Cloudera customers usually have two major sources of data: log files, which can be imported to Hadoop via Flume, and relational databases. Throughout the previous releases of CDH2 and CDH3, Cloudera has included a package we’ve developed called Sqoop. Sqoop can perform batch imports and exports between relational databases and Hadoop, storing data in HDFS and creating Hive tables to hold results. We described its motivation and some use cases in a previous blog post a while ago. In CDH3b2, we’ve included a greatly-expanded version of Sqoop which has had a major overhaul since previous releases. This version is important enough that we’re deeming it the “1.0” release of Sqoop. In this blog post we’ll cover the highlights of the new features available in Sqoop.
The biggest change you’ll notice is that the Sqoop command-line interface has completely changed. Users who have been embedding Sqoop in scripts may be frustrated by this incompatible change, but we think that given the amount of functionality available in Sqoop now, some refactoring is necessary, and this is the correct opportunity to do it. Sqoop is now arranged as a set of tools. If you type sqoop help, you’ll see the list of tools available. Most of the original funtionality is contained in a tool called import; running sqoop help import will list the options available to this tool.
Improved Export Performance
In CDH3b1 we provided basic support for exports: the ability to take results from HDFS and insert them back into a database. CDH3b2 features a completely rewritten export pipeline which demonstrates considerably greater throughput and scalability. You can now export gigabytes of data with high performance. For MySQL users, we’ve added a separate “direct mode” channel that uses mysqlimport to perform this job even faster.
Large Object Support
Sqoop now has the ability to import CLOB and BLOB columns and store them in a separate file format in HDFS designed for these large records. If you have been accumulating large volumes of unstructured data in your database for a long period of time, Sqoop can now help you get this data into a format that you can more easily process with MapReduce.
Append to an Existing Dataset
Sqoop can now append new results to an existing dataset in HDFS. Users who perform periodic imports to synchronize a copy of a dataset in HDFS with a continually-updated copy in a database will now find that this process has been made much smoother.
We’ve completely rewritten the Sqoop user manual. You can browse it online, and it’s also included in the Sqoop installation package.
As Arvind mentioned in yesterday’s blog post about Oozie, Sqoop is now a supported component of the workflow engine. You can import source data from a database, run a MapReduce pipeline, and export your results back to a database entirely inside the Oozie framework.
On the Horizon…
We’re continually working on improving Sqoop at Cloudera. Here’s a short list of new features we’re actively working on for the next release:
- HBase integration – Import from a database to a table in HBase.
- Free-form query support – Sqoop’s existing import model is table-driven. You can now import data from an arbitrary SELECT statement against your database.
- UPDATE support – Export a set of UPDATE statements against an existing database, rather than a set of new records.
We’re also building in support for additional database vendors. We’ve seen a lot of interest recently in integrating with Sqoop. In June, we announced a partnership with Quest to develop high-performance Oracle support. We’re also pleased to announce another partnership with Netezza, to develop tools for fast connectivity between Hadoop and their enterprise data warehouse.
Longer term, we’re looking at some deeper enhancements to the system, such as adding a public Java API as well as support for pluggable serialization and storage formats, allowing users to process records with Avro or Protocol Buffers.
These features (and more!) are coming soon.
For more information
- Get started with CDH3 and Sqoop by configuring the installation packages.
- Sqoop is open source! Get the code at github.com/cloudera/sqoop.
- Read the user guide to learn more about how it works.
- Join the mailing list to get help with Sqoop and participate in the user community.
- Browse the issue tracker to see outstanding issues, or to file a bug report.