What’s New in CDH3b2: Sqoop

Cloudera customers usually have two major sources of data: log files, which can be imported to Hadoop via Flume, and relational databases. Throughout the previous releases of CDH2 and CDH3, Cloudera has included a package we’ve developed called Sqoop. Sqoop can perform batch imports and exports between relational databases and Hadoop, storing data in HDFS and creating Hive tables to hold results. We described its motivation and some use cases in a previous blog post a while ago. In CDH3b2, we’ve included a greatly-expanded version of Sqoop which has had a major overhaul since previous releases. This version is important enough that we’re deeming it the “1.0″ release of Sqoop. In this blog post we’ll cover the highlights of the new features available in Sqoop.

New Interface

The biggest change you’ll notice is that the Sqoop command-line interface has completely changed. Users who have been embedding Sqoop in scripts may be frustrated by this incompatible change, but we think that given the amount of functionality available in Sqoop now, some refactoring is necessary, and this is the correct opportunity to do it. Sqoop is now arranged as a set of tools. If you type sqoop help, you’ll see the list of tools available. Most of the original funtionality is contained in a tool called import; running sqoop help import will list the options available to this tool.

Improved Export Performance

In CDH3b1 we provided basic support for exports: the ability to take results from HDFS and insert them back into a database. CDH3b2 features a completely rewritten export pipeline which demonstrates considerably greater throughput and scalability. You can now export gigabytes of data with high performance. For MySQL users, we’ve added a separate “direct mode” channel that uses mysqlimport to perform this job even faster.

Large Object Support

Sqoop now has the ability to import CLOB and BLOB columns and store them in a separate file format in HDFS designed for these large records. If you have been accumulating large volumes of unstructured data in your database for a long period of time, Sqoop can now help you get this data into a format that you can more easily process with MapReduce.

Append to an Existing Dataset

Sqoop can now append new results to an existing dataset in HDFS. Users who perform periodic imports to synchronize a copy of a dataset in HDFS with a continually-updated copy in a database will now find that this process has been made much smoother.

Documentation Overhaul

We’ve completely rewritten the Sqoop user manual. You can browse it online, and it’s also included in the Sqoop installation package.

Oozie Integration

As Arvind mentioned in yesterday’s blog post about Oozie, Sqoop is now a supported component of the workflow engine. You can import source data from a database, run a MapReduce pipeline, and export your results back to a database entirely inside the Oozie framework.

On the Horizon…

We’re continually working on improving Sqoop at Cloudera. Here’s a short list of new features we’re actively working on for the next release:

  • HBase integration – Import from a database to a table in HBase.
  • Free-form query support – Sqoop’s existing import model is table-driven. You can now import data from an arbitrary SELECT statement against your database.
  • UPDATE support – Export a set of UPDATE statements against an existing database, rather than a set of new records.

We’re also building in support for additional database vendors. We’ve seen a lot of interest recently in integrating with Sqoop. In June, we announced a partnership with Quest to develop high-performance Oracle support. We’re also pleased to announce another partnership with Netezza, to develop tools for fast connectivity between Hadoop and their enterprise data warehouse.

Longer term, we’re looking at some deeper enhancements to the system, such as adding a public Java API as well as support for pluggable serialization and storage formats, allowing users to process records with Avro or Protocol Buffers.

These features (and more!) are coming soon.

For more information

Filed under:

4 Responses
  • Arushi / August 20, 2010 / 2:32 AM

    Hi,

    Is the –export option not at all supported by Sqoop now?

    I can also see that sqoop –version does not work as well.

    Are these just the problems with my cloudera-training vm or is it the new release?

  • Aaron Kimball / August 20, 2010 / 10:37 AM

    Arushi,

    As mentioned, the new version of Sqoop has changed the way you specify operations and their arguments. Rest assured; imports, exports, and more functionality are all available.

    You should run “sqoop help” to see how you use the new Sqoop command line, or check out the user’s guide at http://archive.cloudera.com/cdh/3/sqoop/

    - Aaron

  • Arushi / August 20, 2010 / 10:52 AM

    Thanks for replying Aaron. My problem is resolved after using the cloudera VM 3.4.

    In cloudera’s training VM 3.3, I was getting the issue reported above.

    I installed VM 3.4 on another machine and sqoop (with all its options including export) works fine there.

Leave a comment


× two = 16