Hadoop World: Sqoop – Database Import for Hadoop

Categories: General

At Cloudera, we’re always working to make it easier for you to work with Hadoop and integrate Hadoop-based systems in with your existing data sources. One example of how we accomplish this is Sqoop, a database import tool developed at Cloudera that allows you to easily copy data between databases and HDFS. We originally announced this tool in June, but we’ve been steadily improving it since then. It can now talk with several more databases than before, and performance has been improved considerably. Sqoop has demonstrated its usefulness pretty quickly; several open source projects and many of our clients use Sqoop as part of their data pipeline. Last summer our friend Pete Skomoroch demonstrated how to integrate it into his Wikipedia Trending Topics project (blog tutorial ).

This talk at Hadoop World NYC by Cloudera engineer Aaron Kimball introduces Sqoop, describes its use cases, and gives some technical details of how it works.


9 responses on “Hadoop World: Sqoop – Database Import for Hadoop

  1. aaron

    I’m not certain. Sqoop uses JDBC and generates standard SQL, so it can connect to many database vendors out-of-the-box. Some vendors require nonstandard SQL, or implement JDBC drivers slightly different than the norm, for which we have to write special-case code.

    We haven’t explicitly tested it with Teradata. I’d love to hear feedback as to how that experiment works out.

  2. Chetan Conikee

    Typically Netezza and TeraData are bundled with their own proprietary OBDC bindings [and] ship with a JAR file which essentially is a ODBC-JDBC bridge..
    So if you have the specified bridge jar you should be able to Sqoop from/to Netezza or TeraData

  3. Arushi

    Any updates on unsqoop?

    I wanted to get data using hadoop, modify it using mapreduce and put it back into the mysql.

    Is it possible with latest sqoop or shall we wait for unsqoop to send data back to MySql?

    1. Jeff Hammerbacher

      Hey Arushi,

      Unsqoop is now available in CDH3b2 (see http://blog.cloudera.com/blog/2010/07/what’s-new-in-cdh3b2-sqoop for more details), so go and get it! Additionally, we have added a Sqoop task to Oozie (http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie), so you can tie the import, processing, and export phases together in a single Oozie workflow.

      For further questions on Sqoop, check out the Sqoop mailing list at https://groups.google.com/a/cloudera.org/group/sqoop-user.


  4. sathi

    Does the current version of sqoop have a way to capture only delta?if I have initial load of table A under /original ,and then I import incremental under /delta ,is there a better way to store a combined data under /latest which is basically merge of /original and /delta
    In a standard ETL pipeline,If I am storing a transaction table which is changing everyday,what is the best way to store this evolving data within hdfs?

  5. Arvind Prabhakar

    Hi Sathi,

    Sqoop does allow you to do incremental imports based on time-stamp columns or numeric columns. During such imports, Sqoop populates the data files in the same target directory using names that start from where the previous job may have left off. That way the target directory is always current with the latest changes.

    Regarding the storage of a transactional table that is changing everyday, one way to use it for ETL use-cases would be to load it up in HBase. Storing it in flat files may not be a good idea since that could lead to duplication if previously imported data is updated.

    Arvind Prabhakar