Introducing Sqoop

In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.

Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse

After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.

Motivation

Hadoop MapReduce is a powerful tool; its flexibility in parsing unstructured or semi-structured data means that there is a lot of potential for creative applications. But your analyses are only as useful as the data which they process. In many organizations, large volumes of useful information are locked away in disparate databases across the enterprise. HDFS, Hadoop’s distributed file system represents a great place to bring this data together, but actually doing so is a cumbersome task.

Consider the task of processing access logs and analysing user behavior on your web site. Users may present your site with a cookie that identifies who they are. You can log the cookies in conjunction with the pages they visit. This lets you coordinate users with their actions. But actually matching their behavior against their profiles or their previously recorded history requires that you look up information in a database. If several MapReduce programs needed to do similar joins, the database server would experience very high load, in addition to a large number of concurrent connections, while MapReduce programs were running, possibly causing performance of your interactive web site to suffer.

The solution: periodically dump the contents of the users database and the action history database to HDFS, and let your MapReduce programs join against the data stored there. Going one step further, you could take the in-HDFS copy of the users database and import it into Hive, allowing you to perform ad-hoc SQL queries against the entire database without working on the production database.

Sqoop makes all of the above possible with a single command-line.

Example Usage

Continuing the example above, let’s say that our front end servers connected to a MySQL database named website, stored on db.example.com. The website database has several tables, but the one we are most interested in is one named USERS.

This table has several columns; it might have been created from a SQL statement like:

CREATE TABLE USERS (
  user_id INTEGER NOT NULL PRIMARY KEY,
  first_name VARCHAR(32) NOT NULL,
  last_name VARCHAR(32) NOT NULL,
  join_date DATE NOT NULL,
  zip INTEGER,
  state CHAR(2),
  email VARCHAR(128),
  password_hash CHAR(64));

Importing this table into HDFS could be done with the command:

you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \
    --local --hive-import

This would connect to the MySQL database on this server and import the USERS table into HDFS. The –-local option instructs Sqoop to take advantage of a local MySQL connection which performs very well. The –-hive-import option means that after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdirectory of your HDFS home directory).

Suppose you wanted to work with this data in MapReduce and weren’t concerned with Hive. When storing this table in HDFS, you might want to take advantage of compression, so you’d like to be able to store the data in SequenceFiles. In this case, you might want to import the data with the command:

you@db@ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \
    --as-sequencefile

Sqoop will also emit a Java class named USERS with getter methods for each of the columns of the table.

They support the majority of SQL’s types including optionally-null values. The data will be loaded into HDFS as a set of SequenceFiles; you can use the USERS.java class to work with the data in your MapReduce analyses.

Sqoop can also connect to other databases besides MySQL; anything with a JDBC driver should work. If you are running locally on a MySQL server the import will be especially high-performance, but a MapReduce-based import mechanism allows remote database connections as well. Authenticated connections with usernames and passwords are also supported. Several other options allow you to control which columns of a table are imported, and other aspects of the import process. The full reference manual is available at www.cloudera.com/hadoop-sqoop.

A Closer Look

In this section I’ll briefly outline how Sqoop works under the hood.

In an earlier blog post, I described the DBInputFormat, a connector that allows Hadoop MapReduce programs to read rows from SQL databases. DBInputFormat allows Hadoop to read input from JDBC: a Java interface to databases that most popular database vendors (Oracle, MySQL, Postgresql, etc.) implement.

In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.

When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.

Hadoop users know that moving large volumes of data can be a time-intensive operation. While it provides a reliable implementation-independent mechanism to read database tables, using a MapReduce JDBC job to import data from a remote database is often inefficient. Database vendors usually provide an export tool that exports data in a more high-performance manner. Sqoop is capable of using alternate import strategies as well. By examining the connect string URL that tells Sqoop which database to connect to, Sqoop will choose alternate import strategies as appropriate to the database. We’ve already implemented the ability to take advantage of MySQL’s export tool called mysqldump. We’ll add support for other systems as soon as we can.

Getting Sqoop

The first beta release of Sqoop is available today as part of Cloudera’s Distribution for Hadoop. It installs as part of the same RPM (or Debian package) that contains Hadoop itself.

Hadoop users who aren’t using our distribution can apply the patch that is contributed to Apache Hadoop as issue HADOOP-5815, and compile it themselves, but Sqoop won’t be part of the standard Hadoop release for some time (at least until version 0.21.0). mysqldump support is added in HADOOP-5844, and Hive integration is provided in HADOOP-5887.

You can read the documentation for Sqoop at http://www.cloudera.com/hadoop-sqoop. You can also get some basic usage information from Sqoop itself by running sqoop –-help after it’s installed.

We also did a preview of this tool at the May Bay Area Hadoop User Group meet-up; you can catch the presentation here:

We hope you find this tool useful—please check it out! Then let us know your feedback on GetSatisfaction. Bug reports and feature requests especially welcome.

33 Responses
  • Toby Jungen / June 01, 2009 / 1:31 PM

    Thanks for the post, looking forward to using it!

    A couple of notes:
    - You mentioned that JDBC performs poorly for larger dumps due to needing to iterate over the tables in chunks via limit and offset. If this is the same problem I’ve run into, then you are doing this chunking because a single query would cause you to run out of memory. I’ve resolved this problem by setting PreparedStatement.setFetchSize to a reasonable size.
    - You also mentioned an extra overhead in unescaping data output over mysqldump. I believe you can disable value escaping in mysqldump via one of its command line flags.

  • Aaron / June 01, 2009 / 2:35 PM

    Hi Toby,

    The chunks aren’t really the issue here. The problem is that by using a MapReduce job to read from a table, individual map tasks responsible for loading slices of the table are run on different nodes; they each generate a separate SELECT statement to accomplish this slice load. But most databases cannot start a query at a particular offset, nor are related queries fused together. Instead, the db just delivers to each task output starting from an offset into a separate full-table read, meaning that each mapper incurs a table scan up to the end of its OFFSET+LIMIT. The solution is to use a single mapper; the downside is that this somewhat defeats the purpose of using MapReduce and doesn’t utilize the network efficiently.

    As for mysqldump, I looked at the command-line flags in question. Unfortunately, it’s hard to use those flags in conjunction with a pipe. So I think that using them would incur an additional r/w cycle on the disk, which is even worse.

    Thanks for the thorough reading; keep keeping me honest :)
    - Aaron

  • Toby Jungen / June 01, 2009 / 6:22 PM

    Hi Aaron -
    Seems I have sorely misunderstood, my apologies! It sounded like Sqoop operated by first dumping the database data into HDFS and then running the MapReduce task as a second step. My dealings with databases in hadoop have consisted of dumping the database data of interest to HDFS, and then starting a mapreduce task over that data. That way, the map / reduce tasks would never touch the database directly. Of course, this approach introduces somewhat of a bottleneck since the dump process cannot be parallelized, and if the task is primarily I/O bound then this will be a problem.
    -Toby

  • Steve / June 09, 2009 / 9:07 AM

    Can this tool also run in reverse? I.E, Export data from Hive into mySQL?

  • Aaron / June 09, 2009 / 1:53 PM

    Hi Steve,

    Not yet. But we’re working on it!

    - Aaron

  • Simonluca Landi / June 18, 2009 / 6:40 AM

    Hi all.
    I’m trying to use sqoop to import a few GBs table from MySQL to Hadoop, but i got an OOME:
    java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.Buffer.getBytes(Buffer.java:198)
    at com.mysql.jdbc.Buffer.readLenByteArray(Buffer.java:318)
    at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1375)
    at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2369)
    at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:451)
    at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2076)
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1451)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1787)
    at com.mysql.jdbc.Connection.execSQL(Connection.java:3283)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1332)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1467)
    at org.apache.hadoop.sqoop.manager.SqlManager.execute(SqlManager.java:254)

    The hadoop node is running with 1GB of RAM.
    Any work-around for this?

    -=[SLL]=-

  • NealRichter / July 21, 2009 / 11:18 PM

    +1 on making Sqoop publish data into MySQL… though I would think that creating an ‘external’ table with a jdbc link as the LOCATION would be pretty optimal also.

  • Dharmesh / September 17, 2012 / 10:45 PM

    Hi all,

    I have question on DBInputFormat Vs. Sqoop

    I am trying to build wrapper class where I can read data from mysql and process it in map reduce and output can be stored in again RDBMS (MySQL) or NOSQL databases or HDFS.

    My confusion is what is the best option to use Sqoop or DBInputFormat and DBOutputFormat? I am creating plugable api where I will just configure input and output datasource and my mapreduce will read and write data accordingly in selected datasource.

    If DBInputFormat and DBOutput format is the best option than do let me know which version of Hadoop Mapreduce API I should use.

    Thanks in advance.

Leave a comment


+ 4 = eleven