In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
After setting up an import job in Sqoop, you can get started working with SQL database-backed data from your Hadoop MapReduce cluster in minutes.
Hadoop MapReduce is a powerful tool; its flexibility in parsing unstructured or semi-structured data means that there is a lot of potential for creative applications. But your analyses are only as useful as the data which they process. In many organizations, large volumes of useful information are locked away in disparate databases across the enterprise. HDFS, Hadoop’s distributed file system represents a great place to bring this data together, but actually doing so is a cumbersome task.
Consider the task of processing access logs and analysing user behavior on your web site. Users may present your site with a cookie that identifies who they are. You can log the cookies in conjunction with the pages they visit. This lets you coordinate users with their actions. But actually matching their behavior against their profiles or their previously recorded history requires that you look up information in a database. If several MapReduce programs needed to do similar joins, the database server would experience very high load, in addition to a large number of concurrent connections, while MapReduce programs were running, possibly causing performance of your interactive web site to suffer.
The solution: periodically dump the contents of the users database and the action history database to HDFS, and let your MapReduce programs join against the data stored there. Going one step further, you could take the in-HDFS copy of the users database and import it into Hive, allowing you to perform ad-hoc SQL queries against the entire database without working on the production database.
Sqoop makes all of the above possible with a single command-line.
Continuing the example above, let’s say that our front end servers connected to a MySQL database named website, stored on db.example.com. The website database has several tables, but the one we are most interested in is one named USERS.
This table has several columns; it might have been created from a SQL statement like:
CREATE TABLE USERS ( user_id INTEGER NOT NULL PRIMARY KEY, first_name VARCHAR(32) NOT NULL, last_name VARCHAR(32) NOT NULL, join_date DATE NOT NULL, zip INTEGER, state CHAR(2), email VARCHAR(128), password_hash CHAR(64));
Importing this table into HDFS could be done with the command:
you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --local --hive-import
This would connect to the MySQL database on this server and import the USERS table into HDFS. The –-local option instructs Sqoop to take advantage of a local MySQL connection which performs very well. The –-hive-import option means that after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdirectory of your HDFS home directory).
Suppose you wanted to work with this data in MapReduce and weren’t concerned with Hive. When storing this table in HDFS, you might want to take advantage of compression, so you’d like to be able to store the data in SequenceFiles. In this case, you might want to import the data with the command:
you@db@ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --as-sequencefile
Sqoop will also emit a Java class named USERS with getter methods for each of the columns of the table.
They support the majority of SQL’s types including optionally-null values. The data will be loaded into HDFS as a set of SequenceFiles; you can use the USERS.java class to work with the data in your MapReduce analyses.
Sqoop can also connect to other databases besides MySQL; anything with a JDBC driver should work. If you are running locally on a MySQL server the import will be especially high-performance, but a MapReduce-based import mechanism allows remote database connections as well. Authenticated connections with usernames and passwords are also supported. Several other options allow you to control which columns of a table are imported, and other aspects of the import process. The full reference manual is available at www.cloudera.com/hadoop-sqoop.
A Closer Look
In this section I’ll briefly outline how Sqoop works under the hood.
In an earlier blog post, I described the DBInputFormat, a connector that allows Hadoop MapReduce programs to read rows from SQL databases. DBInputFormat allows Hadoop to read input from JDBC: a Java interface to databases that most popular database vendors (Oracle, MySQL, Postgresql, etc.) implement.
In order to use DBInputFormat you need to write a class that deserializes the columns from the database record into individual data fields to work with. This is pretty tedious—and entirely algorithmic. Sqoop auto-generates class definitions to deserialze the data from the database. These classes can also be used to store the results in Hadoop’s SequenceFile format, which allows you to take advantage of built-in compression within HDFS too. The classes are written out as .java files that you can incorporate in your own data processing pipeline later. The class definition is created by taking advantage of JDBC’s ability to read metadata about databases and tables.
When Sqoop is invoked, it retrieves the table’s metadata, writes out the class definition for the columns you want to import, and launches a MapReduce job to import the table body proper.
Hadoop users know that moving large volumes of data can be a time-intensive operation. While it provides a reliable implementation-independent mechanism to read database tables, using a MapReduce JDBC job to import data from a remote database is often inefficient. Database vendors usually provide an export tool that exports data in a more high-performance manner. Sqoop is capable of using alternate import strategies as well. By examining the connect string URL that tells Sqoop which database to connect to, Sqoop will choose alternate import strategies as appropriate to the database. We’ve already implemented the ability to take advantage of MySQL’s export tool called mysqldump. We’ll add support for other systems as soon as we can.
The first beta release of Sqoop is available today as part of Cloudera’s Distribution for Hadoop. It installs as part of the same RPM (or Debian package) that contains Hadoop itself.
Hadoop users who aren’t using our distribution can apply the patch that is contributed to Apache Hadoop as issue HADOOP-5815, and compile it themselves, but Sqoop won’t be part of the standard Hadoop release for some time (at least until version 0.21.0). mysqldump support is added in HADOOP-5844, and Hive integration is provided in HADOOP-5887.
You can read the documentation for Sqoop at http://www.cloudera.com/hadoop-sqoop. You can also get some basic usage information from Sqoop itself by running sqoop –-help after it’s installed.
We also did a preview of this tool at the May Bay Area Hadoop User Group meet-up; you can catch the presentation here:
We hope you find this tool useful—please check it out! Then let us know your feedback on GetSatisfaction. Bug reports and feature requests especially welcome.