Sqooping Data with Hue

Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.

Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.

Importing from MySQL to HDFS

The following is the canonical import job example sourced from http://sqoop.apache.org/docs/1.99.2/Sqoop5MinutesDemo.html. In Hue, this can be done in three easy steps.

Environment

  • CDH 4.4 or CDH 5 Beta
  • Hue 3.0.0
  • MySQL 5.1

1. Create a Connection

In the Sqoop app, the connection manager is available from the “New Job” wizard. To get to the new job wizard, click New Job. There may be a list of connections available if a few have been created before. For the purposes of this demo, we’ll go through the process of creating a new connection. Click Add a new connection and fill in the blanks with the data below. Then click Save to return to the New Job wizard!

Connection Parameter

Value

Name

mysql-connection-demo

JDBC Driver Class

com.mysql.jdbc.Driver

JDBC Connection String

jdbc:mysql://hue-demo/demo

Username

demo

Password

demo

 

2. Create a Job

After creating a connection, follow the wizard and fill in the blanks with the information below.

Job Wizard Parameter

Value

Name

mysql-import-job-demo

Type

IMPORT

Connection

mysql-connection-demo

Table name

test

Storage Type

HDFS

Output format

TEXT_FILE

Output directory

/tmp/mysql-import-job-demo

 

3. Save and Submit the Job

At the end of the Job wizard, click Save and Run! The job will auto-magically start and the job dashboard will be displayed. As the job is running, a progress bar below the job listing will be dynamically updated. Links to the HDFS output via the File Browser and Map Reduce logs via Job Browser will be available on the left hand side of the job edit page.

Conclusion

The new Sqoop application enables batch data migration from a more traditional databases to Hadoop and vice versa through Hue. Using Hue, a user can move data between storage systems in a distributed fashion with the click of a button.

I’d like to send out a big thank you to the Sqoop community for the new client-server design!

Both projects are undergoing heavy development and are welcoming external contributions. Have any suggestions? Feel free to tell us what you think through hue-user or our community forums.

Abraham Elmahrek is a Software Engineer at Cloudera, working on the Hue team.

Filed under:

2 Responses
  • muki / April 10, 2014 / 10:24 AM

    In Hue I can see only the following error. I cannot remove this job anymore

    Job application_1397150195282_0001 could not be found: HTTPConnectionPool(host=’localhost’, port=8088): Max retries exceeded with url: /ws/v1/cluster/apps/application_1397150195282_0001 (Caused by : [Errno 111] Connection refused)

    • Justin Kestelyn (@kestelyn) / April 10, 2014 / 4:05 PM

      muki,

      For faster response, post this issue in the Hue area at cloudera.com/community.

Leave a comment


9 × = nine