Sqooping Data with Hue

Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.

Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.

Importing from MySQL to HDFS

The following is the canonical import job example sourced from http://sqoop.apache.org/docs/1.99.2/Sqoop5MinutesDemo.html. In Hue, this can be done in three easy steps.

Environment

  • CDH 4.4 or CDH 5 Beta
  • Hue 3.0.0
  • MySQL 5.1

1. Create a Connection

In the Sqoop app, the connection manager is available from the “New Job” wizard. To get to the new job wizard, click New Job. There may be a list of connections available if a few have been created before. For the purposes of this demo, we’ll go through the process of creating a new connection. Click Add a new connection and fill in the blanks with the data below. Then click Save to return to the New Job wizard!

Connection Parameter

Value

Name

mysql-connection-demo

JDBC Driver Class

com.mysql.jdbc.Driver

JDBC Connection String

jdbc:mysql://hue-demo/demo

Username

demo

Password

demo

 

2. Create a Job

After creating a connection, follow the wizard and fill in the blanks with the information below.

Job Wizard Parameter

Value

Name

mysql-import-job-demo

Type

IMPORT

Connection

mysql-connection-demo

Table name

test

Storage Type

HDFS

Output format

TEXT_FILE

Output directory

/tmp/mysql-import-job-demo

 

3. Save and Submit the Job

At the end of the Job wizard, click Save and Run! The job will auto-magically start and the job dashboard will be displayed. As the job is running, a progress bar below the job listing will be dynamically updated. Links to the HDFS output via the File Browser and Map Reduce logs via Job Browser will be available on the left hand side of the job edit page.

Conclusion

The new Sqoop application enables batch data migration from a more traditional databases to Hadoop and vice versa through Hue. Using Hue, a user can move data between storage systems in a distributed fashion with the click of a button.

I’d like to send out a big thank you to the Sqoop community for the new client-server design!

Both projects are undergoing heavy development and are welcoming external contributions. Have any suggestions? Feel free to tell us what you think through hue-user or our community forums.

Abraham Elmahrek is a Software Engineer at Cloudera, working on the Hue team.

Filed under:

7 Responses
  • muki / April 10, 2014 / 10:24 AM

    In Hue I can see only the following error. I cannot remove this job anymore

    Job application_1397150195282_0001 could not be found: HTTPConnectionPool(host=’localhost’, port=8088): Max retries exceeded with url: /ws/v1/cluster/apps/application_1397150195282_0001 (Caused by : [Errno 111] Connection refused)

    • Justin Kestelyn (@kestelyn) / April 10, 2014 / 4:05 PM

      muki,

      For faster response, post this issue in the Hue area at cloudera.com/community.

  • Sqooped / May 04, 2014 / 3:23 PM

    It would be helpful to have a description of how to use ${CONDITIONS} when running the Hue UI. Specifically, if the “Table SQL statement” is
    SELECT x,y,z FROM mydb WHERE ${CONDITIONS};
    then where is the expression CONDITIONS specified? There doesn’t appear to be any documentation on this and there is no obvious field in the UI for specifying the WHERE CONDITIONS. I’ve tried putting the expression in the brace brackets, as well as other variations, and keep getting the same error:
    SQL statement must contain placeholder for auto generated conditions – ${CONDITIONS}

    (If it matters, all the above was on the VirtualBox for CDH5.)

  • Oscar / May 21, 2014 / 12:12 PM

    Hi,

    I am following the instruction from this page, by using my local database. I created a connector and a job. I also created a folder for output files, /usr/cloudera/temp1 .Although it is not complaining about the driver and/or job creation, but when I go to the /usr/cloudera/temp1, I see no files.

    But I can run sqoop from terminal with this command successfully:
    sqoop import –connect jdbc:sqlserver://dssvmprod.cloudapp.net:1433\\SSAS2008SBS –username=XXXXX –password=***** –driver com.microsoft.sqlserver.jdbc.SQLServerDriver –table DimAccount –target-dir /user/cloudera/Temp1

    Any idea what I am doing wrong?

    • Justin Kestelyn (@kestelyn) / May 22, 2014 / 11:03 AM

      I recommend that you post your questions and issues to cloudera.com/community – will be easier to address there!

  • sean / July 23, 2014 / 7:01 PM

    I am about to pull my hair out. I get an error “Can’t load the driver” on com.mysql.jdbc.Driver no matter what I do or how many placed I load that damn mysql jar file and point Sqoop2 to it, I continue to get the error. I went and plunked down 20 buck for the book about Sqoop and the author glazes over it like it’s such a trivial task to load the driver class and point sqoop to the jar file. At this point, my feelings about Sqoop are not too pleasant. I’ve avoided it all these years because it was – and seems like it always will be – a piece of poop!!!

    • Justin Kestelyn (@kestelyn) / July 24, 2014 / 10:17 AM

      Sean,

      My suggestion would be to post your issue at cloudera.com/community (in the “Data Ingest” area), if you have not already. There are other Sqoop users there who can help.

Leave a comment


five − = 1