Our thanks to Montrial Harrell, Enterprise Architect for the State of Indiana, for the guest post below.
Recently, the State of Indiana has begun to focus on how enterprise data management can help our state’s government operate more efficiently and improve the lives of our residents. With that goal in mind, I began this journey just like everyone else I know: with an interest in learning more about Apache Hadoop.
I started learning Hadoop via a virtual server onto which I installed CDH and worked through a few online tutorials. Then, I learned a little more by reading blogs and documentation, and by trial and error.
Eventually, I decided to experiment with a classic Hadoop use case: extract, load, and transfer (ELT). In most cases, ELT allows you to offload some resource-intensive data transforms in favor of Hadoop’s MPP-like functionality, thereby cutting resource usage on the current ETL server at a relatively low cost. This functionality is in part delivered via the Hadoop ecosystem project called Apache Sqoop.
Preparing for Sqoop
In preparing to use Sqoop, I found that there are two versions inside CDH. The classic version, called Sqoop 1, has a command line interface (CLI) and you store drivers for it in /var/lib/sqoop. If you are going to use Apache Oozie jobs that reference Sqoop, you also need to store your driver in
The more recent Sqoop engine is called Sqoop 2. Sqoop 2 uses the path
/var/lib/sqoop2 for drivers and is the method employed when you access Sqoop Transfer located under the “Data Browsers” menu in Hue. There are differences between the two Sqoop engines that are better explained by Cloudera documentation, but suffice to say here that I chose to use Sqoop 1.
The steps below are based on Sqoop 1 only; should you want to follow along, you should install the JDBC drivers first (docs).
Preparing for a SQL Sqoop Job
Here are the steps I followed to run my first Sqoop job using Microsoft SQL Server as the source:
- First, I downloaded the Microsoft JDBC Driver 4.0 tar file from the Microsoft Download Center.
- Next, I unpacked the tar file using the command
- I copied the driver to the path
/var/lib/sqoopon the machine where I would run the Sqoop command. (If CDH is installed via a parcel, you’ll need to create this directory first.)
- I created a test user on the SQL Server instance with read-only permission to my test database.
- Next, I logged into the test Hadoop cluster.
- I ran the below command to import a single SQL database table.
Sqoop import –-connect "jdbc:sqlserver://xx.xx.xx.xx:3464;databaseName=testing" --username testingusername –P --table testtable --hive-import –m 24
--connect= specifies the JDBC connect string that contains the server name and port if needed, database name. Please note the double-quotes around the connection string.
--username= database login name
-P= prompt user for database login password
--table= table you wish to import
--hive-import= import table into Hive
-m= number of map tasks (parallel processes) to use to perform the import
Finally, I logged into Hue and verified that the imported table imported successfully.
After testing Sqoop and becoming more familiar with its features, flexibility, and ease of use along with Apache Hive, Impala, and Apache Flume, my team and I are very excited. During our testing, I have been able to test load 65GB of data from a SQL table into Hive in 4 minutes in our testing environment and run new transformation test code in minutes as opposed to hours. This is not to say that there have not been bumps along the way, but it does confirm that this platform has definite advantages, and we intend to use it as part of our standard enterprise tool set going forward.