Cloudera Engineering Blog · Sqoop Posts
Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
Thanks to M. Asokan, Chief Architect at Syncsort, for the guest post below.
Apache Sqoop provides a framework to move data between HDFS and relational databases in a parallel fashion using Hadoop’s MR framework. As Hadoop becomes more popular in enterprises, there is a growing need to move data from non-relational sources like mainframe datasets to Hadoop. Following are possible reasons for this:
Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.
Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.
Importing from MySQL to HDFS
Note: This post was originally published at blogs.apache.org in a slightly different form.
Apache Sqoop is a tool for doing highly efficient data transfers between relational databases and the Apache Hadoop ecosystem. One significant benefit of Sqoop is that it’s easy to use and can work with a variety of systems inside as well as outside of that ecosystem. Thus, with one tool, you can import or export data from all databases supporting the JDBC interface with the same command-line arguments exposed by Sqoop. Furthermore, Sqoop was designed to be modular, allowing you to plug in specialized additions to optimize transfers for particular database systems.
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
Continuing the fine tradition of Clouderans contributing books to the Apache Hadoop ecosystem, Apache Sqoop Committers/PMC Members Kathleen Ting and Jarek Jarcec Cecho have officially joined the book author community: their Apache Sqoop Cookbook is now available from O’Reilly Media (with a pelican the assigned cover beast).
The book arrives at an ideal time. Hadoop has quickly become the standard for processing and analyzing Big Data, and in order to integrate a new Hadoop deployment into your existing environment, you will very likely need to transfer data stored in legacy relational databases into your new cluster.
In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).
What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:
(The following is a re-post from apache.org)
Apache Sqoop 1.4.2 was released in August 2012. As this was an extremely important release for the Sqoop community – our first release as an Apache Top Level project – I would like to highlight the key features and fixes of this release. The entire change log can be viewed on our JIRA and actual bits can be downloaded from the usual place.