Cloudera Developer Blog · Sqoop Posts
Hue, the open source Web UI that makes Apache Hadoop easier to use, has a brand-new application that enables transferring data between relational databases and Hadoop. This new application is driven by Apache Sqoop 2 and has several user experience improvements, to boot.
Sqoop is a batch data migration tool for transferring data between traditional databases and Hadoop. The first version of Sqoop is a heavy client that drives and oversees data transfer via MapReduce. In Sqoop 2, the majority of the work was moved to a server that a thin client communicates with. Also, any client can communicate with the Sqoop 2 server over its JSON-REST protocol. Sqoop 2 was chosen instead of its predecessors because of its client-server design.
Importing from MySQL to HDFS
The following is the canonical import job example sourced from http://sqoop.apache.org/docs/1.99.2/Sqoop5MinutesDemo.html. In Hue, this can be done in three easy steps.
Note: This post was originally published at blogs.apache.org in a slightly different form.
Apache Sqoop is a tool for doing highly efficient data transfers between relational databases and the Apache Hadoop ecosystem. One significant benefit of Sqoop is that it’s easy to use and can work with a variety of systems inside as well as outside of that ecosystem. Thus, with one tool, you can import or export data from all databases supporting the JDBC interface with the same command-line arguments exposed by Sqoop. Furthermore, Sqoop was designed to be modular, allowing you to plug in specialized additions to optimize transfers for particular database systems.
Many users use the words “connector” and “driver” interchangeably, but these words mean completely different things in the context of Sqoop — which can lead to confusion because both connectors and drivers are needed for every Sqoop invocation. This blog post will explain the difference between them, and how Sqoop uses these concepts when transferring data between Hadoop and other systems.
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
Continuing the fine tradition of Clouderans contributing books to the Apache Hadoop ecosystem, Apache Sqoop Committers/PMC Members Kathleen Ting and Jarek Jarcec Cecho have officially joined the book author community: their Apache Sqoop Cookbook is now available from O’Reilly Media (with a pelican the assigned cover beast).
The book arrives at an ideal time. Hadoop has quickly become the standard for processing and analyzing Big Data, and in order to integrate a new Hadoop deployment into your existing environment, you will very likely need to transfer data stored in legacy relational databases into your new cluster.
Sqoop is just the ticket; it optimizes data transfers between Hadoop and RDBMSs via a command-line interface listing 60 parameters. This new cookbook focuses on applying these parameters to common use cases — one recipe at a time, Kate and Jarek guide you from basic commands that don’t require prior Sqoop knowledge all the way to very advanced use cases. These recipes are sufficiently detailed not only to enable you to deploy Sqoop in your environment, but also to understand its inner workings.
In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).
What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.
My role is a hybrid “player/coach” model: in addition to doing managerial things like leading a team and addressing customer escalations, I also answer customer support cases directly, which is a fairly unique combination. This is an effective approach: giving me direct insights into customer concerns that I otherwise wouldn’t get, helping me stay grounded, and ensuring I appreciate the work the team is doing, first-hand.
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:
(The following is a re-post from apache.org)
Apache Sqoop 1.4.2 was released in August 2012. As this was an extremely important release for the Sqoop community – our first release as an Apache Top Level project – I would like to highlight the key features and fixes of this release. The entire change log can be viewed on our JIRA and actual bits can be downloaded from the usual place.
Apache Hadoop 2.0.0 Support
One of the main the goals of the previous 1.4.1-incubating release was to make sure that Sqoop works on various Apache Hadoop versions (that it’s not locked to one specific version). We’ve made sure that Sqoop works on Hadoop 0.20, 1.0 and 0.23. We’re excited to announce that in Sqoop 1.4.2 we’ve added support for Hadoop 2.0.x (SQOOP-574) and so now Sqoop will work on four different major Hadoop releases!
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
As you can see, these meetups are highly parallel, so you will either have to make careful choices or have very quick feet. The good news is: there’s something for everybody.