Cloudera Blog · Distribution Posts
Introducing Sqoop
In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
Common Questions and Requests From Our Users
A few months ago we announced the Cloudera Distribution for Hadoop. We’re happy to report that lots of people have started using our distribution, and our GetSatisfaction product (which is essentially a message board about our products) has seen lots of good Hadoop questions and answers. We thought it would be worthwhile to share some of the interesting questions and requests we’ve seen from our users.
Question: How do I backup my name node metadata?
The name node (NN) stores all of the HDFS metadata, which includes file names, directory structures, and block locations. This metadata is stored in memory for fast lookup, but the NN also maintains two on-disk data structures to ensure that metadata is persisted. The first structure stored is a snapshot of the in-memory metadata, and the second structure stored is an edit log of changes that have been made since the snapshot was last taken. The secondary name node (2NN) is in charge of fetching the snapshot and edit log from the NN and merging the two into a new snapshot, which is then sent back to the NN. Once the NN gets the new snapshot, it clears its edit log, and the process repeats. Take a look at our other blog post about multi-host secondary name nodes for more information about configuring the 2NN.
There are two types of metadata backups that one should implement, and each type solves a different problem. I will talk about each of these backup strategies separately. The first backup strategy is used to ensure that no metadata is lost in the event of a NN failure, whether that failure be disks dying, power supplies catching fire, or some other unforeseen loss of the NN or its local data. The way to avoid losing NN metadata in the event of a crash is to configure dfs.name.dir such that it writes to several local disks and at least one NFS mount. dfs.name.dir takes a comma-separated list of local filesystem paths, so an example configuration might look like “/hdd1/hadoop/dfs/name,/hdd2/hadoop/dfs/name,/mnt/nfs/hadoop/dfs/name”. The purpose of storing data on several local hard drives is to avoid data loss in the case of a single drive failing. The purpose of storing data on a NFS mount is to avoid data loss in the case of the NN machine going down entirely. With at least two local drives and one NFS mount storing the same NN metadata, you should be well protected from losing any data from a crash. To be fair, NFS isn’t the only solution for mounting a remote file system, but it’s the de facto standard for Hadoop.
Debian packages for Apache Hadoop
When we announced Cloudera’s Distribution for Apache Hadoop last month, we asked the community to give us feedback on what features they liked best and what new development was most important to them. Almost immediately, Debian and Ubuntu packages for Hadoop emerged as the most popular request. A lot of customers prefer Debian derivatives over Red Hat, and installing RPMs on top of Debian, while possible with tools like alien, is a pain to say the least.
After some weeks of development and testing, we are happy to announce the Cloudera APT Repository. APT is the standard package distribution mechanism for Ubuntu and Debian, and by simply pointing your machines at our repository, you can have Hadoop installed within minutes.
Our Debian packages are comprised of the same components as our RPM based distribution, including:
Configuring Eclipse for Apache Hadoop Development (a screencast)
Update (added 5/15/2013): The information below is a bit dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Apache Hadoop jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated ant build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.
Cloudera’s Distribution for Apache Hadoop: Making Hadoop Easier for a Sysadmin
A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.
Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts. RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS). They give sysadmins a one-command install, and they install libraries, binaries, init scripts, log files, man pages, and configuration files in places where Linux users expect them, typically /usr/lib, /usr/bin, /etc/init.d, /var/log, /usr/share/man, and /etc, respectively. RPMs are also very easy to uninstall and upgrade.
Init scripts are the standard way to start, stop, and restart daemon processes on a Linux system. They allow sysadmins to start and stop daemons with the /sbin/service script, and they use a standard parameter interface, namely start, stop, or restart (e.g., sudo /sbin/service hadoop-datanode start). Init scripts also make sure that the daemon runs as the correct user, which in Hadoop’s case is the hadoop user. Lastly, init scripts are used to start daemons at boot time, allowing daemons to survive reboots.