Cloudera Developer Blog · HDFS Posts
Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.
A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.
While the vast majority of the Hadoop development discussion takes place on the Apache Jira and various project mailing lists, it’s often useful to meet face to face for high bandwidth discussion. To that end, Facebook hosted the first Apache Hadoop contributors meeting yesterday at their campus in Palo Alto. Cloudera, Facebook, Yahoo! and the Apache HBase team were well-represented. It was great to see a broad cross section of Hadoop developers in one room. Contributor meetings will be held on a monthly basis, at a rotating location. While any Hadoop project contributor is welcome to attend, the current focus of the meetings is HDFS and MapReduce. The goal of the discussion is to surface and flesh out ideas rather than make decisions, which happens on the development lists. If you’ve got ideas to add check out the meeting notes and continue the discussion.
Sanjay Radia kicked off the meeting with a discussion of development priorities. Hadoop has become a platform and industry standard for data storage and analytics. What advances are most important to users? How do we continue to innovate without disrupting the installed base? Development must maintain and improve the quality that has allowed companies to adopt Hadoop in their production environments. Fortunately there is broad agreement among contributors on development priorities: availability, compatibility, security, scalability and performance.
The next topic was releases. Tom White gave an update on the upcoming 0.21 release. The release has branched and is now in feature freeze; the focus is now shifting to closing out issues that block the release. You can hear more about the 0.21 release at the next Bay Area Hadoop user group. Chris Douglas recently proposed Hadoop adopt release managers. Owen O’Malley led a discussion on the role and responsibilities of the release manager, and the Hadoop release process.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
Here at ContextWeb, our Apache Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.
Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?
The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.
My presentation slides cover which mailing lists to subscribe to, where the source repositories are located, and how to compile and run the development version of Hadoop.
There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.
Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.
A file didn’t exist until it had been successfully closed (by calling
close() method). If the client failed before it closed the file, or if the
close() method failed by throwing an exception, then (to other clients at least), it was as if the file had never been written. The only way to recover the file was to rewrite it from the beginning. MapReduce worked well with this behavior, since it would simply rerun the task that had failed from the beginning.
First Steps Toward Append
Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have. While you might have hundreds of terabytes of information stored in HDFS, the NameNode’s metadata is the key that allows this information, spread across several million “blocks” to be reassembled into coherent, ordered files.
The techniques to preserve HDFS NameNode metadata are well established. You should store several copies across many separate local hard drives, as well as at least one remote hard drive mounted via NFS. (To do this, list multiple directories, on separate mount points, in your dfs.name.dir configuration variable.) You should also run the SecondaryNameNode on a separate machine, which will result in further off-machine backups of “checkpointed” HDFS state made on an hourly basis.
But an aspect of HDFS that is talked about less frequently is the metadata stored on individual DataNodes. Each DataNode keeps a small amount of metadata allowing it to identify the cluster it participates in. If this metadata is lost, then the DataNode cannot participate in an HDFS instance and the data blocks it stores cannot be reached. The bug HADOOP-5342, “DataNodes do not start up because InconsistentFSStateException on just part of the disks in use” describes a condition where the DataNode metadata is corrupted across all the DataNodes, causing a cluster to be inaccessible.
Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use
vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Apache Hadoop
jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated
build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.