Cloudera Blog · HDFS Posts
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
Here at ContextWeb, our Apache Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.
Last Wednesday, we hosted a Hadoop meetup, and I gave a short talk about the new project split. How does the split change the project’s organization, and what does it mean for end users?
The mailing lists and the source code repositories have been rearranged. For those doing development against Hadoop’s “trunk” branch, compiling Hadoop and using the various components in concert has become more complicated.
My presentation slides cover which mailing lists to subscribe to, where the source repositories are located, and how to compile and run the development version of Hadoop.
There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.
Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.
A file didn’t exist until it had been successfully closed (by calling
close() method). If the client failed before it closed the file, or if the
close() method failed by throwing an exception, then (to other clients at least), it was as if the file had never been written. The only way to recover the file was to rewrite it from the beginning. MapReduce worked well with this behavior, since it would simply rerun the task that had failed from the beginning.
First Steps Toward Append
Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have. While you might have hundreds of terabytes of information stored in HDFS, the NameNode’s metadata is the key that allows this information, spread across several million “blocks” to be reassembled into coherent, ordered files.
The techniques to preserve HDFS NameNode metadata are well established. You should store several copies across many separate local hard drives, as well as at least one remote hard drive mounted via NFS. (To do this, list multiple directories, on separate mount points, in your dfs.name.dir configuration variable.) You should also run the SecondaryNameNode on a separate machine, which will result in further off-machine backups of “checkpointed” HDFS state made on an hourly basis.
But an aspect of HDFS that is talked about less frequently is the metadata stored on individual DataNodes. Each DataNode keeps a small amount of metadata allowing it to identify the cluster it participates in. If this metadata is lost, then the DataNode cannot participate in an HDFS instance and the data blocks it stores cannot be reached. The bug HADOOP-5342, “DataNodes do not start up because InconsistentFSStateException on just part of the disks in use” describes a condition where the DataNode metadata is corrupted across all the DataNodes, causing a cluster to be inaccessible.
Update (added 5/15/2013): The information below is a bit dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use
vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Apache Hadoop
jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated
build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.
Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”
At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.
The rest of this post discusses why it’s a bad idea to just set all the limits as high as they’ll go, and gives you some pointers to get started on finding a happy medium.
Why can’t you just set all the limits to 1,000,000?
You might think that the SecondaryNameNode is a hot backup daemon for the NameNode. You’d be wrong. The SecondaryNameNode is a poorly understood component of the HDFS architecture, but one which provides the important function of lowering NameNode restart time. This blog post describes how to configure this daemon in a large-scale environment. The default Hadoop configuration places an instance of the SecondaryNameNode on the same node as the NameNode. A more scalable configuration involves configuring the SecondaryNameNode on a different machine.
About the SecondaryNameNode
The NameNode is responsible for the reliable storage and interactive lookup and modification of the metadata for HDFS. To maintain interactive speed, the filesystem metadata is stored in the NameNode’s RAM. Storing the data reliably necessitates writing it to disk as well. To ensure that these writes do not become a speed bottleneck, instead of storing the current snapshot of the filesystem every time, a list of modifications is continually appended to a log file called the EditLog. Restarting the NameNode involves replaying the EditLog to reconstruct the final system state.
The SecondaryNameNode periodically compacts the EditLog into a “checkpoint;” the EditLog is then cleared. A restart of the NameNode then involves loading the most recent checkpoint and a shorter EditLog containing only events since the checkpoint. Without this compaction process, restarting the NameNode can take a very long time. Compaction ensures that restarts do not incur unnecessary downtime.