Cloudera Engineering Blog · HDFS Posts
Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use
vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”
At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.
You might think that the SecondaryNameNode is a hot backup daemon for the NameNode. You’d be wrong. The SecondaryNameNode is a poorly understood component of the HDFS architecture, but one which provides the important function of lowering NameNode restart time. This blog post describes how to configure this daemon in a large-scale environment. The default Hadoop configuration places an instance of the SecondaryNameNode on the same node as the NameNode. A more scalable configuration involves configuring the SecondaryNameNode on a different machine.
About the SecondaryNameNode
The NameNode is responsible for the reliable storage and interactive lookup and modification of the metadata for HDFS. To maintain interactive speed, the filesystem metadata is stored in the NameNode’s RAM. Storing the data reliably necessitates writing it to disk as well. To ensure that these writes do not become a speed bottleneck, instead of storing the current snapshot of the filesystem every time, a list of modifications is continually appended to a log file called the EditLog. Restarting the NameNode involves replaying the EditLog to reconstruct the final system state.
We’ve been talking to enterprise users of Hadoop about existing and new projects, and lots of them are asking questions about reliability and data integrity. So we wrote up a short paper entitled HDFS Reliability to summarize the state of the art and provide advice. We’d like to get your feedback, too, so please leave a comment.