Protecting per-DataNode Metadata

Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have. While you might have hundreds of terabytes of information stored in HDFS, the NameNode’s metadata is the key that allows this information, spread across several million “blocks” to be reassembled into coherent, ordered files.

The techniques to preserve HDFS NameNode metadata are well established. You should store several copies across many separate local hard drives, as well as at least one remote hard drive mounted via NFS. (To do this, list multiple directories, on separate mount points, in your dfs.name.dir configuration variable.) You should also run the SecondaryNameNode on a separate machine, which will result in further off-machine backups of “checkpointed” HDFS state made on an hourly basis.

But an aspect of HDFS that is talked about less frequently is the metadata stored on individual DataNodes. Each DataNode keeps a small amount of metadata allowing it to identify the cluster it participates in. If this metadata is lost, then the DataNode cannot participate in an HDFS instance and the data blocks it stores cannot be reached. The bug HADOOP-5342, “DataNodes do not start up because InconsistentFSStateException on just part of the disks in use” describes a condition where the DataNode metadata is corrupted across all the DataNodes, causing a cluster to be inaccessible.

When an HDFS instance is formatted, the NameNode generates a unique namespace id for the instance. When DataNodes first connect to the NameNode, they bind to this namespace id and establish a unique “storage id” that identifies that particular DataNode in the HDFS instance. This data as well as information about what version of Hadoop was used to create the block files, is stored in a filed named VERSION in the ${dfs.data.dir}/current directory. In some conditions, the VERSION file can be removed or corrupted. The issue logged in Hadoop’s JIRA suggested that it occurs when some disks are mounted read-only, an upgrade is performed, and then the disks are remounted read-write, causing VERSION files to go out of sync. Unfortunately, we’ve seen this happen during routine operation as well, with no read-only disks, and no upgrades.

While no programmatic fix to this problem is currently available, a workable solution to this problem is a simple piece of preventative maintenance: for each DataNode, keep a copy of ${dfs.data.dir}/current/VERSION stored in a separate directory, possibly off-machine. If this bug ever manifests, restore the backed-up VERSION files and restart HDFS. The bug does not affect the block files themselves. The VERSION files won’t change once they’re created, so you only need to back them up when they’re first created, or after you upgrade your HDFS instance.

If you’re storing data on multiple disks per node, you should note that while the VERSION file on each node is unique, it is the same across all disks on the node.

We’re interested in knowing what other HDFS best practices you’ve developed at your organization. Please share them in the comments section.

Filed under:

2 Responses
  • Carlos Valiente / May 23, 2009 / 1:29 AM

    We’re running the NameNode in a Ganeti (http://code.google.com/p/ganeti/) virtual instance on a two-node cluster using DRBD disk replication. Live migration of the virtual instance is very fast, and failing over to the backup node seems to work OK, even when thimgs go really wrong with the master node.

Leave a comment


− one = 1