Category Archives: HDFS

File Appends in HDFS

Categories: General Hadoop HDFS

There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.


Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename.

Read more

Protecting per-DataNode Metadata

Categories: Hadoop HDFS

Administrators of HDFS clusters understand that the HDFS metadata is some of the most precious bits they have. While you might have hundreds of terabytes of information stored in HDFS, the NameNode’s metadata is the key that allows this information, spread across several million “blocks” to be reassembled into coherent, ordered files.

The techniques to preserve HDFS NameNode metadata are well established. You should store several copies across many separate local hard drives,

Read more

High Energy Hadoop

Categories: General Guest Hadoop HDFS

We asked Brian Bockelman, a Post Doc Research Associate in the Computer Science & Engineering Department at the University of Nebraska–Lincoln, to tell us how Hadoop is being used to process the results from High-Energy Physics experiments.  His response gives insights into the kind and volume of data that High-Energy Physics experiments generate and how Hadoop is being used at the University of Nebraska. -Matt

In the least technical language,

Read more

Configuring Eclipse for Apache Hadoop Development (a screencast)

Categories: Data Ingestion General HDFS Training

Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.

One of the perks of using Java is the availability of functional, cross-platform IDEs.  I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.

Typically, when you’re developing Map-Reduce applications,

Read more

Configuration Parameters: What can you just ignore?

Categories: General Hadoop HDFS MapReduce

Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”

Nigel's guitar goes to 11, but your cluster might not. At Cloudera,

Read more