File Appends in HDFS
There is some confusion about the state of the file append operation in HDFS. It was in, now it’s out. Why was it removed, and when will it be reinstated? This post looks at some of the history behind HDFS capability for supporting file appends.
Early versions of HDFS had no support for an append operation. Once a file was closed, it was immutable and could only be changed by writing a new copy with a different filename. This style of file access actually fits very nicely with MapReduce, where you write the output of a data processing job to a set of new files; this is much more efficient than manipulating the input files that are already in place.
A file didn’t exist until it had been successfully closed (by calling
close() method). If the client failed before it closed the file, or if the
close() method failed by throwing an exception, then (to other clients at least), it was as if the file had never been written. The only way to recover the file was to rewrite it from the beginning. MapReduce worked well with this behavior, since it would simply rerun the task that had failed from the beginning.
First Steps Toward Append
It was not until the 0.15.0 release of Hadoop that open files were visible in the filesystem namespace (HADOOP-1708). Until that point, they magically appeared after they had been written and closed. At the same time, the contents of files could be read by other clients as they were being written, although only the last fully-written block was visible (see HADOOP-89). This made it possible to gauge the progress of a file that was being written, albeit in a crude manner. Additionally, tools such as
hadoop fs -tail (and its web UI equivalent) were introduced and allowed users to view the contents of a file as it was being written, block by block.
For some applications, the API offered by HDFS was not strong enough. For example, a database, such as HBase, which wishes to write its transaction log to HDFS, cannot do so in a reliable fashion. For this application, some form of sync operation is needed, which guarantees that the bytes up to a given point in the stream are persisted (like POSIX’s fsync). In the event of the process crashing, it can recover its previous state by playing through the transaction log, and then it can open the log in append mode to write new entries to it.
Similarly, writing unbounded log files to HDFS is unsatisfactory, since it is generally unacceptable to lose up to a block’s worth of log records if the client writing the log stream fails. The application should be able to choose how much data it is prepared to lose, since there is a trade-off between performance and reliability. A database transaction log would opt for reliability, while an application for logging web page accesses might tolerate a few lost records in exchange for better throughput.
HADOOP-1700 was opened in August 2007 to add an append operation to HDFS, available through the
append() methods on
FileSystem. (The issue also includes discussion about a truncate operation, which sets the end of the file to a particular position, causing data at the end of the file to be deleted. However, truncates have never been implemented.) A little under a year later, in July 2008, the append operation was committed in time for the 0.19.0 release of Hadoop Core.
Implementing the append operation necessitated substantial changes to the core HDFS code. For example, in a pre-append world, HDFS blocks are immutable. However, if you can append to a file, then the last (incomplete) block is mutated, so there needs to be some way of updating its identity so that (for example) a datanode that is down when the block is updated is recognized as having an out-of-date version of the block when it become available again. This is solved by adding a generation stamp to each block, an incrementing integer that records the version of a particular block. There are a host of other technical challenges to solve, many of which are articulated in the design document attached to HADOOP-1700.
After the work on HADOOP-1700 was committed, it was noticed in October 2008 that one of the guarantees of the append function, that readers can read data that has been flushed (via
sync() method) by the writer, was not working (HDFS-200 “In HDFS, sync() not yet guarantees data available to the new readers”). Further issues were found, which were related:
- HDFS-142 “Datanode should delete files under tmp when upgraded from 0.17″
- HADOOP-4692 “Namenode in infinite loop for replicating/deleting corrupted block”
- HDFS-145 “FSNameSystem#addStoredBlock does not handle inconsistent block length correctly”
- HDFS-168 “Block report processing should compare g[e]neration stamp”
Because of these problems, append support was disabled in the 0.19.1 release of Hadoop Core, and in the first release of the 0.20 branch 0.20.0. Configuration parameter
dfs.support.append, which is false by default, was introduced (HADOOP-5332) to make it easy to enable or disable append functionality (note that append functionality is still unstable, so this flag should be set to true only on development or test clusters).
This prompted developers to step back and take a fresh look at the problem. One of the first actions was to create an umbrella issue (HDFS-265 “Revisit append”) with a new design document attached, which aimed to build on the work done to date on appends and provide a foundation for solving the remaining implementation challenges in a coherent fashion. It provides input to the open Jira issues mentioned above—it does not seek to replace them.
A group of interested Hadoop committers had a meeting at Yahoo!’s offices on May 22, 2009 to discuss the requirements for appends. They reached agreement on the precise semantics for the sync operation and renamed it to hflush in the design document in HDFS-265 to avoid confusion with other sync operations. They agreed on API3, which guarantees that data is flushed to all datanodes holding replicas for the current block but is not flushed to the operating system buffers or the datanodes’ persistent disks.
At the time of this writing, not all of these issues have been fixed, but hopefully they will be fixed in time for a 0.21 release.
Getting appends supported has been a stormy ride. It’s not over yet, but when it is finished, it will enable a new class of applications to be built upon HDFS.
When appends are done, what will be next? Record appends from multiple concurrent writers (like Google’s GFS)? Truncation? File writes at arbitrary offsets? Bear in mind, however, that every new feature adds complexity and therefore may compromise reliability or performance, so each must be very carefully considered before being added.