The first release (0.19.0) from the 0.19 branch of Apache Hadoop Core was made on November 24. Many changes go into a release like this, and it can be difficult to get a feel for the more significant ones, even with the detailed Jira log, change log, and release notes. (There’s also JDiff documentation, which is a great way to see how the public API changed, via a JavaDoc-like interface.) This post gives a high-level feel for what’s new.
Hadoop Core now requires Java 6 to run (HADOOP-2325). Most installations are using Java 6 already due to the performance boost it gives over Java 5, but those that aren’t will need to upgrade.
The most talked about change to HDFS, at least while it was being worked on, was HDFS appends (HADOOP-1700). People quoted the Jira issue number without risk of being misunderstood (“Is 1700 done yet?”). Well, it’s in 0.19.0, so you can now re-open HDFS files and add content to the end of the file. The biggest use so far of this feature is for making HBase’s redo log reliable.
Administrators can now set space quotas on directory trees in HDFS (HADOOP-3938). Previously, you could only set a limit of the number of files and directories created in a certain directory (this was useful to stop folks eating up namenode memory). Full details in the guide.
Something that folks have been asking for is access times on files (HADOOP-1869), as it helps administrators see which files haven’t been used for a while and therefore may be eligible for archival or deletion. Access times are hard to implement efficiently in a distributed filesystem, so you can specify their precision as a trade-off. By default they have hourly granularity, but you can change this via the
dfs.access.time.precision property; setting it to 0 to turn off access times entirely.
Filesystem checksums (HADOOP-3941). While HDFS uses checksums internally to ensure data integrity, it’s not possible to retrieve the checksums to use them for file content comparison, for example. This has changed with the addition of a new
getFileChecksum(Path f) method on
FileSystem, which returns a
FileChecksum object. Most
FileSystem implementations will return a null
FileChecksum (meaning no checksum algorithm is implemented for the filesystem, or the filesystem doesn’t expose checksums), but HDFS returns a non-null
FileChecksum that can be used to check if two files are the same. DistCp uses this API to see if two files are the same when performing an update.
The addition of a Thrift interface (HADOOP-3754) to Hadoop’s
FileSystem interface makes it easy to use HDFS (or indeed any Hadoop filesystem) from languages other than Java. The API comes as a part of the thriftfs contrib module, and includes pre-compiled stubs for a number of languages including Perl, PHP, Python and Ruby, so you can get up and running quickly without having to compile Thrift.
There are new library classes that support multiple inputs and multiple outputs for MapReduce jobs. Multiple inputs (HADOOP-372) allow application writers to specify input formats and mappers on a per path basis. For example, if you change your file format, it’s more elegant to have two mappers, one for the old file format, and one for the new, rather than putting all the logic for interpreting both formats in one mapper class. Or if you want to consume both text and binary files in the same job you can do so by using multiple input formats for your job. See the
MultipleInputs class for details.
Multiple outputs (HADOOP-3149) allows you to create several outputs for your jobs. The format of each output, and which records go in which output is under the application’s control, and it’s even possible to have different types for each output. This feature is provided by the
MultipleOutputs class. There is another, related, library for writing multiple outputs that was introduced in an earlier release (
MultipleOutputFormat). It is perhaps unfortunate that there are two libraries, but they have different strengths and weaknesses: for example,
MultipleOutputFormat has more flexibility over the names of the output files, whereas
MultipleOutputs has more flexibility over the output types.
Chaining maps in a single job (HADOOP-3702). Most data processing problems in the real world that use MapReduce consist of a chain of MapReduce jobs, so this change may be helpful, as it collapses a chain of maps into a single task. The general case of (MR)+ is not handled, rather M+RM*, where the chain of maps before the reduce are executed as a single map task, and any maps after the reduce are executed in the reduce task. See
ChainReducer for how to use this feature.
Skipping bad records (HADOOP-153). When processing terabytes of data it is common for applications to encounter bad records that they don’t know how to handle. If the failure occurs in a thrid party library (e.g. a native library), then there is often little that the MapReduce application writer can do. This feature allows the application to skip records that cause the program to fail, so that it can complete the job successfully, with only a small (controllable) number of discarded records. See the MapReduce user guide for documentation.
Total order partitions (HADOOP-3019). As a spin-off from the TeraSort record, Hadoop now has library classes for efficiently producing a globally sorted output.
InputSampler is used to sample a subset of the input data, and then
TotalOrderPartitioner is used to partition the map outputs into approximately equal-sized partitions. Very neat stuff — well worth a look, even if you don’t need to use it.
Streaming options (HADOOP-3722). Prior to 0.19 the way you passed options to jobs was different for Java MapReduce, Streaming and Pipes. For example, to pass a configuration parameter to a Streaming job, you would use say
-jobconf name=value, whereas in Java you would use
-D name=value. The options have been unified to use the Java convention, as implemented by
GenericOptionsParser. The old options have been deprecated and will be removed in a future release, so you should update your programs when you upgrade to a 0.19 release.
Job scheduling has long been a pain point for many users of Hadoop (the FIFO scheduler is particularly unfair in a multi-user environment), so it was good to see not one, but two, new schedulers being introduced: the fair share scheduler (HADOOP-3746) and the capacity scheduler (HADOOP-3445). Both are possible due to the new pluggable scheduling API (HADOOP-3412), which allows third-party schedulers to be used in Hadoop. The schedulers are not the last word in scheduling, so expect to see further improvements in future releases.
There have been several improvements in giving administrators and programmers more control over MapReduce resource usage. Task JVM re-use (HADOOP-249) allows the tasks from the same job to run (sequentially) in the same JVM, which reduces JVM startup overhead. (The default is not to share JVM instances, see the user guide.) Also, more work has been done to provide controls for managing memory-intensive jobs (HADOOP-3759, HADOOP-3581). For example, you can kill tasks whose child processes consume more than a given amount of memory.
Hive (HADOOP-3601), the distributed data warehouse system, was in Hadoop Core contrib in 0.19, although it has since been moved to be its own Hadoop subproject. Hive provides a SQL-like interface to data stored in HDFS. Development is very active, so if you are interested in using Hive, you should consider using a more recent version.
Chukwa (HADOOP-3719) and FailMon (HADOOP-3585) are both (independent) attempts to use Hadoop to monitor and analyze large distributed systems. They provide components to gather information about the system, and feed it into HDFS for later analysis by a series of pre-defined MapReduce jobs.