What’s New in Hadoop Core 0.20
Hadoop Core version 0.20.0 was released on April 22. In this post I will run through some of the larger or more significant user-facing changes since the first 0.19 release—there were 262 Jiras fixed for this release (fewer than the 303 in 0.19.0). The full list, which includes many bug fixes, can be found in the change log (or in Jira), and in the release notes. The JDiff documentation provides a view of what changed to the public APIs.
The two main components in Hadoop Core are the distributed filesystem (HDFS) and MapReduce. There are plans, however, to move these components into their own Hadoop subprojects, so they can have their own release cycles, and to make it easier to manage their development (with separate mailing lists). Hadoop Core would remain, with the vestiges of other filesystems (local, S3, KFS), and other core infrastructure for I/O, such as RPC libraries, serialization, compression, MapFiles and SequenceFiles.
Splitting Core, HDFS, and MapReduce did not happen for the 0.20.0 release, so you will find HDFS and MapReduce in a single download as usual, but there are some changes in preparation for this split. Notably, hadoop-site.xml has been broken into three: core-site.xml, hdfs-site.xml, and mapred-site.xml (HADOOP-4631). You can still use a single hadoop-site.xml, albeit with a warning. The default files have also been split, although they no longer are bundled in the conf directory; instead you can view them in the form of HTML files in the docs directory of the distribution (or online: core-default.html, hdfs-default.html, mapred-default.html). Also, the control scripts start-all.sh and stop-all.sh have been deprecated, in favor of controlling the two sets of daemons separately (using start-dfs.sh, start-mapred.sh, stop-dfs.sh, stop-mapred.sh).
It’s a minor change, but the ability to comment out entries in the slaves file with the
# character (HADOOP-4454) is a very useful one in practice.
Hadoop configuration files now support XInclude elements for including portions of another configuration file (HADOOP-4944). This mechanism allows you to make configuration files more modular and reusable.
There is a lot of work going on around security in Hadoop. Most of the changes will appear in future releases, but one that made it into 0.20.0 is service-level authorization (HADOOP-4348). The idea is that you can set up ACLs (access control lists) to enforce which clients can communicate with which Hadoop daemons. (Note that this mechanism is advisory: it is not secure since the user and group identities are not authenticated. This will be remedied in a future release.) For example, you might restrict the set of users or groups who can submit MapReduce jobs. See the documentation for further details.
One thing that’s no longer in Hadoop Core are the LZO compression libraries, which had to be removed due to license incompatibilities (HADOOP-4874). If you can use GPL’d code then you can still find the LZO compression code at the hadoop-gpl-compression project on Google Code. You may also be interested in
LzoTextInputFormat (HADOOP-4640), which uses a pre-generated index of LZO-compressed input files to split the compressed files for MapReduce processing. Note that
LzoTextInputFormat is not in 0.20.0 either, so you need to build it yourself.
HDFS append has been disabled in the 0.20.0 release due to stability issues (HADOOP-5332). It has also been disabled in 0.19.1, which means that there is currently no stable Hadoop release with a functioning HDFS append function. The configuration property
dfs.support.append can be set to
true if you wish to try out the append feature, but this should not be used in production. You can track the progress of the work to get append working in HADOOP-5744.
There is a new command for cluster administrators to save the HDFS namespace (HADOOP-4826). By running
hadoop dfsadmin -saveNamespace
(when in safe mode), the namenode will dump its namespace to disk. This is useful when doing a planned namenode restart since the edits log doesn’t need to be replayed when the namenode comes back up.
The biggest change in this release is the introduction of a new Java API for MapReduce, dubbed “Context Objects” (HADOOP-1230). The Streaming and Pipes APIs are unchanged, and, in fact, the Pipes API already looks like the new Java one (some of the ideas were tried out there first).
The change was motivated by a desire to make the API easier to evolve in the future, by making
Reduce abstract classes (not interfaces) and by introducing a
Context object (hence the name) as a central place to get information from the MapReduce framework.
Context objects also replace
OutputCollectors for emitting values from the
reduce() functions. There are also the following changes:
JobConfno longer exists. Use
Configurationto hold job configuration.
- It’s now easier to get the job configuration in the
reduce()method: just call
Contextobject you have been passed.
- The new API supports a “pull” style of iteration. Previously, if you wanted to process a batch of records in one go (in the mapper), you would have to store them in instance variables in the
Mapperclass. With the new API you have the option of calling
map()method to advance to the next record.
- You can also control how the mapper is run by overriding the
- There are no
IdentityReducerclasses in the new API, since
Reducerperform the identity function by default.
The new API is not backward compatible with the old one (which is still present but deprecated—there is no published date as to when it will be removed) so you will need to re-write your applications to use it. Some examples that come with Hadoop have been migrated, so you can look at these (for example
WordCount) for hints on how to change your code (basically a few method signature changes). Be aware, that some MapReduce libraries (such as
MultipleOutputs) have not been moved to the new API yet so you should check before starting a migration. The new API is found in the
org.apache.hadoop.mapreduce package and subpackages; the old API is in
Multiple task assignment (HADOOP-3136). This optimization allows the jobtracker to assign multiple tasks to a tasktracker on each heartbeat, which can help improve utilization. A configuration parameter
mapred.reduce.slowstart.completed.maps was introduced at the same time (default 0.05), which specifies the proportion of map tasks in a job that need to be complete before any reduce tasks are scheduled.
Input formats have seen some interesting improvements. With HADOOP-3293,
FileInputFormat (the most widely-used input format) does a better job of choosing the hosts that have the maximum amount of data for a given file split. You don’t need to make any changes to you code for this to work. In a separate development,
CombineFileInputFormat, introduced in 0.20 (HADOOP-4565), is a new type of file input format that can combine multiple files into a single split, while taking data locality into account. This input format can be a useful remedy for the small files problem. It may also be useful for consuming multiple HDFS blocks in one mapper, if you are finding your mappers are running too quickly, without having to change the block size of your files.
Gridmix2 (HADOOP-3770) is the second generation of the suite of benchmarks to model typical cluster MapReduce workloads seen in practice. Find out more about it in the src/benchmarks directory.
It was always faintly embarrassing that the firepower of a Hadoop cluster has difficulty calculating pi to more than a few places of decimals. The new version of the
PiEstimator example (HADOOP-4437) uses a Halton sequence to produce a better distribution of random input, and as a result produces a more accurate estimate of pi.
Two new contrib modules appeared in the 0.20 branch.
HDFS Proxy is a new contrib module for running a proxy that exposes the (read-only) HSFTP interface to HDFS. This may be useful if you want to provide secure read-only access for users who don’t have direct access to a cluster.
Vaidya is a tool for diagnosing problems with MapReduce jobs by looking at the job history and configuration after a job has run. It has a set of rules that detect common problems and suggests improvements you can make to your code or configuration to avoid them.