Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:
- HDFS High Availability (HA) can now use a Quorum Journal Manager (QJM) for sharing namenode edit logs (HDFS-3077). QJM runs a quorum of journal nodes (typically three), and is designed to replace the reliance on shared storage such as an NFS filer for the edit logs. There is a QJM guide available.
- There is a new C client for HDFS that uses the WebHDFS REST interface rather than libhdfs, which uses JNI (HDFS-2656).
- The shuffle and sort in MapReduce are now pluggable, allowing third parties to try out their own implementations for improved performance (MAPREDUCE-2454). The documentation has more details on this feature.
- There is a new option (
mapreduce.job.classloader) to run MapReduce tasks in a custom classloader to isolate user classes from system classes, like in Java Application Servers (MAPREDUCE-1700).
- The YARN Resource Manager (RM) supports application recovery across restarts (YARN-128). Recovery is off by default, but by setting
trueany YARN apps (including MR jobs) that were running on the cluster will be recovered after the RM is restarted. This is the first step in providing HA for the RM (YARN-149).
- YARN’s Capacity Scheduler now supports CPU in resource requests, in addition to memory (YARN-2). This means that applications may ask for containers with a desired number of “virtual cores”. The allowable range of virtual cores for a container is determined by
yarn.scheduler.maximum-allocation-vcores, the capacity of a Node Manager is set by
yarn.nodemanager.resource.cpu-cores, and a virtual core is defined by
yarn.nodemanager.vcores-pcores-ratio. MAPREDUCE-4520 covers the changes to MapReduce to allow the number of cores to be specified for jobs.
- YARN can optionally use Linux Control Groups (cgroups) to enforce resource limits for containers (YARN-3). See
yarn.nodemanager.resource-enforcer.classin yarn-default.xml for how to enable this feature.
- The YARN Fair Scheduler now supports all the features of the Fair Scheduler in MR1, plus some new features like support for hierarchical queues. Its UI has been improved to support the new UI framework in YARN (YARN-145).
- AMRMClient was added as a convenience class for YARN application writers to manage Application Master-Resource Manager communication. (YARN-103 and YARN-277.)
Getting to Hadoop 2 GA
The next release, Apache Hadoop 2.0.4-beta, is planned for in a couple of months, with a Hadoop 2.0 GA release following sometime in the middle of the year. The remaining work before the GA release is focused on stabilizing YARN so that it can run production MapReduce workloads at scale, and ensuring its APIs are stable. Yahoo! recently reported good progress in this area: they have run over 14 million jobs on YARN, on tens of thousands of nodes. (They run the 0.23.x code line, which contains all the YARN stabilization fixes that are in 2.0.3-alpha.)
Try It Out!
You can download the release from an Apache mirror. The forthcoming CDH 4.2.0 release will include most of the changes from Apache Hadoop 2.0.3-alpha. (Note that QJM has been available since CDH 4.1.0.) Like the Apache release, MR2 (running on YARN) is still experimental in CDH 4.2.0, but HDFS and MR1 are stable and fully supported in production since CDH 4.0.
Thanks to everyone who contributed to this release—every contribution is appreciated. Also, thanks go to Arun C Murthy who acted as release manager.