Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.
Some of the more noteworthy changes in this release include:
- HDFS HA supports automatic failover using ZooKeeper (HDFS-3042).
- The FUSE-DFS module now supports secure HDFS clusters (HDFS-3568).
- The (non-standard) Kerberos over SSL has been replaced with SPNEGO for image transfers and for secure HDFS web access in general (HDFS-2617).
- SASL encryption can be enabled for block data transfers in HDFS (HDFS-3637), and the MapReduce shuffle can be encrypted using HTTPS (MAPREDUCE-4417). There is also HTTPS support for the web UIs (HADOOP-8581).
- A new type of Hadoop Metric, a quantile metric, has been added to provide latency histograms for various HDFS metrics (HDFS-3650).
- The Capacity Scheduler now supports delay scheduling (YARN-80).
- There are various performance improvements including support for fadvise in the shuffle handler (MAPREDUCE-3289) and datanode (HDFS-3697)
- YARN is now a subproject of Hadoop (YARN-1). The separation will make it easier for folks who want to write YARN applications that are independent of MapReduce. (See Harsh Chouraria’s “MR2 and YARN Briefly Explained” post for more on the relationship between YARN and MapReduce.)
Try It Out!
You can download the release from an Apache mirror. Alternatively, you can try CDH 4.1, since it includes most of the changes from Apache Hadoop 2.0.2-alpha. Note that MR2 in CDH 4.1 is still experimental—in line with the Apache release—however MR1 in CDH 4.1 is stable and fully supported in production.
A Note on Release Numbering
Historically the numbering of Apache Hadoop releases has been somewhat confusing, but things have improved since the Hadoop community voted to adopt 1.x for the current stable branch (renamed from the 0.20.x series) and the 2.x branch for the new line of development (previously 0.23.x), which is still currently unstable as mentioned above.
Some confusion lingers in that there is still an 0.23 branch which is still producing releases (Robert Evans is the release manager). However this branch is a special case: it is an earlier version of the branch-2 line that Yahoo! is using to stabilize YARN for their own use, with plans to move to a 2.x sometime next year. The Yahoo! Hadoop team are also backporting fixes in the 2.x branch to the 0.23 branch as needed, and of course all changes that go into 0.23 go into trunk and 2.x first, so all the valuable stabilization work they are doing will benefit future 2.x releases. From a feature point of view, the biggest difference between 0.23 and 2.x is that 0.23 lacks HDFS High Availability.
I would like to thank the many people from many different organizations who contributed to this release—from the smallest bug report to the largest feature, all contributions are appreciated. Also, thanks to Arun C Murthy who acted as release manager for this release.