The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts. Note, however, that 0.23.0 is not a production release, so please don’t install it on your production cluster.
HDFS federation improves HDFS scalability by allowing multiple independent namenodes, each managing a portion of the namespace. Each datanode in the cluster can provide storage to all the namenodes (which means datanodes do not, for example, belong to a single namenode). Note that HDFS federation is not to be confused with HDFS High Availability, which will be coming in a future 0.23 release.
MapReduce 2 (“next gen”) is a re-write of the the MapReduce runtime to overcome scalability bottlenecks in the jobtracker. It is based on a new framework called YARN for cluster resource management, and a MapReduce “application” which runs users’ jobs on YARN. In this design MapReduce becomes a user-space library, and also allows other parallel applications to run on Hadoop clusters, beside MapReduce applications.
Be aware that Hadoop 0.23.0 does not come with the “classic” MapReduce runtime (MapReduce 1) which runs jobtrackers and tasktrackers. However, it does fully support both the old and new MapReduce user APIs (the old API is in the org.apache.hadoop.mapred package, the new one in org.apache.hadoop.mapreduce). In 0.23.0 the old API is deprecated and users are encouraged to move to the new API from this release onwards. Note that if you wish to use the classic runtime (or the old API) you can use a 0.20.x based release, such as the one included in CDH3.
Stability, Compatibility and Testing
It is important to stress that 0.23.0 is not ready for production use yet. It is an early release that users can start testing so that we can stabilize later 0.23 releases. We expect a later dot release to be production-ready, and will be incorporated into CDH4.
In terms of compatibility, in the vast majority of cases, programs written to use the public Hadoop APIs in 0.20.x should run correctly on 0.23.0, although they will need to be recompiled. You can find detailed notes on compatibility in HADOOP-7738.
The process of updating Hadoop ecosystem projects to work with 0.23.0 is still underway. One of the goals of the Apache Bigtop project (incubating) is interoperability testing of Hadoop components, and the project is tracking the status of downstream builds that use Hadoop in BIGTOP-162. If you use Hadoop we encourage you to get involved in this testing effort by trying out your workloads and applications on Hadoop 0.23.0 and reporting any issues you find to the Hadoop project.
Thanks go to everyone who contributed to the release (reporting issues, fixing bugs, reviewing changes, writing documentation, etc), and especially to Arun C Murthy who did a fantastic job as release manager.