The release of Apache Hadoop 2, as announced today by the Apache Software Foundation, is an exciting one for the entire Hadoop ecosystem.
Cloudera engineers have been working hard for many months with the rest of the vast Hadoop community to ensure that Hadoop 2 is the best it can possibly be, for the users of Cloudera’s platform as well as all Hadoop users generally. Hadoop 2 contains many major advances, including (but not limited to):
- High availability for the HDFS NameNode, which eliminates the previous SPOF in HDFS.
- Support for filesystem snapshots in HDFS, which brings native backup and disaster recovery processes to Hadoop.
- Support for federated NameNodes, which allows for horizontal scaling of the filesystem namespace.
- Support for NFS access to HDFS, which allows HDFS to be mounted as a standard filesystem.
- Native network encryption, which secures data while in transit.
- The YARN resource management system, which provides infrastructure for the creation of new Hadoop computing paradigms beyond MapReduce. This new flexibility will serve to expand the use cases for Hadoop, as well as improve the efficiency of certain types of processing over data already stored there.
- Several performance-related enhancements, including more efficient (and secure) short-circuit local reads in HDFS.
Furthermore, a great deal of work has gone into stabilizing and maturing Hadoop’s APIs in preparation for this release, which should give all users and projects building on top of Hadoop confidence that what they’re creating today will work for years to come.
With the continuing growth of the Hadoop development community, the myriad advances in this release in particular highlight the benefits that the entire ecosystem receives from participating in collaborative open source development. Thanks in part to the testing and packaging standardization provided through the Apache Bigtop project, customers now have the luxury of choosing roughly the same core software from no fewer than eight different vendors — and thus can focus on selecting the platform with the best applications, best data access frameworks, and best support on top of this ubiquitous core.
As for CDH, Cloudera’s distribution including Hadoop and related projects, we have already delivered several stable, high-value parts of Hadoop 2 in the current release (such as HDFS 2.0, network encryption, and performance improvements), and the next release (CDH 5) will be based entirely on Hadoop 2 — using YARN for resource coordination between MapReduce and other components. We look forward to continuing to work with the entire community to push Hadoop forward at a rapid clip, in the Hadoop 2 release line and beyond!
Aaron T. Myers is a Software Engineer at Cloudera and a Committer/PMC Member on the Hadoop project.