Today, Cloudera is excited to release the third beta release of Cloudera’s Distribution for Hadoop version 3 — CDH3b3. In this post, I’ll cover the major changes since CDH3b2, and give some insight on what’s coming down the pipe in the next couple of months.
Many of you may have read about some of the recent announcements of partnerships between Cloudera and some of the leading data management software companies like Teradata, Netezza, Greenplum (EMC), Quest and Aster Data. We established these partnerships because Hadoop is increasingly serving as an open platform that many different applications and complimentary technologies work with. Our goal is to to make this as easy and as standardized as possible. As part of this beta update, weve added several platform enhancements that make it easier for complimentary technologies to run or work with the Hadoop platform. The Sqoop framework was enhanced with a plugin framework that allows for different swappable adapters that are optimized for integrations with database technologies. Some of these first adapters will be available in beta form as early as this week with many more to come in the next months. In addition Sqoop can now cover a broader range of integration use cases thanks to support for incremental database updates to and from Hadoop.
Weve also been hard at work on providing better integration opportunities for different analytical tools and applications. As of CDH3b3, this now includes an ODBC driver that can be used in conjunction with Hive to enable users to query Hadoop using their favorite BI tool.
As one of the primary contributors and largest production users of Hadoop, Yahoo! publishes the source tree for the version of Hadoop that they run on their production clusters. We are pleased to announce that we have merged Yahoo’s source tree into CDH3b3. This merge brings many improvements developed at Yahoo! into CDH, including improvements for MapReduce scalability on 1000+-node clusters and several new tools for benchmarking and testing Hadoop.
The largest new feature, though, is the introduction of a strong authentication system based on Kerberos. Kerberos is an industry-standard authentication system supported both by completely open source software like MIT Kerberos as well as by common enterprise authentication systems like Microsoft Active Directory. The integration of Kerberos authentication into CDH enables enterprises to use their existing authentication infrastructure to manage user identities, and allows more sensitive data to be stored and analyzed within a cluster.
Some new authorization features have also been added to CDH in this release. For example, if the security features are enabled, MapReduce jobs can specify access control lists (ACLs) that specify which users and groups may view job details or prematurely kill the job. Additionally, the tasks of MapReduce jobs may now run as the UNIX user who submitted them, improving the ability to isolate resources and protect confidentiality of intermediate data and logs. In addition to integrating these new features into the MapReduce and HDFS components, we have also updated the rest of the components of CDH to operate in an authenticated environment. This marks the first time where it has been possible to run a secure Hadoop while continuing the use of other Hadoop components like Hive, Hue and HBase.
The work to integrate these new security features across the distribution is still continuing — we are currently aware of some places in which the current implementation is incomplete and vulnerable to certain exploits, and will fix these issues before we declare CDH3 stable. We are also hard at work on a comprehensive guide that will detail setup instructions and best practices for operating a secured Hadoop cluster. If you have security requirements in your organization, we hope you will find this beta release useful as a preview of what’s to come.
Of course, CDH3b3 also contains several bug fixes and improvements based on our experiences deploying clusters for customers with a wide range of use cases. Please check the release notes for the full list.
We are happy to include Apache Whirr as the newest member of the CDH family. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. Stay tuned for an upcoming post with more information about Whirr.
Improvements in Performance and Stability
CDH3b3 includes updates to all of the other components in the platform. Most of these updates are fixes to eliminate bugs or to improve performance. One notable enhancement is the support for calendar and event based scheduling of Hadoop jobs via the Oozie workflow engine.
Upgrading to CDH3b3
In order to upgrade an existing cluster from CDH2 or CDH3b2 to CDH3b3, you’ll have to perform a few extra manual steps. Please check out our CDH3 upgrade guide for detailed instructions.
What’s up next
While we’re done adding major new features for CDH3, we expect to do at least one more beta release before declaring it stable for critical production use. Here’s a sneak peak of what’s to come:
- New upstream versions of some components, including Hive 0.6.0 and HBase 0.90.
- Further integration of security features, including improved authentication support in Hive, ACLs for Fair Scheduler Pools, SPNEGO support for Oozie, and easier deployment.
- Further bug fixes based on our experiences deploying CDH3b3 in QA and in the field.
As always, the CDH team is excited to hear your feedback. Please join the cdh-user mailing list and let us know what you think!