CDH2: Cloudera’s Distribution for Apache Hadoop 2

In March of this year, we released our distribution for Apache Hadoop.  Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier.  For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:

# yum install hadoop
# apt-get install hadoop

As proof of this, our easy-to-use Hadoop Amazon Machine Images (AMIs) use these commands at boot to install the latest release of CDH1 whenever a Hadoop cluster is launched on ec2.

In addition, our packages followed the Filesystem Hierarchy Standard, used chkconfig and service for service management and even added a hadoop man page.

Hadoop Manpage

CDH1 demonstrated that Hadoop, Pig and Hive could be tightly integrated into Linux allowing people to manage Hadoop using tools they are familiar with like service, chkconfig, alternatives, logrotate, man, etc.  Hadoop users benefited from the fact that hadoop scripts and environment variables were automatically managed for them, making it easier to focus on using Hadoop.

Our distribution has been installed on a variety of clusters, from small clusters parsing web logs to large clusters processing protein sequences. We’ve received plenty of great feedback on CDH1 from our customers and the community. That feedback was really useful, and influenced the second major release: CDH2.

CDH2

We released CDH2 on August 3rd as a testing release.  CDH1 is still considered our stable release.  More on that later in this post.

A few of our customers and community members volunteered to serve as beta testers in July to get early access to CDH2 features.  With CDH2, Apache Hadoop is available in two packages: hadoop-0.18 and hadoop-0.20.  Hadoop cluster administrators can now choose to install one or both versions on their cluster.  Installing Hadoop is still as simple as running one of the following commands depending on your Linux distribution and which version of Hadoop you want:

# yum install hadoop-0.18
# yum install hadoop-0.20
# apt-get install hadoop-0.18
# apt-get install hadoop-0.20

The hadoop-0.18 and hadoop-0.20 packages don’t conflict because they use different directory paths, e.g. /usr/lib/hadoop-0.18 and /usr/lib/hadoop-0.20.  To see all the files these packages install, run one of the following commands depending on your Linux distribution:

$ rpm -ql hadoop-0.18
$ dpkg -L hadoop-0.18

These packages also allow you to upgrade from your existing CDH1 hadoop package installation.

The hadoop-0.20 package is based on Apache Hadoop 0.20.0 but with many additions:

  • Sqoop Updates
    • Support for Oracle databases in addition to MySQL
    • configurable format control (quoting, delimiters)
    • better documentation (man sqoop)
    • WHERE clause support
    • user-defined class and package name support
    • bugfixes like more secure treatment of passwords and eliminating an out-of-memory condition with MySQL
  • Fair Share Scheduler Updates
    • Support for scheduler preemption
    • allows setting the default value of maxRunningJobs from all pools
  • and much more (see our manifests for details)

stable and testing releases

Some people are happy to trade some stability in order to get the latest features.  To accommodate people’s differing needs, we are publishing our distribution to unique stable and testing repositories.

Packages in our stable repository are deemed ready for production clusters. These packages have passed all unit tests, functional tests and have had a few months of “soak time” in production environments. You can trust that we’ll work hard to prevent any breaking changes to packages in the stable repository (e.g. changing interfaces). As such, we can’t always put the latest features into our current stable release.

Packages in our testing repository are recommended for people who want more features. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.

For example, CDH2 is our current testing release. Over the next few months as our customers and community report their experiences with CDH2, we’ll work to craft it into a stable release. You can expect a new stable and testing release about every quarter.

This doesn’t mean, however, that we’ll immediately discontinue support for previous stable releases.  We’ll support a stable release for at least one year after an alternative stable release is available.

Here is a table showing our current repository state:

Repository CDH Release Released Patched Source Apt Repository Yum Repository
Stable CDH1 March 2009 /cdh/stable /debian /redhat/cdh/stable
Testing CDH2 August 2009 /cdh/testing /debian /redhat/cdh/testing

From CDH2 forward we’ll be using an improved version syntax which will allow package managers like yum and apt-get to correctly manage your package updates depending on whether you’ve subscribed to the stable and testing repositories.

Here is a table with some examples to help familiarize you with the new version syntax:

Full Package Version Component Branch Base Version Patch Level
hadoop-0.18-0.18.3+76 hadoop 0.18 0.18.3 76
hadoop-0.20-0.20.0+45 hadoop 0.20 0.20.0 45
hadoop-0.18-0.18.3+76.3 hadoop 0.18 0.18.3 76.3
pig-0.4.0+14 pig - 0.4.0 14

The dot releases in the patch level, e.g. 76.3, are use to ensure software continuity as packages are promoted from testing to stable.  Once a package is promoted, the major patch level will never change.

Pig and Hive

There are currently no packages for pig and hive in CDH2 but we’re working hard to remedy this situation. Two Cloudera team members worked with the community to help make Pig and Hive work with Hadoop 0.20.0. Todd Lipcon submitted a patch for HIVE-487 which has been committed to trunk and will be included in Hive 0.4.0. Dmitriy Ryaboy opened PIG-924 and submitted a patch which provided “shims” to allow Pig 0.4 to run on Hadoop 0.20.0. This work will allow us to create packages over the next week or two for CDH2.

Learning more

This blog post really only touched on the highlights of the CDH2 release.  If you’re interest in learning more details, visit our Software Archive.  We’ve unified our documentation into a single manual with a clean style to make it easier for you to find the answers you are looking for. As always, we look forward to any feedback you might have regarding CDH2.

8 Responses
  • Otis Gospodnetic / September 11, 2009 / 7:12 PM

    Excellent. I see both HIVE-487 and PIG-924 have been committed. Any idea when 0.4 releases of Hive and Pig will happen? I’m asking in order to figure out whether it makes sense to wait for them to show up in CDH2 or whether I should just use CDH1 because release 0.4 is, say, more than a month away.

    Thanks.

  • matt / September 12, 2009 / 11:24 AM

    Thanks for the question, Otis. Our current development sprint ends September 18th. We’ll have Pig and Hive packages available for CDH2 by the end of our sprint regardless of the Pig and Hive release cycle.

  • Edward Capriolo / September 15, 2009 / 11:12 AM

    Also, I have also branched the Cacti Templates for Hadoop for 18 and 20.
    http://www.jointhegrid.com/hadoop
    :)

  • tommy / September 25, 2009 / 11:32 AM

    Are there EC2 AMIs available for CDH2?

  • Dioktos / October 15, 2009 / 1:50 PM

    Is there any plan to allow the configurator to generate deb packages instead of rpms?

Leave a comment


4 − three =