Cloudera Engineering Blog · Distribution Posts
We’re proud to announce that Cloudera’s Distribution for Hadoop Version 2 (CDH2) is officially released.
We’ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over a month hardening in Cloudera’s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use – we are pleased to recommend it to all our production users.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
One of the more common requests we receive from the community is to package Apache HBase with Cloudera’s Distribution for Apache Hadoop. Lately, I’ve been doing a lot of work on making Cloudera’s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We’re pretty excited about this, and we’re looking forward to your feedback. A big thanks to Andrew Purtell, a Senior Architect at TrendMicro and HBase Contributor, for leading this packaging project and providing this guest blog post. -Chad Metcalf
What is HBase?
Apache HBase is an open-source, distributed, column-oriented store modeled after Google’s Bigtable large scale structured data storage system. You can read Google’s Bigtable paper here.
In March of this year, we released our distribution for Apache Hadoop. Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier. For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:
# yum install hadoop # apt-get install hadoop
At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe
Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website. This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive. The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Apache Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe
So at Digg, we have been working our own Apache Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.
On June 10th, more than 750 people from around the world descended on the Santa Clara Marriott to share their love for a little stuffed elephant named Hadoop. It was a good week to be part of this exploding community, and I want to extend Cloudera’s heartfelt thanks to everyone who made it possible, especially our friends at Yahoo! who organized this Summit. Most importantly, I want to thank all of you who were able to participate. I know many of you couldn’t make it to California this time, so I hope to see you at the Hadoop Summit East in October.
For those of you who couldn’t join us, I thought I would post my notes on a few of the highlights.