Cloudera Developer Blog · Distribution Posts
We’re proud to announce that Cloudera’s Distribution for Hadoop Version 2 (CDH2) is officially released.
We’ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over a month hardening in Cloudera’s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use – we are pleased to recommend it to all our production users.
CDH2 is based on Apache Hadoop 0.20 – a release that has been available for almost a year. During this time, the Apache Hadoop community has produced hundreds of bug fixes, improvements and features. Cloudera is proud to have contributed many of these and incorporated them into CDH2. For more information, please review the following resources:
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
One of the more common requests we receive from the community is to package Apache HBase with Cloudera’s Distribution for Apache Hadoop. Lately, I’ve been doing a lot of work on making Cloudera’s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We’re pretty excited about this, and we’re looking forward to your feedback. A big thanks to Andrew Purtell, a Senior Architect at TrendMicro and HBase Contributor, for leading this packaging project and providing this guest blog post. -Chad Metcalf
What is HBase?
Apache HBase is an open-source, distributed, column-oriented store modeled after Google’s Bigtable large scale structured data storage system. You can read Google’s Bigtable paper here.
“Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from back end bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.”
In March of this year, we released our distribution for Apache Hadoop. Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier. For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:
# yum install hadoop # apt-get install hadoop
As proof of this, our easy-to-use Hadoop Amazon Machine Images (AMIs) use these commands at boot to install the latest release of CDH1 whenever a Hadoop cluster is launched on ec2.
At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe
Trendingtopics.org was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website. This post will give an overview of how trendingtopics.org was put together and show some basic approaches for finding trends in log data with Hive. The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.
The trendingtopics Rails application identifies recent trends on the web by periodically launching an EC2 cluster running Cloudera’s Distribution for Hadoop to process Wikipedia log files. The cluster runs a Hive batch job that analyzes hourly pageview statistics for millions of Wikipedia articles, and then loads the resulting trend parameters into the application’s MySQL database.
Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.
One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe
Here at ContextWeb, our Apache Hadoop infrastructure has become a critical part of our day-to-day business operations. As such, it was important for us to find a way to resolve the single-point-of-failure issue that surrounds the master node processes, namely the NameNode and JobTracker. While it was easy for us to follow the best practice of offloading the secondary NameNode data to an NFS mount to protect metadata, ensuring that the processes were constantly available for job execution and data retrieval were of greater importance. We’ve leveraged some existing, well tested components that are available and commonly used in Linux systems today. Our solution primarily makes use of DRBD from LINBIT and Heartbeat from the Linux-HA project. The natural combination of these two projects provides us with a reliable and highly available solution, which addresses limitations that currently exist.
Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Apache Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe
So at Digg, we have been working our own Apache Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.
On June 10th, more than 750 people from around the world descended on the Santa Clara Marriott to share their love for a little stuffed elephant named Hadoop. It was a good week to be part of this exploding community, and I want to extend Cloudera’s heartfelt thanks to everyone who made it possible, especially our friends at Yahoo! who organized this Summit. Most importantly, I want to thank all of you who were able to participate. I know many of you couldn’t make it to California this time, so I hope to see you at the Hadoop Summit East in October.
For those of you who couldn’t join us, I thought I would post my notes on a few of the highlights.
Apache Hadoop Goes Mainstream:
About 300 developers attended last year’s summit, primarily from web companies and research labs. They were joined by a few forward-thinking venture capitalists. This year’s audience was both larger and different. In addition to the vibrant developer community, there was a flood of users of Hadoop. Though the audience was still dominated by web companies, attendees included traditional enterprise users with applications ranging from finance to biotech. There were technology previews from IBM and Sun. Major companies like Amazon joined our commercial efforts around Hadoop. VCs had also stepped up to sponsor status. Take-away? You ain’t seen nothing yet.