Cloudera Engineering Blog · Guest Posts

Tracking Trends with Hadoop and Hive on EC2

At Cloudera, we frequently work with leading Hadoop developers to produce guest blog posts of general interest to the community. We started a project with Pete Skomoroch a while back, and we were so impressed with his work, we’ve decided to bring Pete on as a regular guest blogger. Pete can show you how to do some pretty amazing things with Hadoop, Pig and Hive and has a particular bias towards Amazon EC2. With that, I’m happy to welcome Pete to the blog, and hope you enjoy his first post as much as we did. -Christophe was built by Data Wrangling to demonstrate how Hadoop and Amazon EC2 can be used with Rails to power a data-driven website.  This post will give an overview of how was put together and show some basic approaches for finding trends in log data with Hive.  The source code for trendingtopics is available on Github and a tutorial is provided on the Cloudera site which describes many of the data processing steps in greater detail.

Running the Cloudera Training VM in VirtualBox

Update (May 1 2013): The post below, which is based on an outdated VM, is deprecated. Rather please see the Cloudera QuickStart VM, which runs on VirtualBox, VMware, and KVM.

Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe

Apache Hadoop HA Configuration

Disclaimer: Cloudera no longer approves of the recommendations in this post. Please see this documentation for configuration recommendations.

One of the things we get a lot of questions about is how to make Hadoop highly available. There is still a lot of work to be done on this front, but we wanted to take a moment and share the best practices from one of our customers. Check out what Paul George has to say about how they keep thier NameNode up at ContextWeb. – Christophe

Hadoop Graphing with Cacti

An important part of making sure Apache Hadoop works well for all users is developing and maintaining strong relationships with the folks who run Hadoop day in and day out. Edward Capriolo keeps’s Hadoop cluster happy, and we frequently chew the fat with Ed on issues ranging from administrative best practices to monitoring. Ed’s been an invaluable resource as we beta test our distribution and chase down bugs before our official releases. Today’s article looks at some of Ed’s tricks for monitoring Hadoop with Cacti through JMX. -Christophe

You may have already read Philip’s Hadoop Metrics post, which provides a general overview of the Hadoop Metrics system. Here, we’ll examine Hadoop monitoring with Cacti through JMX.

What is Cacti?

Rackspace Upgrades to Cloudera’s Distribution for Apache Hadoop

Parallel LZO: Splittable Compression for Apache Hadoop

Yesterday, Chris Goffinet from Digg made a great blog post about LZO and Hadoop. Many users have been frustrated because LZO has been removed from Apache Hadoop’s core, and Chris highlights a great way to mitigate this while the project identifies an alternative with a compatible license. We liked the post so much, we asked Chris to share it with our audience. Thanks Chris! -Christophe

So at Digg, we have been working our own Apache Hadoop cluster using Cloudera’s distribution. One of the things we have been working through is how can we split our large compressed data and run them in parallel on Hadoop? One of the biggest drawbacks from compression algorithms like Gzip is that you can’t split them into multiple mappers. This is where LZO comes in.

The Smart Grid: Hadoop at the Tennessee Valley Authority (TVA)

Using Cloudera’s Hadoop AMIs to process EBS datasets on EC2

High Energy Hadoop

Newer Posts