Migrating from Elastic MapReduce to a Cloudera’s Distribution including Apache Hadoop Cluster

Categories: Community Guest Hadoop MapReduce

This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.

This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Apache Hadoop cluster.

When we first started using Hadoop, we went down the path of Amazon’s EMR service.  We had limited operational resources and wanted to get up and running quickly.  After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster.  In doing so we did not want to lose the features of EMR we found useful – mainly the ease of cluster setup.

This first part of the series discusses our motivation for choosing and then moving away from EMR, while the second part deals with how we maintained ease of cluster setup using Puppet.

Many of our systems use Amazon’s S3 as a backup repository for log data.  Our data became too large to process by traditional techniques, so we started using Amazon’s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3.  The major advantage of EMR for us was the lack of operational overhead.  With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run.

We had two systems interacting with EMR.  The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system.  The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.

The magic of spinning up and configuring a Hadoop cluster in EC2 was spectacular, but there were a few areas that we saw room for improvement.  In particular:

Performance & Tuning. We were hit by the small-files problem, lack of data locality (data stored in S3 but processed on nodes of the EMR cluster), decompression (bz2) performance issues, and virtualization penalties.  To solve these problems, we decided that we needed a non-transient cluster (to satisfy data locality), and a process to aggregate our logfiles into a Hadoop-friendly size and data format (we ultimately chose avro). After crunching the numbers, it was evident that storing large amounts of data on an EC2 cluster quickly becomes expensive, and one still suffers from virtualization penalties (particularly since Hadoop is so I/O intensive), so we decided to build-out a cluster using CDH3.

Monitoring. Typically for us, a pig script running on EMR was one step in a workflow, so we needed to monitor the status of the job to determine when it finished and the next steps could continue.  While Amazon exposes a rich API for monitoring a job, we really wanted a more generic mechanism for monitoring all steps in a workflow, not just those on an EMR cluster.  After considering a number of solutions, we ultimately chose to use Azkaban as our workflow engine for managing dependencies, alerting, and monitoring (which we added atop Azkaban ourselves).

API Access. Interacting with a cluster only over an API is both a blessing and a curse.  The API takes care of otherwise complicated mechanics, such as starting, configuring, and stopping the cluster.  With that said, the calls to the EMR service are rate-limited, so it doesn’t scale very well for monitoring a number of clusters.  Also, we found that we could continuously keep a cluster busy, and thus the EMR limitation of 100 or so jobs on a cluster meant that we had to build wrappers to periodically shutdown and startup clusters.

Lack of latest features. We were using Hadoop 0.18 and Pig 0.3 on EMR, which were missing many features that we wanted to try (e.g. JVM reuse, CombineInputFormats, and improved pig optimization plans).  Eventually, Amazon upgraded to Hadoop 0.20 and Pig 0.6, but even at that point Cloudera’s Distribution including Apache Hadoop had backported many useful features such as performance improvements, monitoring enhancements, and additional APIs.  In addition, CDH provides a full-suite of solutions including Pig, Hive, Flume, and Sqoop, that we’re either actively using or planning to use.

For us, the major drawback to moving away from EMR was new operational overhead.  Starting a cluster with an API call is incredibly useful, and we soon discovered that CDH provided scripts for doing so (now there’s something even better, Apache Whirr).  Eventually, we decided to move out of the cloud, though, so we wanted to build an infrastructure for maintaining a cluster that worked regardless of the hardware configurations.  The RPMs for CDH3 and the great documentation on installing and configuring CDH from Cloudera helped to make this project much-less intimidating.  Ultimately, we built puppet modules for configuring our cluster, which we’ll talk much more about in part two of this post.