How-To: Run a MapReduce Job in CDH4 using Advanced Features

In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:

  • Combiner functions, a feature that allows you to aggregate map outputs before they are passed to the reducer, possibly greatly reducing the amount of data written to disk and sent over the network for certain types of jobs
  • Counters, a way to track how often user-defined events occur across an entire job – for example, count the number of bad records your MapReduce job encounters in all your data and feed it back to you, without any complex instrumentation on your part
  • Custom Writables, go beyond the basic data types that Hadoop provides as keys and values for your mappers and reducers
  • MRUnit, a framework that facilitates unit testing of MapReduce programs

The full code and short instructions for how to compile and run it are available at https://github.com/sryza/traffic-reduce.

In addition, this time we’ll write our MapReduce program using the “new” MapReduce API, a cleaned-up take on what MapReduce programs should look like that was introduced in Hadoop 0.20. Note that the difference between the old and new MapReduce API is entirely separate from the difference between MR1 and MR2: The API changes affect developers writing MapReduce code, while MR2 is an architectural change that differs from MR1 by, under the hood, extracting out the scheduling and resource management aspects into YARN, which allows Hadoop to support other parallel execution frameworks and scale to larger clusters. Both MR1 and MR2 support the old and new MapReduce API.

The Use Case

It’s 11pm on a Thursday, and while Los Angeles is known for its atrocious traffic, you can usually count on being safe five from heavy traffic hours after rush hour. But when you merge onto the I-10 going west, it’s bumper to bumper for miles!  What’s going on?

It has to be the Clippers game. With tens of thousands of cars leaving from the Staples Center after a home-team basketball game, of course it’s going to be bad. But what about for a Lakers game?  How bad does it for those?  And what about holidays and during political events? It would be great if you could enter a time and determine how far traffic deviated from average for every road in the city.

CalTrans’ Performance and Measurement System (PeMS) provides detailed traffic data from sensors placed on freeways across the state, with updates coming in every 30 seconds. The Los Angeles area alone contains over 4,000 sensor stations.  While this is frankly a boatload of data, MapReduce allows you to leverage a cluster to process it in a reasonable amount of time.

In this post, we’ll write a MapReduce program that computes the averages, and next time, we’ll write a program that uses this information to build an index of this data, so that a program may query it easily to display data from the relevant time.

The TrafficInduce Program

For our first MapReduce job, we would like to find the average traffic for each sensor station at each time of the week. While the data is available every 30 seconds, we don’t need such fine granularity, so we will use the five-minute summaries that PeMS also publishes. Thus, with 4,370 stations, we will be calculating 4,370 * (60 / 5) * 24 * 7 = 8,809,920 averages.

Each of our input data files contains the measurements for all the stations over a month. Each line contains a station ID, a time, some information about the station, and the measurements taken from that station during that time interval.

Here are some example lines. The fields that are useful to us are the first, which tells the time; the second, which tells the station ID; and the 10th, which gives a normalized vehicle at that station at that time.

 

The mappers will parse the input lines and emit a key/value pair for each line, where the key is an ID that combines the station ID with the time of the week, and the value is the number of cars that passed over that sensor during that time. Each call to the reduce function receives a station/time of week and the vehicle count values over all the weeks, and computes their average.

Combiners

An interesting inefficiency to note is that if a single mapper processes measurements over multiple weeks, it will end up with multiple outputs going to the same reducer. As these outputs are going to be averaged by the reducer anyway, we would be able to save I/O by computing partial averages before we have the complete data. To do this, we would need to maintain a count of how many data points are in each partial average, so that we can weight our final average by that count.  For example, we could collapse a set of map outputs like 5, 6, 9, 10 into (avg=7.5, count=4). As each map output is written to disk on the mapper, sent over the network, and then possibly written to disk on the reducer, reducing the number of outputs in this way can save a fair amount of I/O.

MapReduce provides us with a way to do exactly this in the form of combiner functions. The framework calls the combiner function in between the map and reduce phase, with the combiner’s outputs sent to the reducer instead of the map outputs that it’s called on. The framework may choose to call a combiner function zero or more times – generally it is called before map outputs are persisted to disk, both on the map and reduce side.

Thus, from a high level, our program looks like this:

 

Custom Writables

MapReduce key and value classes implement Hadoop’s Writable interface so that they can be serialized to and from binary. While Hadoop provides a set of classes that implement Writable to serialize primitive types, the tuples we use in your pseudo-code don’t map efficiently onto any of them. For our keys, we can concatenate the station ID with the time of week to represent them as strings and use the Text type.  However, as our value tuple is composed of primitive types, a float and an integer, it would be nice not to have to convert them to and from strings each time you want to use them. We can accomplish this by implementing a Writable for them.

 

We deploy our Writable by including it in our job jar. To instantiate our Writable, the framework will call its no-argument constructor, and then fill it in by calling its readFields method. Note that if we wanted to use a custom class as a key, it would need to implement WritableComparable so that it would be able to be sorted.

At Last, the Program

With our custom data type in hand, we are at last ready to write our MapReduce program. Here is what our mapper looks like:

 

You may notice that this mapper looks a little bit different than the mapper used in the last post. This is because in this post we use the “new” MapReduce API, a rewrite of the MapReduce API that was introduced in Hadoop 0.20.  The newer one is a little bit cleaner, but Hadoop will support both APIs far into the future.

An astute observer will notice that our combiner and reducer are doing exactly the same thing – i.e. outputting a weighted average of the inputs.  Thus, we can write the following reducer function, and pass it as a combiner as well:

 

Using the new API, our driver class looks like this:

 

Note that unlike last time, when we used KeyValueTextInputFormat, we use TextInputFormat for our input data. While KeyValueTextInputFormat splits up the line into a key and a value, TextInputFormat passes the entire line as the value, and uses its position in the file (as an offset from the first byte) as the key.  The position is not used, which is fairly typical when using TextInputFormat.

Counters

In the real world, data is messy. Traffic sensor data, for example, contains records with missing fields all the time, as sensors in the wild are bound to malfunction at times. Running our MapReduce job, it is often useful to count up and collect metrics on the side about what our job is doing. For a program on a single computer, we might just do this by adding in a count variable, incrementing it whenever our event of interest occurs, and printing it out at the end, but when our code is running in a distributed fashion, aggregating these counts gets hairy very quickly. 

Luckily, Hadoop provides a mechanism to handle this for us, using Counters. MapReduce contains a number of built-in counters that you have probably seen in the output on completion of a MapReduce job.

 

This information is also available in the web UI, both per-job and per-task. To use our own counter, we can simply add a line like

 

to the point in the code where the mapper comes across a record with a missing count. Then, when our job completes, we will see our count along with the built-in counters:

Averager Counters
   Missing vehicle flows=2329

It’s often convenient to wrap your entire map or reduce function in a try/catch, and increment a counter in the catch block, using the exception class’s name as the counter’s name for a profile of what kind of errors come up.

Testing

Running a MapReduce program on a cluster, if we even have access to one, can take a while. However, if we want to make sure that our basic logic works, we have no need for all the machinery. Enter Apache MRUnit, an Apache project that makes writing JUnit tests for MapReduce programs probably as easy as it could possibly be. Through MRUnit, we can test our mappers and reducers both separately and as a full flow. 

To include it in our project, we add the following to the dependencies section Maven’s pom.xml:

 

The following contains a test for both the mapper and reducer, verifying that with sample inputs, they produce the expected outputs:

 

We can run our tests with “mvn test” in the project directory.  If there are failures, information on why is available in the project directory in target/surefire-reports.

A more in depth MRUnit tutorial is available here: https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial.

Running Our Program on Hadoop

Like last time, we can build the jar with

 

The full data is available at http://pems.dot.ca.gov/?dnode=Clearinghouse, but like last time, the github repo contains some sample data to run our program on.  To place it on the cluster, we can run:

 

To run our program, we can use

 

We can inspect the output with:

 

Thanks for reading! Next time, we’ll delve into some more advanced MapReduce features, like the distributed cache, custom partitioners, and custom input and output formats.

Sandy Ryza is a Software Engineer on the Platform team.

Filed under:

4 Responses
  • David Parks / May 04, 2013 / 10:17 PM

    -libjars doesn’t seem to work when I follow this example, adding a job that uses 3rd party libraries would be a great example of an “advanced feature”. It’s a royal pain to find a good example out here on google as there seem to be many ways to do everything in hadoop, and cdh4 and cdh3 don’t seem to be the same in this respect.

  • Sandy Ryza / May 06, 2013 / 1:20 PM

    David,

    Thanks for the feedback. I’ll try to cover -libjars in my next post. If you still have an issue that you’re trying to work out, you might be able to get help on the cdh-user mailing list.

  • Muthukumar / October 11, 2013 / 11:12 AM

    I keep getting ClassNotFoundException when I run this example, not sure what is missing. I just downloaded this example and ran it as it is. I am running this on the Cloudera CDH4 VM image. Any help is appreciated.

    hadoop jar target/trafficinduce-1.0-SNAPSHOT.jar AveragerRunner trafficcounts/input.txt trafficcounts/output
    Exception in thread “main” java.lang.ClassNotFoundException: AveragerRunner
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:201)

  • Justin Kestelyn (@kestelyn) / October 15, 2013 / 4:28 PM

    Muthu,

    Blog comments are not such a good medium for trouble-shooting. Please post your issue to the “MR” board at community.cloudera.com: http://community.cloudera.com/t5/Batch-Processing-and-Workflow/bd-p/JavaAPI

Leave a comment


3 + = four