How-To: Run a MapReduce Job in CDH4

Categories: CDH How-to MapReduce Use Case

This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at The program will run on either MR1 or MR2.

We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.

The Use Case

There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.

The Input Data

A file or set of text files, in which each line represents a specific instance of punching, containing a pirate, a tab character as a delimiter, and then a victim. An example input file is located in the github repo at

Writing Our GapDeduce Program

Our map function simply inverts the mapping of punchers to their targets.

By using the victim as our map output/reduce input key, the shuffle groups by victim.

Our reduce function is a little more complicated but not by much. It uses a TreeSet to eliminate duplicates and order by name, and then outputs the list of pirates as a single string.

Finally, we write a driver that will run the job.

Compiling and Packaging

To run our program, we need to compile it and package it into a jar that can be sent to the machines in our cluster. To do this with Maven, we set up a project whose POM contains the hadoop-client dependency.

Furthermore, we include the following repositories so Maven knows where to get the bits.

The full pom.xml is included in the github repository at

Then, in the project’s root directory, we run  mvn install. This command both compiles our code and generates a jar, which should show up in the target/ directory under the root project directory.

Running Our MapReduce Program

The github repo contains a sample input file and expected output file. We can put the sample input file into a directory on hdfs called “gaps” using

We submit our MapReduce job using “hadoop jar”, which is similar to “java jar”, but includes all the necessary environment for running our jar on Hadoop.

To inspect the output file of our job, we can run

At last, we can provide our mates with a full accounting of their gaps.

In the next post in this series, we’ll cover some advanced MapReduce features like the distributed cache, as well as testing MapReduce code with MRUnit.

Sandy Ryza is a Software Engineer at Cloudera, working on the Platform team.


6 responses on “How-To: Run a MapReduce Job in CDH4

  1. Assirk

    Nice project! Was rendaig the group by example: Imagine we have a register of daily unique visits for each URL: [“url”, “date”, “visits”]. We want to calculate the unique visits up to each date from that register and noticed that actually the cumulative metric computed is not unique visits as there might be duplicated visits across days.Anyway, nice abstraction on top of Hadoop!

  2. David Parks

    How do you use MultipleInputs in this example? MultipleInputs requires a “Job” object, not a JobConf object, and it’s not clear to me when to use one vs. the other.

  3. Rishq

    Can I run a map reduce job without rebuilding with new dependencies in pom.xml on cdh4? (which I built and ran on apache hadoop before)