Lessons Learned from Cloudera’s Hadoop Developer Training Course

This is a guest post from an attendee of our Hadoop Developer Training course, Attila Csordas, bioinformatician at the European Bioinformatics Institute, Hinxton, Cambridge, UK.

As a wet lab biologist turned bioinformatician I have ~2 year programming experience, mainly in Perl and have been working with Java for the last 9 months. A bioinformatician is not a developer so I’m writing easy code in just a fraction of my work time: parsers, db connections, xml validators, little bug fixes, shell scripts. On the other hand, I have now 5 months of Hadoop experience – and a 6 month old baby named Alice – and that experience is as immense as it gets. Ever since I read the classic Dean-Ghemawat paper, MapReduce: Simplified Data Processing on Large Clusters, I’m thinking about bioinformatics problems in terms of Map and Reduce functions (especially during my evening jog), then implementing these ideas in my free time–which consists of feeding the baby, writing code, changing the nappy, rewriting code.

Not independently from this I am now a Cloudera Certified Hadoop Developer as I’ve participated 2 weeks ago in the Cloudera Hadoop Developer Training–organized in London–and passed the exam. Right now this is the only Hadoop Certificate in the world. The instructor, Ian Wrigley, was a relaxed Brit from west LA who was articulate enough to allow us all to follow the ideas put forward and was responsive enough to answer the questions raised clearing misconceptions.

In what follows I summarize some lessons and tricks learned on the course. I will be using some references from the movie, The Prestige, because besides it being my favorite movie sometimes I sense a little analogy between the three parts of a magic trick and the MapReduce paradigm in general:

Map ~ The Pledge, Shuffle&Sort ~ The Turn, Reduce ~ The Prestige.

1. In native Java it is a good practice to create objects outside the map/reduce functions and re/populate it with data each time around the loop in the map/reduce methods.

If you consider a WordCount job running on billions of documents a typical mapper can be called millions of times for each map task which creates a new Text object each time around the loop. If your mapper looks like this:

  public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {
    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> output,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
	   Text word = new Text();
      while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, new IntWritable(1));
      }
    }
  }

then you’re creating a lot of luxury objects and an inefficient pattern, making the life of the garbage collector unnecessarily hard.

But if you initialize your Text object outside the map method like this:

 public static class MapClass extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> output,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);

      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word, one);

      }
    }
  }

you can receive 2 times the performance benefit. No wonder that the relevant Hadoop classes are mutable.

2. Don’t let the reduce progress report fool you!

When you see something like this on the command line:

11/01/23 19:16:28 INFO mapred.JobClient: map 100% reduce 33%

you might think, “Wow, the reducers are quick and making nice progress. See, already processed ? of the input!” Wrong! This is just the false impression created by the progress report. That 33% means only that the shuffle phase has finished and the output from the mappers landed safely at the nodes where the reducers reside. The other 66%? The Sort phase (or rather the Merge phase) – merging the already presorted map outputs while maintaining the sort-order – has been completed. While the copy/shuffle and sort/merge are technically part of the reduce task and accomplished via background threads, the beginner tends to think of map/reduce progress in terms of key/value processing, because he was taught that he only has to write those functions (and the driver) on key/values to accomplish the task. Quite inconsistently, the map progress report does not include the in-memory pre-sort by key and is based solely on the amount of input that has already been processed.

In Hadoop: The Definitive Guide, Tom White let us know how the system estimates/displays the proportion of the reduce progress admitting the calculation being a little more complicated than estimating map progress:

“it does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle: For example, if the task has run the reducer on half of its input, then the task’s progress is ?, since it has completed the copy and the sort phases (? each) and is halfway through the reduce phase (?).”

So ? according to this calculation reflects only 0% actual reduce key/value processing progress as the ? is really representing the Shuffle&Sort which sometimes seems as the bastard child (or children) of the whole MapReduce framework. According to Tom White, “shuffle is the heart of MapReduce and is where the ‘magic’ happens’.” Well, the creators of MapReduce obviously agree with the advice of Alfred Borden from The Prestige, “The secret impresses no one. The trick you use it for is everything,” because they let Shuffle&Sort stay in the background and I guess MapShuffleSortReduce sounded like a less catchy name just like 23andMitochondriaandMe for 23andMe[1]. Actually the extended name should be MapSortShuffleMergeReduce, reflecting the pre-sort at the end of the map and the merge while maintaining the sort by the map on the reduce side.

This false impression should be avoided in future progress reports maybe by giving new progress lines to the separate shuffle and short steps. Till then if you are stuck at reduce 66%, then rewrite your reduce class.

3. Call RecordReader if InputSplit crosses the line.

The byte oriented InputSplit can split a record in half and the record oriented RecordReader will make it Mapper ready. I asked Ian Wrigley to explain how the RecordReader can ensure each key/value pair is processed but only once even if that pair is split across an InputSplit:

  • The InputSplit says “Start at offset x, and continue for y bytes”.
  • The RecordReader gets the InputSplit.
  • If the InputSplit’s offset is 0, start at the beginning of the InputSplit.
  • Otherwise, scan forward to the start of the next record.
  • Read a record at a time; this might take you past the end of the InputSplit, but that’s OK.
  • After you’re past the end of the InputSplit, stop.

4. The sacrifice… that’s the price of a good trick.(Alfred Borden).

Explicitly relying on unreliable hardware: that is a beautiful concept behind the original Google File System and the Hadoop Distributed File System. The MapReduce framework’s robust fault tolerance (except the master node) and the seemingly “nice&easy” abstraction provided for developers comes with a price that needs to be considered.

Mappers don’t communicate with each other, reducers don’t communicate with each other, these functions have no identities and are replaceable & disposable. Developers have to completely rethink all familiar sequential algorithms they are using if they want to implement them in MapReduce and there’s no guarantee that all those techniques will work. They cannot write code which communicates between nodes. As a result, debugging is complicated but necessary once the code’s complexity level surpasses the basic examples.

Finally, 4 snippets of juicy hadoop ecosystem info I picked up during the course quite randomly:

  • There’s a ~20% performance hit when using the Streaming API and other languages than Java, so you can expect a faster development time at a price of slower execution.
  • The latest version of Oozie (Yahoo’s Hadoop workflow engine, included in CDH3) has nice advanced features like running workflows at specific times, monitoring an HDFS directory, and running a workflow when data appears in the directory.
  • Mahout (the machine learning library built on top of Hadoop) has xml input support.
  • If you run hadoop jobs with Pig scripts your sysadmin might not see the Pig in it at all.

[1] The personal genotyping company is providing and analyzing genetic variations not just from the 23 nuclear chromosomes but from the mitochondrial DNA as well

Filed under:

1 Response

Leave a comment


4 + = twelve