Getting MapReduce 2 Up to Speed

Thanks to the improvements described here, CDH 5 will ship with a version of MapReduce 2 that is just as fast (or faster) than MapReduce 1.

Performance fixes are tiny, easy, and boring, once you know what the problem is. The hard work is in putting your finger on that problem: narrowing, drilling down, and measuring, measuring, measuring.

Apache Hadoop is no exception to this rule. Recently, Cloudera engineers set out to ensure that MapReduce performance in Hadoop 2 (MR2/YARN) is on par with, or better than, MapReduce performance in Hadoop 1 (MR1). Architecturally, MR2 has many performance advantages over MR1:

  • Better scalability by splitting the JobTracker into the ResourceManager and Application Masters. 
  • Better cluster utilization and higher throughput through finer-grained resource scheduling. 
  • Less tuning required to avoid over-spilling from smarter sort buffer management.
  • Faster completion times for small jobs through “Uber Application Masters,” which run all of a job’s tasks in a single JVM.

While these improvements are important, none of them mean particularly much for well-tuned medium-sized jobs on medium-sized clusters. Whenever a codebase goes through large changes, regressions are likely to seep in.

While correctness issues are easy to spot, performance regressions are difficult to catch without rigorous measurement. When we started including MR2 in our performance measurements last year, we noticed that it lagged behind MR1 significantly on nearly every benchmark. Since then, we’ve done a ton of work — tuning parameters in Cloudera Manager and fixing regressions in MapReduce itself — and can now proudly say that CDH 5 MR2 performs equally well, or better than, MR1 on all our benchmarks.

In this post, I’ll offer a couple examples of this work as case studies in tracking down the performance regressions of complex (Java) distributed systems.

Ensuring a Fair Comparison

Ensuring a fair comparison between MR1 and MR2 is tricky. One common pitfall is that TeraSort, the job most commonly used for benchmarking, changed between MR1 and MR2. To reflect rule changes in the GraySort benchmark on which it is based, the data generated by the TeraSort included with MR2 is less compressible. A valid comparison would use the same version of TeraSort for both releases; otherwise, MR1 will have an unfair advantage.

Another difficult area is resource configuration. In MR1, each node’s resources must be split between slots available for map tasks and slots available for reduce tasks. In MR2, the resource capacity configured for each node is available to both map and reduce tasks. So, if you give MR1 nodes 8 map slots and 8 reduce slots and give MR2 nodes 16 slots worth of memory, resources will be underutilized during MR1’s map phase. MR2 will be able to run 16 concurrent mappers per node while MR1 will only be able to run 8. If you only give MR2 nodes 8 slots of memory, then MR2 will suffer during the period when the map and reduce phases overlap – it will only get to run 8 tasks concurrently, while MR1 will be able to run more. (See this post for more information about properly configuring MR2.)

To circumvent this issue, our benchmarks give full node capacity in MR1 to both map slots and reduce slots. We then set the mapred.reduce.slowstart.completedmaps parameter in both to .99, meaning that there will be no overlap between the map and reduce phases. This ensures that MR1 and MR2 get full cluster resources for both phases.

Case 1: CPU Cache Locality in Sorting Map Outputs

A performance fix starts with noticing a performance problem. In this case, I noticed WordCount was lagging: A job that ran in 375 seconds on an MR1 cluster took 472 seconds on an MR2 cluster, more than 25 percent longer.

A good start when diagnosing a MapReduce performance issue is to determine in which phase it occurs. In this case, the web UI reported that MR2 map tasks were taking much more time than MR1 map tasks. The next thing is to look for any big differences in the counters; here, they were nearly the same between jobs, so that provided few hints.

With little to go on from the counters, the next step is to isolate the problem – reproduce it with as little else going on as possible. The difference in map time showed up when running the same job with a single map and single reduce task in the LocalJobRunner.  However, with the reduce task cut out, the times evened out. Because the sort is skipped when no reduce tasks run, it seemed likely that there was some kind of regression in the map-side sort phase.

The Map-Side Sort

The next step requires a little bit of background on how the map-side sort works.

Map output data is placed in an in-memory buffer. When the buffer fills up, the framework sorts it and then writes it to disk (“spills” it). A separate thread merges the sorted on-disk files into a single larger sorted file. The buffer consists of two parts: a section with contiguous raw output data and a metadata section that holds pointers for each record into the raw data section. In MR1, the sizes of these sections were fixed, controlled by io.sort.record.percent, which could be configured per job. This meant that, without proper tuning of this parameter, if a job had many small records, the metadata section could fill up much more quickly than the raw data section. The buffer would be spilled to disk before it was entirely full.

MAPREDUCE-64 fixed this issue in MR2 by allowing the two sections to share the same space and vary in size, meaning that manual tuning of io.sort.record.percent is no longer required to minimize the number of spills.

With all this in mind, I realized that we had not yet tuned io.sort.record.percent for the job, and therefore the MR1 map tasks were spilling 10 times as many times as the MR2 map tasks. When I did tune the parameter for MR1 so that it would spill as many times as MR2, MR1 performance actually suffered — spilling fewer larger chunks meant slower map tasks.

I had a theory that CPU cache latency was involved. A smaller chunk of output data might fit into a CPU cache, meaning that all the memory accesses when sorting it would be extremely fast. A larger chunk would not fit, meaning that memory accesses would have to go into a higher-level cache or maybe even all the way to main memory. As memory accesses to each cache level take about an order of magnitude longer, this could cause a large performance hit.

Fortunately, there is an extremely powerful Linux profiling tool, called perf, that makes it easy to measure this. Running perf stat -e cache-misses will spit out the number of CPU cache misses encountered by the command. In this case, MR2 job and the tuned MR1 job had a similar number of cache misses, while the untuned MR1 job had half as many.

Improving CPU Cache Latency

So how to improve CPU cache latency? At the suggestion of Todd Lipcon, I took a look at MAPREDUCE-3235, an unfinished JIRA from a year ago that proposed a couple ways to improve CPU cache performance in exactly this situation. The change suggested was trivial: Instead of sorting indices into the map output meta array, sort the array itself. Previously, to access the nth map output record, I found the nth element in the index into the meta array, followed that to the entry in the meta array, and then followed the position reported there into the raw output data. Now, I just find the nth element in the meta array and follow the position reported there. 

This approach removes a layer of indirection and means that I need to access fewer possibly far away memory locations. The drawback is that when the sort does a swap, it moves the full metadata entry (about 20 bytes) instead of the index (4 bytes). A memory access outside the cache is far more expensive than an extra move instruction inside the cache, so it’s worth the cost.

The tiny change worked like magic. It cut the number of cache misses in half for the local job and brought the runtime of the MR2 job on the cluster to less than the runtime of the MR1 job. Win!

Case 2: Over-reading in the Shuffle

Thanks to a report from a Cloudera partner, we learned that more disk reads were occurring in the shuffle during an MR2 job than during the same one on MR1. To reproduce this issue, I turned to ShuffleText, a benchmark job we run that specifically targets the shuffle. It generates a bunch of data in the mappers, shuffles it, and then throws it away in the reducers.

I failed to reproduce the problem with a pseudo-distributed setup, but it immediately reared its head when I ran the jobs on the cluster. The job submitted to MR2 took 30 percent longer than the same job submitted to MR1. Even more dramatic, the average time spent per reduce task fetching map output data was a whopping 60 seconds in MR2 compared to 27 seconds in MR1.

While MapReduce counters are helpful in many situations, they don’t provide machine-wide hardware metrics like number of reads that actually hit disk. Cloudera Manager was invaluable for both measuring the problem and allowing us to dig down into what was wrong. I quickly created charts that showed the total bytes read from disk on machines in the cluster while the job was running: 31.2GB for MR2 compared with 6.4GB for MR1.

Disk reads happen in a few different places in a MapReduce job: First, when reading the input data from disk; second, when merging map outputs if there are multiple spills; third, when serving data to reducers during the shuffle; and fourth, when merging data on the reduce side.

The TaskTracker and NodeManager processes are responsible for serving intermediate data to reducers in MR1 and MR2, respectively. Looking at the disk bytes read by these processes in Cloudera Manager confirmed that the extra reads were occurring when serving data to the reducers. Looking at the code, I noticed that the MR2 shuffle had been rewritten to serve data with Netty async IO instead of the Jetty web server, which used a thread per request — but nothing popped out as to why this specific issue could be occurring. Adding log messages that printed out the total number of bytes read by the shuffle server yielded no useful leads.

On top of this I noticed a few facts:

  1. In both MR1 and MR2, the sum of bytes being read from disk during the shuffle phase was smaller than the total map output bytes.
  2. In MR1, the majority of shuffle disk reads appeared to be occurring on two of the machines, while the rest had nearly 0.
  3. These two machines happened to have less physical memory than the other machines on the cluster.

The story that best fit all three was that, in MR1, most of the map outputs were able to reside in memory in the OS buffer cache. The two machines were experiencing more disk reads because their smaller physical memory meant they couldn’t fit the map output data in the cache. For some reason, MR2 was not taking advantage of it as well as MR1.

fadvise

Turning again to the code, I noticed that MR (both 1 and 2) is very careful about its interactions with the OS buffer cache. The Linux fadvise system call allows providing the OS memory subsystem with suggestions about whether or not to cache regions of files. When the shuffle server receives a request for map output data, it fadvises file regions that it is about to read with FADV_WILLNEED so that they will be ready in memory. When done with a file region, the server fadvises it out of memory with FADV_DONTNEED to free up space because it knows that the region likely will not be consumed again.

Without any obvious bad logic in the code, the next step was to try figuring out more directly what was going on. I turned to strace, which tracks all the system calls made by a process, and listened for fadvises, file opens, closes, and reads. The amount of data produced by this proved unmanageable to sort by hand. What about simply counting the number of WILLNEEDs and DONTNEEDs?

The WILLNEEDs had pretty similar numbers between MR1 and MR2, but seemed to vary per run. For the DONTNEEDs, MR1 was pegged at 768 per node, which corresponded exactly to the number of map tasks that ran on that node multiplied by the number of reducers. But MR2 was always higher than this; and varied between runs.

A mere four hours of aggravation later, it all fell into place: In normal operation, reducers will often issue requests for map outputs and then decide they don’t want these outputs yet and terminate the connection. This occurs because reducers have a limited amount of memory into which read map outputs, and they find out how much space a map output will take up from a prefix in the response when they request it from the NodeManager/TaskTracker. (In other words, they don’t know how much space a map output will take up before asking for it, so, when they find out, they may need to abort and come back later when the space is available.)

If the reducer terminates a connection, the shuffle server should not evict the file regions being fetched from the OS cache (not fadvise them as DONTNEED) because the reducer will come back and ask for them later. MR1 was doing this right. MR2 was not, meaning that even if a map output was in memory the first time the fetcher came around, it would be evicted if the reducer terminated the connection, and when the reducer came back it would need to be read from disk.

The fix merely consisted of the shuffle server not fadvise-ing regions as DONTNEED when a fetch terminates before reading the whole data. This resulted in the average time a reducer spends fetching intermediate data dropping from 60 seconds to 27, the same as MR1. The average job run time also dropped by 30 percent, bringing it in line with MR1 as well.

The astute reader will realize that the situation could be improved even further by communicating the size of map output regions to reducers before they try to fetch them. This would allow us to avoid initiating a fetch when the reducer can’t fit the results inside its merge buffer, and reduce unnecessary seeks on the shuffle server side. (We would like to implement this change in future work.)

Conclusion

I hope that these experiences will assist you in your own performance regression testing, and maybe give you a tiny drop of solace the next time you’re trapped inside an Eclipse window, wondering whether there’s a way to make everything ok.

Thanks to these improvements and many others from Cloudera and the rest of the community (the MR2 improvements have gone upstream and will appear in Hadoop 2.3), we are confident that the version of MR2 that will ship inside CDH 5 (in beta at the time of writing; download here) will perform at least as well, and very likely better, than MR1!

Many thanks to Yanpei Chen, Prashant Gokhale, Todd Lipcon, and Chris Leroy for their assistance on this project.

Sandy Ryza is a Software Engineer at Cloudera and a Hadoop Committer.

Filed under:

3 Responses

Leave a comment


five − = 3