A profile of Apache Hadoop MapReduce computing efficiency (continued)
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
We selected our SIFT-M MapReduce application, described in our presentation at Hadoop World 2010 , as the candidate algorithm for Node Scalability since it is embarrassingly parallel and is representative of compute-intensive applications where the bulk of work is computation and not data movement. The Terasort MapReduce benchmark is used for the data scalability tests since it has a greater dependence on the distribution of data than the SIFT algorithm. The Terasort MapReduce benchmark is distributed with the Hadoop codebase. The Yahoo implementation gained notoriety for breaking the terabyte sorting benchmark in 2009 for sorting 100 TB in 173 minutes.
Examples indicating saturation and steady-state condition are given in the following plots.
Figure 1 SIFT-M phase plot.
Figure 2 SIFT-M task rate plot.
Figure 3 SIFT-M Ganglia CPU plot.
Figure 1, Figure 2, and Figure 3 depict our SIFT-M job on a cluster with a total of 56×2=112 PEs, the maximum number of task slots. We see from Figure 1 the task count is maximized for the duration of the benchmark indicating the system is saturated and has reached steady-state. The task rate plotted in Figure 2 displays a constant arrival rate, a linear curve, expected for steady-state and is approximately 3 tasks per second. Further evidence of the CPU saturation is provided by the Ganglia hardware metrics in Figure 3. The next plots in Figure 4, Figure 5, and Figure 6 demonstrate an under-utilized cluster which supports 256 PEs. We can see the concurrency is not maintained in the map phase, the map task assignment never reaches a constant rate and declines rapidly. The reduce phase reaches steady-state but is far from saturating the available execution slots.
Figure 4 Under-utilized Terasort phase plot.
Figure 5 Under-utilized Terasort task rate plot.
Figure 6 Under-utilized Terasort Ganglia CPU plot.
An example of a Terasort job with better system utilization is depicted in Figure 7, Figure 8, and Figure 9. Note there are multiple shuffle, sort, and reduce phases. This can arise when there are more reduce tasks allocated than available slots so the reduce phases complete in “waves”. It is important to balance the concurrency between the map and reduce phases to achieve the best performance.
Figure 7 Terasort phase plot.
Figure 8 Terasort task rate plot.
Figure 9 Terasort Ganglia CPU plot.
The histogram plots for the SIFT-M job identify the variance in workload and performance per host, evident in Figure 10, Figure 11, and Figure 12. Clearly the “red” host is struggling with the tasks.
Figure 10 SIFT-M histogram of task count.
Figure 11 SIFT-M histogram of task duration.
Figure 12 SIFT-M histogram of I/O in bytes.
The final plots depict the scalability performance. The plot in Figure 13 compares the node scalability between all the clusters. All clusters scale linearly for the fixed-sized study as expected. Figure 14 displays the data scalability for the scaled-sized problem. We find that each cluster exhibits a decrease in throughput as the input scales correlating with a drop in CPU utilization which we attribute to greater latency in moving the data to the CPU. The dramatic drop in the early data points is likely due to the input fitting entirely in memory.
Figure 13 Node scalability plot.
Figure 14 Data scalability plot.
We expect embarrassingly parallel Hadoop MapReduce applications will scale linearly and our Node Scalability results align very well with this expectation. The preliminary results of our Data Scalability study indicate performance degrades with increasing input as a result of less data locality and higher demand on disk and network resources. Since input is necessarily shared by all compute hosts contention on the same disks increases with input size even when different file blocks are requested. The imposed data dependency impedes an important advantage of MapReduce, the overlap of communication and computation. The MapReduce paradigm masks the latency from information requests by transferring map output to the reduce hosts during the map phase, known as shuffling, rather than waiting for all map tasks to complete. But the map tasks are also requesting data between hosts because of the cluster-wide data dependency, and so the shuffling phase contends for the very same network and storage resources.
Although we used the default Hadoop FIFO scheduler, we suggest that selecting different job schedulers and increasing the replication factor, HDFS block size, and virtual memory page size in isolation or combination can improve performance. Developers may also need to create specialized Partition and Combiner classes to address data skew. A judicious choice in the key-space partitioning and the number of reduce tasks will have significant impact on the performance. Since both map and reduce phases overlap, care is needed not to over-subscribe the system during the map phase but avoid under-utilizing the resources in the reduce phase. A system with many smaller disks over few large disks is recommended in conjunction with high compute density.
I would like to thank Jonathan Jarrett for developing the scripts for the Gnuplot charts and automated benchmarks, and thank David Ritch and Adam Watts for administering our clusters. I also thank the rest of the SRA team for their support.