Thanks to Wuheng Luo, a Hadoop and big data architect at Sears Holdings, for the guest post below about Pig job-level performance tuning
Many factors can affect Apache Pig job performance in Apache Hadoop, including hardware, network I/O, cluster settings, code logic, and algorithm. Although the sysadmin team is responsible for monitoring many of these factors, there are other issues that MapReduce job owners or data application developers can help diagnose, tune, and improve. One such example is a disproportionate Map-to-Reduce ratio—that is, using too many reducers or mappers in a Pig job.
There are two simple reasons why using too many mappers or reducers should be avoided. First, it can inhibit your MapReduce job’s performance. It’s a myth that the more mappers and reducers we use, the faster our jobs will run—the fact is, each MapReduce task carries certain overhead, and the communication and data movement between mappers and reducers take resources and time. Thus, tuning your job so that workload is evenly distributed across reducers, with as little skew as possible, is much more effective than blindly increasing the number of mappers or reducers. Furthermore, the number of MapReduce tasks that can run simultaneously on each machine is limited. Given these facts, using more mappers or reducers than you actually need will slow your job down rather than speed it up.
Second, using more mappers or reducers than needed can have a significant negative impact on other MapReduce applications, and can constrain your Hadoop cluster’s overall performance. We all know that a Hadoop cluster is a distributed, resource sharing grid, but we often forget or ignore the fact that we are not running our own MapReduce jobs in a vacuum.
For example, a typical cluster’s total MapReduce task capacity, generally speaking, is configured to a ratio about 1:0.7. A Hadoop cluster with 100 data nodes, for instance, could have 1,000 map slots and 700 reduce slots allocated in total, about 10 slots for maps and 6-7 slots for reduces for each data node on average. With Hadoop 2, YARN’s default capacity scheduler allows us to partition applications with queues based on the nature, type, and priority of MapReduce jobs and cap how much resource a MapReduce task can take.
There are scenarios, however, in which the request of a job asking for more resources than it actually needs can be granted if other queues’ resources are under-utilized and free resources are available at that moment. In that case, a queue could be allocated with unnecessary resources beyond its defined capacity, since the ResourceManager is not designed to distinguish reasonable resource requests from unreasonable or excessive ones. Therefore, it is our responsibility as developers to not waste or abuse shared resources in way that interferes with the normal operation of other MapReduce jobs in the same cluster.
Well, you might ask, how do I know whether I used too many reducers or mappers in my Pig jobs? Listed below are some simple steps to help you diagnose your jobs.
- Go to the job history portal, or in Hadoop 1, the Jobtracker page, find your completed job, and look at the numbers of mappers and reducers used. If the number of your reducers is equal to or even greater than that of mappers, that is the first sign that your job might have a map-to-reduce ratio issue.
- Check the number and size of your Pig job’s input files. If the number of mappers used in your Pig job is equal to the number of input files (that is, one mapper for each input file), and the size of your input files is less than the default Hadoop data split size (64 or 128 MB), then it is likely that you used too many mappers.
- On the job history or Jobtracker portal, check the actual execution time of your MapReduce tasks. If the majority of your completed mappers or reducers were very short-lived, say, under 10-15 seconds, and if your reducers ran even faster than your mappers, that’s a good sign you might have used too many reducers and/or mappers.
- Go to your job’s output directory in HDFS and check the number of output part-* files. If that number is greater than your planned reducer number (each reducer is supposed to generate one output file), it indicates that your planned reducer number somehow got overridden and you used more reducers than you meant to.
- Check the size of your output part-* files. If the planned reducer number is 50, but some of the output files are empty, that is a good indicator that you over-allocated reducers for your job and wasted the cluster’s resources.
If you find your job has one or more of the symptoms mentioned above, how do you fix it? Here are a few ideas that can help you tune and improve the MapReduce performance at job level while you design, implement, and test your applications.
- Generally, the Mappers-to-Reducers ratio is about 4:1. A Pig job with 400 mappers usually should not have reducers more than 100. If your job has combiners implemented (using
group … byand
foreachtogether without other operators in between), that ratio can be as high as 10:1—that is, for 100 mappers using 10 reducers is sufficient in that case.
- For Symptom 2 mentioned previously, when there are a lot of small input files, smaller than the default data split size, then you need to enable the combination of data splits and set the minimum size of the combined data splits to a multiple of the default data split size, usually 256 or 512MB or even more. Specifically, set these Pig parameters as shown:
12-Dpig.splitCombination="true" \-Dmapred.min.split.size=$[1024 * 1024 * 128 * 2] \
- Do not solely rely on a generic default reduce parallelism setting in the line of
SET default_parallel …at the very beginning of your Pig code. Generally, hard-coding a fixed number of reducers in Pig using
parallelis a bad idea. Instead, calculate specifically the appropriate number of mappers based on the split size of your input data (the split size should be a multiple of the system default, in the range of 128-512MB normally, could be even higher if input data is at multi-TB scale), and then apply the ratio mentioned in Solution 1 above to come up with the proper number of reducers to use. For example, if each of your mappers takes input data split of 512MB or so, each reducer then can take about 2GB of data (512MB x 4) based on the recommended ratio. Another way to estimate the right number of reducers is to look at the reduce task run time—which is usually between 5-15 minutes, on average.
- If you don’t have the line
set default_parallelin your script, and don’t use the keyword
parallelspecifically anywhere in your code, then at the script/application level, Pig has two properties that users can manipulate toward appropriate reduce parallelism:
reducers.bytes.per.reducer, which has a default value of 1GB but as mentioned in Solution 2 can be set to 2GB or even higher, depending on the total input size and how optimized your script is; and
reducers.max, which has a default setting of 999 but should be set with a much lower number to cap the maximum number of reducers a Pig job should take. Between the two properties—(total input size /
reducers.max—Pig will give precedence to the one that is smaller. If your total input data is about 100GB, possible settings could include:
12-Dpig.exec.reducers.bytes.per.reducer=$[1024 * 1024 * 1024 * 2] \-Dpig.exec.reducers.max=50 \
- If you know what you are doing and really need more control over each and every one of the reduce phases in your Pig script, then instead of setting
bytes per reduceras outlined in Solution 3, you may specify the reducer number with the keyword
parallelin each reduce block, such as
CROSS, so that your job-specific reduce settings will take over and override possible inappropriate default settings.
Tuning Pig performance at job level by using an appropriate numbers of mappers and reducers involves many things, and is contingent on resources, job workload, other jobs in the same environment, and other factors. Therefore, it may take some iteration to achieve tangible improvement. Just use your best judgment, and you’ll get there!
Wuheng Luo is currently a Hadoop and big data architect at Sears Holdings. His technical experience includes working on Hadoop data pipelines at Yahoo! He has presented at IEEE Big Data conferences and Hadoop Summit.