Configuration Parameters: What can you just ignore?

Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”

Nigel's guitar goes to 11, but your cluster might not. At Cloudera, we’re working hard to make Hadoop easier to use and to make configuration less painful. Our Hadoop Configuration Tool gives you a web-based guide to help set up your cluster. Once it’s running, though, you might want to look under the hood and tune things a bit.

The rest of this post discusses why it’s a bad idea to just set all the limits as high as they’ll go, and gives you some pointers to get started on finding a happy medium.

Why can’t you just set all the limits to 1,000,000?

Increasing most settings has a direct impact on memory consumption. Increasing DataNode and TaskTracker settings, therefore, has an adverse impact on RAM available to individual MapReduce tasks. On large hardware, they can be set generously high. In general though, unless you have several dozen more more nodes working together, dialing up settings very high wastes system resources like RAM that could be better applied to running your mapper and reducer code.

That having been said, here’s a list of some things that can be cranked up higher than the defaults by a fair margin:

File descriptor limits

A busy Hadoop daemon might need to open a lot of files. The open fd ulimit in Linux defaults to 1024, which might be too low. You can set to something more generous — maybe 16384. Setting this an order of magnitude higher (e.g., 128K) is probably not a good idea. No individual Hadoop daemon is supposed to need hundreds of thousands of fds; if it’s consuming that many, then there’s probably an fd leak or other bug that needs fixing. This would just mask the true problem until errors started showing up somewhere else.

You can view your ulimits in bash by running:

To set the fd ulimit for a process, you’ll need to be root. As root, open a shell, and run:

You can then run the Hadoop daemon from that shell; the ulimits will be inherited. e.g.:

You can also set the ulimit for the hadoop user in /etc/security/limits.conf; this mechanism will set the value persistently. Make sure pam_limits is enabled for whatever auth mechanism the hadoop daemon is using. The entry will look something like:

If you’re running our distribution, we ship a modified version of Hadoop 0.18.3 that includes HADOOP-4346, a fix for the “soft fd leak” that has affected Hadoop since 0.17, so this should be less critical for our users. Users of the official Apache Hadoop release are affected by the fd leak for all 0.17, 0.18, and 0.19 versions. (The fix is committed for 0.20.) For the curious, we’ve published a list of all differences between our release of Hadoop and the stock 0.18.3 release.

If you’re running Linux 2.6.27, you should also set the epoll limit to something generous; maybe 4096 or 8192.

Then put the following text in /etc/sysctl.conf:

See http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/ for more details.

Internal settings

If there is more RAM available than is consumed by task instances, set io.sort.factor to 25 or 32 (up from 10). io.sort.mb should be 10 * io.sort.factor. Don’t forget, multiply io.sort.mb by the number of concurrent tasks to determine how much RAM you’re actually allocating here, to prevent swapping. (So 10 task instances with io.sort.mb = 320 means you’re actually allocating 3.2 GB of RAM for sorting, up from 1.0 GB.) An open ticket on the Hadoop bug tracking database suggests making the default value here 100. This would likely result in a lower per-stream cache size than 10 MB.

io.file.buffer.size – this is one of the more “magic” parameters. You can set this to 65536 and leave it there. (I’ve profiled this in a bunch of scenarios; this seems to be the sweet spot.)

If the NameNode and JobTracker are on big hardware, set dfs.namenode.handler.count to 64 and same with mapred.job.tracker.handler.count. If you’ve got more than 64 GB of RAM in this machine, you can double it again.

dfs.datanode.handler.count defaults to 3 and could be set a bit higher. (Maybe 8 or 10.) More than this takes up memory that could be devoted to running MapReduce tasks, and I don’t know that it gives you any more performance. (An increased number of HDFS clients implies an increased number of DataNodes to handle the load.)

mapred.child.ulimit should be 2–3x higher than the heap size specified in mapred.child.java.opts and left there to prevent runaway child task memory consumption.

Setting tasktracker.http.threads higher than 40 will deprive individual tasks of RAM, and won’t see a positive impact on shuffle performance until your cluster is approaching 100 nodes or more.

Conclusions

Configuring Hadoop for “optimal performance” is a moving target, and depends heavily on your own applications. There are settings that need to be moved off their defaults, but finding the best value for each is difficult. Our configurator for Hadoop will do a reasonable job of getting you started.

We’d love to hear from you about your own configurations. Did you discover a combination of settings that really made your cluster sing? Please share in the comments.

The photo of Nigel’s amplifier is from the movie This is Spinal Tap, distributed by Embassy Pictures.

Filed under:

6 Responses
  • Abhishek Verma / April 03, 2009 / 10:16 PM

    I created a hadoop job which ran a billion iterations per mapper and ran it on a Hadoop cluster of 62 dual-quad core nodes. I was also using the combiner optimization to decrease the intermediate data. A million iterations ran under a minute, but the billion iterations ran for > 40 hours.At the end of it, I killed the job but the cleanup seemed to be taking forever.

    The framework made > 3 million files at each node in a same directory and all the inode tables were fragmented. In fact, counting the number of files itself took 20 mins.

    I am wondering if there is a parameter that can be tweaked so that the intermediate map outputs are spilled infrequently and appended to existing tmp files instead of creating new ones. Note that I did not run over the default (1024) open fd limit at any point of time.

    Or does the hadoop framework need to be changed in order to do this?

  • aaron / April 06, 2009 / 10:33 AM

    Hi Abhishek,

    That’s an interesting problem. I’m not particularly certain of the answer. Looking through the configuration, no settings jump out at me in terms of controlling the combiner process. The mapred.inmem.merge.threshold parameter defaults to 1000. So I think that after 1000 intermediate files get created, they should be merged before presentation to the combiner. I’m not sure why so many millions of files are being created.

    When you say that the mapper runs “a billion iterations,” does that mean that for every input record to the mapper, you generate a billion (k, v) pairs as output? How many input records were you processing? I’m not sure Hadoop was really designed for a 1,000,000,000:1 fanout ratio. Most MapReduce jobs characteristically have less data output than input.

    You might have better luck getting a solution from the Hadoop Core mailing list; sign up at http://hadoop.apache.org/core/mailing_lists.html. Not only are there more of us there who can help, the list format is much more well-suited to the back-and-forth required to diagnose these sorts of issues.

    Regards,
    - Aaron Kimball

  • Alan / May 06, 2010 / 11:57 AM

    I cannot get the ulimit change to be permanent in Ubuntu 9.10. I edited the /etc/security/limits.conf file to contain ” hard nofile 16384″ then logged off and logged back on. But “ulimit -a” still shows a nofile limit of 1024. Any suggestions on how to make this permanent? (Executing “ulimit -n 16384″ did work within its terminal window.)

  • Pavan Kulkarni / August 30, 2012 / 9:05 AM

    Alan,

    You need to reboot the system for the changes to be updated. It is persistent for the session hence it showed the updated value within your terminal.

Leave a comment


6 − five =