Automatically Documenting Apache Hadoop Configuration

Ari Rabkin is a summer intern at Cloudera, working with the engineering team to help make Hadoop more usable and simpler to configure. The rest of the year, Ari is a PhD student at UC Berkeley. He’s applying the results of recent research to automatically find and document configuration options for Hadoop.

Background

Hadoop has a key-value style of configuration, where each configuration option has a name and a value. There is no central list of options, and it’s easy for developers to add new configuration options as needed. Unfortunately, this opens the way for bugs and for erroneous documentation. Not all documented options exist and many options are undocumented. Options can have different default values in the code and the configuration files. (In which case, the config-file default will win.)

Without automated tools, configuration mistakes can be quite hard to track down. Imagine the user who tries to set dfs.datanode.max.xcievers, but accidentally types xceivers — a mistake easy for machines to diagnose, but hard for humans to spot. This is a particularly menacing case, both because the option name is peculiarly spelled, and because accidentally misspelling it will not result in any immediate overt failure. Instead, it will result in the cluster using the default value, resulting in impaired performance and reliability that may be challenging to track down.

As I’ll show in the rest of this post, we’re using advanced techniques to help fix this problem, for both developers and users of the Hadoop platform

Applying cutting-edge techniques

One way to find out which options Hadoop really supports would be to modify the code to log whenever an option is loaded, and then to run Hadoop in many environments with many configurations. Unfortunately, many options are only used in special circumstances, making it hard to get good coverage. We took a different approach, using static program analysis. In this approach, we automatically analyze the compiled bytecode. Doing static analysis on a program the size and complexity of Hadoop is not at all a standard thing to do. Reflection, remote procedure calls, and dynamic classloading make it quite challenging, technically. It does work decently well, however, thanks to years of research on analysis of reflection-heavy Java programs.

Results

Here is a table, for Apache Hadoop 0.20.2, of which options are read, where they are read, whether they are documented, and what the default values are, both in configuration files and in the code. Here is another such table, for Cloudera’s CDH3u0. And here’s the current development trunk.

For CDH3, we supplemented the static analysis approach with dynamic instrumentation. This both helps gauge accuracy and also supplies more precise information about which Hadoop components use which options.

It’s interesting to compare the total numbers of options in each. Apache Hadoop 0.20.2 had about 300 options. CDH3 boosted this to over 500, primarily by incorporating many new security-related options. Apache trunk is somewhere in between, with around 430. Again, security is a big part of the difference, since it wasn’t in Apache Hadoop 0.20.2, but will be in the next Apache release.

Why it matters

There are several ways that Cloudera is using the results of this analysis to improve Hadoop and to help customers.

  • Improving the platform We used the results of this analysis to find undocumented options that now have been described for users. We also noticed that some undocumented options shouldn’t be there at all — those options had been renamed, and the previous names should have gone away. We fixed this one too. Previous versions of Hadoop sometimes included documentation for options that weren’t there anymore — that had been renamed or outright removed. Happily, we found no such options in Trunk. But we’ll keep an eye open for them in the future.
  • Guiding Cloudera Enterprise development The option listings, linked-to above, have been used by the Cloudera Service and Configuration Manager (SCM) developers, to check that SCM correctly handles all the known options in Hadoop. By pinpointing where in the code an option is read and which daemons may read it, this analysis helps SCM avoid setting MapReduce options on HDFS-only nodes, and similar mistakes.
  • Debugging user configurations We also use these configuration option listings to help diagnose user problems. The results of the analysis can be thought of as a “spell check dictionary” for configuration options. If a user sends us a configuration file, we can run it through our configuration spellchecker, and verify that users haven’t misspelled an option name.

Why so many undocumented options?

The bottom-line figures show that Hadoop has many undocumented options — often over 100. This may be surprising, but there are good reasons for it.

  • Many Hadoop options are only there for the benefit of unit tests. For example, there’s no reason why users would want to configure dfs.client.block.write.locateFollowingBlock.retries, but unit tests need to set it to drive Hadoop into otherwise-rare corner cases.
  • Some options are deprecated. For example, dfs.network.script is no longer a supported option, but the code checks if users set it, in order to print a deprecation warning. There’s no reason to document the old stale name.
  • Some options might not be needed. Developers often use options as a sort of configurable named constant. Most complex programs, including Hadoop, are full of parameters that are theoretically tunable but where developers have no insight as to what the right value is. For example, the operating system offers many options controlling the details of TCP connection management. In most cases, there is no reason to prefer one value over another. Offering this control to users would add more complexity and confusion than benefit; hence, the option is not documented. But it is conceivable that the configurability might be useful in some circumstance. Rather than compile in a default, developers use a configuration option. This way, in the event of a production failure, engineers reading the code can find and tune the options, without recompiling and redeploying.

Limitations

The data is autogenerated, and will have mistakes. In particular, sometimes the analysis can’t tell that a particular method will be called by user code, so sometimes the point where the option is read won’t get analyzed. This results in options sometimes being missed.

Another limitation is that the analysis is sometimes severely over-inclusive. In particular, network protocols confuse it. Suppose you have a DataNode that’s listening for instructions from the NameNode. The NameNode will only send certain things over the wire. But the analysis doesn’t know the limitations of the protocol. So it assumes, pessimistically, that anything could come back, including messages from the MapReduce code. As a result, sometimes the analysis will say that something can be reached from the DataNode, when it really can’t. The root cause here is that the analysis can’t exclude things that the type system allows, but that can’t show up from any actual Hadoop server.

Conclusions

Despite its imprecision, these results have been useful here in practice. Results are being used to improve the development of the Hadoop platform, improve the configuration management in Cloudera Enterprise tools, and to help answer user questions.

It’s been a personal thrill being able to take my academic research and put it in the hands of Cloudera engineers. Cloudera has been very supportive.

For more information

  • The static analysis is built around the open-source JChord analysis tool developed by Mayur Naik et al..

An academic paper describing and evaluating the approach is Rabkin and Katz, “Static Extraction of Program Configuration Options“, presented a few months ago at ICSE 2011, the International Conference on Software Engineering, in Honolulu.

Filed under:

7 Responses
  • Praveen / August 10, 2011 / 9:30 AM

    Ari,

    It’s very interesting. Is it something which can be open-sourced and be integrated with the Apache Hadoop build process. Looks like you habe built on top of JChord.

    Lot of improvements are happening around Apache Hadoop (MRv2, HDFS HA etc) and I am sure that a lot of configuration parameters are being added to the code.

    Thanks,
    Praveen

  • Ari Rabkin / August 11, 2011 / 11:26 AM

    Yes, I expect that all the code will be open-sourced. The core analysis already is public, the scripts around it should be shared soon too, as soon as I can clean them up. I would like to have this done regularly during the build and develoment process.

  • Michael Dorf / August 29, 2011 / 12:38 PM

    Very handy article, Ari. You can almost use the table you generated as a cheat-sheet for all the Hadoop options available. I know I will! Thanks a lot!

  • Praveen / November 27, 2011 / 11:53 PM

    Since 0.23 has been release, can this be open sources and integrated with the build process. Some of the parameters used in the code are not documented.

  • Mike / December 19, 2011 / 9:06 AM

    Would love to see this be kept up. This is the best source of config params for hadoop generally (and cdh specifically), that I’ve found.

Leave a comment


+ seven = 12