Testing Apache Hadoop

Categories: General Hadoop

As a developer coming to Apache Hadoop it is important to understand how testing is organized in the project. For the most part it is simple — it’s really just a lot of JUnit tests — but there are some aspects that are not so well known.

Running Hadoop Unit Tests

Let’s have a look at some of the tests in Hadoop Core, and see how to run them. First check out the Hadoop Core source, and from the top-level directory type the following command. (Warning: it will take a few hours to run the whole test suite, so you may not want to do this straight away.)

The test target runs all of the core tests and the tests in contrib, by depending on test-core and test-contrib. Often, you only want to run one of these targets, and, furthermore, you may only want to run a single unit test, when you are working on a single class, for instance. To do this, use the testcase property. For example, to only run the TestConfiguration test, use the following (which will take a matter of seconds to run):

You can use Ant wildcards in the spec:

This will test TestFileInputFormat, TestFileOutputFormat and so on (but not TestFileInputFormatPathFilter).

If you want to do more complex includes and excludes, then use the test.include and test.exclude properties:

This will run the same tests as before, except it will exclude TestFileOutputFormat.

The default value of test.include is Test*, so when you don’t specify any of these properties, you run all the tests beginning with “Test”. Some tests are included which don’t begin with “Test”: they are intended to be run manually. For example, Jets3tS3FileSystemContractTest, which runs against Amazon S3 (and therefore costs money) can be run like this:

(For this one you will need to set up some S3 credentials in src/test/hadoop-site.xml, as explained on the Hadoop Wiki).

Turning to the contrib modules, there are a couple of ways to run the tests for a single module. From the top-level you can just filter appropriately. For example, to run just the streaming tests:

Alternatively, you can change to the contrib’s subdirectory and invoke ant test from there.

Writing Tests In Hadoop

If you are adding a feature or fixing a bug, you need to write a test for the change. When you submit a patch, as well as running the test suite Hudson will look for new or changed tests, and if none are present will mark the issue with a -1 vote. Only if it is unreasonably difficult to test the change will not writing a test for the feature be justified.

Hadoop provides a few useful tools and extension points for writing tests for testing Hadoop itself. Chief among the test classes to know about are the mini clusters, MiniDFSCluster and MiniMRCluster. These start HDFS and MapReduce clusters where each daemon runs in its own thread rather than its own process. Note however that MapReduce tasks run in a separate JVM to the tasktracker, so they are not in the same JVM as the test which launched them. This is important to understand if you want to communicate state between the task and the test, as you need to do so using the local filesystem or similar.

Many tests need to run a MapReduce job, and to do this the ClusterMapReduceTestCase base class is especially convenient as it takes care of setting up and tearing down the mini cluster for you.

For more information on these classes have a look at their source code (there is no published documentation on them).

Why do the tests take so long to run?

Having the mini clusters is a great boon as it allows us to write tests that exercise the whole Hadoop stack, but this comes at the price of test execution time. The current time to run the tests is over three hours, which means that developers don’t run the tests as often as they might. Kent Beck, in his book Extreme Programming Explained advocated the 10 minute build, since it allows the developer to get rapid feedback on the validity of changes they make to the system. The three hours it takes for the Hadoop build forces developers to work on something else while the tests run, and when failures occur, the context switch back to the previous task inevitably takes longer.

So, how can we shorten the run times? There is a real tension here as the last thing we want to do is discourage developers from writing tests. Yet, if we keep growing the suite of tests as we have done, then the time to run them all will become prohibitive. Here are a few ideas to improve things:

  • It’s often the case that each test doesn’t need its own brand new mini cluster. To pick a random example, TestJobName has two test methods that could easily test using the same mini cluster without fear of interacting (they run serially anyway).
  • Many tests don’t need the full machinery of the mini clusters to run against. There is a lot of scope for making a test more focused, and writing it as a true unit test, i.e. one that tests the unit and not the whole system. Introducing a mock framework would facilitate this.
  • HADOOP-1257 is a nice idea to use Hadoop itself to run the unit tests in parallel. Developers with access to a cluster will be able to run the tests in a fraction of the time.
  • There are a few tests which take a disproportionate amount of time to run. It would be wise to focus on making these faster to run. Here are the top 10 from a recent run (times in seconds):

  • Hudson provides a breakdown of test times by package too, which you can sort by clicking on the “Duration” column.
  • Clover is a test coverage tool, which used to run as a part of the Hudson nightly build (and will be re-instated soon, hopefully). The latest version has a feature called Test Optimization that allows you test “only what you need”. I don’t know how this works, but it looks intriguing.
  • HDFS and MapReduce are being split out of core, and this may give us an opportunity to improve things, not least because the project will be decomposed into smaller chunks. Imagine if we had a set of unit tests for each project that could run in 10 minutes or less, and a set of integration tests that took longer to run, but tested the whole system. The smaller set of tests would be useful during development of a new feature or bugfix, and once this passed it would be acceptable to submit the patch to Hudson, which would run the integration tests. This would be a good place to be.

I plan to create Jiras to improve some of these aspects of testing in Hadoop. Comments, suggestions and help are all welcome.


5 responses on “Testing Apache Hadoop

  1. tom


    Thanks for the link to the post on Clover’s Test Optimization, very interesting. Lots of Hadoop’s tests launch a new JVM for the mapper and reducer code, but this may not matter too much as Clover will still have a good idea of the tests that run a particular application class’s code. This could be improved by changing the tests to run tasks in the same JVM, see https://issues.apache.org/jira/browse/HADOOP-3675.


  2. Steve Loughran

    the 3h test cycles really irritate me too. There is some stuff in ant 1.7 for recording which tests fail and rerunning just those, so you could work on the failing tests. But then you need to go back to the big run. Maybe we could split into fast+slow, and so do test runs that run all fast tests and if that works, run the slow set.

  3. Nick Pellow

    Hi Steve,

    Part of Clover’s Test Optimization is to automatically re-order your tests from fastest to slowest so that the failures are detected as early as possible in the build. This removes the need to manually separate fast and slow tests in Hadoop.

    Being Open Source, Hadoop is eligible for a free Clover2 license. I’ll also have a look at integrating Clover into the Hadoop build.