As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.
MR jobs are particularly difficult to test thoroughly because they run in a distributed environment. This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.
As is the case with most testing scenarios, there are certain practices one can follow that have a low barrier to entry; such practices might do a fairly sufficient job of testing. There are also practices one can follow that are more complicated but perhaps result in more thorough testing. Let’s walk through some good QA practices, starting with the easiest and ending with the most complicated.
Traditional Unit Tests – JUnit, PyUnit, Etc.
Your MR job will probably have some functionality that can be tested in isolation using a unit-testing framework such as JUnit or PyUnit. For example, if your MR job does some document parsing in Java, your parse method can be tested using JUnit.
Using a traditional unit-testing framework is perhaps the easiest way to get started testing your MR jobs for a few reasons. First, they are already used by a huge collection of developers. Second, they can be invoked by and integrated into most popular continuous integration servers. Finally, they are simple, effective, and don’t require Hadoop daemons to be running.
Unit tests do a great job of testing each individual part of your MR job, but they do not test your MR job as a whole, and they do not test your MR job within Hadoop.
To start using traditional unit tests to improve the quality of your MR jobs, simply write your map and reduce functions in such a way that functionality is extracted from these functions into “private” helper functions (or classes), which can be tested in isolation. For example, put all your parse code in a “parse” method. Then, choose a unit-testing framework that fits your use case.
Most Python unit testing uses PyUnit, and most Java unit testing uses either JUnit or TestNG. Most popular programming languages have a standard unit-testing tool, so do some research to learn which framework is best for your needs. Finally, write tests in the framework of choice that thoroughly test each helper function you’ve defined for your map and reduce functions. Unit tests can then be run either on the local machine or on a continuous integration to ensure that the post conditions of the tested helper functions are met for a particular input.
MRUnit – Unit Testing for MR Jobs
MRUnit is a tool that was developed here at Cloudera and released back to the Apache Hadoop project. It can be used to unit-test map and reduce functions. MRUnit lets you define key-value pairs to be given to map and reduce functions, and it tests that the correct key-value pairs are emitted from each of these functions. MRUnit tests are similar to traditional unit tests in that they are simple, isolated, and don’t require Hadoop daemons to be running. Aaron Kimball has written a very detailed blog post about MRUnit here.
Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM
Traditional unit tests and MRUnit should do a fairly sufficient job detecting bugs early, but neither will test your MR jobs with Hadoop. The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug in the case of a job failing.
To enable the local job runner, set “mapred.job.tracker” to “local” and “fs.default.name” to “file:///some/local/path” (these are the default values).
Remember, there is no need to start any Hadoop daemons when using the local job runner. Running bin/hadoop will start a JVM and will run your job for you. Creating a new hadoop-local.xml file (or mapred-local.xml and hdfs-local.xml if you’re using 0.20) probably makes sense. You can then use the –config parameter to tell bin/hadoop which configuration directory to use. If you’d rather avoid fiddling with configuration files, you can create a class that implements Tool and uses ToolRunner, and then run this class with bin/hadoop jar foo.jar com.example.Bar -D mapred.job.tracker=local -D fs.default.name=file:/// (args), where Bar is the Tool implementation.
To start using the local job runner to test your MR jobs in Hadoop, create a new configuration directory that is local job runner enabled and invoke your job as you normally would, remembering to include the –config parameter, which points to a directory containing your local configuration files.
The -conf parameter also works in 0.18.3 and lets you specify your hadoop-local.xml file instead of specifying a directory with –config. Hadoop will run the job happily. The difficulty with this form of testing is verifying that the job ran correctly. Note: you’ll have to ensure that input files are set up correctly and output directories don’t exist before running the job.
Assuming you’ve managed to configure the local job runner and get a job running, you’ll have to verify that your job completed correctly. Simply basing success on exit codes isn’t quite good enough. At the very least, you’ll want to verify that the output of your job is correct. You may also want to scan the output of bin/hadoop for exceptions. You should create a script or unit test that sets up preconditions, runs the job, diffs actual output and expected output, and scans for raised exceptions. This script or unit test can then exit with the appropriate status and output specific messages explaining how the job failed.
Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons
The local job runner lets you run your job in a single thread. Running an MR job in a single thread is useful for debugging, but it doesn’t properly simulate a real cluster with several Hadoop daemons running (e.g., NameNode, DataNode, TaskTracker, JobTracker, SecondaryNameNode). A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than the local job runner does.
To start using a pseudo-distributed cluster to test your MR jobs in Hadoop, follow the aforementioned advice for using the local job runner, but in your precondition setup include the configuration and start-up of all Hadoop daemons. Then, to start your job, just use bin/hadoop as you would normally.
Full Integration Testing – Running MR Jobs on a QA Cluster
Probably the most thorough yet most cumbersome mechanism for testing your MR jobs is to run them on a QA cluster composed of at least a few machines. By running your MR jobs on a QA cluster, you’ll be testing all aspects of both your job and its integration with Hadoop.
Running your jobs on a QA cluster has many of the same issues as the local job runner. Namely, you’ll have to check the output of your job for correctness. You may also want to scan the stdin and stdout produced by each task attempt, which will require collecting these logs to a central place and grepping them. Scribe is a useful tool for collecting logs, though it may be superfluous depending on your QA cluster.
We find that most of our customers have some sort of QA or development cluster where they can deploy and test new jobs, try out newer versions of Hadoop, and practice upgrading clusters from one version of Hadoop to another. If Hadoop is a major part of your production pipeline, then creating a QA or development cluster makes a lot of sense, and repeatedly running jobs on it will ensure that changes to your jobs continue to get tested thoroughly. EC2 may be a good host for your QA cluster, as you can bring it up and down on demand. Take a look at our beta EC2 EBS Hadoop scripts if you’re interested in creating a QA cluster in EC2.
You should choose QA practices based on the importance of QA for your organization and also on the amount of resources you have. Simply using a traditional unit-testing framework, MRUnit and the local job runner can test your MR jobs thoroughly in a simple way without using too many resources. However, running your jobs on a QA or development cluster is naturally the best way to fully test your MR jobs with the expenses and operational tasks of a Hadoop cluster.
Do you have any helpful advice on beneficial QA practices for MR jobs? Leave a comment :).