Cloudera Engineering has developed (and recently open sourced) a distributed unit testing framework that cuts testing time from multiple hours to just 10 minutes.
Upstream unit tests are Cloudera’s first line of defense for finding and fixing software bugs, as part of a multidimensional process that also includes static/dynamic code analysis, fault injection, integration/scale/endurance testing, and validation on real workloads. However, running a full unit test suite for Apache Hadoop ecosystem components can take hours, or even days. Consequently, the test suite is unworkable as an on-demand tool for developers, leading to a number of issues:
- Lower software quality, since the full test suite isn’t run as a pre-commit check
- Lower developer productivity, as they wait for unit test runs
- Difficulty reproducing flaky test failures (due to race conditions, for example)
- A resistance to adding tests, particularly long-running end-to-end or fuzz testing, since it further increases the runtime of the test suite. This practice lowers software quality.
- Related to the above, developer time wasted optimizing the runtime of the test suite
- Treating the unit test suite as a burden, since it takes too long to run locally, and kicking off test runs on Jenkins involves extra work and leaving the command line
In this post, I will describe a developer-oriented distributed testing framework created by Cloudera Engineering, which we recently open sourced, that makes running unit tests for Hadoop ecosystem components fast, easy, and accessible—running the full test suite in about 10 minutes, and making it as simple as running the tests locally with
mvn test or
ctest. Furthermore, this framework boosts developer productivity by enabling brand-new workflows, such as quickly reproducing flaky test failures without having to worry about increasing test-suite runtimes.
This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.
Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!
Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.
Sidebar: Why Not Parallel Testing?
Test execution frameworks like Surefire and gtest support parallel test execution modes, which allow a single machine to run multiple unit tests in parallel. We have experience with parallel testing, and found it lacking in a number of ways:
Perhaps most damning criticism, though, is that work spent on parallel testing work is not reusable across projects. Parallelizing one project’s unit tests does not help the other projects in CDH.
The distributed testing infrastructure has two components: the front end and the back end.
The backend stores test dependencies (binaries, JAR files, configs, etc), runs test tasks, and provides APIs for submitting test tasks and monitoring ongoing test runs.
dist_test front scans the on-disk project state and submits test tasks to the backend.
Front ends are language- and test framework-dependent. We’ve developed a front end for Kudu, a C++ project using gtest, and a front end named Grind for Java projects using Maven and Surefire.
Front ends are responsible for enumerating the tests in a project, determining each test’s runtime dependencies, and how to invoke the test. Then, the front end uploads the test dependencies to the back end and submits the tests to be run as a job.
dist_test back end. Test tasks are placed in a Beanstalk queue. Slaves consume these tasks and download dependencies from the Isolate server. Task state is stored in MySQL.
.isolatefile format, for specifying the dependencies of a test and how to invoke it
- The Isolate server, a content-addressed blobstore for serving test dependencies and isolate test descriptions
run_isolated.py, which downloads and runs a test from the Isolate server
However, these pieces still need to be glued together and fronted by an API, which is done by the
dist_test server and slaves.
The master is the user’s entry point to the cluster, and provides REST APIs and web pages for submitting test jobs and viewing job status. The master also schedules tasks to be run on the slaves. Since our infrastructure runs on the cloud, slaves are provisioned on-demand based on queue length.
We reuse some off-the-shelf software components like beanstalkd for a work queue and MySQL to persist master state. We also use S3 to store stdout, stderr, and other test artifacts.
Let’s demonstrate how developers interact with the distributed testing infrastructure. Detailed setup instructions are available in the github repo.
For these examples, we’ll be using Grind, our front end for Maven projects using Surefire and JUnit for test execution. Grind expects to be run from the root of the Maven project, and exposes an interface broadly similar to mvn test. These examples are being run on the example Maven project included with Grind.
Running the tests is as simple as calling
grind test. Grind will crawl your project to determine the dependencies, create
isolate tasks, and submit the tasks to the back end to be run on the cluster.
It looks like some of the tests in our run failed. Let’s click on the job URL to view the job status page.
TestFailSometimes failed. Now, for some fun stuff: let’s re-run these failed tests, specifying
-r 3 to retry up to three times on failure. This step helps us determine if these tests are failing consistently or are flaky.
The job view for this page indicates that
TestFailAlways failed all three times and was thus marked as a failure, but
TestFailSometimes succeeded this time. This result means that
TestFailAlways is likely a consistent failure, and
TestFailSometimes is a flake.
One cool server feature is the trace view, which shows when tasks were run on slaves. In this case, we see the retries for
TestFailAlways being scheduled and run as it fails.
So, we know that
TestFailSometimes is a flaky test. Let’s demonstrate another cool feature and run
TestFailSometimes 1,000 times to collect some test logs and determine its failure rate. We can do that by passing
-n 1000 to Grind.
TestFailSometimes 1,000 times in 22 seconds, and see that it failed 508 out of 1,000 times. We have the option of downloading all the test logs, but can also view them in-browser on the job page:
Hopefully, this post has provided some insight into how and why developers at Cloudera use distributed testing infrastructure. We’ve added distributed test runs to our internal pre-commit checks, and developers use the framework extensively for debugging flaky unit test runs and over the course of normal development.
However, I’m most proud of how the unit test suite has been revitalized as a useful developer tool. Cloudera is building an engineering culture that emphasizes software quality and developer productivity. This distributed testing framework is one part of that overarching goal.
If you’d like to hear more about this distributed testing infrastructure (and how it can be applied to your project!), I’ll be speaking on “Happier Developers and Happier Software through Distributed Testing” at Apache: Big Data 2016 in Vancouver, BC, on May 9.
Special thanks to Todd Lipcon for the collaboration, code reviews, and maintaining the shared infra, and also Sergio Peña for working on the Hive integration.
Andrew Wang is a Software Engineer at Cloudera, and a committer/PMC member on the Apache Hadoop project.