Quality Assurance at Cloudera: Distributed Unit Testing

Categories: Kudu Testing Tools

Cloudera Engineering has developed (and recently open sourced) a distributed unit testing framework that cuts testing time from multiple hours to just 10 minutes.

Upstream unit tests are Cloudera’s first line of defense for finding and fixing software bugs, as part of a multidimensional process that also includes static/dynamic code analysis, fault injection, integration/scale/endurance testing, and validation on real workloads. However, running a full unit test suite for Apache Hadoop ecosystem components can take hours, or even days. Consequently, the test suite is unworkable as an on-demand tool for developers, leading to a number of issues:

  • Lower software quality, since the full test suite isn’t run as a pre-commit check
  • Lower developer productivity, as they wait for unit test runs
  • Difficulty reproducing flaky test failures (due to race conditions, for example)
  • A resistance to adding tests, particularly long-running end-to-end or fuzz testing, since it further increases the runtime of the test suite. This practice lowers software quality.
  • Related to the above, developer time wasted optimizing the runtime of the test suite
  • Treating the unit test suite as a burden, since it takes too long to run locally, and kicking off test runs on Jenkins involves extra work and leaving the command line

In this post, I will describe a developer-oriented distributed testing framework created by Cloudera Engineering, which we recently open sourced, that makes running unit tests for Hadoop ecosystem components fast, easy, and accessible—running the full test suite in about 10 minutes, and making it as simple as running the tests locally with mvn test or ctest. Furthermore, this framework boosts developer productivity by enabling brand-new workflows, such as quickly reproducing flaky test failures without having to worry about increasing test-suite runtimes.

Background

This distributed testing infrastructure started out as a Cloudera hackathon project in 2014. Todd Lipcon and I worked on a shared backend for running test tasks on a cluster, with Todd focusing on onboarding the Apache Kudu (incubating) tests, and myself on Apache Hadoop. Our prototype implementation reduced the runtime of the 1,700+ Hadoop unit tests from 8.5 hours to 15 minutes.

Since then, we’ve spent time improving the infrastructure and on-boarding additional projects. Besides Kudu and Hadoop, our distributed testing infrastructure is also being used by our Apache Hive and Apache HBase teams. We can now run all the Hadoop unit tests in less than 10 minutes!

Finally, we’re happy to announce that both our infrastructure and code are public! You can browse the webUI at http://dist-test.cloudera.org and see all the source code (ASLv2 licensed) at the cloudera/dist_test github repository. This infrastructure is already being used at upstream Apache to run the Kudu pre-commit tests.

Sidebar: Why Not Parallel Testing?

Test execution frameworks like Surefire and gtest support parallel test execution modes, which allow a single machine to run multiple unit tests in parallel. We have experience with parallel testing, and found it lacking in a number of ways:

  • Parallel testing is limited by the parallelism of a single machine, meaning it would still take too long to run large unit test suites.
  • Parallel testing requires significant reworking of the unit test suite to avoid shared state like temp directories and port numbers. These issues were prevalent in the Hadoop unit tests, and likely true of unit tests of other distributed systems.
  • Parallel testing led to test timeouts from increased load. This was exacerbated for Hadoop, since individual Hadoop unit tests are already multi-threaded.

Perhaps most damning criticism, though, is that work spent on parallel testing work is not reusable across projects. Parallelizing one project’s unit tests does not help the other projects in CDH.

Architecture

The distributed testing infrastructure has two components: the front end and the back end.
The backend stores test dependencies (binaries, JAR files, configs, etc), runs test tasks, and provides APIs for submitting test tasks and monitoring ongoing test runs.

distr-test-f1

dist_test front scans the on-disk project state and submits test tasks to the backend.

Front ends are language- and test framework-dependent. We’ve developed a front end for Kudu, a C++ project using gtest, and a front end named Grind for Java projects using Maven and Surefire.

Front ends are responsible for enumerating the tests in a project, determining each test’s runtime dependencies, and how to invoke the test. Then, the front end uploads the test dependencies to the back end and submits the tests to be run as a job.

distr-test-f2

dist_test back end. Test tasks are placed in a Beanstalk queue. Slaves consume these tasks and download dependencies from the Isolate server. Task state is stored in MySQL.

The dist_test backend reuses some distributed testing building blocks developed by Chromium: Swarming and its Go rewrite, Luci. These provide:

  • The .isolate file format, for specifying the dependencies of a test and how to invoke it
  • The Isolate server, a content-addressed blobstore for serving test dependencies and isolate test descriptions
  • run_isolated.py, which downloads and runs a test from the Isolate server

However, these pieces still need to be glued together and fronted by an API, which is done by thedist_test server and slaves.

The master is the user’s entry point to the cluster, and provides REST APIs and web pages for submitting test jobs and viewing job status. The master also schedules tasks to be run on the slaves. Since our infrastructure runs on the cloud, slaves are provisioned on-demand based on queue length.

We reuse some off-the-shelf software components like beanstalkd for a work queue and MySQL to persist master state. We also use S3 to store stdout, stderr, and other test artifacts.

Example

Let’s demonstrate how developers interact with the distributed testing infrastructure. Detailed setup instructions are available in the github repo.

For these examples, we’ll be using Grind, our front end for Maven projects using Surefire and JUnit for test execution. Grind expects to be run from the root of the Maven project, and exposes an interface broadly similar to mvn test. These examples are being run on the example Maven project included with Grind.

Running the tests is as simple as calling grind test. Grind will crawl your project to determine the dependencies, create isolate tasks, and submit the tasks to the back end to be run on the cluster.

It looks like some of the tests in our run failed. Let’s click on the job URL to view the job status page.

distr-test-f4

TestFailAlways and TestFailSometimes failed. Now, for some fun stuff: let’s re-run these failed tests, specifying -r 3 to retry up to three times on failure. This step helps us determine if these tests are failing consistently or are flaky.

The job view for this page indicates that TestFailAlways failed all three times and was thus marked as a failure, but TestFailSometimes succeeded this time. This result means that TestFailAlways is likely a consistent failure, and TestFailSometimes is a flake.

distr-test-f6

One cool server feature is the trace view, which shows when tasks were run on slaves. In this case, we see the retries for TestFailAlways being scheduled and run as it fails.

distr-test-f7

So, we know that TestFailSometimes is a flaky test. Let’s demonstrate another cool feature and run TestFailSometimes 1,000 times to collect some test logs and determine its failure rate. We can do that by passing -n 1000 to Grind.

We ran TestFailSometimes 1,000 times in 22 seconds, and see that it failed 508 out of 1,000 times. We have the option of downloading all the test logs, but can also view them in-browser on the job page:

Conclusion

Hopefully, this post has provided some insight into how and why developers at Cloudera use distributed testing infrastructure. We’ve added distributed test runs to our internal pre-commit checks, and developers use the framework extensively for debugging flaky unit test runs and over the course of normal development.

However, I’m most proud of how the unit test suite has been revitalized as a useful developer tool. Cloudera is building an engineering culture that emphasizes software quality and developer productivity. This distributed testing framework is one part of that overarching goal.

If you’d like to hear more about this distributed testing infrastructure (and how it can be applied to your project!), I’ll be speaking on “Happier Developers and Happier Software through Distributed Testing” at Apache: Big Data 2016 in Vancouver, BC, on May 9.

Special thanks to Todd Lipcon for the collaboration, code reviews, and maintaining the shared infra, and also Sergio Peña for working on the Hive integration.

Andrew Wang is a Software Engineer at Cloudera, and a committer/PMC member on the Apache Hadoop project.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

One response on “Quality Assurance at Cloudera: Distributed Unit Testing

  1. Tudor-Lucian Lapusan

    Good job again Cloudera team !
    Is this infrastructure used only for Hadoop Ecosystem components (like Kudo, MapReduce, Spark, Hbase, etc) or can it be used also for big data projects from a usual big data workflow ?
    Thanks,
    Tudor

Leave a Reply

Your email address will not be published. Required fields are marked *