Cloudera Developer Blog · Testing Posts
While Apache HBase adoption for building end-user applications has skyrocketed, many of those applications (and many apps generally) have not been well-tested. In this post, you’ll learn some of the ways this testing can easily be done.
We will start with unit testing via JUnit, then move on to using Mockito and Apache MRUnit, and then to using an HBase mini-cluster for integration testing. (The HBase codebase itself is tested via a mini-cluster, so why not tap into that for upstream applications, as well?)
As a basis for discussion, let’s assume you have an HBase data access object (DAO) that does the following insert into HBase. The logic could be more complicated of course but for the sake of example, this does the job.
Apache HBase supports three primary client APIs that developers can use to bind applications with HBase: the Java API, the REST API, and the Thrift API. Therefore, as developers build apps against HBase, it’s very important for them to be aware of the compatibility guidelines with respect to CDH.
This blog post will describe the efforts that go into protecting the experience of a developer using the Java API. Through its testing work, Cloudera allows developers to write code and sleep well at night, knowing that their code will remain compatible through supported upgrade paths.
First, we’ll explore the compatibility guidelines themselves. From there, we will discuss some of the testing that ensures compatibility across CDH versions, as well as some of the interesting incompatibilities we’ve detected and fixed along the way.
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Table 1. Summary of Hadoop workloads analyzed
At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.
Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.
Leap seconds are added to the UTC to correct for Earth’s slowing rotation. The latest leap second was added last Saturday (6/30) at 23:59:60 UTC (5 pm PDT). Due to a missed function call in the Linux timekeeping code, the leap second was not accounted for properly. As a result, after the leap second, timers expired one second earlier than requested. Many applications use a recurring timer of 1 second or less; such timers expired immediately, causing the application to immediately try to set another timer, ad infinitum. This infinite loop led to CPU load spikes that launched 21 separate support tickets.
This post was originally posted on the Apache Software Foundation’s blog.
We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.
The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our mailing list to find out how you can help!
In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.
MR jobs are particularly difficult to test thoroughly because they run in a distributed environment. This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.
As is the case with most testing scenarios, there are certain practices one can follow that have a low barrier to entry; such practices might do a fairly sufficient job of testing. There are also practices one can follow that are more complicated but perhaps result in more thorough testing. Let’s walk through some good QA practices, starting with the easiest and ending with the most complicated.