Cloudera Blog · Testing Posts

What Do Real-Life Apache Hadoop Workloads Look Like?

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.

Table 1. Summary of Hadoop workloads analyzed

Watching the Clock: Cloudera’s Response to Leap Second Troubles

At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.

Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.

Background

Leap seconds are added to the UTC to correct for Earth’s slowing rotation. The latest leap second was added last Saturday (6/30) at 23:59:60 UTC (5 pm PDT). Due to a missed function call in the Linux timekeeping code, the leap second was not accounted for properly. As a result, after the leap second, timers expired one second earlier than requested. Many applications use a recurring timer of 1 second or less; such timers expired immediately, causing the application to immediately try to set another timer, ad infinitum. This infinite loop led to CPU load spikes that launched 21 separate support tickets.

Apache MRUnit 0.9.0-incubating has been released!

This post was originally posted on the Apache Software Foundation’s blog.

We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our mailing list to find out how you can help!

2010 Cloudera Apache Hadoop Webinars

Cloudera produced several webinars in 2010 providing attendees with insights into a range of topics from technical best practices to common business applications of Hadoop. These webinars proved to be very popular so we thought we would provide a brief recap for our readers.

Starting way back in June,  we presented Top Ten Tips and Tricks for Hadoop Success. In this webinar we explained some tips that the Cloudera Solutions Architect team has picked up from implementing, deploying, and running Hadoop with our customers.

Top Ten Tips and Tricks for Hadoop Success (Link to video recording)

CDH2: “Testing” Heading Towards “Stable”

In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
(more…)

CDH2: Testing Release now with Pig, Hive, and HBase

At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.

We plan on pushing new packages into the testing repository every 3 to 6 weeks.  And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:

Advice on QA Testing Your MapReduce Jobs

As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.

MR jobs are particularly difficult to test thoroughly because they run in a distributed environment.  This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.

As is the case with most testing scenarios, there are certain practices one can follow that have a low barrier to entry; such practices might do a fairly sufficient job of testing. There are also practices one can follow that are more complicated but perhaps result in more thorough testing. Let’s walk through some good QA practices, starting with the easiest and ending with the most complicated.

Traditional Unit Tests – JUnit, PyUnit, Etc.