Cloudera Engineering Blog · Testing Posts
Security architecture is complex, but these testing strategies help Cloudera customers rely on production-ready results.
Among other things, good security requires user authentication and that authenticated users and services be granted access to those things (and only those things) that they’re authorized to use. Across Apache Hadoop and Apache Solr (which ships in CDH and powers Cloudera Search), authentication is accomplished using Kerberos and SPNego over HTTP and authorization is accomplished using Apache Sentry (the emerging standard for role-based fine grain access control, currently incubating at the ASF).
While Apache HBase adoption for building end-user applications has skyrocketed, many of those applications (and many apps generally) have not been well-tested. In this post, you’ll learn some of the ways this testing can easily be done.
We will start with unit testing via JUnit, then move on to using Mockito and Apache MRUnit, and then to using an HBase mini-cluster for integration testing. (The HBase codebase itself is tested via a mini-cluster, so why not tap into that for upstream applications, as well?)
Apache HBase supports three primary client APIs that developers can use to bind applications with HBase: the Java API, the REST API, and the Thrift API. Therefore, as developers build apps against HBase, it’s very important for them to be aware of the compatibility guidelines with respect to CDH.
This blog post will describe the efforts that go into protecting the experience of a developer using the Java API. Through its testing work, Cloudera allows developers to write code and sleep well at night, knowing that their code will remain compatible through supported upgrade paths.
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.
Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.
This post was originally posted on the Apache Software Foundation’s blog.
We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.
In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
As Hadoop adoption increases among organizations, companies, and individuals, and as it makes its way into production, testing MapReduce (MR) jobs becomes more and more important. By regularly running tests on your MR jobs–either invoked by developers before they commit a change or by a continuous integration server such as hudson–an engineering organization can catch bugs early, strive for quality, and make developing and maintaining MR jobs easier and faster.
MR jobs are particularly difficult to test thoroughly because they run in a distributed environment. This post will give specific advice on how an engineering team might QA test its MR jobs. Note that Chapter 5 of Hadoop: The Definitive Guide gives specific code examples for testing an MR job.