5 Pitfalls of Benchmarking Big Data Systems

Categories: Hadoop Performance

Benchmarking Big Data systems is nontrivial. Avoid these traps!

Here at Cloudera, we know how hard it is to get reliable performance benchmarking results. Benchmarking matters because one of the defining characteristics of Big Data systems is the ability to process large datasets faster. “How large” and “how fast” drive technology choices, purchasing decisions, and cluster operations. Even with the best intentions, performance benchmarking is fraught with pitfalls—easy to get numbers, hard to tell if they are sound.

Below we list five common pitfalls and illustrate them with internal and customer-based stories. They offer a behind-the-scenes look at our engineering and review processes that allow us to produce rigorous benchmark results. These stories illustrate important principles of conducting your own performance benchmarking, or assessing others’ results.

Pitfall 1: Comparing Apples to Oranges

We often run two tests, expecting only one parameter to change, while in fact many parameters changed and a comparison is impossible – in other words, we “compare apples to oranges.”

CDH 5.0.0 was the first release of our software distribution with YARN and MapReduce 2 (MR2) as the default MapReduce execution framework. To their credit, our partners and customers did performance benchmarking on their own when they considered whether to upgrade. Many partners and customers initially reported a performance regression from MapReduce 1 (MR1) in earlier versions of CDH to YARN and MapReduce2 in CDH 5.0.0.

What actually happened was that a straightforward benchmark ended up comparing two different things—comparing apples to oranges. Two technical issues led to this comparison discrepancy.

One issue was that TeraSort, a limited yet popular benchmark, changed between MR1 and MR2. To reflect rule changes in the GraySort benchmark on which it is based, the data generated by the TeraSort included with MR2 is less compressible. A valid comparison would use the same version of TeraSort for both releases, because map-output compression is enabled by default as a performance optimization in CDH. Otherwise, MR1 will have an unfair advantage by using more compressible data.

Another issue was the replacement of the “task slot” concept in MR1 with the “container” concept in MR2. YARN has several configuration parameters that affected how many containers will be run on each node. A valid comparison would set these configurations such that there is the same degree of parallel processing between MR1 and MR2. Otherwise, depending on whether hardware is over or under-committed, either MR1 or MR2 will have the advantage.

We committed these pitfalls ourselves in the early days of ensuring MR1 and MR2 performance parity. We regularly compared MR1 and MR2 performance on our nightly CDH builds, and the “regression” was caught the very first time we did this comparison. Our MapReduce and Performance Engineering teams collaborated to both identify the code changes, and understand what makes a valid performance comparison. This effort culminated in MR2 shipped in CDH 5.0.0 at performance parity with MR1.

Pitfall 2: Not Testing at Scale

“Scale” for big data systems can mean data scale, concurrency scale (number of jobs and number of tasks per job), cluster scale (number of nodes/racks), or node scale (per node hardware size). Failing to test “at scale” for any of these dimensions can lead to surprising behavior for your production clusters.

It is illustrative to look at another aspect of our efforts to drive MR2 to performance parity with MR1. We wanted to verify that MR2 and MR1 perform at parity when a large number of jobs are running. We ran SWIM, which submits many jobs concurrently over hours or even days, simulating the workload logged on actual production clusters. The first runs of SWIM on MR2 revealed a live-lock issue where the jobs would appear as submitted, but none of them would make any progress. What happened is that all available resources were allocated to the Application Masters, leaving no room for the actual tasks.

This issue escaped detection in our other scale tests that covered a range of data, cluster, and node scales. The live-lock occurs only when all the containers in a cluster are taken up by Application Masters. On a cluster of non-trivial size, this means hundreds or even thousands of concurrent jobs. SWIM is specifically designed to reveal such issues by replaying production workloads with their original level of concurrency and load variation over time. In this case, we found a critical issue before our customers ever hit it.

Pitfall 3: Believing in Miracles

If something is too good to be true, it’s probably not true. This means we should always have a model of what performance should be, so that we can tell if a performance improvement is expected, or too good to be true.

Here are some recent “miracles” we have had to debunk for ourselves and for our customers:

  • A customer came to us and declared that Impala performs more than 1000x better than its existing data warehouse system, and wanted us to help it set up a new cluster to handle a growing production workload. The 1000x difference is orders of magnitude larger than our own measurements, and immediately made us skeptical. Following much discussion, we realized that the customer was comparing very simple queries running on a proof-of-concept Impala cluster versus complex queries running on a heavily loaded production system. We helped the customer do an apples-to-apples comparison, yet it turns out Impala still has an advantage. We left the customer with realistic plans for how to grow its data management systems.
  • A customer asked us to run Apache Sqoop in several configurations, with the intent of finding the configuration leading to the best export performance. Among other tests we compared the performance of loading data to new partitions through Oracle Database’s direct path writes, to loading the same data through normal inserts. We normally expect direct path writes to be significantly faster since they bypass the normally busy buffer-cache and redo log subsystems, writing data blocks directly to storage. In this test, the normal inserts were 3 times faster than the direct path writes. Quick investigation revealed that Sqoop was exporting data to an otherwise idle Oracle cluster with over 300GB of memory dedicated to the buffer cache. Loading data into memory in a server with no contention is obviously faster than writing the same data to disk. We explained the results to the customer and recommended repeating the tests on a cluster with realistic workloads.
  • A customer asked us for comment on a Hadoop sort benchmark result in the trade press. The result is more than 100x faster than what we found internally. We took a look at the benchmark report and very quickly found that the data size being tested is considerably smaller than the available memory in the cluster. In other words, a knowledgeable operator would be able to configure Hadoop in a way that the sort takes place completely in memory. This approach departs from the common practice of configuring sort with data size much greater than total cluster memory. So the more-than-100x gap comes from the inherent hardware difference between memory and disk IO, rather than a difference between two software systems.

The ability to identify miracles requires us having models of expected performance beyond just a “gut feeling”. These models can come from prior results, or an understanding of where the system bottlenecks should be. Benchmarking without such models would give you a lot of numbers but not a lot of meaning.

Pitfall 4: Using Unrealistic Benchmarks

Biased benchmarks are benchmarks where the choice of workload, hardware or presentation choices is done regardless of the expected requirements of the customers. Rather, these choices are meant to highlight the capabilities of the vendor performing the benchmark.

Here are specific warning signs of a biased benchmark:

  • Misleading workloads: When a vendor ran benchmarks on 100GB of data when the system is marketed as a “Big Data” system designed for 100TB data sets. Or when a transactional workload is used to test a system with mostly analytical use cases. Terasort, for example, has specific characteristics and stresses a very specific subset of the processing subsystem. It is not necessarily a good benchmark to evaluate how the system will scale for other workloads, although it is a useful first step in comparing different hardware configurations.

At Cloudera, Terasort is only one job in our MapReduce performance benchmarking suite. We run all jobs in the suite under different meanings of scale beyond just large data size. (See Pitfall 2 above.)

  • Premium hardware: Vendors often improve their numbers by using hardware not typically used in production: solid state drives (SSDs) when the customers more commonly use hard disk drives (HDDs), or types of SSDs not available in the general market. The Transaction Processing Council – C (TPC-C) benchmark allow the use of hardware that is not available provided that availability dates are published. It is wise to check if the hardware choices make results irrelevant when using benchmarks for purchasing decisions.

At Cloudera, we have explored MapReduce performance for SSDs. We were very conscious of SSD’s prevalence in the market compared with HDDs. This prompted us to suggest to our hardware partners to track SSD performance-per-cost in addition to the more commonly cited capacity-per-cost. The importance of the performance-per-cost metric represents a key insight from the study.

  • Cherry-picking queries or jobs: The vendor picked very specific queries out of a standard benchmark, but can’t explain the choice with objective criteria that is relevant to the customers (or worse, doesn’t even disclose that a choice was made!)

At Cloudera, many of our past Impala performance results used 20 queries derived from the TPC – Decision Support (TPC-DS) benchmark. These queries were chosen over a year ago, and cover interactive, reporting, and deep analytic use cases. At the time, it was a major improvement over a frequently cited set of queries that were constructed without empirical backing from actual customer use cases. The 20 queries also represent a step forward from our own prior efforts using queries derived from TPC-H. Both TPC-H and TPC-DS are backed by customer surveys from vendors in the TPC consortium, with TPC-H considered to be the less demanding benchmark. We have kept the set of 20 queries derived from TPC-DS to help ourselves compare against our own prior results, and we are well aware they are fewer than the full set of 99 queries in the official TPC-DS. Look for our future posts in this space.

To an extent all commercial benchmarks are suspect of bias, since they are performed or commissioned by a specific vendor to market their products. Vendors can enhance their own credibility by being transparent about the limits of their own work. Customers can hold vendors accountable by understanding their own workload and have a conversation with vendors about whether a product addresses their specific use case.

Pitfall 5. Communicating Results Poorly

Poorly communicated results detract from otherwise good performance benchmarking projects. Here are Cloudera, we check all external-facing benchmarking communications for the following:

  1. Whether we selected a benchmark that
    1. Is unbiased (see Pitfall 3 above),
    2. Exercise workloads relevant to actual customers, and
    3. Scales across data size, concurrency level, cluster size, and node size.
  2. Whether we reported sufficient information for industry peers to assess the significance of the result, and to reproduce the tests if needed. This requires reporting
    1. The benchmark we used and why we used it,
    2. The performance metrics we measured and how we measured them,
    3. The hardware used and the software tuning applied.

One more aspect of a good benchmarking report is whether the results have been independently verified or audited. The purpose of an independent audit is to have the above checks done by someone other than the organization that does the benchmarking study. Benchmarking results that passed independent audit are more likely to be communicated clearly and completely.

There are several gold standards for audit and verification practices established before the rise of Big Data:

  • Dedicated auditors: The TPC uses dedicated auditors. Each auditor is certified to audit a particular benchmark only after passing a test designed by the working group who initially specified that benchmark.
  • Validation kits and fair-use rules: The Standard Performance Evaluation Corporation (SPEC) uses a combination of validation checks built into benchmarking kits, fair-use rules governing how the results should be communicated, and review by the SPEC organization, which encompasses many industry peers of the test sponsor.
  • Peer review: The official Sort Benchmark has new entries reviewed by past winners. This incentivizes the winners to “hand over the torch” only if new entries are sufficiently rigorous.

There are not yet any widely accepted audit and verification processes for Big Data. The need for complete and neutral benchmarking results is sometimes diluted by the need to stand out in the trade press. However, the past year has seen a phenomenal growth in the level of performance knowledge in the customer base and the broader community. Every vendor benchmark is now audited by customers and industry peers. This is why we always conduct and communicate our performance benchmarking in a rigorous and open manner.


Performance benchmarking is hard. When they are done well, benchmarks can guide us as well as the community. We close this blog with anecdotes of the authors’ benchmarking mistakes committed early in their career. After all, anyone can make benchmarking errors, and everyone can learn from them.

Gwen Shapira is on the Platform Engineering team at Cloudera. She once ran a database performance benchmark on a proof-of-concept 5-node cluster. When she was asked what would be the performance for a 50-node production cluster, she multiplied the 5-node performance numbers by 10x. The production cluster blew up, hitting network bottlenecks not revealed at the proof-of-concept scale. Lesson: Testing is better than extrapolating.

Yanpei Chen is on the Performance Engineering team at Cloudera. He ran his first Hadoop benchmark as a grad student at UC Berkeley, where he accidentally mounted HDFS on the departmental network filer. He took down the filer for all EECS professors, staff, and students, and received hate mail from the system administrators for a week. Lesson: Run your benchmarks in a way that doesn’t disrupt production systems.