New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap

Categories: Hadoop Impala Performance

With 1.4, Impala’s performance lead over the SQL-on-Hadoop ecosystem gets wider, especially under multi-user load.

As noted in our recent post about the Impala 2.x roadmap (“What’s Next for Impala: Focus on Advanced SQL Functionality”), Impala’s ecosystem momentum continues to accelerate, with nearly 1 million downloads since the GA of 1.0, deployment by most of Cloudera’s enterprise data hub customers, and adoption by MapR, Amazon, and Oracle as a shipping product. Furthermore, in the past few months, independent sources such as IBM Research have confirmed that “Impala’s database-like architecture provides significant performance gains, compared to Hive’s MapReduce- or Tez-based runtime.”

Cloudera’s performance engineering team recently completed a new round of benchmark testing based on Impala 1.4 and the most recent stable releases of the major SQL engine options for the Apache Hadoop platform (see previous results for 1.1 here, and for 1.3 here). As you’ll see in the results below, Impala has extended its performance lead since our last post, especially under multi-user load:

  • For single-user queries, Impala is up to 13x faster than alternatives, and 6.7x faster on average.
  • For multi-user queries, the gap widens: Impala is up to 27.4x faster than alternatives, and 18x faster on average — or nearly three times faster on average for multi-user queries than for single-user ones.

Now, let’s review the details.

Configuration

As always, all tests were run on precisely the same 21-node cluster. This time around we ran on a smaller (64GB per node) memory footprint across all engines to correct a misperception that Impala only works well with lots of memory (Impala performed similarly well on larger memory clusters):

  • 2 processors, 12 cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
  • 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
  • 64GB memory

Comparative Set

This time, we dropped Shark from the comparative set as it has since been retired in favor of the broad-based community initiative to speed up batch processing with Hive-on-Spark. We also added the current version of Spark SQL to the mix for those interested:

  • Impala 1.4.0
  • Hive-on-Tez: The final phase of the 18-month Stinger initiative (aka Hive 0.13)
  • Spark SQL 1.1: A new project that enables developers to do inline calls to SQL for sampling, filtering, and aggregations as part of a Spark data processing application
  • Presto 0.74: Facebook’s query engine project

Queries

  • Just like last time, to ensure a realistic Hadoop workload with representative data-size-per-node, queries were run on a 15TB scale-factor dataset across 21 nodes.
  • We ran precisely the same open decision-support benchmark derived from TPC-DS described in our previous rounds of testing (with queries categorized into Interactive, Reporting, and Analytics buckets).
  • Due to the lack of a cost-based optimizer in all tested engines except Impala, we tested all engines with queries that had been converted to SQL-92 style joins (the same modifications used in previous rounds of testing). For consistency, we ran those same queries against Impala — although Impala produces identical results without these modifications.
  • We selected the most optimal file formats across all engines, consistently using Snappy compression to ensure apples-to-apples comparisons. Furthermore, each engine was assessed on a file format that ensured the best possible performance and a fair, consistent comparison: Impala on Apache Parquet (incubating), Hive-on-Tez on ORC, Presto on RCFile, and Spark SQL on Parquet.
  • The standard rigorous testing techniques (multiple runs, tuning, and so on) were used for each of the engines involved.

Results: Single User

Impala outperformed all alternatives on single-user workloads across all queries run. Impala’s performance advantaged ranged from 2.1x to 13.0x and on average was 6.7x faster. This result indicates Impala’s performance advantage is actually widening over time as the previous gap was 4.8x on average based on our previous round of testing. The widening of its performance advantage is most pronounced when compared to “Stinger”/Hive-on-Tez (from an average of 4.9x to 9x) and Presto (from an average of 5.3x to 7.5x):

Next, let’s turn to multi-user results. We view performance under concurrent load as the more meaningful metric for real-world comparison.

Results: Multiple Users

We re-ran the same Interactive queries as the previous post, running 10 users at the same time. Our experience was the same as the past several posts: Impala’s performance advantage widens under concurrent load. Impala’s average advantage widens from 6.7x to 18.7x when going from single user to concurrent user workloads. Advantages varied from 10.6x to 27.4x depending on the comparison. This is a bigger advantage than in our previous round of testing where our average advantage was 13x. Again, Impala is widening the gap:

Note that Impala’s speed under 10-user load was nearly half that under single-user load — whereas the average across the alternatives was just one-fifth that under single-user load. We attribute that advantage to Impala’s extremely efficient query engine, which demonstrated query throughput that is 8x to 22x better (and by an average of 14x) than alternatives:

Conclusion

Our vision for Impala is for it to become the most performant, compatible, and usable native analytic SQL engine for Hadoop. These results demonstrate Impala’s widening performance lead over alternatives, especially under multi-user workloads. We have a number of exciting performance and efficiency improvements planned so stay tuned for additional benchmark posts. (As usual, we encourage you to independently verify these results by running your own benchmarks based on the open toolkit.)

While Impala’s performance advantage has widened, the team has been hard at work adding to Impala’s functionality. Later this year, Impala 2.0 will ship with a great many feature additions including commonly requested ANSI SQL functionality (such as analytic window functions and subqueries), additional datatypes, and additional popular vendor-specific SQL extensions.

Impala occupies a unique position in the Hadoop open source ecosystem today by being the only SQL engine that offers:

  • Low-latency, feature rich SQL for BI users
  • Ability to handle highly-concurrent workloads
  • Efficient resource usage in a shared workload environment (via YARN)
  • Open formats for accessing any data from any native Hadoop engine
  • Low lock-in multi-vendor support, and
  • Broad ISV support

As always, we welcome your comments and feedback!

Justin Erickson is Director of Product Management at Cloudera.

Marcel Kornacker is Impala’s architect and the Impala tech lead at Cloudera.

Dileep Kumar is a Performance Engineer at Cloudera.

David Rorke is a Performance Engineer at Cloudera.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

10 responses on “New Benchmarks for SQL-on-Hadoop: Impala 1.4 Widens the Performance Gap

  1. Patrick

    Really impressive results! I would be very interested in seeing these same benchmarks for Impala running on YARN. My guess is that the additional overhead of interacting with the YARN ResourceManager would slow it down a bit. One of the benefits of using Hive on Tez is that it is just another YARN application running on the cluster alongside MapReduce, Pig, etc., which reduces deployment and configuration complexity when compared to installing Impala side-by-side with YARN or a separate cluster. Thanks.

    1. Justin Kestelyn (@kestelyn) Post author

      Hey Patrick,

      Impala’s overhead of interacting with YARN is typically no more than other YARN components. In fact, Llama enables a fast path to YARN resources that allows Impala to use resources from previous queries without having to gate every query on YARN, and Llama will improve in this respect in the future.

    1. Justin Kestelyn (@kestelyn) Post author

      Aha, sorry for our misunderstanding!

      We used HiveContext, via this flag:

      ./make-distribution.sh –name cloudera -Phive -Phive-thriftserver -Phadoop-2.3

  2. Antonia

    Hi, I am quite interested in your single user benchmark result, and I’ve some questions to ask:
    1 where did you store you 15T data, hdfs? what type of disk are these machines using, SSD ? SATA?
    2 we performed the same query sets as specified in “Interactive set” on spark-sql , approximately the same scale of cluster, but result is much worse than yours. Are there any crucial parameters to be set?

Leave a Reply

Your email address will not be published. Required fields are marked *