New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead

Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.

In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data, not to produce a “faster Apache Hive.” An enterprise-quality BI experience requires low latency and high concurrency (among other things), so surpassing a well-known proprietary MPP DBMS in these areas was important evidence of progress.
 
In the past nine months, we’ve also all seen additional public validation that the original technical design for Hive, while effective for batch processing, was a dead-end for BI workloads. Recent examples have included the launch of Facebook’s Presto engine (Facebook was the inventor and world’s largest user of Hive), the emergence of Shark (Hive running on the Apache Spark DAG), and the “Stinger” initiative (Hive running on the Apache Tez [incubating] DAG).
 
Given the introduction of a number of new SQL-on-Hadoop implementations it seemed like a good time to do a roundup of the latest versions of each engine to see how they differ. We find that Impala maintains a significant performance advantage over the various other open source alternatives — ranging from 5x to 23x depending on the workload and the implementations that are compared. This advantage is due to some inherent design differences among the various systems, which we’ll explain below. Impala’s advantage is strongest for multi-user workloads, which arguably is the most relevant measure for users evaluating their options for BI use cases.

Configuration

Cluster

All tests were run on precisely the same cluster, which was torn down between runs to ensure fair comparisons. The cluster comprised 21 nodes, each equipped with:

  • 2 processors, 12 cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
  • 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
  • 384GB of memory
Comparative Set
  • Impala 1.3.0
  • Hive-on-Tez: The final phase of the 18-month Stinger initiative (aka Hive 0.13)
  • Shark 0.9.2: A port of Hive from UC Berkeley AMPLab that is architecturally similar to Hive-on-Tez, but based on Spark instead of Tez. Shark testing was done on a native in-memory dataset (RDD) as well as HDFS.
  • Presto 0.60: Facebook’s query engine project
Queries
  • To ensure a realistic Hadoop workload with representative data-size-per-node, queries were run on a 15TB scale-factor dataset across 20 nodes.
  • We ran precisely the same open decision-support benchmark derived from TPC-DS described in our previous testing (with queries categorized into Interactive, Reporting, and Deep Analytics buckets).
  • Due to the lack of a cost-based optimizer and predicate propagation in all tested engines excepting Impala, we ran the same queries that had been converted to SQL-92-style joins from the previous testing and also manually propagated predicates where semantically equivalent. For consistency, we ran those same queries against Impala — although Impala produces identical results without these modifications.
  • In the case of Shark, manual query hints were needed in addition to the modifications above to complete the query runs. Furthermore, Shark required more memory than available in the cluster to run the Reporting and Deep Analytics queries on RDDs (and thus those queries could not be completed).
  • We selected comparable file formats across all engines, consistently using Snappy compression to ensure apples-to-apples comparisons. Furthermore, each engine was tested on a file format that ensures the best possible performance and a fair, consistent comparison: Impala on Apache Parquet (incubating), Hive-on-Tez on ORC, Presto on RCFile, and Shark on ORC. (Note that native support for Parquet in Shark as well as Presto is forthcoming.)
  • Standard methodical testing techniques (multiple runs, tuning, and so on) were used for each of the engines involved.

Results

Single User

Impala on Parquet was the performance leader by a substantial margin, running on average 5x faster than its next best alternative (Shark 0.9.2).

(Note: The results are not shown here, but the queries were also run on Impala/RCFile as a direct comparison to Presto/RCFile — and performance was consistently 20-30% slower than that of Impala/Parquet.)

The two Hive-on-DAG implementations produced similar results, which is consistent with what one would have expected given they have highly similar designs. Presto is the youngest implementation of the four and is held back by the fact that it runs on RCFile, which is a much less effective columnar format than Parquet. We look forward to re-running these benchmarks in a few months when Presto runs on Parquet.

Although these results are exciting in themselves, as previously explained, we believe that measuring latency under a multi-user workload is a more valuable metric — because you would very rarely, if ever, commit your entire cluster to a single query at a time.

Multiple Users

In this test of a concurrent workload, we ran seven Interactive queries (q42, q52, q55, q63, q68, q73, q98) 10 times concurrently. To prevent all processes from running the same queries at the same time, queries were run consistently back-to-back and randomized. Furthermore, because we could not run the full query set for Shark on RDDs, we used only the partition necessary for the Interactive queries to do the single-user and 10-user comparisons.

In this run, Impala widened its performance advantage, performing 9.5x better than the next best alternative:

Throughput and Hardware Utilization

In the above chart you can see that under the (simulated) load of 10 concurrent users, Impala slows down by 1.9x, whereas for other SQL implementations, query performance slows by 2.6x – 8.6x under the same load. This performance difference translates into quality of experience as perceived by the BI user.

We also measured total throughput, or how many queries the system could process in a given hour — which has an impact on the quantity of hardware required to run a SQL workload at a targeted performance level. This metric is a big influence on TCO, where the carrying cost of hardware is typically two-thirds of the TCO of a Hadoop system.

It’s perhaps surprising to see Shark running on data cached as RDDs resulting in slightly slower single-user queries than Shark running directly on HDFS — because the data in HDFS was already in memory (local cache) and RDDs only added overhead. This disparity will widen over time now that HDFS supports in-memory reads (HDFS-4949), which are more efficient than the OS buffer cache. In addition, in-memory writes are planned for an upcoming HDFS release (HDFS-5851). (In the coming months, we plan to re-run these benchmarks using updated versions and with HDFS caching configured.)

CPU efficiency explains how Impala is able to provide lower latency and higher throughput than the alternatives, and why a native high-performance MPP query engine offers benefits that just porting Hive onto a DAG (either Tez or Spark) does not. While utilizing a DAG removes additional I/O costs beyond the initial scan of the data, most of the performance and concurrency gains come from the CPU efficiency of the query engine itself.

Beyond Performance

Based on these results, Impala not only outperforms its nearest competitors, but also proved itself to be a more robust system that requires less manual tuning:

  • The other systems required significant rewrites of the original queries in order to run, while Impala could run the original as well as modified queries.
    Deep knowledge about how to rewrite SQL statements was required to ensure a head-to-head comparison across non-Impala systems to avoid even slower response times and outright query failures, in some cases. For most users of applications or BI tools, such manual writing of queries is highly undesirable, if not impossible.
     
    In contrast, Impala’s cost-based optimizer and predicate propagation capability allows it to run the queries in the original SQL-89 form of the TPC-DS-derived benchmark or the modified versions with identical performance. Manual predicate propagation in particular is often challenging for users; traditional databases provide automatic propagation similar to that of Impala and incorrect placements can lead to wrong results.
  • Some systems require manual tuning of the JVM’s garbage collection parameters.
    Presto in particular required manual tuning of Java garbage collection in order to achieve its results. Likewise, Shark’s inability to run without manual query hints was partially due to Shark’s dependence on JVM memory management. And Tez either needs more time for startup and smaller queries when running queries in separate containers, or runs into similar challenges when reusing containers.
     
    Impala’s query execution, however, is written in native code, which not only leads to greater performance and CPU efficiency as demonstrated above, but also offers a more stable multi-user service similar to traditional MPP query engines.

Conclusion

In summary, these new results prove out that Impala achieves better concurrent latency than its competitors while providing high query throughput, and with a far smaller CPU footprint. Furthermore, out of the entire comparative set, only Impala was able to run the queries in their original SQL-89-style join format without modification.

The results above help demonstrate that despite significant engineering investments into alternatives, Impala uniquely delivers on the requirements for BI and SQL analytics by combining:

  • Interactive SQL
  • Ability to handle highly-concurrent workloads
  • Efficient resource usage (so Impala is a “good citizen” in a shared workload environment)
  • Open formats for accessing any data
  • Multi-vendor support (from Cloudera, MapR, and Amazon) to avoid lock-in, and
  • Broad ISV support

As usual, we invite you to do the same testing for yourselves using our openly published benchmark kit — any and all feedback is welcome and appreciated. We’ll bring you more news over time as Impala continues to hit its milestones!

Justin Erickson is Director of Product Management at Cloudera.

Marcel Kornacker is Impala’s architect and the Impala tech lead at Cloudera.

Dileep Kumar is a Performance Engineer at Cloudera.

Filed under:

35 Responses
  • Pratik / May 30, 2014 / 1:07 AM

    You have published invalid link of PrestoDB,It should be http://prestodb.io

  • Chiris Zhang / May 30, 2014 / 2:03 AM

    yes, I myself also found that performance of Shark dropped sharply when handling multiple table join query.

  • Sam / May 30, 2014 / 8:40 AM

    384 GB memory per server? I don’t know how many customers run that kind of servers yet.

    • Justin Kestelyn (@kestelyn) / May 30, 2014 / 9:36 AM

      Hi Sam,

      The reason for this amount of memory was to ensure a head-to-head comparison of query-engine efficiency in isolation. Thus, we needed to provide a lot of “head room”.

      Note that even with this size, we still ran out of enough memory to run the Reporting and Deep Analytics queries on Shark/RDD.

  • Patrick Angeles / May 30, 2014 / 9:54 AM

    @Sam It doesn’t make sense to get less than 128GB these days given current prices, and 256GB is getting common.

  • Rob / May 30, 2014 / 4:40 PM

    What are the times in the first chart? Sum of all query times or something else?

  • Otis Gospodnetic / June 01, 2014 / 9:25 AM

    @Patrick is that really true for all the AWS users out there, for example? :) It would be good to see how these systems behave with a higher number of servers with less RAM – 16GB, 32GB or 64 GB, for example.

  • Antonio Conte / June 02, 2014 / 5:11 AM

    Why this benchmarks are so different from the AMPlab one published on February 2014?

    https://amplab.cs.berkeley.edu/benchmark/

    • Justin Kestelyn (@kestelyn) / June 02, 2014 / 11:33 AM

      Several reasons that the cited AMPLab testing differs, including:

      - No use of columnar format (eg Parquet)
      - Overly simplistic schema (very few tables, very few columns in those tables)
      - Overly simplistic queries (the most complex one joins two tables and performs an aggregation)
      - Only 4 queries in the entire workload
      - Workload was not TPC-DS-derived

      We think this new benchmark is much more representative of real-world workloads.

      (BTW, Databricks engineers assisted us to ensure that the Shark run was as competitive as possible.)

  • ML / June 02, 2014 / 12:16 PM

    I’d be curious to hear your feedback on the benchmarks from amplab that seemingly tell a different story. Thoughts?
    https://amplab.cs.berkeley.edu/benchmark/

  • Simon / June 02, 2014 / 10:05 PM

    I scanned the 20 or so queries used in the benchmark but as far as I can tell they only refer to one of the large fact tables in the TPC-DS schema (store_sales). Traditionally this table accounts for approx 43% of the data in TPC-DS, which means the active data set used for this benchmark was approx 6.5TB. Given that the cluster has approx 8TB of memory was any pre-caching involved when executing the queries, and if so was it the same for all technologies…..

    • Justin Kestelyn (@kestelyn) / June 03, 2014 / 9:25 AM

      Simon,

      We warmed up the cache – and as indicated above, we used the same hardware and benchmarking method for every system (thus, we warmed up the cache for each of them).

  • Damien Carol / June 03, 2014 / 12:45 AM

    PrestoDB was used in version 0.60 BUT there was several releases which improves performances since version 0.60.

    Why using an old version?

  • Patrick Angeles / June 03, 2014 / 5:46 PM

    Hey @Otis,

    That would be a different benchmark altogether. However, Amazon does offer the hs1.8xlarge with 16 vcores, 117GB RAM and 24 x 2TB drives.

  • Yingjie / June 05, 2014 / 11:28 PM

    What’s the difference of Shark(HDFS) and Shark(RDD)?

    • Justin Kestelyn (@kestelyn) / June 06, 2014 / 9:55 AM

      Shark can use either HDFS (Hadoop Filesystem) or RDD (Spark’s native in-memory dataset) as a data source.

  • David Joosten / June 10, 2014 / 10:55 AM

    Can you publish the results of Impala/RCFile as well? I’m not sure how wide use Parquet is currently, and there are still some scenarios where Impala can’t query a Parquet table.

    • Justin Kestelyn (@kestelyn) / June 10, 2014 / 4:12 PM

      Results for Impala/RCFile (single user, geometric mean) – as reported, considerably slower than Impala/Parquet, but still faster than the rest of the comparative set:

      Interactive – 27 sec
      Reporting – 44 sec
      Deep Analytics – 87 sec

      10-user/Interactive – 49 sec

      Total CPU time: 220 hours
      Query throughput: 561 queries/hour

      BTW, there are no limitations or scenarios where RCFile would/should be preferred over Parquet. If you’re running into issues, please contact Cloudera support or post in the Impala community forum so we can help you resolve them.

  • Hari Sekhon / June 24, 2014 / 9:32 AM

    When can we expect to see Impala in Apache Foundation?

    • Justin Kestelyn (@kestelyn) / June 24, 2014 / 10:46 AM

      There is no current plan to do that, but it’s not impossible at some point in the future.

  • Bo Wang / July 07, 2014 / 9:52 AM

    It would be interesting to see how impala/Parquet compares with Shark[Hive-on-Tez]/ORC with secondary indices. Is there a plan to support secondary indices on Parquet?

    • Justin Kestelyn (@kestelyn) / July 11, 2014 / 10:44 AM

      Parquet already supports nested data, if that’s your question. Impala does not yet write to nested data structures, though.

  • Arun Sundar S / July 09, 2014 / 11:43 PM

    Agreed that impala is best fit for real time and near-real time. But What if i want to process the delta data daily and give a report on that? Also Impala purely depends on memory and what will happen if the cluster is configured with only 128 gb RAM for the above mentioned scenario?

    • Justin Kestelyn (@kestelyn) / July 10, 2014 / 9:53 AM

      Arun,

      That depends — if you only need the report once per day, you may not need an “interactive” tool like Impala (a batch process may or may not make more sense, depending on your data and environment).

      Recommended memory config for Impala is 128GB or more.

  • Alexey Bozhkov / July 11, 2014 / 8:01 AM

    Thx for review. Why IBM BigInsight is not in the list of competitors ?

    • Justin Kestelyn (@kestelyn) / July 11, 2014 / 10:41 AM

      I think you are referring to IBM Big SQL. This comparison was limited to open source options, and Big SQL is not open source.

  • Eswara Reddy / July 11, 2014 / 12:44 PM

    This may be off-topic as Impala is not claimed as a fit for ETL purposes, but this question is about PB scale ETL.
    I am ruling out Spark in-mem RDD for PB scale ETL as it has hit memory limits in your test.Spark HDFS is slower than Tez[from most of the graphs above I see Tez better than Spark HDFS] , what does cloudera recommend for faster ETL of PB scale data: Tez or Spark HDFS?
    Have you also considered Spark Hybrid Mode : partly in-mem and partly HDFS in your tests?

    • Justin Kestelyn (@kestelyn) / July 16, 2014 / 1:26 PM

      Eswara,

      I don’t think you can conclude that Spark is faster/slower than Tez generically based on these results, as there are queries that did better on each side. Overall, the results were quite similar, which is what you’d expect from a similar architectural approach. The key distinction to consider is the broad community momentum behind Spark generally as well as for Hive per our joint development with IBM, MapR, Intel, and Databricks as outlined here:

      http://vision.cloudera.com/broadening-support-for-apache-spark/

      While Hive is traditionally more common for batch processing, Impala’s performance advantage also makes it an attraction option for consideration depending on the type of ELT processing.

  • Sridhar / August 27, 2014 / 2:48 PM

    The result are so different to that of the one below…any idea?

    https://amplab.cs.berkeley.edu/benchmark/

    • Justin Kestelyn (@kestelyn) / August 27, 2014 / 3:56 PM

      The AMPlab benchmark was run on an extremely simple workload (and for just a single user) — not really reflective of real-world BI-type queries.

  • Haifeng Li / September 05, 2014 / 9:09 AM

    It is worth of understanding why Impala is faster. Here are some possible reasons:

    1. As a native query engine, Impala avoids the startup overhead of MapReduce/Tez jobs. It is well known that MapReduce programs take some time before all nodes are running at full capacity. In Hive, every query suffers this “cold start” problem. In contrast, Impala daemon processes are started at boot time, and thus are always ready to execute a query.

    2. Hadoop reuses JVM instances to reduce the startup overhead partially. However, it also introduces another problem. For big data processing, we prefer large memory. For example, the recommended physical memory for an Impala node is 128 GB or higher. The nodes in this benchmark have 384 GB memory. Such a big heap is actually a big challenge to the garbage collection system of the reused JVM instances. The stop-of-the-world GC pauses may add high latency to queries. Impala’s execution engine, written in C++ native code, avoids this problem. Impala may also do a better job on cache management.

    3. Impala process are multithreaded. Importantly, the scanning portion of plan fragments are multithreaded as well as making use of SSE4.2 instructions. The I/O and network systems are also highly multithreaded. Therefore, each single Impala node runs more efficiently by a high level local parallelism.

    4. Impala’s query execution is pipelined as much as possible. In case of aggregation, the coordinator starts the final aggregation as soon as the pre-aggregation fragments has started to return results. In contrast, sort and reduce can only start once all the mappers are done in MapReduce. Tez currently doesn’t support pipelined execution yet.

    5. MapReduce materializes all intermediate results while Impala streams intermediate results between executors. Tez allows different types of Input/Output including file, TCP, etc. But it seems that Hive doesn’t use this feature yet to avoid unnecessary disk writes.

    5. The reducer of MapReduce employs a pull model to get Map output partitions. For sorted output, Tez makes use of the MapReduce ShuffleHandler, which requires downstream Inputs to pull data over HTTP. With multiple reducers (or downstream Inputs) running simultaneously, it is highly likely that some of them will attempt to read from the same map node at the same time, inducing a large number of disk seeks and slowing the effective disk transfer rate.

    6. Hive’s query expressions are generated at compile time while Impala does runtime code generation for “big loops” using llvm that can achieve more optimized code.

    7. Tez allows complete control over the processing, e.g. stopping processing when limits are met. It is very useful for top-k calculation and straggler handling. Unfortunately, this feature is not used by Hive currently. (BTW, Dremel calculates approximate results for top-k and count-distinct using one-pass algorithms. It is not clear if Impala does the same.)

    8. During query execution, Dremel computes a histogram of tablet processing time. If a tablet takes a disproportionately long time to process, it is rescheduled to another server. If trading speed against accuracy is acceptable, Dremel can return the results before scanning all the data, which may reduce the response time significantly as a small fraction of the tables often take a lot longer. It is not clear if Impala implements a similar mechanism although straggler handling was stated on the roadmap.

    What else?

Leave a comment


four × = 4