New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance

Categories: Hive Impala Performance Spark

New testing results show a significant difference between the analytic database performance of Impala compared to batch and procedural development engines, as well as Impala running all 99 TPC-DS-derived queries in the benchmark workload.

2015 was an exciting year for Apache Impala (incubating). Cloudera’s Impala team significantly improved Impala’s scale and stability, which enabled many customers to deploy Impala clusters with hundreds of nodes, run millions of queries, and even push Impala concurrency to thousands of users. We introduced highly anticipated features like nested data types, unified fine-grained security with RecordService, and Apache Sentry (incubating). We unveiled Apache Kudu (incubating), which enables Impala to query fast-changing data and provide updatability, and Kudu and Impala both became Apache Incubator projects.

As noted in our recent roadmap update, 2016 is slated to be the most exciting year yet for Impala. With features like dynamic partition pruning, improved YARN integration, Amazon S3 support, and even better performance via multi-core joins and aggregations and increased runtime code-generation, Impala is set to take on even more use cases and greater concurrency as the analytic database for Apache Hadoop.

Over the past few years, the distinction has widened between systems designed as analytic databases, as compared to SQL interfaces exposed for easier development (such as Apache Hive and the Spark SQL module in Apache Spark). Cloudera’s performance engineering team recently completed a new round of benchmarks comparing the most recent, stable releases of those SQL-on-Hadoop engines. For the low-latency and multi-user throughput performance requirements of BI and SQL analytics, there are clear, substantial differences between Impala as an analytic database compared to its batch and procedural-development engine peers:

  • For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.
  • All 99 TPC-DS-derived queries in the benchmark run on Impala, with similar results in Impala’s favor.

These results demonstrate Impala’s leadership in delivering the low-latency and multi-user throughput for running SQL analytics and BI on Hadoop. Even after the large investments that improved the performance in Hive with Stinger and Spark with Tungsten, there is nearly an order-of-magnitude difference in performance when compared to an analytic database like Impala.

We unfortunately cannot publish results against Impala’s analytic database competitors due to their proprietary licensing restrictions, but we have published the benchmarking queries so you can replicate the results that customers such as Quaero, Epsilon, and many others have seen.

Now, for the details.

Configuration

All tests were run on the same 21-node cluster. For the previous round of testing, we ran on a smaller (64GB per node) memory footprint to correct a misperception that Impala only works well with lots of memory. But, this time, we ran on larger memory (384GB per node) to provide ample memory for Spark to perform at its best.

Each node was configured as follows:

  • CPU: 2 sockets, 12 total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
  • 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
  • 384GB memory

Comparative Set

This time, we dropped Presto because it has not kept up with the SQL language functionality to run the latest benchmarks in an apples-to-apples comparison (due to its lack of DECIMAL support, for example).

Queries

In this post, we demonstrate how to run all 99 queries derived from TPC-DS on Impala. Given the significant advancements Impala and other engines have made since our last performance update, we dropped previously used rewrites with DOUBLE (as Impala and other engines support the DECIMAL datatype now) and also dropped SQL-92 style join rewrites (as other engines are now capable of running without them). Full queries are published here.

The testing used a 15TB scale-factor dataset. Each engine was assessed on a file format and compression that ensured the best possible performance and a fair, consistent comparison: Impala and Spark SQL on Apache Parquet with Snappy, and Hive-on-Tez on ORC with Zlib. The standard rigorous testing techniques (multiple runs, tuning, and so on) were used for each of the engines involved.

Results: Multiple Users

Our performance comparison focuses on multi-user results to simulate real-world workloads (as running a dedicated cluster per BI user is clearly impractical). We re-ran the same Interactive queries as in previous testing, running 10 users at the same time. To summarize the results:

  • Impala is the only engine that provides interactive responses for the Interactive query set with an average of 12.8s, compared to the next nearest alternative with an average of over 1.5 minutes (7.6x slower).
  • Impala delivers 6.8x – 15x higher throughput under concurrent load.

These results show that Impala is the only engine that consistently provides the interactive latency and concurrency needed for BI and SQL analytics, despite significant performance improvement efforts from Spark SQL with Tungsten as well as Hive-on-Tez with the Stinger initiative. We’re excited to see these results as more significant investments in performance and concurrency are still under away.

impala-23-f1

impala23-throughput

The astute reader will note that the performance numbers are slower than what was reflected in the previous testing. We attribute that result to the fact that, as noted previously, the originally specified DECIMAL types from the TPC-DS spec were included in the new round of testing, now that all the engines support DECIMAL (and DECIMAL incurs more computational complexity than DOUBLE.)

Running the Full 99 Queries Based on TPC-DS

Next, we will demonstrate how to run all 99 queries derived from the TPC-DS spec on Impala. While most of the queries ran with no changes beyond the addition of simple partition filters, we modified some queries for language differences and additional optimizations, specifically:

impala-23-t1

The 99 queries that run on Impala are published in the benchmark kit here. As you can see in the above table, most of the changes are due to a small number of language features used in multiple queries. For a fair apples-to-apples comparison across engines, we excluded these 30 queries and also eliminated queries that could not run on Spark SQL and Hive-on-Tez, which left a total of 47 remaining queries.

The chart below illustrates the breakdown and results of the same single-user buckets/queries as before, plus the 47 common queries. As noted previously, it’s uncommon for real-world deployments to run a cluster with a single BI user, so the multiuser results from the previous section are a better comparison.

impala-23-f3

TPC-DS was designed to be representative of a traditional report-based workload, rather than the more common self-service and exploratory BI workloads you see in Hadoop today. The latter doesn’t have a canonical industry benchmark. When assessing for your environment it’s important to look at the characteristics of your workload that are best suited for particular systems. For Cloudera customers, the Cloudera Navigator Optimizer tool is designed to help with that assessment.

Conclusion

Cloudera’s vision for Impala is for it to become the most performant, compatible, and usable analytic database. These results, which demonstrate how to run all 99 queries derived from TPC-DS on Impala, is another important milestone on the way to its status as the leader and open standard for BI and SQL analytics on modern big data architecture.

Choosing the right engine for the right job is very important. Despite Impala’s significant performance lead as an analytic database, Hive and Spark SQL continue to provide important capabilities for other use cases and users alongside Impala:

  • Hive is designed to make batch processing jobs like data preparation and ETL more accessible than raw MapReduce via a SQL-like language. Most data served to BI users in Impala today is prepared by ETL developers in Hive. Hive-on-Tez and Hive-on-Spark provide the same great Hive capabilities, yet use Tez or Spark as the execution engines for incrementally faster processing.
  • Spark SQL is an API within Spark that is designed for Scala or Java developers to embed SQL queries into their Spark programs. This API enables common data engineering like aggregations, filters, joins, and so on to be simply expressed in SQL as part of a broader procedural Spark application. For example, data engineers and data scientists commonly use Spark for feature engineering and model development.
  • Impala is modern MPP query engine purpose-built for Hadoop to provide BI and SQL analytics at interactive latencies. For BI users, there’s a big difference between clicking on a report or visualization and getting a response in seconds, versus having to wait minutes. Interactivity is critical for BI users and that interactivity must be maintained as these tools scale to more users.

With multi-core joins and aggregates, dynamic partition pruning, and a variety of other performance enhancements coming up, we expect Impala’s performance lead to widen further. Be on the lookout for new performance testing results that showcase these enhancements in future posts.

(As usual, we encourage you to independently verify these results by running your own benchmarks based on the open toolkit.)

This benchmark is derived from the TPC-DS benchmark and, as such, is incomparable to published TPC-DS results.

Devadutta Ghat is Senior Product Manager at Cloudera.

David Rorke is a Performance Engineer at Cloudera.

Dileep Kumar is a Performance Engineer at Cloudera.

Appendix (added April 20, 2016)

New testing was also recently completed with Hive 2.0 on Tez 0.8.2; results shown below.

impala-tez082-f1 impala-tez082-f2

 

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

35 responses on “New SQL Benchmarks: Apache Impala (incubating) Uniquely Delivers Analytic Database Performance

    1. Justin Kestelyn Post author

      Richard,

      At this point, we don’t see enough production deployments of Drill to make including it in testing useful and worthwhile.

      1. John

        Well define “See” with Apache Drill. You don’t support it, so you wouldn’t see it correct? I guess, using “We don’t see it” as a reason to exclude it seems an odd reason. Do you see implementations of Spark SQL that are replacing Hive/Impala in functionality? (I.e. as the enterprise warehousing engine? I am guessing not as Spark SQL doesn’t really seem to fit that, yet Spark SQL is included.

        1. Justin Kestelyn Post author

          John,

          1. Drill is open source, so we don’t need to be supporting it in order to “see” it…
          2. No. As you point out, Spark SQL has a different use case than either Hive or Impala. But there are plenty of new use cases for it popping up.

          1. partha

            I beg to differ here. The purpose of performance benchmarking is to help enable adoption of a tool. Therefore, if someone is already using Drill for a production implementation, s/he has already done the benchmarking and you don’t have to do it anymore unless for the purpose of a blog.

    1. Justin Kestelyn Post author

      HD,

      At the time testing started (October 2015), the then-current release of Hive 2.0 was pre-configured for use with Tez 0.5.2 (pom.xml). For future rounds of testing, we will continue to default to the pom.xml.

      1. John

        Isn’t the pom.xml just a compile time dependency to expose APIs? I mean, in a cluster deployed in October 2015, a production Tez Cluster would not be running 0.5.2. I find going by the pom.xml on the Hive Project to be disingenuous at best.

        1. Justin Kestelyn Post author

          I fail to see what is “disingenuous” about relying on upstream Hive for configuration information. What would be a more “fair” way to do it?

          1. John

            More Fair: Use the version of Tez that shipped with a common platform (Hortonworks) at the time of the benchmarks. What was the most recent Impala version shipping with Cloudera at the time of the Benchmarks? It wasn’t 2.3, if I am reading the Cloudera documentation correctly, 2.3 isn’t even released now, yet when the benchmarks were performed, Hortonworks was supporting more than just 0.5.2 on their platform. So you took a super modern (not yet)released version of Impala and put it against an very old version of Tez. (note, using Hive as your “upstream” provider doesn’t make sense either, Hive just submits the job, so why use it as the indicator?) Hive 2.0 with version of Tez supported at the time by Hortonworks would be defendable versioning to use at the time of the benchmarks. So doing a test dated Oct 2015, using a competitors product from 2014, and then using your product that isn’t released, even on the date you release your blog post seems suspect. If your product is not released, you may want to provide a release date and/or a list of blockers on that release as per https://issues.cloudera.org/projects/IMPALA/versions/10911 2.3 isn’t out yet.

          2. Justin Kestelyn Post author

            Impala 2.3 was released in November 2015.

            We feel that deferring to the pom.xml is the correct way to go because it is objective and repeatable. Thus, for the next round of testing, a newer version of Tez will be used.

            In the meantime, we have provided everything you/users need to run exactly the same benchmark. So, try it for yourself.

          3. Justin Kestelyn Post author

            FYI,

            We recently completed testing using Hive 2.0 on Tez 0.8.2; see Appendix for results.

    1. Justin Kestelyn Post author

      Maciek,

      Spark 1.6 had not been released when testing began. Future benchmarks will include more recent releases of Spark.

  1. reader

    Dropping presto because it’s not supporting some sql constructs like decimal is lousy. At least you should present the numbers for the queries that can be run in presto. And btw presto is one of the engines that has great SQL support, I am sure there are many sql constructs that presto supports and the engines you evaluated do not.

    1. Justin Kestelyn Post author

      Anonymous Reader,

      65% of the queries in the TPC-DS derived benchmark require support for DECIMAL, so excluding these queries (and other queries that did not run) just for Presto would leave few queries left over. We will, however, consider publishing future results including Presto with a more suitable query set.

  2. Buntu

    I realize Parquet was used for Impala vs Spark but would be great to know number of files, parquet block size and hdfs block size settings used.

  3. Shef

    It seems to me the title of this blog entry gives the impression that performance results were published for 99 queries “derived” from TPC-DS.

    But performance results were only published for 47 of the 99 queries. And for a single user.

    For multiple users, performance results were only published for 7 of the 99 queries. Which are the simpler, interactive ones.

    For the full set of 99 queries – only the derived SQL query statements themselves are provided in a link. Not performance results.

    1. Justin Kestelyn Post author

      Shef, a couple things:

      1. As stated in the post, for the single-user queries, we excluded 30 queries that were modified for Impala due to language differences and additional optimizations, and 22 queries that would not run on either Spark SQL or Hive-on-Tez. That left 47 queries for fair, apples-to-apples comparison.

      2. In all rounds of testing throughout Impala’s history, we have always used only “Interactive” queries as the benchmark because those queries are the most commonly used/most representative ones in multi-user BI environments. In contrast, long-running queries are rare in such environments.

      As for results, we have provided the geometric mean for comparison purposes.

      I hope that helps…

  4. Sandeep Khosala

    Can you guys put the SparkSQL/Tez queries out as well ? Also can you put tunings you have used for these component. In absence of that it seems incomplete and be never be reproducible.
    Looking at Impala queries looks like Tez/SparkSQL has unfair disadvantage of using specific Partition Filter. Some of the PF you have used I bet your planner would be able to catch in near future.

    1. Justin Kestelyn Post author

      Sandeep,

      We used the best known configurations for each component, and ran precisely the same queries on all of them. The queries were run with and without partition filters; overall, the partition filters were indeed helpful for Spark SQL and Hive.

      1. Sandeep Khosala

        Is is possible that you guys share the sql and tuning as we are not able to run exact same version of queries on tea/spark ?
        I think that would be straightforward to share and I assume there is nothing to hide the way cloudera has been doing the benchmark so far

  5. xingang

    it’s awesome, impala, GO!!!

    I even make a new scanner required by impala to scan my own OLAP storage engine, and that happened even Impala in 2013, and its performance just rocks!!! now, even better!! I mean for its qps improvement

  6. Theorist

    So here you’ve shown that, with your benchmarking, Impala is roughly around 5x faster than Spark SQL and roughly around 10x faster than Hive-on-Tez. However, Hortonworks published a comparison of Impala to Hive-on-Tez done by Yahoo Japan, in which is was found that Impala is only faster than Hive-on-Tez at low cluster load. At high cluster load, they found the reverse is the case (“Impala returns a very fast response when the cluster is at low load state, but is not suitable for use such as running the SQL in parallel.”):

    So this is yet another illustration, at least for me, about how tricky benchmarks can be — change the conditions, and the results can reverse completely.

    Could you please address the discrepancy between the results of the two different benchmarks? I’d be interested to hear your take on this, and whether you agree with their explanation and conclusion.

    1. Justin Kestelyn Post author

      Theorist,

      It’s difficult to comment on that benchmark because we know nothing about the workload or the hardware involved. Benchmarks that are not reproducible, and hence that prevent true apples-to-apples comparisons, are hard to take seriously. (Furthermore, the data format used was text, not a columnar format like Parquet. Again, hard to take seriously.)

  7. Giuro

    I can’t understand why it’s so hard understand Impala is becoming a great path for Analytics.
    I see Justin reporting “Despite Impala’s significant performance lead as an analytic database, Hive and Spark SQL continue to provide important capabilities for other use cases and users alongside Impala”.

    In my experience I see very similar results out of a laboratory; in any case I never use Impala for Machine Learning cause of Spark offers a better way ( MLib ) ; i find both honest and unbiased ” Hive and Spark SQL continue to provide important capabilities for other use cases and users alongside Impala”.

    But this is only my poor opinion

Leave a Reply

Your email address will not be published. Required fields are marked *