Impala Performance Update: Now Reaching DBMS-Class Speed

Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!

Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:

  • Impala is being widely adopted.
    Impala has been downloaded by people in more than 5,000 unique organizations, and Cloudera Enterprise customers from multiple industries and Fortune 100 companies have deployed it as supported infrastructure, integrated with their existing BI tools, in business-critical environments. Furthermore, MapR and Amazon Web Services (via its Elastic MapReduce service) recently announced support for Impala in their own platforms.
  • Impala has demonstrated production readiness.
    Impala has proved out its superior performance for interactive queries in a concurrent-workloads environment, while providing reliable resource sharing compared to batch-processing alternatives like Apache Hive.
  • Impala sports an ever-increasing list of enterprise features.
    Planned new enterprise features are materializing rapidly and according to plan — including the addition of fine-grained, role-based authorization, user-defined functions (UDFs), cost-based optimization, and pre-built analytic functions.

In this update, you’ll read about some exciting performance results (based on an open decision-support benchmark derived from TPC-DS) that not only prove out Impala’s continuing superiority over the latest releases of Hive (0.12) for interactive queries, but also query performance across open Hadoop data faster than that of a leading proprietary analytic DBMS across data in its native columnar data store. (Yes, you read that right!)

Thus far, we consider this milestone the best evidence yet that Impala is unique in its ability to surpass the performance of all SQL-on-Hadoop alternatives, while delivering the value propositions of Hadoop (fractional cost, unlimited scalability, unmatched flexibility) — a combination that the creators of those alternatives (including remote-query approaches and siloed DBMS/Hadoop hybrids) can never hope to achieve.

First, let’s take take a look at the latest results compared to those of Hive 0.12 (aka “Stinger”).

Impala versus Hive

The results below show that Impala continues to outperform all the latest publicly available releases of Hive (the most current of which runs on YARN/MR2).

For this analysis, we ran Hive 0.12 on ORCFile data sets, versus Impala 1.1.1 running against the same data set in Parquet (the general-purpose, open source columnar storage format for Hadoop). To obtain the best possible results from Hive, we converted the TPC-DS queries into SQL-92 style joins, manually optimized the join order, and added an explicit partition predicate. (To ensure a direct comparison, the same modifications were run against Impala.)

The data set in question comprised 3TB (TPC-DS scale factor 3,000) across five typical Hadoop DataNodes (dual-socket, 8-core, 16-thread CPU; 96GB memory; 1Gbps Ethernet; 12 x 2TB disk drives). As in our previous analysis, the query set was intentionally diverse in order to ensure unbiased results, including a variety of fairly standard joins (from one to seven in number) and aggregations, as well as complex multi-level aggregations and inline views (and categorized into Interactive Exploration, Reports, and Deep Analytics buckets). Five runs were done, and aggregated into the numbers you see below.

As you’ll see in the chart below, Impala outperformed Hive on every query, and by very considerable margins:

Impala versus Hive 0.12/Stinger
(Lower bars are better)

In summary, Impala outperformed Hive by 6x to 69x (and by an average of 24x) depending on the category involved:

Clearly, Impala’s performance continues to lead that of Hive. Now, let’s move on to the even more exciting results!

Impala versus DBMS-Y

For this milestone, we tested a popular commercial analytic DBMS (referred to here as “DBMS-Y” due to a restrictive proprietary licensing agreement) against Impala 1.2.2 on a 30TB set of TPC-DS data (scale factor 30,000) — maintaining the same query complexity and unbiased pre-selection characteristics as in the Impala versus Hive analysis above. The queries were not customized for optimal Impala join order; rather, we used the same format as provided in the official TPC-DS kit.

In this experiment, we:

  1. Pre-selected a diverse set of 20 TPC-DS queries before obtaining any results, to ensure an unbiased comparison.
  2. Installed DBMS-Y (with OS filesystem) and Impala (with CDH) on precisely the same cluster (20 nodes with 96GB memory per node).
  3. Loaded all the data into the DBMS-Y proprietary columnar data store and compared it to data in Parquet.
  4. Ran the set of 20 queries on each of the two systems individually (with the other one completely shut down to prevent any interference).
  5. Validated that execution plans from both systems were common-sense and routine — to avoid comparing a “garbage” plan to a “good” plan.
  6. Removed currently unsupported analytic functions and added an explicit predicate to the WHERE clauses that expresses a partition filter on the fact table. (Window functions and dynamic partition pruning are both on the Impala near-term roadmap.)

To obtain the best possible results from DBMS-Y, we modified the original TPC-DS queries to one-month boundaries to avoid penalizing DBMS-Y for its inability to handle daily partitions on the 30TB data set. We ran the queries against DBMS-Y both with and without an explicit partition filter and chose the best result. (In a few cases, excluding the partition filter led to better DBMS-Y performance.)

As you can see from the results below, Impala outperformed DBMS-Y by up to 4.5x (and by an average of 2x), with only three queries performing more slowly on Impala:

Impala versus DBMS-Y
(Lower bars are better)

To repeat: on 17 of 20 examples, Impala queries on open Hadoop data were faster than DBMS-Y queries on its native, proprietary data – with some being nearly 5x as fast.

Multiple Impala customers have reported results similar to those above against DBMS-Y and other popular commercial DBMS vendors. We’ve published the benchmarking tests for you to confirm for yourself here!

Experiments with Linear Scalability

With that extremely impressive outcome, we could have called it a day — but we didn’t stop there. Next, we wanted to document what Impala users have reported with respect to Impala’s linear scalability in response time, concurrency, and data scale. For this set of tests, we used:

  • Two clusters of the same hardware: one of 18 nodes and one of 36 nodes
  • 15TB-scale and 30TB-scale TPC-DS data sets
  • A multi-user workload of TPC-DS queries selected from the “Interactive” bucket described previously

The first set of results below demonstrates that you can simply add machines to Impala to achieve the desired latency. For example, if you compare a set of queries on 15TB with 18 nodes to the same queries running against 15TB with 36 nodes, you’ll find that the response time is approximately 2x better with 2x the number of nodes:

2x the Hardware (Expectation: Cut response times in half)

 

Take-away: When the number of nodes in the cluster is doubled, expect the response time for queries to be reduced by half.

The next results demonstrate that you can simply add more machines to the cluster to scale up the number of concurrent users while maintaining latency. Here we ran the same set of queries and 15TB data set against both the 18-node and 36-node clusters, but we doubled the level of concurrency with the 36-node cluster:

2x the Users, 2x the Hardware (Expectation: Constant response times)

 

Take-away: Assuming constant data, expect that doubling the number of users and hardware will result in constant response time.

The final results below demonstrate that you can simply add new machines to your cluster as your Big Data sets grow and expect to achieve the same response times. In this experiment, we ran the same queries on the 18-node and 36-node clusters, but this time we ran against a 15TB data set with the 18-node cluster and a 30TB data set with the 36-node cluster:

2x the Data, 2x the Hardware (Expectation: Constant response times)

 

Take-away: When both the data and the number of nodes in the cluster are doubled, expect the response time for queries to remain constant.

Based on the above, you can safely conclude that Impala’s linear scalability is well established.

Conclusion

To briefly summarize, based on our latest performance analysis:

  • Impala handily outperformed the latest public release of Hive (0.12 – aka Stinger).
  • Impala (on the open Parquet file format for Hadoop) performance, on average, substantially surpassed that of DBMS-Y (on its proprietary columnar data store).
  • Impala scaled linearly and predictably as users, data, or hardware increased.

These results are a milestone in an already exciting year for Impala users. Up until now, users could count on superior performance for BI-style, interactive queries over batch-processing alternatives; even when running multiple workloads simultaneously. Today, with Impala 1.2.x and Parquet, they can also count on performance similar to that of an analytic DBMS on proprietary data, while retaining all the well known advantages of Hadoop — there’s no longer any need to compromise. Clearly, we’re entering a “golden age” for BI users and customers!

Although we’re proud of these results, we encourage you to do your own testing — whether utilizing the methodology described above, or your own approach. (It’s unfortunate that this goal is made more difficult to reach by the common use of restrictive licensing agreements that make direct benchmarking impossible.) We welcome any and all outcomes, because in the end, they can only make your Impala experience better.

When Impala 2.0 arrives in the first half of 2014, you’ll see further performance improvements, as well as the addition of popular SQL functionality like analytic window functions. Impala is well on its way to becoming one of the main access points to the Enterprise Data Hub.

The authors would like to thank Arun Singla (Engineering Manager, Performance at Cloudera) for his contributions to this project.

Justin Erickson is a director of product management at Cloudera.

Marcel Kornacker is the creator/tech lead of Impala at Cloudera.

Greg Rahn and Yanpei Chen are performance engineers at Cloudera.

Filed under:

20 Responses
  • Steve Smith / January 13, 2014 / 10:56 AM

    That’s great news..Could you confirm what is the roadmap to support all the queries in TPC-DS ? And wondering what is DBMS-Y stands-for ? This advocates openness in competition as it will only bring out the best..

  • Andy / January 13, 2014 / 12:15 PM

    Could you share the exact 20 TPC-DS queries that were executed? And did you run the queries as is or did you modify any of those queries?

  • John A / January 14, 2014 / 8:13 PM

    What were the “currently unsupported analytic functions” excluded from this comparison?

  • Justin Erickson / January 15, 2014 / 5:23 PM

    John A,

    The query modifications are done with comments in the github and explained in the blog post. For example in the q98 (https://github.com/cloudera/impala-tpcds-kit/blob/master/queries/q98.sql) you can see the commented out window function and a comment indicating the explicit partition key filter that moves a where condition from the dimension table to the fact table without affecting the semantics of the query as described in the blog above.

    The currently unsupported analytic functions that we commented out are window functions (i.e. OVER PARTITION clauses), ROLLUP and GROUPING

  • G.C. Chen / January 16, 2014 / 6:54 PM

    Great article! Congratulations for the nice results!

    It would be interesting to see a performance comparison between the latest Impala and Shark. Ref: https://amplab.cs.berkeley.edu/benchmark/

  • Justin Kestelyn (@kestelyn) / January 17, 2014 / 3:18 PM

    Have at it! As indicated above, the queries we used are published to the world.

  • Aminul / January 19, 2014 / 8:51 PM

    Pivotal claims that HAWQ is faster than Impala.
    http://www.gopivotal.com/sites/default/files/Hawq_WP_042313_FINAL.pdf.
    Is there any performance data for Impala vs HAWQ?

    • Justin Kestelyn (@kestelyn) / January 20, 2014 / 3:35 PM

      Aminul,

      The results you cite here are extremely dated – they’re based on testing with the Beta release of Impala, and were done before Parquet even existed. They also do not disclose any details of the file formats or benchmark.

      We’d love to disclose comparisons transparently with proprietary vendors, but Greenplum is an example of “a popular commercial analytic DBMS” with “a restrictive proprietary licensing agreement” referred to above that forced us to mask “DBMS-Y”. HAWQ is based on the Greenplum database and has a similarly restrictive licensing restriction

  • Nikita Makeev / January 30, 2014 / 10:02 AM

    Did you test for performance/service degradation? My tests showed increasing error rate when I increase concurrency. I tried to run 16 concurrent queries against 5-node cluster (16Gb RAM, 4 cores) and got about 50% error rate, i.e. some queries could not be completed. Test setup uses impala 1.2.0 and 160 Gb data stored in parquet.

    • Justin Kestelyn (@kestelyn) / February 05, 2014 / 8:37 AM

      Nikita,

      It’s hard to diagnose this issue without knowing your config details, but 16GB RAM per node is minuscule for Impala (at least 128GB is recommended). Feel free to post more details in the Impala forum at community.cloudera.com.

  • Dave / April 28, 2014 / 7:43 AM

    Hello,

    I have some questions about the queries.

    What queries are included for each of the query categories that you used?

    The query categories don’t match what is defined in the TPC-DS specification. Well, the Interactive and Reports categories have analogous classes in the spec (Ad Hoc and Reporting) but the Data Analytics category – does that map to the Data Mining class in the spec? Or to the Iterative OLAP class?

    Finally, where did the ss_max query come from? I didn’t see it in the TPC-DS specification.

    Thanks,
    Dave

  • Rob / May 23, 2014 / 2:14 AM

    A proprietary, columnar data store, workloads starting from 3TB, scales across 36+ Linux hosts…
    lemme think what that DBMS-Y might be :D

  • Igor / June 10, 2014 / 7:56 AM

    Did you :
    1) give DBMS-Y vendor (whoever it is) an opportunity to review/comment on your DBMS-Y benchmark results and/or tune performance results for DBMS-Y
    2) make yourself any effort to tune DBMS-Y query performance (to reflect real-world usage – as no one is running 30TB database queries taking many minutes without making serious attempt at tuning).
    If neither of 1) and 2) is true, blog post above sounds a lot like PR rather than honest comparison.

    • Justin Kestelyn (@kestelyn) / June 10, 2014 / 9:28 AM

      Igor,

      The post itself is fully transparent about what was done. And as previously noted, the results are in the same ballpark as those reported by customers.

      Furthermore, we have published/provided all the tools you would need for running your own comparison against an analytic DBMS of your choice.

  • Dave / June 23, 2014 / 3:57 PM

    Hi, Justin,

    Thanks for the response (to my earlier ‘query”). I think one of the queries is missing – q65. Where does that land in the classification of the queries?

    Also, ss_max is not a TPC-DS query (at least that I could find the TPC-DS spec) so should it not be included in this discussion? Or perhaps at least separated (removed?) from the discussion of the TPC-DS queries?

    Thanks again for your support on this topic.

    Dave

  • Dave / June 23, 2014 / 4:03 PM

    I just noticed the counts of the types and looking at the elapsed time I’m guessing q65 is Deep Analytics (to make that the 6th Deep Analytics query).

    My other question about ss_max remains though.

    Thanks,
    Dave

    • Justin Kestelyn (@kestelyn) / June 23, 2014 / 8:50 PM

      Dave, ss_max is not from TPC-DS, no. It does a full table scan and several aggregations over a fact table.

Leave a comment


two + 3 =