New testing results show a significant difference between the analytic database performance of Impala compared to batch and procedural development engines, as well as Impala running all 99 TPC-DS-derived queries in the benchmark workload.
2015 was an exciting year for Apache Impala (incubating). Cloudera’s Impala team significantly improved Impala’s scale and stability, which enabled many customers to deploy Impala clusters with hundreds of nodes, run millions of queries, and even push Impala concurrency to thousands of users. We introduced highly anticipated features like nested data types, unified fine-grained security with RecordService, and Apache Sentry (incubating). We unveiled Apache Kudu (incubating), which enables Impala to query fast-changing data and provide updatability, and Kudu and Impala both became Apache Incubator projects.
As noted in our recent roadmap update, 2016 is slated to be the most exciting year yet for Impala. With features like dynamic partition pruning, improved YARN integration, Amazon S3 support, and even better performance via multi-core joins and aggregations and increased runtime code-generation, Impala is set to take on even more use cases and greater concurrency as the analytic database for Apache Hadoop.
Over the past few years, the distinction has widened between systems designed as analytic databases, as compared to SQL interfaces exposed for easier development (such as Apache Hive and the Spark SQL module in Apache Spark). Cloudera’s performance engineering team recently completed a new round of benchmarks comparing the most recent, stable releases of those SQL-on-Hadoop engines. For the low-latency and multi-user throughput performance requirements of BI and SQL analytics, there are clear, substantial differences between Impala as an analytic database compared to its batch and procedural-development engine peers:
- For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.
- All 99 TPC-DS-derived queries in the benchmark run on Impala, with similar results in Impala’s favor.
These results demonstrate Impala’s leadership in delivering the low-latency and multi-user throughput for running SQL analytics and BI on Hadoop. Even after the large investments that improved the performance in Hive with Stinger and Spark with Tungsten, there is nearly an order-of-magnitude difference in performance when compared to an analytic database like Impala.
We unfortunately cannot publish results against Impala’s analytic database competitors due to their proprietary licensing restrictions, but we have published the benchmarking queries so you can replicate the results that customers such as Quaero, Epsilon, and many others have seen.
Now, for the details.
All tests were run on the same 21-node cluster. For the previous round of testing, we ran on a smaller (64GB per node) memory footprint to correct a misperception that Impala only works well with lots of memory. But, this time, we ran on larger memory (384GB per node) to provide ample memory for Spark to perform at its best.
Each node was configured as follows:
- CPU: 2 sockets, 12 total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
- 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
- 384GB memory
- Impala 2.3
- Hive-on-Tez: Hive 2.0 on Tez 0.5.2 (aka Stinger; see appendix for results on Tez 0.8.2)
- Spark SQL 1.5 with Tungsten
This time, we dropped Presto because it has not kept up with the SQL language functionality to run the latest benchmarks in an apples-to-apples comparison (due to its lack of
DECIMAL support, for example).
In this post, we demonstrate how to run all 99 queries derived from TPC-DS on Impala. Given the significant advancements Impala and other engines have made since our last performance update, we dropped previously used rewrites with
DOUBLE (as Impala and other engines support the
DECIMAL datatype now) and also dropped SQL-92 style join rewrites (as other engines are now capable of running without them). Full queries are published here.
The testing used a 15TB scale-factor dataset. Each engine was assessed on a file format and compression that ensured the best possible performance and a fair, consistent comparison: Impala and Spark SQL on Apache Parquet with Snappy, and Hive-on-Tez on ORC with Zlib. The standard rigorous testing techniques (multiple runs, tuning, and so on) were used for each of the engines involved.
Results: Multiple Users
Our performance comparison focuses on multi-user results to simulate real-world workloads (as running a dedicated cluster per BI user is clearly impractical). We re-ran the same Interactive queries as in previous testing, running 10 users at the same time. To summarize the results:
- Impala is the only engine that provides interactive responses for the Interactive query set with an average of 12.8s, compared to the next nearest alternative with an average of over 1.5 minutes (7.6x slower).
- Impala delivers 6.8x – 15x higher throughput under concurrent load.
These results show that Impala is the only engine that consistently provides the interactive latency and concurrency needed for BI and SQL analytics, despite significant performance improvement efforts from Spark SQL with Tungsten as well as Hive-on-Tez with the Stinger initiative. We’re excited to see these results as more significant investments in performance and concurrency are still under away.
The astute reader will note that the performance numbers are slower than what was reflected in the previous testing. We attribute that result to the fact that, as noted previously, the originally specified
DECIMAL types from the TPC-DS spec were included in the new round of testing, now that all the engines support
DECIMAL incurs more computational complexity than
Running the Full 99 Queries Based on TPC-DS
Next, we will demonstrate how to run all 99 queries derived from the TPC-DS spec on Impala. While most of the queries ran with no changes beyond the addition of simple partition filters, we modified some queries for language differences and additional optimizations, specifically:
The 99 queries that run on Impala are published in the benchmark kit here. As you can see in the above table, most of the changes are due to a small number of language features used in multiple queries. For a fair apples-to-apples comparison across engines, we excluded these 30 queries and also eliminated queries that could not run on Spark SQL and Hive-on-Tez, which left a total of 47 remaining queries.
The chart below illustrates the breakdown and results of the same single-user buckets/queries as before, plus the 47 common queries. As noted previously, it’s uncommon for real-world deployments to run a cluster with a single BI user, so the multiuser results from the previous section are a better comparison.
TPC-DS was designed to be representative of a traditional report-based workload, rather than the more common self-service and exploratory BI workloads you see in Hadoop today. The latter doesn’t have a canonical industry benchmark. When assessing for your environment it’s important to look at the characteristics of your workload that are best suited for particular systems. For Cloudera customers, the Cloudera Navigator Optimizer tool is designed to help with that assessment.
Cloudera’s vision for Impala is for it to become the most performant, compatible, and usable analytic database. These results, which demonstrate how to run all 99 queries derived from TPC-DS on Impala, is another important milestone on the way to its status as the leader and open standard for BI and SQL analytics on modern big data architecture.
Choosing the right engine for the right job is very important. Despite Impala’s significant performance lead as an analytic database, Hive and Spark SQL continue to provide important capabilities for other use cases and users alongside Impala:
- Hive is designed to make batch processing jobs like data preparation and ETL more accessible than raw MapReduce via a SQL-like language. Most data served to BI users in Impala today is prepared by ETL developers in Hive. Hive-on-Tez and Hive-on-Spark provide the same great Hive capabilities, yet use Tez or Spark as the execution engines for incrementally faster processing.
- Spark SQL is an API within Spark that is designed for Scala or Java developers to embed SQL queries into their Spark programs. This API enables common data engineering like aggregations, filters, joins, and so on to be simply expressed in SQL as part of a broader procedural Spark application. For example, data engineers and data scientists commonly use Spark for feature engineering and model development.
- Impala is modern MPP query engine purpose-built for Hadoop to provide BI and SQL analytics at interactive latencies. For BI users, there’s a big difference between clicking on a report or visualization and getting a response in seconds, versus having to wait minutes. Interactivity is critical for BI users and that interactivity must be maintained as these tools scale to more users.
With multi-core joins and aggregates, dynamic partition pruning, and a variety of other performance enhancements coming up, we expect Impala’s performance lead to widen further. Be on the lookout for new performance testing results that showcase these enhancements in future posts.
(As usual, we encourage you to independently verify these results by running your own benchmarks based on the open toolkit.)
This benchmark is derived from the TPC-DS benchmark and, as such, is incomparable to published TPC-DS results.
Devadutta Ghat is Senior Product Manager at Cloudera.
David Rorke is a Performance Engineer at Cloudera.
Dileep Kumar is a Performance Engineer at Cloudera.
Appendix (added April 20, 2016)
New testing was also recently completed with Hive 2.0 on Tez 0.8.2; results shown below.