Released with CDH 5.8, Impala 2.6 brings solid performance improvements, particularly for clusters secured by Kerberos running BI workloads on Apache Hadoop.
Just a few months back, we showed you how Impala 2.5 delivered a 4x performance boost compared to Impala 2.3 for BI workloads on Hadoop via the introduction of several features like runtime filters. Here’s an update: Compared to two releases ago, Impala 2.6 delivers 12x better performance on secure workloads and continues this drumbeat of consistent performance improvement.
We are excited to share details on performance improvements in Impala 2.6 with you here. (Impala 2.6 also brings some great new cloud features, including support for reading and writing data directly from Amazon S3. You can learn more about Impala on S3 and its performance in an upcoming post. ) For the full list of new features, check out the release notes.
Summary of Performance Improvements
Compared to Impala 2.5, Impala 2.6:
- Is 22% faster on TPC-DS, with targeted queries running 2.4x to 8.9x faster
- Is 3x faster on secure clusters as measured on queries derived from TPC-H
- Offers 38% better query throughput for concurrent workloads, as measured by 16 concurrent users on TPC-DS
(Note: The benchmark used is derived from the TPC-DS and TPC-H benchmarks and, as such, is not comparable to published TPC-DS or TPC-H results.)
Next, let’s look at the details.
Improved Performance on Encrypted Workloads
Through optimizations in the Thrift client, Impala 2.6 query performance has improved by 3x on average on workloads secured with Kerberos, with some queries (such as TPC-H Q4) running 11x faster.
Multiple Destinations for Runtime Filters
Impala 2.6 brings several improvements to the runtime filtering feature introduced in Impala 2.5. For example, each runtime filter can now target more than one scan. Queries with multiple fact tables joining with one or more selective dimension table will benefit from this feature. For example, TPC-DS queries like 2, 5, 40, 54, 71 and 73 are 4x faster. Also, due to this feature, the runtime filter is now applied to all the operands of
UNION ALL operators.
Scan-node Wait Improvements for Runtime Filters
Several bug fixes in Impala 2.6 deliver improved performance; one such fix reduces scan-node wait times by notifying them early if a runtime filter will not be effective due to high false-positive rates. This change leads to speed up of 94% on certain TPC-DS queries.
Dynamically Sizing Bloom Filters
Impala 2.6 now uses table statistics to estimate the appropriate size for runtime Bloom filters, resulting in much better memory utilization. In targeted benchmark queries, memory consumption for runtime filters is reduced by 10x.
For large TopN queries, Impala 2.6 provides up to 2x speedup.
Software Prefetching for Hash Tables
Impala 2.6 adds software prefetching to speed up hash-table build and probe for joins and aggregations. Software prefetching eliminates 70% of stalls due to cache misses when building and probing hash tables. This results in a 25% speedup on average on targeted TPC-DS queries and up to 52% speedup for TPC-DS Query 50.
Improvements in codegen for decimal arithmetic in Impala 2.6 results in up to 48% gain on benchmark queries like TPC-H Query 1.
Apache Parquet Scanner Improvements
Impala 2.6 brings improved Parquet scanner performance of up to 2x by materializing many values of each column at a time. This feature results in up to 60% performance improvement in TPC-DS Query 28, specifically.
Improved Predicate Ordering
The Impala 2.6 planner now uses statistics and simple heuristics to order filters based on selectivity and cost, and applies highly selective filters before less selective ones. This improvement results in a 30% speed boost on average in benchmark queries, and 1.9x speedup on TPC-H Query 6.
Less than a year ago, we outlined our long-term commitment to Impala performance, and 2.6 is another promising step in that direction: In just three months, the Impala team has made several improvements that make this release the fastest one yet. With multi-core joins and aggregates, improved buffer management, even more codegen, and improved cloud functionality coming soon, 2016 will end up being another very exciting year for Impala!
If you are interested in contributing to Apache Impala (incubating), please do get in touch.
Devadutta Ghat is a Senior Product Manager at Cloudera.
Marcel Kornacker is a Tech Lead at Cloudera and the founder of Impala.
Mostafa Mokhtar is a Software Engineer at Cloudera.
Henry Robinson is a Software Engineer at Cloudera.