BI and SQL Analytics with Apache Impala (Incubating) in CDH 5.8: 3x Faster on Secure Clusters

Categories: CDH Impala

Released with CDH 5.8, Impala 2.6 brings solid performance improvements, particularly for clusters secured by Kerberos running BI workloads on Apache Hadoop.

Just a few months back, we showed you how Impala 2.5 delivered a 4x performance boost compared to Impala 2.3 for BI workloads on Hadoop via the introduction of several features like runtime filters. Here’s an update: Compared to two releases ago, Impala 2.6 delivers 12x better performance on secure workloads and continues this drumbeat of consistent performance improvement.

We are excited to share details on performance improvements in Impala 2.6 with you here. (Impala 2.6 also brings some great new cloud features, including support for reading and writing data directly from Amazon S3. You can learn more about Impala on S3 and its performance in an upcoming post. ) For the full list of new features, check out the release notes.

Summary of Performance Improvements

impala26-summary

Compared to Impala 2.5, Impala 2.6:

  • Is 22% faster on TPC-DS, with targeted queries running 2.4x to 8.9x faster
  • Is 3x faster on secure clusters as measured on queries derived from TPC-H
  • Offers 38% better query throughput for concurrent workloads, as measured by 16 concurrent users on TPC-DS

(Note: The benchmark used is derived from the TPC-DS and TPC-H benchmarks and, as such, is not comparable to published TPC-DS or TPC-H results.)

Next, let’s look at the details.

Improved Performance on Encrypted Workloads

Through optimizations in the Thrift client, Impala 2.6 query performance has improved by 3x on average on workloads secured with Kerberos, with some queries (such as TPC-H Q4) running 11x faster.

impala26-geomean-secure

impala26-secure-gains

Multiple Destinations for Runtime Filters

Impala 2.6 brings several improvements to the runtime filtering feature introduced in Impala 2.5. For example, each runtime filter can now target more than one scan. Queries with multiple fact tables joining with one or more selective dimension table will benefit from this feature. For example, TPC-DS queries like 2, 5, 40, 54, 71 and 73 are 4x faster. Also, due to this feature, the runtime filter is now applied to all the operands of UNION and UNION ALL operators.

impala26-filters

Scan-node Wait Improvements for Runtime Filters

Several bug fixes in Impala 2.6 deliver improved performance; one such fix reduces scan-node wait times by notifying them early if a runtime filter will not be effective due to high false-positive rates. This change leads to speed up of 94% on certain TPC-DS queries.

impala26-scan-node

Dynamically Sizing Bloom Filters

Impala 2.6 now uses table statistics to estimate the appropriate size for runtime Bloom filters, resulting in much better memory utilization. In targeted benchmark queries, memory consumption for runtime filters is reduced by 10x.

impala26-bloom

Codegen Improvements

TopN

For large TopN queries, Impala 2.6 provides up to 2x speedup.

impala26-topn

Software Prefetching for Hash Tables

Impala 2.6 adds software prefetching to speed up hash-table build and probe for joins and aggregations. Software prefetching eliminates 70% of stalls due to cache misses when building and probing hash tables. This results in a 25% speedup on average on targeted TPC-DS queries and up to 52% speedup for TPC-DS Query 50.

impala26-hash-table

Decimal Arithmetic

Improvements in codegen for decimal arithmetic in Impala 2.6 results in up to 48% gain on benchmark queries like TPC-H Query 1.

impala26-decimal

Apache Parquet Scanner Improvements

Impala 2.6 brings improved Parquet scanner performance of up to 2x by materializing many values of each column at a time. This feature results in up to 60% performance improvement in TPC-DS Query 28, specifically.

impala26-scanner

Improved Predicate Ordering

The Impala 2.6 planner now uses statistics and simple heuristics to order filters based on selectivity and cost, and applies highly selective filters before less selective ones. This improvement results in a 30% speed boost on average in benchmark queries, and 1.9x speedup on TPC-H Query 6.

impala26-tpch6

Conclusion

Less than a year ago, we outlined our long-term commitment to Impala performance, and 2.6 is another promising step in that direction: In just three months, the Impala team has made several improvements that make this release the fastest one yet. With multi-core joins and aggregates, improved buffer management, even more codegen, and improved cloud functionality coming soon, 2016 will end up being another very exciting year for Impala!

If you are interested in contributing to Apache Impala (incubating), please do get in touch.

Devadutta Ghat is a Senior Product Manager at Cloudera.

Marcel Kornacker is a Tech Lead at Cloudera and the founder of Impala.

Mostafa Mokhtar is a Software Engineer at Cloudera.

Henry Robinson is a Software Engineer at Cloudera.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

4 responses on “BI and SQL Analytics with Apache Impala (Incubating) in CDH 5.8: 3x Faster on Secure Clusters

Leave a Reply

Your email address will not be published. Required fields are marked *