What’s Next for Impala: More Reliability, Usability, and Performance at Even Greater Scale

Categories: Impala

This year will close out with new features for reliability, usability, and nested types, and in 2016, performance-related enhancements promise >20x gains.

It’s been roughly a year since we provided an update about the Impala roadmap. During that time, a number of milestones have been reached:

  • Most Cloudera customers have deployed Impala to production across industries including financial services, retail, healthcare, gaming, government, advertising, and telecom.
  • The number of customers in the million-query club (with cluster sizes ranging from tens to hundreds of nodes) has steadily increased. Furthermore, many customers are pushing the concurrency envelope—for example, one large advertising company is capable of running more than 80 queries/second to power 1,000+ web dashboard end-users with sub-second response time.
  • Downloads of standalone Impala for CDH 4 passed the 1-million mark, with many millions more binaries downloaded as part of CDH 5.
  • Impala became an open standard as multi-vendor support from Cloudera, Oracle, MapR, Amazon came online—and Impala recently shipped inside IBM Big SQL, as well.
  • As the “cement” set in Impala’s foundation, outside contributions to Impala from the community became increasingly important.

In the previous roadmap post, we explained how the roadmap could be generally characterized as an effort to complement Impala’s MPP-like performance with additional advanced SQL functionality—to provide users with the SQL support and performance of commercial MPP-query engines, running natively on open source Apache Hadoop. In this post, we’ll report on our success rate for delivering these new features and shed some light on the roadmap from this point forward. (As with all forward-looking statements about roadmaps, keep in mind that these plans are always subject to change.)

Report Card for 2.0, 2.1, and 2.2

Since the introduction of Impala 2.0 in late 2014, Cloudera’s customers have pushed Impala into greater cluster scale, user scalability, and query complexity. Based on these customer experiences, for 2.1 and 2.2, the Impala team re-prioritized greater reliability and usability at these higher scales over new features. Thus:

Delivered in 2.0
  • SQL 2003-compliant analytic window functions
  • External joins and aggregations using disk (aka “spill to disk”)
  • Sub-queries inside WHERE clauses
  • Additional data types (including VARCHAR, CHAR)
  • Additional built-in functions
Delivered in 2.1
  • Incremental stats
  • Enhanced scalability for metadata updates
Delivered in 2.2
  • Column-level lineage tracking with Cloudera Navigator
  • Ability to read directly from Amazon S3 (unsupported beta)

Coming in Late 2015 and Beyond

For the remainder of this year, this focus on reliability and usability at even greater user/node scalability will continue. The top-priority feature beyond this effort in 2015 is the much-anticipated support for nested types.

Starting in 2016, some exciting feature enhancements will dramatically expand the types of workloads and data volumes available for interactive BI and analytics. Notable examples include support for updates/inserts in Hadoop as well as the most significant performance gains seen since Impala 1.0.

Planned for 2015
  • EMC Isilon support – to execute distributed queries against data in Isilon
  • Nested types – enabling queries on complex data structures like maps, structs, and arrays
  • Even greater scalability and reliability – to run at greater node and user scalability with less manual tuning
  • Even better predictability under concurrency – to handle greater concurrency in resource-limited situations
  • New Python data analysis framework – to maximize user productivity and executing natively on Impala’s high-performance scale-out architecture (more news soon!)
Planned for late 2015 or early 2016
  • Fine-grained authorization across CDH – expansion and full integration of Apache Sentry authorization across all CDH frameworks
  • Dynamic partition pruning – to perform data elimination of queries where the partition filters are in dimension tables instead of the fact tables
  • Greater node scalability for metadata propagation – finer-grained updates of metadata to enable more frequent metadata updates and greater node scalability
  • Improved YARN integration for improved predictability of resource requests with dynamic resource scheduling
Planned for 2016
  • Support for updates – to directly update and insert data into Hadoop
  • >20x performance gains – via multi-core joins/aggregations, even more runtime code generation, collaboration with Intel for greater hardware efficiency gains, and more
  • In-memory columnar format – for more efficient, more scalable, vectorized operations on nested data types as well as to enable high-performance custom logic (UDFs/UDAs) without serialization/deserialization bottlenecks
  • Automated and incremental metadata refresh
  • Automated and incremental stats collection
  • Temporary tables – enabling temporary scratch locations to store interim results
  • Additional language extensions and data types – addition of new SQL and vendor-specific language extensions and data types based on customer feedback

Conclusion

To summarize, by supporting a new range of analytics, the 2.0 release extended and accelerated Impala adoption. The 2.x releases in the first half of 2015 reflected deeper investment in reliability and usability at greater scale to meet increasing demands for higher concurrency and scalability. And, in the second half of 2015, nested types, concurrency, and scalability will be the key focus areas.

Based on the current plan, 2016 may grade out as the most exciting year yet for the expansion of Impala use cases. Major new capability enhancements will unlock new analytic workloads that could never run at Big Data scale before by doubling down on:

  • Low-latency queries for a BI user experience
  • Ability to handle highly-concurrent workloads
  • Efficient resource usage in a shared workload environment (via YARN)
  • Commitment to open standards, and
  • Broad ISV support

We look forward to bringing you more information about these new efforts as it becomes available!

Marcel Kornacker is Impala’s architect.

Silvius Rus is an Engineering Manager at Cloudera.

Justin Erickson is Director of Product Management at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

8 responses on “What’s Next for Impala: More Reliability, Usability, and Performance at Even Greater Scale

  1. Lars Francke

    That sounds like great plans. Especially nested types and automated metadata refresh!

    One thing is missing though: An Oozie action for Impala. It’s one major hindrance of putting Impala into production in ETL workflows especially with Kerberos enabled.

    I know there’s crutches out there (via Java or Shell actions) but that’s not good enough for most customers.

    Any plans to finally get to this?

    1. Justin Kestelyn (@kestelyn) Post author

      Lars,

      Yes, it’s part of the roadmap. We’re looking at 2016 for an Impala Oozie action and will have a better idea of timing as we get closer to the end of 2015.

    1. Justin Kestelyn (@kestelyn) Post author

      Stefano,

      IBM Big SQL 3.0 has some Impala code inside it, in addition to retrofitted legacy technology. So, it’s not purely Impala architecture.

  2. Edward

    How is the integration with S3 or other cloud storage coming along? Would love to have a production ready implementation of Impala that can read (or write) from S3 via external table.