What’s Next for Impala After Release 1.1

by Justin Erickson

Posted in Technical | September 24, 2013 3 min read

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.

Furthermore, based on the reaction from other vendors in the data management space, few observers would dispute the notion that Impala has made low-latency, interactive SQL queries for Hadoop as important a customer requirement as the high-latency, batch-oriented SQL queries enabled by Apache Hive. That’s a great development for Hadoop users everywhere!

What Was Delivered in Impala 1.0/1.1

Let’s begin with a report card on the previously published Impala 1.0/1.1 roadmap. Here’s the feature list, grouped by delivery status:

Delivered:

Support for Parquet format, Apache Avro file format, and LZO-compressed TextFiles
Support for the same 64-bit OS platforms as supported for CDH
JDBC driver
DDL support
Faster, bigger, more memory efficient joins
Faster, bigger, more memory efficient aggregations
More SQL performance optimizations

Postponed based on customer feedback:

Straggler handling
Automatic metadata refresh

Furthermore, thanks to the addition of the Apache Sentry module (incubating), Impala 1.1 and later now also provide granular, role-based authorization, ensuring that the right users and applications have access to the right data. (With the recent contribution of Sentry to the Apache Incubator and of HiveServer2 to Hive by Cloudera, Hive 0.11 and later have that functionality, as well.)

A lot of work was done, but there is still plenty of work to do. Now, on to the Impala 2.0 wave.

Near-Term Roadmap

The following new Impala functionality will be released incrementally across near-term future releases, starting with Impala 1.2 in late 2013 and ending with Impala 2.0 in the first third of 2014. In addition, you’ll see more performance gains and SQL functionality enhancements in each release – with the goal of expanding Impala’s performance lead over the alternative SQL-on-Hadoop approaches of legacy relational database vendors as well as Hadoop distro vendors.

Please note, as is always the case with roadmaps, that timelines and features are always subject to change. What you see below captures our current plan-of-record, however.

Impala 1.2

UDFs and extensibility – enables users to add their own custom functionality; Impala will support existing Hive Java UDFs as well as high-performance native UDFs and UDAFs
Automatic metadata refresh – enables new tables and data to seamlessly be available for Impala queries as they are added without having to issue a manual refresh on on each Impala node
In-memory HDFS caching – allows access to frequently accessed Hadoop data at in-memory speeds
Cost-based join order optimization – frees the user from having to guess the correct join order
Preview of YARN-integrated resource manager — allows prioritization of workloads at a finer granularity than the service-level isolation currently provided in Cloudera Manager

Impala 2.0

The list below captures only the bigger, most frequently requested features; it’s by no means complete.

SQL 2003-compliant analytic window functions (aggregation OVER PARTITION) – to provide more advanced SQL analytic capabilities
Additional authentication mechanisms – including the ability to specify username/passwords in addition to the already supported Kerberos authentication
UDTFs (user-defined table functions) – for more advanced user functions and extensibility
Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala
Nested data – enables queries on complex nested structures including maps, structs, and arrays
Enhanced, production-ready, YARN-integrated resource manager
Parquet enhancements – continued performance gains including index pages
Additional data types – including Date and Decimal types
ORDER BY without LIMIT clauses

Beyond Impala 2.0

The following list of features are those that we currently anticipate will be present in 2.1 or a release soon thereafter:

Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET
Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
External joins using disk – enables joins between tables to spill to disk for joins that require join tables larger than the aggregate memory size
Subqueries inside WHERE clauses

As we learn more about customer and partner requirements, this list will expand.

Conclusion

As you can see, Impala has evolved considerably since its beta release, and it will continue to evolve as we gather more feedback from users, customers, and partners.

Ultimately, we believe that Impala has already enabled our overall goal of allowing users to store all their data in native Hadoop file formats, and simultaneously run all batch, machine learning, interactive SQL/BI, math, search, and other workloads on that data in place. From here, it’s just a matter of continuing to build upon that very solid foundation with richer functionality and improved performance.

Justin Erickson is a director of product management at Cloudera.

Justin Erickson

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

What’s Next for Impala After Release 1.1

What Was Delivered in Impala 1.0/1.1

Near-Term Roadmap

Conclusion

Editor's Choice

Leave a comment Cancel reply