In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.
But first, a thank-you: Since the initial beta release, we’ve received a tremendous amount of feedback and validation about Impala — copious in its quality as well as quantity. At least one person in approximately 4,500 unique organizations around the world have downloaded the Impala binary, to date. And even after only a few months of GA, we’ve seen Cloudera Enterprise customers from multiple industries deploy Impala 1.x in business-critical environments with support via a Cloudera RTQ (Real-Time Query) subscription — including leading organizations in insurance, banking, retail, healthcare, gaming, government, telecom, and advertising.
Furthermore, based on the reaction from other vendors in the data management space, few observers would dispute the notion that Impala has made low-latency, interactive SQL queries for Hadoop as important a customer requirement as the high-latency, batch-oriented SQL queries enabled by Apache Hive. That’s a great development for Hadoop users everywhere!
What Was Delivered in Impala 1.0/1.1
Let’s begin with a report card on the previously published Impala 1.0/1.1 roadmap. Here’s the feature list, grouped by delivery status:
Postponed based on customer feedback:
Furthermore, thanks to the addition of the Apache Sentry module (incubating), Impala 1.1 and later now also provide granular, role-based authorization, ensuring that the right users and applications have access to the right data. (With the recent contribution of Sentry to the Apache Incubator and of HiveServer2 to Hive by Cloudera, Hive 0.11 and later have that functionality, as well.)
A lot of work was done, but there is still plenty of work to do. Now, on to the Impala 2.0 wave.
The following new Impala functionality will be released incrementally across near-term future releases, starting with Impala 1.2 in late 2013 and ending with Impala 2.0 in the first third of 2014. In addition, you’ll see more performance gains and SQL functionality enhancements in each release – with the goal of expanding Impala’s performance lead over the alternative SQL-on-Hadoop approaches of legacy relational database vendors as well as Hadoop distro vendors.
Please note, as is always the case with roadmaps, that timelines and features are always subject to change. What you see below captures our current plan-of-record, however.
- UDFs and extensibility – enables users to add their own custom functionality; Impala will support existing Hive Java UDFs as well as high-performance native UDFs and UDAFs
- Automatic metadata refresh – enables new tables and data to seamlessly be available for Impala queries as they are added without having to issue a manual refresh on on each Impala node
- In-memory HDFS caching – allows access to frequently accessed Hadoop data at in-memory speeds
- Cost-based join order optimization – frees the user from having to guess the correct join order
- Preview of YARN-integrated resource manager — allows prioritization of workloads at a finer granularity than the service-level isolation currently provided in Cloudera Manager
The list below captures only the bigger, most frequently requested features; it’s by no means complete.
- SQL 2003-compliant analytic window functions (aggregation OVER PARTITION) – to provide more advanced SQL analytic capabilities
- Additional authentication mechanisms – including the ability to specify username/passwords in addition to the already supported Kerberos authentication
- UDTFs (user-defined table functions) – for more advanced user functions and extensibility
- Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala
- Nested data – enables queries on complex nested structures including maps, structs, and arrays
- Enhanced, production-ready, YARN-integrated resource manager
- Parquet enhancements – continued performance gains including index pages
- Additional data types – including Date and Decimal types
- ORDER BY without LIMIT clauses
Beyond Impala 2.0
The following list of features are those that we currently anticipate will be present in 2.1 or a release soon thereafter:
- Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET
- Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
- External joins using disk – enables joins between tables to spill to disk for joins that require join tables larger than the aggregate memory size
- Subqueries inside WHERE clauses
As we learn more about customer and partner requirements, this list will expand.
As you can see, Impala has evolved considerably since its beta release, and it will continue to evolve as we gather more feedback from users, customers, and partners.
Ultimately, we believe that Impala has already enabled our overall goal of allowing users to store all their data in native Hadoop file formats, and simultaneously run all batch, machine learning, interactive SQL/BI, math, search, and other workloads on that data in place. From here, it’s just a matter of continuing to build upon that very solid foundation with richer functionality and improved performance.
Justin Erickson is a director of product management at Cloudera.