Impala 2.0 will add much more complete SQL functionality to what is already the fastest SQL-on-Hadoop solution available.
In September 2013, we provided a roadmap for Impala — the open source MPP SQL query engine for Apache Hadoop, which was on release 1.1 at the time — that documented planned functionality through release 2.0 and beyond.
Impala is now on release 1.4, with many major features delivered since our previous roadmap update, and adoption is at an all-time high: it’s been download by 10,000 unique organizations since January 2013, is in use by most of Cloudera’s enterprise data hub customers, and is shipped by MapR, Amazon, and inside the Oracle Big Data Appliance in addition to Cloudera. For these reasons, it seems like a good time to elaborate on the 2.x roadmap.
First, let’s recap what has been delivered since 1.1. Then, we’ll follow with a list of the substantial new features, mainly in the area of SQL functionality, planned for Impala 2.0 and a few of the features beyond.
Delivered Thus Far
Impala 1.2 (Shipped Oct. 2013)
- UDFs and extensibility – enables users to add their own custom functionality; Impala will support existing Hive Java UDFs as well as high-performance native UDFs and UDAFs
- Automatic metadata refresh – enables new tables and data to seamlessly be available for Impala queries as they are added without having to issue a manual refresh on on each Impala node
- Cost-based join order optimization – frees the user from having to guess the correct join order
- Additional authentication mechanisms – including the ability to specify Active Directory username/passwords in addition to the already supported Kerberos authentication
Impala 1.3 (Shipped May 2014)
- Admission Control – allows prioritization and queueing of queries within Impala
- Preview of YARN-integrated resource manager (CDH 5.0) — allows prioritization of workloads at a finer granularity than the service-level isolation currently provided in Cloudera Manager
- Improved memory consumption at higher scale – allows for greater multi-user concurrency with lower memory footprints
Impala 1.4 (Shipped July 2014)
- In-memory HDFS caching (CDH 5.1 or higher) via Impala DDL – allows access to frequently accessed Hadoop data at in-memory speeds
DECIMALdata type – allows Impala to query fixed-precision numeric data
COMPUTE STATS– 5x faster statistics capture than previous releases
- Additional built-ins from traditional databases – easier migration with some common SQL language extensions like statistics functions such as
LIMITclauses – allows easier migration of existing queries without having to fit in memory or requiring
- Improved performance for selective joins – improvements in such queries by over 2x compared to previous versions of Impala
- Enhanced, production-ready, YARN-integrated resource manager (CDH 5.1 and later)
To Be Delivered by Impala 2.x
Impala 2.0, scheduled for release by the end of 2014, is the most significant milestone since GA. It will add the most popular SQL analytic language features on top of what has already been demonstrated to be not only be the fastest SQL-on-Hadoop solution (by at least 950% compared to Shark, “Stinger,” and Presto), but more important, one that has been documented by multiple customers as performing on the same level as traditional MPP query engines yet doing so on Hadoop-native data sets. Essentially, the Impala 2.0 milestone marks the point at which Hadoop users will get the “whole package”: the expected SQL support and performance of commercial MPP-query engines, running natively on Hadoop.
Impala 2.0 (Ships in Fall 2014)
- SQL 2003-compliant analytic window functions (aggregation
LAG, and so on) – to provide more advanced SQL analytic capabilities
- External joins and aggregations using disk – enables operations to spill to disk if their internal state exceeds the aggregate memory size
- Subqueries inside
- Incremental statistics – only run statistics on the new or changed data for even faster statistics computations
- Additional data types – including
- Additional built-in functions – enables easier migration of custom language extensions for users of traditional SQL engines
Impala 2.1 and Beyond (Ships in 2015)
- Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015)
MERGEstatement – enables merging in updates into existing tables
- Additional analytic SQL functionality –
- Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
- UDTFs (user-defined table functions) – for more advanced user functions and extensibility
- Intra-node parallelized aggregations and joins – to provide even faster joins and aggregations on on top of the performance gains of Impala
- Parquet enhancements – continued performance gains including index pages
- Amazon S3 integration
From the outset, we described the Impala journey as one that would take its users beyond the limits of what they thought Hadoop could do by offering the performance and SQL capabilities of traditional analytic DBMSs natively on Hadoop data. The functionality delivered thus far has certainly done that in terms of performance, and with the features planned for Impala 2.0, we’re confident it will do the same with respect to SQL functionality.
As we’ve written before, thanks to these features, Impala uniquely delivers on requirements for BI and SQL analytics in enterprise data hubs by blending:
- Low-latency queries for a BI user experience
- Ability to handle highly-concurrent workloads
- Efficient resource usage in a shared workload environment (via YARN)
- Open formats for accessing any data from any native Hadoop engine
- Multi-vendor support to avoid lock-in, and
- Broad ISV support
As always, we welcome your comments and feedback!
Justin Erickson is Director of Product Management at Cloudera.
Marcel Kornacker is Impala’s architect and the Impala tech lead at Cloudera.
To learn more about and discuss the Impala roadmap, attend the Bay Area Impala User Group meeting in Palo Alto on Sept. 16, 2014.