It’s been an exciting month and a half since the launch of the Cloudera Impala (the new open source distributed query engine for Apache Hadoop) beta, and we thought it’d be a great time to provide an update about what’s next for the project – including our product roadmap, release schedule and open-source plan.
First of all, we’d like to thank you for your enthusiasm and valuable beta feedback. We’re actively listening and have already fixed many of the bugs reported, captured feature requests for the roadmap, and updated the Cloudera Impala FAQ based on user input.
Our primary focus between now and general availability (GA) is making Impala enterprise-ready for your production Hadoop clusters. This means continued investments in product stability as well as product functionality, including:
- Additional file formats – specifically the Avro file format and LZO-compressed TextFiles
- Additional OS support – for the same supported 64-bit OS platforms as CDH4 including RHEL/CentOS 5.7, Ubuntu, Debian, SLES, and Oracle Linux
- Straggler handling – enables Impala to give more work to faster machines and less to slower machines for the fastest response times. In large clusters you often see a large variance of performance across nodes due to things like slow and faulty disks.
- JDBC driver – enables Java apps to interface with Impala. We’ll leverage the JDBC driver from Apache Hive to provide a common SQL interface for Java apps for both Impala and Hive.
- Data Definition Language (DDL) – enables users to create tables in the shared Hive metastore from Impala as well as Hive. As of Impala beta version 0.3, you can query from Impala but need to create your tables through Hive first.
- Faster, bigger, and more memory efficient joins – through a partitioned hash join, Impala will be able to partition the second table in a join so only one copy of the table is partitioned across all the nodes in the cluster. Currently Impala stores the second table in a join in each node’s memory. Impala will use table statistics to determine which strategy is most performant for each query.
- Faster, bigger, and more memory efficient aggregations – enables pre-aggregation to occur distributed local to the data to offload work, and thus memory consumption from the coordinator node that returns the final results.
- Broader SQL performance optimizations – enables more of Impala’s SQL features and built-ins to return with lowest latency by expanding our usage of LLVM code generation.
- Automatic metadata refresh – enables new tables and data to seamlessly be available for Impala queries as they are added without having to issue a manual refresh command to Impala.
- New Parquet columnar file format – enables even faster performance through an optional columnar format like Google Dremel’s ColumnIO and those of other analytical query engines. For a Hadoop user, Parquet will be another file format so any processing framework can access data stored in Parquet format like they do today with Avro and SequenceFiles.
Post-GA Top Asks
We have a good list of additional enhancements that are important to us and our users that are on our post-GA roadmap. The most notable and frequently asked for items include:
- UDFs and extensibility – enables users to add their own custom functionality. This is a frequent request and will take a more time than GA to build the right model considering performance and isolation requirements.
- Cost-based join order optimization – avoids users having to correctly order the joins based on size and selectivity of the tables.
- External joins using disk – enables joins between tables to spill to disk for arbitrarily large joins.
- Nested data – enables queries on complex nested structures including maps, structs, and arrays.
We are tentatively planning for the Impala 1.0 GA at the end of the first quarter of 2013. During the beta period we will continue to ship Impala beta updates every 2-4 weeks. These updates will include stability fixes as well as features from our roadmap listed above as soon as they are ready. For example, two of our top asks, additional OS platforms and a JDBC driver, will be coming soon after the New Year.
For those of you involved in the Apache Hadoop community, we appreciate your patience as we provide more transparency into our open-source development. Our internal test code and issue tracking has some confidential information from our early private beta customers. We need to separate this out before we can push more of our infrastructure to public systems.
Earlier this week we provided the second update to the Impala code base. Going forward, the plan is to provide:
- Up-to-date source repositories – we’ll keep the repo more up-to-date going forward.
- Transparent issue tracking – we’ll be moving bug and feature request tracking over to the public Jira we have set up for Impala.
We are eagerly listening to feedback and continuously adjusting our roadmap to best meet the needs of our user base. As such, please note that as this is a beta product, so the roadmap and timelines above may change.
Justin Erickson is the product manager for Cloudera Impala.