Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop
- by Marcel Kornacker & Justin Erickson
- May 01, 2013
- 5 comments
In October 2012, we introduced the Impala project, at that time the first known effort to bring a modern, open source, distributed SQL query engine to Apache Hadoop. Our release of source code and a beta implementation were met with widespread acclaim — and later inspired similar efforts across the industry that now measure themselves against the Impala standard.
Today, we are proud to announce the first production drop of Impala (download here), which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.
Although the features and performance results described below are impressive, it’s important to note that they represent only a down payment toward the full promise of Impala. There is much more to come — and soon.
Features in Impala 1.0
First, a brief summary of features (see release notes for full detail). In combination with the design principles described above, they deliver on all requirements for a SQL-on-Hadoop platform: local processing to avoid networking bottlenecks, interactive response time, a single pool of native data, and the freedom to do different kinds of processing on the same data at the same time:
- Support for a subset of ANSI-92 SQL (compatible with Hive SQL), including
JOIN, and subqueries
- Support for partitioned joins, fully distributed aggregations, and fully distributed top-n queries
- Support for a variety of data formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format
- Support for all CDH4 64-bit packages: RHEL 6.2/5.7, Ubuntu, Debian, SLES
- Connectivity via JDBC, ODBC, Hue GUI, or command-line shell
- Kerberos authentication and MR/Impala resource isolation
Current State of Performance
Much effort has gone into improving performance over the beta release. But before we offer an overview of benchmark numbers, we want to explain how performance testing was done to ensure a realistic preview of real-world use cases.
Because doing BI and analytics often involves running a set of different queries to generate a report, our primary emphasis for measuring performance is to use a diverse set of real-world customer queries against files in their native Hadoop file formats — not cherry-picked queries against pre-loaded specialized file formats, as we’ve seen elsewhere. Furthermore, to measure the full spectrum of performance across the platform, we quantified stand-alone performance as well as the multi-tenant performance of Impala queries and other processing jobs working concurrently. Finally, we ultimately relied on beta customers and the broader community to validate the results even more widely with their queries in their own environments. We believe that any approach other than that above will not provide meaningful results. (In fact, they will be quite misleading.)
Some other important facts about the testing process:
- Where Impala and Hive/MapReduce are compared for single-user results, both sets of queries operated against precisely the same Snappy-compressed SequenceFile in HDFS.
- The fact table contains 5 years of data totaling over 1TB.
- Queries were diverse in date range (1 month to 5 years) and amount of latency (and categorized into Interactive Exploration, Reports, and Deep Analytics buckets).
- Queries involved a variety of fairly standard joins (from one to seven in number) and aggregations as well as complex multi-level aggregations and inline views.
- This query set is one among several from customers that we run regularly against various native file formats.
Here are the results (in seconds) for single-user query response on a 20-node cluster, bucketed by type and expressed as a geometric mean across those buckets (geometric mean being our preferred approach over arithmetic mean because response times vary by query):
Impala 1.0 versus Hive: Query response time (geometric mean, by category)
And the results above viewed through a “Times Faster Than Hive” lens (expressed as ranges):
Now, let’s look at how Impala achieves better-than-linear scale as we add more concurrent clients:
Impala 1.0: Multi-tenant query response times (geometric mean, by category)
Note from the above that even with a 24x increase in the number of concurrent clients running, performance is still faster than single-user results for Hive! (Note: Concurrency is an important subject, so we will provide in-depth benchmarking results for this area in a future post.)
The above proves out that Impala, unlike Hive, is suitable for modern BI-scale environments (in which many users are running different query types concurrently) — and with Impala, as you increase the number of nodes, you see performance similarly increase.
Although we’re proud of these results, we also consider them to be only scratching the surface of what Impala will do when all the benefits of Parquet (currently available in preview form) and full multi-threaded execution are brought to bear over the next two releases.
The Road Ahead for Impala
As you can deduce from the above, Impala 1.0 offers significant performance improvements over MapReduce/Hive for a wide range of BI/analytic queries, making BI over Hadoop feasible. And, thanks to its complete integration with Hadoop, Impala also offers that platform’s familiar flexibility and TCO advantages over remote-query approaches and siloed DBMS/Hadoop hybrids – making costly redundant infrastructure unnecessary.
As we reach new milestones along the roadmap, Impala will move toward achieving the ultimate goal: allowing users to store all their data in the same flexible, open, and native Hadoop file formats and simultaneously run all their batch MapReduce, machine learning, interactive SQL/BI, math (e.g., SAS), and other jobs on the same data. We look forward to your continuing feedback as Impala travels in that direction!
– Impala FAQ
– Impala 1.0 source code
– Impala 1.0 downloads (binaries and inside the QuickStart VM)
– Installing Impala using Cloudera Manager Free Edition
– Impala 1.0 documentation
– Public JIRA
– Impala mailing list
– Cloudera RTQ Support Subscription
Marcel Kornacker is the architect of Impala. Prior to joining Cloudera, he was the lead developer for the query engine of Google’s F1 project.
Justin Erickson is the product manager for Impala.