Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop

In October 2012, we introduced the Impala project, at that time the first known effort to bring a modern, open source, distributed SQL query engine to Apache Hadoop. Our release of source code and a beta implementation were met with widespread acclaim — and later inspired similar efforts across the industry that now measure themselves against the Impala standard.

Today, we are proud to announce the first production drop of Impala (download here), which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.

Although the features and performance results described below are impressive, it’s important to note that they represent only a down payment toward the full promise of Impala. There is much more to come — and soon.

Features in Impala 1.0

First, a brief summary of features (see release notes for full detail). In combination with the design principles described above, they deliver on all requirements for a SQL-on-Hadoop platform: local processing to avoid networking bottlenecks, interactive response time, a single pool of native data, and the freedom to do different kinds of processing on the same data at the same time:

  • Support for a subset of ANSI-92 SQL (compatible with Hive SQL), including CREATE, ALTER, SELECT, INSERT, JOIN, and subqueries
  • Support for partitioned joins, fully distributed aggregations, and fully distributed top-n queries
  • Support for a variety of data formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format
  • Support for all CDH4 64-bit packages: RHEL 6.2/5.7, Ubuntu, Debian, SLES
  • Connectivity via JDBC, ODBC, Hue GUI, or command-line shell
  • Kerberos authentication and MR/Impala resource isolation

Current State of Performance

Much effort has gone into improving performance over the beta release. But before we offer an overview of benchmark numbers, we want to explain how performance testing was done to ensure a realistic preview of real-world use cases.

Because doing BI and analytics often involves running a set of different queries to generate a report, our primary emphasis for measuring performance is to use a diverse set of real-world customer queries against files in their native Hadoop file formats — not cherry-picked queries against pre-loaded specialized file formats, as we’ve seen elsewhere. Furthermore, to measure the full spectrum of performance across the platform, we quantified stand-alone performance as well as the multi-tenant performance of Impala queries and other processing jobs working concurrently. Finally, we ultimately relied on beta customers and the broader community to validate the results even more widely with their queries in their own environments. We believe that any approach other than that above will not provide meaningful results. (In fact, they will be quite misleading.)

Some other important facts about the testing process:

  • Where Impala and Hive/MapReduce are compared for single-user results, both sets of queries operated against precisely the same Snappy-compressed SequenceFile in HDFS.
  • The fact table contains 5 years of data totaling over 1TB.
  • Queries were diverse in date range (1 month to 5 years) and amount of latency (and categorized into Interactive Exploration, Reports, and Deep Analytics buckets).
  • Queries involved a variety of fairly standard joins (from one to seven in number) and aggregations as well as complex multi-level aggregations and inline views.
  • This query set is one among several from customers that we run regularly against various native file formats.

Here are the results (in seconds) for single-user query response on a 20-node cluster, bucketed by type and expressed as a geometric mean across those buckets (geometric mean being our preferred approach over arithmetic mean because response times vary by query):

Single-user query performance

Impala 1.0 versus Hive: Query response time (geometric mean, by category)

And the results above viewed through a “Times Faster Than Hive” lens (expressed as ranges):

Now, let’s look at how Impala achieves better-than-linear scale as we add more concurrent clients: 

multi-user query response time
Impala 1.0: Multi-tenant query response times (geometric mean, by category)

Note from the above that even with a 24x increase in the number of concurrent clients running, performance is still faster than single-user results for Hive! (Note: Concurrency is an important subject, so we will provide in-depth benchmarking results for this area in a future post.)

The above proves out that Impala, unlike Hive, is suitable for modern BI-scale environments (in which many users are running different query types concurrently) — and with Impala, as you increase the number of nodes, you see performance similarly increase.

Although we’re proud of these results, we also consider them to be only scratching the surface of what Impala will do when all the benefits of Parquet (currently available in preview form) and full multi-threaded execution are brought to bear over the next two releases.

The Road Ahead for Impala

As you can deduce from the above, Impala 1.0 offers significant performance improvements over MapReduce/Hive for a wide range of BI/analytic queries, making BI over Hadoop feasible. And, thanks to its complete integration with Hadoop, Impala also offers that platform’s familiar flexibility and TCO advantages over remote-query approaches and siloed DBMS/Hadoop hybrids – making costly redundant infrastructure unnecessary.

As we reach new milestones along the roadmap, Impala will move toward achieving the ultimate goal: allowing users to store all their data in the same flexible, open, and native Hadoop file formats and simultaneously run all their batch MapReduce, machine learning, interactive SQL/BI, math (e.g., SAS), and other jobs on the same data. We look forward to your continuing feedback as Impala travels in that direction!

Additional resources:

– Impala FAQ
Impala 1.0 source code
Impala 1.0 downloads (binaries and inside the QuickStart VM)
Installing Impala using Cloudera Manager Free Edition
Impala 1.0 documentation
Public JIRA
Impala mailing list
Cloudera RTQ Support Subscription

Marcel Kornacker is the architect of Impala. Prior to joining Cloudera, he was the lead developer for the query engine of Google’s F1 project.

Justin Erickson is the product manager for Impala.

Filed under:

5 Responses
  • Slim Baltagi / May 02, 2013 / 7:48 AM

    Thanks for the update. I have a couple questions:
    1. Is there any scheduled 1.1 release of Impala soon to contain fixes of issues you are considering critical in your product backlog? https://issues.cloudera.org/browse/IMPALA
    IMPALA-174, IMPALA-248, IMPALA-124, IMPALA-85, IMPALA-50
    2. With the above ‘critical’ issues and concerns about the number of concurrent users Impala can support, is Impala 1.0 still being considered production ready?
    3. Does Cloudera Impala RTQ shares the same code base with Impala (open source and freely available version) and only provides supplementary service: technical support and management?
    4. How the cost of Impala RTQ compares to licensed counterparts like Hadapt?

    Thanks

    Slim Baltagi

  • Justin Erickson / May 02, 2013 / 8:11 PM

    Hi Slim,

    Thanks for your interest in Impala. For questions 1 and 2: Impala 1.0 has gone through nearly a year of customer beta testing and extensive internal QA in order to validate production general availability. The specific issues you called out do not affect production quality or stability for the following reasons:
    * IMPALA-174 – we’re waiting for confirmation on which build the customer ran and/or more details on the issue. We suspect this may be something we have fixed in later beta builds and in the 1.0 build
    * IMPALA-248 – a nice usability feature to allow submitting multiple commands at once through the command line shell but by no means a blocker because you can just run each command sequentially
    * IMPALA-124 – Impala uses and reports cancellation as an exception. It would be nicer if this was more clearly surfaced but this doesn’t affect behavior
    * IMPALA-85 – this adds new useful SQL functionality but doesn’t impact stability of the existing functionality. More specifically this makes the older ANSI-89 style joins more usable but the more commonly used ANSI-92 style joins work fine today
    * IMPALA-50 – this one is a more impactful, but the major issue (i.e. the crash) has been fixed prior to release. Under situations of high stress we discovered a case where new connections would timeout where we could have behaved better. This was uncovered through our internal stress testing where we specifically push the product to the limits and not something we have observed in customer beta environments

    I hope that helps clarify the state of the issues you mentioned and why they do not affect production installations of Impala. We will of course have continued releases of Impala and will keep the strong momentum going on the Impala project given the great customer response for the project. I can’t comment on release timeframe for 1.1 yet but can say that you should expect Impala to release more frequently than quarterly CDH updates.

    3. Impala is and will continue to be 100% open sourced. Customers of Cloudera Enterprise RTQ get exactly the same Impala that is publicly available under the Apache license. Cloudera Enterprise RTQ is Cloudera’s support offering, which also includes the Enterprise Edition of Cloudera Manager for easy deployment, management, monitoring, etc.

    4. While I can’t comment on other’s prices, we’d be happy to talk in more depth on what Cloudera offers. Please request us to contact you directly through: http://cloudera.com/content/cloudera/en/about/contact-us/contact-form.html

    Thanks,
    Justin

  • ludonara / May 04, 2013 / 12:50 AM

    Thanks for this great job! Impala brings the necessary to do real time queries for BI analytics on hadoop and play the role that hive doesn’t, due to this poor performance.

    I have just one question about the benchmark on concurrent Impala clients. Does each client run the same set of queries in this test?
    In the case of the answer is yes, does impala implement a mechanism to detect that different clients run same query and so doesn’t do the same job multiple time for the same result (or do some caching) ?

    Thanks for the great job of Cloudera team.

  • Pat Wellington / May 07, 2013 / 10:41 AM

    This is a fascinating post.

    The post mentions a 20 node cluster. What was the memory configured on each node? I am a little confused by statements about Impala’s memory requirements. The FAQ seems to indicate that each node must have memory capacity to contain the entire working set of a query. I thought I read that this has been removed in the GA release and join processing is distributed across nodes. So I am not sure if my impala cluster should consist of a few large nodes or many small ones.

    Thanks!

  • Justin Kestelyn (@kestelyn) / May 21, 2013 / 5:05 PM

    This response from Impala product manager Justin Erickson:

    “Hi Ludonara and Pat,

    Thanks for your interest and excitement for the Impala project. In the test above we ran the same set of queries from multiple clients simultaneously. We’ll have a follow-up post with more in-depth coverage of multi-user performance.

    Impala and Hadoop currently have no memory caching so each query runs independently and afresh. Linux has an OS block cache which will help keep some portion of the working data set in memory if a query runs a plan fragment on a node that already has a block in the Linux OS cache. This current mechanism ends up being less efficient than application-level in-memory caching since the file cache is not coordinated with Hadoop and Impala. Stay tuned for Hadoop to gain the ability to perform more efficient memory caching.

    Impala does not need to keep the entire working set in memory. The confusion comes from the fact that before Impala 0.7, Impala needed to keep the right-hand side input of a join in memory on each node – the left-hand side input can be arbitrarily large as it is only scanned, never cached in memory. This limitation has since been lifted with partitioned joins which enable the right-hand side input of a join to be partitioned across the aggregate available memory of the entire cluster. Please also note that Impala does not cache the entire right-hand side input table of a join, but only those rows and columns that are needed for the join.

    It’s hard to provide absolute numbers of memory recommendations. In general the bigger the joins and concurrency between big joins the larger the memory footprint to have in the cluster. Typical deployments from our beta customers have been between 48 GB and 96 GB per node. With the advent of beyond-batch processing engines such as the recently announced SAS integration and Impala general availability, we’re seeing more customers go for 64 GB and 96 GB per node with a few that go even larger, anticipating the ability to pin data sets in memory for even lower latencies.

    Thanks,
    Justin”

Leave a comment


seven − 1 =