The Impala project has already passed several important milestones on the way to its status as the leader and open standard for BI and SQL analytics on modern big data architecture. Today’s milestone is the submission of proposals for Impala and Kudu to join the Apache Software Foundation (ASF) Incubator.
Since its initial release nearly five years ago, we’ve kept you informed along the way about important landmarks in the evolution of Impala—including its production-readiness with GA, the addition of analytic SQL capabilities in Impala 2.0, updatability via Kudu integration, and unified, fine-grained security with RecordService and Apache Sentry (incubating). With today’s announcement of proposals for Impala and Kudu to enter the ASF Incubator, in this post, we’ll review these major milestones and what’s coming next.
The Road to Impala 2.0: Analytic Database Performance and Functionality
In October 2012, Cloudera introduced Impala, the first effort to build a modern, open source, distributed SQL query engine for Apache Hadoop. Impala immediately earned widespread market attention with orders-of-magnitude performance advantages over alternative SQL-on-Hadoop solutions, and performance on par with that of traditional analytic databases. These performance gains unlocked the ability to perform interactive BI and SQL analytics directly on Hadoop for the first time.
Impala widened its performance lead with Impala 1.1, 1.3, and 1.4, even as other SQL-on-Hadoop engines improved their own performance (with IBM Research concluding that “Impala’s database-like architecture provides significant performance gains” in a VLDB paper comparing SQL-on-Hadoop engines). More important, though, even in its first year on the market, we observed that Impala already outperformed traditional parallel databases across many customer deployments. To document and quantify this observation, in January 2014, we published a set of performance results against a traditional analytic database (masked as “DBMS-Y” due to proprietary licensing restrictions):
With each release leading up to Impala 2.0, Impala responded to increased customer usage with added SQL compatibility and enterprise capabilities. Impala 2.0 aimed to provide the functional as well as performance benefits of traditional analytic databases, most notably:
- User-defined functions for customized SQL language extensions
- Fine-grained authorization via Apache Sentry (later extended across the platform with RecordService)
- Admission control for multi-tenancy
- Additional ANSI SQL capabilities and vendor-specific extensions including SQL:2003 analytic window functions
- Integration with most of the leading business intelligence (BI) tools on the market
During this time, downloads of Impala rapidly surpassed the 1-million mark. Impala also emerged as an open standard as multi-vendor support from Cloudera, Oracle, MapR, Amazon came online—and more recently, Impala code shipped as a part of IBM Big SQL, as well.
Post Impala 2.0: Reliability with Concurrency and Scalability
The functionality in Impala 2.0 further accelerated Impala’s adoption across customer deployments. Today, Impala has been adopted by most Cloudera customers and is the most popular additional component used in Cloudera Enterprise.
Based on this accelerated adoption, the next phase of development was primarily focused on reliability, while scaling to meet the multi-user concurrency and data scalability needs of Impala’s customers. With the improvements released since Impala 2.0 we’ve seen Impala’s customers successfully continue to scale on all these fronts, including a few public studies (Cloudera, Zoosk, AtScale).
Impala now has many customers in the 1-million-queries club, clusters ranging from 10s to even 100s of nodes, and customers that are pushing the concurrency envelope into 1,000+ users. Many enterprises have repeatedly shown that Impala can support the multi-user workloads that, historically, were only possible with traditional MPP database technologies.
Impala 2.0 also cemented its architectural foundation sufficiently that we recently opened Impala development to external contributors in the open source community. Contributions from Intel, Arcadia Data, and others have already gone upstream–and currently, Google is in the process of contributing code for Impala-on-Bigtable integration. We’re excited to see these contributions and even more excited for what’s to come.
Hadoop, Kudu, and Impala: The Modern Analytic Database Architecture
Just a few weeks ago, we unveiled Kudu, the most significant milestone since Impala’s launch. The Impala and Kudu communities are now working together to enable new Hadoop workloads in which one can directly query fast-changing data in real time with support for direct inserts, updates, and deletes.
The architecture of Kudu, Impala, and Hadoop sets the foundation for the modern analytic database architecture. Hadoop and Kudu enable all your data to be flexibly used across all Hadoop SQL and non-SQL processing frameworks, as Kudu can now handle fast-changing data that can’t be easily managed in HDFS. Impala continues to be the leading analytic query engine that is uniquely positioned to enable interactive BI and SQL analytics for this platform.
Today, having established the architectural foundation of Impala via 2.0 functional enhancements and integration with Kudu as the high-performance storage manager, we’re very excited to take the next step in Impala’s open source journey, proposing that Impala join the ASF to further grow the project’s community of users and developers. While we work through the ASF proposal process, if you’re interested in contributing to Impala, please review our blog post about how to contribute today.
While we’ve come a long way to establish Impala as a core component of modern analytic database architecture, but we’re just beginning to unlock its potential as part of the broader platform. As Turing Award winner and database pioneer Michael Stonebraker wrote:
Parallel DBMSs excel at efficient querying of large data sets; MR-style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary, and we expect MR-style systems performing ETL to live directly upstream from DBMSs.
With continuing investments in Impala, Kudu, and Apache Spark (the latter through the One Platform Initiative), we see this vision mirrored in the Hadoop ecosystem: Impala as the leading parallel DBMS and Spark as the leading data processing framework, working together at the core of a unified, modern enterprise data hub architecture. We’re excited to take the next step in building on this vision together with the rest of the open source community.
Marcel Kornacker is the founder and architect of Impala.
Justin Erickson is director of product management at Cloudera.