Impala users can expect new performance and usability benefits via improved integration with Kudu.
It’s been nearly one year since the public beta announcement of Kudu (now a top-level Apache project) and a noteworthy milestone has been reached: its 1.0 release. This is particularly exciting as Kudu extends the use cases that can be supported on the Apache Hadoop platform, whether it be on-premises or in the cloud, by providing a high-performance, columnar relational storage engine that enables fast analytics on fast (changing) data.
When it comes to analytics, most will recognize SQL as the lingua franca of data analysts, andApache Impala (incubating) brings the same low latency SQL query access that users have come to rely on for data stored in HDFS and Amazon S3 to data stored in Kudu tables. Thus, unlike other analytic database solutions where you need to first bulk-load data (or use a “microbatch” approach), the combination of Kudu and Impala provides instant access to the most recent data via SQL.
Remind Me What Kudu Is, and Why It’s Exciting for Impala Users?
At a high level, Kudu is a new storage manager that enables durable single-record inserts, updates, and deletes, as well as fast and efficient columnar scans due to its in-memory row format and on-disk columnar format. This architecture makes Kudu very attractive for data that arrives as a single record at a time or that may need to be modified at a later time.
Today, many users try to solve this challenge via a Lambda architecture, which presents inherent challenges by requiring different code bases and storage for the necessary batch and real-time components. Using Kudu and Impala together completely avoids this problematic complexity by easily and immediately making data inserted into Kudu available for querying and analytics via Impala. (For more technical details on how Impala and Kudu work together for analytical workloads, see this post.)
Initial Impala Integration Features for Kudu
Finally, let’s review the Impala functionality for Kudu 1.0 that is scheduled to appear in the upcoming Impala 2.7 release:
DROPsupport added for Kudu tables. The tables follow the same internal/external approach as other tables in Impala, allowing flexible data ingestion and querying.
INSERTsupport for Kudu tables in Impala using the same mechanisms as any other table with HDFS or HBase persistence. Note that there is no penalty for single-row inserts into Kudu tables as compared to HDFS (where each single row insert result in a single record file).
DELETEsupport for Kudu tables. The syntax of the SQL commands is as compatible as possible with existing solutions. In addition to basic
UPDATEcommands, you can specify complex joins in the
FROMclause of the query, using the same syntax as a regular
- In addition to
DELETEsupport, Impala 2.8 also supports
UPDATEif the primary key exists, else
INSERTas a new record).
- To achieve the best possible performance, the Kudu client in Impala parallelizes scans to multiple tablets.
- Impala 2.7 pushes down predicate evaluation to Kudu where possible so filters can be evaluated close to the data. Query performance is generally comparable to Apache Parquet for many workloads.
The Future of Impala and Kudu
As Kudu adds more functionality to better support fast analytics on fast data, Impala will also work to add supporting functionality to enable those features for its SQL users. The journey is just beginning, but the combination of Impala and Kudu is exciting as it brings more of a “database-like” experience to Hadoop, and unlocks even more use cases on that platform to support the ever increasing demand for real-time analytics.
Greg Rahn is a director of product management at Cloudera.