Apache Kudu and Apache Impala (Incubating): The Integration Roadmap

Categories: Impala Kudu

Impala users can expect new performance and usability benefits via improved integration with Kudu.

It’s been nearly one year since the public beta announcement of Kudu (now a top-level Apache project) and a noteworthy milestone has been reached: its 1.0 release. This is particularly exciting as Kudu extends the use cases that can be supported on the Apache Hadoop platform, whether it be on-premises or in the cloud, by providing a high-performance, columnar relational storage engine that enables fast analytics on fast (changing) data.

When it comes to analytics, most will recognize SQL as the lingua franca of data analysts, andApache Impala (incubating) brings the same low latency SQL query access that users have come to rely on for data stored in HDFS and Amazon S3 to data stored in Kudu tables. Thus, unlike other analytic database solutions where you need to first bulk-load data (or use a “microbatch” approach), the combination of Kudu and Impala provides instant access to the most recent data via SQL.

Remind Me What Kudu Is, and Why It’s Exciting for Impala Users?

At a high level, Kudu is a new storage manager that enables durable single-record inserts, updates, and deletes, as well as fast and efficient columnar scans due to its in-memory row format and on-disk columnar format.  This architecture makes Kudu very attractive for data that arrives as a single record at a time or that may need to be modified at a later time.

Today, many users try to solve this challenge via a Lambda architecture, which presents inherent challenges by requiring different code bases and storage for the necessary batch and real-time components. Using Kudu and Impala together completely avoids this problematic complexity by easily and immediately making data inserted into Kudu available for querying and analytics via Impala. (For more technical details on how Impala and Kudu work together for analytical workloads, see this post.)

Initial Impala Integration Features for Kudu

Finally, let’s review the Impala functionality for Kudu 1.0 that is scheduled to appear in the upcoming Impala 2.7 release:

  • CREATE and DROP support added for Kudu tables. The tables follow the same internal/external approach as other tables in Impala, allowing flexible data ingestion and querying.
  • INSERT support for Kudu tables in Impala using the same mechanisms as any other table with HDFS or HBase persistence. Note that there is no penalty for single-row inserts into Kudu tables as compared to HDFS (where each single row insert result in a single record file).
  • UPDATE and DELETE support for Kudu tables. The syntax of the SQL commands is as compatible as possible with existing solutions. In addition to basic DELETE or UPDATE commands, you can specify complex joins in the FROM clause of the query, using the same syntax as a regular SELECT statement.
  • In addition to INSERT, UPDATE, and DELETE support, Impala 2.8 also supports UPSERT (UPDATE if the primary key exists, else INSERT as a new record).
  • To achieve the best possible performance, the Kudu client in Impala parallelizes scans to multiple tablets.
  • Impala 2.7 pushes down predicate evaluation to Kudu where possible so filters can be evaluated close to the data. Query performance is generally comparable to Apache Parquet for many workloads.

The Future of Impala and Kudu

As Kudu adds more functionality to better support fast analytics on fast data, Impala will also work to add supporting functionality to enable those features for its SQL users. The journey is just beginning, but the combination of Impala and Kudu is exciting as it brings more of a “database-like” experience to Hadoop, and unlocks even more use cases on that platform to support the ever increasing demand for real-time analytics.

Greg Rahn is a director of product management at Cloudera.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

8 responses on “Apache Kudu and Apache Impala (Incubating): The Integration Roadmap

  1. Pablo Vazquez

    Hi.
    I have a doubt with this part:

    “INSERT support for Kudu tables in Impala using the same mechanisms as any other table with HDFS or HBase persistence. Note that there is no penalty for single-row inserts into Kudu tables as compared to HDFS (where each single row insert result in a single record file).”

    Does that mean Impala 2.6 use “single record file” approach for Kudu tables?

    I’ve done some tests trying to insert 1M records into a Kudu table but the performance is really poor ( around 20 min). I supposed it was origined by the “single record file” approach

  2. Greg Rahn

    Thanks for the comment Pablo. When I mentioned “there is no penalty” what I was referring to is the fact that single-row inserts create single-row files in HDFS and that has several negative side effects: very many small files, poor/no compression, and poor performance for query engines scanning over those files. I did not mean to imply that there is a zero-latency overhead for using Impala for INSERTs vs using the Kudu API directly — code path does cost something, however, the teams hope to optimize and minimize that latency. Hope that helps clarify things. Also note, the Impala integration with Kudu is still early and under active development, so functionality and performance should both improve with upcoming releases.

    1. Pablo Vazquez

      Thanks Greg for your soon reply.

      I understand there is a difference between using Java API vs Impala. Latency will be present anyway (it always finds the way to appear).

      In my project we are using Pentaho to create a DWH with Kudu as storage layer. Pentaho uses Impala driver to reach Kudu. In my first approach I tried to insert directly Pentaho –> Kudu thru Impala driver but the performance was really bad. Right now, we’re creating a txt file with Pentaho, moving it to HDFS, using the ‘load data inpath …’ in Impala and, finally, use the “INSERT … SELECT ..” approach to persist data in Kudu. Time is much better but it has many failure points. I don’t feel comfortable at all.

      My intention with Impala 2.7 was to test again the first approach and check how it worked without that ‘single record file’ penalty.

      Keep doing this great job.

      1. J-D Cryans

        Hi Pablo,

        Single row operations via the Impala driver will definitely be slow, Impala is not an OLTP kind of database so single row operations will take a lot more time than what you’d expect. The data loading via a text file, as you saw, is much faster because it’s all part of a single query. But still, there’s a lot of performance improvements we can be doing there.

        The fastest way to load data into Kudu is to use the native Java or C++ APIs.

        BTW, where is your data coming from?

        Thanks,

        J-D

  3. mizukusak

    I am sorry for my poor English.
    First of all, congratulations on kudu 1.0 release!!!
    I am very looking forward to watching over impala and kudu projects growing up.

    I have some questions.
    Do these projects have plans for support multi-tablet transactions? (UPDATE, INSERT, DELETE)
    If so, around when it will be supported?

    1. J-D Cryans

      Hi mizukusak,

      There’s nothing architecturally-speaking that prevents Kudu from having multi-tablet transactions, but it’s not on our roadmap at the moment. What is your use case?

      Thanks,

      J-D

  4. mizukusak

    Hi J-D,
    Thanks you for your reply.
    I will use Impala and Kudu for web application which may grow large.
    For example, my use case is following.
    DELETE FROM comments WHERE article_id = 1;
    INSERT comments (article_id, text) VALUES (1, ‘foo’);
    row is staying deleted if fail INSERT statement for some reason.
    but, actually it is not a problem so much even without transaction because I can implement alternative to it.

  5. Son

    Hello, I believe the Kudu is the solution to my issues. I am ingesting millions of different files along with metadata from different sources …the end goal is to run analytics on all of these files. Since each file is small a few mb, even when serialized/dserialized to JSON – it does not meet HDFS block size – therefore not running optimally (especially when running spark sql queries). I believe Kudu should be able to handle these small json files for fast read/write – is that a correct assumption?

Leave a Reply

Your email address will not be published. Required fields are marked *