Inside Cloudera Impala: Runtime Code Generation

Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.

Why Code Generation?

The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:

select count(*)
from tbl
where col like %XYZ%

 

as fast as grep -c "XYZ" tbl.

As another example, consider the query select sum(col) from tbl. If the table had only one int64 column, encoded as little endian, this could be implemented by a dedicated application as:

int64_t sum = 0;
int64_t* values = (int64_t*)buffer;
for (int i = 0; i < num_rows; ++i) {
    sum += values[i];
  }

 

Both these queries are reasonable (as is the encoding on the second for a columnar format) and probably much slower when run through most existing query engines. (This is assuming brute force execution; a database could certainly use indices or save precomputed values to perform much better than a simple application. Of course, the techniques described here would be applied to non-brute force execution strategies as well.) This is because most query engines in use today follow an interpretive approach that adds various forms of execution overhead.

Overhead comes in the form of:

  1. Virtual function calls. Without codegen, expressions (e.g. col1 + col2 < col3) are interpreted, resulting in a virtual function call per expression. (This certainly depends on the implementation, but ours, and probably most others, would involve something akin to a virtual generic “Eval” function that each operator would implement.) In this case, the expression itself is very cheap (a single add) but the virtual function call overhead is very high.
  2. Large switch statements for types, operators, functions that are not referenced by the query. While the branch predictor can alleviate this problem, branch instructions still prevents effective instruction pipelining and instruction-level parallelism. 
  3. Inability to propagate all constants. Impala computes a fixed-width tuple format during planning (e.g. col3 is at byte offset 16). It would be beneficial if these constants could be folded into the code itself rather than having to do additional memory lookups.

The goal of code generation is to have each query use the same number of instructions as custom-written code that implements exactly the logic of the query, with no overhead due to the query execution supporting broader functionality.

Introduction to LLVM

LLVM (Low-Level Virtual Machine) is a set of libraries that constitute the building blocks of a compiler (the clang compiler is built using these libraries). The key components are:

  1. Abstract Syntax Tree (AST) → Intermediate Representation (IR) generation
  2. IR optimization (i.e. compiler optimizations)
  3. IR → machine code generation

IR is an intermediate, typed assembly language that LLVM uses as the input and output for most of its internal components; it’s comparable to Java byte code. LLVM also exposes higher-level code objects (e.g. Instruction object, Function object) that makes it easy to manipulate IR programmatically. This includes operations like inlining a function call, removing instructions, removing a computation with a constant, and so on. Impala utilizes only (2) and (3) at runtime since we have our own version of the AST in the form of the query plan.

LLVM gives us a better balance of low overhead, being able to leverage an optimizing compiler and having an actual API to generate the code.

There are other techniques that can be used for code generation. We believe LLVM is superior. Other techniques include:

  • Generating assembly on the fly and running that. While this is very fast to do, writing assembly (as ASCII text) is error-prone and difficult, especially as the number of functions generated in assembly increases. It also does not benefit from optimization passes that a compiler is able to do. 
  • Generating C++ (as text) on the fly, compiling it (by exec-ing a compiler) and dynamically loading the resulting binary. This has the benefit of being able to use an optimizing compiler and writing in a higher level language (although writing as text is not great) but the overhead of compiling can be prohibitive (multiple seconds). 

Impala’s Use of IR

After the SQL semantic analysis phase, we generate code for the “kernels” for the individual operators of the query: i.e. the inner loops of the code in which the query spends the vast majority of the cpu cycles. At the time of code generation, we know all the types, tuple layouts, SQL operations and expressions that will be used by this query. The result is a very tight inner loop with all function calls inlined and no extraneous instructions.

We first need to get IR function objects for pieces of the code path. LLVM provides two mechanism for generating IR. The first is to use LLVM’s IrBuilder (c++) API which lets us programmatically generate IR, instruction by instruction. The second is to use the clang compiler to compile c++ source code into IR.  Impala uses both methods.

At a high level, to perform code generation we:

  1. Generate IR using the IRBuilder where more efficient code can be generated with  additional runtime information.
  2. Load precompiled IR for functions that we need but don’t benefit from runtime information.
  3. Combine the IR from 1 and 2 by replacing called functions. This allows us to inline across what would have been virtual function calls.
  4. Run the LLVM optimizer along with some of our custom optimizations. This is very similar to compiling your code with optimizations enabled and takes care of many things for you. In addition to having less code, this step also does subexpression elimination, constant propagation, more function inlining, instruction reordering, dead code elimination, and other compiler optimization techniques.
  5. JIT-compile the optimized IR to machine code. LLVM returns this as a function pointer that the query engine uses instead of the interpreted function.

Example & Results

The most common question we get when talking about code generation is how much does this speed things up? What’s the performance difference? 

Here’s an example running the TPCH-Q1 query on a test cluster that comprises 10 data nodes each with 10 disks, 48GB RAM and 8 cores (16 hyper threads). The query is:

select
      l_returnflag,
      l_linestatus,
sum(l_quantity),
      sum(l_extendedprice),
      sum(l_extendedprice * (1 - l_discount)),
      sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
      avg(l_quantity),
      avg(l_extendedprice),
      avg(l_discount),
      count(1)
from
tpch.lineitem
where
 	l_shipdate<='1998-09-02'
group by
 	l_returnflag,
 	l_linestatus

 

Details of how Impala executes queries will be provided in a future blog post. As a quick overview, Impala processes batches of tuples through an operator tree. In this case, there are two operators: a scan which reads the input from disk, and a hash aggregation, which computes the sum and averages. 

We’ll focus on the aggregation step. For hash aggregation, we iterate through the tuples in the batch, evaluate and hash the grouping columns (l_returnflags and l_linestatus), do a hash table lookup and then evaluate the aggregation expressions (the sums, avgs, and count in the select list). For the aggregation operator, the code generation phase compiles all the logic for evaluating batches of rows into a single fully inlined loop.

We’ll run this query on two differently sized datasets; the first is at the 1TB scale factor and the second is at the  100GB scale factor. The files are stored as sequence files with Snappy block compression. For the 100GB dataset, the dataset is small enough to easily fit in the OS buffer cache of the cluster. This prevents the disks from being a possible bottleneck.

For both datasets, enabling codegen improves the query time by just under 3x. The total code generation time for this query is about 150ms. (Code generation can be toggled on and off through the query options so you can try the same experiment. To see the list of query options, just type ‘set’ in the impala shell.)

To further quantify the benefit of codegen, we can compare some more detailed metrics. In this case, the query is run on a single machine on a much smaller dataset (700MB) using perf stat, which is a linux perf tool that lets you gather hardware metrics with minimal setup requirements. The results are the totals from running the query 5 times.

Codegen on?

Time

Instructions

Branches

Branch Misses

Miss %

Yes

.63s

52,605,701,380

9,050,446,359

145,461,106

1.607

No

1.7s

102,345,521,322

17,131,519,396

370,150,103

2.161

       
As you can see, without codegen, we are running about twice as many instructions and over twice as many branch misses.  

Conclusion

The time we’ve invested in code generation has already paid dividends and we expect the benefits to grow even bigger as we continue to improve the query engine. With a columnar file format, more efficient encodings, and larger (memory) caches, we expect IO performance to improve dramatically, making CPU efficiency more and more important.

Code generation is most beneficial for queries that execute simple expressions and the interpretation overhead is most pronounced. For example, a query that is doing a regular expression match over each row is not going to benefit from code generation much because the interpretation overhead is low compared to the regex processing time. We expect cases like this to be much less common and our initial users have confirmed this.

There are still code paths in the current version of Impala (0.5) that are not yet code generated; we simply have not had the time to do that. A lot of these code paths will be finished up for our upcoming GA release. We have more ideas on how to better take advantage of code generation for after GA.

Nong Li is a Software Engineer on the Impala team.

Filed under:

5 Responses
  • Igor / February 21, 2013 / 3:03 PM

    Pretty impressive speedup.
    How many rows were in lineitem for 1TB dataset – was that ~6 billion (total i.e 600M/node) or ….? And what was the CPU used?

  • Ashwin Jayaprakash / February 26, 2013 / 10:34 PM

    Just curious, why C++/LLVM? If you had generated JVM bytecode dynamically from the query planner would the speedup have been significantly less than teh equivalent LLVM generated native code?

  • Gary Trakhman / April 23, 2013 / 2:05 PM

    What Ashwin said, I’m curious what tradeoffs were considered for such a departure from the JVM.

  • Henrik Behrens / November 11, 2013 / 3:52 AM

    This explains why it worked well what I did: I bought a Cluster based on cheap and energy saving CPUs (1 Celeron 847 per node) and fast IO (1 Samsung SSD per node). This way I have an excellent cost-performance ratio using Impala:
    $360 hardware cost per node
    250 GB/s scan performance per node on CSV external tables
    35 W power consumption per node ($8 per month)
    During query execution, the system resources are used equally (80 % CPU utilization, 80 % SSD utilization), so it is a balanced configuration for Impala. For the lineitem table, scan performance is 2,3 M rows per second per node in CSV format (parquet: up to 40 M rows per second per node for count(*)).
    I expect that it will also work for more expensive nodes using several SSDs and mid-size CPUs per node.

    Did anybody else try Impala on SSD based systems?

Leave a comment


7 × = fifty six