How-to: Get Started Writing Impala UDFs

Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.

User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.

You can use UDFs to simplify query logic when producing reports, or to transform data in flexible ways when copying from one table to another with the INSERT ... SELECT syntax.

Since release 1.2.0, Impala has supported UDFs written in C++. Although existing Apache Hive UDFs written in Java are supported as well, Cloudera recommends using C++ UDFs because the compiled native code can yield higher performance — as illustrated in the chart below (running on a single core; see sample UDF here):

 

In summary: Native Impala UDFs execute 10x faster than Hive UDFs when run in Impala (resulting in significantly faster queries), and Hive UDFs run faster in Impala than they do in Hive.

Impala can run scalar UDFs that return a single value for each row of the result set, and user-defined aggregate functions (UDAFs) that return a value based on a set of rows. Currently, Impala does not support user-defined table functions (UDTFs) or window functions (although this support is on the roadmap).

In this post, you will learn how to get started writing your own UDFs in the current Impala release (1.2.3).

Sample Build Environment for UDFs

The Impala team has made a sample build environment available so that you can create your own UDFs with minimal work.

To develop UDFs for Impala, download and install the impala-udf-devel package containing header files, sample source, and build configuration files. Start at http://archive.cloudera.com/impala/ and locate the appropriate .repo or list file for your operating system version, such as the .repo file for RHEL 6. Use the familiar yumzypper, or apt-get commands depending on your operating system, with impala-udf-devel for the package name.

(Note: The UDF development code does not rely on Impala being installed on the same machine. You can write and compile UDFs on a minimal development system, then deploy them on a different one for use with Impala. If you develop UDFs on a server managed by Cloudera Manager through the parcel mechanism, you still install the UDF development kit through the package mechanism; this small standalone package does not interfere with the parcels containing the main Impala code.)

When you are ready to start writing your own UDFs, download the sample code and build scripts from the Cloudera sample UDF GitHub, and see Examples of Creating and Using UDFs for how to build and run UDFs.

To understand the layout and member variables and functions of the predefined UDF data types, examine the header file /usr/include/impala_udf/udf.h:

 

For the basic declarations needed to write a scalar UDF, see the header file udf-sample.h within the sample build environment, which defines a simple function named AddUdf():

 

For sample C++ code for a simple function named AddUdf(), see the source file udf-sample.cc within the sample build environment:

 

Conclusion

Writing your own UDFs helps you customize an Impala deployment for your particular use case. Also, as you can see from the above, it’s easy to get started!

John Russell is a technical writer at Cloudera and the author of the free O’Reilly Media e-book, Cloudera Impala.

Filed under:

1 Response
  • Viacheslav Rodionov / January 27, 2014 / 2:07 AM

    Hi,

    thanks for the article, it’s very useful for me.

    Please fix your code visualization – “ampersand” character is represented as XML entity right now.

    Best regards,
    Viachesalv Rodionov

Leave a comment


2 × = two