New in Cloudera Enterprise 5.5: Support for Complex Types in Impala

Categories: Impala Parquet

The new support for complex types in Impala makes running analytic workloads considerably simpler.

Impala 2.3 (shipping starting in Cloudera Enterprise 5.5) contains support for querying complex types in Apache Parquet tables, specifically ARRAY, MAP, and STRUCTs. This capability enables users to query against naturally nested data sets without having to perform ETL to flatten them. This feature provides a few major benefits, including:

  • It removes additional ETL and data modeling work to flatten data sets.
  • It makes queries easier by maintaining the natural relationship between nested data elements.
  • It boosts performance by removing joins.

This post is a gentle introduction to querying Impala tables with complex types; we will have follow-up posts that will go into more depth. Our goal is give you a quick understanding of Impala’s design philosophy and querying capabilities in Impala 2.3. We will focus on the SQL syntax extensions by presenting a series of simple SQL examples.

The Impala team is excited about the first Impala release with complex types support, and we hope to get you excited, too!

Design Goals

The new SQL language extensions were designed with a few principles in mind:

  • The syntax should feel natural to the user, and should be a natural extension of SQL.
  • It should also allow the full expressiveness of SQL with complex types.
  • Common querying patterns can be expressed concisely.
  • Queries can be executed efficiently.

Consequently, we came up with the following main ideas and extensions:

  • Nested fields within a STRUCT are referenced via the familiar . notation.
  • The collection types ARRAY and MAP are referenced in the FROM clause, just like conventional tables. This exposes the nested data as columns that can be referenced as usual.
  • Subqueries (including inline views) can reference nested ARRAYs and MAPs from tables in enclosing SELECT blocks. This allows a SQL-in-SQL pattern where the data in nested collections can be filtered/joined/aggregated with the full expressiveness of SQL.

Onward to the examples!

Example Schema

The following schema models a hypothetical customer data warehouse that contains data that might have been assembled from various data sources. There is routine customer data like name, address, orders, and so on, but also data about website and call-center interactions, all in a single table. The schema presents a customer-centric view of the data with the intent of performing customer-centric analyses.

We have highlighted the complex types below; the column/field names should be more-or-less self-explanatory. Their meaning will become clear in the examples to follow.

Query A: Referencing STRUCT fields

Find all customer names and their zip in the automotive market segment in San Francisco.

The nested fields of the STRUCT-typed address column are accessed via . notation.

Query B: Accessing a Nested Collection

List all orders from 11/27/2015 with a total price over $1000.

The . notation in the FROM clause is used to expose the nested orders collection as a table that contains all orders of all customers. By referencing the nested orders collection in the FROM clause, we can use its fields anywhere a conventional column reference could appear. Notice that in this query we do not reference any of the top-level customer fields.

Query C: Accessing a Deeply Nested Collection in a Single Path

Compute the average item price over all items.

Here the FROM clause produces a table with all items of all orders of all customers. A single dotted path in the FROM clause can traverse any number of collections, flattening all of them.

Query D: Using Relative Table References

Compute the minimum and maximum duration of all successfully resolved support calls by market segment.

Unlike the previous examples, this query references top-level customer columns, as well as nested fields from the support_calls collection. For every table or nested collection we wish to use fields/columns from, we need to establish a table alias in the FROM clause. In this example, we have an alias c that exposes the top-level columns cid, name, etc. and another alias s that exposes the columns agent_id, order_date, duration_ms, and so on. You can think of the FROM clause as a join between all customers and all support calls on the implicit is-nested-in relationship. This implicit join condition is expressed by virtue of the c.support_calls reference that is relative to the alias c.

Query E: ANSI-92 Joins with Nested Collections

List all customers and their orders from a specific zip code, including customers that have no orders.

The LEFT OUTER JOIN flattens the nested orders collection while preserving those customers that have no orders. The orders columns are set to NULL for customers that have no orders. Notice that an ON-clause is not required here due to the implicit is-nested-in join condition.

Query F: Pseudocolumns of Arrays

List all phone numbers of a specific customer.

We use the pseudocolumn item to refer to the column exposed by c.phone_numbers because the element type of the phone_numbers array is an anonymous scalar. The pos pseudocolumn contains the ordinal position of item in the corresponding array.

Query G: Pseudocolumns of Maps

Count the number of distinct user agents that accessed a purchasing-related URL.

We use the pseudocolumn key to refer to the key portion of the web_visits map. Similarly, the value pseudocolumn can be used to access the value portion of the map (e.g., if the map’s value was an anonymous scalar type).

Query H: Correlated Table References in Inline Views

Count the number of pending orders and show their total value for every customer living in “Palo Alto”.

This example shows how a relative reference to a nested collection c.orders can be made inside an inline view. We say the reference is correlated because it references a table alias from an enclosing query block, akin to conventional correlated subqueries in SQL. One way to think about the meaning of this query is that the query inside v is evaluated for every customer c due to the correlated reference. The result of each v evaluation is then joined with the corresponding c row. Such inline views can contain arbitrary SQL for operating on nested collections. Note that this query does per-customer aggregation, but does not require a GROUP BY clause.

Query I: Correlated Table References in Subqueries

Count the number of customers who have no orders but at least one web visit, grouped by market segment.

This example shows two EXISTS subqueries that have correlated table references to filter a customer based on evaluating some SQL over the nested orders and web_visits collections of that particular customer. Note that in a flat schema, this query would require explicit and potentially expensive distributed joins between customers, orders, and web_visits.

Query J: Deeply Nested Subqueries

List all customers that have at least five orders that were ordered on a date where the customer also had at least one support call and web interaction.

The purpose of this final example is to show a more complicated query with multiple nested subqueries that contain correlated table references. Subqueries can be arbitrarily nested just as in the conventional “flat” SQL.

Useful Utility Commands

When working with data that has complex types, you might find the following utility commands useful.

  • CREATE TABLE <tbl> LIKE PARQUET 'path_to_file'. See documentation.
  • DESCRIBE <path>. See ARRAY type and MAP type.

What’s Next

While the core built-in language extensions are available in CDH 5.5 today, we are continuing to improve the complex types feature on several dimensions, and have the following items on the roadmap for future releases:

  • Support for Apache Avro
  • Support for JSON files
  • Syntactic sugar for even more concise aggregates over collections
  • INSERT into tables with complex types
  • Builtin functions and user-defined functions that return and/or operate on complex types
  • Performance improvements

This post was intended to be a short introduction and reference, so we inevitably omitted many interesting details. For more information and examples, try the following resources:

Alex Behm is a Software Engineer at Cloudera, working on the Impala team.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

10 responses on “New in Cloudera Enterprise 5.5: Support for Complex Types in Impala

    1. Vinod

      How can i access a particular element with in a map, in my select clause. IF I want to select the map elements corresponding to a certain key value in select, how would I do that. I am not talking about having a filter like mapName.key =”filter”
      More like how hive lets you access the elements in the select by the using the square brackets, mapcolumnName[key]

      Thanks

  1. Alex

    INSERT is on future work. Does that mean that for the moment I cannot use Impala to create a table with complex types out of existing “normal” tables? So the only way to use complex types is to already have a Parquet file with the corresponding structure that has been created somehow else?

  2. Alex Behm

    Alex, you can CREATE/ALTER tables with complex types in Impala as usual. For populating the tables with data (INSERT or CREATE TABLE AS SELECT), you can use conventional tools from the Hadoop ecosystem such as Hive/Kite/MR/Spark/Pig/etc. For this first release with complex types support, we decided to focus on the querying portion to deliver this feature as quickly as possible.

    I’d recommend looking into Hive’s collect_list() + group by, for generating nested data from existing flat data.

  3. Michael

    Alex, Excellent article. Thank you. I have created your table in a CDH 5.5 VM for testing. Question: Is there a location where you’ve placed the test data used in your examples?

  4. Subur

    Alex

    Excellent article. Can’t wait to explore cost benefits from a teradata / netezza implementation to switch to this. U know we r greedy – advanced analytics will be next as a request;-). Good intro.

  5. Benjamin Greve

    Thanks for the article, very helpful. I have not yet tried nested types with Hive or Impala, but it sounds interesting.
    Are complex types visible in the HDFS folder structure, e.g. like partitions?
    If I have a single large de-normalized table, e.g. website tracking data with partitions by day, should I use complex types anyway (despite there being no joins involved) to make the data more structured? Would this still have an impact on performance?

  6. mansson

    I am very surprised that Impala cannot provide result from a query in JSON. Isn’t that the obvious way to deliver a complex data structure? My database has about five levels of arrays of structs, and I am construcing a query to get the result in a way that I can treat as a Javascript object. The sheer amount of subqueries, concats, group concats and quote signs required for that drives my crazy.
    The object is in there, I just want to get it out!

Leave a Reply

Your email address will not be published. Required fields are marked *