The new support for complex types in Impala makes running analytic workloads considerably simpler.
Impala 2.3 (shipping starting in Cloudera Enterprise 5.5) contains support for querying complex types in Apache Parquet tables, specifically
STRUCTs. This capability enables users to query against naturally nested data sets without having to perform ETL to flatten them. This feature provides a few major benefits, including:
- It removes additional ETL and data modeling work to flatten data sets.
- It makes queries easier by maintaining the natural relationship between nested data elements.
- It boosts performance by removing joins.
This post is a gentle introduction to querying Impala tables with complex types; we will have follow-up posts that will go into more depth. Our goal is give you a quick understanding of Impala’s design philosophy and querying capabilities in Impala 2.3. We will focus on the SQL syntax extensions by presenting a series of simple SQL examples.
The Impala team is excited about the first Impala release with complex types support, and we hope to get you excited, too!
The new SQL language extensions were designed with a few principles in mind:
- The syntax should feel natural to the user, and should be a natural extension of SQL.
- It should also allow the full expressiveness of SQL with complex types.
- Common querying patterns can be expressed concisely.
- Queries can be executed efficiently.
Consequently, we came up with the following main ideas and extensions:
- Nested fields within a
STRUCTare referenced via the familiar
- The collection types
MAPare referenced in the
FROMclause, just like conventional tables. This exposes the nested data as columns that can be referenced as usual.
- Subqueries (including inline views) can reference nested
MAPs from tables in enclosing
SELECTblocks. This allows a SQL-in-SQL pattern where the data in nested collections can be filtered/joined/aggregated with the full expressiveness of SQL.
Onward to the examples!
The following schema models a hypothetical customer data warehouse that contains data that might have been assembled from various data sources. There is routine customer data like name, address, orders, and so on, but also data about website and call-center interactions, all in a single table. The schema presents a customer-centric view of the data with the intent of performing customer-centric analyses.
We have highlighted the complex types below; the column/field names should be more-or-less self-explanatory. Their meaning will become clear in the examples to follow.
CREATE TABLE customers (
/* Keyed on web URL */
STORED AS PARQUET;
Query A: Referencing STRUCT fields
Find all customer names and their zip in the automotive market segment in San Francisco.
SELECT name, address.zip FROM customers
WHERE mktsegment = "AUTOMOTIVE" AND address.city = "SAN FRANCISCO"
The nested fields of the
STRUCT-typed address column are accessed via
Query B: Accessing a Nested Collection
List all orders from 11/27/2015 with a total price over $1000.
SELECT oid, status, totalprice, order_date FROM customers.orders
WHERE order_date = "11-27-2015" and totalprice > 1000
. notation in the
FROM clause is used to expose the nested
orders collection as a table that contains all orders of all customers. By referencing the nested
orders collection in the
FROM clause, we can use its fields anywhere a conventional column reference could appear. Notice that in this query we do not reference any of the top-level customer fields.
Query C: Accessing a Deeply Nested Collection in a Single Path
Compute the average item price over all items.
SELECT AVG(price) FROM customers.orders.items
FROM clause produces a table with all items of all orders of all customers. A single dotted path in the
FROM clause can traverse any number of collections, flattening all of them.
Query D: Using Relative Table References
Compute the minimum and maximum duration of all successfully resolved support calls by market segment.
SELECT MIN(duration_ms), MAX(duration_ms) FROM customers c, c.support_calls s
WHERE s.issue_resolved = true
GROUP BY c.mktsegment
Unlike the previous examples, this query references top-level customer columns, as well as nested fields from the
support_calls collection. For every table or nested collection we wish to use fields/columns from, we need to establish a table alias in the
FROM clause. In this example, we have an alias
c that exposes the top-level columns
name, etc. and another alias
s that exposes the columns
duration_ms, and so on. You can think of the
FROM clause as a join between all customers and all support calls on the implicit is-nested-in relationship. This implicit join condition is expressed by virtue of the
c.support_calls reference that is relative to the alias
Query E: ANSI-92 Joins with Nested Collections
List all customers and their orders from a specific zip code, including customers that have no orders.
SELECT cid, name, oid, status, totalprice, order_date
FROM customers c LEFT OUTER JOIN c.orders
WHERE address.zip = 92309
LEFT OUTER JOIN flattens the nested orders collection while preserving those customers that have no orders. The orders columns are set to
NULL for customers that have no orders. Notice that an
ON-clause is not required here due to the implicit is-nested-in join condition.
Query F: Pseudocolumns of Arrays
List all phone numbers of a specific customer.
SELECT c.cid, c.name, p.item, p.pos FROM customers c, c.phone_numbers p
WHERE c.cid = 12345
We use the pseudocolumn
item to refer to the column exposed by
c.phone_numbers because the element type of the
phone_numbers array is an anonymous scalar. The
pos pseudocolumn contains the ordinal position of
item in the corresponding array.
Query G: Pseudocolumns of Maps
Count the number of distinct user agents that accessed a purchasing-related URL.
SELECT COUNT(DISTINCT w.user_agent) FROM customers.web_visits w
WHERE w.key LIKE "%purchase%"
We use the pseudocolumn
key to refer to the key portion of the
web_visits map. Similarly, the
value pseudocolumn can be used to access the value portion of the map (e.g., if the map’s value was an anonymous scalar type).
Query H: Correlated Table References in Inline Views
Count the number of pending orders and show their total value for every customer living in “Palo Alto”.
SELECT cid, name, order_cnt, price_sum
FROM customers c,
(SELECT COUNT(*) order_cnt, SUM(totalprice) price_sum FROM c.orders
WHERE status = "PENDING") v
WHERE address.city = "PALO ALTO"
This example shows how a relative reference to a nested collection
c.orders can be made inside an inline view. We say the reference is correlated because it references a table alias from an enclosing query block, akin to conventional correlated subqueries in SQL. One way to think about the meaning of this query is that the query inside
v is evaluated for every customer
c due to the correlated reference. The result of each
v evaluation is then joined with the corresponding
c row. Such inline views can contain arbitrary SQL for operating on nested collections. Note that this query does per-customer aggregation, but does not require a
GROUP BY clause.
Query I: Correlated Table References in Subqueries
Count the number of customers who have no orders but at least one web visit, grouped by market segment.
SELECT mktsegment, COUNT(*)
FROM customers c
WHERE NOT EXISTS (SELECT oid FROM c.orders)
AND EXISTS (SELECT key from c.web_visits)
GROUP BY mktsegment
This example shows two
EXISTS subqueries that have correlated table references to filter a customer based on evaluating some SQL over the nested
web_visits collections of that particular customer. Note that in a flat schema, this query would require explicit and potentially expensive distributed joins between
Query J: Deeply Nested Subqueries
List all customers that have at least five orders that were ordered on a date where the customer also had at least one support call and web interaction.
SELECT cid, name, cnt
FROM customers c,
(SELECT COUNT(oid) cnt FROM c.orders o
WHERE order_date IN (SELECT visit_date FROM c.web_visits)
AND order_date IN (SELECT call_date FROM c.support_calls)
HAVING cnt >= 5
The purpose of this final example is to show a more complicated query with multiple nested subqueries that contain correlated table references. Subqueries can be arbitrarily nested just as in the conventional “flat” SQL.
Useful Utility Commands
When working with data that has complex types, you might find the following utility commands useful.
CREATE TABLE <tbl> LIKE PARQUET 'path_to_file'. See documentation.
DESCRIBE <path>. See
While the core built-in language extensions are available in CDH 5.5 today, we are continuing to improve the complex types feature on several dimensions, and have the following items on the roadmap for future releases:
- Support for Apache Avro
- Support for JSON files
- Syntactic sugar for even more concise aggregates over collections
INSERTinto tables with complex types
- Builtin functions and user-defined functions that return and/or operate on complex types
- Performance improvements
This post was intended to be a short introduction and reference, so we inevitably omitted many interesting details. For more information and examples, try the following resources:
- Impala Documentation (Complex Types)
- Presentation: “Nested Types in Impala“
- Presentation: “Data Modeling for Data Science: Simplify Your Workload with Complex Types in Impala“
Alex Behm is a Software Engineer at Cloudera, working on the Impala team.