One User’s Impala Experience at Data Hacking Day

Categories: Data Science Hive Impala

The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.

Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.

Our hosts helpfully suggested some datasets, including the DEBS 2013 Grand Challenge data. This dataset contains the position of all the players and ball during a football match; our project was to map the data for a given span of time and player onto a map of the field, to create a heatmap of how much time that player spent at different positions.

The Data

The full-game CSV describes the position of all the players and the ball on the field, in the following format:

sid, ts, x, y, z, |v|, |a|, vx, vy, vz, ax, ay, az

sid uniquely identifies the sensor: two are attached to each player (one per leg), four are attached to the ball, and others are attached to the goalies’ arms and the referees. ts is the timestamp in picoseconds, and (x,y,z) is the position of the ball in space. The dataset also provides the acceleration a, and velocity v in each axis, and their absolute magnitude – we don’t use these values in our application. In all, the Hive statement to create the table is:


DEB doesn’t provide an easy, machine-readable mapping of the sensors to the players, so we wrote our own, called sensors.csv.  The definition of the sensors table is:


By joining the sensors table in, we can do ad hoc queries about specific players’ positions, rather than working with sensor ids.


The web component is written in Ruby, and leverages Colin Marc’s impala-ruby gem, which he’s blogged about here. Colin’s gem provides a simple interface to run SQL queries on Impala; combined with Sinatra, the server-side came together very quickly. The main query we use to populate the heatmap is:


This query splits the field into 20-by-20 grid and counts the number of sensor readings which land within a given square – since the readings are at regular intervals, this is proportional to how much time the player spent in that area. We have to approximate player position like this to reduce the load on the web server and client system – a straight SELECT would produce about 50MB of data per query, versus the 10KB the aggregation uses. The multiplication and addition roughly map the x and y values provided by DEBS onto the football field image in the web UI. 


At the event we ran the web interface against the single-node demo VM image Cloudera provides to test Impala. Running on a Macbook Pro with 8GB of RAM, Impala was consistently capable of producing 1-to-2 second response times, which was fast enough that we could iterate on queries and quickly test them in the console. Once we got home, we loaded the same data into Pythian’s in-house Hadoop cluster, to better assess performance on a “real-world” system.

Querying on the console, the benefits of not spinning up a JVM are appreciable: a typical query to aggregate all the sensor data ran in about 1 second in Impala, versus 90 seconds in Hive. In the web interface we saw similar performance: the entire request – including the web server and Beeswax – takes about 1.5 seconds on the first SELECT for a player, and about 800ms on subsequent queries over the same data.

It’s worth noting that partitioning the table produced a meaningful performance boost, since we almost always scan all rows for a single sid – we gained about 100ms on average, or about a 10% boost. The queries to set up the partitioned table are included on the Github page (see below).

Get The Code

All of the code, including the CSV of player-sensor mappings, is available on Github. Feel free to direct any feedback to