As a follow-up to a previous post about the Impala demo he built during Data Hacking Day, Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post below.
A little while ago I blogged about (and open sourced) a Cloudera Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. [Note: instance live only for one week!] You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
Deploying Impala on EC2
While there are many tools to deploy a Hadoop cluster on EC2 – like Apache Whirr, or even Cloudera Manager – I only wanted to use a single instance for the entire cluster. Starting from the base Ubuntu (Precise) image, I added Cloudera’s apt repos, and installed the single node configuration. Impala doesn’t support using Derby for the Hive metastore, so I installed MySQL and configured Hive to use it instead. Then I installed Impala using Cloudera’s instructions. Impala, and all of the Hadoop daemons, are running comfortably on one M3 2XLarge EC2 instance. Given our modest demands, this may actually be overkill; I over-spec’ed the server trying to find a (now-obvious) performance problem involving short-circuit reads.
On the Pythian cluster, we could consistently return a query in around half a second. On EC2 queries took closer to 5 seconds. A bit of investigation showed that in getting the server up and running, I had disabled short-circuit reads, which slows down Impala considerably. While Impala isn’t supposed to start without short-circuit reads, it only throws an error if you have short-circuit reads enabled but misconfigured. If short-circuits are off in the hdfs-site configuration, it will happily start and run very slowly. With the default DEB install, the libhadoop library isn’t installed to the LD_PATH on Ubuntu, which prevents short-circuit read from working. The easiest solution was to create symlinks for libhadoop to /usr/lib/, then run ldconfig:
ln -s /usr/lib/hadoop/lib/native/libhadoop.so /usr/lib/
ln -s /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0 /usr/lib/
To confirm whether your cluster has short-circuit reads enabled, you can visit the Impala web interface (by default, port 25000 on any system running impalad) and click on the ‘/varz’ tab. Search for ‘dfs.client.read.shortcircuit’ – it should be set to ‘true’.
With libhadoop installed and short-circuit reads enabled, the next greatest performance improvement came from partitioning the table on the sensor id. Since all of our web interface queries filter by sensor id, Impala can perform some serious partition elimination: looking at the query profiles, partitioning the table reduced the amount of data read from HDFS from 4GB to 50MB, and the query time from 2.6s to 130ms. The README on Github has instructions on how to use dynamic partitioning in Hive to quickly partition the soccer data; these steps can be generalized to any dataset.