From Zero to Impala in Minutes

This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.

Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or Apache HBase.

This post will explain how to use Apache Whirr to bring up a Cloudera Impala multi-node cluster on EC2 in minutes. When the installation script finishes, you’ll be able to immediately query the sample data in Impala without any more setup needed. The script also sets up Impala for performance (e.g. enabling direct reads). Since Amazon’s Elastic Compute Cloud (Amazon EC2) is a resizable compute capacity, you can easily choose any size Impala cluster you want.

In addition, your Impala cluster will be automatically setup with Ganglia: a lightweight and scalable metric-collection framework that provides a powerful web UI for analyzing trends in cluster and application performance.

The installation scripts represent a day of work so I’m sure there are ways they can be improved. Please feel free to comment at the end of the post if you have any ideas (or issues). These scripts could also easily be used as a basis for a proper Whirr service if someone had the time.

If you’re planning to deploy Impala in production, I highly recommend that you use Cloudera Manager.

Installing Whirr

If you haven’t already installed Apache Whirr, download and install using the following instructions. If you already have Whirr 0.8.1 installed, feel free to skip ahead.

Note: I like to install things in /workspace on my machine but you can install Whirr anywhere you like of course.

$ cd /workspace
$ wget
$ gunzip < whirr-0.8.1.tar.gz | tar -xvf -
$ cd whirr-0.8.1
$ mkdir ~/.whirr
$ cp conf/credentials.sample ~/.whirr/credentials


Add the following line to your .bashrc replacing /workspace with the path you installed Whirr into.



Once you’ve edited your .bashrc, source it and check that whirr is in your path.

$ . ~/.bashrc
$ whirr version
Apache Whirr 0.8.1
jclouds 1.5.1


Edit your ~/.whirr/credential file (created above) to set EC2 (aws-ec2) as your cloud provider and add your AWS identity and credentials, e.g.

IDENTITY=[Put your AWS Access Key ID here]
CREDENTIAL=[Put your AWS Secret Access Key here]


For the last step, you need to create an SSH RSA (not DSA!) keypair. You’ll use this keypair whenever you launch a cluster on EC2 using Whirr (more on that soon).

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr


Note that this keypair has nothing to do with your AWS keypair that is generated in the AWS Management Console or by running ec2-add-keypair.

Preparing for Impala Installation

You will need three files to install Impala:,, and 

The file will be passed to Whirr as a recipe for creating your cluster. This cluster will be built to satify Impala requirements, e.g. CentOS 6.2, CDH, etc.

The script will use the information provided by Whirr about your cluster to scp and ssh the script to each machine and run it. The will pass the address of the machine to house the Hive metadata store as well as a randomly generated password for the ‘hive’ user.

The script does the actual installation on each machine in your cluster. This script will completely configure Impala and Hive on your cluster for optimal performance. Once complete, you will immediately be able to query against Impala (and Hive).

Let’s go through each of these files in detail.

The Impala installation guide lists the following requirements, e.g.

  • Red Hat Enterprise Linux (RHEL)/CentOS 6.2 (64-bit)
  • CDH 4.2.0 or later
  • Hive
  • MySQL
  • Sufficient memory to handle join operation

The RightImage CentOS_6.2_x64 v5.8.8 EBS image (ami-51c3e614) will satisfy the CentOS 6.2 requirement and Whirr will do all the work to install CDH 4.2.x on your cluster. The installation scripts provided in this post will handle setting up Hive, MySQL and Impala.

Here is the Impala-ready Apache Whirr recipe,, to use a starting point for your deployment:


# The private key you created during the Whirr installation above (you'll need to change this path)
# The public key you created during the Whirr installation above (you'll need to change this path)
# The size of EC2 instances to run (see Keep in mind that some
# joins can require quite a bit of memory. We'll use the m2.xlarge (High-Memory Extra Large Instance) for extra memory.
# You can use any size instance you like here (except micro).
# You can modify the number of machines in the cluster here. The first machine type is the master and the second
# machine type are the workers.  To change you cluster size, change the number of workers in the cluster.
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker+ganglia-metad,5 hadoop-datanode+hadoop-tasktracker+ganglia-monitor


# This name will be used by Amazon to create the security group name. Using any string you like.
# Impala should not be run as root since root is not allowed to do direct reads
# The RightImage CentOS 6.2 x64 image
# The following two lines will cause Whirr to install CDH instead of Apache Hadoop


The will pass these properties to Whirr when you launch your cluster. You should edit the properties at the top of the file to match your environment and desired cluster characteristics (e.g. RSA key, cluster size and EC2 instance type).

Do not edit the properties at the bottom of the file. Doing so, could break the installer.

If you want to learn more about these whirr options, take a look at the Whirr Configuration Guide. There is also a recipes directory inside the Whirr distribution with example recipes.

The file orchestrates the installation using the Whirr deployment information that is generated by the launch-cluster command. This information is found in the directory ~/.whirr/myimpalacluster. Here is the script:


# Please provide the path to the RSA private key you
# created as part of the Whirr installation


# Generate a random password to secure the 'root' and 'hive' mysql users
RANDOM_PASSWORD=$(dd count=1 bs=16 if=/dev/urandom of=/dev/stdout 2>/dev/null | base64)

# Use Whirr to bring up the CDH cluster
whirr launch-cluster --config

# Fetch the list of workers from the Whirr deployment
WORKER_NODES=$(egrep -v 'hadoop-namenode|hadoop-jobtracker|ganglia-metad' \
                    $WHIRR_INSTANCES | awk '{print $3}')

# Install the Hive metastore on the first worker node
# Hive box internal IP
HIVE_MYSQL_BOX_INTERNAL=$(head -1 $WHIRR_INSTANCES | awk '{print $4}')
# Hive box external IP
HIVE_MYSQL_BOX_EXTERNAL=$(head -1 $WHIRR_INSTANCES | awk '{print $3}')

# Copy the impala setup script to every machine in the cluster and run it
SSH_OPTS=" -i $RSA_PRIVATE_KEY -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no "
        # Run the script in the background so the installation is in parallel
            sudo bash /tmp/$SETUP_IMPALA_SCRIPT $HIVE_MYSQL_BOX_INTERNAL $RANDOM_PASSWORD > /tmp/impala-install.log 2>&1 &

echo "Waiting for the installation scripts to finish on all the nodes. This will take about a minute per node in the cluster."

echo "The password for your root and Hive account on the MySQL box is $RANDOM_PASSWORD"
echo "Please save this password somewhere safe."


You will likely need to change the RSA_PRIVATE_KEY specified at the top of the script; otherwise, you should not need to modify anything else in this file.

This script will generate a random password for the Hive metadatastore user, launch a cluster using the file, using the Whirr deployment to copy and run the script on every worker in the cluster.

Note that, for performance, the installer runs the ssh calls in parallel and waits for them to complete. Using time ./, I’ve found that it takes about a minutes/machine to bring up a cluster, e.g. an 11-node cluster (1 master, 10 workers) will take, e.g.

real 11m41.172s
user  0m22.061s
sys   0m2.402s

This is the script that is run on each machine in your Impala cluster to install and configure Impala:


# Ip address of the box with the Hive metastore
# Password to use for the hive user


function write_hive_site {
cat > $1 <<HIVESITE
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

# Some configuration only needs to be run the the box housing the hive metastore
/sbin/ifconfig -a | grep "addr:$HIVE_METASTORE_IP " > /dev/null && {

# Install all the necessary packages
yum install -y hive mysql mysql-server mysql-connector-java
# Start the mysql server
/etc/init.d/mysqld start
# Create the Hive metastore and hive user
/usr/bin/mysql -u root <<SQL
-- Create the metastore database
create DATABASE metastore;
-- Use the metastore database
use metastore;
-- Import the metastore schema from hive
SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
-- Secure the root accounts with the hive password
update mysql.user set password = PASSWORD('$HIVE_PASSWORD') where user = 'root';
-- Create a user 'hive' with random password for localhost access
-- Grant privileges on the metastore to the 'hive' user on localhost
GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'localhost' WITH GRANT OPTION;
-- Create a user 'hive' with random password
-- Grant privileges on the metastore to the 'hive' user
-- Load the new privileges
# Write the hive-site to the Hive configuration directory
write_hive_site $HIVE_CONF_DIR/hive-site.xml
# Link the mysql connector to hive lib
ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib
# Load up a really basic tab-delimited table into hive for testing end-to-end functionality
cat > /tmp/numbers.txt <<TABLE
1 one
2 two
3 three
4 four
sudo -E -u impala hadoop fs -mkdir /user/impala
sudo -E -u impala hadoop fs -put /tmp/numbers.txt /user/impala/numbers.txt
sudo -E -u impala hive -e "LOAD DATA INPATH '/user/impala/numbers.txt' into table numbers;"
} # /end hive metadata store specific commands

# Fetch the Cloudera yum repo file
(cd /etc/yum.repos.d/ && wget -N $IMPALA_REPO_FILE)
# Install the impala and impala-shell packages
yum -y install impala impala-shell impala-server impala-state-store

# Create the impala configuration directory
# Install the hive-site.xml into the Impala configuration directory
write_hive_site $IMPALA_CONF_DIR/hive-site.xml

# Copy the Hadoop core-site.xml into the Impala config directory
# Make sure to prepend the some properties for performance
     <name>dfs.domain.socket.path </name>
     <value>/var/lib/hadoop-hdfs/socket._PORT </value>
     <name> </name>
     <value>false </value>

# Update the hdfs-site.xml file
cat > /tmp/$HDFS_SITE_XML <<'EOF'
grep -v "<configuration>" $HADOOP_CONF_DIR/$HDFS_SITE_XML >> /tmp/$HDFS_SITE_XML
# Copy the hdfs-site.xml file into the Impala config directory
# Copy the log4j properties from Hadoop to Impala

# Add Impala to the HDFS group
/usr/sbin/usermod -G hdfs impala

# Restart HDFS
/etc/init.d/hadoop-hdfs-datanode restart

# Start the impala services
# NOTE: It's important to run impala as a non-root user or performance will suffer (no direct reads)
sudo -E -u impala GVLOG_v=1 nohup /usr/bin/impalad \
-state_store_host=$HIVE_METASTORE_IP -nn=$NN_HOST -nn_port=$NN_PORT \
-ipaddress=$(host $HOSTNAME | awk '{print $4}') < /dev/null >
 /tmp/impalad.out 2>&1 &


You do not need to edit this file.

This script is a bit long but I hope it’s easy to understand. This script is passed two arguments: the IP address of the Hive metadata store and the password to hive user.

When run, this script will, e.g.

  1. Check if it’s running on the machine designated to be the Hive metadata store; if so, it will install and configure Hive and MySQL and drop in a very simple example table.
  2. Install the necessary Impala packages
  3. Configure impala for read.shortcircuit, skip.checksum, local-path-access.user and data locality tracking for performance
  4. Create an impala user
  5. Restart the datanode to pull in the modified configuration
  6. Start the statestored service
  7. Start impalad passing in the -state_store_host (all impalad use the state store running on the Hive metadata store machine), -nn (NameNode) and -nn_port (NameNode port) arguments

Once you’ve modified your and files, you’re ready to launch your Impala cluster.

Launching your Impala cluster

At this point, you should have a directory with your customized installation script and configuration file:

$ ls  


To launch your cluster, simply run the script.

% time bash ./
Running on provider aws-ec2 using identity ABCDEFGHIJKLMNOP
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker_ganglia-monitor
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker_ganglia-metad
Starting 5 node(s) with roles [hadoop-datanode, hadoop-tasktracker, ganglia-monitor]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker, ganglia-metad]


When installer.shcompletes, you should see the following messages, e.g.

Warning: Permanently added '' (RSA) to the list of known hosts.                                                                               100% 5675     5.5KB/s   00:00
Warning: Permanently added '' (RSA) to the list of known hosts.                                                                               100% 5675     5.5KB/s   00:00
Warning: Permanently added '' (RSA) to the list of known hosts.                                                                               100% 5675     5.5KB/s   00:00
Warning: Permanently added '' (RSA) to the list of known hosts.                                                                               100% 5675     5.5KB/s   00:00
Warning: Permanently added '' (RSA) to the list of known hosts.                                                                               100% 5675     5.5KB/s   00:00
Waiting for the installation scripts to finish on all the nodes. This will take about a minute per node in the cluster.
The password for your root and Hive account on the MySQL box is lBnn/HynCPcYNr/AUm5Hzg==
Please save this password somewhere safe.

real  6m42.620s
user  0m14.590s
sys   0m1.261s


At this point, your Impala Cluster is up and ready for work.

Using your Impala cluster

You can find your deployment details in the file ~/.whirr/myimpalacluster/instances, e.g.

$ cat ~/.whirr/myimpalacluster/instances
us-west-1/i-082ab151  hadoop-datanode,hadoop-tasktracker,ganglia-monitor
us-west-1/i-0a2ab153  hadoop-datanode,hadoop-tasktracker,ganglia-monitor
us-west-1/i-0c2ab155  hadoop-datanode,hadoop-tasktracker,ganglia-monitor
us-west-1/i-0e2ab157  hadoop-datanode,hadoop-tasktracker,ganglia-monitor
us-west-1/i-142ab14d  hadoop-datanode,hadoop-tasktracker,ganglia-monitor
us-west-1/i-162ab14f  hadoop-namenode,hadoop-jobtracker,ganglia-metad


The columns are, in order, the EC2 instance id, the Whirr service template, the EC2 public IP of the machine and the EC2 private address of the machine. The Hive metadata store is always installed on the first machine in the list (that is not a master running the namenode, jobtracker, etc).

To log into the Hive machine, use the public IP address of the first node: in this example.

$ ssh -i /Users/matt/.ssh/id_rsa_whirr -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no impala@
Warning: Permanently added '' (RSA) to the list of known hosts.
Last login: Tue Nov 20 22:57:12 2012 from


Launch hive to ensure you can run queries, e.g.

-bash-4.1$ hive
Logging initialized using configuration in file:/etc/hive/conf.dist/
Hive history file=/tmp/impala/hive_job_log_impala_201211202349_133443320.txt
hive> show tables;
Time taken: 2.883 seconds
hive> select * from numbers;
1 one
2 two
3 three
4 four
Time taken: 0.943 seconds
hive> select word from numbers;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201211210024_0003, Tracking URL =
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201211210024_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-11-21 00:29:11,916 Stage-1 map = 0%,  reduce = 0%
2012-11-21 00:29:15,940 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.73 sec
2012-11-21 00:29:16,949 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.73 sec
2012-11-21 00:29:17,961 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 0.73 sec
MapReduce Total cumulative CPU time: 730 msec
Ended Job = job_201211210024_0003
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 0.73 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 730 msec
Time taken: 10.687 seconds
hive> quit;


Now that you know Hive is running correctly, you can use Impala to query the same table.

$ impala-shell
$ impala-shell
Welcome to the Impala shell. Press TAB twice to see a list of available commands.

Copyright (c) 2012 Cloudera, Inc. All rights reserved.

(Build version: Impala v0.1 (e50c5a0) built on Mon Nov 12 13:22:11 PST 2012)
[Not connected] > connect localhost
[localhost:21000] > show tables;
[localhost:21000] > select * from numbers;
1 one
2 two
3 three
4 four
[localhost:21000] > select word from numbers;
[localhost:21000] >


Destroying your Impala Cluster

To destroy your Impala cluster, use the Whirr destroy-cluster command:

$ whirr destroy-cluster --config
Running on provider aws-ec2 using identity ABCDEFGHIJKLMNOP
Finished running destroy phase scripts on all cluster instances
Destroying myimpalacluster cluster


Taking a look at Ganglia

For security, Whirr installs the Ganglia web interface to only be accessible by localhost, e.g.

  # Ganglia monitoring system php web frontend

  Alias /ganglia /usr/share/ganglia

    Order deny,allow
    Deny from all
    Allow from
    Allow from ::1
    # Allow from


In order to view ganglia, you will need to run the following script to create a secure SSH tunnel (in a separate terminal).

GMETA_NODE=`grep ganglia-metad $HOME/.whirr/$CLUSTER_NAME/instances | awk '{print $3}'`
echo "Creating an SSH tunnel to $GMETA_NODE. Open your browser to localhost:$LOCAL_PORT. Ctrl+C to exit"
ssh -i $HOME/.ssh/id_rsa_whirr -o ConnectTimeout=10 -o ServerAliveInterval=60 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -L $LOCAL_PORT:localhost:80 -N $CLUSTER_USER@$GMETA_NODE


This script to look at your Whirr deployment to find the ganglia-meta machine and start an SSH tunnel. To view your ganglia data, open your browser to http://localhost:8080/ganglia/ or use whatever port you set LOCAL_PORT to in the script. (An alternative is to use the Whirr SOCKS proxy – see the Whirr docs.)

Ganglia tracks performance metrics for all your hosts and services. Keep in mind that it will take a few minutes for ganglia to distribute all metrics when it first starts. Initially, check that the Hosts up: number to make sure all the machines are reporting (meaning that ganglia heartbeats are getting through).

That’s it

I hope you find these bash scripts useful. Feel free to contact me using the comment box below.

Matt Massie is the lead developer at the UC Berkeley AMP Lab, and previously worked in the Cloudera engineering team. He founded the Ganglia project in 2000.

Filed under:

15 Responses
  • r roy / February 12, 2013 / 4:42 PM

    When I testing Hive as per instructions above. I get following error. what could be the reason?

    hive> show tables;
    FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@'’ (using password: YES)
    java.sql.SQLException: Access denied for user ‘hive’@'’ (using password: YES)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

  • Ashish / February 12, 2013 / 5:36 PM

    1. For base64: command not found: use macports to install base64
    2. For “Access denied for user ‘hive’@’’ (using password: YES)” error:

    I changed line RANDOM_PASSWORD=$(dd count=1 bs=16 if=/dev/urandom of=/dev/stdout 2>/dev/null | base64)


    and it worked.

  • Dave / February 27, 2013 / 3:46 PM

    These instructions no longer work with the latest release of Impala. After the cluster has been created, I get the following error when issuing the ‘show tables’ command from the Hive CLI:

    -bash-4.1$ hive
    Logging initialized using configuration in file:/etc/hive/conf.dist/
    Hive history file=/tmp/impala/hive_job_log_impala_201302272345_1246843221.txt
    hive> show tables;
    FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.hive.metastore.api.MetaException javax.jdo.JDODataStoreException: Required table missing : “`SKEWED_STRING_LIST`” in Catalog “” Schema “”. DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable “datanucleus.autoCreateTables”
    NestedThrowables: Required table missing : “`SKEWED_STRING_LIST`” in Catalog “” Schema “”. DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable “datanucleus.autoCreateTables”)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

    I get a similar message from the Impala shell. Can someone take a look?

  • Justin Kestelyn (@kestelyn) / February 27, 2013 / 4:59 PM


    We have updated the script and it should work for you now.

  • Aleksandra / March 01, 2013 / 6:31 AM

    It seems like the problem still exists: “show tables” from Hive CLI produces the same error.

  • Justin Kestelyn (@kestelyn) / March 01, 2013 / 8:34 AM


    Try dropping the old metastore database.

  • Steven Wong / March 07, 2013 / 12:46 PM

    These instructions do not work with the instance types m2.4xlarge and hs1.8xlarge. The instances come up but have 1 or 0 ephemeral disk, respectively. There should be 2 and 24 ephemeral disks, respectively, for the 2 instance types. How can this be fixed?

  • Naga / March 14, 2013 / 4:52 PM

    Excellent article, I tried setting up the cluster today and I still get the same error message as Dave for both Hive and Impala-shell. Kindly suggest.


  • Naga / March 16, 2013 / 4:03 PM

    Found the problem, the download script has to be updated from
    The deployment script needs to be updated
    to change:

    or manually modify the changes in before running

  • Justin Kestelyn (@kestelyn) / March 17, 2013 / 4:30 PM

    Naga, thanks for reporting.

    All scripts shown inline are correct; we have removed the download links for now to avoid confusion.

  • Dave / April 23, 2013 / 4:21 PM


    This script seems to be broken again. Neither Hive nor Impala start. I get errors trying of get into the respective shells.

  • Justin Kestelyn (@kestelyn) / April 23, 2013 / 4:41 PM


    We’re working on updates to resolve the issue. Thanks for your patience.

    In the meantime consider using this method:

  • Justin Kestelyn (@kestelyn) / April 23, 2013 / 7:50 PM


    Rather than updating this post and causing confusion, we have determined that the fastest and most reliable approach for users is to follow the instructions provided here:

  • Faina / June 15, 2013 / 11:03 PM

    Hi Justin!
    Is updated post available?
    I would like to install Impala 7.0 cluster with Whirr.

  • Justin Kestelyn (@kestelyn) / June 18, 2013 / 8:30 AM


    Please see the message at the top of the post.

Leave a comment

5 − = four