From Zero to Impala in Minutes

This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.

Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or Apache HBase.

This post will explain how to use Apache Whirr to bring up a Cloudera Impala multi-node cluster on EC2 in minutes. When the installation script finishes, you’ll be able to immediately query the sample data in Impala without any more setup needed. The script also sets up Impala for performance (e.g. enabling direct reads). Since Amazon’s Elastic Compute Cloud (Amazon EC2) is a resizable compute capacity, you can easily choose any size Impala cluster you want.

In addition, your Impala cluster will be automatically setup with Ganglia: a lightweight and scalable metric-collection framework that provides a powerful web UI for analyzing trends in cluster and application performance.

The installation scripts represent a day of work so I’m sure there are ways they can be improved. Please feel free to comment at the end of the post if you have any ideas (or issues). These scripts could also easily be used as a basis for a proper Whirr service if someone had the time.

If you’re planning to deploy Impala in production, I highly recommend that you use Cloudera Manager.

Installing Whirr

If you haven’t already installed Apache Whirr, download and install using the following instructions. If you already have Whirr 0.8.1 installed, feel free to skip ahead.

Note: I like to install things in /workspace on my machine but you can install Whirr anywhere you like of course.

 

Add the following line to your .bashrc replacing /workspace with the path you installed Whirr into.

 

Once you’ve edited your .bashrc, source it and check that whirr is in your path.

 

Edit your ~/.whirr/credential file (created above) to set EC2 (aws-ec2) as your cloud provider and add your AWS identity and credentials, e.g.

 

For the last step, you need to create an SSH RSA (not DSA!) keypair. You’ll use this keypair whenever you launch a cluster on EC2 using Whirr (more on that soon).

 

Note that this keypair has nothing to do with your AWS keypair that is generated in the AWS Management Console or by running ec2-add-keypair.

Preparing for Impala Installation

You will need three files to install Impala: impalacluster.properties, installer.sh, and setup-impala.sh. 

The impalacluster.properties file will be passed to Whirr as a recipe for creating your cluster. This cluster will be built to satify Impala requirements, e.g. CentOS 6.2, CDH, etc.

The installer.sh script will use the information provided by Whirr about your cluster to scp and ssh the setup-impala.sh script to each machine and run it. The installer.sh will pass the address of the machine to house the Hive metadata store as well as a randomly generated password for the ‘hive’ user.

The setup-impala.sh script does the actual installation on each machine in your cluster. This script will completely configure Impala and Hive on your cluster for optimal performance. Once complete, you will immediately be able to query against Impala (and Hive).

Let’s go through each of these files in detail.

impalacluster.properties

The Impala installation guide lists the following requirements, e.g.

  • Red Hat Enterprise Linux (RHEL)/CentOS 6.2 (64-bit)
  • CDH 4.2.0 or later
  • Hive
  • MySQL
  • Sufficient memory to handle join operation

The RightImage CentOS_6.2_x64 v5.8.8 EBS image (ami-51c3e614) will satisfy the CentOS 6.2 requirement and Whirr will do all the work to install CDH 4.2.x on your cluster. The installation scripts provided in this post will handle setting up Hive, MySQL and Impala.

Here is the Impala-ready Apache Whirr recipe, impalacluster.properties, to use a starting point for your deployment:

 

The installer.sh will pass these properties to Whirr when you launch your cluster. You should edit the properties at the top of the file to match your environment and desired cluster characteristics (e.g. RSA key, cluster size and EC2 instance type).

Do not edit the properties at the bottom of the file. Doing so, could break the installer.

If you want to learn more about these whirr options, take a look at the Whirr Configuration Guide. There is also a recipes directory inside the Whirr distribution with example recipes.

installer.sh

The installer.sh file orchestrates the installation using the Whirr deployment information that is generated by the launch-cluster command. This information is found in the directory ~/.whirr/myimpalacluster. Here is the script:

 

You will likely need to change the RSA_PRIVATE_KEY specified at the top of the script; otherwise, you should not need to modify anything else in this file.

This script will generate a random password for the Hive metadatastore user, launch a cluster using the impalacluster.properties file, using the Whirr deployment to copy and run the setup-impala.sh script on every worker in the cluster.

Note that, for performance, the installer runs the ssh calls in parallel and waits for them to complete. Using time ./installer.sh, I’ve found that it takes about a minutes/machine to bring up a cluster, e.g. an 11-node cluster (1 master, 10 workers) will take, e.g.

 

setup-impala.sh

This is the setup-impala.sh script that is run on each machine in your Impala cluster to install and configure Impala:

 

You do not need to edit this file.

This script is a bit long but I hope it’s easy to understand. This script is passed two arguments: the IP address of the Hive metadata store and the password to hive user.

When run, this script will, e.g.

  1. Check if it’s running on the machine designated to be the Hive metadata store; if so, it will install and configure Hive and MySQL and drop in a very simple example table.
  2. Install the necessary Impala packages
  3. Configure impala for read.shortcircuit, skip.checksum, local-path-access.user and data locality tracking for performance
  4. Create an impala user
  5. Restart the datanode to pull in the modified configuration
  6. Start the statestored service
  7. Start impalad passing in the -state_store_host (all impalad use the state store running on the Hive metadata store machine), -nn (NameNode) and -nn_port (NameNode port) arguments

Once you’ve modified your impalacluster.properties and installer.sh files, you’re ready to launch your Impala cluster.

Launching your Impala cluster

At this point, you should have a directory with your customized installation script and configuration file:

 

To launch your cluster, simply run the installer.sh script.

 

When installer.shcompletes, you should see the following messages, e.g.

 

At this point, your Impala Cluster is up and ready for work.

Using your Impala cluster

You can find your deployment details in the file ~/.whirr/myimpalacluster/instances, e.g.

 

The columns are, in order, the EC2 instance id, the Whirr service template, the EC2 public IP of the machine and the EC2 private address of the machine. The Hive metadata store is always installed on the first machine in the list (that is not a master running the namenode, jobtracker, etc).

To log into the Hive machine, use the public IP address of the first node: 50.18.85.89 in this example.

 

Launch hive to ensure you can run queries, e.g.

 

Now that you know Hive is running correctly, you can use Impala to query the same table.

 

Destroying your Impala Cluster

To destroy your Impala cluster, use the Whirr destroy-cluster command:

 

Taking a look at Ganglia

For security, Whirr installs the Ganglia web interface to only be accessible by localhost, e.g.

 

In order to view ganglia, you will need to run the following script to create a secure SSH tunnel (in a separate terminal).

 

This script to look at your Whirr deployment to find the ganglia-meta machine and start an SSH tunnel. To view your ganglia data, open your browser to http://localhost:8080/ganglia/ or use whatever port you set LOCAL_PORT to in the script. (An alternative is to use the Whirr SOCKS proxy – see the Whirr docs.)

Ganglia tracks performance metrics for all your hosts and services. Keep in mind that it will take a few minutes for ganglia to distribute all metrics when it first starts. Initially, check that the Hosts up: number to make sure all the machines are reporting (meaning that ganglia heartbeats are getting through).

That’s it

I hope you find these bash scripts useful. Feel free to contact me using the comment box below.

Matt Massie is the lead developer at the UC Berkeley AMP Lab, and previously worked in the Cloudera engineering team. He founded the Ganglia project in 2000.

Filed under:

15 Responses
  • r roy / February 12, 2013 / 4:42 PM

    When I testing Hive as per instructions above. I get following error. what could be the reason?

    hive> show tables;
    FAILED: Error in metadata: javax.jdo.JDOFatalDataStoreException: Access denied for user ‘hive’@'ip-10-196-40-24.us-west-1.compute.internal’ (using password: YES)
    NestedThrowables:
    java.sql.SQLException: Access denied for user ‘hive’@'ip-10-196-40-24.us-west-1.compute.internal’ (using password: YES)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
    hive>

  • Ashish / February 12, 2013 / 5:36 PM

    1. For base64: command not found: use macports to install base64
    2. For “Access denied for user ‘hive’@’ip-10-196-40-24.us-west-1.compute.internal’ (using password: YES)” error:

    I changed line RANDOM_PASSWORD=$(dd count=1 bs=16 if=/dev/urandom of=/dev/stdout 2>/dev/null | base64)

    to
    RANDOM_PASSWORD = ”

    and it worked.

  • Dave / February 27, 2013 / 3:46 PM

    These instructions no longer work with the latest release of Impala. After the cluster has been created, I get the following error when issuing the ‘show tables’ command from the Hive CLI:

    -bash-4.1$ hive
    Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
    Hive history file=/tmp/impala/hive_job_log_impala_201302272345_1246843221.txt
    hive> show tables;
    FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.hive.metastore.api.MetaException javax.jdo.JDODataStoreException: Required table missing : “SKEWED_STRING_LIST” in Catalog “” Schema “”. DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable “datanucleus.autoCreateTables”
    NestedThrowables:
    org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : “SKEWED_STRING_LIST” in Catalog “” Schema “”. DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable “datanucleus.autoCreateTables”)
    FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
    hive>

    I get a similar message from the Impala shell. Can someone take a look?

  • Justin Kestelyn (@kestelyn) / February 27, 2013 / 4:59 PM

    Dave,

    We have updated the script and it should work for you now.

  • Aleksandra / March 01, 2013 / 6:31 AM

    It seems like the problem still exists: “show tables” from Hive CLI produces the same error.

  • Justin Kestelyn (@kestelyn) / March 01, 2013 / 8:34 AM

    Aleksandra,

    Try dropping the old metastore database.

  • Steven Wong / March 07, 2013 / 12:46 PM

    These instructions do not work with the instance types m2.4xlarge and hs1.8xlarge. The instances come up but have 1 or 0 ephemeral disk, respectively. There should be 2 and 24 ephemeral disks, respectively, for the 2 instance types. How can this be fixed?

  • Naga / March 14, 2013 / 4:52 PM

    Hi,
    Excellent article, I tried setting up the cluster today and I still get the same error message as Dave for both Hive and Impala-shell. Kindly suggest.

    Thanks

  • Naga / March 16, 2013 / 4:03 PM

    Found the problem, the download script setup_impala.sh has to be updated from
    The deployment script needs to be updated
    to change:
    SOURCE
    /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.9.0.mysql.sql;
    To:
    SOURCE
    /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;

    or manually modify the changes in setup_impala.sh before running installer.sh

  • Justin Kestelyn (@kestelyn) / March 17, 2013 / 4:30 PM

    Naga, thanks for reporting.

    All scripts shown inline are correct; we have removed the download links for now to avoid confusion.

  • Dave / April 23, 2013 / 4:21 PM

    HI,

    This script seems to be broken again. Neither Hive nor Impala start. I get errors trying of get into the respective shells.

  • Justin Kestelyn (@kestelyn) / April 23, 2013 / 4:41 PM

    All,

    We’re working on updates to resolve the issue. Thanks for your patience.

    In the meantime consider using this method:

    http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

  • Justin Kestelyn (@kestelyn) / April 23, 2013 / 7:50 PM

    All,

    Rather than updating this post and causing confusion, we have determined that the fastest and most reliable approach for users is to follow the instructions provided here:

    http://blog.cloudera.com/blog/2013/03/how-to-create-a-cdh-cluster-on-amazon-ec2-via-cloudera-manager/

  • Faina / June 15, 2013 / 11:03 PM

    Hi Justin!
    Is updated post available?
    I would like to install Impala 7.0 cluster with Whirr.

  • Justin Kestelyn (@kestelyn) / June 18, 2013 / 8:30 AM

    Faina,

    Please see the message at the top of the post.

Leave a comment


two + = 3