How-to: Set Up an Apache Hadoop/Apache HBase Cluster on EC2 in (About) an Hour

Note (added July 8, 2013): The information below is deprecated; we suggest that you refer to this post for current instructions.

Today we bring you one user’s experience using Apache Whirr to spin up a CDH cluster in the cloud. This post was originally published here by George London (@rogueleaderr) based on his personal experiences; he has graciously allowed us to bring it to you here as well in a condensed form. (Note: the configuration described here is intended for learning/testing purposes only.)

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Apache Hadoop/Apache HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition.  Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

All together, this tutorial will take a bit over one hour and cost about $10 in server costs.

Step 1: Get the Cluster Running

I’m going to assume you already have an Amazon Web Services account (because it’s awesome, and the basic tier is free.) If you don’t, go get one. Amazon’s directions for getting started are pretty clear, or you can easily find a guide with Google. We won’t actually be interacting with the Amazon management console much, but you will need two pieces of information, your AWS Access Key ID and your AWS Secret Access Key.

To find these, go to https://portal.aws.amazon.com/gp/aws/securityCredentials. You can write these down, or better yet add them to your shell startup script by doing:

You will also need a security certificate and private key that will let you use the command-line tools to interact with AWS. From the AWS Management Console go to Account > Security Credentials > Access Credentials, select the “X.509 Certificates” tab and click on Create a new Certificate. Download and save this somewhere safe (e.g. ~/.ec2)

Then do:

Finally, you’ll need a different key to log into your servers using SSH. To create that, do:

You have the option of manually creating a bunch of EC2 nodes, but that’s a pain. Instead, we’re going to use Whirr, which is specifically designed to allow push-button setup of clusters in the cloud.

To use Whirr, we are going to need to create one node manually, which we are going to use as our “control center.” I’m assuming you have the EC2 command-line tools installed (if not, go here and follow directions).

We’re going to create an instance running Ubuntu 10.04 (it’s old, but all of the tools we need run on it in stable fashion), and launch it in the USA-East region. You can find AMIs for other Ubuntu versions and regions here.

So, do:

This creates an EC2 instance using a minimal Ubuntu image, with the SSH key “hadoop_tutorial” that we created a moment ago. The command will produce a bunch of information about your instance. Look for the “instance id” that starts with i- , then do:

This will tell you the IP of your new instance (it will start ec2-). Now we’re going to remotely log in to that server.

Now we’re in! This server is only going to run two programs, Whirr and the Cloudera Manager. First we’ll install Whirr.  Find a mirror at (http://www.apache.org/dyn/closer.cgi/whirr/), then download to your home directory using wget:

Untar and unzip:

Whirr will launch clusters for you in EC2 according to a “properties” file you pass it. It’s actually quite powerful and allows a lot of customization (and can be used with non-Amazon cloud providers) or you to set up complicated servers using Chef scripts. But for our purposes, we’ll keep it simple.

Create a file called hadoop.properties:

And give it these contents:

This will launch a cluster of six unconfigured “large” EC2 instances. (Whirr refused to create small or medium instances for me. Please let me know in comments if you know how to do that.)

Before we can use Whirr, we’re going to need to install Java, so do:

Next we need to create that SSH key that will let our control node log into to our cluster. 

And hit [enter] at the prompt.

Now we’re ready to launch!

This will produce a bunch of output and end with commands to SSH into your servers.

We’re going to need these IPs for the next step, so copy and paste these lines into a new file:

Then use this bit of regular expression magic to create a file with just the IP’s:

Step 2: Configure the Cluster

From your Control Node, download Cloudera Manager; we will install the Free Edition, which can be used for up to 50 nodes:

Then install it:

This will pop up an extreme green install wizard; just hit “yes” to everything.

Cloudera Manager works poorly with textual browsers like Lynx. (It has an API, but we won’t cover that here.) Luckily, we can access the web interface from our laptop by looking up the public DNS address we used to log in to our control node, and appending “:7180” to the end in our web browser.

First, you need to tell Amazon to open that port. The manager also needs a pretty ridiculously long list of open ports to work, so we’re just going to tell Amazon to open all TCP ports. That’s not great for security, so you can add the individual ports if you care enough (lists here):

Then fire up Chrome and visit http://ec2-.compute-1.amazonaws.com:7180/ .

Log in with the default credentials user: “admin” pass: “admin”

Click “just install the free edition”, “continue”, then “proceed” in tiny text at the bottom right of the registration screen.

Now go back to that ips.txt file we created in the last part and copy the list of IPs. Past them into the box on the next screen, and click “search”, then “install CDH on selected hosts.”

Next the manager needs credentials that’ll allow it to log into the nodes in the cluster to set them up. You need to give it a SSH key, but that key is on the server and can’t be directly accessed from you laptop. So you need to copy it to your laptop.

(“scp” is a program that securely copies files through ssh, and the –r flag will copy a directory.)

Now you can give the manager the username “huser”, and the SSH keys you just downloaded:

Click “start installation,” then “ok” to log in with no passphrase. Now wait for a while as CDH is installed on each node.

Next, Cloudera Manager will inspect the hosts and issues some warnings but just click “continue.” Then it will ask you which services you want to start – choose “custom” and then select Zookeeper, HDFS, HBase, and MapReduce.

Click “continue” on the “review configuration changes” page, then wait as the manager starts your services.

Click “continue” a couple more times when prompted, and now you’ve got a functioning cluster.

Step 3: Do Something

To use your cluster, you need to SSH login to one of the nodes. Pop open the “hosts.txt” file we made earlier, grab any of the lines, and paste it into the terminal.

If you already know how to use Hadoop and HBase, then you’re all done. Your cluster is good to go. If you don’t, here’s a brief overview:

The basic Hadoop workflow is to run a “job” that reads some data from HDFS, “maps” some function onto that data to process it, “reduces” the results back to a single set of data, and then stores the results back to HDFS. You can also use HBase as the input and/or output to your job.

You can interact with HDFS directly from the terminal through commands starting “hadoop fs”. In CDH, Cloudera’s open-source Hadoop distro, you need to be logged in as the “hdfs” user to manipulate HDFS, so let’s log in as hdfs, create a users directory for ourselves, then create an input directory to store data.

You can list the contents of HDFS by typing:

To run a program using MapReduce, you have two options. You can either:

  • Write a program in Java using the MapReduce API and package it as a JAR
  • Use Hadoop Streaming, which allows you to write your mapper and reducer”scripts in whatever language you want and transmit data between stages by reading/writing to StdOut.

If you’re used to scripting languages like Python or Ruby and just want to crank through some data, Hadoop Streaming is great (especially since you can add more nodes to overcome the relative CPU slowness of a higher level language). But interacting programmatically with HBase is a lot easier through Java. (Interacting with HBase is tricky but not impossible with Python. There is a package called “Happybase” which lets you interact “pythonically” with HBase; the problem is that you have to run a special service called Thrift on each server to translate the Python instructions into Java, or else transmit all of your requests over the wire to a server on one node, which I assume will heavily degrade performance. Cloudera Manager will not set up Thrift for you, though you could do it by hand or using Whirr+Chef.) So I’ll provide a quick example of Hadoop streaming and then a more extended HBase example using Java.

Now, grab my example code repo off Github. We’ll need git.

(If you’re still logged in as hdfs, do “exit” back to “huser” since hdfs doesn’t have sudo privileges by default.)

Cloudera Manager won’t tell the nodes where to find the configuration files it needs to run (i.e. “set the classpath”), so let’s do that now:

Hadoop Streaming

Michael Noll has a good tutorial on Hadoop streaming with Python here. I’ve stolen the code and put it in Github for you, so to get going. Load some sample data into hdfs:

Now let’s Hadoop:

That’s a big honking statement, but what it’s doing is telling Hadoop (which Cloudera Manager installs in /usr/lib/hadoop-0.20-mapreduce) to execute the “streaming” jar, to use the mapper and reducer “mapper.py” and “reducer.py”, passing those actual script files along to all of the nodes, telling it to operate on the sample_rdf.nt file, and to store the output in the (automatically created) output/1/ folder.

Let that run for a few minutes, then confirm that it worked by looking at the data:

That’s Hadoop Streaming in a nutshell. You can execute whatever code you want for your mappers/reducers (e.g. Ruby or even shell commands like “cat”. If you want to use non-standardlib Python packages – e.g. “rdflib” for actually parsing the RDF – you need to zip the packages and pass those files to hadoop streaming using -file [package.zip].)

Hadoop/HBase API

If you want to program directly into Hadoop and HBase, you’ll do that using Java. The necessary Java code can be pretty intimidating and verbose, but it’s fairly straightforward once you get the hang of it.

The Github repo we downloaded in Step 3 contains some example code that should just run if you’ve followed this guide carefully, and you can incrementally modify that code for your own purposes. (The basic code is adapted from the code examples in Lars George’s HBase, the Definitive Guide. The full original code can be found here.  That code has its own license, but my marginal changes are released into the public domain.)

All you need to run the code is Maven. Grab that:

(If you’re logged in as user “hdfs”, type “exit” until you get back to huser. Or give hdfs sudo privileges with “visudo” if you know how.)

When you run Hadoop jobs from the command line, Hadoop is literally shipping your code over the wire to each of the nodes to be run locally. So you need to wrap your code up into a JAR file that contains your code and all the dependencies. (There are other ways to bundle or transmit your code but I think fully self-contained “fat jars” are the easiest. You can make these using the “shade” plugin which is included in the example project. )

Build the jar file by typing:

That will take an irritatingly long time (possibly 20+ minutes) as Maven downloads all the dependencies, but it requires no supervision.

(If you’re curious, you can look at the code with a text editor at /home/users/hdfs/Hadoop_Tutorial/hadoop_tutorial/src/main/java/com/tumblr/rogueleaderr/hadoop_tutorial/HBaseMapReduceExample.java). There’s a lot going on there, but I’ve tried to make it clearer via comments.

Now we can actually run our job:

If you get a bunch of connection errors, make sure your classpath is set correctly by doing:

Confirm that it worked by opening up the hbase commandline shell:

If you see a whole bunch of lines of data, then – congratulations! You’ve just parsed RDF data using a six-node Hadoop Cluster, and stored the results in HBase!

Next Steps

If you’re planning on doing serious work with Hadoop and HBase, just buy the books:

The official tutorials for WhirrHadoop, and HBase are okay, but pretty intimidating for beginners.

Beyond that, you should be able to Google up some good tutorials.

2 Responses
  • Kyle Moses / November 11, 2012 / 2:05 PM

    Great tutorial! Thanks for posting. I couldn’t get the regexp for sed to work for some reason. A simple alternative is:

    awk -F’@’ ‘{print $2}’ hosts.txt > ips.txt

    I’d also recommend adding a comment to Step 1 reminding users to fill in their EC2 access info for whirr.credential and whirr.identity.

    I missed this implicit step and the resulting Java error message was a bit cryptic and only stated “Empty key”. The Whirr installation guide (link below) helped me realize my mistake:

    https://ccp.cloudera.com/display/CDHDOC/Whirr+Installation

  • Mehul / September 26, 2013 / 1:22 PM

    Nice stuff. Can you tell me how you installed java on the rest of the machines that were launched by whirr? Is cloudera manager responsible for installing even java on these machines?

Leave a comment


+ nine = 18