How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager

Editor’s Note (added Feb. 28, 2014): The instructions below are deprecated for Cloudera Manager releases beyond 4.5. Please refer to this doc for instructions pertaining to releases 4.6 and later.

Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.

The new distinguishing feature introduced in version 4.5 is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager! 

Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.

Here’s what you can do with Cloud Express Wizard:

  • Provision new EC2 instances (AWS credentials required)
  • Choose between CentOS and Ubuntu images (or a custom AMI)
  • Choose your EC2 instance type
  • Install the most recently released CDH, Cloudera Impala, and Cloudera Manager agent packages on them

And here’s what you cannot do:

  • Use pre-existing EC2 instances
  • Install older (earlier ) versions of CDH and Cloudera Manager, or use Parcels 

I am excited to show you how this feature works. These instructions will set up a fully configured CDH cluster (all services with embedded PostgreSQL) from scratch in less than 15 minutes.

Step 1: Install Cloudera Manager Server on EC2

First, you will need to  launch an EC2 instance for the Cloudera Manager server, which will require an AWS Access Key ID and AWS Secret Key — please follow these instructions if you need help getting them.

To launch the EC2 instance, go to “EC2” in the AWS web console and select “Instances” in the left menu. Before you provision the instance, select the EC2 region you want your instance to be in (dropdown in top right corner of the web console). For his demo, you can simply use the default “N. Virginia (us-east-1)” region. Click on “Launch Instance” and select the Classic Wizard. On the next page, pick the “Ubuntu Server 12.04 LTS” 64-bit image. You need one instance of type “m1.large.” You can keep the default values of other settings and proceed to the “Create Key Pair” page.

If you don’t have an SSH key imported to EC2 already, select “Create a new Key Pair.” Enter the name of your new key pair, and click “Create and Download your key pair.” This will download a .pem file to your computer. (Important: AWS does not store the private SSH keys, so save this file or you won’t be able to SSH into the instance we’re about to launch.)

It is very important to configure the EC2 firewall correctly. On the “Configure Firewall” page choose “Create a new Security Group,” and authorize all the ports listed below:

TCP

22

SSH

TCP

7180

Cloudera Manager web console

TCP

7182

Agent heartbeat

TCP

7183

(optional, Cloudera Manager web console with TLS)

TCP

7432

Embedded PostgreSQL

icmp

-1

ping echo

Next, go to the last page of the wizard and launch the instance!

How to Install the Latest Version of Cloudera Manager
Once the state of the instance is “running” (provisioning takes usually less than 5 minutes), you  can SSH in and install Cloudera Manager 4.5. The public hostname of the instance is listed in the instance details in the AWS console.

 

Download the Cloudera Manager 4.5 installer and execute it on the remote instance:

 

Once the installer finishes, use the public hostname of your server instance to navigate in your browser to http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7180, and then log into the web console (the default username and password are both “admin”). If you’re successfully logged in, congratulations!

Step 2: Installing a CDH Cluster with Cloud Express Wizard

After logging in, Cloudera Manager will detect that it runs on EC2, and it will greet you with the welcome screen of the new wizard (see below). There is a warning that the instances started by this installer are instance store-based, which implies that stopping or terminating these instances results in losing all data stored on them. Remember to back-up  important data from the cluster before terminating the instances!

Figure 1: Cloud Express Wizard

Why does Cloudera Manager prefer instance store-backed over EBS-backed AMIs? Although EBS volumes offer persistent storage, they are network-attached and charge per I/O request, so they are not suitable for Hadoop deployments. If you wish to experiment with EBS-backed instances, you can always use a custom EBS AMI.


Figure 2: Cloud Express Wizard – instance specifications

Go to the second page of the wizard (Figure 2) to specify the details about the hosts we are about to launch. Cloudera Manager detects the region it runs in, and the new instances will be installed there as well. The following attributes can be specified:

  • OS (Amazon Machine Image, AMI): Cloudera supports Ubuntu 12.04 and CentOS 6.3 images. Cloudera Manager knows which AMI to use for the specified region. If you choose to use a custom AMI (this is especially handy if you want to pre-install some tools or authorize SSH keys on your hosts), make sure the AMI is available in the specified region.
  • Instance Type: Only instance types matching the minimum requirements for CDH hosts are available. m1.medium will be sufficient for this demo. The high-storage instances (hs1.8xlarge) are not yet available but will be included in a future release of Cloudera Manager .
  • Number of Instances: You will create four instances for this demo. Although there is no limit on the number of instances, you’re likely to exceed the EC2 API request limit  if you try to create more than ~20 instances at once.
  • Group name: The optional “group name” is there to help you identify the instances launched by the wizard, and it will be used as suffix for the name, Security Group, and Key Pair of the instances.

The next page (Figure 3) shows you the credentials page. You need to paste in the AWS Access ID and AWS Secret Key. Then you can choose an SSH key for the hosts; in this demo I will let Cloudera Manager generate a new key pair for my instances, and the private key will be available for download on the next page once the instances are launched. If you upload an existing private SSH key, Cloudera Manager will extract the public part and authorize it in your AWS account.


Figure 3: Cloud Express Wizard – Credentials

Proceed to the review page (Figure 4), where you can double-check your installation settings. You can easily go back to modify the settings. However, once the instances are provisioned, you must terminate  them in order to make changes.

Note that when provisioning the instance fails on “503 Error: Api Request Limit exceeded”, it’s likely because other applications (or users) are issuing API calls to the same AWS account at the same time, or because you are launching a large number of instances at once. (In testing we successfully spun up as many as 20 instances  simultaneously.) This limitation will be removed in a future Cloudera Manager release.


Figure 4: Cloud Express Wizard – Review Installation

The review page indicates you are about to install the latest packages of CDH and Impala. Currently this is the only supported option in this installation wizard. If everything looks right, click the “Start Installation” button. (Note: if node installation fails because “CM failed to receive a heartbeat from Agent”, Confirm that port 7182 is authorized in the Security Group of Cloudera Manager server and re-try the installation.)


Figure 5: AWS web console – EC2 instance started by Cloudera Manager

Cloudera Manager uses jclouds to create new key pair and security group, and to launch the EC2 instances. The new instances will also appear in your AWS EC2 console (Figure 5). You can see that the security group and the key pair starts with “jclouds#” prefix. Also, all ports required for CDH have already been enabled. Provisioning new instances takes usually less than five minutes.

Once the instances are successfully provisioned, you can download the private SSH key (Figure 6). It’s a good idea to download the key in case something goes wrong and you need to SSH in to investigate the issue. However, this installation path won’t require us to do anything manually on the remote hosts.


Figure 6: Cloud Express Wizard – Instances successfully provisioned

The next screen looks familiar if you’ve used the classic express wizard in Cloudera Manager. It shows the progress of package installation on the newly provisioned hosts (Figure 7).


Figure 7: Cloud Express Wizard – Package installation

After finishing the package installation, you can proceed to the Host Inspector and Services First Run page – you’re done. Congratulations, the CDH cluster is up and running now!

Note: The hosts cannot be terminated from Cloudera Manager, so to do that you’ll need to use EC2 CLI tools or the AWS web console instead. Go to the Instances page in https://console.aws.amazon.com/ec2, select the instance you created for the server and all the instances launched by the wizard (hint: use the Group Name string to filter them out), and click “Actions > Terminate”.

Emanuel Buzek is a Software Engineer on the Enterprise team.

Editor’s Note (added Feb. 28, 2014): The instructions above are deprecated for Cloudera Manager releases beyond 4.5. Please refer to this doc for instructions pertaining to releases 4.6 and later.

 

30 Responses
  • Nathan Truong / March 26, 2013 / 11:01 AM

    Can Cloudera Manager provision EC2 spot instances?

  • p / March 28, 2013 / 10:55 AM

    Can I use micro instance? or it must be m1.medium?

  • Justin Kestelyn (@kestelyn) / March 28, 2013 / 11:02 AM

    Nathan,

    I don’t believe that the Wizard supports provisioning of spot instances yet.

    P,

    A micro instance will be too small for CDH.

  • Kris Jonsson / March 30, 2013 / 10:07 AM

    Excelent HOW-TO Emanuel! I have everything up and running and I can submit jobs to the cluster if I first ssh into one of the nodes. Could you explain how to open up access to the JobTracker such that you can submit jobs to the EC2 cluster from a remote machine? Setting the JobTracker and NameNode addresses to the external EC2 host names of the corresponding machines in the -site.xml files on the remote machine does not work out of the box.

    Kris

  • Stu / April 01, 2013 / 12:51 AM

    Thanks for publishing this. I’ve stood up and 9-node CDH cluster (1NN, 8 DN) and now wish to add additional data nodes. How do i do that in AWS using the Manager? Thanks.

  • Amandeep Khurana / April 01, 2013 / 1:04 PM

    Kris,

    In order to use a non EC2 instances as your client node to access HDFS and submit jobs from, you need to do the following:

    1. Bind the Hadoop processes to 0.0.0.0 so that they work with the external DNS also
    2. Open the security groups to allow traffic from the client/gateway node that they want to submit jobs from
    3. Put the external DNS entries for the JT and NN in the -site.xml files on the client/gateway node so the Hadoop client knows how to reach the cluster.

    Hope this helps

    Amandeep

  • Randy Zwitch / April 01, 2013 / 1:22 PM

    Following these directions works in terms of getting everything set up and running. However, I’m not able to get to the Hue screen, the URL never resolves.

    As a test, I’ve opened up every ICMP, TCP, and UDP ports and the http://ip-XX-XXX-XXX-XXX.ec2.internal:8888/ still doesn’t resolve.

    Any tips on how to get Hue working properly?

  • Kostas Sakellis / April 01, 2013 / 2:09 PM

    Stu,

    To add more hosts to your cluster, follow these steps:

    1. Create a host template from the Hosts page. (https://ccp.cloudera.com/display/FREE451DOC/Working+with+Host+Templates). The host template allows you to specify what roles/configuration you would like for your hosts.
    2. Go to the Hosts page and click “Add new Hosts to the cluster”. This wizard will allow you to provision new hosts through AWS
    3. Apply the host template you created in 1) to your new hosts.

    Hope this helps,
    Kostas Sakellis

  • Kostas Sakellis / April 01, 2013 / 2:23 PM

    Randy,

    The ip-XX-XXX-XXX-XXX.ec2.internal is an AWS internal name that doesn’t resolve externally. You need to use the public host name. For example, it should be something like: ecXXXXXXXX.compute-1.amazonaws.com. You can find this by either using the AWS console, or using the AWS command line tools.
    ec2-describe-instances | grep ip-XXXXX.ec2.internal

  • Randy Zwitch / April 01, 2013 / 2:35 PM

    Thanks Kostas, I figured it out. I figured it was something similar to that, but I only tried the public host name for the m1.large, it didn’t occur to me that Hue was running off the other nodes.

  • Jon / April 03, 2013 / 12:33 PM

    There are two machine types that don’t actually work. (cc2.8xlarge and cc1.4xlarge). The reason is that these actually require special HVM images which are different from the images that CM tries to launch with. I tried to provide my own HVM images ID, but then it failed with an error saying that it can’t find hardware support.

    Any chance you could fix this so they actually work?

    Is there any way to work around this in the mean time? I guess I can just manually launch the slaves and then add the nodes in CM – would that work?

  • Praveen / April 04, 2013 / 3:07 AM

    I was able to setup a CDH Manager instance and a 2 node Hadoop cluster. I am able to login into the Hadoop cluster as an ec2-user and not any other user.

    ssh -i cm_cloud_hosts_rsa ec2-user@ec2-184-72-169-149.compute-1.amazonaws.com

    ec2-user doesn’t have permissions to put data into HDFS and run MR jobs. What are the steps involved to put files into HDFS and run an MR job?

    Thanks,
    Praveen

  • Chanka Perera / April 09, 2013 / 6:39 AM

    How can i disable “Cloudera Manager Cloud Express Wizard”

    Express wizard currently not supported in Sydney region and i would like to use pre-existing EC2 instances which i have already build.

    Please let me know how to disable express wizard and add instances manually.

  • Stefan / April 18, 2013 / 1:05 PM

    Great tutorial!
    It set up the cluster without problems.
    2 Issues:
    a) I am not able to access the hue interface.
    Could you please descibe the necessary steps?

    b) How do I start the impala-shell? Or the hive shell?

    Your support is appreciated.
    Best regards,
    Stefan

  • Carlos / April 29, 2013 / 1:06 PM

    I was looking at the docs, and it says that this setup is not recommended for production. This post doesn’t mention production environments. Do you think you could elaborate why this setup wouldn’t be good for production? or if that is no longer the case?

    I’m trying to decide how to go about deploying CDH onto a cluster on EC2. I’ve found other methods, but this seems to be the easiest. I’m familiar with Hadoop and HBase, but don’t have much experience deploying a cluster and much less EC2.

    Thanks in advance.

  • John / May 05, 2013 / 4:44 PM

    I followed the instructions and was able to launch 4 hosts and looks like all the components were successfully installed. How do I start up job tracker, task tracker, data node and name node?

    I didn’t see any menu on the Services tab to add new services.

    This is more difficult than I thought it would be.

    Please help.

  • Ravi / May 06, 2013 / 10:37 AM

    When I try to access CDH instance using CYGWIN I get below error in Windows 7 machine

    $ ssh -i cm_cloud_hosts_rsa ec2-user@ec2-54-234-94-1.compute-1.amazonaws.com
    Permission denied (publickey).

    I have tried permisssion from 400 to 600 to 777 nothing worked .. really weird.

    Also in MAC I get similar type error
    if permission 400 or 600 I get permission denied and anything other then that it gives error

    Permissions 0644 for ‘cm_cloud_hosts are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.

    Any suggestion whats going wrong that too both in Windows 7 and MAC machine.

    Thanks

  • Justin Kestelyn (@kestelyn) / May 07, 2013 / 3:20 PM

    John,

    It sounds like you may need a Cloudera Manager primer. There are some demo videos here:

    http://www.cloudera.com/content/cloudera/en/resources/library.html?category=cloudera-resources%3Ausing-cloudera%2Fproduct-demos&q=cloudera+manager

    (albeit for Cloudera Manager 4.0)

  • Eric Moore / June 04, 2013 / 10:28 AM

    This isn’t working for me. I brought up the ubuntu instance and logged in as the ubuntu user per the instructions above. I downloaded the Cloudera Manager installer. When I execute “./cloudera-manager-installer.bin”, it says I have to be root to execute.

    When I try “sudo ./cloudera-manager-installer.bin”, it takes me through a bunch of terms acceptance and then installs. Unfortunately, though, it doesn’t seem to recognize that it is running in EC2 and after asking if I want to upgrade to full Cloudera Manager, it gives me the normal install options rather than using the cloud install wizard. Any suggestions?

  • Sanjeev Kumar / June 13, 2013 / 2:42 AM

    The instructions are very clear and I could setup the full CDH cluster successfully within few minutes. Basically I wanted to use it to test Impala on HDFS and HBASE and it seems working fine for me.

    I have a question at this point. I would like to automate provisioning of the CDH and program these steps which currently we are doing manually using Web UI. It will be nice if Chef can be avoided – earlier we tried provisioning CDH using Chef scripts and unfortunately it was a painful experience though we could bring up cluster there as well.

    Last, but not the least, If the it’s possible to automate provisioning CDH, Does this mean we can automate bringing up of first instance where Cloudera Manager is installed?

  • Ashwin Jayaprakash / July 01, 2013 / 6:17 PM

    The wizard did not take me to the auto install. After some Googling I found out that manually providing the URL parameters would bring up the EC2 cloud wizard:

    http://ec2-xxxxxx.us-west-2.compute.amazonaws.com:7180/cmf/cloud-express-wizard/specs?provider=aws-ec2

  • Mark Kerzner / July 01, 2013 / 9:23 PM

    Hi, Emanuel, thank you for the useful post. How can I improve the health of the cluster? It comes up in concerning and goes to bad pretty soon afterwards.

    I believe it is because of the small 10GB hard drive on root instances.

    Thank you.
    Mark

  • Sassoon Kosian / July 01, 2013 / 10:32 PM

    I installed CDH4 on AWS using a single instance, it worked fine. In order to save costs I stopped the instance and started later. When I started Hadoop was not running (or maybe was running but everything failed). hadoop fs -ls gave me this error:

    Call From ip-10-147-146-227.ec2.internal/10.147.146.227 to ip-10-147-146-227.ec2.internal:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

    How to I get it fixed?

    Thanks.
    Sassoon

  • Nathan D / July 12, 2013 / 8:41 AM

    The instructions are working and letting me set up the cluster with no errors, but before it gets to the point where it would start up all the services and says “Server Error No hosts found” Any idea why this is happening?

    When I try to manually start the services, starting with HDFS, it sets up the data nodes but fails when setting up the name node.

  • King / July 18, 2013 / 9:43 AM

    Are there any ways to stop the instances for cloudera impala on AWS? there is only terminating. But after terminating, you have to re-install Impala every time. How to re-start the instances after stopping it to avoid being billed?

  • dejan / July 30, 2013 / 11:45 PM

    Is it posible to connect to ec2 cluster from local machine i.e. put files hdfs ans start map reduce jobs from local machine

  • dejan / July 30, 2013 / 11:52 PM

    Can we save AMI with cloudera manager so there is no need to go trough installation process every time we want to create new cluster. Also, is it possible to create cluster using scripts from machinne that is not on ec2.

  • Justin Kestelyn (@kestelyn) / July 31, 2013 / 3:05 PM

    All,

    Due to the quantity of questions here, I recommend that you post them to our new Community Forum here:

    http://community.cloudera.com/t5/Cloudera-Manager-Installation/bd-p/CMInstall

  • Matt / August 08, 2013 / 3:47 PM

    I’m having the same problem as this guy:

    http://grokbase.com/t/cloudera/scm-users/1382vdbsbx/express-install-on-ec2-fail

    To summarize, I get up to the 6th panel “Provisioning
    requested instances”. And then get the following error:
    code=’RulesPerSecurityGroupLimitExceeded’,
    message=’The maximum number of rules per security group has been reached.’

    I tried to go to the forum link posted by @KESTELYN but it’s broken… nothing works today! :)

    Please help!

  • Justin Kestelyn (@kestelyn) / September 10, 2013 / 9:45 AM

    Sorry Matt,

    That link has been corrected:

    http://community.cloudera.com/t5/Cloudera-Manager-Installation/bd-p/CMInstall

Leave a comment


− three = 6