How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager

Categories: CDH Cloud Cloudera Manager How-to Impala Ops and DevOps

Editor’s Note (added Feb. 25, 2015): For releases beyond 4.5, Cloudera recommends the use of Cloudera Director for deploying CDH in cloud environments. 

Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) – and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.

The new distinguishing feature introduced in version 4.5 is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager! 

Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.

Here’s what you can do with Cloud Express Wizard:

  • Provision new EC2 instances (AWS credentials required)
  • Choose between CentOS and Ubuntu images (or a custom AMI)
  • Choose your EC2 instance type
  • Install the most recently released CDH, Cloudera Impala, and Cloudera Manager agent packages on them

And here’s what you cannot do:

  • Use pre-existing EC2 instances
  • Install older (earlier ) versions of CDH and Cloudera Manager, or use Parcels 

I am excited to show you how this feature works. These instructions will set up a fully configured CDH cluster (all services with embedded PostgreSQL) from scratch in less than 15 minutes.

Step 1: Install Cloudera Manager Server on EC2

First, you will need to  launch an EC2 instance for the Cloudera Manager server, which will require an AWS Access Key ID and AWS Secret Key — please follow these instructions if you need help getting them.

To launch the EC2 instance, go to “EC2” in the AWS web console and select “Instances” in the left menu. Before you provision the instance, select the EC2 region you want your instance to be in (dropdown in top right corner of the web console). For his demo, you can simply use the default “N. Virginia (us-east-1)” region. Click on “Launch Instance” and select the Classic Wizard. On the next page, pick the “Ubuntu Server 12.04 LTS” 64-bit image. You need one instance of type “m1.large.” You can keep the default values of other settings and proceed to the “Create Key Pair” page.

If you don’t have an SSH key imported to EC2 already, select “Create a new Key Pair.” Enter the name of your new key pair, and click “Create and Download your key pair.” This will download a .pem file to your computer. (Important: AWS does not store the private SSH keys, so save this file or you won’t be able to SSH into the instance we’re about to launch.)

It is very important to configure the EC2 firewall correctly. On the “Configure Firewall” page choose “Create a new Security Group,” and authorize all the ports listed below:






Cloudera Manager web console



Agent heartbeat



(optional, Cloudera Manager web console with TLS)



Embedded PostgreSQL



ping echo

Next, go to the last page of the wizard and launch the instance!

How to Install the Latest Version of Cloudera Manager
Once the state of the instance is “running” (provisioning takes usually less than 5 minutes), you  can SSH in and install Cloudera Manager 4.5. The public hostname of the instance is listed in the instance details in the AWS console.


Download the Cloudera Manager 4.5 installer and execute it on the remote instance:


Once the installer finishes, use the public hostname of your server instance to navigate in your browser to, and then log into the web console (the default username and password are both “admin”). If you’re successfully logged in, congratulations!

Step 2: Installing a CDH Cluster with Cloud Express Wizard

After logging in, Cloudera Manager will detect that it runs on EC2, and it will greet you with the welcome screen of the new wizard (see below). There is a warning that the instances started by this installer are instance store-based, which implies that stopping or terminating these instances results in losing all data stored on them. Remember to back-up  important data from the cluster before terminating the instances!

Figure 1: Cloud Express Wizard

Why does Cloudera Manager prefer instance store-backed over EBS-backed AMIs? Although EBS volumes offer persistent storage, they are network-attached and charge per I/O request, so they are not suitable for Hadoop deployments. If you wish to experiment with EBS-backed instances, you can always use a custom EBS AMI.

Figure 2: Cloud Express Wizard – instance specifications

Go to the second page of the wizard (Figure 2) to specify the details about the hosts we are about to launch. Cloudera Manager detects the region it runs in, and the new instances will be installed there as well. The following attributes can be specified:

  • OS (Amazon Machine Image, AMI): Cloudera supports Ubuntu 12.04 and CentOS 6.3 images. Cloudera Manager knows which AMI to use for the specified region. If you choose to use a custom AMI (this is especially handy if you want to pre-install some tools or authorize SSH keys on your hosts), make sure the AMI is available in the specified region.
  • Instance Type: Only instance types matching the minimum requirements for CDH hosts are available. m1.medium will be sufficient for this demo. The high-storage instances (hs1.8xlarge) are not yet available but will be included in a future release of Cloudera Manager .
  • Number of Instances: You will create four instances for this demo. Although there is no limit on the number of instances, you’re likely to exceed the EC2 API request limit  if you try to create more than ~20 instances at once.
  • Group name: The optional “group name” is there to help you identify the instances launched by the wizard, and it will be used as suffix for the name, Security Group, and Key Pair of the instances.

The next page (Figure 3) shows you the credentials page. You need to paste in the AWS Access ID and AWS Secret Key. Then you can choose an SSH key for the hosts; in this demo I will let Cloudera Manager generate a new key pair for my instances, and the private key will be available for download on the next page once the instances are launched. If you upload an existing private SSH key, Cloudera Manager will extract the public part and authorize it in your AWS account.

Figure 3: Cloud Express Wizard – Credentials

Proceed to the review page (Figure 4), where you can double-check your installation settings. You can easily go back to modify the settings. However, once the instances are provisioned, you must terminate  them in order to make changes.

Note that when provisioning the instance fails on “503 Error: Api Request Limit exceeded”, it’s likely because other applications (or users) are issuing API calls to the same AWS account at the same time, or because you are launching a large number of instances at once. (In testing we successfully spun up as many as 20 instances  simultaneously.) This limitation will be removed in a future Cloudera Manager release.

Figure 4: Cloud Express Wizard – Review Installation

The review page indicates you are about to install the latest packages of CDH and Impala. Currently this is the only supported option in this installation wizard. If everything looks right, click the “Start Installation” button. (Note: if node installation fails because “CM failed to receive a heartbeat from Agent”, Confirm that port 7182 is authorized in the Security Group of Cloudera Manager server and re-try the installation.)

Figure 5: AWS web console – EC2 instance started by Cloudera Manager

Cloudera Manager uses jclouds to create new key pair and security group, and to launch the EC2 instances. The new instances will also appear in your AWS EC2 console (Figure 5). You can see that the security group and the key pair starts with “jclouds#” prefix. Also, all ports required for CDH have already been enabled. Provisioning new instances takes usually less than five minutes.

Once the instances are successfully provisioned, you can download the private SSH key (Figure 6). It’s a good idea to download the key in case something goes wrong and you need to SSH in to investigate the issue. However, this installation path won’t require us to do anything manually on the remote hosts.

Figure 6: Cloud Express Wizard – Instances successfully provisioned

The next screen looks familiar if you’ve used the classic express wizard in Cloudera Manager. It shows the progress of package installation on the newly provisioned hosts (Figure 7).

Figure 7: Cloud Express Wizard – Package installation

After finishing the package installation, you can proceed to the Host Inspector and Services First Run page – you’re done. Congratulations, the CDH cluster is up and running now!

Note: The hosts cannot be terminated from Cloudera Manager, so to do that you’ll need to use EC2 CLI tools or the AWS web console instead. Go to the Instances page in, select the instance you created for the server and all the instances launched by the wizard (hint: use the Group Name string to filter them out), and click “Actions > Terminate”.

Emanuel Buzek is a Software Engineer on the Enterprise team.

Editor’s Note (added Feb. 28, 2014): The instructions above are deprecated for Cloudera Manager releases beyond 4.5. Please refer to this doc for instructions pertaining to releases 4.6 and later.



32 responses on “How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager

  1. Justin Kestelyn (@kestelyn) Post author


    I don’t believe that the Wizard supports provisioning of spot instances yet.


    A micro instance will be too small for CDH.

  2. Kris Jonsson

    Excelent HOW-TO Emanuel! I have everything up and running and I can submit jobs to the cluster if I first ssh into one of the nodes. Could you explain how to open up access to the JobTracker such that you can submit jobs to the EC2 cluster from a remote machine? Setting the JobTracker and NameNode addresses to the external EC2 host names of the corresponding machines in the -site.xml files on the remote machine does not work out of the box.


  3. Stu

    Thanks for publishing this. I’ve stood up and 9-node CDH cluster (1NN, 8 DN) and now wish to add additional data nodes. How do i do that in AWS using the Manager? Thanks.

  4. Amandeep Khurana


    In order to use a non EC2 instances as your client node to access HDFS and submit jobs from, you need to do the following:

    1. Bind the Hadoop processes to so that they work with the external DNS also
    2. Open the security groups to allow traffic from the client/gateway node that they want to submit jobs from
    3. Put the external DNS entries for the JT and NN in the -site.xml files on the client/gateway node so the Hadoop client knows how to reach the cluster.

    Hope this helps


  5. Kostas Sakellis


    To add more hosts to your cluster, follow these steps:

    1. Create a host template from the Hosts page. ( The host template allows you to specify what roles/configuration you would like for your hosts.
    2. Go to the Hosts page and click “Add new Hosts to the cluster”. This wizard will allow you to provision new hosts through AWS
    3. Apply the host template you created in 1) to your new hosts.

    Hope this helps,
    Kostas Sakellis

  6. Kostas Sakellis


    The ip-XX-XXX-XXX-XXX.ec2.internal is an AWS internal name that doesn’t resolve externally. You need to use the public host name. For example, it should be something like: You can find this by either using the AWS console, or using the AWS command line tools.
    ec2-describe-instances | grep ip-XXXXX.ec2.internal

  7. Randy Zwitch

    Thanks Kostas, I figured it out. I figured it was something similar to that, but I only tried the public host name for the m1.large, it didn’t occur to me that Hue was running off the other nodes.

  8. Jon

    There are two machine types that don’t actually work. (cc2.8xlarge and cc1.4xlarge). The reason is that these actually require special HVM images which are different from the images that CM tries to launch with. I tried to provide my own HVM images ID, but then it failed with an error saying that it can’t find hardware support.

    Any chance you could fix this so they actually work?

    Is there any way to work around this in the mean time? I guess I can just manually launch the slaves and then add the nodes in CM – would that work?

  9. Praveen

    I was able to setup a CDH Manager instance and a 2 node Hadoop cluster. I am able to login into the Hadoop cluster as an ec2-user and not any other user.

    ssh -i cm_cloud_hosts_rsa

    ec2-user doesn’t have permissions to put data into HDFS and run MR jobs. What are the steps involved to put files into HDFS and run an MR job?


  10. Chanka Perera

    How can i disable “Cloudera Manager Cloud Express Wizard”

    Express wizard currently not supported in Sydney region and i would like to use pre-existing EC2 instances which i have already build.

    Please let me know how to disable express wizard and add instances manually.

  11. Stefan

    Great tutorial!
    It set up the cluster without problems.
    2 Issues:
    a) I am not able to access the hue interface.
    Could you please descibe the necessary steps?

    b) How do I start the impala-shell? Or the hive shell?

    Your support is appreciated.
    Best regards,

  12. Carlos

    I was looking at the docs, and it says that this setup is not recommended for production. This post doesn’t mention production environments. Do you think you could elaborate why this setup wouldn’t be good for production? or if that is no longer the case?

    I’m trying to decide how to go about deploying CDH onto a cluster on EC2. I’ve found other methods, but this seems to be the easiest. I’m familiar with Hadoop and HBase, but don’t have much experience deploying a cluster and much less EC2.

    Thanks in advance.

  13. John

    I followed the instructions and was able to launch 4 hosts and looks like all the components were successfully installed. How do I start up job tracker, task tracker, data node and name node?

    I didn’t see any menu on the Services tab to add new services.

    This is more difficult than I thought it would be.

    Please help.

  14. Ravi

    When I try to access CDH instance using CYGWIN I get below error in Windows 7 machine

    $ ssh -i cm_cloud_hosts_rsa
    Permission denied (publickey).

    I have tried permisssion from 400 to 600 to 777 nothing worked .. really weird.

    Also in MAC I get similar type error
    if permission 400 or 600 I get permission denied and anything other then that it gives error

    Permissions 0644 for ‘cm_cloud_hosts are too open.
    It is recommended that your private key files are NOT accessible by others.
    This private key will be ignored.

    Any suggestion whats going wrong that too both in Windows 7 and MAC machine.


  15. Eric Moore

    This isn’t working for me. I brought up the ubuntu instance and logged in as the ubuntu user per the instructions above. I downloaded the Cloudera Manager installer. When I execute “./cloudera-manager-installer.bin”, it says I have to be root to execute.

    When I try “sudo ./cloudera-manager-installer.bin”, it takes me through a bunch of terms acceptance and then installs. Unfortunately, though, it doesn’t seem to recognize that it is running in EC2 and after asking if I want to upgrade to full Cloudera Manager, it gives me the normal install options rather than using the cloud install wizard. Any suggestions?

  16. Sanjeev Kumar

    The instructions are very clear and I could setup the full CDH cluster successfully within few minutes. Basically I wanted to use it to test Impala on HDFS and HBASE and it seems working fine for me.

    I have a question at this point. I would like to automate provisioning of the CDH and program these steps which currently we are doing manually using Web UI. It will be nice if Chef can be avoided – earlier we tried provisioning CDH using Chef scripts and unfortunately it was a painful experience though we could bring up cluster there as well.

    Last, but not the least, If the it’s possible to automate provisioning CDH, Does this mean we can automate bringing up of first instance where Cloudera Manager is installed?

  17. Mark Kerzner

    Hi, Emanuel, thank you for the useful post. How can I improve the health of the cluster? It comes up in concerning and goes to bad pretty soon afterwards.

    I believe it is because of the small 10GB hard drive on root instances.

    Thank you.

  18. Sassoon Kosian

    I installed CDH4 on AWS using a single instance, it worked fine. In order to save costs I stopped the instance and started later. When I started Hadoop was not running (or maybe was running but everything failed). hadoop fs -ls gave me this error:

    Call From ip-10-147-146-227.ec2.internal/ to ip-10-147-146-227.ec2.internal:8020 failed on connection exception: Connection refused; For more details see:

    How to I get it fixed?


  19. Nathan D

    The instructions are working and letting me set up the cluster with no errors, but before it gets to the point where it would start up all the services and says “Server Error No hosts found” Any idea why this is happening?

    When I try to manually start the services, starting with HDFS, it sets up the data nodes but fails when setting up the name node.

  20. King

    Are there any ways to stop the instances for cloudera impala on AWS? there is only terminating. But after terminating, you have to re-install Impala every time. How to re-start the instances after stopping it to avoid being billed?

  21. dejan

    Is it posible to connect to ec2 cluster from local machine i.e. put files hdfs ans start map reduce jobs from local machine

  22. dejan

    Can we save AMI with cloudera manager so there is no need to go trough installation process every time we want to create new cluster. Also, is it possible to create cluster using scripts from machinne that is not on ec2.

  23. Matt

    I’m having the same problem as this guy:

    To summarize, I get up to the 6th panel “Provisioning
    requested instances”. And then get the following error:
    message=’The maximum number of rules per security group has been reached.’

    I tried to go to the forum link posted by @KESTELYN but it’s broken… nothing works today! :)

    Please help!

    1. Justin Kestelyn (@kestelyn) Post author

      Not at this time. CDH 5.4.0 contains an unsupported preview of SQL query (Impala) over S3, however.