Cloudera Blog · Cloud Posts

How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager

Cloudera Manager 4.5 includes a new express installation wizard for Amazon Web Services (AWS) EC2. (This feature is also available in Cloudera Manager Free Edition.) Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the new open source distributed query engine for Apache Hadoop) on EC2 as easily as possible - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.

The new distinguishing feature is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager! 

Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.

From Zero to Impala in Minutes

This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.

Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or Apache HBase.

How-to: Deploy a CDH Cluster in Skytap Cloud

You may have seen the recent announcement from Skytap about the availability of pre-configured CDH4 templates in the Skytap Cloud public template library. So for anyone who wants to try out a Cloudera Hadoop cluster—from small to large—it can now be easily accomplished in Skytap Cloud. The how-to below from Skytap’s Matt Sousely explains how.

The goal of this how-to will be to spin up a 10-node Cloudera Hadoop cluster in Skytap Cloud. To begin, let’s talk about the two new Cloudera Hadoop cluster templates. The first is Cloudera CDH4 Hadoop cluster: a 2-node Hadoop cluster template. It includes 2 nodes and a management node/server. The second is the Cloudera CDH4 Hadoop host template. This second template is not intended to run by itself in a configuration—rather, it contains a host VM that is ready to become another Hadoop node in the Cloudera CDH4 Hadoop cluster template-based configuration.

To start, let’s spin up a Cloudera Hadoop cluster.

  1. Log in to Skytap Cloud
  2. Choose the Templates tab
  3. In the search box, type hadoop
  4. Select Cloudera CDH4 Hadoop cluster
  5. Click New Configuration
  6. Click Run

How-to: Set Up an Apache Hadoop/Apache HBase Cluster on EC2 in (About) an Hour

Today we bring you one user’s experience using Apache Whirr to spin up a CDH cluster in the cloud. This post was originally published here by George London (@rogueleaderr) based on his personal experiences; he has graciously allowed us to bring it to you here as well in a condensed form. (Note: the configuration described here is intended for learning/testing purposes only.)

I’m going to walk you through a (relatively) simple set of steps that will get you up and running MapReduce programs on a cloud-based, six-node distributed Apache Hadoop/Apache HBase cluster as fast as possible. This is all based on what I’ve picked up on my own, so if you know of better/faster methods, please let me know in comments!

We’re going to be running our cluster on Amazon EC2, and launching the cluster using Apache Whirr and configuring it using Cloudera Manager Free Edition.  Then we’ll run some basic programs I’ve posted on Github that will parse data and load it into Apache HBase.

Apache Whirr 0.7.0 has been released

Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:

Support for Apache Mahout as a deployable component is new in 0.7.0. Mahout is a scalable machine learning library implemented on top of Apache Hadoop.