How-to: Get Started with CDH on OpenStack with Sahara

Categories: Cloud Guest

The recent OpenStack Kilo release adds many features to the Sahara project, which provides a simple means of provisioning an Apache Hadoop (or Spark) cluster on top of OpenStack. This how-to, from Intel Software Engineer Wei Ting Chen, explains how to use the Sahara CDH plugin with this new release.

[Ed note: Cloudera does not provide support for Cloudera Enterprise on OpenStack. Cloudera customers should contact their representative with questions.]

Prerequisites

This how-to assumes that OpenStack is already installed. If not, we recommend using Devstack to build a test OpenStack environment in a short time. (Note: Devstack is not recommended for use in a production environment. For production deployments, refer to the OpenStack Installation Guide.)

Sahara UI

You can use Horizon as a web-based interface to manage your Sahara environment. Please login to the Horizon website and look for the “Data Processing” link under the “Project” tab. You can also use Sahara CLI to launch the commands in your OpenStack environment.


Login page


“Data Processing” link

Sahara CDH Plugin Configuration

Confirm the CDH plugin has already been enabled in the Sahara configuration. (Currently CDH 5.0 and CDH 5.3 are supported; CDH 5.4 will be supported in an upcoming Kilo release.) In the Kilo release, Sahara enables the CDH plugin by default. If you find there is no CDH plugin in your OpenStack environment, confirm the CDH plugins (vanilla, hdp, cdh, spark, fake) are installed in the Sahara configuration file (/etc/sahara/sahara.conf).

You can also confirm the plugin is installed via Horizon. Below is a screenshot of the Horizon page where you can confirm installed plugins.


Data processing plugins

Building a CDH Image

Before you start to use Sahara, you need to prepare one image with Cloudera packages. The open source project sahara-image-elements is a script that creates Sahara images. Below are the steps for building the image via this project.

Step 1: Checkout the project from https://github.com/openstack/sahara-image-elements

Step 2: Run this command and add some arguments to specify what OS distribution you would like to build. Here we use CentOS and CDH 5.3 as an example.

This command will download a base image and start to pre-install all the required package including Cloudera Manager and the CDH package into the image. The image creation requires internet connectivity. The entire process may take a long time depending on your internet speed and machines.

Currently Sahara can support building “Ubuntu” and “CentOS” images. The Kilo release has been tested with CDH “5.0”, “5.3”, abd “5.4”. Cloudera Manager package will be downloaded into the image from the Cloudera website.

For more information, please refer to https://github.com/openstack/sahara-image-elements.

Uploading the Image with Glance

You need to upload and register the image with the Glance service in your OpenStack environment. These steps use Horizon to upload the image you created from the previous step. You can also use glance CLI to upload your own image.

Step 1: Click “Images” in “System.”
Step 2: Click “Create Image.”
Step 3: Click “Image File” from “Image Source.”
Step 4: Click “Choose File” from Image File and select the image you would like to upload from file explorer.
Step 5: Fill in all the required fields; for our example, we use QCOW2 as the image format. You can also select other options as per your needs.


Creating an image

Adding Tags to the Glance Image with Sahara

Next, you need to register the Glance image to Sahara and add tags to help Sahara recognize the images.

Step 1: Select “Image Registry.”
Step 2: Click “Register Image” and select an image to add username and tags. (The username is a user account creating from sahara-image-elements. By default, the ubuntu user is “ubuntu” and the centos user is “cloud-user.” As for tags, add the related plugin and version tags to the image. It supports adding of multiple version tags in one image.)


Registering an image

Provisioning a Cluster

Sahara provides two major features. It allows you to:

  • Provision a cluster in your Openstack environment. You can use the Sahara UI to quickly select the services you want to launch in your cluster; there are many settings you can configure during the cluster creation, such as for anti-affinity.
  • Submit a job to the created cluster. The purpose is to help data scientists or application developers focus on development and ignore the provisioning process.
Provision a cluster using Guides

Using Guides can help you following the steps to launch a cluster.

Step 1. Select “Cluster Creation Guide” from “Data Processing Guides.”


Data processing Guides

Step 2: Select Plugin “Cloudera Plugin” and Version “5.3.”


Selecting plugin version

Step 3: Create templates and launch a cluster. After launching a cluster, you can check the “Cluster” tab to see the launched cluster.


Clusters

Provision a cluster manually

You can also create your own cluster using the node group template:

Step 1: Create node group templates. A node group template is a basic template to provision a node. You can use this template to create your own custom node and select which service you would like to run. There are several settings you need to assign:

  • OpenStack Flavor: select a flavor size for this template.
  • Availability Zone: select an availability zone for this template.
  • Storage Location:  select a storage location to put HDFS (currently supports ephemeral and cinder).
  • Select Floating IP pool: Optional; select a floating ip for this template.
  • Select Process: CDH Plugin support most services in CDH 5.3/CDH 5.4. After selecting a process, you can find the related tab on top of screen and you can change the default parameter for this process.


Node group templates


Creating node group template (1)


Creating node group template (2)

Step 2: Create cluster templates. After creating node group template, you need to create a cluster template by selecting multiple node group template. Go to the “Node Groups” tab and select how many node group templates you would like to run in this cluster template.


Cluster templates


Creating a cluster template

Step 3: Launch Cluster using Cluster Templates. Choose a cluster template and wait for the cluster status to become “Active.”

How to Run a Job

Run a job using Guides

Step 1: Select “Job Execution Guide” in “Guides.”
Step 2: Follow the steps to create the required templates.


Guided job execution

Run a job manually

Step 1: Create a Data Source. Select a data source type to put your input and output file. Currently Sahara supports internal/external HDFS and Swift for the CDH plugin. If you select Swift, please provide your account information.


Creating a data source

Step 2: Create a Job Binary. Write your own binary and upload it via the Sahara UI. The supported storage type include internal databases and Swift.


Creating a job binary

Step 3: Create job templates. Create a job template using the job binary you created in previous step. Choose which job type (pig, hive, mapreduce, and java) you would like to run and choose the binary you uploaded previously; you can also add the related libraries in this page.


Creating a job template

Step 4: Run a job. Select a job template you would like to run and check job status on the Jobs page. (If the job finishes without problems, the status will be “Success.” If a job fails, it will show “Killed.”) If you would like to get more information, you can check the cluster details to get Cloudera Manager address by clicking the cluster name.

FAQ

Finally, here are answers to some questions you may have.

  1. Is Apache Sentry integratIon available?
    Yes. However, although a Sentry service is on the service list, currently it needs to be configured manually. This is a limitation in the current Kilo release.
  2. Can I access Cloudera Manager directly?
    Yes, you can. Just click “Cluster Name” on the Clusters page and you can see the Cloudera Manager address and the password for the user – admin. For each cluster, the CDH plugin will create an individual password for security.
  3. Why is my Cluster status always “Waiting?”
    Please confirm al instances are reachable by using ssh and ping. If not, please check your security group settings.
  4. How do I use external HDFS as a data source?
    You need to setup this up manually and make sure there are no authentication issues between computing nodes and external HDFS.
  5. Does the CDH plugin support Hadoop’s data locality feature?
    No, currently the plugin does not support data locality.
  6. What if I create an invalid setting for the node group template?
    Before launching a cluster, there will be a validation warning to confirm your cluster template is valid. If you create an invalid cluster template, you will receive an invalid message when you are launching a cluster using this template.
  7. Is there HA support for CDH plugin?
    No, currently there is no HA support for a virtual cluster.
  8. How to define a CDH template?
    There is a validation program when you launch a cluster template to verify your template is workable. For more detail about the validation rule, see these docs. And you can follow the validation rule to design your cluster template.
  9. Can I configure the details for every services?
    Yes, you can. Please check the sub-tab of process; the sub-tab will pop-up when you select the process in the node group template.
  10. Can I scale up/down a running cluster?
    Yes, currently you may scale up/down a Node Manager or Data Node.

Enjoy!

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

3 responses on “How-to: Get Started with CDH on OpenStack with Sahara

  1. futuretec

    Hey there,

    thanks for the great guide!
    Unfortunately no one (neither the official OpenStack documentation, nor other users) do describe, how to install the Sahara project within the Ubuntu 14.04 LTS environment (default installation architecture).

    Do you have any information how to get the proper Sahara packages for Ubuntu and installing the project without using RDO/Fuel/VirtualEnv?

    Thanks.

  2. araz

    I’ve created a cluster using CDH plugin 5.4.0 with 1 master node and 2 worker nodes. Cloudera Manger reports following warning alerts:

    hdfs01: DataNode Data Directory
    Missing required value: DataNode Data Directory
    hdfs01: HDFS Checkpoint Directories
    Missing required value: HDFS Checkpoint Directories
    hdfs01: NameNode Data Directories
    Missing required value: NameNode Data Directories
    yarn01: NodeManager Local Directories
    Missing required value: NodeManager Local Directories
    Then I looked into master and worker node group templates. I can not find dfs.datanode.data.dir, dfs.namenode.name.dir, yarn.nodemanager.local-dirs, or dfs.namenode.checkpoint.dir parameters where I can give data directory path.

    Please let me know if I’m missing something in template configuration. I appreciate any help.

    Thanks.