Migrating to CDH

Categories: General Hadoop HBase HDFS Hive MapReduce Pig ZooKeeper

With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.

Why Migrate?

If you’re not familiar with CDH3b2, here’s what you need to know.

All versions of CDH provide:

  • RPM and Debian packages for simple installation and management.
  • Clean integration with the host operating system. Logs are in /var/log, common binaries in /usr/bin, and configuration in /etc.
  • A Cloudera support-ready distribution. As Hadoop becomes a mission critical component of your production infrastructure, you’ll want the option of engaging Cloudera for support or consulting services. Running CDH makes this process simple.

CDH3b2 additionally is:

  • A complete platform with smooth integration of popular projects such as Hive, HBase, Pig, Zookeeper, Flume, Sqoop, Oozie, and HUE. HDFS and Hadoop Map Reduce are only two parts of a larger system. CDH3b2 brings together tools frameworks to get data in and out of HDFS, coordinate complex processing pipelines, as well as process and analyze your data. Learn more about this.
  • Based on Apache Hadoop 0.20.2 with 320 patches worth of feature back ports, stability enhancements, and bug fixes.


The migration process does require a moderate understanding of Linux system administration. You should make a plan before you start. You will be restarting some critical services such as the name node and job tracker, so some downtime is necessary. Given the value of the data on your cluster, you’ll also want to be careful to take recent back ups of any mission-critical data sets as well as the name node meta-data.

Backing up your data is most important if you’re upgrading from a version of Hadoop based on an Apache Software Foundation release earlier than 0.20. There were changes in the open source HDFS implementation prior to 0.20 that force this upgrade. See the section below on compatibility for more details.

The process I’ll outline here is as follows:

  • CDH version selection
  • Options for installation
  • Installation process
  • Migration of configuration data
  • Testing your cluster

Selecting a Branch

One of the first questions you should ask yourself is what level of stability versus new features you require from Hadoop. If you’re managing a production Hadoop cluster with jobs with SLAs, you need a rock solid, production-proven Hadoop distribution. This is Cloudera’s stable or production branch. At the time of this writing, this is CDH2 based on Hadoop 0.20.1+169.89. In certain cases, features may be of greater priority, in which case, CDH3 0.20.2+320 is appropriate.

It’s important to note that both CDH2 and CDH3 pass all functional and unit tests at Cloudera. The real difference between them is that CDH2 has been in the field longer. We generally promote a release to stable when we’ve seen it running production workloads for a substantial period of time, and when the rate of issues opened against the distro in our support group tails off. We have customers running in production today on both CDH2 an CDH3.

On Compatibility

Before we dive into the installation process I’ll highlight some points on compatibility. When upgrading to CDH from an older version or another distribution of Hadoop, it’s possible that HDFS data needs to be taken through an upgrade process. This is relatively simple, but as with any upgrade of critical data, it is absolutely necessary to back up your data.

Currently, it is not necessary to perform an HDFS upgrade if you’re upgrading to CDH3 from CDH2 or Apache Hadoop versions 0.20.0 or later. In fact, any distribution of Hadoop based on Apache 0.20.0 is likely to be a clean transition without an update to HDFS required, but you should always check with the distributor.

During RPC operations, all Hadoop daemons will check to ensure they are speaking to the same exact version as themselves. This means that you cannot, at present, perform a rolling upgrade of CDH. There has been some discussion about relaxing this requirement so compatible versions of Hadoop can communicate, but this has not yet been implemented.

Installation Options

CDH is available in three forms: RPMs, debs, and tarball distributions. The preferred method of installation is usually the RPM or deb packages as they automate a lot of the work required to get CDH up and running quickly. Tarballs of CDH are useful for users on systems that do not use yum/rpm or apt/dpkg, or where you do not have root access to the host operating system.

Installing CDH

When installing CDH from from RPMs or Debian packages you will definitely want to take advantage of Cloudera’s yum or apt repository support. If you’re on a system that is not rpm or deb format packages, you can still use Cloudera’s binary tarball packages.

You should follow the normal process for installing CDH on your systems. The CDH packages should be installed on all nodes in the cluster. The rpm and deb packages of CDH will automatically create a hadoop user and group as well as SYSV init scripts as part of the install process. The CDH tarballs do not contain the init scripts and obviously do not create the hadoop user and group.

Detailed installation instructions for all formats of CDH are available.

After the packages are installed, you’ll want to make sure you set the proper daemons to start on the proper machines upon boot. There is a separate init script for each Hadoop daemon so only what is necessary is started.

Redhat example:
% chkconfig --level 3 hadoop-0.20-namenode on

Debian example:
% update-rc.d hadoop-0.20-namenode start 80 3 .

Make sure you specify the correct run level. While run level 3 is common for multiuser Linux servers, this may not be the case in your installation. You can use the runlevel command to find the currently active run level.

For now, do not start any of the Hadoop daemons.

Migrating Your Configuration

If you’re coming from older version of CDH, your configuration should already be setup with alternatives. If not, now is a good time to bring your configuration layout in line with CDH by moving your conf directory to /etc/hadoop-0.20/conf.mycluster. You should also configure alternatives to know about your new configuration. The CDH documentation covers this in detail. For now, register your new configuration with alternatives and set it to be the preferred configuration.

% alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.mycluster 100
% alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.mycluster

Users who are on systems that don’t have alternatives or who are installing CDH from tarballs should simply update the configuration files in $HADOOP_HOME/conf. Normally, $HADOOP_HOME is /usr/local/hadoop-$VERSION or /opt/hadoop-$VERSION but you can put it wherever it makes sense. This includes running CDH from your home directory if you don’t have root access.

Testing CDH

Now that CDH is installed and you’ve migrated your cluster configuration it’s time to fire up a few nodes and make sure everything is working as expected. Rather than bring up all the daemons at once, let’s focus on the name node first.

Start by logging on to the name node machine. You may want to manually rotate the log file just to minimize the noise during testing. You can do this by simply moving today’s log file to a different name.

% mv /var/log/hadoop/hadoop-hadoop-namenode-nn.mycompany.com.log \

Next, start the CDH name node daemon using the provided init script. If an HDFS upgrade is required, you can use the upgrade argument in place of start below. This will be your last chance to grab a backup of the name node’s metadata prior to starting the daemon.

% /etc/init.d/hadoop-0.20-namenode start

Note that the CDH init scripts require you to be root whereas the Apache Hadoop start-all.sh / stop-all.sh scripts should not be run as root.

It’s a good idea to check the contents of the name node log file now to ensure it has come up cleanly. You should see a warning about the name node being in safe mode due to missing blocks. This is OK because we haven’t brought up any data nodes yet. If something doesn’t look right, jump ahead to the getting help section before proceeding.

Before you start any of your data nodes, you’ll want to place the name node in safe mode manually. This will prevent the name node from “panicking” and trying to repair missing block replicas as data nodes begin to register themselves. You’ll need to run this command as the hadoop user.

% hadoop dfsadmin -safemode enter

Next start one of the data nodes and watch its logs as you did for the name node.

% /etc/init.d/hadoop-0.20-datanode start

If everything is setup correctly, you should see the data node start up, register with the name node, and start its periodic block scanner thread. You should also check the name node logs to confirm you see the data node registration message there as well. Once you’ve confirmed that things look good, you should move on to starting additional data nodes checking them in batches as you go.

After all data nodes are up and running, you can use the Hadoop fsck tool to confirm that the file system is healthy.

% hadoop fsck /

Your cluster should still be in safe mode. If the file system is healthy, you can go ahead and take it out of safe mode.

% hadoop dfsadmin -safemode leave

Follow this with a quick test of HDFS by copying a file into the file system.

% date > now.txt
% hadoop fs -put now.txt /now.txt
% hadoop fs -cat /now.txt
% hadoop fs -rm /now.txt
% rm now.txt

Congratulations! You now have HDFS running on CDH.

If you had to upgrade the HDFS data – that is, you started the init script with the upgrade option – you should do some more extensive testing of your data. Once you’ve confirmed everything is working as expected, finalize the HDFS upgrade.

% hadoop namenode -finalize

Starting and testing the map reduce daemons follows a similar procedure but is a bit simpler. Start the job tracker daemon on the proper machine and monitor the logs as you did with the name node. Once you’ve confirmed the job tracker is running, proceed with starting the task tracker daemons in groups checking the job tracker UI as you go. You should see the map and reduce task capacity increasing with each node you start. Don’t panic if the job tracker doesn’t see the nodes immediately; it can take a few seconds.

Don’t forget to start the secondary name node daemon as well. It’s usually a good idea to wait an hour or so and check the modification time on the files in the configured fs.checkpoint.dir. You should see that the files have been updated within the last hour. You can also check the secondary name node logs; you’ll see an indication things are working there as well in the form of some log messages about performing the checkpoint.

Documentation and References

In addition to the community articles and blog posts on Hadoop, Cloudera provides CDH-specific documentation at docs.cloudera.com. Here you can find information on CDH including all of its components like Hadoop, Hive, Flume, Sqoop, HUE, and others.

How to Get Help

There are a number of ways to get help if you run into trouble during your migration or if you just have questions.