Rackspace Upgrades to Cloudera’s Distribution for Apache Hadoop

Categories: Community General Guest Hadoop
Apache Hadoop moves fast. Users often find that they need to upgrade after just a few months. Upgrading can be a daunting task, especially if you are several versions behind. We’ve been working with Rackspace for a while now, and they recently embarked on an upgrade from Hadoop 0.15.3 to Cloudera’s Distribution for Hadoop based on 0.18.3. Stu Hood, Search Team Technical Lead at Rackspace, was kind enough to document their experience, and we’re happy to share it with you here. -Christophe

Upgrading to the Cloudera Distribution

Apache Hadoop plays an integral part in the email analytics performed at Rackspace Email and Apps, and our installation of Apache Hadoop 0.15.3 ran smoothly for 18 months after we deployed it in January 2008. By the time we decided to upgrade to Cloudera’s Distribution for Apache Hadoop in June 2009, our production cluster had performed almost 600,000 MapReduce jobs.

In the past, we have deployed Hadoop along with our primary MapReduce application by checking the entire Hadoop distribution and our configuration into version control. Deploying a new slave for the cluster involved running custom scripts to create users, directories, and install dependencies.

There were a few important reasons to upgrade a cluster as trusty as ours to Cloudera’s Distribution for Hadoop (version 0.18.3):

  • Hadoop improves rapidly (since version 0.15.3 was released, over 1500 JIRA issues were resolved).
  • The Cloudera Distribution contains backported patches that are considered stable, but have not been applied to previous versions by the Apache project, such as the FairScheduler. Some of these patches fix critical bugs, add new features, or improve performance.
  • Cloudera’s configuration RPMs maintain the optimal settings for the installed version of Hadoop. Tweaking these settings manually would involve far more research than we can afford.
  • Standardizing on a Red Hat deployment infrastructure like RPM and YUM makes it much easier to track the latest stable version of Hadoop.


Configure Hadoop

In order to take advantage of Cloudera’s recommended configuration values, we decided to use Cloudera’s Configurator for Hadoop to generate the configuration that we would be using on the upgraded cluster.

We started by following the steps at https://my.cloudera.com/, using parameters that matched our current configuration. Since we were upgrading an existing cluster, it was important that the data directories matched up in our new configuration. The following table describes mapping between entries made in the GUI as well as those in the generated configuration files:

Step 2: NameNode Metadata Path(s) dfs.name.dir
Step 3: Secondary NameNode Metadata Path(s) fs.checkpoint.dir
Step 5: TaskTracker Intermediate Data Path(s) mapred.local.dir
Step 5: HDFS Data Path(s) dfs.data.dir

Note that the configurator does not support the type of variable expansion that Hadoop’s configuration files sometimes do. One such example is ${hadoop.tmp.dir} expanding to the Hadoop temporary directory.

If one of your previous configuration values used variable expansion for ${username}, you would need to replace ${username} with the name of the user that you had previously used to run the Hadoop daemons. In our case, we needed to replace instances of ${username} in the dfs.data.dir and dfs.name.dir values with user “hadoopuser.”

When we reached the end of the configurator, we downloaded the generated hadoop-site.xml* files and the Cloudera Repository RPM, and then recorded our repository ID. To double check that our data directories were configured properly, we compared the values (from the table above) in the new hadoop-site.xml* files against our previous configuration. If you see any mismatches at this step, you will probably want to restart the configurator until the resulting files are consistent.


At this point, it was time to jump into the upgrade. We installed the Configurator RPM, which we had downloaded earlier on all machines in our cluster by walking through the steps in the config guide. After listing out the available configuration packages with yum search hadoop-conf, we installed the matching packages for each class of machine in the cluster using yum install $packagename. At this point, the new version of Hadoop was installed, but not running.

In order to swap out the running version of Hadoop, and create a backup of the current filesystem, we needed to follow the steps leading up to the “Install New Version” step from the Hadoop Wiki upgrade page. After walking through those preparation instructions and successfully shutting down the cluster, it was time to make the switch.

Cloudera’s Distribution for Hadoop creates user ”hadoop” and this user runs all of the necessary services/daemons for the cluster. If your cluster had previously been running with a different username (ours was running as ”hadoopuser”) you will need to give the new user ownership of various different directories. We ran…

# chown -R hadoop $directory

…for each of the following configured directories:

* dfs.data.dir,
* dfs.name.dir,
* fs.checkpoint.dir,
* mapred.local.dir,
* hadoop.tmp.dir,
* /var/log/hadoop (FIXME: the hardcoded(?) log directory)

Once the ”hadoop” user had access to the necessary directories, we were ready to upgrade the Namenode. We ran the following command from our Namenode machine, so that the process would start in the background and begin upgrading its checkpoint:

$ sudo -u hadoop /usr/lib/hadoop/bin/hadoop-daemon.sh –config “/etc/hadoop/conf” start namenode -upgrade

We watched the “Upgrades” section of the DFS status page at http://$namenode:50070/ while waiting for the Namenode upgrade to complete, and then we started up the remaining Hadoop services on their respective machines using the instructions from the “Managing Hadoop Services” section of the config guide.


Code Changes

Once our cluster was upgraded, we needed to port our Hadoop jobs to the Hadoop 0.18.3 API. There were actually only minor changes in the MapReduce and FileSystem APIs between 0.15.3 and 0.18.3:

  • Our OutputFormats needed to extend FileOuputFormat, rather than OutputFormatBase.
  • FileSystem.listPaths() was removed, in favor of .globPaths().

Finalizing the Upgrade

After verifying that our newly updated jobs were running correctly against the cluster, we were ready to make the changes permanent. The dfsadmin -finalizeUpgrade command runs in the background and cleans up the outdated copies of blocks left behind by the upgrade, freeing disk space.

$ sudo -u hadoop hadoop dfsadmin -finalizeUpgrade


Now that we’ve upgraded to the Cloudera distribution using the configurator, it will be much easier to stay at the bleeding edge of Hadoop development (or the cutting edge, if we choose stability over features). We can also add the Cloudera repository RPM to our base server image and add a single command to pull down the entire distribution from Yum. Finally, we can conveniently install the packages for Pig and Hive to give our developers more options for their processing jobs.