Setting up CDH3 Hadoop on my new Macbook Pro

This is a guest re-post courtesy of Arun Jacob, Data Architect at Disney, prior to that he was an engineer at RichRelevance and Evri. For the last couple of years, Arun has been focused on data mining/information extraction, using a mix of custom and open source technologies.

A New Machine

I’m fortunate enough to have recently received a Macbook Pro, 2.8 GHz Intel dual core, with 8GB RAM. This is the third time I’ve turned a vanilla mac into a ninja coding machine, and following my design principle of “first time = coincidence, second time = annoying, third time = pattern”, I’ve decided to write down the details for the next time.

Baseline

This section details the pre-hadoop installs I did.

Java

Previously I was running on Leopard, i.e. 10.4, and had to install soylatte to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That’s good enough for me, for now.

Gcc, etc.

In order to get these on the box, I had to install XCode, making sure to check the ‘linux dev tools’ option.

MacPorts

I installed MacPorts in case I needed to upgrade any native libs or tools.

Eclipse

I downloaded the 64 bit Java EE version of Helios.

Tomcat

Tomcat is part of my daily fun, and these instructions to install tomcat6 where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify


prior to defining

Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using jsvc for 64 bit sounded iffy at best.

Hadoop

I use the CDH distro. The install was amazingly easy, and their support rocks. Unfortunately, they don’t have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):

  1. 1. distinct processes for namenode, job tracker node, and datanode/task tracker nodes.
  2. 2. formatted HDFS
  3. 3. Pig 0.8.0

I’m not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.

I am installing CDH3, because I’ve been running it in psuedo-mode on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.

All of the CDH3 Tarballs can be found at http://archive.cloudera.com/cdh/3/, and damn, that’s a lot of tarballs.

I downloaded hadoop 0.20.2+737, it’s (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I’ll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under  a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I’ll update it if CDH3 updates their version of hadoop.

After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.

Psuedo Mode Config

I’m going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own.

Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:



and the following in hdfs-site.xml:

finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:

Pig

Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).

More To Come

This psuedo node install was relatively painless. I’m going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.

Filed under:

6 Responses
  • John K / January 10, 2011 / 9:03 AM

    Thanks for the instructions.

    But there is an example in the text of how unix commands (i.e. sudo) can lead to incorrect spelling. The english word is ‘Pseudo’.

  • Fred Oliveira / January 10, 2011 / 7:25 PM

    Thanks for the instructions – need to try this out myself. Heads up, though, the MBP probably has 8gb, not MB of RAM ;-)

  • john c / January 11, 2011 / 5:05 AM

    I had better luck with HomeBrew than MacPorts

  • Bo?Xiao / January 13, 2011 / 8:41 PM

    I’m confused. Are you sure XCode has Linux dev tools option?

  • Arun Jacob / January 14, 2011 / 11:42 AM

    thanks for the comments, let me respond:
    (1) John K , thanks for the spelling correction, I was always the first kid out in every spelling bee :)
    (2) Fred: likewise, my non algebraic math has always been suspect!
    (3) John K, I did not know about Homebrew, but reading about it here http://tedwise.com/2010/08/28/homebrew-vs-macports/ makes me want to try it.
    (4) Xiao, here are details on how to install Unix Dev tools (sorry, not Linux) as part of XCode: http://www.askdavetaylor.com/how_to_install_apple_developer_tools_cc_gcc_mac_os_x.html

Leave a comment


six − 2 =