Setting up CDH3 Hadoop on my new Macbook Pro
This is a guest re-post courtesy of Arun Jacob, Data Architect at Disney, prior to that he was an engineer at RichRelevance and Evri. For the last couple of years, Arun has been focused on data mining/information extraction, using a mix of custom and open source technologies.
A New Machine
This section details the pre-hadoop installs I did.
Previously I was running on Leopard, i.e. 10.4, and had to install soylatte to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That’s good enough for me, for now.
In order to get these on the box, I had to install XCode, making sure to check the ‘linux dev tools’ option.
I installed MacPorts in case I needed to upgrade any native libs or tools.
I downloaded the 64 bit Java EE version of Helios.
Tomcat is part of my daily fun, and these instructions to install tomcat6 where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify
prior to defining
<user username="admin" password="password" roles="standard,manager,admin"/>
Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using jsvc for 64 bit sounded iffy at best.
I use the CDH distro. The install was amazingly easy, and their support rocks. Unfortunately, they don’t have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):
- 1. distinct processes for namenode, job tracker node, and datanode/task tracker nodes.
- 2. formatted HDFS
- 3. Pig 0.8.0
I’m not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.
I am installing CDH3, because I’ve been running it in psuedo-mode on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.
All of the CDH3 Tarballs can be found at http://archive.cloudera.com/cdh/3/, and damn, that’s a lot of tarballs.
I downloaded hadoop 0.20.2+737, it’s (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I’ll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I’ll update it if CDH3 updates their version of hadoop.
After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.
Psuedo Mode Config
I’m going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own.
Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:
<property> <name>hadoop.tmp.dir</name> <value>changed_to_a_valid_dir_I_own</value> </property>
and the following in hdfs-site.xml:
<property> <!-- specify this so that running 'hadoop namenode -format' formats the right dir --> <name>dfs.name.dir</name> <value>changed_to_a_different_dir_I_own</value> </property>
finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:
mv ./conf install-conf ln sf ./example-confs/conf.pseudo conf
Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).
More To Come
This psuedo node install was relatively painless. I’m going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.