Building and Deploying MR2
- by Ahmed Radwan
- November 16, 2011
- 10 comments
A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Apache Hadoop 0.23.
A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).
In MR2, the JobTracker aggregated functionality is separated across two new components:
- Central Resource Manager (RM): Management of resources in the cluster.
- Application Master (AM): Management of the life cycle of an application and its tasks. Think of the AM as a per-application JobTracker.
The new design enables scaling Hadoop to run on much larger clusters, in addition to the ability to run non-mapreduce applications on the same Hadoop cluster. For more architecture details, the interested reader may refer to the design document at: https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf.
The objective of this blog is to outline the steps for building, configuring, deploying and running a single-node NextGen MR cluster.
In the following steps, I’ve chosen to use ~/mr2 as my working directory. Inside this directory, we’ll create a source directory for the code we’ll soon checkout, and a deploy directory for our deployment.
ahmed@ubuntu:~$ cd ~/mr2 ahmed@ubuntu:~/mr2$ mkdir source ahmed@ubuntu:~/mr2$ mkdir deploy
Make sure protbuf is in your library path or:
ahmed@ubuntu:~/mr2$ export LD_LIBRARY_PATH=/usr/local/lib
We’ll now checkout the source code from the apache git repository.
ahmed@ubuntu:~/mr2$ cd source ahmed@ubuntu:~/mr2/source$ git clone git://git.apache.org/hadoop-common.git Cloning into hadoop-common... . . ahmed@ubuntu:~/mr2/source$ cd hadoop-common/ ahmed@ubuntu:~/mr2/source/hadoop-common$ git branch * trunk
Create the deployment tar files.
ahmed@ubuntu:~/mr2/source/hadoop-common$ mvn package -Pdist -Dtar -DskipTests
Copy the created Hadoop tar file to our deploy directory and Untar it.
ahmed@ubuntu:~/mr2/source/hadoop-common$ cp ./hadoop-dist/target/hadoop-0.24.0-SNAPSHOT.tar.gz ../../deploy/. ahmed@ubuntu:~/mr2/deploy$ tar -xzvf hadoop-0.24.0-SNAPSHOT.tar.gz ahmed@ubuntu:~/mr2/deploy$ cd hadoop-0.24.0-SNAPSHOT/
Export some needed environment variables. See the following listing:
#!/bin/bash
export HADOOP_DEV_HOME=`pwd`
export HADOOP_MAPRED_HOME=${HADOOP_DEV_HOME}
export HADOOP_COMMON_HOME=${HADOOP_DEV_HOME}
export HADOOP_HDFS_HOME=${HADOOP_DEV_HOME}
export YARN_HOME=${HADOOP_DEV_HOME}
export HADOOP_CONF_DIR=${HADOOP_DEV_HOME}/conf/
export YARN_CONF_DIR=${HADOOP_DEV_HOME}/conf/
We’ll start our configuration; the configuration directory will contain the following files: core-site.xml, hdfs-site.xml, mapred-site.xml, slaves, yarn-env.sh and yarn-site.xml.
Make sure the contents of the xml files in the conf directory are as follows:
ahmed@ubuntu:~/mr2/deploy/ hadoop-0.24.0-SNAPSHOT/conf$ cat yarn-site.xml
| 1 2 3 4 5 6 7 8 9 10 11 12 |
<?xml version=”1.0″?> <configuration> <!– Site specific YARN configuration properties –> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> |
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat core-site.xml
| 1 2 3 4 5 6 7 8 |
<?xml version=”1.0″?> <?xml-stylesheet href=”configuration.xsl”?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat mapred-site.xml
| 1 2 3 4 5 6 7 8 |
<?xml version=”1.0″?> <?xml-stylesheet href=”configuration.xsl”?> <configuration> <property> <name> mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT/conf$ cat hdfs-site.xml
| 1 2 3 4 5 6 7 8 9 10 11 12 |
<?xml version=”1.0″?> <?xml-stylesheet href=”configuration.xsl”?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> |
Now we can format our HDFS namenode as usual:
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ bin/hadoop namenode -format
And start the HDFS services:
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/hadoop-daemon.sh start namenode ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/hadoop-daemon.sh start datanode
And the new MR2 services
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/yarn-daemon.sh start resourcemanager ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/yarn-daemon.sh start nodemanager ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ sbin/mr-jobhistory-daemon.sh start historyserver
Make sure all needed services are running:
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ jps 7623 Jps 7433 ResourceManager 7587 JobHistoryServer 7495 NodeManager 7325 DataNode 7250 NameNode
You can see the resource manager web console (shown below) using this address: http://localhost:8088

Our usual namenode web console should be also up:

We can now start running some example jobs, here is how we submit our first job; the randomwriter from our examples jar:
ahmed@ubuntu:~/mr2/source/hadoop-common/hadoop-mapreduce-project$ cd $HADOOP_MAPRED_HOME ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ $HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar randomwriter -Dmapreduce.randomwriter.bytespermap=10000 -Ddfs.blocksize=536870912 -Ddfs.block.size=536870912 output
If everything goes fine, you’ll see the job console output and finally:
2011-09-30 14:47:47,688 INFO mapreduce.Job (Job.java:monitorAndPrintJob(1245)) - Counters: 28 File System Counters FILE: BYTES_READ=1200 FILE: BYTES_WRITTEN=437730 FILE: READ_OPS=0 FILE: LARGE_READ_OPS=0 FILE: WRITE_OPS=0 HDFS: BYTES_READ=1180 HDFS: BYTES_WRITTEN=150089 HDFS: READ_OPS=70 HDFS: LARGE_READ_OPS=0 HDFS: WRITE_OPS=40 org.apache.hadoop.mapreduce.JobCounter TOTAL_LAUNCHED_MAPS=10 OTHER_LOCAL_MAPS=10 SLOTS_MILLIS_MAPS=138095 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS=10 MAP_OUTPUT_RECORDS=20 SPLIT_RAW_BYTES=1180 SPILLED_RECORDS=0 FAILED_SHUFFLE=0 MERGED_MAP_OUTPUTS=0 GC_TIME_MILLIS=785 CPU_MILLISECONDS=4180 PHYSICAL_MEMORY_BYTES=491077632 VIRTUAL_MEMORY_BYTES=3847532544 COMMITTED_HEAP_BYTES=162529280 org.apache.hadoop.examples.RandomWriter$Counters BYTES_WRITTEN=148669 RECORDS_WRITTEN=20 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ=0 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN=150089 Job ended: Fri Sep 30 14:47:47 PDT 2011 The job took 53 seconds.
We’ll now run the conventional wordcount example job, to see a full map & reduce job (as opposed to the randomwriter map-only job):
ahmed@ubuntu:~/mr2/deploy/hadoop-0.24.0-SNAPSHOT$ $HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-0.24.0-SNAPSHOT.jar wordcount input output2
The job counters output after successful run:
File System Counters FILE: BYTES_READ=384 FILE: BYTES_WRITTEN=87759 FILE: READ_OPS=0 FILE: LARGE_READ_OPS=0 FILE: WRITE_OPS=0 HDFS: BYTES_READ=144 HDFS: BYTES_WRITTEN=46 HDFS: READ_OPS=9 HDFS: LARGE_READ_OPS=0 HDFS: WRITE_OPS=4 org.apache.hadoop.mapreduce.JobCounter TOTAL_LAUNCHED_MAPS=1 TOTAL_LAUNCHED_REDUCES=1 DATA_LOCAL_MAPS=1 SLOTS_MILLIS_MAPS=2331 SLOTS_MILLIS_REDUCES=2353 org.apache.hadoop.mapreduce.TaskCounter MAP_INPUT_RECORDS=5 MAP_OUTPUT_RECORDS=5 MAP_OUTPUT_BYTES=56 MAP_OUTPUT_MATERIALIZED_BYTES=72 SPLIT_RAW_BYTES=108 COMBINE_INPUT_RECORDS=5 COMBINE_OUTPUT_RECORDS=5 REDUCE_INPUT_GROUPS=5 REDUCE_SHUFFLE_BYTES=72 REDUCE_INPUT_RECORDS=5 REDUCE_OUTPUT_RECORDS=5 SPILLED_RECORDS=10 SHUFFLED_MAPS=1 FAILED_SHUFFLE=0 MERGED_MAP_OUTPUTS=1 GC_TIME_MILLIS=128 CPU_MILLISECONDS=1270 PHYSICAL_MEMORY_BYTES=212226048 VIRTUAL_MEMORY_BYTES=770711552 COMMITTED_HEAP_BYTES=137433088 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter BYTES_READ=36 org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter BYTES_WRITTEN=46
MR2 comes with new and updated web consoles that can be used to conveniently explore and monitor the system services. The Applications tab shows the submitted applications, the application’s id, user name, queue name, status, progress and other useful information. For example, here is the resource manager Applications snapshot when randomwriter and wordcount are running.

The nodes tab shows the nodes of the cluster, their addresses, in addition to health and container information. For example, here is the nodes view for our single node cluster.

The scheduler view shows useful scheduling information. In our example, we used the default FifoScheduler. And, as seen in the following snapshot, the view shows information like the queue minimum and maximum capacities, number of nodes, total and available capacities and other useful information. This snapshot was captured after running the randomwriter application but before submitting the wordcount application.

-
Phaneesh /
November 16, 2011 / 9:09 AM
Hi,
Thanks a ton for the guide!. Was waiting to try out MR2. I am trying to deploy it on Ubuntu 11.10 64 bit, Oracle Java 7 64 bit. I was able to build without any issues. However when I try to format the name node I am getting the following exception:
11/11/16 21:30:22 ERROR namenode.NameNode: Exception in namenode join
java.lang.IllegalArgumentException: URI has an authority component
at java.io.File.(File.java:397)
at org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageDirectory(NNStorage.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.(FSEditLog.java:170)
at org.apache.hadoop.hdfs.server.namenode.FSImage.(FSImage.java:125)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:604)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:732)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:799)Should I use Oracle/Sun Java 6 instead of 7?
Thanks
NP -
Ahmed Radwan /
November 21, 2011 / 5:41 PM
Thanks PHANEESH!
This doesn’t seem to be an mr2 issue. I think it a problem in setting up HDFS, which you’ll also face if trying to install older MR versions on this same system.
Have you explicitly added any other configuration properties other than the ones in the xml files above? For example: dfs.namenode.edit.dir property, if not set, then the default will be to use hadoop.tmp.dir, can you check your /tmp/hadoop-${user.name}? Are you accessing this location through a network resource referenced through a UNC path for example?
-
Marcos Ortiz /
November 28, 2011 / 4:21 PM
Wow, good tutorial, I will try to reproduce this on my Fedora 15. Regards and best wishes.
-
Patrick Wendell /
December 11, 2011 / 11:47 PM
The examples jar is now built in a slightly different location than is listed here. It’s particularly confusing because the old location still contains a jar with the correct name but the jar is empty. The new location is here:
hadoop-mapreduce-project/hadoop-mapreduce-examples/target/
Discussed here:
https://issues.apache.org/jira/browse/MAPREDUCE-3492 -
MRK /
December 19, 2011 / 2:32 AM
HI,
Thanks for the inputs. When i install apache hadoop-0.23.0 build, i am able to start namenode and datanode.
but namenode is in safemode always. the below message is shown in name node logs.WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume ‘nfs” is 0, which is below the configured reserved amount 104857600
2011-12-19 08:28:36,982 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: NameNode low on available disk space. Already in safe mode.
2011-12-19 08:28:36,988 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.I am seeing there is any enough space on the specified volume. Please let me know whats wrong here.
-
srikanth /
January 08, 2012 / 1:04 AM
Nice tutorial and guided us in right direction. But i got a small problem in creating examples jar in hadoop-examples using ant. i feel all the jar files are not loaded correctly and cannot able to run an example…did any one face the same problem? please let me know….
-
Varun Thacker /
February 13, 2012 / 4:03 AM
Solved my problem.
I was using Maven 2.2.1 . After reading the build notes and installing Maven 3.0+ it started the build process.
-
Pavan /
July 09, 2012 / 5:36 PM
Great tutorial and I could start almost all the daemons except the Nodemanager.
I get this following error, any idea about this ? ThanksUnrecognized option : -jvm
Could not create the Java virtual machine
Filed under
Share this post