Experimenting with MapReduce 2.0

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.

In the rest of this post, we shall first discuss the Client API, followed by configurations and testing considerations, and finally commenting on the new changes related to the Job History Server and Web Servlets. We will use the terms MR1 and MR2 to refer to MapReduce in Hadoop 1.0 and Hadoop 2.0, respectively.

Client API

The Java client APIs for MR2 are compatible with the corresponding APIs in MR1 (see for example JobClient). There is no change required to any code that was written using the old client APIs; applications using such client APIs can directly switch to use MR2. This is applicable to MR1 in both CDH3 and CDH4.

However, it is important to note that users have to recompile applications that use the org.apache.hadoop.mapreduce API and also pipes applications (see jira) when moving from CDH3 to CDH4 (or generally from Hadoop 1 to 2).

CDH4 provides a new Hadoop-Client Maven-based way of managing client-side Hadoop API dependencies; that eliminates the need to figure out the exact names and locations of all the needed Hadoop JAR files. This wasn’t an issue in prior Hadoop versions since the release contained a single Jar file, but it is essential with the new Maven-based releases, which contain multiple Jar files.

Configurations

Configuration properties are heavily used in MapReduce and Hadoop in general. While the Java client API is compatible in MR1 and MR2, configuration properties are generally not. Due to the major changes that were introduced in MR2, a number of old configuration properties are no longer valid and new properties were added. Please refer to this list for configuration property names that are deprecated in Hadoop 2, and their replacements. There is also an open jira for removing unused MR1 configurations.

One new key configuration property is mapreduce.framework.name; this property is now used to control if the MapReduce job will be submitted to a YARN cluster or will be run locally using the LocalJobRunner. The valid values for this property are yarn and local. Another property you need to specify if running the resource manager on a separate host is yarn.resourcemanager.address which should be set to a <host>:<port> combination. These properties roughly correspond to the MR1 property mapred.job.tracker that was either set to a <host>:<port> combination for submitting the job to the JobTracker, or set to local for using the LocalJobRunner. This old MR1 property is no longer valid in MR2 and, if specified, will be ignored by the system.

As a general rule, all properties referring to the JobTracker or TaskTracker are no longer valid in MR2. Corresponding and new properties for The ResourceManager, ApplicationMaster and NodeManager have been added on the other hand.

Testing using MapReduce Mini Cluster

If a MapReduce cluster is needed for testing, developers can use the newly added MiniMRClientClusterFactory instead of directly constructing a MiniMRCluster (deprecated in MR2). This MiniMRClientClusterFactory provides a wrapper MiniMRClientCluster interface around the MiniMRYarnCluster. This same factory was also added to MR1 to provide an easy migration of tests between MR1 and MR2.

MapReduce JobHistory Server

As previously described in MapReduce 2.0 in Hadoop 0.23,the JobTracker no longer exists in MR2, and the job life cycle management functionality is now the responsibility of the short-lived Application Masters. For this reason, a new MapReduce JobHistory server was added to MR2, which maintains information about submitted MapReduce jobs after their Application Master terminates. The Resource Manager Web UI manages such forwarding of requests to the JobHistory server when the Application Master completes.

Web Servlets

Another class of incompatibility between MR2 and MR1 is the use of HTTP servlets. A number of the MR1 servlets are no longer available in MR2 and new servlets were added. Depending on what your old MR1 application was using, there should be a way to achieve the same or similar functionality using MR2. An example is the TaskLogServlet, which in MR2 is part of the MapReduce Job History Server. There is an open jira to document these changes.

MR2 introduces a number of promising features and enhancements, but there are considerations to keep in mind when migrating your applications. We highlighted some of these main considerations in an effort to help users who are interested in experimenting with this new system.

References:

Please refer to the complete API docs for additional details.

Also refer to the CDH MapReduce 2.0 API Known Issues

Filed under:

2 Responses
  • Liyin Liang / July 22, 2012 / 3:42 AM

    >>> “it is important to note that users have to recompile applications that use the org.apache.hadoop.mapreduce API and also pipes applications (see jira) when moving from CDH3 to …”

    How about the jobs which use the org.apache.hadoop.mapred APIs? Need recompile?

Leave a comment


nine × 4 =