Running existing applications on Hadoop 2 YARN

Running existing applications on Hadoop 2 YARN

This post authored by Zhijie Shen with Vinod Kumar Vavilapalli.

This is the sixth blog in the multi-part series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Other posts in this series:

Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Introduction

The beta release of Apache Hadoop  2.x has finally arrived and we are striving hard to make the release easy to adopt with no or minimal pain to our existing users.

As you may know, YARN evolves the compute platform of Hadoop beyond MapReduce so that it can accommodate new processing models, such as streaming and graph processing.  To accommodate this transition, a major goal of this release was to ensure that the existing MapReduce applications which were programmed and compiled against previous MapReduce APIs (we’ll call these MRv1 applications) can continue to run with little work on top of YARN (calling these as MRv2 applications).

We spent significant amount of time and effort to fix any inadvertently broken MapReduce APIs, so that the upgrade process for old MapReduce applications over to YARN can happen smoothly.  Most of this work can be tracked at the MAPREDUCE-5108 JIRA ticket. This post will discuss the backward compatibility with existing MRv1 applications and early MRv2 applications (in particular, those compiled against Hadoop 0.23) that users have written, with the examples that are shipped as part of Hadoop releases. We’ll also talk about the compatibility of applications written on top of other frameworks like Pig, Hive, Oozie etc.

Backward compatibility of MRv2 APIs

(1) Binary Compatibility of org.apache.hadoop.mapred APIs

For the vast majority of users who use the the org.apache.hadoop.mapred APIs, you have to follow the three steps below:

1.

2.

3.

You read that right, you don’t have to do anything! We ensure full binary compatibility! Therefore, these applications can run on YARN directly without recompilation.

You can use jars of your existing application that codes against mapred APIs, and use “bin/hadoop” to submit them directly to YARN!

All you really need to do is to point your application to a YARN installation and HADOOP_CONF_DIR to the corresponding configuration directory. In such conf directory, users will find yarn-site.xml (contains the configurations for YARN) and mapred-site.xml (contains the configuration for MapReduce apps). Please refer to the full list of YARN configuration properties. In contrast, mapred.job.tracker in mapred-site.xml is no longer necessary. Instead, users need to add:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

in mapred-site.xml to run MRv1 applications on YARN. We will cover detailed configuration changes needed to be done by operators in one of the next blog posts.

(2) Source Compatibility of org.apache.hadoop.mapreduce APIs

Unfortunately, it proved to be difficult to ensure full binary compatibility to the existing applications that compiled against MRv1 org.apache.hadoop.mapreduce APIs. These APIs have gone through lots of changes. For example, a bunch of classes stopped being abstract classes and changed to interfaces. Therefore, we compromised to only supporting source compatibility for org.apache.hadoop.mapreduce APIs.

Existing application using mapreduce APIs are source compatible and can run on YARN with no changes, recompilation and/or minor updates.

If MRv1 mapreduce based applications fail to run on YARN, users are requested to investigate their source code, checking whether mapreduce APIs are referred to or not. If they are referred to, users have to recompile their applications against MRv2 jars that are shipped with Hadoop 2.

(3) Compatibility of Command Line Scripts

Most of the command line scripts over from Hadoop 1.x should just work!

The only exception is MRAdmin whose functionality is removed from MRv2 because JobTracker and TaskTracker aren’t there anymore. The MRAdmin functionality is now replaced with RMAdmin. The suggested method to invoke MRAdmin (also RMAdmin) is through command line, even though one can directly invoke the APIs. In YARN, when mradmin commands are executed, warning messages will appear, and remind users of using new YARN commands (i.e., rmadmin commands). On the other hand, if users’ applications programmatically invoke MRAdmin, applications will break when running on top of YARN. We support neither binary nor source compatibility here.

(4) Compatibility Tradeoff between MRv1 and Early MRv2 (0.23.x) applications

Unfortunately, there are some APIs that can be compatible either with MRv1 applications or with early MRv2 applications (in particular, the applications compiled against Hadoop 0.23), but not both. Some of the APIs were exactly the same in both MRv1 and MRv2 except for the return type change in method signatures. Therefore, we were forced to trade off the compatibility between the two.

  1. For mapred APIs, we decided to make them be compatible with MRv1 applications, which have a larger user base.
  2. For mapreduce APIs, if they don’t significantly break Hadoop 0.23 applications, we made the same decision of continuing to be compatible with 0.23 but only source compatible with 1.x.

Below is the list of APIs which are incompatible with Hadoop 0.23. If early Hadoop 2 adopters using 0.23.x used the following methods in their custom routines, they have to modify the code accordingly. For some problematic methods, we provided an alternative method with the same functionality and similar method signature to MRv2  applications.

Problematic Method – org.apache.hadoop

Incompatible Return Type Change

Alternative Method

.util.ProgramDriver#drive

void -> int

run

.mapred.jobcontrol.Job#getMapredJobID

String -> JobID

getMapredJobId

.mapred.TaskReport#getTaskId

String -> TaskID

getTaskID

.mapred.ClusterStatus#UNINITIALIZED_MEMORY_VALUE

long -> int

N/A

.mapreduce.filecache.DistributedCache#getArchiveTimestamps

long[] -> String[]

N/A

.mapreduce.filecache.DistributedCache#getFileTimestamps

long[] -> String[]

N/A

.mapreduce.Job#failTask

void -> boolean

killTask(TaskAttemptID, boolean)

.mapreduce.Job#killTask

void -> boolean

killTask(TaskAttemptID, boolean)

.mapreduce.Job#getTaskCompletionEvents

.mapred.TaskCompletionEvent[] -> .mapreduce.TaskCompletionEvent[]

N/A

Running Typical Applications on YARN

(1) Running MRv1 examples on YARN

Most of the MRv1 examples continue to work on YARN, except they are now present in a newly versioned jar. One exception worth mentioning is that the sleep example that used to be in hadoop-examples-1.x.x.jar is no longer in hadoop-mapreduce-examples-2.x.x.jar – it got moved into the test jar hadoop-mapreduce-client-jobclient-2.x.x-tests.jar.

That exception aside, users may want to directly try hadoop-examples-1.x.x.jar on YARN. Running hadoop -jar hadoop-examples-1.x.x.jar will still pick the classes in hadoop-mapreduce-examples-2.x.x.jar. This is because by default Java first searches the desired class in the system jars, and  if the class is not found, it will go on to search in the user jars in the classpath. hadoop-mapreduce-examples-2.x.x.jar  is installed together with other MRv2 jars in the Hadoop classpath, such that the desired class (e.g., WordCount) will be picked from this 2.x.x jar instead. However, it is possible to let Java pick the classes from the jar which is specified after -jar option. Users have two options:

  1. Add HADOOP_USER_CLASSPATH_FIRST=true and HADOOP_CLASSPATH=...:hadoop-examples-1.x.x.jar as environment variables, and add mapreduce.job.user.classpath.first = true in mapred-site.xml.
  2. Remove the 2.x.x jar from the classpath. If it is a multiple-node cluster, the jar needs to be removed from the classpath on all the nodes.

(2) Running Pig scripts on YARN

Pig is one of the two major data process applications in the Hadoop ecosystem, the other being Hive. Because of significant efforts from the Pig community, we are happy to report that Pig scripts of existing users don’t need any change! Pig on YARN in Hadoop 0.23 has been supported since 0.10.0 and Pig working with Hadoop 2.x has been supported starting 0.10.1.

Existing Pig scripts that work with Pig 0.10.1 and beyond will work just fine on top of YARN !

However, versions earlier to Pig 0.10.x may not run directly on YARN due to some of the incompatible mapreduce APIs and configuration.

(3) Running Hive queries on YARN

Hive queries of existing users don’t need any change to work on top of YARN starting Hive-0.10.0, thanks to the Hive community. Support for Hive to work on YARN in Hadoop 0.23 and 2.x releases has been supported since 0.10.0.

Queries on Hive 0.10.0 and beyond will work without changes on top of YARN !

However, like Pig, earlier versions of Hive may not run directly on YARN as those Hive releases don’t support 0.23 and 2.x.

(4) Running oozie workflows on YARN

Like Pig and Hive, the Apache Oozie community worked to make sure existing oozie workflows run in a completely backwards compatibility manner. Support for Hadoop 0.23 and 2.x is available starting oozie release 3.2.0.

Existing oozie workflows can start taking advantage of YARN in 0.23 and 2.x with Oozie 3.2.0 and above !

Conclusion

This effort was possible thanks to members of the Apache Hadoop MapReduce community. Mayank Bansal, Karthik Kambatla and Robert Kanter are others who contributed with some patches to this effort. Thanks to the committers Arun C Murthy and Alejandro Abdelnur for helping with reviews and commits.

Thanks are also due to various Apache Hadoop ecosystem communities like Pig, Hive and Oozie to come together in porting the frameworks over to Hadoop 2 YARN with minimal disruption to the end users.

We also are happy to report that some of the installations which kick-started testing and validation of apps on Hadoop 2 YARN, few with big clusters and good sized user-base, are only observing zero or minimal impact with regard to porting their existing MapReduce applications! Looking forward to the beta releases and smooth upgrades of all users’ apps!

Cloudera Community
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.