How-to: Include Third-Party Libraries in Your MapReduce Job

“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.

Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?

MapReduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts. For example, you might want to access HBase from within your map tasks. One way to do this is to package every class used in the submittable JAR. You will have to unpack the original hbase-.jar and repackage all the classes in your submittable Hadoop jar. Not good. Don’t do this: The version compatibility issues are going to bite you sooner or later.

There are better ways of doing the same by either putting your jar in distributed cache or installing the whole JAR on the Hadoop nodes and telling TaskTrackers about their location.

1. Include the JAR in the “-libjars” command line option of the `hadoop jar …` command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. More specifically, you will find the JAR in one of the ${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/… subdirectories on local nodes. The advantage of the distributed cache is that your jar might still be there on your next program run (at least in theory: The files should be kicked out of the distributed cache only when they exceed soft limit defined by the local.cache.size configuration variable, defaults to 10GB, but your actual mileage can vary particularly with the newest security enhancements). Hadoop keeps track of the changes to the distributed cache files by examining their modification timestamp.

*Update to post: Please note that items 2 and 3 below are deprecated starting CDH4 and will be no longer supported starting CDH5.

2. Include the referenced JAR in the lib subdirectory of the submittable JAR: A MapReduce job will unpack the JAR from this subdirectory into ${mapred.local.dir}/taskTracker/${user.name}/jobcache/$jobid/jars on the TaskTracker nodes and point your tasks to this directory to make the JAR available to your code. If the JARs are small, change often, and are job-specific this is the preferred method.

3. Finally, you can install the JAR on the cluster nodes. The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. However, since you know that only TaskTrackers will need these the new JAR, a better way is to modify HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file. This method is preferred if the JAR is tied to the code running on the nodes, like HBase.

HADOOP_TASKTRACKER_OPTS="-classpath<colon-separated-paths-to-your-jars>"

Restart the TastTrackers when you are done. Do not forget to update the jar when the underlying software changes.

All of the above options affect only the code running on the distributed nodes. If your code that launches the Hadoop job uses the same library, you need to include the JAR in the HADOOP_CLASSPATH environment variable as well:

HADOOP_CLASSPATH="<colon-separated-paths-to-your-jars>"

Note that starting with Java 1.6 classpath can point to directories like “/path/to/your/jars/*” which will pick up all JARs from the given directory.

The same guiding principles apply to native code libraries that need to be run on the nodes (JNI or C++ pipes). You can put them into distributed cache with the “-files” options, include them into archive files specified with the “-archives” option, or install them on the cluster nodes. If the dynamic library linker is configured properly the native code should be made available to your task attempts. You can also modify the environment of the job’s running task attempts explicitly by specifying JAVA_LIBRARY_PATH or LD_LIBRARY_PATH variables:

hadoop jar <your jar> [main class]
      -D mapred.child.env="LD_LIBRARY_PATH=/path/to/your/libs" ...

Filed under:

20 Responses
  • Craig Macdonald / January 11, 2011 / 9:16 AM

    Terrier submits MapReduce jobs from Java code. To handle jar dependencies, it walks the client classpath, uploading any jar files to the distributed cache.

    See http://terrier.org/docs/v3.0/javadoc/org/terrier/utility/io/HadoopUtility.html#saveClassPathToJob%28org.apache.hadoop.mapred.JobConf%29

  • Owen O’Malley / January 11, 2011 / 11:49 AM

    Including nested jars is discouraged legacy method because it has a large negative performance cost.

    The distributed cache is the preferred solution. You should have mentioned using symlinks to find the loaded jars and the difference between the public and private cache. In particular, you assumed the user is using the private cache when they should ensure that they use the public cache except when the data is in fact private.

  • Craig Macdonald / January 11, 2011 / 12:32 PM

    Which of options 1, 2 or 3 above are your referring to?

  • Markus Weimer / January 11, 2011 / 1:58 PM

    According to the hadoop documentation:

    http://hadoop.apache.org/common/docs/r0.21.0/commands_manual.html#Generic+Options

    The first option does not work. -libjars supposedly cannot be combined with the ‘hadoop jar’ command.

  • Owen O’Malley / January 11, 2011 / 5:08 PM

    Craig,
    Only 1 is recommended. 2 is discouraged. 3 works for private clusters, but not shared clusters.

  • Forest / January 11, 2011 / 7:58 PM

    Unfortunately, method 1 only work before 0.18, it doesn’t work in 0.20.

    Now I still use the method of packing all 3rd-party jars into mine. I think it is simpler than other methods.

  • Alex Kozlov / January 13, 2011 / 12:58 AM

    For the hadoop command line options and the method 1 to work the main class should implement Tool and call ToolRunner.run()

    Sorry for not making this clear in the first place

  • Akbar Hussain / February 25, 2011 / 2:40 AM

    Hi can any one help me how to share jni .so file in hadoop environment?

    Thanks

  • Adarsh / May 03, 2011 / 4:42 AM

    I am able to successfully run an external library in Hadoop Java Programs.

    But Is there is any way to run the same through Hadoop Pipes way.

    Fore.g I add a simple C libhdfs Code to open & write file in HDFS by inserting some lines in wordcount.cc

    PLease let me know if it is possible.

    Thanks

  • Alexey / May 05, 2011 / 12:29 PM

    Never mind my prev. comment: here it is if somebody as lazy as I was:

    build jar with dependancies (jars) inside
    dependancies (jars) are taken from lib folder

    <!– THAT ALSO WORKS, but you list all files by name

    –>

  • Chris / July 07, 2011 / 11:39 AM

    Can you post your entire ant build.xml?

  • akshit / July 11, 2012 / 4:51 PM

    please give an example of
    HADOOP_TASKTRACKER_OPTS=”-classpath”

    it does not work for me

  • Steve / September 06, 2012 / 12:16 PM

    Setting -classpath in HADOOP_TASKTRACKER_OPTS=’-classpath /my/new/custom.jar’ unfortunately doesn’t work because the -classpath argument is set later in the command… The resulting task tracker process looks something like:

    java -Dproc_tasktracker -Xmx1000m -classpath /my/new/custom.jar -Dhadoop.log.dir= …. -classpath /etc/hadoop/conf:…:/usr/lib/hadoop/lib/*:….

    So, the last -classpath option overrides the first.

    The only way I could get this to work was adding my jar file to one of the directories included with a wildcard. For example, like /usr/lib/hadoop/lib.

  • Alvin Liu / September 12, 2012 / 5:06 AM

    Hadoop 0.20.2 does not support mapreduce.user.classpath.first. How to resolve jar version conflict when MR job using same jar in different version with Hadoop lib. Such like slf4j-api-1.4.3, my MR depends on slf4j-api-1.6.1. I am using CDH3u3 in Oracle BDA.

  • Jebel / October 26, 2012 / 4:00 AM

    Once the property java.library.path is set, the JVM only search the path and ignores the evn value LD_LIBRARY_PATH.

  • David / January 23, 2013 / 2:03 PM

    I am not having success with these instructions. The -D option might not work with the jar option according to this: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

Leave a comment


six − = 5