How-to: Use the ShareLib in Apache Oozie

Ed. Note: The post below pertains to CDH 4.x only. Read this post for updates concerning CDH 5.x.

As Apache Oozie, the workflow engine for Apache Hadoop, continues to receive wider adoption from our customers and the community, we’re seeing patterns with respect to the biggest challenges for users. One such point of difficulty is setting up and using Oozie’s ShareLib for allowing JARs to be shared by different workflows. This blog post is intended to help you with those tasks (for CDH 4.x only; read about changes in CDH 5.x here

Errors

A missing or improperly installed ShareLib will cause some action types (DistCp, Streaming, Pig, Sqoop, and Hive) to fail. In this case, you’ll typically see any of the following exceptions in the Oozie and JobTracker logs:

Before exploring these errors, let’s first discuss what the ShareLib is and how it works.

Why Use the ShareLib?

Suppose you have an Oozie workflow that runs a MapReduce action. You want to specify your own Mapper and Reducer classes, but how does Oozie know where to find those two classes? 

There are two ways to let Oozie know about Mapper and Reducer classes or any other additional JARs required by your workflow. The first approach is based on the fact that a workflow typically consists of a job.properties file, a workflow.xml file, and an optional lib folder (and perhaps other files such as Pig scripts). Usually, you’d place them in a folder like this:

Oozie will take any of the JARs that you put in that lib folder and automatically add them to your workflow’s classpath when it’s executed.  This is the simplest approach. 

Alternatively, you can use the oozie.libpath property in your job.properties file to specify additional HDFS directories (multiple directories can be separated by a comma) that contain JARs. The advantage of using this property over the lib folder discussed above is in cases where you have many workflows all using the same set of JARs. 

For example, suppose you have 20 workflows that each need the same three additional JARs. Instead of storing and maintaining 20 copies of each of the three JARs (that’s 20 x 3 – 3 = 57 redundant JARs), you can simply keep the three JARs in one location and have all your workflows refer to that one copy. (Technically, there would be even more copies because of HDFS replication.)

How Does the ShareLib Work?

Some of the actions – specifically DistCp, Streaming, Pig, Sqoop, and Hive – require external JAR files in order to run successfully. Instead of having to keep these JAR files in each workflow’s lib folder, or forcing the user to manually manage them via the oozie.libpath property on every workflow using one of these actions, Oozie provides the ShareLib. The ShareLib behaves very similarly to oozie.libpath, except that it’s specific to the aforementioned actions and their required JARs. Here’s what the (MRv1) ShareLib looks like in CDH 4.1.2:

As you can see, the above actions each depend on many JARs that you now don’t have to worry about after deploying the ShareLib. Each of these actions has its own folder with its own JARs; this allows Oozie to use only the JARs required for that action instead of including every JAR. In fact, this is necessary because not all of these actions use the same, or even compatible, versions of the JARs. For example, the Hive action uses antlr-runtime-3.0.1.jar and will fail if used with antlr-runtime-3.4.jar, which is what the Pig action uses.

How to Install and Use the ShareLib

By default, the ShareLib should be placed in the home folder in HDFS of the user who started the Oozie web server; this is not necessarily the same user as the one submitting a job. In CDH3 and CDH4, this user is named ‘oozie’. The property in oozie-site.xml for setting the location of the ShareLib is called oozie.service.WorkflowAppService.system.libpath and its default value is /user/${user.name}/share/lib, where ${user.name} gets resolved to the user who started the Oozie server. Hence, the default location to install the ShareLib is /user/oozie/share/lib. More detailed instructions for installing the ShareLib can be found in the CDH4 Oozie documentation here.  (A future release of Cloudera Manager will be able to install the ShareLib automatically.) 

One caveat: Because CDH4 supports MRv1 and YARN, the CDH4 Oozie installation provides separate ShareLib archives for MRv1 (oozie-sharelib.tar.gz) and YARN (oozie-sharelib-yarn.tar.gz). It is important that the correct one is installed based on which version of Hadoop is being used. 

To enable a workflow to use the ShareLib, you would simply specify oozie.use.system.libpath=true in the job.properties file and Oozie will know to include the jars in the ShareLib with the necessary actions in your job. 

Overriding the ShareLib

In CDH 4.1.0 and later (or Oozie 3.3.0 and later), you can override the ShareLib location at the action, job, and server levels. This allows users or admins to support multiple versions or a patched version of an action at the same time. The property is called oozie.action.sharelib.for.actiontype, where actiontype is the name of the action type (e.g. Pig, Sqoop); you would set its value to the name of a subfolder in the ShareLib. To set it at the action level you would put the property in that action’s ; to set it at the job level, you would put the property in that job’s job.properties; and to set it at the server level, you would put the property in oozie-site.xml. 

For example, Oozie currently ships ready for Pig 0.10.x, but suppose you also want to be able to use Pig 0.9.x in the same workflow. The share/lib/pig folder is for Pig 0.10.x, but if you add a new folder with the Pig 0.9.x JARs, say share/lib/pig-9, you can put the following in the element for the Pig 0.9.x action:

Oozie will continue to use share/lib/pig for the Pig 0.10.x action but will use share/lib/pig-9 for the Pig 0.9.x action. 

Conclusion

Now that you understand the purpose of the ShareLib, how it works, and how to use it, you can better leverage it in your Oozie workflows. As Oozie continues to grow and mature, features such as the ShareLib make it easier to use. In the future, OOZIE-1054 will make it even easier for users by providing a script that installs the ShareLib.

Have any suggestions? Feel free to tell us what you think through user@oozie.apache.org or cdh-user@cloudera.org.

Robert Kanter is a Software Engineer at Cloudera, working on the Platform team.

Filed under:

1 Response
  • bogdan dalia / October 18, 2013 / 5:37 AM

    Is it possible to specify for a certain action (e.g a java action) two folders with jars from the share lib(e.g. hive and pig)?

Leave a comment


+ one = 2