How-to: Shorten Your Oozie Workflow Definitions

While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.

The Sub-Workflow Action

One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.

We’re always looking for new ways to improve the usability of Oozie and of the workflow format.

I’ve created a simple example workflow showing how the Sub-Workflow Action can be used to run the same action twice from another workflow. Additional details are provided in the readme. (IMPORTANT: Be very careful when using the Sub-Workflow Action. While it can be used to create loops, if you are not careful you can easily create an infinite recursion! OOZIE-1550 and OOZIE-1583 will add safeguards to protect against this risk.)

The Sub-Workflow Action has been available since Oozie workflow schema 0.1.

Including Other XML Configuration Files

Another reason why workflows can become quite long is in specifying all of the necessary properties in the <configuration> section of an action. Oozie allows you to “include” the contents of a separate file (from somewhere in HDFS) as part of the <configuration> section by using the <job-xml> element. This approach is really helpful if you have multiple actions that need to have the same properties. You can put them all in a separate file and simply use <job-xml> instead of copying and pasting them everywhere, thus keeping your workflow shorter. You also get the benefit of only having to maintain one copy of those properties.

Another common use case for <job-xml> is for the Hive action, where you would typically point it at a copy of your hive-site.xml. Also, your actions can have multiple <job-xml> elements, so you can include multiple files. If you specify a property in a <job-xml> file and in the <configuration> section, the latter has priority, which provides a convenient way of using this feature while still being able to override common options for just a specific action.

One thing to watch out for is that currently any EL variables used in a file referenced by a <job-xml> element are not resolved. OOZIE-1580 aims to improve this situation.

The <job-xml> element has has been around since workflow schema 0.1, but was originally limited to only one instance. In workflow schema 0.4 and in newer versions of the extension action schemas, it was modified to allow multiple <job-xml> elements.

Using the Global Section

The Global Section is another solution for the problem of having to specify the same properties in multiple actions. The Global Section goes at the top of a workflow and looks like this:

 

Most workflows use the same JobTracker and NameNode for all actions, so you can easily remove those two lines from all your actions by simply specifying them once at the top in the Global Section! And if you do have some actions that use a different JobTracker or NameNode, you can override the <global> section in that specific action as well. The <job-xml> element and <configuration> elements here work just like they usually do, except that they’re applied to all actions in your workflow. As with the <job-tracker> and <name-node> elements, you can override their properties by specifying them in the individual actions. The Global Section is a great way to reduce the number of repeated properties, such as mapred.job.queue.name. Also, each of the elements in the <global> section are optional, so if you only want to use one of them, you can do so.

The Global Section is a relatively new feature, and was added in Oozie workflow schema 0.4.

Conclusion

Now that you are aware of these tips and tricks for shortening your workflows, hopefully you can find ways of taking advantage of them. We’re always looking at new ways to improve the usability of Oozie and of the workflow format; that said, if you have any ideas that you think would be helpful in this regard, or for anything, please let us know on the on the cdh-user or oozie-user mailing lists, or the Cloudera Community Forums.

Further Reading:

Filed under:

No Responses

Leave a comment


− 5 = four