How-to: Use cron-like Scheduling in Apache Oozie

Improved scheduling capabilities via Oozie in CDH 5 makes for far fewer headaches.

One of the best new Apache Oozie features in CDH 5, Cloudera’s software distribution, is the ability to use cron-like syntax for coordinator frequencies. Previously, the frequencies had to be at fixed intervals (every hour or every two days, for example) – making scheduling anything more complicated (such as every hour from 9am to 5pm on weekdays or the second-to-last day of every month) complex and difficult. 

In this post, you’ll learn how to use this new syntax in a practical way.

Scheduling in CDH 4

Before we get into the cron-like syntax, let’s take a quick look at the fixed-frequency scheduling supported in CDH 4 and earlier.

Oozie treats frequency in terms of minutes; however, due to the variability in the number of minutes in each day and month, we recommend using the appropriate EL (Expression Language) functions instead of trying to do the calculations by hand. With that in mind, coordinator frequencies are typically specified with one of these four EL functions:

  • ${coord:minutes(int n)}
  • ${coord:hours(int n)}
  • ${coord:days(int n)}
  • ${coord:months(int n)}

Let’s try to use these functions to schedule a coordinator for the first example: every hour from 9am to 5pm on weekdays. The most obvious approach is to use ${coord:hours(1)} to schedule the coordinator to run every hour. But as you’ll quickly realize, that won’t work — because the job will continue after 5pm, before 9am, and also into the weekend. In the end, the easiest approach is to create 40 similar coordinator jobs!

Yes, you read that correctly: you’d create a new job to run every seven days, and start them at offsets of an hour between 9am and 5pm on Monday, Tuesday, Wednesday, Thursday, and Friday — that’s eight coordinators per day and five days = 40 coordinators. 

True, you can make this process less painful by reusing a single coordinator.xml and simply submitting it with multiple start times by making the start attribute a variable and setting it at submission time:

<coordinator-app name="coord-1" frequency="${coord:days(7)}" start="{$start}" end="2015-01-27T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4">
	...
</coordinator-app>

 

Here’s how the first and last few of the 40 submissions would look, with the first on Monday, January 27, 2014, at 9am and the last on Friday, January 31, 2014, at 5pm:

oozie job -config job.properties -run -Dstart=2014-01-27T09:00Z
oozie job -config job.properties -run -Dstart=2014-01-27T10:00Z
oozie job -config job.properties -run -Dstart=2014-01-27T11:00Z
...
oozie job -config job.properties -run -Dstart=2014-01-31T15:00Z
oozie job -config job.properties -run -Dstart=2014-01-31T16:00Z
oozie job -config job.properties -run -Dstart=2014-01-31T17:00Z

 

You’d have to do the same thing for the other example, the second-to-last day of every month. In fact, you’d probably have to manually calculate the specific dates and submit a coordinator for each month!

Scheduling in CDH 5

cron is a utility included with most Unix/Linux operating systems for scheduling time-based jobs. For example, you might want it to run a script that cleans out your Internet history once a week. 

As I hinted at earlier, the syntax for specifying the schedule for cron is more flexible and powerful than previously. We’ll take a quick look at the syntax below, but for more details, review the documentation here. (Note that there are variations even among standard cron tools, so it’s a good idea to quickly read the Oozie-specific documentation even if you are already familiar with cron.) Oozie uses the Quartz Scheduler to parse the cron syntax. 

The cron syntax used by Oozie is a string with five space-separated fields: Minute, Hour, Day-of-Month, Month, and Day-of-Week. Below is a chart, adapted from the documentation, summarizing the different values for each field:

The Allowed Values for each field are fairly self-explanatory (but note that while in many cron implementations, Day-of-Week accepts 0-6, here we accept 1-7, instead). 

Allowed Special Characters are allowed in all fields: “*” (asterisk), which matches all values; “,” (comma), which lets you specify multiple values; “-” (dash), which lets you specify ranges; and “/”, which lets you specify increments. 

It’s probably easiest to explain these characters with some examples. Remember, Oozie’s processing time zone is UTC, so if you”re in a different time zone, you’ll have to add/subtract the appropriate offset from the examples ahead. 

  • 30 * * * *
    This expression indicates that the job should run at the 30th minute of every hour (1:30am, 2:30am, and so on, assuming the job is set to start on the hour). We’ve set the Minute field to 30, and the remaining fields to “*” so they match every value. 
  • 30 14 * * *
    This expression indicates that the job should run at 2:30pm everyday. The Minute field is set to 30, the Hour field is set to 14, and the remaining fields are set to “*”. 
  • 30 14 * 2 *
    This expression indicates that the job should run at 2:30pm everyday during February. This is similar to the previous expression, except that we’ve now restricted the Month field to February instead of including every month. 
  • 0/20 5-9,12-14 0/5 * *
    This expression is a bit trickier and highlights the flexibility of the cron-like scheduling. It indicates that the job should run every 20 minutes (0, 20, and 40 past the hour) between 5am and 10am (with the last job starting at 9:40am) and between noon and 2pm (with the last job starting at 1:40pm) on every fifth day of every month. The Minute field is set to 0/20, the Hour field is set to 5-9,12-14, the Day-of-Month field is set to 0/5, and the remaining fields are set to “*”. 

You may have also noticed that there are some additional Allowed Special Characters — “?”, “L”, “W”, and “#” – which you can use in some of the fields to provide more specialized results:

  • Use “?” in the Day-of-Month and Day-of-Week fields to indicate no specific value (if you want to specify one but not the other).
  • Use “L” in the Day-of-Month and Day-of-Week fields to indicate the last day of the month or the last day of the week (Saturday) respectively. The “L” can do other things in the Day-of-Week field, too; for example, if “6L” is in the Day-of-Week field, it indicates the last Friday of the month. 
  • Use “W” in the Day-of-Month field to indicate the nearest weekday to the given day. And the “#” can be used in the Day-of-Week field to indicate the nth day of the month. 

These values have more subtle and complex use cases than the other values. Before using them, read the documentation carefully — especially for “L” and “W” as they have some additional behavior.

Here are some examples using these special values:

  • 0 5 ? * MON
    This expression indicates that the job should run every Monday at 5am. The Minute field is set to 0, the Hour field is set to 5, the Day-of-Month field is set to “?”, the Month field is set to “*”, and the Day-of-Week field is set to MON. Notice that if the “?” were a “*”, then this expression would indicate that the job should run every day at 5am, not just Mondays. The difference between the “?” and the “*” is sometimes tricky, but this example is pretty helpful. 
  • 0 5 L * ?
    This expression indicates that the job should run on the last day of every month at 5am.  The Minute field is set to 0, the Hour field is set to 5, the Day-of-Month field is set to L, the Month field is set to “*”, and the Day-of-Week field is set to “?”. 
  • 0 5 15W * ?
    This expression indicates that the job should run at 5am on the weekday closest to the 15th day of every month. The Minute field is set to 0, the Hour field is set to 5, the Day-of-Month field is set to 15W, the Month field is set to “*”, and the Day-of-Week field is set to “?”. 
  • 0/33 9-14 ? * 2#1
    This expression indicates that the job should run every 33 minutes between 9am and 3pm on the first Monday of every month. The Minute field is set to 0/33, the Hour field is set to 9-14, the Day-of-Week field is set to 2#1 (the first Monday), and the remaining fields are set to “*”. 

With all this in mind, it is now fairly straightforward to create one coordinator for each of our examples. Using cron syntax, every hour from 9am to 5pm on Weekdays can be expressed as:

0 9-17 ? * 2-6

and the second-to-last day of every month can be expressed as:

0 0 L-1 * ?

(Note that “L-1″ means the second-to-last day of the month.)

You would need only one coordinator for each of the two examples instead of the (ridiculous) number of them you saw earlier.

For more examples, take a look at the cron tutorial from the Quartz Scheduler website. Keep in mind that they have a Second and optional Year fields, whereas Oozie has neither, so their examples have six or seven fields instead of only five. 

It’s Even Easier with Hue

The Hue team has added support for Oozie’s cron-like scheduling syntax. too. (Hopefully you are familiar with Hue’s great Oozie dashboard and editor.) In fact, it’s easier to configure the frequency with Hue because you don’t even have to know the cron syntax! You can also just use Hue to create the cron expression for you, and then put it into your own coordinator. 

Find out more and watch a demo on the Hue blog

Conclusion

If you found the cron-like scheduling syntax to be a little overwhelming, don’t worry: the fixed-frequency scheduling syntax isn’t going anywhere and will still work. Or, you can try using Hue. 

Otherwise, cron-like scheduling is a much more powerful and flexible way to schedule jobs and will make Oozie an even more valuable tool in CDH 5. Also, it requires no extra setup or configuration, so go and try it out!

Robert Kanter is a Software Engeer at Cloudera, and an Oozie Committer/PMC Member.

Filed under:

1 Response
  • Hari Sekhon / April 23, 2014 / 10:12 AM

    Executing in UTC seems like a bad idea when cron uses local time… if you wanted to use UTC you would do it for everything and set your servers to UTC timezone.

    Having to translate timezones to figure out what to enter requires more effort on the part of the user.

    Also the day field offsets are wrong/needlessly different compared to ISC cron which is zero indexed and uses 0 or 7 for Sunday.

    Diverging from veterans experience with using cron seems against intuitive design.

Leave a comment


five × 7 =