Better Workflow Management in CDH with Oozie 2

Oozie version 2.2.1 is now bundled with Cloudera Distribution for Hadoop (CDH3 Beta 3). This major upgrade includes new functionality such as time and date-driven workflow jobs, and an embedded Tomcat server that makes deploying Oozie much easier.  Oozie 2.2.1 also includes several bug fixes, while preserving backward compatibility with Oozie 1.6.

Time and Data Triggered Workflow Jobs

Oozie 2 introduces support for coordinator jobs. A coordinator job is a time and data-driven job that starts a workflow job every time a set of events happen. For example, a coordinator job could start a workflow job every hour, or start a workflow job at the end of the day when all of the hourly data of the day is available in HDFS. Coordinator jobs are expressed in terms of a time frequency, and a set of time-bound input and output data.

A Use-Case for Coordinator Jobs

Suppose a website has a high volume of traffic and rolls over its logs every 30 minutes. After the log files are rolled over, they are copied to HDFS in a directory named with the rollover timestamp (“logs-YYYY-MM-DD-HH-mm”).  At the end of the day, when the 48 daily logs are available in HDFS, a workflow job processes the logs to generate an aggregated log-report for the day.

A coordinator job consists of three characteristics: the frequency of the workflow job, the time-bound input, and the time-bound output. The frequency dictates the nominal time for a workflow job to run. The time-bound input data must exist or else the workflow job will not start. The time-bound output data is produced by the workflow job.

In our daily logs example, the characteristics of the coordinator job are:

  • Frequency of the workflow job: daily
  • Time-bound input data: the 48 log rollovers of the day
  • Time-bound output: the aggregated log-report for the day

One important property of a coordinator job is that even if the workflow job is started late (because the input data is not available, for example), the time boundaries of the input data do not change. That is, the coordinator job uses the input data within the specified time boundaries, not the input data from a time period relative to its execution time.  For example, in the use case for aggregating a daily log, the time boundaries of the input data are from the first 30 minutes to the last 30 minutes of a calendar day.  If the coordinator job starts six hours after the end of the day (at 6:00 AM) because the input data is not available until then, the coordinator job uses the data from the first 30 minutes to the last 30 minutes of the previous calendar day, not the data from the 24-hour period preceding the job start time at 6:00 AM.

Oozie Runs Out of the Box

Starting with Cloudera Oozie 2.2, there is no need to install a Tomcat server to run Oozie. Oozie bundles an embedded and preconfigured Tomcat server. You simply invoke ‘oozie-start.sh’ to run Oozie.

Support for Multiple Databases

Oozie now can run with HSQLDB, MySQL or Oracle databases.

What’s Next?

For the next release of CDH3, we will make Oozie easier to configure and use, add Kerberos security to the Oozie HTTP API as well as support for Derby databases, and include a few other incremental features and bug fixes.

Filed under:

1 Response

Leave a comment


× 4 = eight