Updated 11/22/16 – Important: All features below are working on CDH 5.9.0 and CM 5.9.0 and above.
This tool makes Oozie migrations off Apache Derby (or any other supported database) easy, in addition to streamlining upgrades.
The Apache Oozie server is a stateless web application by design, with all information about running and completed workflows, coordinator jobs, and bundle jobs stored in a relational database. Prior to Cloudera Manager 5.4, Oozie was configured to use the embedded Apache Derby database for this purpose by default. However, while Derby can safely be used in very small or test/dev clusters, it is not recommended for production Oozie installations—with the main reason being that Derby suffers from known locking issues at high scale and with large amounts of data.
Furthermore, to provide additional scalability and fault tolerance, Oozie introduced high availability (HA) starting in CDH 5. Unfortunately, Derby cannot be used in an HA setup, as it does not support multiple concurrent connections.
In this post, we’ll describe how Cloudera has addressed these issues for users in a forthcoming Oozie release (and soon to ship in Cloudera Enterprise).
Here at Cloudera, our support team frequently analyzes customer workloads to help ensure their long-term success. During this process, we discovered that many Oozie customers would benefit from migrating off Derby before they begin to scale up and possibly experience locking problems.
The Oozie database contains binary large object (BLOB) columns, so previous migration attempts using free or open source ETL tools failed either partially or completely. The best fix, in most cases, was to start over with a new empty database. But in complex production installations with numerous bundles, coordinators, and workflows running, starting from scratch and locating the properties to resubmit hundreds of jobs is often impractical.
For those reasons, Cloudera’s support and Oozie engineering teams—working in tandem with other members of the Oozie developer community—on a solution to help customers safely migrate their Oozie databases without history or job loss: The addition of a new database dump and load tool to Oozie, which is accessible to admins via Cloudera Manager. Targeted for the upcoming Oozie 4.3 release (and shipping soon in Cloudera Enterprise), the new Oozie Database Migration Tool (OOZIE-2632) helps users more easily migrate from one database to another (and, as a bonus, makes upgrades easier as well).
Using the Tool
oozie-setup.sh dump and load utility in OOZIE-2632 solves the problem of dumping and loading the Oozie database in a database-agnostic way: It uses OpenJPA (the library that lets Oozie talk to databases) to dump objects to compressed JSON files using GSON. This approach not only supports data extraction from a Derby database, but from any other database that Oozie supports.
Dumping the Source Database
Prior to dumping the database, the Oozie server (in HA mode: all of the servers) must be stopped and the
oozie-site.xml configured to the source database from which the dump will be performed. During the dump process, the database content is fetched and written to a compressed zip of json files using the command:
oozie-setup.sh export /export/path/oozie_db_dump.zip
Loading the Target Database
Once the dump is performed, you can configure the
oozie-site.xml to point to a new target database. The stock, initial Oozie database tables should be created using the command:
oozie-setup.sh db create
To then load the previously dumped data, run the command:
oozie-setup.sh import /export/path/oozie_db_dump.zip
Integration with Cloudera Manager
Oozie and Cloudera Manager engineers also worked together to simplify this process via a few clicks (see screenshots below). As previously explained, you’ll see this feature in a forthcoming release.
Sample output of the “Dump Database” action
Dumping and loading the Oozie database is conveniently available to Oozie admins either from command line or via the easy-to-use UI in Cloudera Manager. It expedites the process of moving from one database to another for migration or upgrade projects. Helpfully, workflows, coordinators, and bundles can continue to run with only minimal interruption.
As a caution, keep in mind that migration of an exceptionally large database may require more resources and time. That said, due to the scalability problems described previously, Derby will likely have problems long before it reaches a size that would make this issue a common concern. (We may address it on the roadmap should there be enough demand to do so.)
We hope you will find this tool very helpful. Please give us your feedback!
The authors would like to thank Peter Bacsko and Peter Cseh for implementing and reviewing the tool. Special thanks to Aravind Selvan, Paolo Milani, and Darren Lo for their guidance and suggestions throughout the Cloudera Manager integration, and to the rest of the Oozie community for additional reviews and feedback.
Attila Sasvari is a Software Engineer at Cloudera, and is currently working on the Oozie team.
Robert Justice is a Customer Operations Engineer at Cloudera.
Robert Kanter is a Software Engineer at Cloudera, a committer and PMC Member for Apache Hadoop, and the VP/PMC Chair for Oozie.