Following these best practices can make your upgrade path to CDH 5 relatively free of obstacles.
Upgrading the software that powers mission-critical workloads can be challenging in any circumstance. In the case of CDH, however, Cloudera Manager makes upgrades easy, and the built-in Upgrade Wizard, available with Cloudera Manager 5, further simplifies the upgrade process. The wizard performs service-specific upgrade steps that, previously, you had to run manually, and also features a rolling restart capability that reduces downtime for minor and maintenance version upgrades. (Please refer to this blog post or webinar to learn more about the Upgrade Wizard).
As you prepare to upgrade your cluster, keep this checklist of some of Cloudera’s best practices and additional recommendations in mind. Please note that this information is complement to, not a replacement for, the comprehensive upgrade documentation.
Backing Up Databases
You will need to take backups prior to the upgrade. It is recommended that you already have procedures in place to periodically backup your databases. Prior to upgrading, be sure to:
- Back-up the Cloudera Manager server and management databases that store configuration, monitoring, and reporting data. (These include the databases that contain all the information about what services you have configured, their role assignments, all configuration history, commands, users, and running processes.)
- Back-up all databases (if you don’t already have regularly scheduled backup procedures), including the Apache Hive Metastore Server, Apache Sentry server (contains authorization metadata), Cloudera Navigator Audit Server (contains auditing information), Cloudera Navigator Metadata Server (contains authorization, policies, and audit report metadata), Apache Sqoop Metastore, Hue, Apache Oozie, and Apache Sqoop.
- Back-up NameNode metadata by locating the NameNode Data Directories in the HDFS service and back up a listed directory (you only need to make a backup of one directory if more than one is listed)
Note: Cloudera Manager provides an integrated, easy-to-use management solution for enabling Backup and Disaster Recovery and the key capabilities are fully integrated into the Cloudera Manager Admin Console. It also is automated and fault tolerant.
Cloudera Manager makes it easy to manage data stored in HDFS and accessed through Hive. You can define your backup and disaster recovery policies and apply them across services. You can select the key datasets that are critical to your business, schedule and track the progress of data replication jobs, and get notified when a replication job fails. Replication can be set up on files or directories in the case of HDFS and on tables in the case of Hive. Hive metastore information is also replicated which means that table definitions are updated. (Please refer to the BDR documentation for more details.)
A separate Disaster Recovery cluster is not required for a safe upgrade but the Backup and Disaster Recovery capability in Cloudera Manager can ease the upgrade process by ensuring the critical parts of your infrastructure are backed up before you take the upgrade plunge.
Recommended Practices for Upgrading to CDH 5
- Create fine-grained, step-by-step production plan for critical upgrades (using the Upgrade Documentation as a reference).
- Document the current deployment by chronicling the existing cluster environment and dependencies, including
- The current CDH and Cloudera Manager versions that are installed
- All third-party tools that interact with the cluster
- The databases for Cloudera Manager, Hive, Oozie, and Hue
- Important job performance metrics so pre-upgrade baselines are well defined
- Test the production upgrade plan in a non-production environment (e.g. sandbox or test environment) so you can update the plan if there are unexpected outcomes. It also allows you to:
- Test job compatibility with the new version
- Run performance tests
- Upgrade to Cloudera Manager 5 before upgrading to CDH 5.
- Ensure the Cloudera Manager minor version is equal to or greater than the target CDH minor version—the Cloudera Manager version must always be equal to or greater than the CDH version to which you upgrade.
- Reserve a maintenance window with enough time allotted to perform all steps.
- For a major upgrade on production clusters, Cloudera recommends allocating up to a full-day maintenance window to perform the upgrade (but time is dependent on the number of hosts, the amount of Hadoop experience, and the particular hardware). Note that it is not possible to perform a rolling upgrade from CDH 4 to CDH 5 (major upgrade) due to incompatibilities between the two major versions.
- Maintain your own local Cloudera Manager and CDH package/parcel repositories to protect against external repositories being unavailable.
- Read the reference documentation for details on how to create a local Yum repository, or
- Pre-download a parcel to a local parcel repository on the Cloudera Manager server, where it is available for distribution to the other nodes in any of your clusters managed by this Cloudera Manager server. You can have multiple parcels for a given product downloaded to your Cloudera Manager server. Once a parcel has been downloaded to the server, it will be available for distribution on all clusters managed by the server. (Note: Parcel and package installations are equally supported by the Upgrade Wizard. Using parcels is the preferred and recommended way, as packages must be manually installed, whereas parcels are installed by Cloudera Manager. See this FAQ and this blog post to learn more about parcels.)
- Ensure there are no Oozie workflows in RUNNING or SUSPENDED status as the Oozie database upgrade will fail and you will have to reinstall CDH 4 to complete or kill those running workflows. (Note: When upgrading from CDH 4 to CDH 5, the Oozie upgrade can take a very long time. You can reduce this time by reducing the amount of history Oozie retains; see the documentation.)
- Import MapReduce configurations to YARN as part of the Upgrade Wizard. (Note: If you do not import configurations during upgrade, you can manually import the configurations at a later time. In addition to importing configuration settings, the import process will configure services to use YARN as the MapReduce computation framework instead of MapReduce and overwrites existing YARN configuration and role assignments.)
These recommendations and notable points to address when planning an upgrade to a Cloudera cluster are intended to complement the upgrade documentation that is provided for Cloudera Manager and CDH. As mentioned, Cloudera Manager streamlines the upgrade process and strives to prevent job failures by making upgrades simple and predictable—which is especially necessary for production clusters.
Cloudera’s enterprise data hub is constantly evolving with more production-ready capabilities and innovative tools. To ensure the highest level of functionality and stability, consider upgrading to the most recent version of CDH.