Unique across all options, Cloudera Manager makes it easy to do what would otherwise be a disruptive operation for operators and users.
For the increasing number of customers that rely on enterprise data hubs (EDHs) for business-critical applications, it is imperative to minimize or eliminate downtime — thus, Cloudera has focused intently on making software upgrades a routine, non-disruptive operation for EDH administrators and users.
With Cloudera Manager 4.6 and later, it is extremely easy to do minor version upgrades (from CDH 4.3 to CDH 4.6) with zero downtime, and with no service interruptions during the process. The parcels binary packaging format is the key to this process: Parcels allow administrators to keep multiple versions of EDH software on their clusters simultaneously, and make transitions across versions seamless.
There are two steps involved in doing a rolling upgrade:
- Distributing and activating a newer version of the relevant parcel (described in detail here)
- Doing a rolling restart of the existing cluster, which I will explain in the remainder of this relatively brief post
Doing the Rolling Restart
To start the rolling restart operation, go to the Actions menu of your cluster in Cloudera Manager and select Rolling Restart. (Having a highly available NameNode is a pre-requisite for the rolling restart operation because it ensures that all data is available throughout the operation.)
Selecting Rolling Restart will take you to a pop-up where you can specify the parameters for the operation:
The important parameters in the dialog box are as follows:
- Services to restart: Here you specify which services should be restarted with no downtime. Any service that does not have a single point of access for the clients (including HDFS, Apache HBase, Apache MapReduce, Apache Oozie, and Apache ZooKeeper) is eligible for rolling restart because we can ensure that clients do not face any service interruptions while the operation is going on. This setting is required and the form provides more options to select what roles to restart from the selected services
- Roles to include: Here you can specify if you want to restart only worker roles (DataNode, TaskTracker, and RegionServer), only non-worker roles (NameNode, JobTracker, HBase Master, ZooKeeper Server, and so on), or all roles. This setting offers the flexibility to pick the order in which the roles are restarted.
- Role filters: These filters are present to allow users the convenience to only restart the specific roles that have a new configuration or a new software version. One common use case for these role filters is to resume the rolling restart operation if it fails due to host failures in the middle of the operation. If that happens, you can trigger rolling restart again with the appropriate role filter and it will resume restarting roles that haven’t been restarted after software upgrade or configuration changes. The list of roles on which the rolling restart operation is performed is an intersection of “Roles to include” and the selected role filter.
- Batch size: Here you can specify how many hosts to restart at a time. Typically, you would choose 5-10% of the size of the cluster here if you have multiple racks. If you select a higher batch size, the overall rolling restart operation will finish sooner, but the cluster performance will be reduced during the operation as there would be fewer worker roles active at any given time. If you have a single rack, then the batch size should be left as 1.
Once you’ve specified the parameters, click Confirm, which will take you to the Command Details page for the command that is consequently triggered.
The command will restart the selected services and their roles in proper order to ensure zero downtime for EDH users. The rolling restart operation itself comprises several steps that occur in the background:
- Master rolling restart: Highly available master roles like NameNode and JobTracker, along with other auxiliary roles in the cluster, are restarted without interrupting services to EDH consumers. To do so, one of the masters is always available and active during the operation.
- Worker rolling restart: This step involves restarting the worker roles of each service properly to not affect any running jobs or clients accessing data in EDH. The roles are restarted rack-by-rack. By default, Hadoop maintains replicas of every block on at least two racks (provided there are multiple racks), which ensures that all data is always available to the clients. Within each rack, the hosts are grouped in specified batch sizes alphabetically and then roles are restarted on them. For some roles, a simple restart is sufficient, while others must be decommissioned to ensure that another worker role is servicing the client.
The operation can take a little while to finish depending on cluster size and the chosen batch size. Should a host failure occur, Cloudera Manager lets users easily recover and resume the operation using the role filters described above.
As you can see, what would otherwise be a complex operation for doing a software upgrade with no downtime is made extremely easy by Cloudera Manager — and can be done without ever leaving the UI. More information about rolling upgrades can be found in Cloudera Manager documentation.
Vikram Srivastava is a Software Engineer at Cloudera.