High Availability (Multi-AZ) for CDP Operational Database

High Availability (Multi-AZ) for CDP Operational Database

How CDP Operational Database can deliver high availability for your applications when running on multiple availability zones in AWS

CDP Operational Database (COD) is an autonomous transactional database powered by Apache HBase and Apache Phoenix. It is one of the main Data Services that runs on Cloudera Data Platform (CDP) Public Cloud. You can access COD right from your CDP console. With COD, application developers can now leverage the power of HBase and Phoenix without the overheads that are often related to deployment and management. COD is easy-to-provision and self-managing, that means developers can provision a new database instance within minutes and start creating prototypes quickly. Autonomous features like auto-scaling, auto-healing and auto-tuning ensure there’s no management and administration of the database to worry about. 

In this blog, we’ll share how CDP Operational Database can deliver high availability for your applications when running on multiple availability zones in AWS.

To fully understand what a Multi-AZ deployment means for your infrastructure, it’s critical to recognize how Amazon Web Services is configured across the globe and thus how it provides the redundancy services no matter your location. As discussed in Amazon’s official documentation, the AWS Cloud is made up of a number of regions, which are physical locations around the world. While AZ outages are not officially tracked, Cloudera customers have reported having experienced AZ outages 1-2 times a year. So, Multi-AZ stretch deployments are required to achieve 99.95+% availability.

Each region comprises a number of separate physical data centers, known as availability zones (AZ). Each AZ is a self-contained facility with its own power, connectivity, and networking capabilities. Most regions are home to 2-3 different availability zones each, providing adequate redundancy within a given region (An AZ is represented by a region code followed by a letter identifier; for example, us-west-1a).

However, this redundancy is only applied to the storage layer (S3) and does not exist for virtual machines used for your database instance. If something were to cause the Availability Zone where your server instances reside to have an outage, your database would cease to function, as the entire compute infrastructure would be offline.

This is where Multi-AZ Deployment comes in. A Multi-AZ Deployment means that compute infrastructure for HBase’s Master and Region Servers are distributed across multiple Availability Zones ensuring that when a single Availability Zone has an outage, only a portion of Region Servers will be impacted and clients will automatically switch over to the remaining servers in the available AZs. Similarly, the backup master (assuming the primary master was in the AZ having an outage) will automatically take over the role of the failing master since it is deployed in a separate AZ from the primary master server.  All of this is automatic requiring no setup, no management, and no actions from a user / administrative standpoint. It simply works to ensure an application does not suffer an outage due to the loss of a single AZ.  

Demo

Newly created COD databases will automatically take advantage of all configured availability zones in the environment. Therefore it’s crucial to set up the environment with the zones that we would like to use. 

For instance, we have an environment with the following AZs: us-west-1a, us-west-1b and us-west-1c. When we deploy a COD database, it automatically deploys in a multi-AZ fashion — there is nothing to do! Let’s check behind the scenes and see what’s on the AWS console.

COD makes sure that worker nodes are equally spread across configured AZs. (Masters and the Leader are also deployed in different AZs in order to provide high availability for the ZooKeeper quorum.)

Apache HBase already has built-in failover capabilities, so in the event that one AZ goes offline, the system is already in place to instantly and automatically continue the services of your database. 

In order to add a bit more fun, let’s run a simple HBase load test during our testing. HBase has a built-in load test tool which we can use for a long running write test:

hbase ltt -write 10:1024:10 -num_keys 10000000

Let’s simulate AZ failure now and see what happens. The easiest way to do that is adding a new Network ACL which disables the ingress and egress traffic of a given subnet performing similar conditions to a real AWS outage.

In the first minute we don’t see anything particularly interesting on the status page, because from COD’s perspective the database is still healthy.

But noticed that the client has stopped making progress.

In 10-20 seconds, the Master realizes that some of the Region Servers are dead.

If the outage affects the active master, HBase will automatically switch over to the backup which takes over the role after 10-20 seconds..

The failure doesn’t take too long, after 2-3 minutes and some transient region errors the client is able to make progress again. Master had to transition the dead regions to live Region Servers.

To simulate the end of the outage, let’s undo the network ACL creation by deleting it. Region Servers are connecting back to the Master.

Now we’re back where we originally started. COD has fully recovered from the outage. In the write requests we can see two drops: the first one is when the client transitioned to the remaining live Region Servers, the second one slightly later is when HBase’s load balancer moved back the regions to the reconnected servers.

COD on HDFS

Object Storage in the Cloud is the default storage layer for COD and spreads data across 3 availability zones behind and will re-balance behind the scenes. HBase only has to do some housekeeping (region transition) to serve regions by the remaining servers making this a relatively fast operation.

For high performance use cases, COD supports using HDFS as its underlying storage. In this deployment paradigm, we automatically configure HDFS rack awareness for fault tolerance by placing one block replica on a different rack and mapping the racks to Availability Zones. This provides data availability in the event of a network switch failure or partition within the cluster. So, the behavior in the demo above is very similar to what you would see when deploying COD with HDFS.

Summary

Multi-AZ deployment is crucial for highly available databases and now COD supports it in AWS as technical preview behind the scenes at no extra cost. It makes your operational workload more robust and reliable with zero additional configuration. It will both be generally available and support additional cloud providers (Microsoft, Google) soon.

Reach out to your Cloudera account team if you are interested in learning more about how to migrate from your deployment of HBase to CDP Operational Database in the public cloud or take it for a spin with the Cloudera Test Drive.

Andor Molnar
Senior Staff Engineer - OpDB
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.