Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.
From CDH 5.10 onward, clusters running in AWS cloud can share a single persistent instance of RDS as the HMS backend database. This enables persistent sharing of metadata beyond a cluster’s life cycle so that subsequent clusters need not regenerate metadata as before.
Advantages of This Approach
Using a shared Amazon RDS server as your HMS backend enables you to deploy and share data and metadata across multiple transient as well as persistent clusters, provided they adhere to restrictions outlined in the “Supported Scenarios” section below. For example, you can have multiple transient Hive or Apache Spark clusters writing table data and metadata which can be subsequently queried by a persistent Apache Impala (incubating) cluster. Or you might have 2-3 different transient clusters, each dealing with different types of jobs on different datasets that spin up, read raw data from S3, do the ETL, write data out to S3, and spin down. In this scenario, you want each cluster to be able to simply point to a permanent HMS and do the ETL. Using RDS as a shared HMS backend database greatly reduces your overhead because you no longer need to recreate the HMS again and again for each cluster, every day, for each transient ETL job.
To configure RDS as the backend database for a shared Hive Metastore:
- Create a MySQL instance with Amazon RDS. See Creating a MySQL DB Instance… and Creating an RDS Database Tutorial in Amazon documentation. This is done only once. Subsequent clusters aiming to use an existing RDS instance will not need this step.
- Configure a remote MySQL Hive Metastore database as part of the Cloudera Manager installation procedure, using the hostname, username, and password configured during your RDS setup. See Configuring a Remote MySQL Database for the Hive Metastore in Cloudera documentation.
- Configure Hive, Impala, or Spark to use Amazon S3:
These limitations apply to the jobs you run when you use an RDS server as a remote backend database for Hive Metastore:
- No overlapping data or metadata changes to the same data sets across clusters.
- No reads during data or metadata changes to the same data sets across clusters.
- Overlapping data or metadata changes are defined as when multiple clusters concurrently:
- Make updates to the same table or partitions within the table located on S3.
- Add or change the same parent schema or database.
*3/16/2017 – This article has been updated to reflect that Cloudera has removed the restriction on Shared RDS for Kerberos and Sentry.