How To Set Up a Shared Amazon RDS as Your Hive Metastore

Categories: Cloud Hadoop Hive How-to Impala Spark Use Case

Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.

From CDH 5.10 onward, clusters running in AWS cloud can share a single persistent instance of RDS as the HMS backend database. This enables persistent sharing of metadata beyond a cluster’s life cycle so that subsequent clusters need not regenerate metadata as before.

Advantages of This Approach

Using a shared Amazon RDS server as your HMS backend enables you to deploy and share data and metadata across multiple transient as well as persistent clusters, provided they adhere to restrictions outlined in the “Supported Scenarios” section below. For example, you can have multiple transient Hive or Apache Spark clusters writing table data and metadata which can be subsequently queried by a persistent Apache Impala (incubating) cluster. Or you might have 2-3 different transient clusters, each dealing with different types of jobs on different datasets that spin up, read raw data from S3, do the ETL, write data out to S3, and spin down. In this scenario, you want each cluster to be able to simply point to a permanent HMS and do the ETL. Using RDS as a shared HMS backend database greatly reduces your overhead because you no longer need to recreate the HMS again and again for each cluster, every day, for each transient ETL job.

To configure RDS as the backend database for a shared Hive Metastore:

  1. Create a MySQL instance with Amazon RDS. See Creating a MySQL DB Instance… and Creating an RDS Database Tutorial in Amazon documentation. This is done only once. Subsequent clusters aiming to use an existing RDS instance will not need this step.  
  2. Configure a remote MySQL Hive Metastore database as part of the Cloudera Manager installation procedure, using the hostname, username, and password configured during your RDS setup. See Configuring a Remote MySQL Database for the Hive Metastore in Cloudera documentation.
  3. Configure Hive, Impala, or Spark to use Amazon S3:

Supported Scenarios

These limitations apply to the jobs you run when you use an RDS server as a remote backend database for Hive Metastore:

  • No overlapping data or metadata changes to the same data sets across clusters.
  • No reads during data or metadata changes to the same data sets across clusters.
  • Overlapping data or metadata changes are defined as when multiple clusters concurrently:
    • Make updates to the same table or partitions within the table located on S3.
    • Add or change the same parent schema or database.
If you are running a shared RDS, Cloudera Support will help licensed customers repair any unexpected metadata issues, but won’t do “root-cause” analysis.

 
*3/16/2017 – This article has been updated to reflect that Cloudera has removed the restriction on Shared RDS for Kerberos and Sentry.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *