Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5 (and later).
For details about other new features in this release, please see the full release notes:
Introducing the Hive Metastore Server
Starting with Cloudera Manager 4.5, there is a new role type called the Hive Metastore Server. This role manages the metastore process when Hive is configured with a Remote Metastore.
I strongly encourage you to read the documentation about the Hive Remote Metastore and Local Metastore.
Cloudera recommends using a Remote Metastore with Hive, especially for CDH 4.2 or later. Since the Remote Metastore is recommended, Cloudera Manager treats the Hive Metastore Server as a required role for all Hive services. Here are a couple key reasons why the Remote Metastore setup is advantageous, especially in production settings:
- The Hive Metastore Database password and JDBC drivers don’t need to be shared with every Hive client; only the Hive Metastore Server does. Sharing passwords with many machines is a security concern.
- You can control activity on the Hive Metastore Database. To stop all activity on the database, just stop the Hive Metastore Server. This makes it easy to perform tasks such as backup and upgrade, which require all Hive activity to stop.
The Hive Metastore Server should not be used with CDH3. If you are using CDH3 or you’d like to use the Local Metastore mode, you can control this process by enabling the
Bypass Hive Metastore Server mode in the Hive Service Configuration. See the discussion in the “Hive Services Created During Cloudera Manager Upgrade” below.
Hive Setup Made Easy
Whether adding Hive for the first time or importing an existing Hive install to be managed by Cloudera Manager, the steps are very similar. If you are upgrading from a previous Cloudera Manager release, then see the section below on upgrade.
To help you set up and manage Hive, Cloudera Manager will:
- Create a database for the Hive Metastore if you are using the Cloudera Manager Embedded PostgreSQL Database.
- Create Hive Metastore tables.
- Create the Hive Warehouse Directory in HDFS.
- Manage Hive Metastore Server.
- Manage HiveServer2 (CDH4.2 only).
- Manage client configurations (/etc/hive/conf).
- Manage Hive configuration for services that depend on Hive (Cloudera Impala and Hue).
Note that using Apache Derby for the Hive Metastore Database is not recommended for production use. The wizards do not support adding a new Hive service configured with Derby. Instead, I recommend leveraging the Cloudera Manager Embedded PostgreSQL Database for an easy, production-quality setup.
Now let’s walk through adding a Hive service to an existing cluster. If you are creating a new Cloudera Manager cluster from scratch, the steps are extremely similar to the steps outlined below.
- If you are importing an existing Hive install to be managed by Cloudera Manager:
- Back up your Hive Metastore Database and any hive configuration files (hive-site.xml).
- Stop any running hive processes such as Hive Metastore, HiveServer, HiveServer2, or any Hive clients that are running commands.
- Get your Hive Metastore Database login info handy. You’ll need it in a moment.
- Add a “Gateway” to any host from which you will run the Hive CLI or the Beeline CLI.
- Pick one host for the “Hive Metastore Server”.
- Although Hive can be configured to use more than one Hive Metastore Server, Cloudera does not support having multiple Hive Metastore Servers. This may result in problems such as concurrency errors.
- For performance reasons, it’s generally a good idea to have the Hive Metastore Server on the same host as the database that it’ll be talking to. This is not required.
- If you’re using CDH4.2, you can also pick “HiveServer2”. HiveServer2 differs from HiveServer, and Beeline is the supported CLI to communicate with HiveServer2. HiveServer2 supports multiple clients making many simultaneous requests, which is an improvement over HiveServer. (See HiveServer2 Documentation.)
- If you’d like to leverage the Cloudera Manager Embedded PostgreSQL Database for your Hive Metastore Database, select “Use Embedded Database”. Don’t use this when adopting an existing Hive setup.
- If you’d like to configure an external database, then select “Use Custom Databases” and enter the appropriate database login information. If you are importing an existing Hive setup, enter the same information that your existing Hive setup uses.
- Click “Test Connection” and make sure there are no errors (skipped tests are ok), then click “Continue”.
- For newly created Hive setups, the defaults are normally appropriate.
- If you are adopting an existing Hive setup, then be sure to pick the
Hive Warehouse Directorythat matches what your existing Hive setup uses.
- Create Hive Metastore Database – creates the user and database in the Cloudera Manager Embedded PostgreSQL server. If you selected a custom database, then this step will not appear in the workflow.
- Create Hive Metastore Database Tables – creates all of the tables in the Hive Metastore Database for the current Hive version. This command will only run if the schema is empty.
- Create Hive Warehouse Directory – creates the Hive Warehouse directory in HDFS with 1777 permissions if it doesn’t already exist.
- Deploy Client Configuration – updates /etc/hive/conf on all hosts that have a Hive role, using the alternatives mechanism.
Once this is done, then you can run the shell command
hiveon any of these hosts and your hive commands will all go through the Hive Metastore Server.
- You can now review the configuration to make sure it matches what you’d like.
- If you are importing an existing setup, you should compare the configuration with your backup copy of hive-site.xml.
- If there’s any config that’s missing from your hive-site.xml, but there’s no option in the UI for this config, then you can use the property
Hive Service Configuration Safety Valve for hive-site.xml to specify these configs for all CM-managed processes (includes roles from Hive, Impala, and Hue). If you need the config to be used by Hive CLI (which normally reads from /etc/hive/conf/hive-site.xml), then use the property
Hive Client Configuration Safety Valve for hive-site.xml.
- Be sure to restart Hive, restart any dependent services (Hue and Impala), and deploy client configs after making a configuration change.
Hive Services Created During Cloudera Manager Upgrade
When upgrading to Cloudera Manager 4.5, if there are any Hue or Impala services in the existing setup, one or more Hive services will automatically be created.
Prior to Cloudera Manager 4.5, Hue and Impala both specified Hive configurations such as warehouse directory and database configuration. When upgrading to Cloudera Manager 4.5, this information from the Hue and Impala configurations is used to generate a new Hive service with the same warehouse directory and database configuration. The old Hue Impala services are then linked to the new Hive service(s). Cloudera Manager will attempt to merge the configuration, so if Hue and Impala had an identical Hive configuration, then only a single Hive service will be created.
After upgrading to Cloudera Manager 4.5, at the end of the Upgrade Wizard, you will be asked to add a Hive Metastore Server role to each Hive Service that was automatically created. Select one host for the Hive Metastore Server. Cloudera recommends using the Hive Metastore Server, so Cloudera Manager requires that each Hive service has one. For performance reasons, it’s good to have the Hive Metastore Server on the same host as the database that it’ll be talking to.
If you manually (not using Cloudera Manager) created an account for Hive in the Embedded PostgreSQL Database, then you need to make sure that the database and user have the same name. This can easily be done through the commands:
CREATE ROLE <dbname> LOGIN PASSWORD ‘<password>’;
ALTER DATABASE <dbname> SET OWNER <dbname>;
REASSIGN OWNED BY <old username> TO <dbname>;
After running these commands, edit your Hive Metastore Database configuration in Cloudera Manager with the new username and password, restart Hive, restart Hue/Impala, and deploy the client configuration. (Thanks to Benjamin Kim on the scm-users group for pointing this out!)
Hive has the configuration
Bypass Hive Metastore Server. When this configuration is enabled, Hive clients, Hue, and Impala connect directly to the Hive Metastore Database. Prior to Cloudera Manager 4.5, Hue and Impala talked directly to the Hive Metastore Database, so the
Bypass mode is enabled by default when upgrading to Cloudera Manager 4.5. This is to make sure the upgrade doesn’t disrupt your existing setup.
You should plan to disable the
Bypass Hive Metastore Server mode, especially when using CDH 4.2 or later. Using the Hive Metastore Server is the recommended configuration (as discussed in “Introducing the Hive Metastore Server” previously).
To switch between using the Hive Metastore Server or talking directly to the Metastore Database, use the Hive service configuration
Bypass Hive Metastore Server. You can find this option by using the search feature on the Hive Service Configuration page:
After toggling this
Bypass option, restart Hive and all services that depend on Hive (Hue and Impala), then re-deploy client configuration.
Here are some common issues I hope you’ll now easily avoid. It’s always a good idea to look at the latest Cloudera Manager Installation Documentation and Known Issues before performing an install or upgrade.
- When upgrading to CDH 4.1 or 4.2, a manual Metastore Database upgrade is required.
- When performing any CDH upgrade, be sure to read the Cloudera Manager upgrade guides to make sure the Hive Metastore Database is properly backed up and upgraded before starting Hive. See “Upgrading CDH in a Cloudera Managed Deployment” in this doc.
- You’ll probably need to
chown /user/hivein HDFS to
hive:hiveafter the warehouse directory is created, otherwise you may see errors in creating /user/hive/.Trash when you drop a table.
- There are various issues with using the Hive Metastore Server in CDH3. It’s easier to just always enable the “Bypass Hive Metastore Server” mode when running in CDH3. (See the Upgrade section for discussion on using this option.)
CDH4.0 and CDH4.1 Secure Clusters
- Hue has trouble talking to the Hive Metastore Server in CDH4.0 and CDH 4.1 secure clusters. Details can be found at the “Known Issues and Workarounds” section of this doc.
Darren Lo is a Software Engineer working on the Enterprise team.