Demo: Using Hue to Access Hive Data Through Pig
- by Hue Team
- August 07, 2013
- 1 comment
This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)
What is HCatalog?
HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas, or caring about or duplicating the data’s location.
Hue contains a Web application for accessing the Hive metastore called Metastore Browser, which lets you explore, create, or delete databases and tables using wizards. (You can see a demo of these wizards in a previous tutorial about how to analyze Yelp data.) However, Hue uses HiveServer2 for accessing the metastore instead of HCatalog. This is because HiveServer2 is the new secure and concurrent server for Hive and it includes a fast Hive Metastore API.
HCatalog connectors are still useful for accessing Hive data through Pig, though. Here is a demo about accessing the Hive example tables from the Pig Editor:
To try this yourself, first, you need to install HCatalog via Cloudera Manager (or do it the manual way). If you are using a fully distributed cluster (e.g. not on a demo VM), make sure that the Hive Metastore is remote or you will see an error like the one below. Then, upload the three jars from /usr/lib/hcatalog/share/hcatalog/ and all the Hive ones from /usr/lib/hive/lib to the Oozie Pig sharelib in /user/oozie/share/lib/pig. This can be done in a few clicks while being logged in as ‘oozie’ or ‘hdfs’ in the File Browser.
Keep in mind that all the jars will be included in all the future Pig scripts, which might be unnecessary. An alternative would be to upload these jars in your HDFS home directory and then include the path to the directory with the Hadoop property
‘oozie.libpath’ in the Properties section of the Pig Editor.
-- Load table 'sample_07' sample_07 = LOAD 'sample_07' USING org.apache.hcatalog.pig.HCatLoader(); -- Compute the average salary of the table salaries = GROUP sample_07 ALL; out = FOREACH salaries GENERATE AVG(sample_07.salary); DUMP out;
As HCatalog needs to access the metastore, you need to specify the hive-site.xml. Go to Properties > Resources and add a ‘File’ pointing to the hive-site.xml uploaded on HDFS. Then, submit the script by pressing CTRL + ENTER. The result (47963.62637362637) will appear at the end of the log output. (Notice that you don’t need to redefine the schema as it is automatically picked up by the loader.) If you use the Oozie App, you can now freely use HCatalog in your Pig actions.
Warning! If you get the error below, it means that your metastore is owned by the Hive user and is not remote.
Cannot get a connection, pool error Could not create a validated object, cause: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection. 2013-07-24 23:20:04,969 [main] INFO DataNucleus.Persistence - DataNucleus Persistence Factory initialisedfor datastore URL="jdbc:derby:;databaseName=/var/lib/hive/metastore/metastore_db;create=true"driver="org.apache.derby.jdbc.EmbeddedDriver" userName="APP"
A workaround is to make sure that Beeswax is shut down and then change the permissions of the SQLite database:
sudo rm /var/lib/hive/metastore/metastore_db/*lck sudo chmod 777 -R /var/lib/hive/metastore/metastore_db
Similar to HCatLoader, use HCatStorer to update the table, e.g.:
STORE alias INTO 'sample_07' USING org.apache.hcatalog.pig.HCatStorer();
Here you have seen how Hue makes it easy to access Hive’s metastore and how it supports the HCatalog connectors for Pig. Hue 3.0 will simplify things even more by automatically copying the required jar files and making the table names auto-complete.