What’s New in CDH3b2: Apache Hive

Categories: General Hive

CDH3 beta 2 includes Apache Hive 0.5.0, the latest version of the popular open source Apache Hadoop data warehouse platform. Hive allows you to express data analysis tasks in a dialect of SQL called HiveQL, and then compiles these tasks into MapReduce jobs and executes the jobs on your Hadoop cluster. Hive is a natural entry point to Hadoop for people who have prior experience with relational databases, but even those who have never written a line of SQL should give it a chance since it is currently the only Hadoop dataflow programming platform to provide built-in facilities for managing metadata. This unique feature of Hive allows you to access your data through a Table abstraction, making it possible to cleanly separate your analysis logic from the details of how your data is formatted and parsed. This results in scripts that are easier to write and much easier to maintain.

While Hive is great it on its own, it’s even better when you connect it to other tools in the Hadoop ecosystem. Users can currently use Sqoop to import data from relational databases into Hive, run Hive jobs inside Oozie workflows, and design queries in the Beeswax query editor that comes included with Hue. Hive 0.6.0 will include new features that make it possible to seamlessly access HBase tables from Hive, and there is also work afoot to provide an integration point between Hive and Flume.

The 0.5.0 release of Hive includes a variety of feature enhancements and bug fixes that improve the usability and stability of the Hive platform. These changes include extensions to HiveQL such as support for the CREATE TABLE AS SELECT statement, LEFT SEMI JOINs, and LATERAL VIEWs, as well as support for User Defined Table Generating Functions. The 0.5.0 release also includes enhancements that improve the performance of GROUP BY aggregations and Hive’s RCFile columnar storage format.

Readers who are new to Hive should check out our Hive training videos and tutorial notes, as well as an earlier blog post from Peter Skomoroch in which he explains how he used Hive and Hadoop to identify trending topics on Wikipedia. Experienced users looking to upgrade to the new version of Hive will want to consult the CDH Quick Start Guide and the CDH Hive Installation Guide.