The recently-released Apache Hive 2.0 contains some exciting improvements, many of which are already available in CDH 5.x.
Recently, the Apache Hive community announced Hive 2.0.0. This is a larger release compared to the previous one (covered here), with a lengthy list of new features (many experimental), enhancements, and bug fixes. Cloudera’s Hive team have been working with the community for months to drive toward this significant release.
Contributors from Intel, Cloudera, and the rest of the community have been making strong progress on the Hive-on-Spark initiative. This post provides an update.
[Editor’s note (April 20, 2016): Hive-on-Spark is now GA/shipping starting in CDH 5.7.]
Since its inception about one year ago, the community initiative to make Apache Spark a data processing engine for Apache Hive (HIVE-7292) has attracted widespread interest from developers around the world and gone through phases of rapid development,
Apache Hive 1.2.0, although not a major release, contains significant improvements.
Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.
Shortly thereafter, Apache Hive 1.1.0 was released (renamed from Apache Hive 0.15.0), which included more significant features—including Hive-on-Spark.
A Hive-on-Spark beta is now available via CDH parcel. Give it a try!
The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.
Many anxious users have inquired about its availability in the last few months.
Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.
(Editor’s note [April 12, 2016]: Hive-on-Spark is now GA/ready for production as of CDH 5.7.)
Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem,