Contributors from Intel, Cloudera, and the rest of the community have been making strong progress on the Hive-on-Spark initiative. This post provides an update.
[Editor’s note (April 20, 2016): Hive-on-Spark is now GA/shipping starting in CDH 5.7.]
Since its inception about one year ago, the community initiative to make Apache Spark a data processing engine for Apache Hive (HIVE-7292) has attracted widespread interest from developers around the world and gone through phases of rapid development, testing, and early deployment. (For example, based on downloads data and user questions about the beta release of this functionality in CDH, interest has been strong.)
Across this timeline, the Hive-on-Spark team made a commitment to keeping you informed about significant milestones, including blog posts about the design, the first demo, the hands-on sandbox, and of course the first beta release in CDH. (Hive-on-Spark was originally a Cloudera Labs project.) Since Apache Hive 1.1 was released with Hive-on-Spark as its flagship feature in March 2015, the Hive community has received a lot of feedback from users across the ecosystem, and Cloudera’s and Intel’s Hive teams have heard similarly helpful feedback from CDH beta customers. Based on that feedback, the community has responded by making the following improvements:
- Spark cluster usage by providing dynamic executor allocation and dangling user session control [HIVE-7768] [HIVE-10143]
- Usability with respect to job monitoring and logging [HIVE-9871] [HIVE-10291] [HIVE-11314]
- Integration with other components such as Apache Sentry and Apache Oozie in Hadoop ecosystem [HIVE-10594] [HIVE-11363]
- Performance enhancement and optimizations [HIVE-10844] [HIVE-10550] [HIVE-11180] [HIVE-10855] [HIVE-11183] [HIVE-9152]
In addition, the team has conducted extensive stress, scalability, and performance testing. We have also made major improvements in documenting the configuration and tuning for Hive-on-Spark. Most of these improvements were released in Hive 1.2 as well as in CDH 5.4.
With this initiative to make Spark an attractive data processing engine for Hive built on a solid technical foundation and community-based collaboration, Hive-on-Spark will soon be ready for production use by Cloudera’s customers.
The work is not yet finished, however, because the community intends to continually improve the code base based on your experiences and feedback. Please keep that feedback coming!
Xuefu Zhang is a Software Engineer at Cloudera, an Apache member, and an Apache Hive PMC member.
Rui Li is a software engineer at Intel, and an Apache Hive committer and Apache Spark contributor.