The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.
Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.
Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.
The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).
This demo is intended to illustrate our progress toward porting Hive to Spark, not to compare Hive-on-Spark performance versus other engines. The Hive-on-Spark team is now focused on additional join strategies, like Map-side joins, statistics, job monitoring, and other operational aspects. As these pieces come together, we’ll then shift our focus to the performance tuning and optimization needed prior to general release.
The Hive and Spark communities have worked closely together to make this effort possible. In order to support Hive on Spark, Spark developers have provided enhancements including MapReduce-style shuffle transformation, removed Guava from the public API, and improved the Java version of the Spark API. Among other enhancements in progress, the Spark community is working hard to provide elastic scaling within a Spark application. (Elastic Spark application scaling is a favorite request from longtime Spark users.) Given all the enhancements Hive on Spark is driving within Spark, the Hive-on-Spark project is turning out to be beneficial for their respective communities.
We look forward to providing another update in a few months. Until then, please enjoy the demo video!
Finally, a big thanks to to the project team members: Chao Sun, Chengxiang Li, Chinna Rao Lalam, Jimmy Xiang, Marcelo Vanzin, Na Yang, Reynold Xin, Rui Li, Sandy Ryza, Suhas Satish, Szehon Ho, Thomas Friedrich, Venki Korukanti, and Xuefu Zhang.
Brock Noland is a Software Engineer at Cloudera and an Apache Hive committer/PMC member.