Apache Hive on Apache Spark: The First Demo

Categories: Community Hive MapReduce Spark

The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

This demo is intended to illustrate our progress toward porting Hive to Spark, not to compare Hive-on-Spark performance versus other engines. The Hive-on-Spark team is now focused on additional join strategies, like Map-side joins, statistics, job monitoring, and other operational aspects. As these pieces come together, we’ll then shift our focus to the performance tuning and optimization needed prior to general release.

The Hive and Spark communities have worked closely together to make this effort possible. In order to support Hive on Spark, Spark developers have provided enhancements including MapReduce-style shuffle transformation, removed Guava from the public API, and improved the Java version of the Spark API. Among other enhancements in progress, the Spark community is working hard to provide elastic scaling within a Spark application. (Elastic Spark application scaling is a favorite request from longtime Spark users.) Given all the enhancements Hive on Spark is driving within Spark, the Hive-on-Spark project is turning out to be beneficial for their respective communities.

We look forward to providing another update in a few months. Until then, please enjoy the demo video!

Finally, a big thanks to to the project team members: Chao Sun, Chengxiang Li, Chinna Rao Lalam, Jimmy Xiang, Marcelo Vanzin, Na Yang, Reynold Xin, Rui Li, Sandy Ryza, Suhas Satish, Szehon Ho, Thomas Friedrich, Venki Korukanti, and Xuefu Zhang.

Brock Noland is a Software Engineer at Cloudera and an Apache Hive committer/PMC member.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

9 responses on “Apache Hive on Apache Spark: The First Demo

  1. Thomas

    Great job! Is this based on SparkSQL? I’m asking because at some point you were pushing SparkSQL as an alternative to SQL-on-Hadoop.

    Otherwise, assuming that you are promoting 3 alternatives (Impala + SparkSQL + Hive on Spark), your strategy is becoming less and less clear. As a consultant I’m having a hard time following the numerous initiatives. Do you have some insights?

    Thanks

    1. Justin Kestelyn (@kestelyn) Post author

      Thomas,

      Nothing to do with Spark SQL. Hopefully this helps:

      1. Impala is your best choice for interactive BI-like workloads, because Impala queries have proven to have the lowest latency across all other options — especially under concurrent workloads (see the benchmarks on this).
      2. Hive is still a great choice when low latency/multiuser support is not a requirement, such as for batch processing/ETL. Hive-on-Spark will narrow the time windows needed for such processing, but not to an extent that makes Hive suitable for BI (as Impala already is).
      3. Spark SQL, which is in CDH 5.2 as an alpha but is not supported, lets Spark users selectively use SQL constructs when writing Spark pipelines. It is not intended to be a general-purpose SQL layer for interactive/exploratory analysis.

      So, in summary, these are all different tools for doing different things.

  2. Vladi

    Hi,
    Is SparkSQL query latency comparable with Impala ?
    Why SparkSQL can not replace Impala in interactive BI-like workloads?

    Thank you,
    Vladi

    1. Justin Kestelyn (@kestelyn) Post author

      Vladi,

      There are different tools available for different use cases. There is no silver bullet.

      1. Impala is your best choice for interactive BI-like workloads, because Impala queries have proven to have the lowest latency across all other options — especially under concurrent workloads (see the benchmarks on this).
      2. Hive is still a great choice for SQL-based ETL development that focused on a handful of very long-running jobs. Hive-on-Spark will narrow the time windows needed for such processing, but not to an extent that makes Hive suitable for BI (as Impala already is).
      3. Spark SQL, which is in CDH 5.2 as an alpha and is not supported, lets Spark developers selectively use SQL constructs when writing Spark pipelines. It is not intended to be a general-purpose SQL layer for interactive/exploratory analysis.

    1. Justin Kestelyn (@kestelyn) Post author

      There is an “alpha” version in CDH 5.3. It’s considered experimental and is unsupported.

  3. Rajesh Purwar

    If i m creating hive table inside the spark ,,,can we access the same hive table from outside (From Hive terminal)??