Download the Hive-on-Spark Beta

Categories: Cloudera Labs Hive Spark

A Hive-on-Spark beta is now available via CDH parcel. Give it a try!

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Many anxious users have inquired about its availability in the last few months. Some users even built Hive-on-Spark from the branch code and tried it in their testing environments, and then provided us valuable feedback. The team is thrilled to see this level of excitement and early adoption, and has been working around the clock to deliver the product at an accelerated pace.

Thanks to this hard work, significant progress has been made in the last six months. (The project is currently incubating in Cloudera Labs.) All major functionality is now in place, including different flavors of joins and integration with Spark, HiveServer2, and YARN, and the team has made initial but important investments in performance optimization, including split generation and grouping, supporting vectorization and cost-based optimization, and more. We are currently focused on running benchmarks, identifying and prototyping optimization areas such as dynamic partition pruning and table caching, and creating a roadmap for further performance enhancements for the near future.

Two month ago, we announced the availability of an Amazon Machine Image (AMI) for a hands-on experience. Today, we even more proudly present you a Hive-on-Spark beta via CDH parcel. You can download that parcel here. (Please note that in this beta release only HDFS, YARN, Apache ZooKeeper, and Hive are supported. Other components, such as Apache Pig, Apache Oozie, and Impala, might not work as expected.) The “Getting Started” guide will help you get your Hive queries up and running on the Spark engine without much trouble.

We welcome your feedback. For assistance, please use user@hive.apache.org or the Cloudera Labs discussion board.

We will update you again when GA is available. Stay tuned!

Xuefu Zhang is a software engineer at Cloudera and a Hive PMC member.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

10 responses on “Download the Hive-on-Spark Beta

  1. Alexey Grishchenko

    Could you please state the advantages of Spark-on-Hive engine over the Spark SQL?
    In general, SparkSQL supports HiveQL, Hive Metastore, Hive SerDes and Hive UDFs, and also reuses Hive JDBC driver, which makes it almost fully compliant with Hive. What is the additional value of Hive-on-Spark?
    As Shark was the first introduction of Hive query engine on top of Spark, how did you solve its performance problems? According to Databricks (https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html), SparkSQL outperforms Shark in an order of mangitude on TPC-DS

    1. Justin Kestelyn (@kestelyn) Post author

      Alexey,

      We see a few advantages. Namely:

      1. Spark SQL is based on a snapshot of Hive. Thus, although it’s “compliant” with Hive, it will always be behind the current Hive release wrt implementing new features and bug fixes. In contrast, Hive-on-Spark evolves alongside Hive organically (and thus is as enterprise-ready as Hive itself, which cannot be said of Spark SQL).
      2. Spark SQL replaces Hive’s SQL constructs with Spark’s transformations and actions. Thus, Spark SQL is missing a lot of features that are implemented in Hive’s constructs.
      3. In contrast, Hive-on-Spark is built using Hive’s SQL constructs, with Spark used only as a general execution engine. Thus, it contains all the SQL features that are missing in Spark SQL per #2.

      Generally speaking, Spark SQL is helpful for Spark developers who want to use SQL when writing Spark jobs, but so far it’s not an outright Hive replacement for those people who use SQL exclusively.

    1. Justin Kestelyn (@kestelyn) Post author

      Spark SQL already ships inside CDH 5.x. However it is tagged as an alpha — not supported and not recommended for production use.

      Note though that Spark SQL and Hive-on-Spark have difference use cases. Spark SQL contains a subset of Hive SQL functionality and is intended for Spark developers who want to use SQL in their Spark jobs. Hive-on-Spark is just that: full Hive with Spark data processing underneath.

  2. David Sabater

    Hi,
    At my current client we are evaluating using Spark SQL thriftserver as in-memory DB engine, leveraging caching tables in memory in columnar efficient format and wondering if you have any date planned to start supporting this?
    I actually asked Patrick Wendell from Databricks in Strataconf London last week when this will be supported and they are actually supporting it already, so it’s down to the Hadoop distributions to support it as well.

    Thanks.

    1. Justin Kestelyn (@kestelyn) Post author

      David,

      We can’t really give you a date on that because we don’t consider Spark SQL to be production-ready. We’ll see where it goes.

Leave a Reply

Your email address will not be published. Required fields are marked *