Hands-on Hive-on-Spark in the AWS Cloud

Categories: Cloud Community Hive Spark

Interested in Hive-on-Spark progress? This new AMI gives you a hands-on experience.

Nearly one year ago, the Apache Hadoop community began to embrace Apache Spark as a powerful batch-processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this shift, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.

Since that demo, we have made tremendous progress, having finished up Map Join (HIVE-7613), Bucket Map Join (HIVE-8638), integrated with Hive Server 2 (HIVE-8993) and importantly integrated our Spark Client (HIVE-8548, aka Remote Spark Context). Remote Spark Context is important as it’s not possible to have multiple SparkContexts within a single process. The RSC API allows us to run the SparkContext on the server in a container while utilizing the Spark API on the client—in this case HiveServer 2, which reduces resource utilization on an already burdened component.

Many users have proactively starting using the Spark branch and providing feedback. Today, we’d like to offer you the first chance to try Hive-on-Spark yourself. As this work is under active development, for most users, we do not recommend that you attempt to run this code outside of the packaged Amazon Machine Image (AMI) provided. The AMI ami-35ffed70 (named hos-demo-4) is available in us-west-1 while we recommend an instance of m3.large or larger.

Once logging in as ubuntu, change to the hive user (sudo su - hive) and you will be greeted with instructions on how to start Hive on Spark. Pre-loaded on the AMI is a small TPC-DS dataset and some sample queries. Users are strongly encouraged to load their own sample datasets and try their own queries. We are hoping not only to showcase our progress delivering Hive-on-Spark but also to help find areas of improvement, early. As such, if you find any issues, please email hos-ami@cloudera.org and the cross-vendor team will do its best to investigate the issue.

Despite spanning the globe, the cross-company engineering teams have become close. The team members would like to thank our employers for sponsoring this project: MapR, Intel, IBM, and Cloudera.

Rui Li is a software engineer at Intel and a contributor to Hive.

Na Yang is a staff software engineer at MapR and a contributor to Hive.

Brock Noland is an engineering manager at Cloudera and a Hive PMC member.




6 responses on “Hands-on Hive-on-Spark in the AWS Cloud

  1. David Gruzman

    I would like to ask a few questions : how it differs from Shark – in best of my understanding it is exactly Hive on top of Spark, and why do you call Spark to be batch processing engine?

    1. Justin Kestelyn (@kestelyn) Post author


      1. Shark is in “sunset” mode, so I presume you mean Spark SQL, the successor to Shark. (But you are essentially correct that Shark was a Hive port.)
      2. The purpose of Spark SQL is to allow Spark developers to selectively use SQL expressions (with not a huge number of functions currently supported) when writing Spark jobs. It’s a “better Spark”. The purpose of Hive-on-Spark is to give Hive users a faster batch processing engine for ETL jobs etc. It’s a “better Hive”. So, they’re for two different things.
      3. Spark provides very similar functionality to MR with the major exception that most Spark concepts are implemented in memory, not on disk, for better performance. The same paradigms are still in place. So, Spark is indeed a powerful batch processing engine, although that’s not ALL it does. (See this post for details.)

  2. David Gruzman

    Thank you for the answer.
    1. I means specifically shark, since I saw Hive on Spark as something technologically similar. I completely agree that it is left “orphan”.
    2. I think you have small typo – Spark, not MR implemented mostly in memory.
    3. Spark is clear winner for short tasks because of its small per job and per task overhead. But for long queries situation can be different, at least today. I doubt that Spark could do things like shuffle much better then MR because they both use JVM, and their algorithms are similar. I selected shuffle like example because it usually dominate query time if it present in query. More then that – last time when i profiled it – MR was superior to Spark in shuffle performance.
    In the same time – i do believe that Spark, as more flexible runtime, can do better job for SQL implementation than MR.

    1. Justin Kestelyn (@kestelyn) Post author

      Not that it matters much because Shark is in sunset — but being a port of Hive it was always behind in functionality, by definition.

  3. Tom Hanlon

    Great work,

    I was just reading through the Jiras and the developer email list and was getting ready to grab the branch and mess around with it. A working AMI is a much easier on-ramp. Thank You..