New in Cloudera Labs: SparkOnHBase

Categories: Cloudera Labs HBase Spark

As we progressively move from MapReduce to Spark, we shouldn’t have to give up good HBase integration. Hence the newest Cloudera Labs project, SparkOnHBase!

[Ed. Note: In Aug. 2015, SparkOnHBase was committed to the Apache HBase trunk in the form of a new HBase-Spark module.]

Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.

In this post, we will share the work being done in Cloudera Labs to make integrating Spark and HBase super-easy in the form of the SparkOnHBase project. (As with everything else in Cloudera Labs, SparkOnHBase is not supported and there is no timetable for possible support in the future; it’s for experimentation only.) You’ll learn common patterns of HBase integration with Spark and see Scala and Java examples for each. (It may be helpful to have the SparkOnHBase repository open as you read along.)

HBase and Batch Processing Patterns

Before we get into the coolness of Spark, let’s define some powerful usage patterns around HBase interactions with batch processing. This discussion is necessary because when I talk to many customers that are new to HBase, they tell me that they hear HBase and MapReduce should never be used together.

In fact, although there are valid use cases to have a HBase cluster that is isolated from MapReduce for low SLA reasons, there are also use cases where the combination of MapReduce and HBase is the right approach. Here are just a couple examples:

  • Massive operations on a tree/DAG/graph structures stored in HBase
  • Interaction with a store or table that is in constant change, with MapReduce or Impala

SparkOnHBase Design

We experimented with many designs for how Spark and HBase integration should work and ended up focusing on a few goals:

  • Make HBase connections seamless.
  • Make Kerberos integration seamless.
  • Create RDDs through Scan actions or from an existing RDD which are used to generate Get commands.
  • Take any RDD and allow any combination of HBase operations to be done.
  • Provide simple methods for common operations while allowing unrestricted, unknown advanced operation through the API.
  • Support Scala and Java.
  • Support Spark and Spark Streaming with a like API.

These goals led us to a design that took a couple of notes from the GraphX API in Spark. For example, in SparkOnHBase there is an object called HBaseContext. This class has a constructor that takes HBase configuration information and then once constructed, allows you to do a bunch of operations on it. For example, you can:

  • Create RDD/DStream from a Scan
  • Put/Delete the contents of a RDD/DStream into HBase
  • Create a RDD/DStream from gets created from the contents of a RDD/DStream
  • Take the contents of a RDD/DStream and do any operation if a HConnection was handed to you in the worker process

Let’s walk through a code example so you can an idea about how easy and powerful this API can be. First, we create a RDD, connect to HBase, and put the contents of that RDD into HBase.

Now every partition of that RDD will execute in parallel (in different threads in a number of Spark workers across the cluster)—kind of like what would have happened if we did Puts in a mapper or reducer task.

One thing to note is that the same rules apply when working with HBase from MapReduce or Spark in terms of Put and Get performance. If you have Puts that are not partitioned, a Put batch will most likely get sent to each RegionServer, which will result in fewer records per RegionServers per batch. The image below illustrates how this would look with six RegionServers; imagine if you had 100 of them (it would be 16.7x worse)!

Now let’s look at that same diagram if we used Spark to partition first before talking to HBase.


Next, we’ll quickly explore just three code examples to illustrate how you can do different types of operations. (A Put example would look almost exactly like a delete, checkPut, checkDelete, or increment example.)

The big difference in a get example would be the fact that we are producing a new RDD from an existing one. Think of it as a “Spark map function.”

Now, let’s say your interaction with HBase is more complex than straight gets or Puts—a case were you want to say, “Just give me an HConnection and leave me alone.” Well, HBaseContext has map, mapPartition, foreach, andforeachPartition methods just for you.

Here’s an example of the foreachPartition in Java.

The last example to talk about will be the create a RDD from a scan:

This code will execute a scan just like MapReduce would do with the table input format and populate the resulting RDD with records of type (RowKey, List[(columnFamily, columnQualifier, Value)]. If you don’t like that record type, then just use the hbaseRDD method, which gives you a record conversion function for changing it to whatever you like.


SparkOnHBase has been tested on a number of clusters with Spark and Spark Streaming; give it a look and let us know your feedback via the Cloudera Labs discussion group. The hope is that this project and others like it will help us blend the goodness from different Hadoop ecosystem components to help solve bigger problems.

To use SparkOnHBase, just add the following snippet as a dependency in your pom.xml:


Special thanks to the people that helped me make SparkOnHBase: Tathagata Das (TD), Mark Grover, Michael Stack, Sandy Ryza, Kevin O’Dell, Jean-Marc Spaggiari, Matteo Bertozzi, and Jeff Lord.

Ted Malaska is a Solutions Architect at Cloudera, a contributor to Apache Spark, and a co-author of the O’Reilly book, Hadoop Applications Architecture.


13 responses on “New in Cloudera Labs: SparkOnHBase

  1. Cristofer

    Now that Spark SQL have support for external data sources with predicate pushdown it will be nice to see some integrations in this direction too.

  2. Does SparkOnHBase support Dstream

    The first parameter of bulkPut method is a RDD,When I use this method in spark streaming I set a DStream an the parameter;It won’t work;How to solve this problem?

    1. Justin Kestelyn (@kestelyn) Post author

      Please post this question in the Cloudera Labs area at for easier interaction…

  3. Manju Jain

    When I try to execute the JavaHBaseStreamingBulkPutExample by reading the stream of Data from the socket it does not save anything to HBase
    I try to print the Dstream its getting printed properly
    please help as this is really not working for Us

  4. Vinay

    We are using sparkOnHBase lib to do streamBulkPut() for a RDD in “spark-streaming with checkpointing”
    and getting the following error while recovering from a checkpoint

    =================================== ======================================
    16/01/22 01:32:35 ERROR executor.Executor: Exception in task 0.0 in stage 39.0 (TID 134)
    java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable
    at com.cloudera.spark.hbase.HBaseContext.applyCreds(HBaseContext.scala:225)
    at com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
    at com.cloudera.spark.hbase.HBaseContext$$anonfun$com$cloudera$spark$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:460)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.executor.Executor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

  5. alchemist

    I am trying to use the above API in Java and trying to get the JavaHBaseContext using CDH5.5 libraries. Somehow I cannot get JavaHBaseContext library.
    Cannot find this library
    JavaHBaseContext hbaseContext = new JavaHBaseContext(jsc, conf);
    Improted following maven script


  6. Abi

    You have provided Java based SparkHbase sample code at line “Here’s an example of the foreachPartition” but maven dependencies and what parameter used while creating jsc obj is missing. Can you please let us know link where you have this example end to end ?

  7. Sathish

    I am trying to use this library in my spark scala application, but finding difficulty in compiling the code,
    I added below dependency in pom.xml file, and added repo –


    I do see this depenency got resolved by checking in mvn dependency:tree,

    However, compilation fails with below error,
    error: object cloudera is not a member of package,
    [INFO] import com.cloudera.spark.hbase.HBaseContext

    Any clue on the root cause ? I tried in sbt project, and got the same problem .

Leave a Reply

Your email address will not be published. Required fields are marked *