How-to: Use the Apache HBase REST Interface, Part 1

There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

HBase REST Basics

For both Thrift and REST to work, another HBase daemon needs to be running to handle these requests. These daemons can be installed in the hbase-thrift and hbase-rest packages. The diagram below illustrates where Thrift and REST are placed in the cluster. Note that the Thrift and REST clients usually don’t run any other services services like DataNode or RegionServers to keep the load down, and responsiveness high, for REST interactions.

Be sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the web application server. The REST interface doesn’t have any built-in load balancing; that will need to be done with hardware or in code. Cloudera Manager makes it really easy to install and manage the HBase REST and Thrift services. (You can download and try it out for free!) The downside to REST is that it is much heavier-weight than Thrift or Java. 

A REST interface can use various data formats: XML, JSON, and protobuf. By specifying the Accept and Content-Type headers, you can choose the format you want to pass in or receive back.

To start using the REST interface, you need to figure out which port it’s running on. The default port for CDH is port 8070. For this post, you’ll see the baseurl variable used, and here is the value I’ll be using::

baseurl = "http://localhost:8070"

The REST interface can be set up to use a Kerberos credential to increase security.

For your code, you’ll need to use the IP address or fully qualified domain name DNS of the node running the REST daemon. Also, confirm that the port is correct. I highly recommend making this URL a variable, as it could change with network changes.

Python and HBase Bug Workarounds

There are two bugs and workarounds that need to be addressed. The first bug is that the built-in Python modules don’t support all of the HTTP verbs. The second is an HBase REST bug when working with JSON.

The built-in Python modules for REST interaction don’t easily support all of the HTTP verbs needed for HBase REST. You’ll need to install the Python requests module. The requests module also cleans up the code and makes all of the interactions much easier.

The HBase REST interface has a bug when adding data via JSON: it is required that the fields maintain their exact order. The built-in Python dict type doesn’t support this feature, so to maintain the order, we’ll need to use the OrderedDict class. (Those with Python 2.6 and older will need to install the ordereddict module.) I’ll cover the bug and workaround later in the post, too.

It was also difficult to use base64 encode and decode integers, so I wrote some code to do that:

# Method for encoding ints with base64 encoding
def encode(n):
     data = struct.pack("i", n)
     s = base64.b64encode(data)
     return s

# Method for decoding ints with base64 encoding
def decode(s):
     data = base64.b64decode(s)
     n = struct.unpack("i", data)
     return n[0]

 

To make things even easier, I wrote a method to confirm that HTTP responses come back in the 200s, which indicates that the operation worked. The sample code uses this method to check the success of a call before moving on. Here is the method:

# Checks the request object to see if the call was successful
def issuccessful(request):
	if 200

 

Working With Tables

Using the REST interface, you can create or delete tables. Let’s take a look at the code to create a table.

content =  '<?xml version="1.0" encoding="UTF-8"?>'
content += '<TableSchema name="' + tablename + '">'
content += '  <ColumnSchema name="' + cfname + '" />'
content += '</TableSchema>'

request = requests.post(baseurl + "/" + tablename + "/schema", data=content, headers={"Content-Type" : "text/xml", "Accept" : "text/xml"})

 

In this snippet, we create a small XML document that defines the table schema in the content variable. We need to provide the name of the table and the column family name. If there are multiple column families, you create some more ColumnSchemanodes.

Next, we use the requests module to POST the XML to the URL we create. This URL needs to include the name of the new table. Also, note that we are setting the headers for this POST call. We are showing that we are sending in XML with the Content-Type set to “text/xml” and that we want XML back with the Accept set to “text/xml”.

Using the request.status_code, you can check that the table create was successful. The REST interface uses the same HTTP error codes to detect if a call was successful or errored out. A status code in the 200s means that things worked correctly.

We can easily check if a table exists using the following code:

request = requests.get(baseurl + "/" + tablename + "/schema")

 

The calls uses the GET verb to tell the REST interface we want to get the schema information about the table in the URL. Once again, we can use the status code to see if the table exists. A status code in the 200s means it does exist and any other number means it doesn’t.

Using the curl command, we can check the success of a REST operation without writing code. The following command will return a 200 showing the success of the call because the messagestabletable does exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/messagestable/schema
HTTP/1.1 200 OK
Content-Length: 0
Cache-Control: no-cache
Content-Type: text/xml

 

This REST call will error out because the tablenottheretable doesn’t exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/tablenotthere/schema
HTTP/1.1 500 org.apache.hadoop.hbase.TableNotFoundException: tablenotthere
Content-Type: text/html; charset=iso-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 10767

 

We can delete a table using the following code:

request = requests.delete(baseurl + "/" + tablename + "/schema")

 

This call uses the DELETE verb to tell the REST interface that we want to delete the table. Deleting a table through the REST interface doesn’t require you to disable it first. As usual, we can confirm success by looking at the status code.

In the next two posts in this series, we’ll cover inserting and getting rows, respectively.

Jesse Anderson is an instructor with Cloudera University.


If you’re interested in HBase, be sure to register for HBaseCon 2013 (June 13, San Francisco) – THE community event for HBase contributors, developers, admins, and users. Early Bird registration is open until April 23.

Filed under:

9 Responses
  • Joe Pallas / March 12, 2013 / 1:16 PM

    You might want to fix that example: unescaped XML is ignored by most browsers.

  • Justin Kestelyn (@kestelyn) / March 12, 2013 / 1:33 PM

    Thanks Joe,

    We’ve corrected the error; thanks for reporting.

  • Andrew Purtell / April 19, 2013 / 8:30 PM

    Although JSON is clearly more fashionable, the bug you mention — which is a consequence of how internally the representations are implemented with JAXB, so arguably a consequence of implementation particulars and not a bug — can be avoided entirely if you use the XML or protobuf representations instead. So why describe and advocate for the only option of the three with such caveats?

  • Tejay Cardon / April 22, 2013 / 1:28 PM

    You state:
    “Cloudera Manager makes it really easy to install and manage the HBase REST and Thrift services.”

    I don’t see REST as a role under HBase. Is the rest interface automatically installed along with the HBase master? Or is there some way to enable it? If it is separate, how do I install it, can can it be installed via the API?

    Thanks,
    Tejay

  • Jesse Anderson (@jessetanderson) / May 24, 2013 / 1:44 PM

    @andrew The remainder of the series covers JSON and XML.

  • Jesse Anderson (@jessetanderson) / May 24, 2013 / 1:46 PM

    @tejay I double checked and the latest version of Cloudera Manager does support both the REST and Thrift interfaces. You can add them when assigning roles to an instance.

  • Jesse Anderson (@jessetanderson) / May 28, 2013 / 11:01 AM

    @tejay You might need to update to the latest version of Cloudera Manager. As of Cloudera Manager 4.5, you can manage the HBase REST and Thrift daemons.

  • Wouter Bolsterlee / July 10, 2013 / 2:40 PM

    HappyBase (https://happybase.readthedocs.org) is a very mature and extensive library for interacting with HBase from Python. I would heartily recommend it over the quite “hacky” REST library built in this tutorial…

    • Jesse Anderson (@jessetanderson) / July 10, 2013 / 2:47 PM

      I wouldn’t remotely call this series of posts or the code a library because they’re showing how to work directly with the API. The purpose of a library is to abstract away the complexity of an API and make it easier to work with. HappyBase is a library and this code isn’t.

Leave a comment


three − = 2