There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.

There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.

This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.

HBase REST Basics

For both Thrift and REST to work, another HBase daemon needs to be running to handle these requests. These daemons can be installed in the hbase-thrift and hbase-rest packages. The diagram below illustrates where Thrift and REST are placed in the cluster. Note that the Thrift and REST clients usually don’t run any other services services like DataNode or RegionServers to keep the load down, and responsiveness high, for REST interactions.

Be sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the web application server. The REST interface doesn’t have any built-in load balancing; that will need to be done with hardware or in code. Cloudera Manager makes it really easy to install and manage the HBase REST and Thrift services. (You can download and try it out for free!) The downside to REST is that it is much heavier-weight than Thrift or Java.

A REST interface can use various data formats: XML, JSON, and protobuf. By specifying the Accept and Content-Type headers, you can choose the format you want to pass in or receive back.

To start using the REST interface, you need to figure out which port it’s running on. The default port for CDH is port 8070. For this post, you’ll see the baseurl variable used, and here is the value I’ll be using::

baseurl = "http://localhost:8070"

The REST interface can be set up to use a Kerberos credential to increase security.

For your code, you’ll need to use the IP address or fully qualified domain name DNS of the node running the REST daemon. Also, confirm that the port is correct. I highly recommend making this URL a variable, as it could change with network changes.

Python and HBase Bug Workarounds

There are two bugs and workarounds that need to be addressed. The first bug is that the built-in Python modules don’t support all of the HTTP verbs. The second is an HBase REST bug when working with JSON.

The built-in Python modules for REST interaction don’t easily support all of the HTTP verbs needed for HBase REST. You’ll need to install the Python requests module. The requests module also cleans up the code and makes all of the interactions much easier.

The HBase REST interface has a bug when adding data via JSON: it is required that the fields maintain their exact order. The built-in Python dict type doesn’t support this feature, so to maintain the order, we’ll need to use the OrderedDict class. (Those with Python 2.6 and older will need to install the ordereddict module.) I’ll cover the bug and workaround later in the post, too.

It was also difficult to use base64 encode and decode integers, so I wrote some code to do that:

# Method for encoding ints with base64 encoding
def encode(n):
     data = struct.pack("i", n)
     s = base64.b64encode(data)
     return s

# Method for decoding ints with base64 encoding
def decode(s):
     data = base64.b64decode(s)
     n = struct.unpack("i", data)
     return n[0]

 

To make things even easier, I wrote a method to confirm that HTTP responses come back in the 200s, which indicates that the operation worked. The sample code uses this method to check the success of a call before moving on. Here is the method:

# Checks the request object to see if the call was successful
def issuccessful(request):
	if 200

 

Working With Tables

Using the REST interface, you can create or delete tables. Let’s take a look at the code to create a table.

content =  ''
content += ''
content += '  '
content += ''

request = requests.post(baseurl + "/" + tablename + "/schema", data=content, headers={"Content-Type" : "text/xml", "Accept" : "text/xml"})

 

In this snippet, we create a small XML document that defines the table schema in the content variable. We need to provide the name of the table and the column family name. If there are multiple column families, you create some more ColumnSchemanodes.

Next, we use the requests module to POST the XML to the URL we create. This URL needs to include the name of the new table. Also, note that we are setting the headers for this POST call. We are showing that we are sending in XML with the Content-Type set to “text/xml” and that we want XML back with the Accept set to “text/xml”.

Using the request.status_code, you can check that the table create was successful. The REST interface uses the same HTTP error codes to detect if a call was successful or errored out. A status code in the 200s means that things worked correctly.

We can easily check if a table exists using the following code:

request = requests.get(baseurl + "/" + tablename + "/schema")

 

The calls uses the GET verb to tell the REST interface we want to get the schema information about the table in the URL. Once again, we can use the status code to see if the table exists. A status code in the 200s means it does exist and any other number means it doesn’t.

Using the curl command, we can check the success of a REST operation without writing code. The following command will return a 200 showing the success of the call because the messagestabletable does exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/messagestable/schema
HTTP/1.1 200 OK
Content-Length: 0
Cache-Control: no-cache
Content-Type: text/xml

 

This REST call will error out because the tablenottheretable doesn’t exist in HBase. Here is the call and its output:

[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/tablenotthere/schema
HTTP/1.1 500 org.apache.hadoop.hbase.TableNotFoundException: tablenotthere
Content-Type: text/html; charset=iso-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 10767

 

We can delete a table using the following code:

request = requests.delete(baseurl + "/" + tablename + "/schema")

 

This call uses the DELETE verb to tell the REST interface that we want to delete the table. Deleting a table through the REST interface doesn’t require you to disable it first. As usual, we can confirm success by looking at the status code.

In the next post in this series, we’ll cover inserting rows.

Jesse Anderson is an instructor with Cloudera University.


If you’re interested in HBase, be sure to register for HBaseCon 2013 (June 13, San Francisco) – THE community event for HBase contributors, developers, admins, and users. Early Bird registration is open until April 23.

Jesse Anderson
Jesse Anderson

Leave a comment

Your email address will not be published. Links are not permitted in comments.