There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.
There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.
This series of how-to’s will discuss the REST interface and provide Python code samples for accessing it. The first post will cover HBase REST, some Python caveats, and table administration. The second post will explain how to insert multiple rows at a time using XML and JSON. The third post will show how to get multiples rows using XML and JSON. The full code samples can be found on my GitHub account.
HBase REST Basics
For both Thrift and REST to work, another HBase daemon needs to be running to handle these requests. These daemons can be installed in the hbase-thrift and hbase-rest packages. The diagram below illustrates where Thrift and REST are placed in the cluster. Note that the Thrift and REST clients usually don’t run any other services services like DataNode or RegionServers to keep the load down, and responsiveness high, for REST interactions.
Be sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the web application server. The REST interface doesn’t have any built-in load balancing; that will need to be done with hardware or in code. Cloudera Manager makes it really easy to install and manage the HBase REST and Thrift services. (You can download and try it out for free!) The downside to REST is that it is much heavier-weight than Thrift or Java.
A REST interface can use various data formats: XML, JSON, and protobuf. By specifying the Accept
and Content-Type
headers, you can choose the format you want to pass in or receive back.
To start using the REST interface, you need to figure out which port it’s running on. The default port for CDH is port 8070. For this post, you’ll see the baseurl
variable used, and here is the value I’ll be using::
baseurl = "http://localhost:8070"
The REST interface can be set up to use a Kerberos credential to increase security.
For your code, you’ll need to use the IP address or fully qualified domain name DNS of the node running the REST daemon. Also, confirm that the port is correct. I highly recommend making this URL a variable, as it could change with network changes.
Python and HBase Bug Workarounds
There are two bugs and workarounds that need to be addressed. The first bug is that the built-in Python modules don’t support all of the HTTP verbs. The second is an HBase REST bug when working with JSON.
The built-in Python modules for REST interaction don’t easily support all of the HTTP verbs needed for HBase REST. You’ll need to install the Python requests module. The requests module also cleans up the code and makes all of the interactions much easier.
The HBase REST interface has a bug when adding data via JSON: it is required that the fields maintain their exact order. The built-in Python dict
type doesn’t support this feature, so to maintain the order, we’ll need to use the OrderedDict
class. (Those with Python 2.6 and older will need to install the ordereddict module.) I’ll cover the bug and workaround later in the post, too.
It was also difficult to use base64 encode and decode integers, so I wrote some code to do that:
# Method for encoding ints with base64 encoding def encode(n): data = struct.pack("i", n) s = base64.b64encode(data) return s # Method for decoding ints with base64 encoding def decode(s): data = base64.b64decode(s) n = struct.unpack("i", data) return n[0]
To make things even easier, I wrote a method to confirm that HTTP responses come back in the 200s, which indicates that the operation worked. The sample code uses this method to check the success of a call before moving on. Here is the method:
# Checks the request object to see if the call was successful def issuccessful(request): if 200
Working With Tables
Using the REST interface, you can create or delete tables. Let’s take a look at the code to create a table.
content = '' content += '' content += ' ' content += '' request = requests.post(baseurl + "/" + tablename + "/schema", data=content, headers={"Content-Type" : "text/xml", "Accept" : "text/xml"})
In this snippet, we create a small XML document that defines the table schema in the content variable. We need to provide the name of the table and the column family name. If there are multiple column families, you create some more ColumnSchema
nodes.
Next, we use the requests module to POST
the XML to the URL we create. This URL needs to include the name of the new table. Also, note that we are setting the headers for this POST
call. We are showing that we are sending in XML with the Content-Type
set to “text/xml” and that we want XML back with the Accept
set to “text/xml”.
Using the request.status_code
, you can check that the table create was successful. The REST interface uses the same HTTP error codes to detect if a call was successful or errored out. A status code in the 200s means that things worked correctly.
We can easily check if a table exists using the following code:
request = requests.get(baseurl + "/" + tablename + "/schema")
The calls uses the GET
verb to tell the REST interface we want to get the schema information about the table in the URL. Once again, we can use the status code to see if the table exists. A status code in the 200s means it does exist and any other number means it doesn’t.
Using the curl
command, we can check the success of a REST operation without writing code. The following command will return a 200 showing the success of the call because the messagestable
table does exist in HBase. Here is the call and its output:
[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/messagestable/schema HTTP/1.1 200 OK Content-Length: 0 Cache-Control: no-cache Content-Type: text/xml
This REST call will error out because the tablenotthere
table doesn’t exist in HBase. Here is the call and its output:
[user@localhost]$ curl -I -H "Accept: text/xml" http://localhost:8070/tablenotthere/schema HTTP/1.1 500 org.apache.hadoop.hbase.TableNotFoundException: tablenotthere Content-Type: text/html; charset=iso-8859-1 Cache-Control: must-revalidate,no-cache,no-store Content-Length: 10767
We can delete a table using the following code:
request = requests.delete(baseurl + "/" + tablename + "/schema")
This call uses the DELETE
verb to tell the REST interface that we want to delete the table. Deleting a table through the REST interface doesn’t require you to disable it first. As usual, we can confirm success by looking at the status code.
In the next post in this series, we’ll cover inserting rows.
Jesse Anderson is an instructor with Cloudera University.
If you’re interested in HBase, be sure to register for HBaseCon 2013 (June 13, San Francisco) – THE community event for HBase contributors, developers, admins, and users. Early Bird registration is open until April 23.