How-to: Use the Apache HBase REST Interface, Part 2

This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.

Adding Rows With XML

The REST interface would be useless without the ability to add and update row values. The interface gives us this ability with the POST verb. By posting new rows, we can add new rows or update existing rows using the same row key.

First, let’s step through how to do this using the XML and JSON data formats. Let’s start with XML.

We’ll have to add two import statements:

 

For the XML data format, all values and column names need to be base64 encoded because values can be binary data. We can’t have binary data messing up our nicely formed XML.

We also need to import our XML modules. These modules will help us create the XML DOM to hold our new rows.

To work with the column’s name, we need to base64 encode them. I recommend doing this at the start of the script and reuse the variable as needed:

 

Let’s take a look at the code to create the rows:

 

The first line creates the root XML element. The CellSet node will contain all of the child elements to be inserted as rows.

The for loop iterates over all of the entries we want to insert. At the start of the loop, a row key is created and then base64 encoded. This key will uniquely identify the row in HBase. Next, a row element is created. This row element contains an attribute called “key” with the base64 encoded row key as the value. Each separate cell is base64 encoded and added to “Cell” elements.  The “Cell” elements all have the column name as an attribute. The “Cell”s element text is set to the base64 encoded value.

Note that the URL is set to “fakerow”. Since this is a multi-store request, the REST interface will be using the key supplied in the “Cell” element. For a single store request, you can use the URL. Doing a multi-store request is much more efficient than doing a single request at a time. The overhead is a lot less by doing a larger request fewer times.

Once the entire XML DOM is created, it can be passed to the REST interface using the POST verb. The POST‘s data is set to a string representation of the DOM. The Content-Type and Accept headers are set to XML. The REST server is expecting XML input and will pass back XML. The success of the call can be ascertained by looking at the status code.

To update a row’s values in XML, simply make the key’s value the same as it was before. HBase will find the row and update the values contained in the call.

Adding Rows With JSON

Working with JSON is very similar to working with XML. I find that the code for JSON is more straightforward though. JSON also has the benefit of being better suited to data formats and creates a much smaller footprint. I’ve found that the JSON calls take less time than the XML calls.

Here are the imports for using JSON:

 

Note that OrderedDict is one of the imports. We’ll discuss the reason why shortly.

Now let’s see the JSON multi-store code:

 

First we create a row array to store all of the rows we want to store. Then, we add the row array to a dict with the name “Row”.

We enter a for loop that goes through all of the data we want to add to HBase. First, we create the row key that will uniquely identify the row in HBase. Once again, we have to base64 encode the value. Next, we take the data we want to store and base64 encode it too.

Now comes the OrderedDict. We have to use an OrderedDict instead of a normal dict because we have to maintain the order of the keys in the dictionary. This works around an issue in the REST daemon for JSON. The bug is that the “key” entry must come before the “Cell” entry. If it doesn’t, the REST interface won’t find the key and will use the URL’s key over and over. In this case, the row key would be “fakekey” and every column would be added to the same row.

We add the “key” to the OrderedDict and add the “Cell” which is an array of dictionaries. The column key is the base64 encoded name of the column and the dollar sign ($) is the base64 encoded value of the column.

Once we have finished creating the cell object, we can append that to our row array.

The final step is to submit the JSON to the REST server using the POST verb. We are using the multi-store so the URL’s key will be fake and the cell’s key will be used. The data is set to a string representation of the JSON. The headers are changed so that the REST server is expecting JSON and will pass back JSON. The success of the call can be ascertained by looking at the status code.

To update a row’s values in JSON, simply make the key’s value the same as it was before. HBase will find the row and update the values contained in the call.

In the third and final how-to in this series, we’ll cover getting the rows that we’ve just inserted.

Jesse Anderson is an instructor with Cloudera University.

Filed under:

2 Responses
  • Bjorn Lindberg / June 02, 2013 / 12:28 AM

    Hello, thanks for the post, when / where is part 3 showing up?

  • Justin Kestelyn (@kestelyn) / June 02, 2013 / 12:35 PM

    Hi Bjorn,

    Part 3 will publish late next week (week of June 3).

Leave a comment


seven × 4 =