How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows

How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows

The second how-to in a series about using the Apache HBase Thrift API

Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.

Working with Tables

Using the Thrift interface, you can create or delete tables. Let’s take a look at the Python code that creates a table:

client.createTable(tablename, [Hbase.ColumnDescriptor(name=cfname)])

 

In this snippet, you created a Hbase.ColumnDescriptor object. In this object, you can set all the different parameters for a Column Family. In this case, you only set the Column Family name.

You may recall from the previous how-to that adding Hbase.thrift file to your project is often useful. This is one of those times: You can open up Hbase.thrift and find the ColumnDescriptor definition with all its parameters and their names.

You can confirm a table exists using the following code:

tables = client.getTableNames()

found = False

for table in tables:
	if table == tablename:
		found = True

 

This code gets a list of the user tables, iterates through them, and marks found as true if the table is found.

You can delete a table using the following code:

client.disableTable(tablename)
client.deleteTable(tablename)

 

Remember that in HBase, you have to disable a table before deleting it. This code does just that.

Adding Rows with Thrift

Thrift gives us a couple of ways to add or update rows: One row at a time, or multiple rows at a time. The Thrift interface does not use the same Put object as the Java API. These changes are called row mutations and use the Mutation and BatchMutation objects.

mutations = [Hbase.Mutation(
  column='columnfamily:columndescriptor', value='columnvalue')]
client.mutateRow('tablename', 'rowkey', mutations)

 

Each Mutation object represents the changes to a single column. To add or change another column, you would simply add another Mutation object to the mutations list.

When you are done adding Mutation objects, you call the mutateRow method. This method takes the table name, row key, and mutations list as arguments.

Adding multiple rows at a time requires a few changes:

# Create a list of mutations per work of Shakespeare
mutationsbatch = []

for line in shakespeare:
	rowkey = username + "-" + filename + "-" + str(linenumber).zfill(6)

	mutations = [
			Hbase.Mutation(column=messagecolumncf, value=line.strip()),
			Hbase.Mutation(column=linenumbercolumncf, value=encode(linenumber)),
			Hbase.Mutation(column=usernamecolumncf, value=username)
		]

       mutationsbatch.append(Hbase.BatchMutation(row=rowkey,mutations=mutations))

# Run the mutations for the work of Shakespeare
client.mutateRows(tablename, mutationsbatch)

 

In this example, you’re still using the Mutation object but this time you have to wrap them in a BatchMutation object. The BatchMutation object allows you to specify a different rowkey for each list of Mutations. You also change to the mutateRows method. It takes a table name and the BatchMutation object.

Getting Rows With Thrift

Using the getRow method, you can retrieve a single row based on its row key. This call returns a list of TRowResult objects. Here is the code for getting and working with the output:

rows = client.getRow(tablename, "shakespeare-comedies-000001")

for row in rows:
     message = row.columns.get(messagecolumncf).value
     linenumber = decode(row.columns.get(linenumbercolumncf).value)

     rowKey = row.row

 

Start the code with a getRow request. This get will return the row with the key “shakespeare-comedies-000001”.  These rows will come back as a list of TRowResult. Using a row loop, you go through the list of rows that were returned.

To get the value of a column, use columns.get(“COLUMNFAMILY:COLUMDESCRIPTOR”). Be sure to use the proper naming syntax.

Remember that when dealing with binary data like integers, you will need to convert it from a Python string to whatever type it should be. In this case, you are taking the string and making it an integer with the decode method.

Getting multiple rows at a time is very similar to getting one row.  Here is the code:

rowKeys = [ "shakespeare-comedies-000001",
"shakespeare-comedies-000010",
"shakespeare-comedies-000020",
"shakespeare-comedies-000100",
"shakespeare-comedies-000201" ]

rows = client.getRows(tablename, rowKeys)

 

Instead of specifying a single row, you pass in a list of rows. You also change the method to getRows, which takes the table name and list of rows as arguments.

A list of TRowResult objects is returned and you then iterate through the list just like in the single-row code.

In the next and final how-to, you’ll learn how to use scans and get an introduction to some considerations when choosing between the REST and Thrift APIs for development.

Jesse Anderson is an instructor for Cloudera University.

Leave a comment

Your email address will not be published. Links are not permitted in comments.