How-to: Use the HBase Thrift Interface, Part 3 – Using Scans
The conclusion to this series covers how to use scans, and considerations for choosing the Thrift or REST APIs.
In this series of how-tos, you have learned how to use Apache HBase’s Thrift interface. Part 1 covered the basics of the API, working with Thrift, and some boilerplate code for connecting to Thrift. Part 2 showed how to insert and to get multiple rows at a time. In this third and final post, you will learn how to use scans and some considerations when choosing between REST and Thrift.
Scanning with Thrift
A "scan" allows you to retrieve all or a range of rows in a table. Here is the code for doing a scan:
scan = Hbase.TScan(startRow="shakespeare-comedies-000001", stopRow="shakespeare-comedies-999999") scannerId = client.scannerOpenWithScan(tablename, scan) row = client.scannerGet(scannerId) rowList = client.scannerGetList(scannerId,numRows) while rowList: for row in rowList: message = row.columns.get(messagecolumncf).value linenumber = decode(row.columns.get(linenumbercolumncf).value) rowKey = row.row rowList = client.scannerGetList(scannerId,numRows) client.scannerClose(scannerId)
You start off by creating a
TScan object, which allows you to specify the start and stop rows for the scan. The start row is inclusive and the stop row is exclusive.
Using the TScan object, you call
scannerOpenWithScan with the table name. This returns a scanner id. The Thrift client cannot directly hold onto a
Scan object; instead, the Thrift server holds on to this object for us. The scanner id uniquely identifies this
Scan object on the Thrift server so your code can use it.
With the scanner id, you start to get rows back with the
scannerGetList call. You also need to specify the number of rows you want to come back at a time. This number will vary depending on your code and data — I recommend making it a variable and spending some time to optimize it.
Now that you have a row list, you can start iterating through it. The iteration code is a little more complex than the
GET code. Here, you have to deal with the
None as well as iterating through the values. Since Python lacks an assignment syntax in a
WHILE loop, you have to assign it elsewhere and then iterate.
The scanner returns the same
TRowResult as a
GET. You can pull the row’s columns with
As previously explained, the Thrift server holds the
Scan object. So, you need to tell the Thrift server to close the
Scan object on its side. You do this by calling
scannerClose with the scanner id that the
scannerOpen returned. Forgetting to do this could leak
Scan objects onto the Thrift server.
Choosing Between Thrift and REST
You may be deciding between using HBase’s Thrift and REST interfaces. I have covered the REST interface in great detail in another series of blog posts. This series of blog posts covers the Thrift side of things.
I should start the discussion by mentioning that the Thrift and REST interfaces are not mutually exclusive. Both daemons can run on the same server because they use different ports. Most shops choose just one to standardize their codebase.
Program completion time (in seconds)
There are a couple of things to consider when standardizing on an interface. First and foremost is speed. The chart above shows the program completion times for the
GET scripts of Thrift, REST with JSON, and REST with XML. These numbers were from a three-node cluster with the Thrift and REST server running on localhost. It gives you a general idea of speed for each interface and data format. Thrift comes out ahead in both programs.
Another consideration is ease of use. REST is much easier to set up and get going. Any programming language with an HTTP library can access the HBase REST interface. With Thrift, you will have to create the bindings and learn the data types. Thrift really excels at creating a seamless experience with the language bindings. You will not have to deal with XML or JSON with Thrift. Thrift has generated all of the data classes for you. Comparing the three examples’ code, you can see the Thrift code is much cleaner.
If I were to start a brand new, non-Java project, I would use Thrift. It just comes out ahead. The REST interface is great for projects where you are already exposing your own REST interfaces. REST is also good for scripting and shell commands. You could write a quick bash script that uses curl to perform some actions.
I should stress that choosing between Thrift and REST always precludes Java. The only first-class citizen for HBase is Java. If were to start a brand-new project without any language constraints, I would use the Java API.
The HBase Thrift interface is a good way to use HBase if you don’t want to use Java. It gives you code generation for a lot of different languages. This native, non-Java language support will improve your code’s performance and readability.
These code samples and explanations will save you a lot of googling when embarking on your HBase Thrift project.
Jesse Anderson is a Cloudera University instructor.
HBaseCon 2014 is coming May 5! Register now while there’s room at hbasecon.com/registration.