Cloudera Developer Blog · HBase Posts
Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)
In this series of how-to’s you’ll learn your way around the Thrift interface and explore Python code samples for doing that. This first post will cover HBase Thrift, working with Thrift, and some boilerplate code for connecting to Thrift. The second post will show how to insert and get multiple rows at a time. The third post will explain how to use scans and some considerations when choosing between REST and Thrift.
The following post was originally published by the Hue Team at the Hue blog in a slightly different form.
In this post, we’ll take a look at the new Apache HBase Browser App added in Hue 2.5 and which has improved significantly since then. To get the Hue HBase browser, grab Hue via CDH 4.4 packages, via Cloudera Manager, or build it directly from GitHub.
Prerequisites before starting Hue:
While Apache HBase adoption for building end-user applications has skyrocketed, many of those applications (and many apps generally) have not been well-tested. In this post, you’ll learn some of the ways this testing can easily be done.
We will start with unit testing via JUnit, then move on to using Mockito and Apache MRUnit, and then to using an HBase mini-cluster for integration testing. (The HBase codebase itself is tested via a mini-cluster, so why not tap into that for upstream applications, as well?)
As a basis for discussion, let’s assume you have an HBase data access object (DAO) that does the following insert into HBase. The logic could be more complicated of course but for the sake of example, this does the job.
Apache HBase supports three primary client APIs that developers can use to bind applications with HBase: the Java API, the REST API, and the Thrift API. Therefore, as developers build apps against HBase, it’s very important for them to be aware of the compatibility guidelines with respect to CDH.
This blog post will describe the efforts that go into protecting the experience of a developer using the Java API. Through its testing work, Cloudera allows developers to write code and sleep well at night, knowing that their code will remain compatible through supported upgrade paths.
First, we’ll explore the compatibility guidelines themselves. From there, we will discuss some of the testing that ensures compatibility across CDH versions, as well as some of the interesting incompatibilities we’ve detected and fixed along the way.
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.
In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.
A typical HBase cluster has one active master, one or several backup masters, and a list of region servers. The backup masters are standby masters waiting to be the next active one. Before they are active, they do not listen on any ports. (Learn more about how HBase scalability works here.)
This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.
Getting Rows with XML
GET verb, you can retrieve a single row or a group of rows based on their row keys. (You can read more about the multiple value URL format here.) Here we are going to use the simple wildcard character or asterisk (*) to get all rows that start with a specific string. In this example, we can load every line of Shakespeare’s comedies with “shakespeare-comedies-*”. This also requires that our row key(s) be laid out by “AUTHOR-WORK-LINENUMBER”.
Here is the code for getting and working with the XML output:
Thanks to Steven Noels, SVP of Products for NGDATA, for the guest post below.
NGDATA builds and sells Lily, the next-generation Customer Intelligence Platform that helps enterprise marketing teams collect and store customer interaction data in order to profile, segment, and present better offers. We designed Lily from the ground up to run on Apache HBase and Apache Solr. Combining these technologies with our deep marketing segmentation expertise and unique machine learning techniques we’re able to deliver interactive data management, real-time statistical calculations, faceted search views of customers, offers, interactions and the permutations they each inspire.
The team at NGDATA has been working since mid-2010 on HBase triggers (or update notifications, if you want), which we use in Lily to sync up Solr with HBase, to make HBase freely searchable, compute indexed views for data exploration and feed our online machine learning engine with customer behavior information. The foundation portion of our platform – the Lily Data Repository, based on the combination of HBase and Solr – is being used by large banks, media companies and pharmaceutical firms who value combing Apache Hadoop’s data storage and parallel data processing framework with ad-hoc search and discovery through Solr.
In Part 1 of this series about Apache HBase snapshots, you learned how to use the new Snapshots feature and a bit of theory behind the implementation. Now, it’s time to dive into the technical details a bit more deeply.
What is a Table?
An HBase table comprises a set of metadata information and a set of key/value pairs:
For those of you who missed the show, session video and presentation slides (as well as photos) will be available via hbasecon.com in a few weeks. (To be notified, follow @cloudera or @ClouderaEng.) Although it’s not quite as good as being there with the rest of the community, you’ll still be able to partake from the real-world experiences of Apache HBase users like Facebook, Box, Yahoo!, Salesforce.com, Pinterest, Twitter, Groupon, and more.
While you’re waiting for that, allow me to bring you just this single photo to capture the HBaseCon experience: