Apache HBase Available in CDH2

One of the more common requests we receive from the community is to package Apache HBase with Cloudera’s Distribution for Apache Hadoop. Lately, I’ve been doing a lot of work on making Cloudera’s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We’re pretty excited about this, and we’re looking forward to your feedback. A big thanks to Andrew Purtell, a Senior Architect at TrendMicro and HBase Contributor, for leading this packaging project and providing this guest blog post. -Chad Metcalf

What is HBase?
Apache HBase is an open-source, distributed, column-oriented store modeled after Google’s Bigtable large scale structured data storage system. You can read Google’s Bigtable paper here.

“Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from back end bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.”

HBase extends the publicly shared aspects of the Bigtable architecture and design as described in the Bigtable OSDI’06 paper with community developed improvements and enhancements:

  • Convenient base classes for backing Hadoop MapReduce jobs with HBase tables
  • Query predicate push down via server side scan and get filters
  • Optimizations for real time queries
  • A high performance Thrift gateway
  • A REST-ful Web service gateway that supports XML, Protobuf, and binary data encoding options
  • Cascading source and sink modules
  • A JRuby-based shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

This most recent version of HBase, 0.20.0, has greatly improved on its
predecessors:

  • No HBase single point of failure
  • Rolling restart for configuration changes and minor upgrades
  • Generally, one order of magnitude performance improvement for every class of operation
  • Random access performance on par with open source relational databases such as MySQL

We use ZooKeeper as a substitute for Google’s “Chubby” to enable hot fail over should a Master node fail. We do other interesting things with ZooKeeper as well and on our roadmap is increasingly distributed function via emergent behaviors with no central point of control.

Unfortunately the HDFS NameNode is still a single point of failure in Hadoop 0.20, and HBase depends on HDFS. For more information on mitigating this risk, see this Cloudera blog post on NameNode High Availability.

For more detail, please visit the HBase wiki. Links and references to additional information appear below.

Why Would You Need HBase?

Use HBase when you need fault-tolerant, random, real time read/write access to data stored in HDFS. Use HBase when you need strong data consistency. HBase provides Bigtable-like capabilities on top of Hadoop. HBase’s goal is the hosting of very large tables — billions of rows times millions of columns — atop clusters of commodity hardware.

HBase is an answer for effectively managing terabytes of mutating structured storage on the Hadoop platform at reasonable cost. HBase manages structured data on top of HDFS for you, efficiently using the underlying replicated storage as backing store to gain the benefits of its fault tolerance and data availability and locality. HBase hides the gory details of how one would provide random real time read/write access on top of a filesystem tuned for MapReduce jobs that process terabytes of data, where file block sizes are huge, and where a file can be open for reading or for writing, but not both.

At large scale traditional relational databases (RDBMSes) fall down. We are considering here big queries, typically range or table scans; and big tables, typically terabytes or petabytes. Such workloads generally exceed the ability of these systems to process them in a timely, cost-effective manner. Managing very large storage with them alone is an expensive proposition. Processing that data incurs other cost — in time, in productivity. Waits and deadlocks rise nonlinearly with transaction size and concurrency, the square of concurrency, the third power of the transaction size. In contrast, HBase table scans run in linear time, and row lookup or update times are logarithmic with respect to the size of the table.

Features of the relational model get in the way as data volumes scale up and analytics get more complex and interesting. Expensive commercial RDBMS systems can deliver large storage capacities and they can execute some queries over all that data in reasonable time, but only at high dollar cost. Using open source RDBMSes at scale simply requires giving up all relational features (e.g. secondary indexes) for performance. Sharding is a brittle and complex non-solution to these scalability problems. It is something done when there are no better alternatives.

What if we trade relational features for performance since using RDBMSes at scale often requires giving them up anyway? The first casualty of sharding is the normalized schema. We can avoid waits and deadlocks by restricting transaction scope to groups of row mutations only. What if we generalize the data model? Then we can provide transparent horizontal scalability without architectural limits — generic “self-sharding”. We can also provide fault tolerance and data availability by way of the same mechanisms which allow this scalability.

Bigtable and HBase are able to avoid the scalability issues that trouble RDBMSes by eschewing the relational data model. HBase provides something else. It is like a large distributed map. It is a row indexed list of tags and data, values of variable length, bounded by a configuration setting.  It is a column based data store. Keys are arbitrary byte data and are multidimensional: row, column, optional column qualifier, and timestamp. Columns in HBase are multiversioned. You can store more than one version of a value in a particular row and column, and the timestamp provides an extra dimension of indexability.  This can be a particularly useful feature: Multiversioning and timestamps avoid edit conflicts caused by concurrent decoupled processes. Rows are stored in byte lexicographic sorted order.

Lexicographically similar values are packed adjacent to one another into blocks in the column stores and are retrieved efficiently together. Column stores may optionally be compressed on disk. Tables are dynamically split into regions. Regions are hosted on a number of region servers. Adding additional capacity to a HBase cluster is a simple and transparent process: Provision another region server, configure it, and start it. Typically, HBase region servers are co-deployed with Hadoop HDFS DataNodes. The underlying storage capacity grows also. As regions grow, they are split and distributed evenly among the storage cluster to level load. Splits are almost instantaneous. A cluster master process manages region assignment for fast recovery and fine grained load balancing. The Master role falls over to spares as necessary for fault tolerance. The Master rapidly redeploys regions from failed nodes to others. Because the stores are in HDFS, all region servers in the cluster have immediate access to the replicated table data.

When Would You Not Want To Use HBase?

When your data access patterns are largely sequential over immutable data. Use plain MapReduce.

When your data is not large.

When the large overheads of the extract-transform-load (ETL) of your data into alternatives such as Hive is not an issue because you are purely operating on the data in a batching manner and can afford to wait, and some feature of the alternative is simply a must-have.

If you need to make a different trade off between consistency and availability. HBase is a strongly consistent system. HBase regions can be temporarily unavailable during fault recovery. The HBase client API will suspend pending reads and writes until the regions come back online. Perhaps for your use case blocking of any kind for any length of time is intolerable.

If you just can’t live without SQL.

When you really do require normalized schemas or a relational query engine.

However, this last point can use some additional detail. HBase supports random, real time read/write access to your data by way of a single index. However, secondary indexes can be emulated by managing additional index tables at the application level. To achieve fast query response times under real world conditions, “Web 2.0″ applications often denormalize and replicate and synchronize values in multiple tables anyway. Bigtable’s simpler data model is sufficient for many such use cases and furthermore does not support constructs that can get you into trouble. What you do get is:

  • Fast (logarithmic time) lookup using row key, optional column and column qualifiers for result set filtering and optional timestamp;
  • Full table scans;
  • Range scans, with optional timestamp;
  • Queries for most recent version or N versions;
  • Partial key lookups: When combined with compound keys, these have the same properties as leading left edge indexes with the benefit of a distributed index;
  • Server side filters, a form of query push down.

And, while HBase does not support joining data from multiple tables, you can implement your data workflows using Cascading or a similar higher level construct on top of HBase to recover some relational algebraic operators; or you can simply do ?insert time joins? — denormalization, view materialization, and so on.

How Do You Try Out HBase?

Installing and configuring HBase on the CDH2 is fast and easy.

Before we begin, note that HBase requires an available Zookeeper ensemble.  The CDH2 packages for HBase includes a Zookeeper package. You can install and configure it and then point HBase to it, or you can let HBase create and manage a private Zookeeper ensemble using the bundled Zookeeper jar. For new users who do not have Zookeeper already set up, it is easiest to just let HBase take care of it. The instructions below assume this is the case.

Also, let’s consider what is a reasonable test deployment.

Google aims for ~100 regions per region server, and each region is kept to around 200 MB. Large RAM per node and reasonable region counts and sizing means many tables can be cached and served entirely out of RAM. Bigtable is big because there are 100s if not 1000s of nodes participating. The performance numbers in the Bigtable paper are impressive because of the above. It is cheap (for Google) because they build their own hardware and buy components in bulk.

HBase operates in a different world. Many evaluators or new users expect a lot more for a lot less. They don’t build their own hardware — but could and maybe should — and don’t make bulk purchases. Rather, test deployments of 3 or 4 standard type servers are common. Often the hardware is underpowered for the attempted load. Sometimes even smaller deployments are considered, or even virtual machines are used, but those do not make any sense except as programmer tools. While a “pseudo-distributed” configuration for HBase is included in the distribution, a single server deployment is suitable only for very limited testing. We recommend that three servers be considered a minimum test deployment. These can be Amazon EC2 instances, but use c1.xlarge instances.

A reasonable physical server configuration could be:

  • Dual quad core CPU
  • 8 GB RAM or more (4 GB is passable, but constrain MapReduce to only 1 concurrent mapper and reducer per node)
  • 4 x 250 GB data disk attached as JBOD (for the DataNode process)

The reason for the resource demand is simple: The typical deployment combines HDFS, MapReduce, and HBase over all servers uniformly. A good rule of thumb here is each Hadoop and HBase daemon requires 1 available CPU core and 1 GB of heap. Each mapper or reducer task requires 1 CPU core and 200 MB of heap by default, more if asked for. For HBase to achieve best performance, the region servers must be given sufficient heap to buffer writes and cache blocks for repeated reads. The higher the write load or the larger the working set, the more larger heap allocations will be useful. Configuring 2GB or 4GB heap for HBase region servers is not uncommon.

On to a quick install:

1) On each server, install the core HBase RPMs: hbase, hbase-native, hbase-master, hbase-regionserver, hbase-zookeeper, hbase-conf-pseudo, hbase-docs.

2) On each server, create the cluster configuration and use ‘alternatives’ to enable it.

Create the configuration:

% mkdir /etc/hbase-0.20/conf.my_cluster
% cp /etc/hbase-0.20/conf.pseudo/* /etc/hbase-0.20/conf.my_cluster
% vi /etc/hbase-0.20/conf.my_cluster/hbase-site.xml

Set up the ZooKeeper quorum:

<property>
<name>hbase.zookeeper.quorum</name>
<value>host1,host2,host3</value>
</property>

Point HBase root to a folder to be created in HDFS:

<property>
<name>hbase.rootdir</name>
<value>hdfs://namenode:nnport/hbase</value>
</property>

Note: Do not create this folder yourself.

Enable distributed operation:

<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

On small clusters reduce DFS replication to speed writes:

<property>
<name>dfs.replication</name>
<value>2</value>
</property>

Use the new configuration:

% alternatives --install /etc/hbase-0.20/conf hbase-0.20-conf \
/etc/hbase-0.20/conf.my_cluster 50

3) Bring Hadoop HDFS up as you would normally.

4) On all cluster nodes, start zookeeper:

% service hbase-zookeeper start

5) On the designated master, start the master process:

% service hbase-master start

6) On the designated backup master, start another master process:

% service hbase-master start

7) On the designated slaves, start the region server processes:

% service hbase-regionserver start

8 ) Anywhere on the cluster, launch the HBase shell and create a table:

% su - hadoop
% hbase shell

HBase Shell; enter ‘help<RETURN>‘ for list of supported commands. Version: 0.20.0~1-1.cloudera

hbase(main):001:0> create 'TestTable', {NAME=>'test'}

0 rows(s) in 3.4460 seconds

hbase(main):002:0>

9) Somewhere else on the cluster, launch the HBase shell and describe your new table:

% su - hadoop
% hbase shell

HBase Shell; enter ‘help<RETURN>‘ for list of supported commands. Version: 0.20.0~1-1.cloudera

hbase(main):001:0> describe 'TestTable'

DESCRIPTION                                                    ENABLED
{NAME => ‘TestTable’, FAMILIES => [{NAME => 'test',            true
COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
'true'}]}
1 row(s) in 1.6820 seconds

You are now ready to try out your new HBase installation!

For More Information

Visit the HBase Website and Wiki.

Users and supporting projects.

HBase Roadmap

HBase Mailing List

IRC Channel: #hbase on Freenode

Committers and core contributors are here on a regular basis. More active than the Hadoop forums!

Follow us on Twitter: @hbase

9 Responses
  • Arie / September 30, 2009 / 1:38 AM

    Thank you, Cloudera!

  • Salvador Fuentes / October 31, 2009 / 10:18 PM

    Are there no HBase apt packages at this point? I can’t find them in the cloudera repos as of today. I can’t even see the RPMs as well.

  • Something / November 18, 2009 / 4:41 PM

    I would like to install HBase on Amazon EBS. If I follow these instructions, would they work on EBS? I am new to EBS, so I apologize in advance, but do I have to get (say 4) instances for myself before trying this out? Any pointers in using CDH2 on EBS would be greatly appreciated. Thanks.

  • Tom White / November 20, 2009 / 5:49 PM

    To use these instructions with EBS volumes you would have to manually attach the EBS volumes to the EC2 instances that you are installing on.

    For more on using CDH2 on EC2 with EBS see http://archive.cloudera.com/docs/ebs.html.

    Cheers,
    Tom

  • Edward Capriolo / December 01, 2009 / 4:23 PM

    As always, excellent job on both the packaging and the tutorial. I will again complement cloudera for your great packaging of Hadoop and related components. Good use of the standard layout, the alternatives system is nice, init scripts etc. This really cut down the set up time, I can not say enough about that.

  • Denis / March 23, 2010 / 10:51 AM

    Are the hbase packages available in a cloudera repository? They don’t seem to be in intrepid-testing or -stable. Suggestions? Thanks…

  • William McVey / May 27, 2010 / 7:15 AM

    After struggling through this for longer than I care to admit, and manually reviewing the deb repositories, as of May 27, 2010 I’m pretty confident that HBase is *NOT* in CDH2 on the Debian/Ubuntu platform repositories. It’s a major shame because I saw this blog posting early on and an integrated HBase/Hadoop cluster tied to stable deb packaging was a major factor in choosing the Cloudera distro.

Leave a comment


six − 4 =