How Scaling Really Works in Apache HBase
This post was originally published via blogs.apache.org, we republish it here in a slightly modified form for your convenience:
At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.
Regions and Region Servers
HBase is the Hadoop storage manager that provides low-latency random reads and writes on top of HDFS, and it can handle petabytes of data. One of the interesting capabilities in HBase is auto-sharding, which simply means that tables are dynamically distributed by the system when they become too large.
The basic unit of horizontal scalability in HBase is called a Region. Regions are a subset of the table’s data and they are essentially a contiguous, sorted range of rows that are stored together.
Initially, there is only one region for a table. As shown below, when regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.
In HBase the slaves are called Region Servers. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.
The HBase architecture has two main services: HMaster that is responsible to coordinate the cluster and execute administrative operations, and the HRegionServer responsible for handling a subset of the table’s data.
HMaster, Region Assignment, and Balancing
As previously mentioned, the HBase Master coordinates the HBase Cluster and is responsible for administrative operations.
A Region Server can serve one or more Regions. Each Region is assigned to a Region Server on startup and the master can decide to move a Region from one Region Server to another as the result of a load balance operation. The Master also handles Region Server failures by assigning the region to another Region Server.
The mapping of Regions and Region Servers is kept in a system table called META. By reading META, you can identify which region is responsible for your key. This means that for read and write operations, the master is not involved at all and clients can go directly to the Region Server responsible to serve the requested data.
Locating a Row-Key: Which Region Server is Responsible?
To put or get a row clients don’t have to contact the master, clients can directly contact the Region Server that handles the specified row, or in case of a client scan, can directly contact the set of Region Servers responsible for handling the set of keys:
To identify the Region Server, the client does a query on the META table.
META is a system table used to keep track of regions. It contains the server name and a region identifier comprising a table name and the start row-key. By looking at the start-key and the next region start-key clients are able to identify the range of rows contained in a a particular region.
The client keeps a cache for the region locations. This avoids clients to hit the META table every time an operation on the same region is issued. In case of a region split or move to another Region Server (due to balancing, or assignment policies), the client will receive an exception as response and the cache will be refreshed by fetching the updated information from the META table:
Since META is a table like the others, the client has to identify on which server META is located. The META locations are stored in a ZooKeeper node on assignment by the Master, and the client reads directly the node to get the address of the Region Server that contains META.
HBase’s original design was based on BigTable, with another table called -ROOT- containing the META locations and Apache ZooKeeper pointing to it. HBase 0.96 removed that arrangement in favor of ZooKeeper only, since META cannot be split and therefore consists of a single region.
Client API: Master and Regions Responsibilities
The HBase Java client API has two main interfaces:
HBaseAdmin allows interaction with the “table schema” by creating/deleting/modifying tables, and it allows interaction with the cluster by assigning/unassigning regions, merging regions together, calling for a flush, and so on. This interface communicates with the Master.
HTable allows the client to manipulate the data of a specified table by using get, put, delete, and all the other data operations. This interface communicates directly with the Region Servers responsible for handling the requested set of keys.
Those two interfaces have separate responsibilities: HBaseAdmin is only used to execute admin operations and communicate with the Master while the HTable is used to manipulate data and communicate with the Regions.
As we’ve seen here, having a Master/Slave architecture does not mean that each operation goes through the master. To read and write data, the HBase client, in fact, goes directly to the specific Region Server responsible for handling the row keys for all the data operations (HTable). The Master is used by the client only for table creation, modification, and deletion operations (HBaseAdmin).
Although the a concept of a Master exists, the HBase client does not depend on it for data operations and the cluster can keep serving data even if the master goes down.
Matteo Bertozzi is Software Engineer on the Platform team and an HBase Committer.
If you’re interested in HBase, be sure to register for HBaseCon 2013 (June 13, San Francisco) – THE community event for HBase contributors, developers, admins, and users. Space is limited!