Author Archives: Matteo Bertozzi

New in CDH 5.2: Improvements for Running Multiple Workloads on a Single HBase Cluster

Categories: CDH HBase

These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.

Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.

One can categorize the approaches to this multi-tenancy problem in three ways:

  • Physical isolation or partitioning

Read More

What are HBase znodes?

Categories: General HBase ZooKeeper

Apache ZooKeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a znode) may contain data and a set of children. Each znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).

In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is,

Read More

Introduction to Apache HBase Snapshots, Part 2: Deeper Dive

Categories: HBase

In Part 1 of this series about Apache HBase snapshots, you learned how to use the new Snapshots feature and a bit of theory behind the implementation. Now, it’s time to dive into the technical details a bit more deeply.

What is a Table?

An HBase table comprises a set of metadata information and a set of key/value pairs:

  • Table Info: A manifest file that describes the table “settings”,

Read More

How Scaling Really Works in Apache HBase

Categories: HBase

This post was originally published via, we republish it here in a slightly modified form for your convenience:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

Read More

Introduction to Apache HBase Snapshots

Categories: CDH HBase

The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.

Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance.

Read More