Cloudera Engineering Blog · ZooKeeper Posts

Hadoop/HBase Capacity Planning

Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).  HBase is much better for real-time read/write/modify access to tabular data.  Both applications are designed for high concurrency and large data sizes.  For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [,], or Lars George blogs [].  We expect a new edition of the Tom White’s Hadoop book [] and a new HBase book in the near future as well.

Migrating to CDH

With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.

Why Migrate?

If you’re not familiar with CDH3b2, here’s what you need to know.

What’s New in CDH3b2: ZooKeeper

CDH3 beta 2 is the first version of CDH to incorporate Apache ZooKeeper. ZooKeeper is a highly reliable and available coordination service for distributed processes. It is a proven technology and a well established open source project at Apache (sub-project of Hadoop).

ZooKeeper is distributed coordination

Often distributed applications need some way to coordinate across processes; locking resources, managing queues of events, electing a “leader” process, configuration, etc… Coordination operations such as these are notoriously hard to get right. ZooKeeper provides a relatively simple API which allows clients to correctly implement these and many other coordination mechanisms.

Building a distributed concurrent queue with Apache ZooKeeper

In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution – a kind of distributed barrier – is surprisingly difficult to do correctly. ZooKeeper offers an API to facilitate this sort of distributed coordination. For example, it is often used to serve locks to client processes – locks are just another kind of coordination primitive – in the form of small files that ZooKeeper tracks.

In order to be useful, ZooKeeper must be both highly reliable and available as systems will rely upon it as a critical component. For example, if locks cannot be taken, processes cannot make progress and the whole system will grind to a halt. ZooKeeper is built on a suite of reliable distributed systems techniques and protocols, and is typically run on a cluster of machines so that if some should fail, the remaining ones can continue to provide service. Under the hood, ZooKeeper is responsible for ordering calls made by clients so that each request is processed atomically and in a fixed and firm order.

Newer Posts