CDH3 beta 2 is the first version of CDH to incorporate Apache ZooKeeper. ZooKeeper is a highly reliable and available coordination service for distributed processes. It is a proven technology and a well established open source project at Apache (sub-project of Hadoop).
ZooKeeper is distributed coordination
Often distributed applications need some way to coordinate across processes; locking resources, managing queues of events, electing a “leader” process, configuration, etc… Coordination operations such as these are notoriously hard to get right. ZooKeeper provides a relatively simple API which allows clients to correctly implement these and many other coordination mechanisms.
?ZooKeeper is itself a replicated service based on a quorum algorithm. One or more ZooKeeper servers form whats called an ensemble, which are in constant communication. As the size of the ensemble increases the reliability of the service itself increases as long as a majority of the configured ensemble servers are available the service is available. As an example, say you have an ensemble of size three (three ZooKeeper servers), if one of the three fail the service is still up. If two of the three fail the service is down. One could run with five servers, in which case if two servers fail the service as a whole would still be available. Seven server ensembles can survive three failures, and so on.
Who should be interested in the ZooKeeper project? (I say “project/us” here because the team & community are just as important as the software, if not more so) Well, developers appreciate us because we make it simple to implement some very difficult distributed communication problems. Operations teams like us because we ensure that they only need to learn, operate and maintain a single, sane, coordination mechanism that’s easy to manage. Business folks like the fact that we are a proven technology helping to ensure high availability, allowing the development/ops teams to focus on domain specific problems.
Powered By ZooKeeper
Yahoo!, Facebook, Twitter, Digg, Rackspace and a number of other companies are making use of ZooKeeper in their production environments today. Additionally technologies such as HBase, Solr, Katta, and Neo4j have all increased reliability/availability and extended their capabilities by adopting ZooKeeper.
Here at Cloudera we recently open sourced Flume, a distributed, real-time event collection service which is also part of CDH3 beta 2. Flume makes use of ZooKeeper to coordinate its various distributed components – for example to store and manage dynamically updated configuration information. You can find out more about Flume here.
ZooKeeper and CDH
A significant amount of work has gone into ZooKeeper integration with CDH3. In particular the Cloudera team ensures that the CDH3 components relying on ZooKeeper (HBase and Flume) are fully compatible and will work together. CDH based ZooKeeper packages (tar, RPM, and DEB files) containing libraries as well as startup scripts for running as a service are available on the Cloudera website.
Cloudera is an active member of the Hadoop and ZooKeeper communities – we have two active commiters working on ZooKeeper, myself and Henry Robinson. Henry recently contributed a major new feature called “observers” which greatly extends ZooKeeper’s read scalability, Henry also created and maintains the popular ZooKeeper python client binding.
In addition to leading the ZooKeeper project at Apache I’ve personally been working on a number of projects for upcoming releases; I’m currently adding transport level security (encryption and authentication of communications) to the service. The team is constantly on the lookout for other areas where the technology may be applied. As I mentioned HBase and Flume are currently using ZooKeeper and we hope to extend this further, in particular around the idea of centralized configuration and monitoring for Hadoop. There are just too many configuration files floating around when setting up a Hadoop based service, it seems like ZooKeeper would be a perfect fit for this. A great example of this in use today is LinkedIn’s Norbert project, where ZooKeeper is used to maintain and manage cluster metadata.
Join the ZooKeeper community
Find out more about ZooKeeper on the official Apache project pages. On that page you’ll also find links for documentation, user and developer mailing lists, issue tracking, etc… we welcome new users and contributors. You might also follow me on twitter, where I frequently post on community related issues.