Flavio Junqueira (PMC Chair of the Apache ZooKeeper project and a member of the Systems and Networking Group at Microsoft Research) and Benjamin Reed (PMC Member and Software Engineer at Facebook) are the co-authors of the new O’Reilly Media book ZooKeeper: Distributed Process Coordination. We had a chat with Flavio and Ben recently about the rationale for writing the book, and what it will add to the distributed systems conversation.
Why did you decide to write this book?
It seemed like a natural step. Although ZooKeeper has been around for a while, there was no good starting point for beginners or even a comprehensive reference for advanced developers. The online documentation is ok, but it lacks depth and is a bit scattered. Also, in some sense, ZooKeeper is a little bit too easy to jump in and use. We found that people were jumping in without fully understanding what they needed to watch out for, ZooKeeper’s limitations, and how to deal with some faulty scenarios.
We had been thinking about writing a book for a while, and last year, we decided to move forward with the idea after being contacted by O’Reilly.
Who is your intended reader?
We primarily focus on programming with ZooKeeper, so the book targets new and advanced developers. It also has material for sysadmins. The two kinds of content are not incompatible, though, and it is probably a good thing that developers are also aware of administration issues and possibilities. In fact, we cover the internals of ZooKeeper in the part about administration, but we think that this content is valuable for anyone working with ZooKeeper.
When covering the internals, we didn’t write it with distributing computing researchers in mind — quite the opposite. Rather, we tried to explain them in a way that is accessible to the average developer. We also give code references when possible to encourage readers to look into the code and possibly even consider contributing back. We are always looking to grow the community.
What are your favorite things about ZooKeeper that you want people to know?
One of the most interesting aspects of ZooKeeper is that it implements a simple API on top of a difficult core, which makes it possible for most developers to ignore all the deep distributed computing concepts it is implementing internally. Replication protocols like Zab, the one in ZooKeeper, are conceptually simple, but when it comes to implementing them, there are a number of subtle points that can cause headaches. In ZooKeeper, I think we have been able to abstract that away from the developer nicely. As the book says, we didn’t completely hide distributed systems issues from the developer, but we made it easier for developers to reason about them.
In your research, what did you learn that you did not already know?
There are two important lessons here, one about protocols and another about everything else around Zab.
We made a few interesting observations about Paxos when contrasting it to Zab, like problems you could run into if you just implemented Paxos alone. Not that Paxos is broken or anything, just that in our setting, there were some properties it was not giving us. Some people still like to map Zab to Paxos, and they are not completely off, but the way we see it, Zab matches a service like ZooKeeper well.
The reconfiguration work (ZOOKEEPER-107), which is not yet available in a release, was a challenge to include, mostly because we introduced it so recently. It was a big patch that touched many parts, even if the concept is not very complex. In retrospect, we should have considered it earlier in the project.
The other important lesson is related to what developers see. It is a challenge in itself to make a service built on top of Zab, Paxos, and so on easy to use; a simple API for ZooKeeper made a huge difference for its adoption. Having a pretty cool protocol is not enough, though, if the API does not make it easy to program against the service. Of course, performance is also a concern, and we had to make sure that the system is up to task for our users. An API with simple operations like ZooKeeper’s helps deliver high throughput and low latency because no operation really blocks the ZooKeeper pipeline for too long.
What are some other things that the ZooKeeper community can do to help make distributed systems easier to build?
We call ZooKeeper a “coordination system” because a lot of the coordination tasks we need to implement for a distributed system can build on ZooKeeper. Such tasks typically need some processes to agree upon something, which is often hard to implement from scratch because of all the corner cases of things failing and getting disconnected. ZooKeeper makes it easier to deal with these situations.
Not everything in a distributed system is directly about coordination, though. For example, reliable messaging and recoverability are present in many systems, and we have worked on Hedwig and BookKeeper respectively to deal with them. They are independent components built on ZooKeeper that make it easier to build other distributed systems. Similar specialized components that possibly build on ZooKeeper would definitely make the task of building distributed systems easier. Perhaps a shout-out to Apache Curator and similar APIs would be good. We always meant for ZooKeeper to be a generic building block that could be used to implement richer client APIs.
Regarding coordination, a few things in ZooKeeper could be improved. It is not uncommon to come across use cases that run across data centers, and ZooKeeper has some features (like observers) that help with such cases. Still, the overall design of ZooKeeper targets a homogeneous set of servers running in a single data center. We can probably do a better job for coordination across data centers with a different design. Also, with new storage technologies (solid state drives, non-volatile RAM, and so on), we could provide backends optimized for them. It is not that ZooKeeper is not going to run if you have SSDs, but the writes to disk are not really optimized for such drives. It would be nice to have better support for these technologies.
These two points are just to illustrate that we often come across different requirements and scenarios on the mailing list, so interacting with the community is a good way to get ideas for the Next Cool Thing.