As readers of our previous post on the subject will recall, ZooKeeper is a distributed coordination service suitable for implementing coordination primitives like locks and concurrent queues. One of ZooKeeper’s great strengths is its ability to operate at scale. Clusters of only five or seven machines can often serve the coordination needs of several large applications.
We’ve recently added a major new feature to ZooKeeper to improve its scalability even further – a new type of server called Observers. In this blog post, I want to motivate the need for such a feature, and explain how it might help your deployments scale even better. Scalability means many things to many people – here I mean that a system is scalable if we can increase the workload the system can handle by assigning more resources to the system – a non-scalable system might see no performance improvement, or even a degradation as the workload increases.
To understand why Observers have an effect on ZooKeeper’s scalability, we need to understand a little about how the service works. Broadly speaking, every operation on a ZooKeeper cluster is either a read or a write operation. ZooKeeper makes sure that all reads and all writes are observed by every client of the system in exactly the same order, so that there’s no confusion about which operation happened first.
Along with this strong consistency guarantee, ZooKeeper also promises high availability, which can loosely be interpreted to mean that it can withstand a significant number of machine failures before the service stops being available to clients. ZooKeeper achieves this availability in a traditional way – by replicating the data that is being written and read amongst a small number of machines so that if one fails, there are others ready to take over without the client being any wiser.
However, these two properties – consistency and availability – are hard to achieve together, as now ZooKeeper must make sure that every replica in its cluster agrees on the series of read and write operations. It does this by using a consensus protocol. Simplifying greatly, this protocol operates by having a designated Leader propose a new operation to all the other servers, all of whom vote and respond back to the Leader. Once the Leader has gathered more than half the votes outstanding from the other servers, it can deduce that the vote has passed, and sends a further message telling the servers to go ahead and commit the operation to their memory.
This data flow is illustrated, from start to finish as the client sees it, in the diagram below. The client proposes a value to the server it is connected to. The server then relays that to the Leader, which initiates the consensus protocol, and once the original server has heard from the Leader it can relay its answer back to the client.
The need for Observers arises from the observation (no pun intended!) that ZooKeeper servers are playing two roles in this protocol. They accept connections and operation requests from clients, and also vote upon the result of these operations. These two responsibilities stand in opposition to each other when it comes to scaling ZooKeeper. If we wish to increase the number of clients attached to a ZooKeeper cluster (and we are often considering the case with 10000 or more clients), then we have to increase the number of servers available to support those clients. However, we can see from the description of the consensus protocol, that increasing the number of servers can place pressure on the performance of the voting part of the protocol. The Leader has to wait for at least half of the machines in the cluster to respond with a vote. The chance of one of these machines running slowly and holding up the entire vote process therefore gets bigger, and the performance of the voting step can decrease commensurately. This is something that we have seen in practice – as the size of the ZooKeeper cluster gets bigger, throughput of voting operations goes down.
So there is a tension between our desire to scale the number of clients, and our desire to keep performance reasonable in terms of throughput. To decouple this tension, we introduced non-voting servers called Observers to the cluster. Observers can accept client connections, and will forward write requests to the Leader. However, the Leader knows not to ask the Observers to vote. Instead, Observers takes no part in the voting process, but instead are informed about the result of the vote in Step 3 along with all the other servers.
This simple extension opens up new vistas of scalability to ZooKeeper. We may now add as many Observers as we like to the cluster without dramatically affecting write throughput. The scaling is not absolutely perfect – there is one step in the protocol (the ‘inform’ step) that is linear in the number of servers to inform, but the serial overhead of this step is extremely low. We would expect to hit other bottlenecks in the system before the cost of sending an inform packet to every server dominates the throughput performance of a ZooKeeper cluster.
Figure 2 shows the results of one microbenchmark. The vertical axis measures the number of synchronous write operations per second that I was able to issue from a single client (a fully tuned ZooKeeper installation can significantly more operations per second – it’s the relative size of the bars we’re interested in here) . The horizontal axis denotes the size of the ZooKeeper cluster used. The blue bars are ZooKeeper clusters where every server is a voting server, where the green bars are ZooKeeper clusters where all but three servers are Observers. The chart shows that write performance stays approximately constant as we scale out the number of Observers, but falls off dramatically if we expand the size of the voting cluster. This is a win for Observers!
Observers scale read performance too
Scaling the number of clients is an important use case for Observers, but in fact there are significant other advantages to having them in your cluster.
As an optimisation, ZooKeeper servers may serve read requests out of their local data stores, without going through the voting process. This puts read requests at a very slight risk of a ‘time-travel’ read, where an earlier value is read after a later value; but this only happens when a server fails. Indeed, in that case a client may issue a ‘sync’ request that ensures the next value it reads is the most up-to-date.
Therefore Observers are a big performance improvement for read-heavy workloads. Writes go through the standard voting path, and so, by the same argument as for client scalability, increasing the number of voting servers in order to serve more reads will have a detrimental effect on write performance. Observers allow us to decouple read performance from write performance. This meshes well with many use cases for ZooKeeper, where most clients issues few writes but many reads.
Observers enable WAN configurations
There’s yet more that Observers can do for you. Observers are excellent candidates for connecting clients to ZooKeeper across wide-area networks. There are three main reasons for this. In order to get good read performance, it is necessary to have your clients relatively near to a server so that round-trip latencies aren’t too high. However, splitting a ZooKeeper cluster between two datacenters is a very problematic design, due to the fact that ZooKeeper works best when the voting servers are able to communicate with each other at low latency – otherwise we get the slowdown problem I described earlier.
Observers can be placed in every datacenter that needs to access a ZooKeeper cluster. Therefore the voting protocol doesn’t take place across a high-latency intra-datacenter link, and performance is improved. Also, two fewer messages that are sent between Observers and the Leader during the voting process than between a voting server and the Leader. This can help ease bandwidth requirements on write-heavy workloads from remote datacenters.
Finally, since Observers can fail without affecting the voting cluster itself, there’s no risk to the availability of the service if the link between datacenters is severed. This is much more likely than the loss of internal rack-to-rack connections, so it is beneficial not to rely on such a link.
How to get started with Observers
Observers are not yet part of a ZooKeeper release, so in order to start working with them you will have to download the source code from the Subversion trunk.
The following is excerpted from the Observers user guide, found in
docs/zooKeeperObservers.html in the source distribution.
How to use Observers
Note that until ZOOKEEPER-578 is resolved, you must set electionAlg=0 in every server configuration file. Otherwise an exception will be thrown when you try to start your ensemble.
The reason: because Observers do not participate in leader elections, they rely on voting Followers to inform them of changes to the Leader. Currently, only the basic leader election algorithm starts a thread that responds to requests from Observers to identify the current Leader. Work is in progress on other JIRAs to bring this functionality to all leader election protocols.
Setting up a ZooKeeper ensemble that uses Observers is very simple, and requires just two changes to your config files. Firstly, in the config file of every node that is to be an Observer, you must place this line:
This line tells ZooKeeper that the server is to be an Observer. Secondly, in every server config file, you must add :observer to the server definition line of each Observer. For example:
This tells every other server that server.1 is an Observer, and that they should not expect it to vote. This is all the configuration you need to do to add an Observer to your ZooKeeper cluster. Now you can connect to it as though it were an ordinary Follower. Try it out, by running:
bin/zkCli.sh -server localhost:2181
where localhost:2181 is the hostname and port number of the Observer as specified in every config file. You should see a command line prompt through which you can issue commands like ls to query the ZooKeeper service.
There’s more to be done with the Observers feature. In the short term we are working on making Observers fully compatible with all leader election algorithms that ship with ZooKeeper – we expect this to be finished within the next few days. Longer term, we are hoping to investigate performance optimisations such as batching and off-line reads for Observer-based clusters, to take advantage of the fact that Observers have no strict latency requirement to meet unlike a normal ZooKeeper server.
We hope that Observers will make it into the release of ZooKeeper 3.3.0, due early next year. We would be delighted to hear your feedback, either on the mailing lists, or via direct e-mail. ZooKeeper is always looking for contributors, and we’ve got plenty of interesting problems to solve, so do get in contact if you’d like to get involved and I’d be happy to help you get started.