Distributed Systems Get Simpler with Apache Helix

Our thanks to Kishore Gopalakrishna, staff engineer at LinkedIn and one of the original developers of Apache Helix (incubating), for the introduction below. Cloudera’s Patrick Hunt is a mentor for the project.

With the trend of exploding data growth and the systems in the NoSQL and Big Data space, the number of distributed systems has grown significantly. At LinkedIn, we have built a number of distributed systems over the years. Such systems run on a cluster of multiple servers and need to handle the problems that come with distributed systems. Fault tolerance – that is, availability in the presence of server failures and network problems — is critical to any such system. Horizontal scalability and seamless cluster expansion to handle increasing workloads are also essential properties.

Without a framework that provides these capabilities, developers of distributed platforms have to continually re-implement features that support them. From LinkedIn’s own experiences building various distributed systems and having to solve the same problems repeatedly, we decided to build a generic platform to simply this process – which we eventually called Helix (and which is now in the Apache Incubator).

Concept

Helix introduces the following terminologies that allow you to define a distributed system and express its behavior.

A set of autonomous processes, each of which is referred to as an instance, collectively constitute a cluster. The participants collectively perform or support a task, which is referred to as a resource. A resource is a logical entity that can span many Instances. For example, a database, topic, index or any computation/serving task can be mapped to a resource. A resource can be further sub-divided into sub-resources or sub-tasks, which are referred to as partitions. A partition is the largest part of a resource that can be associated with an instance, which implies a partition cannot span multiple instances. For fault tolerance and/or scalability, multiple copies of a partition are possible. Each copy is referred to as a replica.

The above terminology is sufficiently generic to be mapped to any distributed system. But distributed systems vary significantly in their function and behavior. For example, a search system that serves only queries might behave quite differently from a database that supports reads on mutable data. So, to effectively describe the behavior of a distributed system, we need to define various states for the resources and partitions of a distributed system and capture the description of all its valid states and legal state transitions. Helix accomplishes this task by providing the ability to plug in a Finite-State Machine (FSM).

The following figure illustrates Helix terminology. Replicas and nodes (aka instances) are color-coded to indicate the nodes they exist on; for example, P1.r2 is a slave and it sits on Node 4:

An FSM by itself is not sufficient to describe all the invariants in the system — we still need a way to capture the desired system behavior when the cluster topology changes. For example, when a node fails, we could choose to create new replicas or change the state of existing replicas. When new nodes are added to the cluster, we could re-distribute the partitions to new nodes at once or incrementally. These are nontrivial problems that make it very difficult for programmers to correctly program system behavior for all possible scenarios.

Helix implements the concept of constraints and objectives: One can define the constraints on each state and transitions in the FSM. Furthermore, you can specify constraints at various granularities such as partition, resource, instance, and cluster.

Here are some examples of constraints and objectives:

  • State constraints:
    • Partition scope: 1 Master, 2 Slave
    • Instance scope: Max 10 partitions on each node
  • Transition constraints:
    • Max number of bootstrap (copy data over network) transitions in a cluster=5
  • Objectives:
    • Even distribution of partitions
    • Replicas on different nodes/racks
    • On node failure/addition/removal redistribute partitions evenly with minimal movement

Based on these concepts, we can define cluster management this way:

Given a set of instances and resources, along with some constraints on the system state, compute an assignment of partitions to instances such that all the constraints are satisfied.

Based on that definition, we can envision cluster management as a constraint satisfaction problem. In general, multiple solutions can satisfy the given constraints, and in order to reduce the solution space, we can enforce some optimization rules or objectives.

IdealState

In Helix parlance, a solution that satisfies the constraints and meets the objectives is called the IdealState of the system:

 

It simply describes the mapping of TASK to an INSTANCE and STATE. For example, if you need one master and two slaves and three nodes are up (N1,N2,N3), the solution can be:

 

Modeling a distributed system as a state machine with constraints on state and transitions has the following benefits:

  • It separates cluster management from core functionality.
  • It quickly transforms a single-node system to an operable, distributed system.
  • System components do not have to manage the global cluster. This division of labor makes it easier to build, debug, and maintain your system.

See this doc for a detailed explanation of Helix concepts.

Roles

Helix divides distributed system components into three logical roles:

  • Controller: The controller is the core component in Helix and serves as the “brain” of the system. It hosts the state machine engine, runs the execution algorithm, and issues transitions against the distributed system. The controller is responsible for computing the IdealState that satisfies the given constraints and objectives. If the IdealState differs from the current state of the system, the controller computes the transitions required to take the system from its CurrentState to IdealState. This process is repeated on every change in the cluster, planned as well as unplanned: starting/stopping of nodes, adding or losing nodes, adding or deleting resources, and adding or deleting partitions, among others.
  • Participant: Each instance in the cluster that performs the core functionality is called a participant. The participants run the Helix library that invokes callbacks whenever the controller initiates a state transition on the participant. The distributed system is responsible for implementing those callbacks. In this way, Helix remains agnostic to the semantics of a state transition or how to actually implement the transition. For example, in a LeaderStandby model, the participant implements methods OnBecomeLeaderFromStandby and OnBecomeStandbyFromLeader.
  • Spectator: This component represents the external entities that need to observe the state of the system. A spectator is notified of the state changes in the system, so that they can interact with the appropriate participant. For example, a routing/proxy component (a spectator) can be notified when the routing table has changed, such as when a cluster is expanded or a node fails.

It’s important to note that these are just logical components and can be physically co-located within the same process. Cluster state is maintained in Apache ZooKeeper.

Use Cases at LinkedIn

LinkedIn uses Helix extensively to manage our backend systems, such as Espresso (a distributed NoSQL data store), Databus (a change data capture system), and SEAS (Search as a Service).

Espresso is a distributed, timeline consistent, scalable, document store that supports local secondary indexing and local transactions. Espresso runs on a number of storage node servers that store and index data and answer queries. Espresso databases are horizontally partitioned into a number of partitions, with each partition having a certain number of replicas distributed across the storage nodes.

Databus provides a common pipeline for transporting events from LinkedIn’s primary databases (Espresso and Oracle Database) to downstream applications like caches and search systems. Each Databus partition is assigned to a consumer such that partitions are evenly distributed across consumers and each partition is assigned to exactly one consumer at a time. The set of consumers may grow over time, and consumers may leave the group due to planned or unplanned outages. In these cases, partitions must be reassigned, while maintaining balance and the single consumer-per-partition invariant.

LinkedIn’s SEAS lets internal customers define custom indexes on a chosen dataset and then makes those indexes searchable via a service API. The index service runs on a cluster. The index is broken into partitions and each partition has a configured number of replicas.

Distributed System Recipes

The Helix team has written a few recipes to demonstrate common usage patterns. Helix comes with pre-defined state models such as MasterSlave, OnlineOffline, and LeaderStandby and a recipe is available for each model.

  • Distributed lock manager (LeaderStandBy model): With this recipe, you can simply define the number of locks needed and Helix will distribute them among the live nodes. Note: even though Helix uses Zookeeper, we never elect the leader using zookeeper ephemeral nodes. Instead, the leader is selected by the controller.
  • Consumer grouping (OnlineOffline model): Consumer grouping is a simple recipe for load balancing the consumption of messages with fault tolerance from a pub/sub system. This recipe uses Rabbit MQ but can be applied to any pub/sub system.
  • Rsync-based replicated file system (MasterSlave model): This recipe demonstrates a system where files written to one server are automatically replicated to another slave using rsync. The client always writes data to the Master and it gets replicated. It also provides fault tolerance — if a Master dies, another Slave becomes the Master.

Questions?

The Helix concept is a bit different from the traditional way of building distributed systems. But once the concept is clear, we are confident that you will see the same benefits we saw at LinkedIn.

We still have a long way to go and are making improvements to design and refactoring our APIs as we see more use cases outside of LinkedIn. We welcome your feedback and contributions via the mailing lists (user@helix.incubator.apache.orgdev@helix.incubator.apache.org).

Further Reading:

Filed under:

No Responses

Leave a comment


three − = 1