With Kafka now formally integrated with, and supported as part of, Cloudera Enterprise, what’s the best way to deploy and configure it?
Earlier today, Cloudera announced that, following an incubation period in Cloudera Labs, Apache Kafka is now fully integrated into Cloudera’s Big Data platform, Cloudera Enterprise (CDH + Cloudera Manager). Our customers have expressed strong interest in Kafka, and some are already running Kafka in production.
Kafka is a popular open source project so there is already good awareness of its strengths and overall architecture. However, a common set of questions and concerns come up when folks try to deploy Kafka in production for the first time. In this post, we will attempt to answer a few of those questions.
We assume you are familiar with the overall architecture of Kafka, and with the Kafka concepts of brokers, producers, consumers, topics, and partitions. If not, you can check out a previous blog post, “Kafka for Beginners,” and the Kafka documentation.
Hardware for Kafka
For optimal performance, Cloudera strongly recommends that production Kafka brokers be deployed on dedicated machines, separate from the machines on which the rest of your Apache Hadoop cluster runs. Kafka relies on dedicated disk access and large pagecache for peak performance, and sharing nodes with Hadoop processes may interfere with its ability to fully leverage the pagecache.
Kafka is meant to run on industry standard hardware. The machine specs for a Kafka broker machine will be similar to that of your Hadoop worker nodes. Though there is no minimum specification that is required, Cloudera suggests machines that have at least:
- Processor with four 2Ghz cores
- Six 7200 RPM SATA drives (JBOD or RAID10)
- 32GB of RAM
- 1Gb Ethernet
The most accurate way to model your use case is to simulate the load you expect on your own hardware, and you can do that using the load-generation tools that ship with Kafka.
If you need to size your cluster without simulation, here are two simple rules of thumb:
- Size the cluster based on the amount of disk space required. This requirement can be computed from the estimated rate at which you get data multiplied by the required data retention period.
- Size the cluster cluster based on your memory requirements. Assuming readers and writers are fairly evenly distributed across the brokers in your cluster, you can roughly estimate of memory needs by assuming you want to be able to buffer for at least 30 seconds and compute your memory need as write_throughput*30.
Replication, Partitions, and Leaders
Data written to Kafka is replicated for fault tolerance and durability, and Kafka allows users to set a separate replication factor for each topic. The replication factor controls how many brokers will replicate each message that is written. If you have a replication factor of three, then up to two brokers can fail before you will lose access to your data. Cloudera recommends using a replication factor of at least two, so that you can transparently bounce machines without interrupting data consumption. However, if you have stronger durability requirements, use a replication factor of three or above.
Topics in Kafka are partitioned, and each topic has a configurable partition count. The partition count controls how many logs into which the topic will be sharded. Producers assign data to partitions in round-robin fashion, to spread the data belonging to a topic among its partitions. Each partition is of course replicated, but one replica of each partition is selected as a leader partition, and all reads and writes go to this lead partition.
Here are some factors to consider while picking an appropriate partition count for a topic:
A partition can only be read by a single consumer. (However, a consumer can read many partitions.) Thus, if your partition count is lower than your consumer count, many consumers will not receive data for that topic. Hence, Cloudera recommends a partition count that is higher than the maximum number of simultaneous consumers of the data, so that each consumer receives data.
Similarly, Cloudera recommends a partition count that is higher than the number of brokers, so that the leader partitions are evenly distributed across brokers, thus distributing the read/write load. (Kafka performs random and even distribution of partitions across brokers.) As a reference, many Cloudera customers have topics with tens or even hundreds of partitions each. However, note that Kafka will need to allocate memory for message buffer per partition. If there are a large number of partitions, make sure Kafka starts with sufficient heap space (number of partitions multiplied by replica.fetch.max.bytes).
Though there is no maximum message size enforced by Kafka, Cloudera recommends writing messages that are no more than 1MB in size. Most customers see optimal throughput with messages ranging from 1-10 KB in size.
Settings for Guaranteed Message Delivery
Many use cases require reliable delivery of messages. Fortunately, it is easy to configure Kafka for zero data loss. But first, curious readers may like to understand the concept of an in-sync replica (commonly referred to as ISR): For a topic partition, an ISR is a follower replica that is caught-up with the leader partition, and is situated on a broker that is alive. Thus, if a leader replica is replaced by an ISR, there will be no loss of data. However, if a non-ISR replica is made a leader partition, some data loss is possible since it may not have the latest messages.
To ensure message delivery without data loss, the following settings are important:
- While configuring a Producer, set
acks=-1. This setting ensures that a message is considered to be successfully delivered only after ALL the ISRs have acknowledged writing the message.
- Set the topic level configuration
min.insync.replicas, which specifies the number of replicas that must acknowledge a write, for the write to be considered successful.If this minimum cannot be met, and
acks=-1, the producer will raise an exception.
- Set the broker configuration param
unclean.leader.election.enableto false. This setting essentially means you are prioritizing durability over availability since Kafka would avoid electing a leader, and instead make the partition unavailable, if no ISR is available to become the next leader safely.
A typical scenario would be to create a topic with a replication factor of 3, set
min.insync.replicas to 2, and produce with
request.required.acks set to -1. However, you can increase these numbers for stronger durability.
As of today, Cloudera provides a Kafka Custom Service Descriptor (CSD) to enable easy deployment and administration of a Kafka cluster via Cloudera Manager. The CSD provides granular real-time view of the health of your Kafka brokers, along with reporting and diagnostic tools. This CSD is available via Cloudera’s downloads page.
We hope the Cloudera Manager CSD, along with the tips in this blog post, make it easy for you to get up and running with Kafka in production.
Anand Iyer is a product manager at Cloudera.