It was good to see Jay Kreps (@jaykreps), the LinkedIn engineer who is the tech lead for that company’s online data infrastructure, visit Cloudera Engineering yesterday to spread the good word about Apache Kafka.
Kafka, of course, was originally developed inside LinkedIn and entered the Apache Incubator in 2011. Today, it is being widely adopted as a pub/sub framework that works at massive scale (and which is commonly used to write to Apache Hadoop clusters, and even data warehouses).
Perhaps the most interesting thing about Kafka is its treatment of the venerable commit log as its inspiring abstraction. As Jay puts it, the log is “the natural data structure for handling data flow between systems” — and he describes that approach is “pub/sub done right” (much more detail about this important concept here).
In his talk, Jay first described the objectives that brought Kafka into being, including the need to build data pipelines across a highly heterogeneous environment (and for highly heterogeneous data), and without ending up with a spaghetti diagram of specialized, one/off connections. The requirements involved will not be unfamiliar to most of this blog’s readers: fast, filesystem-like performance; built-in failover; and an inherently distributed architecture (including replication, partitioning, and so on). LinkedIn explored several COTS alternatives, all of which failed to tick at least one of these checkboxes, before going homegrown.
Today, Kafka is the dataflow glue that binds LinkedIn’s data infrastructure, keeping its online systems Espresso, Voldemort, graph databases, search, and so on in sync with offline ones (Hadoop and the enterprise data warehouse) — with 7 million messages written per second, and 35 million read per second.
Jay kindly agreed to let us share his presentation with you, and here it is:
Thanks Jay, we really appreciate your visit!
Justin Kestelyn is Cloudera’s developer outreach director.