Our thanks to Micah Whitacre, a senior software architect on Cerner Corp.’s Big Data Platforms team, for the post below about Cerner’s use case for CDH + Apache Kafka. (Kafka integration with CDH is currently incubating in Cloudera Labs.)
Over the years, Cerner Corp., a leading Healthcare IT provider, has utilized several of the core technologies available in CDH, Cloudera’s software platform containing Apache Hadoop and related projects—including HDFS, Apache HBase, Apache Crunch, Apache Hive, and Apache Oozie. Building upon those technologies, we have been able to architect solutions to handle our diverse ingestion and processing requirements.
At various points, however, we reached certain scalability limits and perhaps even abused the intent of certain technologies, causing us to look for better options. By adopting Apache Kafka, Cerner has been able to solidify our core infrastructure, utilizing those technologies as they were intended.
One of the early challenges Cerner faced when building our initial processing infrastructure was moving from batch-oriented processing to technologies that could handle a streaming near-real-time system. Building upon the concepts in Google’s Percolator paper, we built a similar infrastructure on top of HBase. Listeners interested in data of specific types and from specific sources would register interest in data written to a given table. For each write performed, a notification for each applicable listener would be written to a corresponding notification table. Listeners would continuously scan a small set of rows on the notification table looking for new data to process, deleting the notification when complete.
Our low-latency processing infrastructure worked well for a time but quickly reached scalability limits based on its use of HBase. Listener scan performance would degrade without frequent compactions to remove deleted notifications. During the frequent compactions, performance would degrade, causing severe drops in processing throughput. Processing would require frequent reads from HBase to retrieve the notification, the payload, and often supporting information from other HBase tables. The high number of reads would often contend with writes done our processing infrastructure that were writing transformed payloads and additional notifications for downstream listeners. The I/O contention and the compaction needs required careful management to distribute the load across the cluster, often segregating the notification tables on isolated region servers.
Adopting Kafka was a natural fit for reading and writing notifications. Instead of scanning rows in HBase, a listener would process messages off of a Kafka topic, updating its offset as notifications were successfully processed.
Kafka’s natural separation of producers and consumers eliminated contention at the HBase RegionServer due to the high number of notification read and write operations. Kafka’s consumer offset tracking helped to eliminate the need for notification deletes, and replaying notifications became as simple as resetting the offset in Kafka. Offloading the highly transient data from HBase greatly reduced unnecessary overhead from compactions and high I/O.
Building upon the success of Kafka-based notifications, Cerner then explored using Kafka to simplify and streamline data ingestion. Cerner systems ingest data from multiple disparate sources and systems. Many of these sources are external to our data centers. The “Collector,” a secured HTTP endpoint, will identify and namespace the data before it is persisted into HBase. Prior to utilizing Kafka, our data ingestion infrastructure targeted a single data store such as an HBase cluster.
The system satisfied our initial use cases but as our processing needs changed, so did the complexity of our data ingestion infrastructure. Data would often need to be ingested into multiple clusters in near real time, and not all data needed the random read/write functionality of HBase.
Utilizing Kafka in our ingestion platform helped provide a durable staging area, giving us a natural way to broadcast the ingested data to multiple destinations. The collector process stayed simple by persisting data into Kafka topics, segregated by source. Pushing data to Kafka resulted in a noticeable improvement as the uploading processes were no longer subject to intermittent performance degradations due to compaction or region splitting with HBase.
After data lands in Kafka, Apache Storm topologies push data to consuming clusters independently. Kafka and Storm allow the collector process to remain simple by eliminating the need to deal with multiple writes or the performance influence of the slowest downstream system. Storm’s at least once guarantee of delivering the data is acceptable because persistence of the data is idempotent.
The separation that Kafka provides also allows us to aggregate the data for processing as necessary. Some medical data feeds produce a high volume of small payloads that only need to be processed through batch methods such as MapReduce. Linkedin’s Camus project allows our ingestion platform to persist batches of small payloads within Kafka topics into larger files in HDFS for processing. In fact, all the data we ingest into Kafka is archived into HDFS as Kite SDK Datasets using the Camus project. This approach gives us the ability to perform further analytics and processing that do not require low latency processing on that data. Archiving the data also provides a recovery mechanism in case data delivery lags beyond the topic retention policies of Kafka.
Cerner’s use of Kafka for ingesting data will allow us to continue to experiment and evolve our data processing infrastructure when new use cases are discovered. Technologies such as Spark Streaming, Apache Samza (incubating), and Apache Flume can be explored as alternatives or additions to the current infrastructure. Cerner can prototype Lambda and Kappa architectures for multiple solutions independently without affecting the processes producing data. As Kafka’s multi-tenancy capabilities develop, Cerner can also look to simplify some of its data persistence needs, eliminating the need to push to downstream HBase clusters.
Overall, Kafka will play a key role in Cerner’s infrastructure for large-scale distributed processing and be a nice companion to our existing investments in Hadoop and HBase.
Micah Whitacre (@mkwhit) is a senior software architect on Cerner Corp.’s Big Data Platforms team, and an Apache Crunch committer.