Recently, Confluent hosted Current 2023 (formerly Kafka summit) in San Jose on Sept 26th and 27th. With few conferences curating content specific to streaming developers, Current has historically been an important event for anyone trying to keep a pulse on what’s happening in the streaming space. Over 2,000 attendees and lots of new solutions were on display, and the event proved to be a clear look into the current (no pun intended) state of streaming and where it is headed. This blog is for anyone who was interested but unable to attend the conference, or anyone interested in a quick summary of what happened there. I will cover key takeaways from Current 2023 and offer Cloudera’s perspective.
Five Takeaways from Current 2023:
1- The people have spoken and Apache Flink is the de facto standard for stream processing
This may seem obvious to many who are already familiar with Flink, but it is worth pointing out. Architecture decisions have long-term effects and an important consideration when choosing a stream processing engine is whether the technology will stagnate or continue to evolve with contributions from the open source community. Will I be able to find developers for this three years from now? The answer from the community is a resounding yes. Flink is here to stay.
It makes perfect sense that Apache Flink has emerged as the standard. Flink was launched in 2015 as the world’s first open source streaming-first distributed stream processing engine and has since grown to rival Spark in terms of popularity. And the layered APIs from low-level operations to high-level abstractions gives Flink appeal to a broad range of users. The adoption of Flink mirrors growth in streaming data volumes and maturity of the streaming market. As organizations shift from the modernization of data-driven applications via Kafka towards delivering real-time insight and/or powering smart automated systems, Flink
At Current, adoption of Flink was a hot topic and many of the vendors (Cloudera included) use Flink as the engine to power their stream processing offerings as well. Use cases such as fraud monitoring, real-time supply chain insight, IoT-enabled fleet operations, real-time customer intent, and modernizing analytics pipelines are driving development activity. The value of consolidating different processing frameworks onto a single comprehensive framework to minimize technical overhead and maintain innovation speed is well understood.
The big announcement everyone was waiting for was the unveiling of Apache Flink in Confluent Cloud. The actual unveiling was a bit underwhelming as the SQL console left a lot to be desired, and outside of serverless auto-scaling functionality there was no “wow” factor. As of this writing, the product is still not GA and will not be made available on-prem, but the unveiling is still important due to the sheer size of the Confluent user base. Adoption will follow, and it’s safe to say that we have passed the tipping point— Flink is the future of streaming.
Cloudera’s perspective: Cloudera saw the increasing volumes of data our customers were moving via streams early on. They were suffering rising costs and were struggling to provide real-time insight to demanding stakeholders. So we bet big on Flink in 2020 and started developing tooling to bring it to the enterprise, and have a mature Flink product used by customers in banking, telco, manufacturing, and IT. kSQLdb, Spark Structured Streaming, and other proprietary approaches that fall short of the truly open and distributed stateful stream processing capabilities that Flink brings to the table will likely decelerate.
2- But there is an intriguing new category of competitor emerging, the “streaming database”
There are a handful of vendors positioning streaming databases as an alternative to Flink for stream processing. Their core value proposition is that streaming databases are inherently faster than Flink due to in-memory processing and state management. This makes sense in theory, but there are pretty wild claims out there as far as just how much faster they are, and with a lack of independent benchmarks in the industry a healthy dose of skepticism is warranted. But the tech is interesting and the allure of DB tooling that can “do-it-all” is strong.
Cloudera’s perspective: There is much value to be captured by bringing real-time processing capabilities to streaming architectures. Kafka-centric approaches leave a lot to be desired, most notably operational complexity and difficulty integrating batch data, so there is certainly a gap to be filled. Real-time databases have their place in the streaming ecosystem, but that place is in publishing and making the result sets widely available after a highly scalable engine like Flink has processed the data. Cloudera does this via materialized views that are accessible via API. Also, why solve for connectivity and data distribution again if it’s already solved for? How long does streaming data live inside the database and what happens when it expires? Is this yet another database? What about data lock-in? With highly interdependent capabilities, how difficult will it be to make modifications as business and data requirements evolve?
This class of technologies is very interesting, but still new—“wait and see” is perhaps sage advice.
3- Change data capture is red hot and Debezium is the de facto standard in this space
Judging by the sheer number of questions from the audience about CDC in general and Debezium specifically, it’s safe to say that Debezium has become for CDC what Flink is for stream processing. It makes perfect sense—similar to Flink, Debezium is an open source distributed service frequently used with Kafka to extend the value of streaming and capture new use cases. Debezium works by continuously reading the change logs of popular databases and publishing to Kafka topics, effectively transforming legacy batch systems into rich streams of data.
Debezium does have certain complexities of course, namely resource management and schema evolution. But there is much value to be captured here.
Cloudera perspective: Data freshness matters. It’s difficult to imagine a use case where fresher data isn’t inherently better data. Change Data Capture is an important part of the streaming ecosystem. Cloudera supports Debezium connectors for Kconnect and Flink and will soon release a NiFi processor as well, giving users fine grain control over data distribution.
4- Tooling for the Kafka ecosystem is improving
It’s no secret that Kafka deployments can be quite complex. Setting up clusters, monitoring and managing brokers, partitions, and topics, handling message ordering, exactly once guarantees, schema evolution and security: these all add up to operational overhead. Data lineage and debugging can be a nightmare to unravel. As the streaming space grows in maturity one thing that stood out is the improved tooling in the space. Confluent’s future vision for the data portal is a great example of the effort to provide better tooling and smoother user experience around discoverability and governance. Many vendors are providing enhanced tooling to provide observability and improve performance or to extend the ecosystem by integrating other frameworks such as MQTT and Pulsar.
Cloudera perspective: Cloudera began providing support and building tooling for the Kafka ecosystem in 2015 and has developed stable enterprise solutions. The Streams Messaging Manager tool is included in our free community edition of Cloudera Streams Processing. Furthermore, Cloudera SDX provides an integrated set of security and governance tools across the entire data lifecycle, including streaming. The Kafka platform shifting from Zookeeper to Kraft as is a huge relief for anyone managing Kafka operations. KRaft is already in tech preview for our next release.
For these reasons and more, IBM recently chose Cloudera as strategic Kafka partner of choice to bring cost efficient, scalable solutions to our enterprise customers.
5- There is still room for growth and maturation in the streaming space
While adoption of streaming technologies has steadily increased, the average streaming maturity level is still in the early stages. Streaming maturity is not about simply streaming more data; it’s about weaving streaming data more deeply into operations to drive real-time utilization across the enterprise. The number of use cases supported by a single Kafka topic is a better indicator than a raw measure of volume like events per second. Surprisingly few users had multiple use cases for most of their Kafka topics. Another hallmark of streaming maturity is the efficiency of the entire system in terms of resource utilization and ease of developing or modifying new use cases. Real-time processing can significantly reduce the volume of data in the stream and that’s a good thing. The majority of data streamers are just beginning to experiment here.
More forward-looking talks focused on expanding the impact of streaming data. Real-time anomaly detection and other time series operations on event streams. Operationalizing python for real-time ML pipelines was a hot topic. Others focused on the big picture efficiency, looking for ways to reduce load on Kafka by integrating with Apache Pinot for example (link below to an NYC-based Meetup on this topic). There was conspicuously little content specific to generative AI, which was a bit surprising given the attention the industry at large has given the topic in 2023. Streaming data absolutely has a tremendous role to play in generative AI, in fine tuning foundational models, optimizing prompts, contextualizing and augmenting outputs, etc. Stay tuned for plenty more on that topic!
Cloudera perspective: Data streams are part of a much broader data lifecycle. Kafka can’t do it all. Kafka shines when utilized as the real-time bus for application integration and as the message buffer for analytics workflows. When stretched beyond those core capabilities however, it becomes overly complex and carries significant technical overhead. That’s why a complete approach to streaming is needed. An efficient and scalable streaming architecture should be simple yet complete with tooling to address continuous iterative development cycles. That includes first class support for data distribution (aka universal data distribution), edge data capture, stream filtering, independently modifiable stream processing that is accessible to analysts, and integration with data at rest for low cost accessible storage. Lastly, real-time processing and movement of multi structured data including prompts and embeddings is critical for harnessing the transformative power of AI.
Download Cloudera Stream Processing Community edition for FREE and get zero to Flink in less than an hour. Our SQL Stream Builder console is the most complete you’ll find anywhere.
Sign up for a free trial of Cloudera’s NiFi-based DataFlow and walk through use cases like stream filtering and cloud data warehouse ingest.
Join myself and Developer Advocate Tim Spann in New York City for the latest on real-time, including generative AI and more, cohosted by Cloudera and Apache Pinot based Startree.