Earlier this month (November 6 through 8, 2023) a few hundred Apache Flink enthusiasts descended upon a Hyatt Regency Lake near Seattle for the annual Flink Forward conference. Cloudera was happy to participate, both as a sponsor of the conference and supporter of the open source community. Flink is, relatively speaking, a newer technology. However, it continues to gain adoption and inspire new development in the core engine as well as supporting technologies. Flink Forward is a great opportunity to learn about the cutting edge of streaming and stream processing technologies. This blog is a summary of what we observed there for anyone who was unable to attend or just wants to stay on top of what’s happening in streaming.
Takeaway No. 1: The Flink community is amazing
I’d like to offer a proper hats-off to Veverica for organizing a fantastic conference. The conference had a laser focus on the open source technology and the developers who bring it to their organizations. No vendors pretending OS tech was their own secret sauce. No glorified advertisements masquerading as case studies. Just Flink-oriented content and training. The tech itself now boasts 1.4 million downloads, 21,000 GitHub stars, and 1,600 code contributions. There are individual Flink clusters in production as big as 4 million cores and 2,000 cluster nodes, clocked at 4.1 billion events/s. However you want to measure it, it’s safe to say that Flink has taken the mantle of “industry standard.”
Cloudera perspective: Flink is here to stay. When choosing open source or open core, a key consideration is the support of the community and the sustained development of the tech. No enterprise wants to bet on technology that will be out of fashion next year. Flink is a distributed engine that can be deployed on commodity hardware where it is lightning fast at astronomical scale. Vendors making claims of being faster than Flink should be viewed with suspicion.
Takeaway No. 2: The majority of Flink shops are in earlier phases of maturity
We talked to numerous developer teams who had migrated workloads from legacy ETL tools, Kafka streams, Spark streaming, or other tools for the efficiency and speed of Flink. Many critical downstream applications consume data processed by Flink, especially telcos, financial services, and e-commerce, where real-time processing needs are pronounced. But the burden of development and maintenance of these solutions often fell on small teams of Java programmers. There’s still a good percentage of self-managed Flink deployments that offer a series of challenges to solve in order to scale Flink. Many architects and team leaders expressed to us a desire to democratize stream processing to larger user bases, especially SQL analysts and/or a desire to move from manual configuration and maintenance of Flink environments to more of a PaaS model to maintain performance while freeing up development resources.
Cloudera perspective: This is exactly why we built SQL Stream Builder, a SQL-based no-code UI for analysts and domain experts. By democratizing access to streaming data, and bringing domain expert users into the development cycle, we help accelerate iterations on stream processing applications. This is vital when onboarding new data, or changing logic to meet evolving needs as is the case in fraud monitoring. Join our webinar December 14 to see a demonstration and ask questions.
Takeaway No. 3: Efforts to simplify deployment architectures are expected to help further accelerate adoption
Many organizations are moving their Flink deployments to Kubernetes. This will help accelerate deployment across environments and to optimize performance and resource utilization on an ongoing basis. DataOps rejoice—this is good news for Flink as it removes barriers to adoption and lowers the overall cost of deployment, significantly impacting the ROI on Flink pipelines and applications, especially when consolidating disparate processing tools.
Cloudera Perspective: Deployment architecture matters. Hybrid matters! Cloud-only solutions will not meet the needs for many use cases and run the risk of creating additional barriers for organizations. Cloudera is embracing Kubernetes in our Data in Motion stack, making our Flink PaaS offering more portable, scalable and suitable for data ops.
Takeaway No. 4: There is growing realization that Kafka is not enough
Numerous developers and architects expressed a desire to de-load Kafka and are looking to Flink for that purpose. Consider a few factors: First, many have been using Kafka as long-term storage and have seen their clusters grow without the same elasticity and accessibility one would expect from a modern data lake. Kafka has included “friends” Kconnect and Kstreams, but neither of those actually reduce the amount of data streamed, with Kconnect offering an all-or-nothing approach to bringing data into the stream. It should come as no surprise that streams have grown considerably over the years and here we are now where a common Flink use case is to simply filter streams to reduce the load on Kafka.
Cloudera perspective: The market has evolved. Organizations are moving beyond a Kafka-is-everything mentality when it comes to streaming. Workloads that don’t expressly require the many-to-many data sharing that publish/subscribe model solves for might be better for a universal data distribution too like NiFi for real-time needs or an open table format like Iceberg where making data accessible in near real time is acceptable. Cloudera offers Kafka with Flink and NiFi and Iceberg to provide a complete set of capabilities for streaming data that help organizations capture, process, and distribute and store any and all data needed to deliver the real time insights their applications and business users need.
Takeaway No. 5: Stream Processing and Lakehouse capabilities need each other.
Veverica unveiled support for Apache Paimon, a new Apache project that seems poised to support this Kafka-offloading trend as part of a broader integration with data at rest. While an integrated storage solution for Flink is highly valuable it’s still early and not clear how the market will react to Paimon or “streamhouse” terminology. The project does tout some bells and whistles but ultimately little in terms of fundamental differentiation against Apache Iceberg. The Paimon community is nascent and heavily centered in one geo. Adoption has yet to really catch on. It’s unclear that there is enough incentive to do so—is there significant room between ultra low-latency Flink use cases and low-latency availability of Iceberg? What use cases are there where Iceberg low latency is too slow but real-time stream processing is unnecessary? Flink 2.0 is coming soon and has loads of upgrades for Iceberg integrations that can take advantage of killer features like time travel while Iceberg continues to develop an ecosystem of integrations that include Flink. Sink v2 is part of the Iceberg roadmap and will be a game changer for Flink SQL, providing incremental file compaction that will improve performance and reduce costs. It’s a positive sign that Iceberg will continue to develop integrations with Flink—after all, Iceberg has wide adoption from big organizations like Netflix, Apple, Citi, and Bloomberg, who also happen to have large Flink footprints and will be motivated to improve integrations between the two.
Cloudera perspective: Data Lakehouses have established themselves as core architectures at organizations across industries and it is becoming more clear that there is a need for Stream Processing capabilities that can be easily combined with lakehouse platforms.
Paimon might be a technology solution in search of a problem. For now, Flink plus Iceberg is the compute plus storage solution for streaming data. It’s important to place your bets strategically when choosing critical pieces of data infrastructure. There is a tremendous opportunity to simplify data architectures by combining a single unified processing engine with a single open-table storage solution. Over time, the open source community tends to consolidate efforts on a standard. Cloudera is monitoring the evolution and demand from our customers for Paimon at this stage.
Conclusion:
All in all, Flink Forward was a fantastic conference. Cloudera is proud to support and contribute to the open source community and will be looking forward to sponsoring Flink Forward again. It feels like Flink is hitting an inflection point in adoption so we expect this time next year the community will have grown and matured a great deal!
For more information on how Cloudera is bringing Flink to the enterprise with SQL stream builder join our webinar Dec 14.
Download Cloudera Stream Processing Community edition for FREE and get zero to Flink in less than an hour. Our SQL Stream Builder console is the most complete you’ll find anywhere.
Sign up for a free trial of Cloudera’s NiFi-based DataFlow and walk through use cases like stream filtering and cloud data warehouse ingest.