Top Three Requirements for Data Flows

Data flows are an integral part of every modern enterprise. No matter whether they move data from one operational system to another to power a business process or fuel central data warehouses with the latest data for near-real-time reporting, life without them would be full of manual, tedious and error-prone data modification and copying tasks.

At Cloudera, we’re helping our customers implement data flows on-premises and in the public cloud using Apache NiFi, a core component of Cloudera DataFlow. While Apache NiFi is used successfully by hundreds of our customers to power mission critical and large-scale data flows, the expectations for enterprise data flow solutions are constantly evolving. In this blog post, I want to share the top three requirements for data flows in 2021 that we hear from our customers.

Data comes in bursts – The need for auto-scaling in minutes

As businesses are moving more and more towards real-time data movement instead of hourly/daily batches, data bursts become more visible and less predictable mainly due to two reasons:

  1. Once the hourly/daily batch windows are removed, there’s nothing left that aggregates and averages out lows and peaks. If there is a data burst lasting for five minutes, followed by a calm period of another five minutes, the data flow system has to deliver the expected performance throughout both periods without wasting resources. A batch system ingesting data every hour would have averaged out these bursts.
  2. Moving to real-time data flows is an opportunity to connect new streaming data sources to the data lifecycle, which did not fit the previous batch model. While these new sources increase the amount of data that a data flow system has to process, more often than not, these sources are sending data via unreliable network connections with each network outage resulting in its own data burst.

To successfully embrace streaming data, businesses – especially in public cloud environments – need to balance the need for high-performance data processing with the associated compute costs. The best way to achieve this balance is through using a service that auto-scales with built-in cost control.

Self-service is king – The need for a data flow catalog

Even though no-code, graphical tools like Apache NiFi make building data flows more accessible to non-coders, most data flows are still being built by specialized teams who are solely responsible for data integration.

With the move towards streaming data and the desire for Line of Business (LoB) teams to gain access to data faster, these centralized teams are struggling to keep up with the ever-growing list of data flows that the business users want implemented.

Data flows follow the 80/20 rule. While 80% of data flows cover the same use cases and patterns, only 20% are complex enough to require an in-depth understanding of the data flow product to customize them from the ground up. What if the specialized data integration team could focus on the 20%, while LoB users would have the ability to pick & adjust flow templates from a vetted and tested repository?

A self-service catalog providing out-of-the box data flow templates gives LoB users the speed and agility they need to support new business initiatives while data flow developers can now truly focus on implementing the challenging 20% of all data flows.

Deploy anywhere – The need for central monitoring

Multi-cloud is becoming a reality for many of our customers. Multi-cloud does not necessarily mean that one single use case is implemented and pieced together across public cloud providers, but rather that different lines of businesses selected a public cloud provider based on their needs.

While each cloud provider offers products to build data flows connecting systems and applications, each one is based on completely different technology, requiring the data integration team to learn all of them. Even if the team manages to implement the data flows with different technology, a consistent approach to monitoring “production flows” is simply missing. Now the integration team even has to learn dedicated monitoring tools and how to integrate each public cloud service with them.

To stay productive in an environment where multiple public cloud providers are used, integration teams need a data flow system that runs on all major public clouds and offers centralized monitoring of all data flows – no matter if they run on AWS, Azure, or GCP.

Where do we go from here?

 At Cloudera, we are working hard to solve these challenges with our Cloudera DataFlow capabilities on the Cloudera Data Platform (CDP). If this piqued your interest and you find yourself in any of the challenges outlined above, contact Cloudera, which has representatives worldwide to respond to you. Otherwise, stay tuned for more information about how Cloudera DataFlow on CDP can help you tame your data flows.

Michael Kohs
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.