Cloudera’s data-in-motion architecture is a comprehensive set of scalable, modular, re-composable capabilities that help organizations deliver smart automation and real-time data products with maximum efficiency while remaining agile to meet changing business needs. In this blog, we will examine the “why” behind streaming data and review some high-level guidelines for how organizations should build their data-in-motion architecture of the future.
Businesses everywhere seek to be more data-driven not just when it comes to big strategic decisions, but also when it comes to the many low-level operational decisions that must be made every day, every hour, every minute, and, in many cases, every second. The transformative power of incremental improvement at the operational level has been proven many times over. Executing better on the processes that add value to your value chain is bound to reap benefits. Take a hypothetical manufacturer for example. On the shop floor, myriad low-level decisions add up to manufacturing excellence, including:
- Inventory management
- Equipment health and performance monitoring
- Production monitoring
- Quality control
- Supply chain management
It’s no wonder that businesses are working harder than ever to embed data deeper into operations. In 2022, McKinsey imagined the Data-Driven Enterprise of 2025 where winner-takes-all market dynamics incentivizes organizations to pull out all the stops and adopt the virtuous cycle of iterative improvement. It was very telling that, of the seven characteristics highlighted in that piece, the first two are:
- Data should be embedded in every decision, interaction, and process
- Data should be processed and delivered in real time
Notice that McKinsey isn’t talking about how fast data is created. They are talking about data being processed and delivered in real time. It is not the speed at which data is created that determines an organization’s response time to a critical event, it’s how quickly they can execute an end-to-end workflow and deliver processed data that determines their response. A sensor on a machine recording a vibration, on its own, has very little value. What matters is how fast that data can be captured, processed to put that vibration reading within the context of the machine’s health, used to identify an anomaly, and delivered to a person or system that can take action.
Businesses are challenged, however, with transforming legacy architectures to deliver real-time data that is ready for business use. For many organizations, the analytics stack was built to consolidate transactional data in batches, often over multiple steps, to report on Key Performance Indicators (KPIs). They were never built for real-time data, yet they are still the primary means of moving and processing data for most data teams. To achieve this, real-time data must first come to rest and wait to make its way through the stack. By the time it is ready for analysis, it is a historical view of what happened, and the opportunity to act on events in real time has passed, reducing the value of the insights.
The growing number of disparate sources that business analysts and data scientists need access to further complicates efforts. Unfortunately, a lot of enterprise data is underutilized. Underutilized data often leads to lost opportunities as data loses its value, or decays, over time. For example, 50% of organizations admit that their data loses value within hours, and only 26% said their streaming data is analyzed in real time. If an organization is struggling to utilize data before it decays, it fails to fully leverage the high-speed data in which it has invested.
Before we go any further, let’s clarify what data in motion is. Data in motion, simply put, is data that is not at rest, such as data in permanent storage. It includes data that is streaming – a continuous series of discrete events that happen at a point in time, such as sensor readings. It also includes data that is currently moving through an organization’s systems. For example, a record of login attempts being sent from an authentication server to a Security Information and Event Management tool is also data in motion. By contrast, data at rest isn’t doing much besides waiting to be queried. Data in motion is active data that is flowing.
Data-in-motion architecture is about building the scalable data infrastructure required to remove friction that might impede active data from flowing freely across the enterprise. It’s about building strategic capabilities to make real-time data a first-class citizen. Data in motion is much more than just streaming.
Delivering real-time insights at scale with the efficiency and agility needed to compete in today’s business environment requires more than just building streaming pipelines to move high-velocity data into an old analytics stack. The three key elements of a data-in-motion architecture are:
- Scalable data movement is the ability to pre-process data efficiently from any system or device into a real-time stream incrementally as soon as that data is produced. Classic Extract, Transform, & Load (ETL) tools have this functionality, but they typically rely on batching or micro-batching as opposed to moving the data incrementally. Thus, they are not built for true real-time.
- Enterprise stream management is the ability to manage an intermediary that can broker real-time data between any number of “publishing” sources and “subscribing” destinations. This capability is the backbone of building real-time use cases, and it eliminates the need to build sprawling point-to-point connections across the enterprise. Management involves utilizing tools to easily connect publishing and subscribing applications, ensure data quality, route data, and monitor health and performance as streams scale.
- Democratized stream processing is the ability of non-coder domain experts to apply transformations, rules, or business logic to streaming data to identify complex events in real time and trigger automated workflows and/or deliver decision-ready data to users. This capability converts large volumes of raw data into contextualized data that is ready for use in a business process. Domain experts need to have access to inject their knowledge into data before it is distributed across the organization. A traditional analytics stack typically has this functionality spread out over multiple inefficient steps.
To transform business operations with data embedded in every process and decision, a data-in-motion architecture must be able to capture data from any source system, process that data within the context of the processes and decisions that need to be made, and distribute it to any number of destinations in real time. As organizations scale, the benefits of data in motion grow exponentially. The hallmark of an effective data-in-motion architecture is maximal data utilization with minimal latency across the organization. Examples of this include:
- An order flowing across an e-commerce organization to provide real-time updates to marketing, fulfillment, supply chain, finance, and customer service, enabling efficient operations and delighting customers.
- A user session on a telco network flowing across the organization and being utilized by various processes, including fraud detection, network optimization, billing, marketing, and customer service.
With data in motion enabling true real-time, analysts can get fresh, up-to-the-second, processed data ready for analysis, improving the quality of insights and accelerating their time to value.
A data-in-motion architecture delivers these capabilities in a way that makes them independently modifiable. That way, organizations can adopt technology that meets their current needs and continue to build their streaming maturity as they go. It should be easy to do things like onboard a new sensor stream when a manufacturing production line has been retrofitted with sensors by using data movement capabilities to bring data into an existing stream without modifying the entire architecture. We should be able to add new rules to how we manage streaming data without rebuilding connectivity to the source system. Similarly, it should be easy to add new logic into real-time monitoring for cybersecurity threats when we identify a new tactic. As demand for real-time data continues to grow and new data sources and applications come online, it should be effortless to scale up the necessary components independently without compromising the efficient use of resources. The speed with which an enterprise can make changes to the way they capture, process, and distribute data is essential for organizational agility.
Capturing, processing, and distributing real-time data at scale is critical to unlocking new opportunities to drive operational efficiency. The ability to do so at scale is the key to reaping greater economic value. The ability to remain agile is critical to sustaining innovation speed. Additionally, the value of architectural simplicity can not be understated. In a recent paper, Harvard Business School professor and technology researcher Marco Iansiti collaborated with Economist Ruiging Cao to model “Data architecture coherence” and the cascading benefit of sustained innovation speed across an enterprise. A coherent data architecture in Professor Iansiti’s definition is simple to understand and modify, and one that is well aligned with business processes and broader digital transformation goals. Professor Iansiti theorizes that the real driving force behind the innovation speed of many digital natives is not culture as much as it is a coherent data architecture that lends itself well to a rapid iteration approach to business process optimization. Reduction in redundant tools and process steps can be quantified in terms of licensing, resource utilization, personnel impacts, and administrative overhead. However, these benefits are dwarfed by the sustained innovation speed required to execute constant incremental improvements at the operational level that coherent data architectures deliver.
Cloudera’s holistic approach to real-time data is designed to help organizations build a data-in-motion architecture that simplifies legacy processes for data movement as it scales.
Ready to take action? Get started by reviewing Gigaom’s Radar for Streaming Data Platforms to see how vendors stack up in this space.