An Overview of Real Time Data Warehousing on Cloudera

by Justin Hayes

Posted in Business | Technical | November 02, 2020 8 min read

Users today are asking ever more from their data warehouse. This is resulting in advancements of what is provided by the technology, and a resulting shift in the art of the possible. As an example of this, in this post we look at Real Time Data Warehousing (RTDW), which is a category of use cases customers are building on Cloudera and which is becoming more and more common amongst our customers.

Let’s consider a large Asian Telecommunications provider who is rolling out 5G. They built a RTDW using Cloudera to ensure a good customer experience and to keep maintenance costs under control. Having a live view of all aspects of their network lets them identify potentially faulty hardware in real time so they can avoid impact to customer call/data service. It also gives them a chance to remediate the issue earlier, when it is cheaper to resolve.

Ingest 100s of TB of network event data per day
Updates and deletes to ensure data correctness
200,000 queries per day
Mix of ad hoc exploration, dashboarding, and alert monitoring

The capabilities that more and more customers are asking for are:

Analytics on live data AND recent data AND historical data
Correlations across data domains, even if they are not traditionally stored together (e.g. real-time customer event data alongside CRM data; network sensor data alongside marketing campaign management data)
The extreme scale of “big data”, but with the feel and semantics of “small data”
All of the above, in one integrated and secured platform

The factors driving this trend are part technical, part business, and part cultural. On the technical side, it is cheaper and easier than ever to instrument everything and send that data in real-time through a messaging system. On the business side, companies and governments are digitizing and automating as many of their operations as possible so decision making and asset management can be more effective. On the cultural side, people expect to have the answers they need available at their fingertips, immediately, and without having to go ask someone (thanks Google and Wikipedia).

One other example highlights this trend. An AdTech company in the US provides processing, payment, and analytics services for digital advertisers. Data processing and analytics drive their entire business. So they needed a data warehouse that could keep up with the scale of modern big data systems, but provide the semantics and query performance of a traditional relational database. They chose to build their RTDW on Cloudera.

Several billion ad impression events per day are streamed in and stored
Billion records are updated daily in aggregate tables
Ad hoc exploration and scheduled reports
Target per-query SLA of 1 second
Average query concurrency factor of 10

What is Real Time Data Warehousing?

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.

the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving
the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction
the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth
mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

While this may sound obvious, and to some perhaps even trivial, decades of data warehousing have shown otherwise. It is very hard to maintain interactive performance, over large amounts of data that is arriving very fast, some of which might need updates, with a large number of queries of varying patterns. Cloudera offers RTDW capabilities that tick all these boxes. As such, many customers are building RTDW applications as part of their overall strategy of using Cloudera to modernize their data warehouse practice.

The following table provides more specifics on the use case characteristics that make up a RTDW.

Ingest	Medium to high throughput, typically streaming in Optimized for insert only as well as insert+update patterns
Query	Optimized for point lookups, analytics, mutations, etc. with low latency and high concurrency Data streamed in is queryable immediately, in an optimal manner Data streamed in is queryable in conjunction with historical data, avoiding need for Lambda Architecture
Data Model	Conventional enterprise data types Small or medium sized models; dimensional and denormalized mainly, occasionally more normalized model

Figure 1 below shows a standard architecture for a Real-Time Data Warehouse. It has the key elements of fast ingest, fast storage, and immediate querying for BI purposes.

Figure 1. Basic Architecture for Real-Time Data Warehousing

As seen in Figure 2, this is easy to extend to cover additional capabilities that might be required. These include stream processing/analytics, batch processing, tiered storage (i.e. for active archive or joining live data with historical data), or machine learning.

Figure 2. Architecture for Real-Time Data Warehousing with Extended Capabilities

Typical RTDW Applications

In addition to understanding the attributes of an RTDW, it is useful to look at the types of applications that can be built within the RTDW category.

General Purpose RTDW

These are end-to-end, high volume applications that are used for general purpose data processing, Business Intelligence, operational reporting, dashboarding, and ad hoc exploration. But an important caveat is that ingest speed, semantic richness for developers, data freshness, and query latency are paramount. They can be scoped to a single purpose (i.e. a data mart) or more comprehensively as an Enterprise Data Warehouse.

In addition to the characteristics described above in What is Real Time Data Warehousing?, a General Purpose RTDW has the following attributes:

Requires full-fledged ANSI SQL support
Updates usually required, sometimes as high percentage of the workload
- E.g. correction of data for a transaction, Change Detection Capture (CDC), Slowly Changing Dimension (SCD) type 2 logic, reordering of late-arriving event data
- Common for processing logic to be done in the stream and the updates subsequently applied in the datastore
Optimized access to both full fidelity raw data and aggregations
Optimized access to both current data and historical data

Time Series and Event Analytics Specialized RTDW

Sometimes you know that there is always a time element to the data events and to the analysis, and you know in advance the types of queries your users will run. You can take this knowledge and build a RTDW that is specialized for Time Series and Event Analytics. By doing so the benefits to ingest speed, query latency, and scalability can be huge. The tradeoff is a loss of generality of supported query patterns, which is OK because the reason that you selected this specialized approach in the first place is that it is ideal for your specific use case and you don’t need anything more general.

In addition to the RTDW characteristics described above in What is Real Time Data Warehousing?, Time Series and Event Analytics RTDWs have the following description:

Used when there is always a time element in the data
Used when query patterns are known in advance, do not require joins, are relatively static, and do not need ad hoc exploration
Used when data pre-aggregations are required for performance or scale reasons
Mix of relatively simple numeric and text data, with no complex types or long strings
Do not require updates to data

Adding Stream Analytics and Stream Processing

In some cases you need to act on the data within the stream as it is flowing into the warehouse. This could be done for stream processing (e.g. cleansing, feature engineering, CDC reconciliation) or for stream analytics (e.g. alert when threshold exceeded over a rolling window of statistics on the data, score the event data against a predictive model to decide which action to take next). Operating on data in the stream gives you the ability to make better decisions in “machine-time”, which complements the ability to make better decisions in “human-time” once the data lands in the warehouse.

Building an RTDW with Cloudera

Cloudera offers a platform, Cloudera Data Platform (CDP), for building end-to-end data applications in both the public and private cloud. CDP contains a rich array of services to move, store, process, and query your data. Some of these are offered within the Cloudera Data Warehouse (CDW) service, and others are offered within the different templated cluster types in the Cloudera DataHub service. The table below summarizes the building blocks used to create a RTDW application within CDP.

Use…	Containing…	Used For…
CDW	Hive	Flexible, scalable query engine for EDW Combines Druid data with other warehouse data in single queries
	Druid	Analytics storage and query engine for pre-aggregated event data Fast ingest of streaming data, interactive queries, very high scale
	Hue	SQL editor for running Hive and Impala queries
	DataViz (Tech Preview)	Tool for visualizing, dashboarding, and report building Connects to Druid, Impala, Hive, and other enterprise data sources
Data Hub – Real Time Data Mart Template	Kudu	Analytics storage engine for huge volumes of fast arriving data Mutability, random access, fast scans, interactive queries
	Impala	Interactive query engine for fast BI over huge volumes of data Combines Kudu data with other warehouse data in single queries
	Spark Streaming	Micro-batch stream processing engine Enrichment, filtering, transformation of data in motion
	HUE	SQL editor for running Impala+Kudu queries
Data Hub – Streams Messaging Template	Kafka	High throughput and high scale distributed event streaming Durability of messages for guaranteed delivery
Data Hub – Flow Management Template	Nifi	Data integration, distribution, and routing engine Glue combining multiple data engines into end-to-end flows
Data Hub – Streaming Analytics Template	Flink	Engine providing stateful analytics computations over data streams Alerting, scoring, decision-making on data in motion

The following two diagrams show how these components work in conjunction for the different flavors of RTDW applications, as described above.

Figure 3. General Purpose Real-Time Data Warehouse in Cloudera CDP

Figure 4. Time Series and Event Analytics Specialized Real-Time Data Warehouse in Cloudera CDP

What’s Next?

Please keep an eye out for the next posts in this series, where we will discuss these two flavors of RTDW in more depth. These will provide more details on how the technologies work together and how you can build your own RTDW applications.

Deep Dive into Time Series and Event Analytics Specialized RTDW, featuring Apache Druid, Apache Hive, Apache Kafka, and Cloudera DataViz.
Deep Dive into General Purpose RTDW, featuring Apache Kudu, Apache Impala, and Apache NiFi.

In the meantime, if you want to learn more, please check out this video, which shows how to build an end-to-end Event Analytics application in CDP, using Apache Kafka, Apache Druid, Apache Hive, and Cloudera DataViz. In addition, we have a webinar and blog explaining how you can use Apache Kudu and Apache Impala to create a time series application within CDP. And finally, if you want to learn more about using CDP to do analytics, processing, and routing of data within the stream, please watch this video, which highlights Apache NiFi and Apache Kafka.

Justin Hayes

Data Warehouse Product Manager

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

An Overview of Real Time Data Warehousing on Cloudera

What is Real Time Data Warehousing?

Typical RTDW Applications

CDW

Data Hub –

Data Hub –

Data Hub –

Data Hub –

Editor's Choice

Leave a comment Cancel reply