An Overview of Real Time Data Warehousing on Cloudera

Users today are asking ever more from their data warehouse. This is resulting in advancements of what is provided by the technology, and a resulting shift in the art of the possible. As an example of this, in this post we look at Real Time Data Warehousing (RTDW), which is a category of use cases customers are building on Cloudera and which is becoming more and more common amongst our customers.

Let’s consider a large Asian Telecommunications provider who is rolling out 5G. They built a RTDW using Cloudera to ensure a good customer experience and to keep maintenance costs under control. Having a live view of all aspects of their network lets them identify potentially faulty hardware in real time so they can avoid impact to customer call/data service. It also gives them a chance to remediate the issue earlier, when it is cheaper to resolve.

  • Ingest 100s of TB of network event data per day 
  • Updates and deletes to ensure data correctness
  • 200,000 queries per day
  • Mix of ad hoc exploration, dashboarding, and alert monitoring

The capabilities that more and more customers are asking for are:

  • Analytics on live data AND recent data AND historical data
  • Correlations across data domains, even if they are not traditionally stored together (e.g. real-time customer event data alongside CRM data; network sensor data alongside marketing campaign management data)
  • The extreme scale of “big data”, but with the feel and semantics of “small data”
  • All of the above, in one integrated and secured platform

The factors driving this trend are part technical, part business, and part cultural. On the technical side, it is cheaper and easier than ever to instrument everything and send that data in real-time through a messaging system. On the business side, companies and governments are digitizing and automating as many of their operations as possible so decision making and asset management can be more effective. On the cultural side, people expect to have the answers they need available at their fingertips, immediately, and without having to go ask someone (thanks Google and Wikipedia).

One other example highlights this trend. An AdTech company in the US provides processing, payment, and analytics services for digital advertisers. Data processing and analytics drive their entire business. So they needed a data warehouse that could keep up with the scale of modern big data systems, but provide the semantics and query performance of a traditional relational database. They chose to build their RTDW on Cloudera.

  • Several billion ad impression events per day are streamed in and stored
  • Billion records are updated daily in aggregate tables
  • Ad hoc exploration and scheduled reports
  • Target per-query SLA of 1 second
  • Average query concurrency factor of 10

What is Real Time Data Warehousing?

The simplest way to describe a RTDW is that it looks and feels like a normal data warehouse, but everything is faster even while massive scale is maintained. It is a type of data warehouse modernization that lets you have “small data” semantics and performance at “big data” scale.

  • the data arrives into the warehouse faster – think streams of many millions of events per second constantly arriving
  • the time it takes for the data to be optimally queryable is faster – query immediately upon arrival with no need for processing or aggregation or compaction
  • the speed at which queries run is faster – small, selective queries are measured in 10s or 100s of milliseconds; large, scan- or compute-heavy queries are processed at very high bandwidth
  • mutations of the data, when needed, are fast – if data needs to be corrected or updated for whatever reason, this can be done in place without large rewrites

While this may sound obvious, and to some perhaps even trivial, decades of data warehousing have shown otherwise. It is very hard to maintain interactive performance, over large amounts of data that is arriving very fast, some of which might need updates, with a large number of queries of varying patterns. Cloudera offers RTDW capabilities that tick all these boxes. As such, many customers are building RTDW applications as part of their overall strategy of using Cloudera to modernize their data warehouse practice.

The following table provides more specifics on the use case characteristics that make up a RTDW.

Ingest
  • Medium to high throughput, typically streaming in
  • Optimized for insert only as well as insert+update patterns
Query
  • Optimized for point lookups, analytics, mutations, etc. with low latency and high concurrency
  • Data streamed in is queryable immediately, in an optimal manner
  • Data streamed in is queryable in conjunction with historical data, avoiding need for Lambda Architecture
Data Model
  • Conventional enterprise data types
  • Small or medium sized models; dimensional and denormalized mainly, occasionally more normalized model

Figure 1 below shows a standard architecture for a Real-Time Data Warehouse. It has the key elements of fast ingest, fast storage, and immediate querying for BI purposes.

Figure 1. Basic Architecture for Real-Time Data Warehousing

As seen in Figure 2, this is easy to extend to cover additional capabilities that might be required. These include stream processing/analytics, batch processing, tiered storage (i.e. for active archive or joining live data with historical data), or machine learning.

Figure 2. Architecture for Real-Time Data Warehousing with Extended Capabilities

 

Typical RTDW Applications

In addition to understanding the attributes of an RTDW, it is useful to look at the types of applications that can be built within the RTDW category.

General Purpose RTDW

These are end-to-end, high volume applications that are used for general purpose data processing, Business Intelligence, operational reporting, dashboarding, and ad hoc exploration. But an important caveat is that ingest speed, semantic richness for developers, data freshness, and query latency are paramount. They can be scoped to a single purpose (i.e. a data mart) or more comprehensively as an Enterprise Data Warehouse.

In addition to the characteristics described above in What is Real Time Data Warehousing?, a General Purpose RTDW has the following attributes:

  • Requires full-fledged ANSI SQL support
  • Updates usually required, sometimes as high percentage of the workload
    • E.g. correction of data for a transaction, Change Detection Capture (CDC), Slowly Changing Dimension (SCD) type 2 logic, reordering of late-arriving event data
    • Common for processing logic to be done in the stream and the updates subsequently applied in the datastore
  • Optimized access to both full fidelity raw data and aggregations
  • Optimized access to both current data and historical data

Time Series and Event Analytics Specialized RTDW

Sometimes you know that there is always a time element to the data events and to the analysis, and you know in advance the types of queries your users will run. You can take this knowledge and build a RTDW that is specialized for Time Series and Event Analytics. By doing so the benefits to ingest speed, query latency, and scalability can be huge. The tradeoff is a loss of generality of supported query patterns, which is OK because the reason that you selected this specialized approach in the first place is that it is ideal for your specific use case and you don’t need anything more general.

In addition to the RTDW characteristics described above in What is Real Time Data Warehousing?, Time Series and Event Analytics RTDWs have the following description:

  • Used when there is always a time element in the data
  • Used when query patterns are known in advance, do not require joins, are relatively static, and do not need ad hoc exploration
  • Used when data pre-aggregations are required for performance or scale reasons
  • Mix of relatively simple numeric and text data, with no complex types or long strings
  • Do not require updates to data

Adding Stream Analytics and Stream Processing

In some cases you need to act on the data within the stream as it is flowing into the warehouse. This could be done for stream processing (e.g. cleansing, feature engineering, CDC reconciliation) or for stream analytics (e.g. alert when threshold exceeded over a rolling window of statistics on the data, score the event data against a predictive model to decide which action to take next). Operating on data in the stream gives you the ability to make better decisions in “machine-time”, which complements the ability to make better decisions in “human-time” once the data lands in the warehouse.

Building an RTDW with Cloudera

Cloudera offers a platform, Cloudera Data Platform (CDP), for building end-to-end data applications in both the public and private cloud. CDP contains a rich array of services to move, store, process, and query your data. Some of these are offered within the Cloudera Data Warehouse (CDW) service, and others are offered within the different templated cluster types in the Cloudera DataHub service. The table below summarizes the building blocks used to create a RTDW application within CDP.

Use… Containing… Used For…

CDW

Hive
  • Flexible, scalable query engine for EDW
  • Combines Druid data with other warehouse data in single queries
Druid
  • Analytics storage and query engine for pre-aggregated event data
  • Fast ingest of streaming data, interactive queries, very high scale
Hue
  • SQL editor for running Hive and Impala queries
DataViz

(Tech Preview)

  • Tool for visualizing, dashboarding, and report building
  • Connects to Druid, Impala, Hive, and other enterprise data sources

Data Hub –

Real Time Data Mart Template

Kudu
  • Analytics storage engine for huge volumes of fast arriving data
  • Mutability, random access, fast scans, interactive queries
Impala
  • Interactive query engine for fast BI over huge volumes of data
  • Combines Kudu data with other warehouse data in single queries
Spark Streaming
  • Micro-batch stream processing engine
  • Enrichment, filtering, transformation of data in motion
HUE
  • SQL editor for running Impala+Kudu queries

Data Hub –

Streams Messaging Template

Kafka
  • High throughput and high scale distributed event streaming
  • Durability of messages for guaranteed delivery

Data Hub – 

Flow Management Template

Nifi
  • Data integration, distribution, and routing engine
  • Glue combining multiple data engines into end-to-end flows

Data Hub – 

Streaming Analytics Template

Flink
  • Engine providing stateful analytics computations over data streams
  • Alerting, scoring, decision-making on data in motion

 

The following two diagrams show how these components work in conjunction for the different flavors of RTDW applications, as described above.

Figure 3. General Purpose Real-Time Data Warehouse in Cloudera CDP

Figure 4. Time Series and Event Analytics Specialized Real-Time Data Warehouse in Cloudera CDP

 

What’s Next?

Please keep an eye out for the next posts in this series, where we will discuss these two flavors of RTDW in more depth. These will provide more details on how the technologies work together and how you can build your own RTDW applications.

  • Deep Dive into Time Series and Event Analytics Specialized RTDW, featuring Apache Druid, Apache Hive, Apache Kafka, and Cloudera DataViz.
  • Deep Dive into General Purpose RTDW, featuring Apache Kudu, Apache Impala, and Apache NiFi.

In the meantime, if you want to learn more, please check out this video, which shows how to build an end-to-end Event Analytics application in CDP, using Apache Kafka, Apache Druid, Apache Hive, and Cloudera DataViz. In addition, we have a webinar and blog explaining how you can use Apache Kudu and Apache Impala to create a time series application within CDP. And finally, if you want to learn more about using CDP to do analytics, processing, and routing of data within the stream, please watch this video, which highlights Apache NiFi and Apache Kafka.

Leave a comment

Your email address will not be published. Links are not permitted in comments.