The Cloudera Data Warehouse (CDW) service is a managed data warehouse that runs Cloudera’s powerful engines on a containerized architecture. It is part of the new Cloudera Data Platform, or CDP, which went live on Microsoft Azure earlier this year. The CDW service lets you meet SLAs, onboard new use cases with zero friction, and minimize cost. Today, we are pleased to announce the general availability of CDW on Microsoft Azure. This service is available as part of CDP through the Azure Marketplace.
When discussing data warehousing with our customers, three scenarios frequently come up. The business can never get what it needs soon enough. SLAs are often missed, especially as the number of users and use cases grow. And there are pressures, if not outright mandates, to move to public cloud.
While there are many factors leading to these scenarios, there is only one answer for what to do about it: CDW. This post describes a representative example of what our customers are facing, and explains how CDW addresses the problems. It also looks at the key role that several Azure services, such as Azure Kubernetes Service and ADLS Gen2, play within this solution.
We will examine a company that manufactures equipment that is used in airplanes. Like many enterprises, there are large numbers of analysts poring over curated data, Line of Business (LOB) managers focused on operational excellence, and data scientists seeking competitive advantage buried within new data sets. But as with many of our customers, there are challenges, as seen in our four protagonists below:
- Ramesh’s team of business analysts build reports on which the business is run. But as the team has grown, the ability of the warehouse to meet SLAs and stay on budget has declined.
- CDW gives Ramesh cost-efficient, scalable reporting and dashboarding so their SLAs are never missed.
- Kelly is a data architect who needs to run ad hoc exploration workloads for campaign analytics. But she is not allowed to use the warehouse because of the risk of causing contention with SLA-bound workloads.
- CDW lets Kelly do her work on data in the warehouse, with no performance impact on other workloads.
- Olivia, a data scientist, cannot get capacity in the warehouse to explore new supply chain data. Thus there is a missed opportunity for optimization.
- CDW gives Olivia unlimited compute resources to throw at any data in object storage, available within minutes.
- Mariana is an operations manager who needs a real-time view of high volume sensor data and the ability to combine that with customer experience data. The current warehouse cannot handle this volume or diversity, so Mariana must use precious budget to build yet another silo.
- CDW gives Mariana a single platform that can do traditional data warehousing as well as new use cases requiring different techniques… all while keeping a single copy of every data set and leveraging shared metadata and security.
In the section below we will explain further how CDW and Azure provide these capabilities.
Capability 1 – Cost-Efficient, Scalable Reporting and Dashboarding on Curated Data
Ramesh and his team of business analysts run reports off and on all day long. The business is run on insights delivered by his team – especially those related to customer sentiment, which are critical given the recent drop in travel spending. So they cannot miss SLA, otherwise the business is flying blind. Reports must be delivered, regardless of the fact that data volumes and the number of analysts are growing, even while budgets are shrinking.
The compute resources in a CDW virtual warehouse (VW) remain suspended, incurring no cost, whenever there are no queries. As soon Ramesh’s first query arrives in the morning when he gets to work, the VW starts up automatically. If the query load later increases to the point of saturation, due to Ramesh’s many colleagues all coming online later in the morning, the VW will detect this and provision more compute resources to handle the load while maintaining performance. This is called autoscaling. Once the load drops back to a lower level (his colleagues all went to lunch without him), then those additional compute resources are let go, so they no longer incur cost. And finally, at the end of the day when Ramesh finally leaves work and the queries are all finished, the VW automatically suspends itself, again dropping to a status of no cost.
CDW is able to provide this pay-only-for-what’s-needed capability by using Azure Kubernetes Service (AKS) to quickly provision compute pods, and release them when no longer needed. These pods use the Standard_E16_v3 compute instance size (16 vCPU, 128 GiB RAM, 400 GiB local SSD). AKS ends up using VM scale sets behind the scenes to enable and control autoscaling.
Once Ramesh’s team are running their queries, they are able to meet their SLAs in large part via the three levels of caching built into the service:
- Data Cache – the first time a piece of data is read from ADLS it is cached on the compute node that used it. Subsequent queries requiring the same data get it from the local cache as opposed to ADLS. Both Hive LLAP and Impala VWs support this cache type.
- Result Set Cache – once the results are sent back to the client, the result set is also cached on storage on the HiveServer2 node. If the exact same query arrives again (which is common in dashboarding and BI use cases) then the results are served directly from the HS2 cache. Currently only Hive LLAP VWs supports this cache type.
- Materialized Views – you can define the structure and contents of a materialized view (MV), which Hive populates with data selected out of the base tables. For subsequent queries that access the base tables, if Hive detects that the data can be served out of the MV then it will transparently rewrite the query to use it, thus avoiding the need to again scan the base tables, join the data, aggregate it, etc. Currently only Hive LLAP VWs supports this.
With this level of intelligence and performance optimization, Ramesh and team can grow as the data volumes and business demands grow, while only ever paying for the resources they need to actually do their job.
Capability 2 – Ad Hoc Exploration to Complement SLA-Bound Workloads
Kelly, the data architect, was asked by the CMO to provide metrics quantifying the impact of the recent marketing campaign. The warehouse has the required data, but is also running at full capacity. Kelly will need to explore the data with a variety of query types, and is uncertain how long it will take or how much CPU and memory she will need. With such vague requirements, IT will not let her do this work on the data warehouse due to the risk of impacting SLA-bound operational workloads. Her queries might eat up the CPU resources and evict all the hot data from the cache. Thus the CMO has no metrics to help understand the impact from their marketing investment.
With CDW Kelly can have her own compute environment that can query the warehouse data, but remain completely isolated from the other SLA-bound workloads. CDW can do this by managing data context (table definitions, authorization policies, metadata) separately from the storage and compute layers. That way multiple compute environments can all share the same data context. Cloudera Shared Data Experience (SDX) is the term given to this managed context.
A key enabling capability for SDX is the ability to reliably store metadata and security rules in a persistent database. We use Azure Database for PostgreSQL for this, using the Gen5 4 vCore, Memory Optimized option. This managed Postgres service is easy to integrate with, highly available, and trivial to administer. Using this as a single source of truth for metadata and other persistent state, CDW can safely run as many compute environments in parallel as your workloads demand.
One other approach that CDW provides when compute resources are needed in situations like this, is to burst your workload from an on premise CDH or HDP cluster to CDP running in the public cloud. In this scenario the Workload Manager tool is used to profile your on premise workload, identify a candidate workload suitable for bursting (ad hoc exploration queries which interfere with SLA bound queries in this case), then replicate the data and metadata to CDP. The workload can now be run safely in your cloud environment. If doing this you would likely want to use Microsoft ExpressRoute to ensure good performance and consistent latencies for the data movement.
Capability 3 – Quick Provisioning to Keep Up with Speed of Business
Olivia, the data scientist, occasionally needs to test out hypotheses for supply chain optimization using new data files which are not yet in the warehouse. But central IT never plans for such bursty workloads and does not have the resources to do a new ETL project to incorporate this new data – whose value is as yet unproven – into the warehouse. This results in a missed opportunity to reduce cost of, and mitigate risk within, the supply chain.
If using CDW, Olivia would be able to simply spin up a new Hive LLAP VW, which takes just a few minutes, then create an external table definition on the data files so she can begin querying them. With Hive you can natively query semi-structured text files and delimited files (e.g. CSV or TSV). There are standard open source libraries to query JSON as well as other file formats. And you can always define your own Serializer-Deserializer (SerDe) for custom formats. Even when these basic file formats are used, Hive will still convert the data to its columnar in-memory format to benefit from caching and IO efficiency optimizations.
This capability to quickly provide querying capability on arbitrary data within your object store yields great agility and flexibility. You can quickly explore new data and onboard new use cases so that you keep up with the speed of business. This is only possible, however, due to the scalable, high performance ADLS Gen2 service. The Hadoop ABFS connector provides this key integration point, bridging the enterprise data you have stored in ADLS Gen2 with the ecosystem of analytics capabilities available in Cloudera.
Capability 4 – Multi-Mode Analytics for New Use Cases, Leveraging Shared Resources
Mariana, the manufacturing LOB operations manager, was tasked by her COO to increase yield by avoiding unplanned equipment downtime. She estimates that it will require storing 1 million sensor readings per second, 15 months of data retention to accommodate historical trend analysis, the ability to run arbitrary SQL against the data, and the need to access both raw data and aggregations. In short, she needs a highly scalable real time data warehouse that provides time series capabilities without breaking the bank.
The current data warehouse team cannot come close to these performance requirements, and the legacy time series database used by one of their teams cannot handle such a long history, or do arbitrary SQL. With the CDP platform, in one hour Mariana can stand up the infrastructure to host such an application, in this case using Azure Compute VMs with standard, locally redundant SSD storage. Cloudera’s time series offering relies primarily on the Apache Kudu storage engine and Apache Impala for SQL querying. Data can be ingested, using Apache NiFi, from Azure Event Hub, or Kafka, or one of the many other supported sources. This combination of powerful Cloudera engines with robust Azure infrastructure means that Mariana’s ambitious requirements can be met.
She did such a good job for her COO that the CEO took notice and asked her to now improve customer (i.e. airplane passenger) satisfaction by building more reliable airplane engines. But the warehouse has no real time visibility into the machinery running on the factory floor, so there is no easy way to integrate that data with the customer experience data and draw correlations. Thus she does not know what to adjust in the factory to improve quality.
With Cloudera, Mariana can run queries that join data in the time series application with other data in the warehouse to draw correlations between the manufacturing process and customer experience (as manifested in flight delays). As above, this is enabled via SDX, but in this case there is an additional level of security in place because Mariana is not permitted to view Personally Identifiable Information (PII) within the customer data. Because CDP integrates with Azure Active Directory to pick up the user’s identity and group membership, it can use Apache Ranger to enforce sophisticated Role-Based or Attribute-Based Access Control to dynamically mask all PII data when Mariana accesses it. She can securely do her job now, delighting the CEO by doing her part to improve customer satisfaction.
Transforming your data warehouse experience with CDW for Azure
With Cloudera Data Warehouse running on Azure, you can cost effectively scale your reporting and dashboarding on curated data, without waiting for the traditionally long provisioning cycles. You can enable ad hoc exploration on top of your SLA-bound workloads, without risk of missing those agreements by causing resource contention. You can quickly provision resources as needed, so you’re always saying yes to any business request for analytics of any sort, and you can take full advantage of the broadr scope of multi-mode analytics for new use cases, leveraging shared resources. To learn more about CDW on Azure, and CDW itself, feel free to check out:
- June 3 demo: Faster Analytics with Cloudera Data Warehouse (CDW)
- Try CDP website
Can you provide a cost wise comparison between CDW and Azure paas offerings (especially azure synapse and azure SQL db)?
When looking at cloud data warehouses you have to look at price-performance metrics as opposed to just plain cost. In other words, you need to look at how much its costs to run a given workload, as opposed to something like how much does it cost per compute-hour. So in short, I can’t provide a cost comparison between CDW and Azure’s services without understanding the target workload. That said, if you want to see them, the instance rates for our CDP Public Cloud offering (which contains CDW) are listed at https://www.cloudera.com/products/pricing.html.
Great piece that really shows the benefits of a data warehousing system, particularly from a cost saving point of view