Cloudera’s new Data Hub cloud service, powered by Cloudera Data Platform, enables users to seamlessly migrate on-premises data management and analytics workloads to the cloud as well as implement new cloud workloads in pursuit of your cloud-first data management strategy. On August 22nd, Cloudera demonstrated its Data Hub service during a webinar highlighting key business benefits, use cases, and product capabilities.
Below is a brief overview of the topics covered and some of the most frequently asked questions from attendees.
What is Cloudera Data Hub?
Cloudera Data Hub is a powerful cloud service on Cloudera Data Platform (CDP) that makes it easier, safer, and faster to build modern, mission-critical, data-driven applications with enterprise security, governance, scale, and control. The cloud-native service is powered by a suite of integrated open source technologies that delivers the widest range of analytical workloads such as data marts and data engineering.
The three distinguishing characteristics of Cloudera Data Hub are:
- It embodies node-based clusters running a broad range of workloads with a suite of selectable components bundled into an open source distribution.
- It offers extensive choices in cluster shapes, workload types, infrastructure choices, and configuration/customization options, delivering a best-in-class, intuitive experience for operators & users of big data platforms.
- It facilitates seamless migration and upgrade paths to the public cloud for existing CDH & HDP customers without disruption.
Key business benefits include:
Speed of innovation
Cloudera Data Hub provides a PaaS-like experience that enables deployment of new solutions in weeks rather than months or quarters Users can build revenue-generating multi-function data applications easier, faster, and safer with enterprise security, governance, scale, and control.
Enhanced user experience
Data Hub clusters can be provisioned and disposed of quickly with pre-built or custom configuration options for infrastructure. It’s easy to provision multiple clusters on shared data, so customers can launch new applications that can be fully isolated without interrupting existing production applications.
Data Hub mitigates the risks associated with technology evolution, price fluctuations of cloud, vendor lock-in, and regulatory compliance. It removes the need for CAPEX on expensive data center hardware and enables end-to-end security and governance for each data hub environment and optimized SLAs for mission critical projects.
Breadth of services
Data Hub empowers organizations to focus on delivering trusted, high-value analytics from edge to AI-powered operations with the broadest range of workloads including operational data store, data mart, and data-in-motion/edge. It reduces TCO by scaling for cost for multiple operational and analytical use cases resulting in reduced infrastructure, maintenance, and integration.
What can Cloudera Data Hub do?
Provision clusters of various workloads
Data Hub currently supports provisioning of data engineering and data mart clusters. It will soon support provisioning of operational database, secure data engineering, Data Flow, stream processing, discovery data mart, as well as custom (user chooses components) clusters.
Facilitate Robust Orchestration and Automation
Data Hub facilitates management, monitoring, and orchestration of all services from a single pane of glass across all environments. Capabilities include always-on with automated HA configuration of all critical services; auto-repair to replace failed cloud instances transparently with no loss of state; resize to easily scale up/down capacity (via UI and API); Compute-only instances with minimal ephemeral storage help optimize cost.
Enable Enterprise-grade security and governance
Data hub delivers enterprise-grade security comprising
- built-in, federated identity management;
- secure, keyless access to cloud provider storage & compute;
- private and public IP addresses with proxied endpoints;
- support for both private and public IP addresses with proxied endpoints;
- automated wire encryption for all control traffic and data paths;
- always on ABAC across all components and clusters in an environment:
- support for encrypted cloud storage services and attached volumes.
Provide Flexibility, Choice, and Control
Data Hub is for businesses that want flexibility, scalability, and ease of use. Users can rearrange worker roles, configure GPU support, adjust resource management settings, and tune clusters to implement complex, multi-function analytics use cases at scale.
Data Hub delivers your environment, your way by providing:
- custom-tailored clusters,
- custom service selection, configuration settings & whole cluster templates,
- access to the latest infrastructure from cloud providers,
- replicate success with canned ‘Cluster Definitions’ for a predictable experience/SLA, and
- orchestration through CLI, API or GUI.
Key Use Cases
Some of the key use cases for Data Hub are described below.
Data Hub enables you to run your existing on-premises Cloudera workloads in the cloud through lift-and-shift with improved performance, robust governance, and availability as experienced by thousands of organizations who have deployed Cloudera on-premises. Additionally, Data Hub offers dedicated, pay-as-you-go, auto-scaling, and extensive choices in configuration. Whether you choose a cluster shape of data engineering or data mart, bare metal or virtualized, Data Hub provides an intuitive experience and seamless migration path.
IaaS to PaaS-like experience
Existing customers using Cloudera in an IaaS model can move workloads to Data Hub and reap the benefits like increased automation, integration with object-storage, unified control pane, and shared catalog, security, and governance with SDX.
Augment with hybrid strategy
Data Hub can augment or complement existing on-premises environments for scenarios such as analyzing new cloud-born data or establishing data marts and operational data stores on existing data synchronized with the cloud. The service also supports new applications & exploratory projects in the cloud, empowering organizations to capitalize on cloud benefits while retaining sensitive or relevant data in the on-premises environment.
Data Hub enables enterprises to capitalize on a cloud-native architecture and deploy a wide variety of workloads like data ingestion, ODS (Operational Data Store), data mart, and data engineering, in a cloud-native, elastic, pay-as-you-go and easy-to-manage environment. IT departments can deliver a number of business use cases such as customer 360, revenue-generating applications, IoT- operational and monetization, and functional data marts (e.g., Finance, Marketing).
Governance, Risk and Compliance
The increasing complexity of regulatory changes is gaining momentum. These requirements demand that organizations build a data architecture that lowers business risks. Data Hub with SDX provides consistent security, governance, and control across all environments, and protects enterprise’ data from security threats and risk.
Highlights of Q&A Session
Cloudera received a number of questions from participants. We have addressed a few of the more commonly asked ones below.
Is Data Hub transient in nature?
Yes, Data Hub clusters can be fully elastic (transient), meaning they are provisioned to run a single workload and then terminated. This can be automated through the CDP CLI. Clusters can also be semi-elastic, meaning they can grow & shrink according to the capacity required. Transient Data Hub clusters maintain all of their catalog and security metadata in a CDP Data Lake through SDX, so no context is lost when clusters are terminated.
Is there a way to easily migrate Sentry policies to Ranger to be compatible with CDP?
Yes. CDP Replication Manager has point-and-click migration of existing Sentry policies to Ranger, which resides in a CDP Data Lake.
When will a CDP be released which will include HDF services such as Nifi, Nifi Registry, Kafka, and Schema Registry?
New Data Hub cluster definitions will be released regularly with new services such as NiFi, Kafka, HBase and Kudu, which are all planned in the coming months. Stay tuned.
Are all components Cloudera Data Platform open source?
Yes. All Data Hub cluster services are 100% Apache open source, and other components of the CDP stack like Cloudera Manager will also become fully open source by January 2020.
Can you combine data from clusters running on different Public Cloud platforms like GCP and Azure?
This is on the roadmap for CDP. Replication Manager will support copying data sets & metadata between clouds like Azure and GCP so they can be combined to power a workload running within a CDP environment in a particular cloud region.
How will upgrades of the stack be handled? Will there be downtime associated with upgrades?
The ‘always-on’ nature of Cloudera SDX services means that newer clusters can be created and stood up next to old clusters, often eliminating the need for in-place upgrades and permitting a gradual transition of workloads without downtime. However, for some workloads like an Operational Database based on HBase or Kafka, in-place upgrades are still appropriate and will be supported in a future version of Cloudera Data Hub.