Cloudera SDX: Under the Hood

Categories: CDH

What is SDX?

Shared Data Experience — SDX — is Cloudera’s secret ingredient that makes it possible to deploy Cloudera’s four core functions (Data Engineering, Data Science, Analytic DB, Operational DB) on a single platform.

Why does that matter?

First, each of those core functions is essential to any modern enterprise business.

  • Data Engineering enables the business to run batch or stream processes that speed ETL and train machine learning models
  • Data Science enables the business to do exploratory data science at big data scale with full data security and governance
  • Analytic DB delivers the fastest time-to-insight with the flexibility and agility to run in any environment and against any type of data.
  • Operational DB enables the business to build data-driven applications that deliver near real-time insights

Second, experience has shown that most business applications actually require a combination of two or more functions to solve real-world problems. In other words, most business applications not only involve large volumes of data, they also require multiple analytic disciplines — including ETL, BI, ML, and real-time analytics — to be applied to the same data set.

Customer Example: This multi-function approach helps freight businesses prevent vehicle downtime. They do this by ingesting a variety of telematics data in real time from the fleet of trucks, using machine learning to predict the likelihood that a certain part will fail at a given time, and then running analytics to determine the best way to pull the truck off the road and service it in a manner that minimizes downtime.

Third, experience has also shown that a scalable and consistent security and governance model is a prerequisite for businesses to enable a diverse set of data practitioners to interact with a shared set of sensitive or regulated data.

Customer Example: Pharmaceutical businesses are working to accelerate drug research programs by providing a self-service analytics experience on a shared pool of data to their entire research team. However, since much of this data is regulated by HIPAA, this more efficient method of drug research would not be possible if the data management team was not able to first ensure that a consistent security and governance model had been applied consistently throughout.

With this in mind, it is clear that the preferred choice for any business should be a platform that provides a reliable implementation of each of these core functions and simultaneously provides a shared data experience to all of the data practitioners operating on that platform. This unified model for enterprise data management is indeed the most cost effective, the fastest to deploy, and the easiest to secure and govern. SDX makes this unified model possible. SDX creates this shared data experience for Cloudera’s customers.

How does SDX benefit Cloudera’s customers?

Let’s take a closer look at each of the key benefits of SDX:

More Cost Effective

SDX makes it possible to

  • Reduce procurement cost by buying multiple functions on one platform instead of buying multiple platforms, even if those platforms are provided by the same vendor

  • Reduce infrastructure cost by eliminating redundancies and inefficiencies (extra copies of data, extra pipelines to move data between platforms, extra platform management services, over-provisioning, etc.)

  • Reduce operations cost by enabling a single operations team to efficiently and consistently support all big data business applications because all functions have been integrated into the same management interfaces

Faster

SDX makes it possible to

  • Reduce deployment time by using software to provide a shared data experience out-of-the-box without the need for lengthy services contracts to make everything work together

  • Reduce time to launch new applications by leveraging all of the existing setup and context (ingest, security, governance, catalog, etc.) without needing to recreate the context, definitions, and policies for each subsequent application

  • Reduce time to onboard new tenants by inheriting existing best practice configurations from other tenants

Easier

SDX makes it possible to

  • Improve security by making it easy to create a single set of security policies that apply to all applications and all users on the platform without needing to manually re-construct security policies between disparate platforms with varying levels of control, and thus reducing the overall security risk of the platform

  • Improve governance by making it easy to keep track of all your data by providing a common data catalog for technical and business-aware definitions to all your knowledge workers

  • Create a self-service environment by making it easy for users to discover new data sets and the lineage associated with those data sets without needed to contact the data management team for support

  • Improve workload management by making it easy to scale the platform, easy to troubleshoot issues, and easy to monitor and optimize jobs

How does it do that?

SDX is comprised of five discrete functions that together solve a really hard problem — providing a shared data experience for a platform that supports a diverse set of workloads and user interactions models. Here is a closer look at each of those functions and how they are implemented within Cloudera Enterprise:

Function

Customer Experience

Key Capabilities

 Implementation

Shared Security

Ability to implement consistent, granular authentication, authorization, encryption, and compliance controls in a unified manner across the entire platform

Authentication

Authorization

Encryption

Key management

Shared Governance

Ability to govern your data in a unified manner so that users can easily discover new data, understand where that data came from, and track how it has been modified

Audit

Discovery

Lineage

Stewardship

Shared Workload Management

Ability to create, manage, and optimize workloads individually or as a collection by allowing administrators to allocate resources and assign workload priority based on business requirements

Workload creation

Workload scheduling

Workload optimization

Workload troubleshooting

  • Cloudera Director

  • Cloudera Manager

  • Apache YARN

  • Apache Oozie

  • Job History Server

  • Workload Analytics

Shared Ingest & Replication

Ability to ingest data once and make available to all functions, applications, and users without additional ingest pipelines or copies of data

 

Ability to replicate data on demand to remote locations or directly to the cloud

Ingestion

Replication

Disaster recovery

Consistency

  • Apache Flume

  • Apache Sqoop

  • Apache Kafka

  • Apache Kite

  • Cloudera Backup & Disaster Recovery

  • S3Guard

Shared Data Catalog

Ability to provide a common catalog of schema and lineage metadata to each workload and user accessing the platform for maximum efficiency and productivity

Metadata management

  • Apache HMS

  • Cloudera Navigator

What does a data platform look like without SDX?

No other data platform has SDX, which makes these alternative platforms more costly to purchase and operate, slower to deploy and expand, and overall more difficult to secure, govern, and manage as compared to Cloudera Enterprise with SDX. Here is a closer look at the customer experience provided by alternative platforms.

Category

Customer Experience

Specialist Providers

Customer buys each function from a discrete vendor and hires developers or consultants to stitch everything together to make it work for their applications

Portfolio Providers

Customer buys multiple platforms from a single vendor plus a large services contract to stitch everything together to make it work for their applications

Hadoop Pure Play Providers

Customer buys a single platform that may be capable of sharing raw data between one or two core functions, but was not designed to share the associated data context (catalog, security, governance, etc.) between functions out-of-the-box and therefore requires the customer to fill in the gaps on their own, or more commonly, simply do without a shared data experience

Is SDX a new concept?

SDX has always been available to our on-premises customers and the vast majority of those customers are reaping the benefits of SDX by deploying multi-function applications on a single platform (see diagram below). However, the power of SDX has thus far been limited to on-premises deployments.

SDX diagramUntil now, no vendor has been able to provide a multi-function platform that delivers a shared data experience for the cloud. It is easy for workloads to share raw data via a cloud object store, but it is not easy for workloads to share security, governance, workload management, and data catalog in a cloud environment. Consequently, cloud deployments have largely been limited to single-function applications and isolated dev/test workloads. What’s worse, early adopters that have endeavored to provide a multi-function experience in the cloud have resorted to copying their on-premises deployment model (one large multi-tenant cluster running on dedicated infrastructure) to the cloud in order to preserve the shared data experience, even though this largely negates most of the benefits of cloud infrastructure and is substantially more expensive to operate.

All of that changes now. For the first time, Cloudera is making SDX available in the cloud. This means that businesses can deploy multi-function applications in the cloud without sacrificing the shared data experience they have coveted on premises and without compromising the benefits of cloud infrastructure. As such, the release of SDX for the cloud will mark the beginning of a new era for enterprise big data applications in the cloud.

How is SDX different in the cloud?

In the cloud, SDX is simultaneously more difficult and more valuable because data applications that are truly optimized for the cloud tend to run on isolated infrastructure (separate set of VMs for each workload) and tend to be transient in nature (such that the entire data context must be automatically supplied and removed to each job).

Without SDX, each workload degenerates into a silo of isolated security policies and metadata context that becomes a nightmare for the data team to manage.

With SDX, it is possible to create a single logical cluster that provides a shared data experience capable of supporting multi-function applications AND simultaneously allows each workload to take full advantage of cloud IaaS as depicted in the diagram below.

SDX Cloud diagram

How does SDX work in the cloud?

The diagram below provides the details of how SDX works in the cloud. Starting from the bottom,

  • The storage layer is implemented via Shared Object Storage and requires only a single copy of raw data to implement SDX in order to maximize efficiency, security, and governance
  • The metadata layer is implemented by a set of shared metastores and related tools that maintain consistent data catalog and policies across the platform
  • The compute layer is implemented by running each workload in an isolated Workload Cluster such that each workloads can be fully optimized for cloud IaaS
  • The management layer is implemented via Cloudera’s leading enterprise management suite to make it easy to create and manage both transient and persistent workloads
  • The user interface layer is implemented via Cloudera’s new suite of self-service data applications — Cloudera Altus —  that makes it easy for end users to create and troubleshoot their jobs in a shared environment managed by the data team

SDX layers

At Cloudera, we are thrilled to see SDX launching for the cloud and cannot wait to see how our customers will use it to make what is impossible today become possible tomorrow. To learn more, sign up today for our upcoming webinar on this topic.

 

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *