Shared Data Experience (SDX) on Cloudera Data Platform (CDP) enables centralized data access control and audit for workloads in the Enterprise Data Cloud. The public cloud (CDP-PC) editions default to using cloud storage (S3 for AWS, ADLS-gen2 for Azure). This introduces new challenges around managing data access across teams and individual users. To solve these challenges for S3 and ADLS-gen2, Cloudera has introduced a new service — the Ranger Authorization Service (RAZ).
CDP-PC provides the same fine-grained access control as on-prem for data warehouse querying (Hive or Apache Impala), search index lookups (Apache Solr), and applications built upon operational database tables (Apache HBase). Initially, the change from HDFS storage to cloud storage required architectural changes to how access control for files and directories were managed. This directly impacted use cases that require access to raw files/objects such as data engineering with Hive, Apache Spark, and Apache Pig. A follow up blog post will illustrate the kinds of changes that would need to be made and how RAZ compares.
Cloudera’s new RAZ addresses these challenges and is now fully integrated with CDP-PC. This service enables data owners to audit and control access to files and directories in cloud storage using Apache Ranger as a centralized repository for data security policies. This effectively provides the same fine-grained and audit capabilities that on-prem users have enjoyed through Apache Ranger in HDFS deployments for years to CDP-PC use of native cloud storage.
In order to describe the benefits of RAZ on CDP Public Cloud, let’s discuss two of our customers.
Customer 1 – Centralized data authorization management
One of our pharmaceutical customers has been using CDH on AWS IaaS and wanted to use CDP to deploy new data engineering workloads. They historically deployed traditional CDH clusters in the cloud as if they were on prem with always-on virtual machines configured for traditional HDFS on nodes with Amazon EBS volumes attached. When they evaluated CDP Public Cloud on Amazon, they were enticed by having one centralized service to define data authorization policies for their different teams.
RAZ for S3 gives them that capability. Without RAZ for S3, managing accesses introduced operational complexity as they would have had to maintain policies in AWS IAM (Identity and Access Management), in CDP’s User Management Service, and in a CDP environment’s Ranger service. With a RAZ for S3-enabled environment, all file access authorizations and audits are managed within the environment’s Ranger service.
Customer 2 – Centralizing data access control operations
One of our large financial services customers has been using HDP on Azure and was motivated by minimal operational changes from their existing clusters. They deployed a traditional HDP cluster in the cloud as if it were on prem with always-on virtual machines configured for traditional HDFS with nodes that had Azure’s Premium storage attached. They also depended upon Apache Ranger for its sophisticated fine-grained access controls and centralized audit of HDFS files and Apache Hive tables access. This customer’s HDP cluster was used by many teams, and the platform owners managed access control using hundreds of Ranger HDFS policies.
RAZ for Azure unblocked and allowed this customer to have virtually the same single pane of glass for their data access control policies as their IaaS deployment. It eliminated the need for potential security policy re-architecture and only required a simple conversion of their existing HDFS Ranger policies to ADLS Ranger policies.
Both customers – Cost savings and modernized architecture
Both customers also were enticed by the potential cloud cost savings realized by migrating from IaaS to CDP Public Cloud. Both can benefit from cost savings by using more economical storage — AWS S3 for storage instead of EBS, and Azure ADLS-gen2 storage instead Azure Premium storage. Both customers also gain from modernizing their data lake architecture to allow them to decouple compute nodes from storage. With the net new workloads of our pharmaceutical CDH customer, they could further reduce compute costs by dynamically spinning up Data Hubs for various jobs instead of having an always-on cluster. Similarly, for the customer migrating from HDP to CDP, cost savings can be achieved by dynamically spinning up and down VM nodes of their ported workloads within a Data Hub.
Conclusion
With the introduction of RAZ for S3 and ADLS, Cloudera customers discussed here are now able to get the operational wins and cost savings for their data engineering use cases. Both CDH and HDP customers were able to get the benefit of a single interface to manage data access policies, and are able to save money by having their upgraded deployment use the more cost efficient cloud storage natively (Azure Data Lake Storage (ADLS) or AWS S3) and take advantage of compute elasticity. The HDP migration customer had the added benefit of having a nearly identical operational experience around data security and didn’t have to significantly re-architect their existing security policies.
With the release of CDP 7.2.11 runtime, RAZ for Azure ADLS is now Generally Available for production use in CDP-PC for Datalakes and Data Hubs for Spark, Hive and HBase. RAZ for AWS S3 is now in Limited Availability for production use, so please reach out to your account team to enable this capability. The rest of the Data Hubs and integration with CDP experiences are in development or preview states so consult the documentation for their status.
For more details, see the following resources
- Our recent blog, walking through how to enable specific use cases with RAZ for ADLS
- Deep dive into a scenario comparing the group-based access control mechanism against the new fine-grained access control.
- Deep dive into how Cloudera and Microsoft Azure partnered to enable interoperability between CDP and Azure native services (RAZ for ADLS with ACL fallback)
- Detailed discussion on the architecture of RAZ in the enterprise data cloud