The Security Challenges of Data Warehousing in the Cloud

Many organizations struggle to meet growing and variable data warehouse demands. No matter how much they pad their annual IT budgets, there never seems to be enough capacity to cover unexpected business requests. This leads to resource restrictions for the various business units that use the platform. 

When business units are not well served by central IT, “shadow IT” emerges. These independent departmental IT projects threaten security and compliance for the entire organization because nobody can be sure that consistent security is maintained — most of the time, central IT is not even aware of their existence. Shadow IT point solutions may temporarily solve a problem for an individual business unit, but often lead to other issues: 

  • How do you maintain a single source of truth in a completely decentralized architecture?
  • How do you control data privacy and protect against data breaches when the data is spread across so many different systems?
  • How do you optimize your enterprise-wide infrastructure (mostly cloud) and application expenditures?

The ideal solution would maintain centralized security and governance controls while enabling individual business units to quickly provision capacity and customize their environment to meet their needs. This is exactly what Cloudera Data Platform (CDP) provides to the Cloudera Data Warehouse. CDP is a data platform that is optimized for both business units and central IT. 

  • CDP allows each business unit to have their own custom data warehouse environment. 
  • CDP includes Cloudera Shared Data eXperience (SDX), a centralized set of security, governance, and management capabilities that make it possible to use cloud resources without sacrificing data privacy or creating compliance risks.
  • CDP does all of this without cloud provider lock-in, so teams may move to the cloud — or between clouds — without retraining staff or rewriting applications.

The end result is that your teams will be able to collaborate better, more efficiently, more securely, and at a lower cost when they use Cloudera Data Warehouse on CDP.  

Main Security Features

In CDP, an “Environment” is a logical subset of your cloud provider account. Registering an Environment provides CDP with access to your cloud provider account and identifies the resources in your cloud provider account that CDP services can access or provision.

Register an Environment

Once you have registered an Environment in CDP, you can start provisioning CDP resources such as data warehouse clusters, which run within your own cloud account, ensuring that your data and your applications never leave your network. You can register multiple environments corresponding to different geographical regions that your organization would like to use.

When you register an Environment in CDP, a Data Lake is automatically deployed for that environment. Data Lake security and governance is managed by a shared set of services running within a Data Lake cluster. These are the shared security services encompassed within SDX. 

The Data Lake cluster and SDX are managed by Cloudera Manager, and include the following services:

  • Hive MetaStore (HMS) — table metadata
  • Apache Ranger — fine-grained authorization policies, auditing
  • Apache Atlas — metadata management and governance: lineage, analytics, attributes
  • Apache Knox:
    • Authenticating Proxy for Web UIs and HTTP APIs — SSO
    • IDBroker — identity federation, cloud credentials

SDX provides consistent data security, governance, and control — and not just within a single Data Lake. Policies from multiple Environments and Data Lakes roll up into CDP Control Plane applications (such as Data Catalog, Workload Manager and Replication Manager) to provide a single and complete view across all deployments.

The Data Lake provides a way for you to create, apply, and enforce user authentication and authorization, and to collect audit and lineage metadata from multiple ephemeral workload clusters. While workloads can be short-lived, the security policies around your data are persistent and shared for all workloads. 

Cloudera Data Warehouse Security

The Cloudera Data Warehouse service enables self-service creation of independent data warehouses and data marts for teams of business analysts without the overhead of bare metal deployments.

In the Cloudera Data Warehouse service, your data is persisted in the object store location specified by the Data Lake that resides in your specific cloud environment. The service is composed of:

  • Database Catalogs:
    A logical collection of metadata definitions for managed data with its associated data context. The data context consists of table and view definitions, transient user and workload contexts from the Virtual Warehouse, security permissions, and governance artifacts that support functions such as auditing. One Database Catalog can be queried by multiple Virtual Warehouses.
  • Virtual Warehouses:
    An instance of compute resources that is equivalent to an autoscaling cluster. A Virtual Warehouse provides access to the data in tables and views in the data lake that correlates to a specific Database Catalog. Virtual Warehouses bind compute and storage by executing queries on tables and views that are accessible through the Database Catalog that they have been configured to access. The compute and memory resources for each Virtual Warehouse are completely isolated from other Virtual Warehouses, avoiding contention and allowing highly sensitive workloads to be executed in complete isolation.

CDW Database Catalogs and Virtual Warehouses automatically inherit the centralized and persistent SDX services — security, metadata, and auditing — from your CDP environment. There is no need to repeatedly specify the security setup for each Database Catalog or Virtual Warehouse.

The following SDX security controls are inherited from your CDP environment:

  • Authentication: Ensures that all users have proven their identity before accessing the Cloudera Data Warehouse service or any created Database Catalogs or Virtual Warehouses. CDP integrates with your corporate Identity Provider to maintain a single source of truth for all user identities.
  • Fine grained authorization: Ensures that only users who have been granted adequate permissions are able to access the Cloudera Data Warehouse service and the data stored in the tables.
  • Dynamic column masking: If rules are set up to mask certain columns when queries execute, based on the user executing the query, then these rules also apply to queries executed in the Virtual Warehouses.
  • Row-level filtering: If rules are set up to filter certain rows from being returned in the query results, based on the user executing the query, then these same rules also apply to queries executed in the Virtual Warehouses.
  • Auditing: Apache Ranger provides a centralized framework for collecting access audit history and reporting data, including filtering on various parameters. 

Video

Cloudera SDX

CDP Secure by Design

Case Studies

One example of using CDP’s controls to secure a cloud data platform comes from a US-based customer in the financial services sector who operates a multi-tenant data warehouse. Their entire business model is premised on secure sharing of data products. They have a read-only data set which all tenants can query, as well as tenant-specific data sets which are only accessible to the respective tenant who owns the data set. SDX provides a strong and flexible authorization capability that supports their hybrid environment. Furthermore, tenants utilize dedicated and isolated compute resources to ensure that, at runtime, there is no exposure of one tenant’s runtime state to another tenant.

Related Information

Cloudera Data Warehouse (product documentation)

Cloudera Data Warehouse (website)

CDP Core Concepts (product documentation)

Justin Hayes
Data Warehouse Product Manager
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.