Cloudera Data platform (CDP) provides a Shared Data Experience (SDX) for centralized data access control and audit in the Enterprise Data Cloud. The Ranger Authorization Service (RAZ) is a new service added to help provide fine-grained access control (FGAC) for cloud storage. We covered the value this new capability provides in a previous blog. RAZ for S3 and RAZ for ADLS introduce FGAC and Audit on CDP’s access to files and directories in cloud storage making it consistent with the rest of the SDX data entities. In this blog post we’ll compare implementing policies using the group-based mechanism (IDBroker) to how it is done in a RAZ-enabled environment.
Changes with file access control
Prior to the introduction of RAZ, controlling access to ADLS or S3 can only be achieved at a coarse-grained group level. While manageable for a couple of teams, many of our customers require hundreds of Ranger policies for HDFS to control access for their different teams and projects. This group level access control is managed with the CDP IDBroker service and requires a re-architecting of how access is managed. Each policy change, or introduction of a new user or new group typically requires interaction between CDP administrators and AWS/Azure administrators and potential changes to existing applications. This can be time consuming and cumbersome: as the number of teams and users grows, the effort required to manage access this way becomes unwieldy.
In the next sections, we’ll walk through a simple data access scenario both without and with RAZ for two separate teams — the data scientists and the data engineers. Although in our example we use RAZ for S3, RAZ for ADLS works analogously.
Without RAZ: Group-based access control with IDBroker
Traditionally with a CDP Private Cloud Base Edition, HDP, or CDH deployment protection of files and directories is achieved through a combination of HDFS ACLs (CDP, HDP, CDH) and Ranger HDFS policies (CDP, HDP). Since these on-prem capabilities were not initially available in CDP Public Cloud, certain use cases needed alternate means to control access to specific files and directories.
Without RAZ, the recommended solution is to use IDBroker to create a mapping from CDP users or groups to AWS IAM (ADLS AD) roles. This approach keeps AWS or ADLS credentials from leaking into your application’s code and allows for good credential hygiene. The procedure to onboard CDP users and groups for AWS cloud storage with an example for a data scientist (DS) and data engineering (DE) group is documented here.
With this in place, when you access cloud storage, CDP talks to IDBroker, exchanges your CDP identity for a AWS IAM role, and then performs the operation as the IAM role.
So, what are the consequences of this implementation? Let’s look at the impact when a new user is added and also when a user is added to multiple groups using the IDBroker approach
Let’s add a new user, Bob. There are two potential approaches with IDBroker:
- Create an IDBroker mapping for each CDP user like Bob to a unique AWS IAM role. Access decisions are made based on Bob’s AWS IAM role and ACLs on S3 buckets/objects. Adding Bob means that he will need to have an IAM role created in AWS by an AWS admin. The AWS admin then needs to give Bob read and write access via ACLs on individual objects or at the bucket level. However, this approach has known limitations including a 20kb policy size limit on buckets and a max of 100 grants on objects that limits the total number of users that can be associated. As the number of users grows, this approach becomes impractical and forces the CDP admin to go to a per group IAM role.
- Create an IDBroker mapping to a shared AWS IAM role per CDP group and assign CDP users like Bob to that group. Access decisions are made based on the group’s AWS IAM role and ACLs on S3 buckets/objects. Adding a user simply requires adding the CDP user to the CDP group.
Let’s say you use the CDP group to AWS IAM mapping. This has the implication that you cannot differentiate between two different users that belong to the same group. Let’s say that both Jon and Remi belong to the Data Engineering group. Both Jon and Remi therefore have the same permissions to read and write files in CDP. The problem is that Jon cannot prevent Remi from deleting files that he had written, and worse yet, he does not have a useful audit trail to determine that Remi in fact deleted the file! The only audit trail is in AWS stating that the Data Engineer group’s IAM role created and deleted files at a particular time.
Adding a user to multiple groups
The group approach has an important caveat. Based on AWS IAM’s design, your CDP identity can only be mapped to one AWS IAM role. This makes composing and managing the rights conferred by being a member of multiple groups extremely complex. Let’s say you wanted a user that had the rights of both DE and DS groups, you’d have to either:
- modify your application to choose which role you were going to use for each access, or
- have your AWS admin create a new IAM role that had the rights that the union of the roles had. You would also need your CDP admin to create a new IDBroker group mapping for this Data Engineer + Data Science group. Furthermore, to keep the DE + DS role consistent with the DE or DS role, the AWS Admin would also need to maintain and update the DE + DS role anytime either of the two individual roles changed. They may still run into the policy size / grants limitation.
All of these options are difficult to scale due to the implementation of the underlying systems or the operational burdens they impose.
With RAZ: Fine-Grained access control with RAZ for ADLS/S3
The introduction of RAZ for ADLS and RAZ for S3’s fine-grained access controls for cloud storage avoids the operational and scalability burdens the IDBroker approach faces. With the RAZ approach, you get virtually identical capabilities that the Ranger HDFS policies provide in HDP or CDP Private Cloud Base. This includes file access audit, resource based access policies, tag-based access policies, and sophisticated access conditions.
So what are the consequences of this implementation? Let’s look at what it takes when adding a new user and when adding a user to multiple groups using the RAZ approach.
When a user is added to the corporate IdP, the user will automatically be put into the public group when they log into CDP. Access is enforced by Ranger policies. No new AWS IAM role is required and thus no interaction with the AWS Admin required.
The scenario with Jon and Remi above is handled nicely as well — a Ranger S3 policy is set up by default that effectively gives Jon and Remi their own home directories. If both Jon and Remi have access to a shared directory, Ranger also records and audits all operations so that Jon can determine that it was Remi who deleted his files.
Adding a user to multiple groups is straightforward too. Just add your user to the group in the IdP or in your CDP groups. The updated group membership will be propagated automatically and near instantaneously to Ranger. When a user tries to access a file, RAZ and Ranger evaluate the request and make policy decisions based on the user identity and the union of all of their groups. Again, no new AWS IAM role is required and thus no interaction with the AWS Admin needed.
From one single pane of glass a CDP admin can manage all data access policies in CDP: files, data warehouse tables, data flows, metadata, operational tables, and more. Regardless of the storage type or location, all is handled consistently and audited on a per user basis.
The RAZ approach is a major operational win for managing access control and audits on file access against cloud storage such as S3 and ADLS-gen2. It also solves the multiple group membership problem elegantly. Please take a look at this use case blog to see how these cases are available for CDP Public Cloud deployments.
RAZ for S3 and RAZ for ADLS both available now in CDP-PC for tech preview, so please reach out to your account team to enable this capability.
For more details, see the following resources