This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.
Data warehouse optimization is one of the most common Hadoop use cases. After migrating data transformation workloads to Hadoop, customers typically want to provide self-service business intelligence access on Hadoop. Self-service BI results in many distinct users logging in and executing queries each under their own user id. When end users start using the cluster, fine-grained authorization is a requirement to satisfy internal controls and governmental regulations. Sentry was initially created originally for this use case.
I won’t go into detail here about why fined-grained authorization is useful; my colleague Shreepadma Venugopalan covered this topic in her post “With Sentry, Cloudera Fills Hadoop’s Enterprise Security Gap.” Furthermore, Sravya Tirukkovalur wrote a post about using Sentry with Apache Hive (“How-to: Get Started with Sentry in Hive”).
Sentry and Impala work together in a similar fashion as Sentry and Hive. In fact, since the policy file syntax is identical, users who use both Hive and Impala are encouraged to share the same policy file.
The two systems have different architectures resulting in some divergence in how they interact with Sentry. For example, Hive is typically configured with a single or small number of HiveServer2 instances. Impala works differently as each Impala daemon accepts queries, one of the many design features which helps Impala scale to a large number of concurrent queries.
In the Hive case, a small number of HiveServer2 instances will read the policy file from HDFS, whereas in the Impala case, each daemon will. (Since many Impala daemons will be reading the file from HDFS and the file is small, setting the replication count equal to the number of slave nodes is reasonable.) One additional difference is that while Hive reads and parses the policy file for each query, Impala checks to see if the policy file has been updated every five minutes.
In the video below, we will use a policy file, shown below, which in addition to an admin role has hierarchical roles
junior_analyst_role. As you can see below, the
manager_role has ALL on the default database, whereas the
analyst_role has ALL on the analyst1_table and
SELECT on the manager1_table. The
junior_analyst_role has ALL on jranalyst1_table.
management = manager_role
analyst = analyst_role, junior_analyst_role
jranalyst = junior_analyst_role
admin = admin_role
manager_role = server=server1->db=default
analyst_role = server=server1->db=default->table=analyst1_table->action=select
junior_analyst_role = server=server1->db=default->table=jranalyst1_table->action=select
# Implies everything on server1.
admin_role = server=server1
In the demo below, I will first enable Sentry with Impala and then create and share a view of manager1_table for junior analysts that restricts their access to roles as well as columns.
You should now understand the relatively straightforward procedure of implementing RBAC in Impala using Sentry!
Brock Noland is a Software Engineer at Cloudera and an Apache committer on the Crunch and Hive projects.