Announcing RecordService Beta 2: Brings Column-level Security to Apache Spark and MapReduce

Categories: General Platform Security & Cybersecurity Sentry Spark

With this new beta release, column-level privileges set via Apache Sentry (incubating) are now enforced on Spark/MapReduce jobs.

Cloudera is excited to announce the availability of the second beta release for RecordService. This release is based on CDH 5.5 and provides some new features, including:

  • Support for Sentry column-level security. Previously, column-level access control required the use of views; now, permissions can be set on individual columns in a table. This new feature simplifies administration as views no longer need to be created and specified in jobs.
  • Multiple planners, enabling high availability
  • Spark 1.5 compatibility

In this post, we’ll walk you through two examples: how to replace an existing MapReduce/Spark job with RecordService, and how to use the column-level security feature with RecordService.

Enable MapReduce with RecordService

Configuring MapReduce jobs with RecordService is straightforward. The following examples will utilize the ubiquitous WordCount program:

To use this program with RecordService, first you’d need to replace the TextInputFormat in the first highlighted line with the RecordService version of TextInputFormat. Specifically, replace the line with:

Besides TextInputFormat, you can also use AvroInputFormat, AvroKeyInputFormat, or AvroKeyValueInputFormat. They correspond to classes with the same name in the Apache Avro project.

Then, you need to specify how the input dataset is read. RecordService provides multiple interfaces for reading datasets. The most commonly used are:

  • SELECT query on a table
  • A projection on a table

Both of these interfaces are referred to as SQL requests. To use them, the target dataset must be a registered table in the Hive MetaStore. For the second approach, the input query can only be a SELECT statement on all or a subset of the columns for the table. UDFs are not currently allowed.

Reading data from SQL objects such as a tables, columns, or views allows fine-grained, column-level and row-level privileges to be enforced via Sentry policies. The underlying files themselves would routinely be locked down so that most users would not be able to access them directly. (Sentry’s HDFS sync feature can set these underlying HDFS permissions automatically.)

To use one of these SQL requests, you simply need to replace the second highlighted line with

or

Here in setInputQuery the args[0] should be a SQL SELECT query (e.g., SELECT * FROM tpch.nation), and in setInputTable it should be the full table name (e.g., tpch.nation). You can also replace the null with database name (e.g., tpch) and use table name for args[0] (e.g., nation).

These are the only changes you’ll need to make. For a complete example, see WordCount.java.

Enable Spark with RecordService

It’s also very easy to convert an existing Spark application to use RecordService. Below is a simple Spark version of WordCount (see WordCount.scala for a complete example).

As you can see, this program counts all the words in all files under the directory /test-warehouse/tpch.nation. To enable RecordService for the program, you simply need to replace the highlighted line with

…just like with the setInputQuery() or setInputTable() methods and their required SQL object parameters described previously. 

Enable Spark SQL with RecordService

Making Spark SQL work with RecordService is also super-easy. You only need to register a temporary table using RecordService:

Here, db is the database name, tbl is the table name, and size is the estimated size of this table in bytes (which can be optionally provided by the user to optimize certain operations on the table).

Column-Level Security with RecordService

Column-level security is an important feature introduced in Sentry 1.5.1. Using column-level security, you can restrict a particular group of users to a subset of the columns in a given table. Since this policy is applied on the RecordService level, all the components on top of it, such as MapReduce, Spark, and so on, can also benefit from this feature.

Let’s review an example. Say that tpch.nation is the input table and has the following columns:

rs-columns-tab

First, for the purpose of demonstration, create a test group and assign it to a user called "demouser":

(Note: If you are using our QuickStart VM, this user and group are already set up for you, so the above step can be omitted.)

Next, create a corresponding test role for this group, and grant the role permissions to access the n_name and n_nationkey columns from the tpch.nation table. The example below is from the Apache Impala (incubating) shell, but you should be able to do the same with Apache Hive.

Now, if you want to count the number of records in tpch.nation with the above settings, you can launch a job for RecordCount for the table:

This job will quickly fail with an AuthorizationException because the user has been granted access to read only from the n_name and n_nationkey columns, whereas here you’re trying to read all the columns.

However, specifically specifying the columns to which the user has been granted privileges will allow the job to succeed:

For Spark applications, the process is similar:

Conclusion

In this blog post, we showed how to integrate existing MapReduce/Spark applications with RecordService, and how to take advantage of the new column-level security feature in this latest release. We encourage you to try out these examples using our latest QuickStart VM. For more information, you can also check out our website and source code repo.

Chao Sun is a Software Engineer at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

2 responses on “Announcing RecordService Beta 2: Brings Column-level Security to Apache Spark and MapReduce

  1. Manhar

    This is an excellent feature. We have had the trouble of maintaining the same via views. This is going to make our life easier. Jumping on my VM to do some test.

    PS: upgrading to CDH 5.5 has the latest sentry 1.5.1 or this parcel needs to be installed separately?

    Thanks a lot
    M.

    1. Justin Kestelyn Post author

      RecordService is a beta and thus is a separate install at this time.