New in CDH 5.1: Document-level Security for Cloudera Search

New in CDH 5.1: Document-level Security for Cloudera Search

Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.

In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.

For example, consider a simple hierarchy of increasingly restrictive security classifications: confidential, secret, and top-secret, and a user with access to view confidential and secret documents querying the corpus. Without document-level security, this query becomes unnecessarily complex. Consider two possible implementations:

  • You could store the confidential and secret documents in non-intersecting collections. That would require complexity at the application or client level to query multiple collections and to combine and score the results:

  • You could duplicate and store the confidential documents with the secret ones in a single collection. That would reduce the application-layer complexity, but add storage overhead and complexity associated with keeping multiple copies of documents in sync:

In contrast, document-level security, integrated via Sentry and now shipping in CDH 5.1, provides an out-of-the-box solution to this problem without adding extra complexity at the application/client layer or significant storage overhead. In this post, you’ll learn how it works. (Note: only access control is addressed here; other security requirements such as encryption are out of scope.)

Document-Level Security Model

You may recall from my previous post that a Sentry policy file specifies the following sections:

  • [groups]: maps a Hadoop group to its set of Sentry roles
  • [roles]: maps a Sentry role to its set of privileges (such as QUERY access on a collection “logs”)

A simple policy file specification giving every user of the hadoop group “ops” the ability to query collection “logs” would look like this:

# Assigns each Hadoop group to its set of roles
ops = ops_role
ops_role = collection = logs->action=Query,


In document-level security, the Sentry role names are used as the authorization tokens that specify the set of roles that can view certain documents. The authorization tokens are specified in the individual Apache Solr documents, rather than in the Sentry policy file with the index-level permissions. This separation is done for a couple of reasons:

  • There are many more documents than collections; specifying thousands or millions of document-level permissions per collection in a single policy file would not scale.
  • Because the tokens are indexed in the Solr documents themselves, we can use Solr’s built-in filtering capabilities to efficiently enforce authorization requirements.

The filtering works by having a Solr SearchComponent intercept the query and append a FilterQuery as part of the following process:

A few important considerations to note here:

  • Document-level authorization does not supersede index-level authorization; if a user has the ability to view a document according to document-level security rules, but not according to index-level security rules, the request will be rejected.
  • The document-level component adds a FilterQuery with all of the user’s roles OR’ed together (a slight simplification of the actual FilterQuery used). Thus, to be able to view the document, the document must contain at least one of the user’s roles in the authorization token field. The name of the token field (called “authField” in the image above) is configurable.
  • Because multiple FilterQuerys work together as an intersection, a malicious user can’t avoid the document-level filter by specify his/her own trivial FilterQuery (such as fq=*:*)
  • Using a FilterQuery is efficient, because Solr caches previously used FilterQuerys. Thus, when a user makes repeated queries on a collection with document-level security enabled, we only pay the cost of constructing the filter on the first query and use the cached filter on subsequent requests

Enabling Document-Level Security

By default, document-level security is disabled to maintain backward compatibility with prior versions of Cloudera Search.  Enabling the feature for a collection involves small modifications to the default solrconfig.xml configuration file:




Simply changed enabled from “false” to “true” and if desired, change the sentryAuthField field. Then, upload the configuration and create the collection using Solrctl.

Integration with the Hue Search App

As with index-level security, document-level security is already integrated with the Hue Search App via secure impersonation in order to provide an intuitive and extensible end-user application.


CDH 5.1 brings fine-grain access control to Cloudera Search via the integration of Sentry’s document-level security features.  Document-level security handles complex security requirements, while being simple to setup and efficient to use.

Cloudera Search is available for download with extensive documentation. If you have any questions, please contact us at the Cloudera Search Forum.

Gregory Chanan
Software Engineer at Cloudera, and an Apache HBase committer
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.