Index-Level Security Comes to Cloudera Search

The integration of Apache Sentry with Apache Solr helps Cloudera Search meet important security requirements.

As you have learned in previous blog posts, Cloudera Search brings the power of Apache Hadoop to a wide variety of business users via the ease and flexibility of full-text querying provided by Apache Solr. We have also done significant work to make Cloudera Search easy to add to an existing Hadoop cluster:

  • It uses the same pool of data and system resources as other workloads, so you avoid the time and expense of transferring data to an external search service.
  • It provides a familiar and trusted security framework for organizations with strict security requirements.
  • It is well integrated with our existing management platform (Cloudera Manager) in order to ease adoption and simplify operations.

In this post, we’ll focus on the security features of Cloudera Search. In particular, you’ll learn how Cloudera Search solves authentication, or verifying a user’s identity; and authorization, or controlling access to resources. We’ll also discuss secure impersonation and how it is used with the Hue Search App.

Authentication Overview

Cloudera Search, via Solr and Apache Lucene, provides an HTTP interface for querying, updating, and managing full-text search indices. Like the other HTTP-level services in an enterprise data hub (such as HttpFS and Apache Oozie), Cloudera Search uses the following frameworks for authentication over HTTP:

  • Kerberos: a mutual authentication protocol that works on the basis of “tickets”
  • SPNego: a negotiation mechanism for selecting an underlying authentication protocol

Cloudera Search uses SPNego HTTP authentication to select Kerberos as the underlying authentication protocol. Using Kerberos and SPNego in this manner is advantageous for users because many tools for accessing HTTP resources have built-in support for the protocol. For example, you can use curl with the --negotiate option, and many popular browsers, including Firefox and Chrome, can be configured to access Kerberos/SPNego protected resources.

Furthermore, although Kerberos is an authentication, not authorization, protocol, you can use it to provide cluster-level access control by granting Kerberos credentials to only those users who should have access to the cluster. If finer-grained control is required than the cluster level, see the section on authorization below.

For information on configuring Cloudera Search to use authentication, see the documentation.

Authorization Overview

Solr itself does not provide access control support, but rather provides “hooks” to allow other systems to build access control on top of it. We have used these hooks to develop index-level access control using Apache Sentry (incubating). Sentry supports role-based granting of privileges in Solr; each role can be granted query, update, and/or admin privileges on any Solr index (called a “collection” in Solr terminology).

Let’s look at a specification of these privileges, called a policy file (typically stored in HDFS):

 

The policy file comprises two main sections:

  • [groups]: maps a Hadoop group to its set of Sentry roles
  • [roles]: maps a Sentry role to its set of privileges. One privilege in Solr is the ability to query, update, or perform administrative actions on a given collection. So, for example, the privilege specification collection = hbase_logs->action=Query grants the role the ability to query the hbase_logs collection in Solr.

Now that we’ve seen how to specify policies in Sentry, let’s look at how you would integrate Sentry and Solr. To understand this, let’s first look at how Solr processes an incoming request:

Processing of incoming Solr HTTP request

First, the HTTP request comes into Solr and is sent to the SolrDispatchFilter. The SolrDispatchFilter is responsible for sending the request to correct RequestHandler for the collection. If the request is to query data from the collection, it will be sent to the Select RequestHandler; if the request is to update the collection, it will be sent to the Update RequestHandler.  The request handlers themselves are specified in the collection-specific configuration file called solrconfig.xml

For example, specifying the Select RequestHandler may look like this:

 

Let’s assume this is the configuration for a Solr collection called “collection1”.  This request handler specification tells Solr that a request to the path http://localhost:8983/solr/collection1/select should be dispatched to an instance of solr.SearchHandler.

In addition to the standard solrconfig.xml, Cloudera Search ships with a modified version (solrconfig.xml.secure) that has request handlers integrated with Sentry. For example, with the select handler above, Sentry uses a Solr SearchComponent to check permissions before the query request is processed:

Solr RequestHandler with Sentry Component

The secure versions of the other standard collection request handlers are implemented in a similar fashion.

Administrative Requests

The section above covered requests on specific Solr collections, but what about cluster-level administrative actions? In Solr, administrative requests are sent to the /admin path. For example, a request to create a collection looks like:

http://localhost:8983/solr/admin/collections?action=CREATE&name=mycollection

If you compare this URL to the collection-specific URL above, you’ll see that “admin” just looks like any other collection but with a different set of request handlers. Sentry mirrors this structure for privilege-granting purposes: instead of granting “admin” access to a role, query or update access is granted to the “admin” collection. Query access grants privileges for read-only administrative commands (for example, dump the state of all the threads running in a Solr server), while update grants privileges for write-only administrative commands (such as changing the level of logging output for a Solr server).

For example, to grant a Sentry role read-only administrative command privileges and the ability to update a collection called “collection1”, add this to the sentry policy file:

 

Solr ships with a wide variety of collection-specific and administrative-level request handlers. For a complete list of the Sentry privileges required for the built-in Solr request handlers, see the documentation.

Secure Impersonation and Hue

Like Hadoop and Oozie, Cloudera Search has support for secure impersonation: the ability of a “super-user” to submit requests on behalf of another user, conceptually similar to sudo functionality on Unix. For security reasons, this functionality is limited to only the groups and hosts that are explicitly configured. (See the documentation for more information.)

The excellent Hue Search App makes use of this functionality in order to integrate with its own security mechanisms. Without this impersonation support, Hue would need access to Kerberos credentials for every user of the Hue App who wants to access Solr — an unacceptable requirement for many organizations. Instead, Hue can integrate with LDAP (and other authentication systems) in order to make requests on behalf of the LDAP authenticated user by using Secure Impersonation, seamlessly integrating with Solr and Sentry.

Conclusion

We believe the integration of Solr and Sentry in Cloudera Search is an exciting development that opens up new workloads in CDH for organizations with strict security requirements, all in an easily consumed application provided by Hue.

Cloudera Search is available for download with extensive documentation. If you have any questions, please contact us at the Cloudera Search Forum.

Gregory Chanan is a Software Engineer at Cloudera and an Apache HBase Committer.

Filed under:

2 Responses
  • Joe Travaglini / March 28, 2014 / 11:03 AM

    Would it be more accurate to call this “Collection-level security” as opposed to “Index level”? The latter implies that each data point is individually protected. However, it seems that the authorizations described are on a bucket of things in Solr, and not on individual indices.

  • Gregory Chanan / March 31, 2014 / 2:03 PM

    Hi Joe,

    Thanks for reading and for your question.

    When discussing security on “each data point” I think it is clearest to specify the type of data point — .e.g. security on each document is document-level security, security on each field is field-level security, etc.

    Here, as you point out, we are discussing security on an aggregation of documents. Solr uses the term “collection” for this, some other search engines use the term “index” (not to be confused with the verb “index”, i.e. parsing and storage). So, I think both “collection-level security” and “index-level security” are appropriate terminology.

Leave a comment


9 × = nine