How-to: Secure Apache Solr Collections and Access Them Programmatically

Categories: Platform Security & Cybersecurity Search Sentry

Learn how to secure your Solr data in a policy-based, fine-grained way.

Data security is more important than ever before. At the same time, risk is increasing due to the relentlessly growing number of device endpoints, the continual emergence of new types of threats, and the commercialization of cybercrime. And with Apache Hadoop already instrumental for supporting the growth of data volumes that fuel mission-critical enterprise workloads, the necessity to master available security mechanisms is of vital importance to organizations participating in that paradigm shift.

Fortunately, the Hadoop ecosystem has responded to this need in the past couple of years by spawning new functionality for end-to-end encryption, strong authentication, and other aspects of platform security. For example, Apache Sentry provides fine-grained, role-based authorization capabilities used in a number of Hadoop components, including Apache Hive, Apache Impala (incubating), and Cloudera Search (an integration of Apache Solr with the Hadoop ecosystem). Sentry is also able to dynamically synchronize the HDFS permissions of data stored within Hive and Impala by using ACLs that derive from Hive GRANTs.

In this post, you’ll learn how to secure Solr data by controlling read/write access via Sentry (backed up by the strong authentication capabilities of Kerberos) and access it programmatically from Java applications and Apache Flume. This operation applies to many industry use cases where Solr is the backing data layer in multi-tenant, Java-based web applications associated with frequent updates that happen in the background.

Preparation

Our example assumes that:

  • Solr is running in a Cloudera-powered enterprise data hub, with Kerberos and Sentry also deployed.
  • A web app needs to access a Solr collection programmatically using Java.
  • The Solr collection is updated in real-time via Flume and a MorphlineSolrSink.

Sentry authorizations for Hive and Impala can be stored in either a dedicated database or a file in HDFS (the policy provider is pluggable). In the below example, we’ll configure role-based access control via the file-based policy provider.

Create the Solr Collection

First, we’ll generate a collection configuration set called poems:

We are assuming that your Solr client configuration automatically comprises settings for solrctl such that it can locate Apache ZooKeeper and the Solr nodes. If that is not the case, you might have to instruct the solrctl command on its location explicitly, for example:

Edit poems/conf/schema.xml to reflect a smaller number of fields per document. (A simple id and text field will suffice.) Also, confirm that copy-fields are removed from the sample schema:

Be sure to use the secured solrconfig.xml:

Push the configuration data into Apache ZooKeeper:

Create the collection:

Secure the poems Collection using Sentry

The policy shown below establishes four Sentry roles based on the admin, operators, users, and techusers groups.

  • Administrators are entitled to all actions.
  • Operators are granted update and query privileges.
  • Users are granted query privileges.
  • Tech users are granted update privileges.

Add the content of the listing to a file called sentry-provider.ini. Rename the groups according to the corresponding groups in your cluster.

Put sentry-provider.ini into HDFS:

Enable Sentry policy-file usage in the Solr service in Cloudera Manager:

Solr -> Configuration → Service Wide → Policy File Based Sentry → Enable Sentry Authorization = True

Restart Solr (only needed once for enabling Sentry integration):

Solr → Actions → Restart

Add Data to the Collection via curl

Use curl to add content:

Use curl to perform an initial query and verify Solr’s function:

Accessing the Collection via Java

Next, we’ll make sure that the web app can access the collection whenever needed.

Add the following code to a Java file called SecureSolrJQuery.java:

Create a JAAS config (jaas-cache.conf) to use the Kerberos ticket cache (that is, your existing ticket from kinit):

Later, you’ll see how to achieve the same goal with a keytab to make authentication happen non-interactively.

Using the Code

Compile the java class:

Create a shell script called query-solrj-jaas.sh to run the query code:

kinit as a user who is a member of cloudera_hadoop_admin (or any other group with query privileges) and run the code:

Changing sentry-provider.ini verifies that access is denied/Sentry works as intended. Performing kinit as a user who is not in the group mapped to an appropriate role has the same effect.

Policy:

Effect:

Accessing the Collection via Flume

Add the same sample data to a file called data.txt in exactly the following format, which we will use in the Morphline:

Create a morphline.conf file to transform the text data into Solr documents:

Prepare a Keytab for Flume

Create a technical user (e.g. tech.hadoop), create a principal for this user, and extract a keytab for that principal. The exact method to do so depends on whether you use Kerberos or Microsoft Active Directory.

Give the user has appropriate permissions to update the collection. As an example, the user could be in our cloudera_hadoop_techusers group.

Next, create a local JAAS config file (jaas-kt.conf) that uses the keytab of the tech user:

Configure and Start Flume

Create a Flume configuration file (flume.conf) that pushes the data:

Start the agent using the JAAS config:

Ingest the data file:

Configure Flume in Cloudera Manager

Cloudera Manager automatically generates a JAAS configuration for Flume that is used by Java client code such as that generated by the Morphline sink. There are three options to get the desired behavior when Cloudera Manager manages the execution of the Flume agent:

  • Cloudera Manager creates keytabs for a principal other that Flume: We configure the flume service with a principal that we choose, like our above hadoopadmin principal by changing the “Kerberos Principal” setting under Flume → Configuration → Security. Cloudera Manager will then create a keytab for tech.hadoop/yourhost@YOURREALM, where yourhost is the host running the Flume agent, and use this principal globally as the Hadoop service principal for Flume. Authentication requests against a “Sentry-fied” Solr service will map the tech.hadoop/yourhost@YOURREALM  to the tech.hadoop user / cloudera_hadoop_techusers group, which is eligible to access the collection. The Flume agent will still be run as the flume user. (Note: While this is a quick configuration change, it may not be desirable to change the Flume principal globally.)
  • Use Cloudera Manager’s Flume principal for Sentry authorization: This option does not change anything in the default service configuration of Flume in Cloudera Manager, which means that Flume will access Solr with the flume/yourhost@YOURREALM principal (where yourhost is the host running the Flume agent). This option requires that the Linux user flume is a member of the cloudera_hadoop_techusers group (or any other group that has the appropriate privileges as per our sentry-provider.ini), so that the Sentry-fied Solr server permits flume to access the collection. (Again, depending on your needs, it may or may not be desirable to do that.)
  • Cloudera Manager uses a user-defined JAAS configuration to run the Flume agent: We place the jaas-kt.conf, which we previously generated, as well as the keytab tech.hadoop.keytab, in /etc/flume-ng/conf/. The location of the files is in fact arbitrary, but we need to make sure they can be accessed by the flume user:

    In Cloudera Manager, we then use the “Flume Service Environment Advanced Configuration Snippet (Safety Valve)” under Flume → Configuration → Security  to supply custom Java options to the Flume agent:

The options above were copied from the standard options derived by Cloudera Manager, with the exception of the -Djava.security.auth.login.config flag.

Conclusion

At this point, you should have a good understanding of how to use Sentry to manage access control and enforce authorization for queries to Solr from Java-based applications and Flume, using Kerberos for strong authentication.

Jan Kunigk and Paul Wilkinson are are Solution Architects at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

One response on “How-to: Secure Apache Solr Collections and Access Them Programmatically

  1. Harini

    in my server could not acces this comment.
    hdfs dfs -mkdir -p /user/solr/sentry
    for making directory only able to give wonderful operations of hadoop software. please give suggestion to clear this error.
    Regards,
    Harini.