Learn how to secure your Solr data in a policy-based, fine-grained way.
Data security is more important than ever before. At the same time, risk is increasing due to the relentlessly growing number of device endpoints, the continual emergence of new types of threats, and the commercialization of cybercrime. And with Apache Hadoop already instrumental for supporting the growth of data volumes that fuel mission-critical enterprise workloads, the necessity to master available security mechanisms is of vital importance to organizations participating in that paradigm shift.
Fortunately, the Hadoop ecosystem has responded to this need in the past couple of years by spawning new functionality for end-to-end encryption, strong authentication, and other aspects of platform security. For example, Apache Sentry provides fine-grained, role-based authorization capabilities used in a number of Hadoop components, including Apache Hive, Apache Impala (incubating), and Cloudera Search (an integration of Apache Solr with the Hadoop ecosystem). Sentry is also able to dynamically synchronize the HDFS permissions of data stored within Hive and Impala by using ACLs that derive from Hive GRANT
s.
In this post, you’ll learn how to secure Solr data by controlling read/write access via Sentry (backed up by the strong authentication capabilities of Kerberos) and access it programmatically from Java applications and Apache Flume. This operation applies to many industry use cases where Solr is the backing data layer in multi-tenant, Java-based web applications associated with frequent updates that happen in the background.
Preparation
Our example assumes that:
- Solr is running in a Cloudera-powered enterprise data hub, with Kerberos and Sentry also deployed.
- A web app needs to access a Solr collection programmatically using Java.
- The Solr collection is updated in real-time via Flume and a MorphlineSolrSink.
Sentry authorizations for Hive and Impala can be stored in either a dedicated database or a file in HDFS (the policy provider is pluggable). In the below example, we’ll configure role-based access control via the file-based policy provider.
Create the Solr Collection
First, we’ll generate a collection configuration set called poems
:
solrctl instancedir --generate poems
We are assuming that your Solr client configuration automatically comprises settings for solrctl
such that it can locate Apache ZooKeeper and the Solr nodes. If that is not the case, you might have to instruct the solrctl
command on its location explicitly, for example:
solrctl --zk zookeeper-host1:2181,zookeeper-host1:2181,zookeeper-host1:2181/solr --solr http://your.datanode.net:8983/solr
Edit poems/conf/schema.xml
to reflect a smaller number of fields per document. (A simple id and text field will suffice.) Also, confirm that copy-fields are removed from the sample schema:
Be sure to use the secured solrconfig.xml
:
cp poems/conf/solrconfig.xml poems/conf/solrconfig.xml.original cp poems/conf/solrconfig.xml.secure poems/conf/solrconfig.xml
Push the configuration data into Apache ZooKeeper:
solrctl instancedir --create poems poems
Create the collection:
solrctl collection --create poems
Secure the poems Collection using Sentry
The policy shown below establishes four Sentry roles based on the admin
, operators
, users
, and techusers
groups.
- Administrators are entitled to all actions.
- Operators are granted update and query privileges.
- Users are granted query privileges.
- Tech users are granted update privileges.
[groups] cloudera_hadoop_admin = admin_role cloudera_hadoop_operators = both_role cloudera_hadoop_users = query_role cloudera_hadoop_techusers = update_role [roles] admin_role = collection = *->action=* both_role = collection = poems->action=Update, collection = poems->action=Query query_role = collection = poems->action=Query update_role = collection = poems->action=Update
Add the content of the listing to a file called sentry-provider.ini
. Rename the groups according to the corresponding groups in your cluster.
Put sentry-provider.ini
into HDFS:
hdfs dfs -mkdir -p /user/solr/sentry hdfs dfs -put sentry-provider.ini /user/solr/sentry hdfs dfs -chown -R solr /user/solr
Enable Sentry policy-file usage in the Solr service in Cloudera Manager:
Solr -> Configuration → Service Wide → Policy File Based Sentry → Enable Sentry Authorization = True
Restart Solr (only needed once for enabling Sentry integration):
Solr → Actions → Restart
Add Data to the Collection via curl
Use curl
to add content:
kinit curl --negotiate -u : -s \ http://your.datanode.net:8983/solr/poems/update?commit=true -H "Content-Type: text/xml" --data-binary \ '1Mary had a little lamb, the fleece was white as snow.2The quick brown fox jumps over the lazy dog.'
Use curl
to perform an initial query and verify Solr’s function:
curl --negotiate -u : -s \ http://your.datanode.net:8983/solr/poems/get?id=1
Accessing the Collection via Java
Next, we’ll make sure that the web app can access the collection whenever needed.
Add the following code to a Java file called SecureSolrJQuery.java
:
import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.HttpSolrServer; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.response.QueryResponse; import org.apache.solr.common.SolrDocumentList; import java.net.MalformedURLException; class SecureSolrJQuery { public static void main(String[] args) throws MalformedURLException, SolrServerException { String queryParameter = args.length == 1? args[0] : "*"; String urlString = "http://your.datanode.net:8983/solr/poems"; SolrServer solr = new HttpSolrServer(urlString); SolrQuery query = new SolrQuery(); query.set("q", "text:"+queryParameter); QueryResponse response = solr.query(query); SolrDocumentList results = response.getResults(); for (int i = 0; i < results.size(); ++i) { System.out.println(results.get(i)); } } }
Create a JAAS config (jaas-cache.conf
) to use the Kerberos ticket cache (that is, your existing ticket from kinit
):
Client { com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true debug=false; };
Later, you’ll see how to achieve the same goal with a keytab to make authentication happen non-interactively.
Using the Code
Compile the java
class:
CP=` find /opt/cloudera/parcels/CDH/lib/solr/ |grep "\.jar"|tr '\n' ':'` CP=$CP:`hadoop classpath` javac -cp $CP SecureSolrJQuery.java
Create a shell script called query-solrj-jaas.sh
to run the query code:
CP=` find /opt/cloudera/parcels/CDH/lib/solr/ |grep "\.jar"|tr '\n' ':'` CP=$CP:`hadoop classpath` java -Djava.security.auth.login.config=`pwd`/jaas-cache.conf -cp $CP SecureSolrJQuery $1
kinit
as a user who is a member of cloudera_hadoop_admin
(or any other group with query privileges) and run the code:
kinit ./query-solrj-jaas.sh 15/03/25 16:00:57 INFO impl.HttpClientUtil: Creating new http client, config:maxConnections=128&&maxConnectionsPerHost=32&followRedirects=false 15/03/25 16:00:57 INFO impl.HttpClientUtil: Setting up SPNego auth with config: /home//solr_test/jaas-cache SolrDocument{id=1, text=Mary had a little lamb, it’s fleece was white as snow., _version_=1496618383939993600} SolrDocument{id=2, text=The quick brown fox jumps over the lazy dog., _version_=1496618383970402304}
Changing sentry-provider.ini
verifies that access is denied/Sentry works as intended. Performing kinit
as a user who is not in the group mapped to an appropriate role has the same effect.
Policy:
[groups] nogroup = admin_role [roles] admin_role = collection = *->action=*
Effect:
./query-solrj-jaas.sh 15/03/25 16:03:32 INFO impl.HttpClientUtil: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false 15/03/25 16:03:33 INFO impl.HttpClientUtil: Setting up SPNego auth with config: /home/a.jkunig/solr_test/jaas-cache Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.sentry.binding.solr.authz.SentrySolrAuthorizationException: User bob does not have privileges for poems at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:556) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:221) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:216) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at com.cloudera.fts.solr.query.SecureSolrJQuery.main(SecureSolrJQuery.java:36)
Accessing the Collection via Flume
Add the same sample data to a file called data.txt
in exactly the following format, which we will use in the Morphline:
3|Mary had additional lambs, their fleeces were like the first. 4|The quick brown fox still jumps over the lazy dog.
Create a morphline.conf
file to transform the text data into Solr documents:
SOLR_LOCATOR : { collection : poems zkHost : "your.datanode.net:2181/solr" } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "|" columns : [id,text] charset : UTF-8 } } { logDebug { format : "output record: {}", args : ["@{}"] } } { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ]
Prepare a Keytab for Flume
Create a technical user (e.g. tech.hadoop
), create a principal for this user, and extract a keytab for that principal. The exact method to do so depends on whether you use Kerberos or Microsoft Active Directory.
Give the user has appropriate permissions to update the collection. As an example, the user could be in our cloudera_hadoop_techusers
group.
Next, create a local JAAS config file (jaas-kt.conf
) that uses the keytab of the tech user:
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache=false keyTab="/home/tech.hadoop/tech.hadoop.keytab" principal="tech.hadoop@YOURREALM"; };
Configure and Start Flume
Create a Flume configuration file (flume.conf
) that pushes the data:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 44444 # Describe the sink a1.sinks.k1.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink a1.sinks.k1.morphlineFile = /home/path/to/morphline.conf # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Start the agent using the JAAS config:
flume-ng agent -n a1 -f ./flume.conf -Xmx1G -Djava.security.auth.login.config=/home/tech.hadoop/jaas-kt.conf
Ingest the data file:
cat ./data.txt | nc localhost 44444
Configure Flume in Cloudera Manager
Cloudera Manager automatically generates a JAAS configuration for Flume that is used by Java client code such as that generated by the Morphline sink. There are three options to get the desired behavior when Cloudera Manager manages the execution of the Flume agent:
- Cloudera Manager creates keytabs for a principal other that Flume: We configure the
flume
service with a principal that we choose, like our abovehadoopadmin
principal by changing the “Kerberos Principal” setting under Flume → Configuration → Security. Cloudera Manager will then create a keytab for tech.hadoop/yourhost@YOURREALM, whereyourhost
is the host running the Flume agent, and use this principal globally as the Hadoop service principal for Flume. Authentication requests against a “Sentry-fied” Solr service will map the tech.hadoop/yourhost@YOURREALM to thetech.hadoop user
/cloudera_hadoop_techusers
group, which is eligible to access the collection. The Flume agent will still be run as theflume
user. (Note: While this is a quick configuration change, it may not be desirable to change the Flume principal globally.) - Use Cloudera Manager’s Flume principal for Sentry authorization: This option does not change anything in the default service configuration of Flume in Cloudera Manager, which means that Flume will access Solr with the flume/yourhost@YOURREALM principal (where
yourhost
is the host running the Flume agent). This option requires that the Linux userflume
is a member of thecloudera_hadoop_techusers
group (or any other group that has the appropriate privileges as per oursentry-provider.ini
), so that the Sentry-fied Solr server permitsflume
to access the collection. (Again, depending on your needs, it may or may not be desirable to do that.) - Cloudera Manager uses a user-defined JAAS configuration to run the Flume agent: We place the
jaas-kt.conf
, which we previously generated, as well as the keytabtech.hadoop.keytab
, in/etc/flume-ng/conf/
. The location of the files is in fact arbitrary, but we need to make sure they can be accessed by theflume
user:chown flume:flume /etc/flume-ng/conf/tech.hadoop.keytab /etc/flume-ng/conf/jaas-kt.conf
In Cloudera Manager, we then use the “Flume Service Environment Advanced Configuration Snippet (Safety Valve)” under Flume → Configuration → Security to supply custom Java options to the Flume agent:
FLUME_AGENT_JAVA_OPTS="-Xms262144000 -Xmx262144000 -XX:OnOutOfMemoryError={{AGENT_COMMON_DIR}}/killparent.sh -Dflume.monitoring.type=HTTP -Dflume.monitoring.port=41414 -Djava.security.auth.login.config=/etc/flume-ng/conf/jaas-kt.conf"
The options above were copied from the standard options derived by Cloudera Manager, with the exception of the -Djava.security.auth.login.config
flag.
Conclusion
At this point, you should have a good understanding of how to use Sentry to manage access control and enforce authorization for queries to Solr from Java-based applications and Flume, using Kerberos for strong authentication.
Jan Kunigk and Paul Wilkinson are are Solution Architects at Cloudera.