How-to: Use Apache Solr to Query Indexed Data for Analytics

Categories: How-to Search

Bet you didn’t know this: In some cases, Solr offers lightning-fast response times for business-style queries.

If you were to ask well informed technical people about use cases for Solr, the most likely response would be that Solr (in combination with Apache Lucene) is an open source text search engine: one can use Solr to index documents, and after indexing, these same documents can be easily searched using free-form queries in much the same way as you would query Google. Still others might add that Solr has some very capable geo-location indexing capabilities that support radius, bounded-box, and defined-area searches. And both of the above answers would be correct.

What may be less well known is that Solr (+Lucene) can be used for certain indexed data queries and to provide lightning-fast response times, as well. Using Solr in this manner, you can extend either your current use of Solr, or add Solr to your existing cluster to better use of existing data assets. So in a way, Solr can provide capabilities that are similar to those of a NoSQL in-memory database.

In this post, I will explain how to use Solr to achieve exceptional query response times to a variety of business-style queries. Using an example, I’ll demonstrate how to index documents into a Solr cluster and issue complex queries against the indexed documents. After the nuts and bolts are covered, I’ll offer an overview of important trade-off considerations. Finally, I’ll briefly compare Solr’s capabilities to in-memory NoSQL engines such as MongoDB.

Let’s Get Some Data

In searching for some data to index into Solr I had a few criteria. I wanted the number of fields to be small so that the data set could be easily understood. I also looked for a dataset that isn’t a typical text-based dataset, but rather more of a business-type one. Finally, I wanted data with some numerical values so that Solr’s comparison-filtering and range-filtering capabilities could be easily demonstrated and understood.

After a little searching online I found a dataset that meets all these criteria; it contains a simple listing of electricity rates listed by zip code for 2011. The dataset has the following fields and types:

I downloaded the csv file from the above URL and for clarity changed the filename to rates.csv. What follows are the first few lines from that csv file:

Creating and Loading a Schema

Solr can infer a schema from indexed data, but when this is done you leave it up to Solr to determine the fields and types. To confirm appropriate indexing and type semantics, defining a schema is recommended. In our example, we will query certain fields and apply comparison and range filters to certain fields. As a result, we must make sure that these certain fields are indexed and defined with the proper field type before we index data into Solr. We also want to avoid indexing fields that will not be searched or faceted, to minimize the amount of memory required.

First, we instruct Solr to create a default configuration set on the local filesystem. To do that we issue the following command, in which /tmp/electric_rates is the local directory where Solr will place our default configuration set:

In the /tmp/electric_rates directory there will now be a file named schema.xml. This is a rather large xml file and it contains some definitions that we’ll use later. The main area of concern for now is the field definitions.

All of the example field definitions can be removed. Listed below are the field definitions that we will use for our example electric rate data set:

You will note that there are a few int fields, a few string fields, and a few double fields. Also note that only some fields are designated as indexed="true"; these are fields that we will query or to which we will apply grouping functions. The omitNorms setting informs Solr that we will NOT be using these fields in any form of “boosting” searches. (Using boosting in searches is an advanced way to instruct Solr that a specific field is more or less important in certain boosted queries.)

After the schema.xml file has been edited, we use Apache ZooKeeper to create a Solr instance directory using the following command:

Next we instruct Solr to create a new collection with the following command:

Finally, to index the data into Solr, we will use the already configured csv request handler to index this csv file into Solr. It should be noted that this is an excellent utility for small data sets, but is not recommended for larger ones; for larger data sets, you might want to consider using the MapReduceIndexerTool (out of scope for this topic, though).

The following command will get our data indexed:

Upon completion you will note that 37,791 documents were indexed into Solr. Obviously, this is not a large data set, but the intention is to demonstrate query capabilities first and response times only as secondary information.

Getting Answers Quickly

To demonstrate Solr’s query capabilities on our newly indexed data set, let’s ask some business-style questions. For each business question, I will provide the query along with a detail of each query element. For brevity, I will not list the full Solr response but rather provide the answers in short form.

All of queries below were issued against a single Solr instance running in a virtual machine:

  • OS: CentOS 6.6
  • CDH version 5.0.0
  • Solr version: 4.10.3
  • Solr memory available: 5.84GB
  • Java version: 1.7.0_67
  • Processors: 1

How many utility companies serve the state of Maryland?

To answer this question, we need to apply a filter to the state field specifying only results from “MD.” To determine how many utility companies exist in MD, we will ask Solr to group the results based on the utility_name field but limit grouping results to just one, as we only care to find out how many total groups there are. The following query fulfills the business needs requested above:

Listed below are the query elements decomposed for better understanding:

The number of groups returned is four, and the result was returned in 23 milliseconds.

Which Maryland utility has the cheapest residential rates?

To answer this question, we only need to add one additional element to the prior query instructing Solr to sort the groups in ascending order (placing the cheapest residential rate at the top). We can also limit the number of groups to just one.

Listed below are the new or modified query elements decomposed for better understanding:

 

The cheapest utility in MD is “The Potomac Edison Company” @ 0.03079 / KWH and the result was returned in 4 milliseconds.

What are the minimum and maximum residential power rates excluding missing data elements?

To answer this query we need to filter out data rows where res_rate = 0.0 as these are missing data elements. We do that using an “frange” query excluding the lower bound of 0.0. To get the minimum and maximum res_rate, we instruct Solr to generate statistics for the res_rate indexed field. The query to answer the above business question is listed below:

Shown below are the query elements decomposed for better understanding:

The res_rate minimum is 0.0260022258659 and the maximum is 0.849872773537. Results were returned in 5 milliseconds.

What is the state and zip code with the highest res_rate?

To fulfill the above business request, we take the maximum res_rate returned from the prior query and use it as a filter for the next query as listed below:

Listed below are the query elements decomposed for better understanding:

The highest residential electric rates are found in Alaska in zip code 99634. The results were returned in 1 millisecond.

Guidelines for Using Solr to Meet Your Analysis Needs

It is worth pointing out that Solr is not a general purpose in-memory NoSQL engine. With that in mind, here are some guidelines to help you understand when it might be appropriate to use Solr for queries:

  • Your use case requires very fast query response times.
  • The data you need to analyze is already stored in Apache Hadoop.
  • You can easily define a schema for the data to be indexed.
  • You need to query (filter) on many fields.
  • The amount of data to be indexed into Solr will not exceed your Solr cluster’s capabilities.

If many or all of the above criteria apply, then using Solr for your data analysis might just be a great fit.

Comparing Solr to MongoDB

MongoDB is one of several NoSQL database engines available today, and among the most popular. For comparison purposes, see the table below.

(Note: in the future, support for Kudu may provide some interesing new update capabilities to Solr as well.)

Conclusion

As you can see, Solr brings lightning-fast query response times to a wide variety of business-style queries. The query language is not nearly as well-known as SQL, but Solr has some excellent capabilities that can be used with some thought and practice.

To get the answers needed above we used grouping, group sorting, field selection (filtering), statistics generation, and range selection. While Solr should not be considered a general-purpose NoSQL in-memory database system, it can still be used to get analysis results with awesome response times. As such, it should be viewed as another tool in the toolbox that, when used correctly, can simplify the life of the Hadoop ecosystem architect.

Further Reading

Peter Whitney is as Solutions Architect at Cloudera working out of the Dallas, Texas area. In this role, Pete assists Cloudera’s clients in learning and best practices for adopting Hadoop and surrounding technologies.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

5 responses on “How-to: Use Apache Solr to Query Indexed Data for Analytics

  1. Lars Jakobsen

    Hi, very nice example!
    Is it possible for you to share the Schema.xml as well, would be very nice to have that too

    BR
    Lars

  2. LD

    This article is a great walkthrough of integrating search with data on Hadoop. However, it is unfortunately missing one key piece of information: the changes to be made to the default generated schema.xml. That section of code snippet is blank.

    Pete – can you please post the scheme.xml file or update the article to include the information in the Create and Loading a Schema code snippet area #2.

    Thanks for writing a great article!