Inside the Apache Solr JSON Facet API

Categories: CDH Search

Solr 5 includes a completely re-written faceted search and analytics module with a structured JSON API to control the faceting and analytics commands. Here’s how it works.

Since I joined Cloudera a few years ago to help bring search-powered analytics to Cloudera’s platform, I’ve been working actively upstream alongside the rest of the Solr community to develop new functionality that will drive more interesting applications on Cloudera Search (which is based on an integration of Solr with the Apache Hadoop ecosystem). In the following re-post from my personal blog, I describe one of these features—improved support for nested facets via JSON—that I wrote at the time of code check-in. (Note: this feature is targeted for a future release of Cloudera Enterprise, and thus is not yet supported for production use.)

Why JSON?

The structured nature of nested subfacets are more naturally expressed in a nested structure like JSON rather than the flat structure that normal query parameters provide. For that reason, staring in 5.0, Solr includes a JSON Facet API. The Facet API is now part of the JSON Request API, so a complete request may be expressed in JSON.

Goals of the new faceting module include:

  • First-class JSON support
  • Easier programmatic construction of complex, nested facet commands
  • Support a much more canonical response format that is easier for clients to parse
  • First-class analytics support
  • Ability to sort facet buckets by any calculated metric
  • A cleaner way to do distributed faceting
  • Better integration with other search features

Of course, if you prefer to use Solr’s existing faceting capabilities, that’s fine too. (You can even use both simultaneously, if you want!)

Next, let’s get into the details. (Note: Some examples here use syntax supported only in later Solr 5 releases, or even Solr 6.)

Ease of Use

Some of the ease-of-use enhancements over traditional Solr faceting come from the inherent nested structure of JSON.

As an example, here is the faceting command for two different range facets using Solr’s flat API:

And here is the equivalent faceting command in the new JSON Faceting API:

These aren’t even nested facets, but already one can see how much nicer the JSON API looks. With deeply nested subfacets and statistics, the clarity of the inherently nested JSON API only grows.

JSON Extensions

A number of JSON extensions have been implemented to further increase the clarity and ease of constructing a JSON faceting command by hand. For example:

Debugging JSON

Nicely-indented JSON is very easy to understand. If you get a large piece of non-indented JSON somehow, and are trying to make sense of it, you can cut and paste into one of the online validators:

http://jsonlint.com

http://jsonformatter.curiousconcept.com

Both of these validators will indent your JSON, even when it contains extensions unsupported by them (such as comments or bare strings).

Facet Types

There are two types of facets: one that breaks up the domain into multiple buckets, and aggregations/facet functions that provide information about the set of documents belonging to each bucket.

Faceting can be nested: Any bucket produced by faceting can further be broken down into multiple buckets by a subfacet.

Statistics are Facets

Statistics are now fully integrated into faceting. Since we start off with a single facet bucket with a domain defined by the main query and filters, we can even ask for statistics for this top-level bucket, before breaking up into further buckets via faceting. Example:

See facet functions for a complete list of the available aggregation functions.

JSON Facet Syntax

The general form of the JSON facet commands are:

Example:

After Solr 5.2, a flatter structure with a “type” field may also be used:

Example:

The results will appear in the response under the facet name specified.

Facet commands are specified using json.facet request parameters.

Test Using Curl

To test out different facet requests by hand, it’s easiest to use curl from the command line. Example:

Terms Facet

The terms facet, or field facet, produces buckets from the unique values of a field. The field needs to be indexed or have docValues.

The simplest form of the terms facet:

An expanded form allows for more parameters:

Example response:

Parameters:

solr-json-tab1

Query Facet

The query facet produces a single bucket that matches the specified query.

Here’s an example of the simplest form of the query facet:

An expanded form allows for more parameters (or sub-facets/facet functions):

Example response:

Range Facet

The range facetproduces multiple range buckets over numeric fields or date fields.

Range facet example:

Example response:

To ease migration, these parameter names, values, and semantics were taken directly from the old-style (non-JSON) Solr range faceting.

Parameters:

solr-json-tab2

Common Parameters

Parameters that all faceting methods have in common include:

Conclusion

Hopefully, you now have a good understanding of the JSON API introduced in Solr 5. Again, this feature is scheduled to ship/be certified in a future Cloudera release but is not yet supported for production use.

Yonik Seeley is a Software Engineer at Cloudera, a committer and PMC member for Apache Lucene, and the creator of Solr. Previously, he was chief open source architect and cofounder at LucidWorks.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *