New in Cloudera Enterprise 6.0: Analytic Search

by Eva Nahari

Posted in Technical | May 24, 2018 8 min read

It has been a long and patient wait for Apache Hadoop 3.0 to mature. A major new version of the storage layer obviously impacts all our integrated components, including Apache Solr and all our integrations with the rest of the platform, commonly referred to as Cloudera Search. Since our customers’ Search deployments are so often mission critical, we’ve made sure to take time to do extensive integration testing and focus on the upgrade experience.

Now the moment has finally come to announce Solr 7.0 in Cloudera Search and available as of our new major release, Cloudera Enterprise 6 (C6)- which just entered public beta!

This blog highlights some of the major new features in Cloudera Search and how we expect users to benefit from them. Let’s start however with a quick overview in case you are new to this fully integrated free-text analytics engine.

The Many Uses of Search

Free-text search is known to everyone, which is a great tool if you for instance want to find and explore data very quickly. With facets, i.e. fields or categories in a search experience context, you can quickly filter down to a best-match, relevance-ranked result set on predetermined categories. However, search is so much more than a text-document lookup and data discovery tool. For instance, you can do:

Spatial or shape queries over geolocation data to zoom into results originating near specific coordinates
Term-distance queries, to identify relevant sample subsets of for instance 10 years of claims or email data (with attachments), where two words in co-use means it is more likely to be a fraudulent claim.
Automated matching of incoming documents against a set of known documents, to streamline categorization – increase accuracy and eliminate tedious manual tasks.
Streaming analytics over incoming event streams (think IoT) and fine tune marketing campaigns in real time based on how you can correlate similarities between customer profiles generating those events.

You can view search as a complimentary query tool to traditional SQL query engines and analytic databases, that expands to also allow analytics over unstructured as well as semi-structured data. Search allows you to express your questions in natural language – i.e. open and easy for anyone to use – a true step towards democratizing data in your organization.

What is Cloudera Search?

Cloudera Search is powered by Solr and integrated in many ways with the rest of the Cloudera data platform; to provide flexibility, IT efficiency, security and manageability. You can choose from a number of integrated indexing options at ingest, or batch indexing tools (helping to offload Solr from indexing workloads, allowing better scale.) With Cloudera Search you utilize the same data storage as other analytics and processing workloads, which not only gives you lower IT costs related to integrations and security, but also gives the opportunity to do best-match, relevance-based lookups as part of a bigger workload. For example, discover keywords in a data set prior to guided machine learning.

Unique to Cloudera is the extremely secure environment across your entire pipeline, including search, and the ability to audit search and other parts of the pipeline as well. This becomes ever so important in a world where Cyber Threat is exponentially growing and where new compliance rules and regulations are quickly changing the landscape. One of our most popular tools is Cloudera Manager, with granular health and monitoring dashboards over any and all Solr metrics, down to the collection level, as well as the rest of the platform.

With integrated search you can safely and with lots of flexibility start using search in more innovative ways than before, with pre- and post- (machine-learning) processed data as well, the options are endless.

Cloudera Search makes sense in today’s data centers, versus siloed stand-alone search deployments, and is therefore widely adopted. Cloudera has many search-including, mission critical applications and pipelines running in production today, generating Millions of dollars in new revenue and in risk/threat mitigation, or solving Billion dollar problems.

Solr 7.0 – Improved Analytic Capabilities

Already when Cloudera adopted Search as part of our platform we had features in mind to allow analytics and calculations over unstructured data. We are therefore very eager to share with you the new exciting improved analytic features and capabilities of Solr 7. We are convinced these will further your search application innovation and business insights forward!

JSON Facet API

We finally get to experience first class JSON support in Cloudera Search. This capability is improving the ease of use. The clearer format of JSON makes it easier to express and calculate statistics over one or many facets. Although some stats were possible to calculate in older versions, the new approach of calculating them is much faster (>3x) and more scalable, as well as allows for new more advanced stats calculations to be added. Example of stats available now are: Counts, Avg, Unique, Sums, Max, Min – and much more! You can now also bucket your query results in multiple segments. It is easier than ever to query and calculate stats over unstructured data (e.g. email, claims forms, notes, research reports, patent documents, documents, pdfs, images, etc).

We anticipate this being heavily used where you do not only want to allow free-text search over data, and drill down via facet filtering, but also want to gain immediate insight over that data based on calculations. For example: not only what products, matched on their name and name variations, are selling the most between August and October, but now also how many are in my inventory of each, in each region, over time, and of what type. You can even step into matching over reviews and the associated searches for these products on your web store, and do statistical calculations over result sets.

Nested Documents Faceting

We also now support faceting over nested documents. Nested documents means two or more documents linked based on parent child relationships. This structure simplifies processing as it removes the need to serialize and deserialize data. Previously, you would have to add fields that would act as keys, but had no ability to enforce the keys. Some practical examples of nested documents would be:

Attachments to emails
Comments on post/article/page/profile/etc
Reviews of products/businesses/books/movies/etc

And now you can do faceted search over this structure too! Enabling interactive drill down and investigation of this more efficient document-linking data structure.

Streaming expressions is a new approach to execute queries in Solr. It is aimed for Solr being part of a bigger pipeline. You can build your own pipeline of queries within the streaming expressions framework. Think of it as similar to a Spark pipeline, but within the distributed Solr process. This framework has opened the door for new types of “heavy duty” queries, that can touch all your data at once, for example distributed joins, math functions, and roll ups.

We see this as an important underlying framework for building out even more innovative applications and ways of querying or tying data together. Note for instance that the SQL interface and Graph Query interface are both examples of such.

SQL Interface* brings some very basic SQL query capabilities to access indexed data. It relies on the streaming expressions framework and comes with an out of the box JDBC connector. The interface allows you to ask queries in very simple SQL form over unstructured data and conceptualize an index as a table (e.g. SELECT X, Y FROM Collection C WHERE Field=F). This interface aims above all to serve SQL applications and 3rd party SQL tools to more easily access unstructured data that is indexed (and most commonly at the same time served through Solr for other audiences and purposes) without any significant code changes. It will give SQL users and BI tools access to indexed data more easily than before, through a familiar interface. We’re very excited to provide some initial, basic SQL capabilities to our SQL tool partners, to bring more insight to our joint customers.

Graph Query Interface* is a very interesting emerging capability. Also built on the streaming expressions framework. The graph query API allows you to traverse and query data elements that are linked in some way. One could almost see it as an iterative join filter over your data set. So for example, one could traverse from an insurance claim, which policy the claim is under, and who is actually the owner of the group policy at a higher level. Important speed up of the investigative process when you are trying to introspect fraudulent claims’ paths.

Please provide us input and ideas on what use cases you intend to use this for and what else you’d like to see for Graph forward!

Under the Hood

New Replica Types

In the old world replicas were so called near real time replicas which had to be in sync with the leader all the time. This to enable the condition that any replica could be selected leader at any point in time. Of course this was very good from an HA and Fault tolerance standpoint, but also very hard on CPU, I/O and indexing throughput. In most cases one could offload heavy duty indexing workload to either the Spark or MapReduce indexer, but still in some cases where real time indexing is required, each new replica slows down indexing.

The new replica types available in Cloudera Search as of C6 comes in three flavors (NRT, TLOG, and PULL). Each comes with their pros and cons – tradeoffs between resource utilization, query speed, indexing throughput and fault tolerance.

TLOG replicas

- Use much less CPU and I/O
- Smaller effect on indexing throughput
- Can lag behind the leader
PULL replicas
- Much less CPU and I/O
- Does not slow down indexing at all
- Ideal for query performance
- Cannot become a leader
- Not real-time

Per collection cluster state

In previous versions all states of all collections are stored in one file in ZooKeeper – this did not scale well for a number of reasons:

- All Solr nodes watch one file in ZooKeeper and get too many notifications that do not apply to them
- Each state change updates this one file
- Single point of contention for the whole SolrCloud custer

We changed this so that each collection has its own state file. A much more resilient and scalable approach.

Distributed Cardinality

In short, this functionality allows you to calculate distinct values faster with less memory.

There is More

We would be kidding ourselves to try to cover all the new and exciting capabilities added into Cloudera Search in C6 with Solr 7. There is however much more. To mention a few: Polygon queries, Auto-Scaling framework, HyperLogLog algorithm for calculation of number of unique values, and abilities for learn-to-rank. In addition, as Cloudera Search is the collection name of Apache Solr and all the integrations with the rest of CDH, our new major Cloudera Enterprise 6.0 release also contains the integrations with all the new major upgrades, i.e. HBase 2, Spark 2, Hadoop 3, HUE 4, etc.

Speaking of HUE, we also improved the index creation experience by adding an Index Designer, integrate collection browsing in the browser experience, and allowing simple dashboarding over the new facet capabilities in Solr 7.

Noteworthy is that the same simple dashboarding capabilities can now also be used over Impala result tables. You can now do analytics over both unstructured and structured data from the same user experience.

We encourage you to try it all out in our ongoing beta and give us feedback so that we can help you be innovative and successful in your Analytic Search missions and applications forward!

Conclusion

Cloudera Enterprise 6 is our most ambitious, most powerful machine learning and analytics platform edition to date. In includes a range of new features that address productivity, scale, and enterprise quality. By integrating Cloudera Search enhancements to the platform via Apache Solr 7.0, our customers gain far greater flexibility, security, and automation than they otherwise would with siloed search engines. From new query APIs you can use to build innovative applications and integrate with 3rd party tools to improvements in resilience, performance, and memory utilization, and helpful end-user tooling, there’s a lot to look forward to with Analytic Search in C6.

If you are curious to try these new capabilities, we encourage you to join the beta and provide feedback in our community forms. Let us know what innovation you have been enabled to do with our fully integrated new Analytic Search capabilities!

*) These features are in Solr 7.0, but will not be supported in Cloudera Enterprise 6.0 GA out the gate. They need some more time to mature and are aiming for future releases. However, the more beta feedback we get and ideas on how you plan to use these capabilities, the faster we will be able to confidently recommend these features for production.

Eva Nahari

Sr Director of Product @EvaNahari

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data

1 Comments

by Samuel on May 06, 2020 @ 2:03 am EDT

Really informative blog. Search analytics tools like 3RDi Search, Algolia & Swiftype, are fast emerging as the must-have tools for enterprises to make use of their data.