How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search

Categories: CDH How-to Search

In this guide, learn how to use Cloudera Search with Basis Technology’s Rosette®  to perform fuzzy name searches in multiple languages and scripts.

Our thanks to Basis Technology team (Jeanne Le Garrec, Hannah MacKenzie-Margulies and Brian Sawyer) for supporting writing this how-to blog.

Cloudera Search, powered by Apache Solr brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS, Apache HBase, Apache Spark, and other projects in CDH. Because it’s integrated with CDH, Cloudera Search brings the same fault tolerance, scale, visibility, and flexibility to your other Hadoop search workloads. It also offers  a number of indexing, access control, and internal management  options.

Names are the linchpin that connect data points in financial compliance, anti-fraud, government intelligence, law enforcement, and identity verification. Names are also helpful identifiers for customer 360 in retail and patient record deduplication in healthcare. However, because of their incredible variability, names are challenging to connect: misspellings, nicknames, initials, and titles can all throw off a basic query. Further, in international databases, a single name may  appear in multiple languages! Take the example of someone from Morocco—their name could be written in either arabic or latin script. In addition, with  no universal transliteration scheme,  there is no agreed upon way way to write out a translated name. Chairman Mao Zedong is sometimes written as Mao Tse-Tung or simply Chairman Mao.

Rosette®’s name indexing and matching technology solves these challenges with a linguistic, knowledge-based system that identifies  and matches person, location, and organization names despite their impressive variability. Rosette, provided by Basis Technology, uses an intelligent and heuristic approach to name matching, and is unrivaled in the field.

In this post, you’ll learn how to use Cloudera Search with Rosette to index and search for names in documents with name fields. Since Cloudera takes care of the plumbing for distributed search and indexing, the only work necessary  is to incorporate and configure the Rosette plug-in for Apache Solr into your Solr collections.

Step by step setup and configuration

To get started, you will need a CDH cluster with Cloudera Search/Solr services running on Java SDK 1.8 or later.

Then install Rosette on all the Solr nodes of the CDH cluster. To do this, you’ll need to unzip the SDK and Documentation files to the same directory (which we will call BT_ROOT) and then copy the license file to the BT_ROOT/rlp/rlp/licenses subdirectory.

Searching and indexing with Rosette requires a few additions to the schema.xml and solr.xml files for each Solr collection you plan to use. To ensure that the appropriate Rosette JAR files end up in the class path, add the following lines to the solrconfig.xml file:

Replace the [[BT_ROOT]] with the location where you installed Rosette.

Edit the schema.xml file to add field types that make use of Rosette and assign them to fields in your documents. In the example below, Rosette is designated to analyze primaryName field data:

Then, You must include a Java property setting that points to the root of a Rosette SDK, You can achieve this in Cloudera Manager by searching “Java Configuration Options for Solr Server”  and appending the existing value with  -Dbt.root=<BT_ROOT> (e.g. -Dbt.root=/usr/bt/rlp). Click “Save Changes” and restart your Solr services.

That’s it! Once this bit of configuration is done, the Cloudera Search framework can be used conventionally for indexing and searching, and your results will reflect Rosette’s advanced functionality. You’ll find a repository of configuration files, scripts, and detailed instructions that you can use to configure and test Rosette here. It provides working examples of the configuration techniques discussed above.

The diagram below explains how Rosette and Cloudera Search powered by Apache Solr work together at both index and query time.

At index time, Rosette creates and stores keys for every token as a set of subfields, hidden from the user behind the scenes. These keys enable matching with the highest possible recall. They include coverage for nicknames and cognates, which are drawn from a dictionary that can be edited by the user, phonetic similarity, and many other possible source of name differences.

At query time, Rosette breaks the input names into tokens, then queries them against the stored high recall keys. This process does not use synonym functions, but rather a process similar to query expansion. This query acts as a high-recall ‘blocking’ pass that finds good candidate matches for rescoring.  

Once these steps are fully executed, Rosette uses Solr’s re-rank functions to evaluate the query results. These candidate results are rescored against the query name, producing a similarity, or “matching” score. Unlike Solr’s typical document score, this name similarity score is normalized across queries so it can be used as a threshold to determine which indexed names should or should not be considered a match.

chart

Request an evaluation version of Rosette from Basis Technology and try it out in your own Cloudera Search application today!

For several years Cloudera and Basis Technology have partnered to provide customers with the best of three worlds: Hadoop for  scalable infrastructure, Cloudera Search for interactive full text search and Rosette for multilingual search enrichment and enhancement. The easy and certified integration between Cloudera Search and Rosette ensures a quick and easy deployment.

 

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *