How-to: Index and Search Multilingual Documents in Hadoop
Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.
Our thanks to Basis Technology for providing the how-to below!
Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.
Cloudera Search brings full-text, interactive search, and scalable indexing to Apache Hadoop by marrying SolrCloud with HDFS and Apache HBase, and other projects in CDH. Because it’s integrated with CDH, Cloudera Search brings the same fault tolerance, scale, visibility, and flexibility of your other Hadoop workloads to search, and allows for a number of indexing, access control, and manageability options.
In this post, you’ll learn how to use Cloudera Search and RBL-JE to index and search documents. Since Cloudera takes care of the plumbing for distributed search and indexing, the only work needed to incorporate Basis Technology’s linguistics is loading the software and configuring your Solr collections.
First, install RBL-JE. This essentially involves unpacking a tar.gz file and copying your license file to the licenses directory. Note the root directory of the installation. We’ll refer to this as RBLJE_ROOT later.
Searching and indexing with RBL-JE requires a few additions to the schema.xml and solr.xml files for each Solr collections that you will use. To the solrconfig.xml file, you will add these lines to ensure that the appropriate RBL jar files end up on the class path:
<lib path="[[RBLJE_ROOT]]/rbl-je-[[RBLJE_VER]]/lib/btrbl-je-[[RBLJE_VER]].jar" /> <lib path="[[RBLJE_ROOT]]/rbl-je-[[RBLJE_VER]]/lib/btcommon-[[BT_COMMON_VER]].jar" /> <lib path="[[RBLJE_ROOT]]/rbl-je-[[RBLJE_VER]]/lib/slf4j-api-[[SLF4J_VER]].jar" /> <lib path="[[RBLJE_ROOT]]/rbl-je-[[RBLJE_VER]]/lib/slf4j-simple-[[SLF4J_VER]].jar" /> <lib path="[[RBLJE_ROOT]]/rbl-je-[[RBLJE_VER]]/lib/btrbl-je-lucene-solr-[[LUCENE_SOLR_VER]]-[[RBLJE_VER]].jar" />
Replace the [[xxx]] text in the pathnames above to match the version of RBL you are using. The version numbers can be determined by looking at the contents your RBLJE_ROOT.
Edit the schema.xml file to add field types that use RBL and assign them to fields in your documents. Here is an example field type that specifies using RBL to analyze Chinese data:
<fieldtype name="chinese-basis" class="solr.TextField"> <analyzer> <tokenizer class="com.basistech.rosette.lucene.BaseLinguisticsTokenizerFactory" language="zhs" licensePath="[[bt.license.path]]" modelDirectory="[[bt.model.directory]]" /> > <filter class="com.basistech.rosette.lucene.BaseLinguisticsTokenFilterFactory" language="zhs" licensePath="[[bt.license.path]]" dictionaryDirectory="[[bt.dictionary.directory]]" addLemmaTokens="true"/> </analyzer> </fieldtype>
- bt.license.path is
- bt.model.directory is
- bt.dictionary.directory is
And here is an example of using it on a field:
<field name="text" type="chinese-basis" indexed="true" stored="true" />
That’s it! Once this bit of configuration is done, the Cloudera Search framework can be used conventionally for indexing and searching. You’ll find a repository of configuration files, scripts, and sample documents that you can use to configure and test RBL-JE here. It provides working examples of the configuration techniques discussed above.