Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.
Our thanks to Basis Technology for providing the how-to below!
Basis Technology’s Rosette Base Linguistics for Java (RBL-JE) provides a comprehensive multilingual text analytics platform for improving search precision and recall. RBL provides tokenization, lemmatization, POS tagging, and de-compounding for Asian, European, Nordic, and Middle Eastern languages, and has just been certified for use with Cloudera Search.
You can use Hue and Cloudera Search to build your own integrated Big Data search app.
In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing Data in Cloudera Search
Indexing data in Cloudera Search involves :
- Setting up SolrCloud to partition your dataset into multiple indexes and processes
- Configuring SolrCloud collections to hold indexes
- Specifying the schema by which indexes will be created
- Feeding relevant data into the SolrCloud
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose,
The rise of Big Data has been pushing search engines to handle ever-increasing amounts of data. While building Cloudera Search, one of the things we considered in Cloudera Engineering was how we would incorporate Apache Solr with Apache Hadoop in a way that would enable near-real-time indexing and searching on really big data.
Eventually, we built Cloudera Search on Solr and Apache Lucene,