How-to: Index and Search Data with Hue’s Search App

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Indexing Data in Cloudera Search

Indexing data in Cloudera Search involves :

  • Setting up SolrCloud to partition your dataset into multiple indexes and processes
  • Configuring SolrCloud collections to hold indexes
  • Specifying the schema by which indexes will be created
  • Feeding relevant data into the SolrCloud

First, install Cloudera Search using this guide. Then, deploy and configure Solr Cloud.

Next, create a new collection and index named "reviews". You can use the predefined schema available here.

 

Replace the field definitions in the schema with a mapping corresponding to the Yelp data. The schema defines each data field that will be available in the search index. You can read more about schema.xml in the Solr wiki.

 

Then, retrieve and clean a subset of the Yelp data with a Hive query, download it as a CSV, and index it with the indexer tool and this command:

 

The command will use a morphline file to map the Yelp data to the fields defined in our index schema.xml. (Cloudera Morphlines, which is bundled with the Cloudera Developer Kit, is an interesting new tool for data transformations that facilitates the indexing of your data.) When debugging morphlines, the –dry-run option will save you some time.

Finally, the Administration panel lets you tweak the look and feel and features of the search page.

View a demo of this entire process here:

Troubleshooting

  1. If you see this error:

     

    You may have forgotten to create the collection:

     

  2. If you see this error:

     

    You might need to force Solr to reload the configuration. Beware, this might break Apache ZooKeeper and you might need to read Error #3.

     

  3. If you see this error:

     

    It probably comes from Error #2. You might need to re-upload the config and recreate the collection.

Conclusion

Cloudera Search is great for opening your user base to Hadoop and do quick data retrieval. Other how-to’s describe other use cases, like email or customer data search.

As usual feel free to comment on the hue-user list, community discussion forum, or @gethue!

 

Filed under:

2 Responses
  • Bill / December 14, 2013 / 6:59 PM

    Why can I show this? And this job is failed.

    record: {_attachment_body=[java.io.BufferedInputStream@2250ed02], _attachment_name=[yelp_40.csv], base_id=[hdfs://master:8020/user/root/yelp_40.csv], file_download_url=[hdfs://master:8020/user/root/yelp_40.csv], file_group=[root], file_host=[master], file_last_modified=[1387074922966], file_length=[31474], file_name=[yelp_40.csv], file_owner=[root], file_path=[/user/root/yelp_40.csv], file_permissions_group=[r--], file_permissions_other=[r--], file_permissions_stickybit=[false], file_permissions_user=[rw-], file_port=[8020], file_scheme=[hdfs], file_upload_url=[hdfs://master:8020/user/root/yelp_40.csv]}
    at com.cloudera.cdk.morphline.base.FaultTolerance.handleException(FaultTolerance.java:74)
    at org.apache.solr.hadoop.morphline.MorphlineMapRunner.map(MorphlineMapRunner.java:213)
    at org.apache.solr.hadoop.morphline.MorphlineMapper.map(MorphlineMapper.java:86)
    at org.apache.

    …….
    attempt_201312151012_0003_m_000000_2: 7899 [main] INFO com.cloudera.cdk.morphline.api.MorphlineContext – Importing commands
    attempt_201312151012_0003_m_000000_2: 14255 [main] INFO org.apache.solr.hadoop.morphline.MorphlineMapRunner – Processing file hdfs://master:8020/user/root/yelp_40.csv
    attempt_201312151012_0003_m_000000_2: 14282 [main] ERROR org.apache.solr.hadoop.morphline.MorphlineMapRunner – Unable to process file hdfs://master:8020/user/root/yelp_40.csv
    attempt_201312151012_0003_m_000000_2: com.cloudera.cdk.morphline.api.MorphlineRuntimeException: Missing charset for record: {_attachment_body=[java.io.BufferedInputStream@24cc0f9f], _attachment_name=[yelp_40.csv], base_id=[hdfs://master:8020/user/root/yelp_40.csv], file_download_url=[hdfs://master:8020/user/root/yelp_40.csv], file_group=[root], file_host=[master], file_last_modified=[1387074922966], file_length=[31474], file_name=[yelp_40.csv], file_owner=[root], file_path=[/user/root/yelp_40.csv], file_permissions_group=[r--], file_permissions_other=[r--], file_permissions_stickybit=[false], file_permissions_user=[rw-], file_port=[8020], file_scheme=[hdfs], file_upload_url=[hdfs://master:8020/user/root/yelp_40.csv]}

  • Justin Kestelyn (@kestelyn) / December 16, 2013 / 10:19 AM

    Bill,

    I recommend that you post this issue to the Hue community forum:

    http://community.cloudera.com/t5/Web-UI-Hue-Beeswax/bd-p/Hue

Leave a comment


+ 9 = thirteen