Jeffrey Shmain, Author at Cloudera Blog

October 15, 2015 | Technical

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale. Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process […]

by Jeffrey Shmain 13 min read

November 5, 2013 | Technical

Email Indexing Using Cloudera Search and HBase

In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the […]

by Jeffrey Shmain 9 min read

Apache HBase Search

More by this author:

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Email Indexing Using Cloudera Search and HBase