Tag Archives: Search

How-to: Use HUE’s Notebook App with SQL and Apache Spark for Analytics

Categories: How-to Hue Spark

This post from the HUE team about using HUE (the open source web GUI for Apache Hadoop), Apache Spark, and SQL for analytics was initially published in the HUE project’s blog.

Apache Spark is getting popular and HUE contributors are working on making it accessible to even more users. Specifically, by creating a Web interface that allows anyone with a browser to type some Spark code and execute it.

Read more

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Categories: HBase How-to Search Spark

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.

Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.

In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark,

Read more

How-to: Use Apache Solr to Query Indexed Data for Analytics

Categories: How-to Search

Bet you didn’t know this: In some cases, Solr offers lightning-fast response times for business-style queries.

If you were to ask well informed technical people about use cases for Solr, the most likely response would be that Solr (in combination with Apache Lucene) is an open source text search engine: one can use Solr to index documents, and after indexing, these same documents can be easily searched using free-form queries in much the same way as you would query Google.

Read more

Ibis on Impala: Python at Scale for Data Science

Categories: Cloudera Labs Data Science Impala

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale.

Across the user community, you will find general agreement that the Apache Hadoop stack has progressed dramatically in just the past few years. For example, Search and Impala have moved Hadoop beyond batch processing, while developers are seeing significant productivity gains and additional use cases by transitioning from MapReduce to Apache Spark.

Thanks to such advances in the ecosystem,

Read more

New in CDH 5.4: Sensitive Data Redaction

Categories: CDH Cloudera Manager Platform Security & Cybersecurity

The best data protection strategy is to remove sensitive information from everyplace it’s not needed.

Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like select * from table where creditcard = ‘1234-5678-9012-3456’,

Read more