Tag Archives: security

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

Categories: HBase How-to Search Spark

Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.

Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.

In this post, you will learn how to use standard open source tools along with Hadoop components such as Apache Spark,

Read more

How-to: Prepare Unstructured Data in Impala for Analysis

Categories: How-to Impala

Learn how to build an Impala table around data that comes from non-Impala, or even non-SQL, sources.

As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter situations where you have data files (particularly in Apache Parquet format) where you do not know the precise table definition. This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources,

Read more

Meet Cloudera’s Apache Spark Committers

Categories: Community General Meet the Engineer Spark

The super-active Apache Spark community is exerting a strong gravitational pull within the Apache Hadoop ecosystem. I recently had that opportunity to ask Cloudera’s Apache Spark committers (Sean Owen, Imran Rashid [PMC], Sandy Ryza, and Marcelo Vanzin) for their perspectives about how the Spark community has worked and is working together, and the work to be done via the One Platform initiative to make the Spark stack enterprise-ready.

Recently, Apache Spark has become the most currently active project in the Apache Hadoop ecosystem (measured by number of contributors/commits over time),

Read more

Community Meetups at Strata + Hadoop World NYC 2015

Categories: Community Events

Strata + Hadoop World 2015 NYC is more than a daytime conference; it’s also a nighttime meetup experience. (Plus, there are a bunch of book signings.)

It won’t be long before we’re all in NYC for Strata + Hadoop World (Sept. 29-Oct. 1; if you haven’t registered yet, a 20% discount is still available). So, consider for your evening agenda:

Read more

How-to: Run Apache Mesos on CDH

Categories: CDH Cloudera Manager Guest Ops and DevOps

Big Industries, Cloudera systems integration and reseller partner for Belgium and Luxembourg, has developed an integration of Apache Mesos and CDH that can be deployed and managed through Cloudera Manager. In this post, Big Industries’ Rob Gibbon explains the benefits of deploying Mesos on your cluster and walks you through the process of setting it up.

[Editor’s Note: Mesos integration is not currently supported by Cloudera, thus the setup described below is not recommended for production use.]

Apache Mesos is a distributed,

Read more