Author Archives: Justin Kestelyn

New in Cloudera Labs: Envelope (for Apache Spark Streaming)

Categories: Cloudera Labs Data Ingestion Kafka Kudu

As a warm-up to Spark Summit West in San Francisco (June 6-8),  we’ve added a new project to Cloudera Labs that makes building Spark Streaming pipelines considerably easier.

Spark Streaming is the go-to engine for stream processing in the Cloudera stack. It allows developers to build stream data pipelines that harness the rich Spark API for parallel processing, expressive transformations, fault tolerance, and exactly-once processing. But it requires a programmer to write code,

Read More

Cloudera’s Process for Handling Security Vulnerabilities

Categories: General Security

Cloudera considers the handling and reporting of security vulnerabilities a very serious matter. In this post, learn the processes involved.

In addition to expecting enterprise-class standards for stability and reliability, Cloudera’s customers also have expectations for industry-standard processes around the discovery, fix, and reporting of security issues. In this post, I will describe how Cloudera addresses such issues in our software.

An overview of the process looks like this flowchart:

secalert-f1

The first step in the life cycle of a security vulnerability is that it is discovered and reported to Cloudera.

Read More

Apache HBase is Everywhere

Categories: Community Events HBase

For Cloudera, Apache HBase has grown into a stable, scalable, mature, and critical component of the Apache Hadoop stack.  

HBase adds the ability to do low-latency random read/write across your big data. While it is a key piece of the Apache Hadoop ecosystem, HBase itself has an ecosystem of projects and products that use it as a storage engine for systems such as time series database (OpenTSDB), or SQL-style databases (Apache Phoenix,

Read More

How-to: Process and Index Medical Images with Apache Hadoop and Apache Solr

Categories: CDH Guest Search Use Case ZooKeeper

Thanks to Karthik Vadla, Abhi Basu, and Monica Martinez-Canales of Intel Corp. for the following guest post about using CDH for cost-effective processing/indexing of DICOM (medical) images.

Medical imaging has rapidly become the best non-invasive method to evaluate a patient and determine whether a medical condition exists. Imaging is used to assist in the diagnosis of a condition and, in most cases, is the first step of the journey through the modern medical system.

Read More

How-to: Configure SAP HANA with Apache Impala (incubating)

Categories: How-to Impala

Combining HANA and Impala can unlock a variety of new use cases that span the full range of enterprise data. Here’s how to do it.

Information is growing at an exponential rate driven by enterprise applications and databases, and often takes the form of new types of data from sources such as social media, sensors, and mobile devices. Because it is not cost-effective to store and process all this information in an in-memory database,

Read More