Category Archives: Spark

Announcing RecordService Beta 2: Brings Column-level Security to Apache Spark and MapReduce

Categories: General Security Sentry Spark

With this new beta release, column-level privileges set via Apache Sentry (incubating) are now enforced on Spark/MapReduce jobs.

Cloudera is excited to announce the availability of the second beta release for RecordService. This release is based on CDH 5.5 and provides some new features, including:

  • Support for Sentry column-level security. Previously, column-level access control required the use of views; now,

Read More

Spark-TS: A New Library for Analyzing Time-Series Data with Apache Spark

Categories: Data Science Spark

Time-series analysis is becoming mainstream across multiple data-rich industries. The new Spark-TS library helps analysts and data scientists focus on business questions, not on building their own algorithms.

Have you ever wanted to build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly view statistics across every page on Wikipedia? To do any of these things,

Read More

Progress Report: Hive-on-Spark Nears Production Readiness

Categories: Cloudera Labs Hive Spark

Contributors from Intel, Cloudera, and the rest of the community have been making strong progress on the Hive-on-Spark initiative. This post provides an update.

Since its inception about one year ago, the community initiative to make Apache Spark a data processing engine for Apache Hive (HIVE-7292) has attracted widespread interest from developers around the world and gone through phases of rapid development, testing, and early deployment. (For example,

Read More

Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

Categories: CDH Spark

Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.

In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative,

Read More

How-to: Build a Complex Event Processing App on Apache Spark and Drools

Categories: HBase How-to Kafka Spark Use Case

Combining CDH with a business execution engine can serve as a solid foundation for complex event processing on big data.

Event processing involves tracking and analyzing streams of data from events to support better insight and decision making. With the recent explosion in data volume and diversity of data sources, this goal can be quite challenging for architects to achieve.

Complex event processing (CEP) is a type of event processing that combines data from multiple sources to identify patterns and complex relationships across various events.

Read More