Category Archives: General

The Cloudera Developer Program: The Low-cost, Low-risk Way to Develop on Cloudera

Categories: CDH Cloudera Manager General

The Cloudera Developer Program is kind of amazing. Here’s why.

For those with a desire to build new applications on Cloudera’s platform, historically there’s been a gap to cross between pure bootstrapping on CDH (whether via a small on-premise cluster, in the public cloud, or using Cloudera Live) and obtaining full-blown support for a complete enterprise data hub with all the fixings (including Cloudera cloudera-developer-programNavigator). For individuals who have moved beyond self-learning and are getting “serious,”

Read more

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read more

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors.

Read more

Spark Dataflow Joins Google’s Dataflow SDK

Categories: Cloud Cloudera Labs General Spark

Spark Dataflow from Cloudera Labs is now part of Google’s New Dataflow SDK, which will be proposed to the Apache Incubator.

Spark Dataflow is an experimental implementation of Google’s Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we’ve seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal.

Read more

Announcing RecordService Beta 2: Brings Column-level Security to Apache Spark and MapReduce

Categories: General Platform Security & Cybersecurity Sentry Spark

With this new beta release, column-level privileges set via Apache Sentry (incubating) are now enforced on Spark/MapReduce jobs.

Cloudera is excited to announce the availability of the second beta release for RecordService. This release is based on CDH 5.5 and provides some new features, including:

  • Support for Sentry column-level security. Previously, column-level access control required the use of views;

Read more