Category Archives: General

The Cloudera Developer Program: The Low-cost, Low-risk Way to Develop on Cloudera

Categories: CDH Cloudera Manager General

The Cloudera Developer Program is kind of amazing. Here’s why.

For those with a desire to build new applications on Cloudera’s platform, historically there’s been a gap to cross between pure bootstrapping on CDH (whether via a small on-premise cluster, in the public cloud, or using Cloudera Live) and obtaining full-blown support for a complete enterprise data hub with all the fixings (including Cloudera cloudera-developer-programNavigator). For individuals who have moved beyond self-learning and are getting “serious,”

Read More

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read More

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.

Read More

What’s New in Cloudera Director 2.0?

Categories: Cloud General Ops and DevOps

New functionality includes support for spot instances, automatic job submission, and integrated setup for HA and Kerberized clusters.

Cloudera Director is the manifestation of Cloudera’s commitment to provide a simple and reliable way to deploy, scale, and manage Apache Hadoop clusters in the cloud of your choice. Cloudera Director lets you deploy production-ready clusters for big data applications and successfully run workloads in the cloud. With Cloudera Director 2.0,

Read More

Spark Dataflow Joins Google’s Dataflow SDK

Categories: Cloud Cloudera Labs General Spark

Spark Dataflow from Cloudera Labs is now part of Google’s New Dataflow SDK, which will be proposed to the Apache Incubator.

Spark Dataflow is an experimental implementation of Google’s Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we’ve seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal.

Read More