The Cloudera Developer Program is kind of amazing. Here’s why.
For those with a desire to build new applications on Cloudera’s platform, historically there’s been a gap to cross between pure bootstrapping on CDH (whether via a small on-premise cluster, in the public cloud, or using Cloudera Live) and obtaining full-blown support for a complete enterprise data hub with all the fixings (including Cloudera Navigator). For individuals who have moved beyond self-learning and are getting “serious,”
I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.
Why did you decide to write this book?
Ben: The content was originally part of a class that Jenny and I were teaching together.
Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.
Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:
- A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.
New functionality includes support for spot instances, automatic job submission, and integrated setup for HA and Kerberized clusters.
Cloudera Director is the manifestation of Cloudera’s commitment to provide a simple and reliable way to deploy, scale, and manage Apache Hadoop clusters in the cloud of your choice. Cloudera Director lets you deploy production-ready clusters for big data applications and successfully run workloads in the cloud. With Cloudera Director 2.0,
Spark Dataflow from Cloudera Labs is now part of Google’s New Dataflow SDK, which will be proposed to the Apache Incubator.
Spark Dataflow is an experimental implementation of Google’s Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we’ve seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal.