Category Archives: General

Check Out Those New and Improved Cloudera Docs

Categories: CDH Cloudera Manager General

Cloudera has given its documentation set a facelift, and we think you’ll like the new look. We use more whitespace and a font that is easier to read and skim, and your pages load much faster. But the improvements go beyond the merely aesthetic.

While electronic documentation has been around for decades, most online documentation is still presented as if it were printed in books. There is a table of contents that assumes you will read the content from start to finish.

Read More

The Cloudera Developer Program: The Low-cost, Low-risk Way to Develop on Cloudera

Categories: CDH Cloudera Manager General

The Cloudera Developer Program is kind of amazing. Here’s why.

For those with a desire to build new applications on Cloudera’s platform, historically there’s been a gap to cross between pure bootstrapping on CDH (whether via a small on-premise cluster, in the public cloud, or using Cloudera Live) and obtaining full-blown support for a complete enterprise data hub with all the fixings (including Cloudera cloudera-developer-programNavigator). For individuals who have moved beyond self-learning and are getting “serious,”

Read More

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read More

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors.

Read More