Category Archives: Data Science

Meet the Authors: “Data Analytics with Hadoop” from O’Reilly Media

Categories: Books Data Science General Hadoop

I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.

Why did you decide to write this book?

Ben: The content was originally part of a class that Jenny and I were teaching together.

Read More

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Categories: Data Science General HDFS Impala Kudu Performance

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors.

Read More

Making Python on Apache Hadoop Easier with Anaconda and CDH

Categories: CDH Cloudera Manager Data Science Spark

Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Analytics’ Python platform (Anaconda).

Python has become an increasingly popular tool for data analysis, including data processing, feature engineering, machine learning, and visualization. Data scientists and data engineers enjoy Python’s rich numerical and analytical libraries—such as NumPy, pandas, and scikit-learn—and have long wanted to apply them to large datasets stored in Apache Hadoop clusters.

Read More

How-to: Predict Telco Churn with Apache Spark MLlib

Categories: Data Science Spark Use Case

Spark MLLib is growing in popularity for machine-learning model development due to its elegance and usability. In this post, you’ll learn why.

Spark MLLib is a library for performing machine-learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take a couple lines of code and leverage hundreds of machines. MLlib greatly simplifies the model development process.

In this post,

Read More

How-to: Train Models in R and Python using Apache Spark MLlib and H2O

Categories: Data Science How-to Spark

Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.

There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).

Read More