Category Archives: General

Bayesian Machine Learning on Apache Spark

Categories: Data Science General Spark

Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.

During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. 

Read More

This Month in the Ecosystem (July 2014)

Categories: General

Welcome to our 11th edition of “This Month in the Ecosystem,” a digest of highlights from July 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

  • An early release of the new O’Reilly Media book, Hadoop Application Architectures, became available. This one is sure to become standard bookshelf material. (Look for signed copies at Strata + Hadoop World!)
  • Continuuity introduced Tephra,

Read More

Cloudera Live: The Instant Apache Hadoop Experience

Categories: CDH Cloud General Hue

Get started with Apache Hadoop and use-case examples online in just seconds.

Today, we announced the Cloudera Live Read-Only Demo, a new online service for developers and analysts (currently in public beta) that makes it easy to learn, explore, and try out CDH, Cloudera’s open source software distribution containing Apache Hadoop and related projects. No downloads, no installations, no waiting — just point-and-play!

Try the Cloudera Live Read-Only Demo

The Cloudera Live Read-Only Demo is a live CDH 5 cluster with a Hue interface (based on Hue 3.5.0,

Read More

How-to: Implement Role-based Security in Impala using Apache Sentry

Categories: General Hive How-to Impala Security

This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.

Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.

Data warehouse optimization is one of the most common Hadoop use cases.

Read More

Apache Spark: A Delight for Developers

Categories: General Spark

Sure, Spark is fast, but it also gives developers a positive experience they won’t soon forget.

Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit – the elegance of the development experience – gets less mainstream attention.

In this post, you’ll learn just a few of the features in Spark that make development purely a pleasure.

Read More