Author Archives: Sean Owen

Time Series for Spark Joins Cloudera Labs

Categories: Cloudera Labs Spark

Bringing Time Series for Spark into Cloudera Labs is a reflection of its potentially future usefulness in more use cases.

Time is more important than ever to data. We’re not merely interested in how things are, but how they change, where tendencies lead, and where trends are heading into unusual territory. Many classic machine-learning techniques do nothing in particular with time, and so assume the past and future are all similar.

Read More

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More

How-to: Translate from MapReduce to Apache Spark

Categories: How-to MapReduce Spark

The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.

Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.

As Hadoop’s usage has broadened,

Read More