Category Archives: Data Science

How-to: Predict Telco Churn with Apache Spark MLlib

Categories: Data Science Spark Use Case

Spark MLLib is growing in popularity for machine-learning model development due to its elegance and usability. In this post, you’ll learn why.

Spark MLLib is a library for performing machine-learning and associated tasks on massive datasets. With MLlib, fitting a machine-learning model to a billion observations can take a couple lines of code and leverage hundreds of machines. MLlib greatly simplifies the model development process.

In this post,

Read More

How-to: Train Models in R and Python using Apache Spark MLlib and H2O

Categories: Data Science How-to Spark

Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.

There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).

Read More

Time Series for Spark: 0.2.0 Released

Categories: Data Science Spark

The 0.2.0 release of the spark-ts package includes includes a fleshed-out Java API, among other things.

The spark-ts library, which was initially developed by Cloudera’s Data Science team, enables analysis of datasets comprising millions of time series, each with millions of measurements. It runs atop Apache Spark, and exposes Scala, Java, and Python APIs. Check out this recent post for a closer look at the library and how to use it.

Read More

A New Library for Analyzing Time-Series Data with Apache Spark

Categories: Data Science Spark

Time-series analysis is becoming mainstream across multiple data-rich industries. The new spark-ts library helps analysts and data scientists focus on business questions, not on building their own algorithms.

Have you ever wanted to build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly view statistics across every page on Wikipedia? To do any of these things,

Read More

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More