Category Archives: Data Science

How-to: Train Models in R and Python using Apache Spark MLlib and H2O

Categories: Data Science How-to Spark

Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.

There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name).

Read More

Spark-TS 0.2.0 Released

Categories: Data Science Spark

Spark-TS 0.2.0 includes a fleshed-out Java API, among other things.

Spark-TS is a library developed started by Cloudera’s Data Science team that enables analysis of datasets comprising millions of time series, each with millions of measurements. Spark-TS runs atop Apache Spark, and exposes Scala, Java, and Python APIs. Check out this recent post for a closer look at the library and how to use it.

Spark-TS 0.2.0 released earlier in January 2016.

Read More

Spark-TS: A New Library for Analyzing Time-Series Data with Apache Spark

Categories: Data Science Spark

Time-series analysis is becoming mainstream across multiple data-rich industries. The new Spark-TS library helps analysts and data scientists focus on business questions, not on building their own algorithms.

Have you ever wanted to build models over measurements coming in every second from sensors across the world? Dig into intra-day trading prices of millions of financial instruments? Compare hourly view statistics across every page on Wikipedia? To do any of these things,

Read More

Common Probability Distributions: The Data Scientist’s Crib Sheet

Categories: Data Science

Data scientists have hundreds of probability distributions from which to choose. Where to start?

Data science, whatever it may be, remains a big deal.  “A data scientist is better at statistics than any software engineer,” you may overhear a pundit say, at your local tech get-togethers and hackathons. The applied mathematicians have their revenge, because statistics hasn’t been this talked-about since the roaring 20s. They have their own legitimizing Venn diagram of which people don’t make fun.

Read More

What We Learned at Wrangle 2015 (Data Science is About People)

Categories: Community Data Science Events

The Wrangle conference was a huge hit. Look for it to return in 2016!

Wrangle, the conference for and by data science practitioners from startup to enterprise, made a noticeable splash in San Francisco last week. As the conference host and organizer, we (Cloudera) couldn’t be happier about its attendees’ happiness.

Wrangle Conference 2015

With presenter/panelist representation from the data science teams at Uber,

Read More