Time Series for Spark Joins Cloudera Labs

Categories: Cloudera Labs Spark

Bringing Time Series for Spark into Cloudera Labs is a reflection of its potentially future usefulness in more use cases.

Time is more important than ever to data. We’re not merely interested in how things are, but how they change, where tendencies lead, and where trends are heading into unusual territory. Many classic machine-learning techniques do nothing in particular with time, and so assume the past and future are all similar. We know that’s increasingly inaccurate. Embracing time means embracing more data, too. Measurements are not happening occasionally, but many times per second, in an urgent flow from the internet of things.

Thankfully, there’s a platform for all that. Apache Spark and Spark Streaming, as part of CDH, are no strangers to storing and crunching lots of data, even in real-time as it arrives. So far, however, Spark’s machine-learning library, MLlib, has not included much support for time-series data analysis. This is by design, since it’s possible to easily add third-party libraries for Spark to any job with a few command-line switches.

So, the data science team here has also been happy to incubate a library, called simply Time Series for Spark (the release of which you have learned about previously in this very blog). It provides implementations of essential time-series algorithms on top of Spark. It can be added easily to any Spark application, or used interactively with the Spark shell.

Time Series for Spark is now joining Cloudera Labs as another example of emerging, useful projects across the ecosystem. (As with all Cloudera Labs projects, Time Series for Spark is not formally supported and no future support is implied. But by all means, experiment away!) Version 0.3.0 was released with support for automatically picking appropriate ARIMA models for data, and an imminent 0.4.0 release will support up- and down-sampling as well as AR models with exogenous variables.

Clone it, fork it, try it, contribute! And then, give us some feedback via the Cloudera Labs discussion forum.

Sean Owen is Director of Data Science at Cloudera. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Apache Hadoop. He is an Apache Spark committer and a co-author of O’Reilly Media’s Advanced Analytics with Spark.