Cloudera Blog · Mahout Posts

Cloudera ML: New Open Source Libraries and Tools for Data Scientists

Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.

Creating Analytical Applications with Crunch: Cloudera ML

The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.

When the data science team sat down with our training team to begin planning our next data science course, we quickly discovered that there weren’t any open-source tools in the Hadoop ecosystem that would allow students to perform the data preparation and model evaluation techniques that we wanted them to learn. For example, it wasn’t possible to quickly summarize a CSV file of numerical and categorical variables via a single MapReduce job, and then use that summary to convert the CSV file into the distributed matrix format that is used as input to many of Mahout’s algorithms. We were also concerned that there wasn’t a lot of guidance as to how to choose values for many of the parameters that Mahout’s algorithms require, and that this might discourage new data scientists from using these models effectively.

Apache Hadoop in 2013: The State of the Platform

For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.

In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.

Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.

Apache Hadoop Releases

What’s New in CDH4.1 Mahout

Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field. 

Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.). 

In CDH4.1, Mahout is upgraded to upstream version 0.7. Several new changes are included in this release, and this post will briefly go over some of the interesting ones.

Outlier Removal Capability

Applying Parallel Prediction to Big Data

This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.

Playing Weatherman

I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.