Cloudera Developer Blog · Mahout Posts
What a short, strange trip it’s been. Just a year ago, I founded Myrrix in London’s Silicon Roundabout to commercialize large-scale machine learning based on Apache Hadoop and Apache Mahout. It’s been a busy scramble, building software and proudly watching early customers get real, big data-sized machine learning into production.
And now another beginning: Myrrix has a new home in Cloudera. I’m excited to join as Director of Data Science in London, alongside Josh Wills. Some of the Myrrix technology will be coming along to benefit CDH and its customers too. There was no question that Cloudera is the right place to continue building out the vision that started as Myrrix, because Josh, Jeff Hammerbacher and the rest of the data science team here have the same vision. It’s an unusually perfect match. Cloudera has made an increasingly complex big-data ecosystem increasingly accessible (Hadoop, real-time queries, search), and we’re going to make “Big Learning” on Hadoop easy and accessible too.
What is Old is New Again
Data-savvy companies of all sizes can now accomplish many viable machine learning projects.
Why the fuss now about machine learning, a decades-old field? I started working on recommender systems relatively late, in 2005, as the open-source project Taste. In 2008, this was merged into the open source machine learning project Apache Mahout, and rebuilt on top of a nascent Hadoop project. Yet as a committer and part of the Mahout PMC, I have watched interest in machine learning suddenly reignite, and skyrocket, as interest in this new Hadoop thing did.
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.
When the data science team sat down with our training team to begin planning our next data science course, we quickly discovered that there weren’t any open-source tools in the Hadoop ecosystem that would allow students to perform the data preparation and model evaluation techniques that we wanted them to learn. For example, it wasn’t possible to quickly summarize a CSV file of numerical and categorical variables via a single MapReduce job, and then use that summary to convert the CSV file into the distributed matrix format that is used as input to many of Mahout’s algorithms. We were also concerned that there wasn’t a lot of guidance as to how to choose values for many of the parameters that Mahout’s algorithms require, and that this might discourage new data scientists from using these models effectively.
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field.
Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.).
In CDH4.1, Mahout is upgraded to upstream version 0.7. Several new changes are included in this release, and this post will briefly go over some of the interesting ones.
Outlier Removal Capability
This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.