Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.
When the data science team sat down with our training team to begin planning our next data science course, we quickly discovered that there weren’t any open-source tools in the Hadoop ecosystem that would allow students to perform the data preparation and model evaluation techniques that we wanted them to learn. For example, it wasn’t possible to quickly summarize a CSV file of numerical and categorical variables via a single MapReduce job, and then use that summary to convert the CSV file into the distributed matrix format that is used as input to many of Mahout’s algorithms. We were also concerned that there wasn’t a lot of guidance as to how to choose values for many of the parameters that Mahout’s algorithms require, and that this might discourage new data scientists from using these models effectively.
Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.
Clustering for Fun
The first algorithm that we’re focusing on in Cloudera ML is k-means clustering, the problem of creating K groups from a collection of N points such that each point is assigned to the group whose center is closest to that point. It is one of the most widely-used analytical algorithms in industry because of its simplicity, versatility, and performance. I have also found it to be one of the most misunderstood and misused techniques in the machine learning toolkit, especially by new data science teams.
Clustering can be appealing because it belongs to the class of unsupervised learning techniques: you do not need to spend any time or money to acquire the labeled data that is necessary to perform supervised learning tasks, such as building a predictive model. But the ease of getting started with clustering can be a double-edged sword, especially if it leads new data science teams to work on problems that seem approachable instead of problems that are important to the business.
Let’s consider the following scenario: the data science team at Company X is looking around for a proof-of-concept project, and decides to use k-means clustering in order to develop a new customer segmentation model for the marketing department. The team gathers a large quantity of data, and then spends a month cleansing it, vectorizing it, normalizing it, and experimenting with different choices of distance metrics and values of K. At the end of the month, they present their best clustering to the marketing department, who proceed to respond in one of two ways:
- Response #1: “Yep, that’s exactly how we think about our customers. Thank you for telling us something we already know.”
- Response #2: “No, that doesn’t really match up with how we think about our customers. You must have done something wrong– go back and try it again.” (The data science team that receives this response spends several months iterating on distance metrics and values of K until they manage to converge to Response #1.)
Don’t let this situation happen to you: always start a new project by taking the time to understand and document how the performance of your modeling and analysis will be evaluated, and avoid situations where personal opinions matter more than business metrics. A data scientist’s time is a terrible thing to waste.
Clustering for Profit
One of my favorite applications of k-means clustering is for finding outliers in data sets, which can be useful for detecting fraudulent transactions, network intrusions, or simply as part of the process of data cleansing. When working on these types of problems, my goal is to find a “good” clustering of the data, and then examine the points that didn’t fit into any of the larger clusters that I found. During this process, I typically need to resolve three major issues, especially when working with large data sets:
- For a given set of input features, how do I quickly create a good partition of the data into K clusters?
- What value of K should I use?
- Once I have some good candidate clusterings, how do I examine the outliers?
Historically, Hadoop has not been particularly great for helping me solve these problems. Although there are many implementations of Lloyd’s algorithm (which is what most people are referring to when they talk about k-means clustering) as an iterative series of MapReduce jobs, they all suffer from the same two limitations:
- Lloyd’s algorithm requires a large (and unknown) number of passes over the data set in order to converge. In fact, it has been shown that the theoretical worst-case running time of the algorithm is super-polynomial in the size of the input [PDF].
- If we choose a random set of points for our initial clusters, it is possible for Lloyd’s algorithm to become stuck in a local optimum that is arbitrarily bad relative to the optimal clustering.
As you can imagine, there isn’t a whole lot of appeal in running an algorithm that takes an arbitrarily long time to return an arbitrarily bad answer.
But don’t despair: a number of academic papers have introduced modifications to the basic k-means algorithm that provide good bounds on both the number of passes over the data that we need to perform as well as on the quality of the solution that we ultimately find. At the 2011 NIPS conference, Shindler et al. introduced “Fast and Accurate k-means for Large Data Sets [PDF],” an an approach to k-means that requires only a single pass over the data, provides reasonable guarantees on the quality of the clusters it finds, and is the basis for MAHOUT-1154, an eagerly awaited component of Mahout’s 0.8 release. Last summer, Bahmani et al. introduced “Scalable k-means++ [PDF],” which provides somewhat better guarantees on the quality of the clusters in exchange for performing a few extra passes over the data set. Both of these algorithms appear to find good clusters in practice and can both be used to create small sketches of a large data set that reflect its overall structure, but are small enough to be analyzed interactively on a single machine.
Data, Big and Small
If you were paying at least somewhat close attention, you may have noticed that the algorithms I’m describing above are essentially clever sampling techniques. With all of the hype surrounding big data, sampling has gotten a bit of a bad rap, which is unfortunate, since most of the work of a data scientist involves finding just the right way to turn a large data set into a small one. Of course, it usually takes a few hundred tries to find that right way, and Hadoop is a powerful tool for exploring the space of possible features and how they should be weighted in order to achieve our objectives.
Whenever possible, we want to minimize the amount of parameter tuning required during model fitting. At the very least, we should try to provide feedback on the quality of the model that is created by different parameter settings. For k-means, we want to help data scientists choose a good value of K, the number of clusters to create. In Cloudera ML, we integrate the process of selecting a value of K into the data sampling and cluster fitting process by allowing data scientists to evaluate multiple values of K during a single run of the tool and reporting statistics about the stability of the clusters, such as the prediction strength.
Finally, we want to investigate the anomalous events in our clustering- those points that don’t fit well into any of the larger clusters. Cloudera ML includes a tool for using the clusters that were identified by the scalable k-means algorithm to compute an assignment of every point in our large data set to a particular cluster center, including the distance from that point to its assigned center. This information is created via a MapReduce job that outputs a CSV file that can be analyzed interactively using Cloudera Impala or your preferred analytical application for processing data stored in Hadoop.
Cloudera ML is under active development, and we are planning to add support for pivot tables, Hive integration via HCatalog, and tools for building ensemble classifers over the next few weeks. We’re eager to get feedback on bug fixes and things that you would like to see in the tool, either by opening an issue or a pull request on our github repository. We’re also having a conversation about training a new generation of data scientists next Tuesday, March 26th, at 2pm ET/11am PT, and I hope that you will be able to join us.
Josh Wills is Cloudera’s senior director of data science.