Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field.
Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.).
In CDH4.1, Mahout is upgraded to upstream version 0.7. Several new changes are included in this release, and this post will briefly go over some of the interesting ones.
Outlier Removal Capability
A new design of cluster classification (MAHOUT-930) is introduced to enable consistency and extensibility across a number of clustering implementations. Classifying data into clusters is now factored out as a separate step after building clusters from data. This separation provides a nice foundation for plugging in the outlier removal capability (MAHOUT-929). This new feature helps prune out those data that are far different from others in the same cluster. To use this capability, an outlier threshold (between 0.0 and 1.0) needs to be provided. With an outlier threshold specified, data will not be classified into the cluster if their probability distribution function values are less than the threshold value.
New Clustering Implementations
The implementations of the K-Means (MAHOUT-981), Fuzzy K-Means (MAHOUT-984), Canopy (MAHOUT-982), and Dirichlet (MAHOUT-983) algorithms are now based on the following cluster classification interfaces:
ClusterIterator, etc. In addition, as a result of the enhancements, these clustering implementations are equipped with the new outlier removal capability. From a user perspective, existing drivers, such as
DirichletDriver, can still be used as the entry points for clustering data.
Bayes Classifier Cleanup
Prior to the current release, there were two different implementations of Naive Bayes classifier. The commands
mahout trainclassifier and
mahout testclassifier utilized an implementation that took text-based data. However, there were occasional out-of-memory problems with the original implementation. A new implementation (MAHOUT-287) taking vector-based data was subsequently introduced with the commands
mahout trainnb and
mahout testnb. This new implementation proved to work better than the old one in the past few releases, and therefore the old Naive Bayes implementation has been removed as part of this release (MAHOUT-1010).
Known Issue under Oozie
There is a known issue of null pointer exceptions when Mahout is invoked as a shell action of Oozie and the distributed (map reduce) version of an algorithm is used in Mahout. This issue results from the effort (MAHOUT-848) to propagate Oozie action configuration down to Mahout job execution, but checking for null on configuration was missing there. The issue was later discovered, but the fix (MAHOUT-1033) didn’t catch the release vehicle of version 0.7. One workaround is to use Mahout as a Java action of Oozie instead of a shell action to bypass the configuration propagation and thus the issue itself.
With CDH4.1, Mahout provides machine learning libraries in four main areas. Recommendation finds items that users might like. Clustering groups together items that are related. Classification assigns unlabeled items to categories. Frequent itemset mining find items that appear together.
Mahout in CDH4.1 is upgraded to upstream version 0.7 with several new features and improvements. You can refer to the release notes for details about these changes. You are also encouraged to check out the CDH4.1 documentation and the Mahout website for more information.