Recommendation with Apache Mahout in CDH3

Intro

The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content available. Ideally we’d like machines and algorithms to help find the more interesting—to individual preference—items so we can easily focus our attention on these items of relevance.

Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.

Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems.  These systems face the dual challenge of producing high quality recommendations and calculating these personalized recommendations for millions of users. In recent years, collaborative filtering (CF) has become a popular way to effectively meet these challenges.  CF techniques start by analyzing the user-item matrix to identify relationships between different users or items, and use that information to produce recommendations for each user.

Quick CF Recommendation Overview

From an algorithmic standpoint, the recommendation systems we’ll talk about today are considered in the k-nearest neighbor family of problems (another type would be a SVD-based recommender). We want to predict the estimated preference of a user towards an item they have never seen before. We also want to generate a ranked (by preference score) list of items the user might be most interested in. Two well-known styles of recommendation algorithms are item-based recommenders and user-based recommenders. Both types rely on the concept of a similarity function/metric (ex: Euclidean distance, log likelihood), whether it is for users or items.

With user-based recommendation algorithms we’re looking at the notion of similarity between users based on preferences/actions/ratings of those users so we can recommend the same things to similar users. We start out by finding out which users are more similar (as opposed to finding more similar items first).

With item-based CF we’re looking at how users interact with items (books, videos, news, users …) and from that information we look at which items are more similar. In both types of CF we start with deriving some idea of the user-item preferences and then either look at user-user or item-item similarity.  We get these interactions from explicit (e.g.  a user rating something by number of stars) and implicit (e.g. purchasing a book) ratings. Given the past actions of a user, the then attempt to predict the future in a “content agnostic” fashion that doesn’t use  the properties of the items themselves, just how user’s interact with the items. Typically we’re interested in item-based recommendation systems when the number of items is very low compared to the number of users. This scenario can provide a good performance gain versus a user-based setup. Another aspect to consider is that the items are going to change far less frequently than the users.

Mathematically speaking we’re creating a user-item-matrix from the user preference data and then predicting the missing entries by finding patterns from the user preference information we know about. In the most simple case, we’re computing the number of times each pair of items occurs together in a user’s list of preferences/ratings and given that grouping (for a given pair) across all users we then feed these “co-occurrences” to a similarity function like “log likelihood”.

Mahout’s implementation of this algorithm is also a great example of how an existing concept is rebuilt for MapReduce. A difference to note with the MapReduce version is that Mahout uses a “Co-Occurrence” matrix for the concept of “item similarity”, which provides some degree of similarity between items (the idea being that the more two items co-occur, the more similar they likely are). This means we are calculating the number of times a pair of items occurs together per user. With this job we’re able to calculate a lot of co-occurrences in parallel which highlights the power of MapReduce and the “out of the box” functionality offered with Mahout.

Once we know a bit about how similar our items are (here, how often they co-occur) we can then predict our user’s preference towards a small subset of other items the user has not seen before. We want to predict a rating for an item a user has not seen before based on the information gathered from other users . To do this we look at all items that are similar to an item we’d like to predict a rating on (for the user) and have been rated by this user. In the case of Mahout’s distributed recommender we multiply the user vector (as column vector) vs the co-occurrence matrix. This means we multiply the user column vector times each item row vector. The sum of each of those row time column multiplications creates a rating for each item relative to the user. We can then select the top N most highly rated items (minus the items the user has already rated) to recommend to the user.

Install Mahout with CDH3u3

Mahout has been integrated with CDH3 since CDH3u2. Let’s start out by getting a cluster going in either Pseudo-Distributed Mode or on a full cluster (If CDH3u2 is already running, please skip this part, otherwise go ahead and install CDH3u3). Click following link to get started:

CDH3 Quick Start Guide

Make sure $JAVA_HOME is set or Mahout will complain. Click the link to learn about Mahout Installation.

Now that we have Mahout installed, let’s take a look at how we interact with it from the command line.

Working with Mahout

Running Mahout is fairly simple and acts a lot like Hadoop from the command line. For instance, if we type “mahout” on the command line after the install we should get:

If your JAVA_HOME env variable is set correctly you should see the above output. Now that we have Mahout installed and working with CDH let’s move on to working with some data.

GroupLens Movie Data

The input data for this demo is based on 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users, which you can download from the www.grouplens.org site. The zip file contains four files:

  • movies.dat (movie ids with title and category)
  • ratings.dat (ratings of movies)
  • README
  • users.dat (user information)

Ratings File Description

The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:

where

  • UsersIDs are integers
  • MovieIDs are integers
  • Ratings are 1 through 5 “stars” (integers)
  • Time stamp is seconds since the epoch
  • Each user has at least 20 ratings

This file isn’t exactly how Mahout prefers, but this is an easy fix. Mahout is looking for a CSV file with lines of the form:

userID, itemID, value

So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:

http://www.grouplens.org/system/files/ml-1m.zip

Next we’ll run the command:

This produces the output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job. While this is an extra step it gives us good practice preparing data for input into our recommendation jobs.

Running Mahout’s Recommendation System

The core workflow in mahout is based around the class

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob

and is a completely distributed item-based recommender. The input to this job is going to be the “ratings.csv” file we generated above (after we converted the ratings.dat file to the ratings.csv file). The format of this file looks like:

and the output will be another CSV file with the layout of:

which represents the userIDs with their recommended itemIDs along with the preference scores. This algorithm is also implemented in MapReduce and the following table has some of the options we can specify for the job:

Flag

Description

--input (-i)

Path to the job input directory in HDFS (-i shortcut flag is broken in 0.5, MAHOUT-842)

--output (-o)

Path to the output dir in HDFS

--usersFile File listing users to generate recommendations for

--tempDir

Intermediary output directory

To set things up we need to setup the list of users we want to produce a recommendation for.  Put the number “6040″ (a user ID in our source data) on a line by itself in a file and put that in HDFS. Use the hadoop command line tools to copy this file to HDFS:

With our user list in hdfs we can now run the Mahout  recommendation job with a command in the form of:

which will run for a while (a chain of 10 MapReduce jobs) and then write out the item recommendations into HDFS we can now take a look at.  If we tail the output from the RecommenderJob with the command:

it should look like:

where each line represents a userID with associated recommended itemIDs and their scores.

Conclusion

This concludes the introduction to Mahout’s implementation of Collaborative Filtering which is now included in CDH3u2. CF is but one of the many techniques included in the Mahout project and the reader should now be ready to not only further explore CF but tackle some of the other techniques as well. Special thanks to Sean Owen for review help on this article.

References

Filed under:

9 Responses
  • sankar / January 03, 2012 / 12:04 PM

    Useful information. Combination of Hadoop and Mahout in my viewpoint has huge application potential in the emerging space of real-time data integration from Industrial IT to the Cloud (e.g. smartgrids). The recommendations system that is presented could be applied to real-time industrial IT/SCADA scenarios that generated huge volumes of sensor and alarms data. For instance Bad or No fault alarm recommendations from an array of sensors across a wide timeline of events through map-reduce analysis and machine learning may be done more effectively at near real times for critical decision making of plant operations

  • Vivek Kondur / February 06, 2012 / 11:59 PM

    Doing some research & analysis on java based Recommendation Engines. This post gives a very good overview as to how Mahout has been built around Hadoop and algos used in Mahout.

    Cheers!

  • Niksa Jakovljevic / May 30, 2012 / 5:20 AM

    Nice article. I would like to know how much effort does take to make Apache Mahout work with Apache Hadoop in distributed mode across multiple machines. Do you maybe have any experience with that?

    Best Regards,

    Niksa

  • Simone / June 14, 2012 / 8:36 PM

    I enjoy the Mahout verbosity.. but it might be a bit too much because I don’t understand where the error is looking at it. Unfortunately, everythings seems to go fine, but I got an empty result.

  • Sri / April 21, 2013 / 5:33 PM

    Trying to run this example on CDH4 and looks like an additional argument ‘-s ‘ need to be supplied…When I ran it with COOCCURRENCE…my output part file was empty…any insights?

Leave a comment


8 − three =