The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content available. Ideally we’d like machines and algorithms to help find the more interesting—to individual preference—items so we can easily focus our attention on these items of relevance.
Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.
Due to the explosion of web traffic and users the scale of recommendation poses new challenges for recommendation systems. These systems face the dual challenge of producing high quality recommendations and calculating these personalized recommendations for millions of users. In recent years, collaborative filtering (CF) has become a popular way to effectively meet these challenges. CF techniques start by analyzing the user-item matrix to identify relationships between different users or items, and use that information to produce recommendations for each user.
Quick CF Recommendation Overview
From an algorithmic standpoint, the recommendation systems we’ll talk about today are considered in the k-nearest neighbor family of problems (another type would be a SVD-based recommender). We want to predict the estimated preference of a user towards an item they have never seen before. We also want to generate a ranked (by preference score) list of items the user might be most interested in. Two well-known styles of recommendation algorithms are item-based recommenders and user-based recommenders. Both types rely on the concept of a similarity function/metric (ex: Euclidean distance, log likelihood), whether it is for users or items.
With user-based recommendation algorithms we’re looking at the notion of similarity between users based on preferences/actions/ratings of those users so we can recommend the same things to similar users. We start out by finding out which users are more similar (as opposed to finding more similar items first).
With item-based CF we’re looking at how users interact with items (books, videos, news, users …) and from that information we look at which items are more similar. In both types of CF we start with deriving some idea of the user-item preferences and then either look at user-user or item-item similarity. We get these interactions from explicit (e.g. a user rating something by number of stars) and implicit (e.g. purchasing a book) ratings. Given the past actions of a user, the then attempt to predict the future in a “content agnostic” fashion that doesn’t use the properties of the items themselves, just how user’s interact with the items. Typically we’re interested in item-based recommendation systems when the number of items is very low compared to the number of users. This scenario can provide a good performance gain versus a user-based setup. Another aspect to consider is that the items are going to change far less frequently than the users.
Mathematically speaking we’re creating a user-item-matrix from the user preference data and then predicting the missing entries by finding patterns from the user preference information we know about. In the most simple case, we’re computing the number of times each pair of items occurs together in a user’s list of preferences/ratings and given that grouping (for a given pair) across all users we then feed these “co-occurrences” to a similarity function like “log likelihood”.
Mahout’s implementation of this algorithm is also a great example of how an existing concept is rebuilt for MapReduce. A difference to note with the MapReduce version is that Mahout uses a “Co-Occurrence” matrix for the concept of “item similarity”, which provides some degree of similarity between items (the idea being that the more two items co-occur, the more similar they likely are). This means we are calculating the number of times a pair of items occurs together per user. With this job we’re able to calculate a lot of co-occurrences in parallel which highlights the power of MapReduce and the “out of the box” functionality offered with Mahout.
Once we know a bit about how similar our items are (here, how often they co-occur) we can then predict our user’s preference towards a small subset of other items the user has not seen before. We want to predict a rating for an item a user has not seen before based on the information gathered from other users . To do this we look at all items that are similar to an item we’d like to predict a rating on (for the user) and have been rated by this user. In the case of Mahout’s distributed recommender we multiply the user vector (as column vector) vs the co-occurrence matrix. This means we multiply the user column vector times each item row vector. The sum of each of those row time column multiplications creates a rating for each item relative to the user. We can then select the top N most highly rated items (minus the items the user has already rated) to recommend to the user.
Install Mahout with CDH3u3
Mahout has been integrated with CDH3 since CDH3u2. Let’s start out by getting a cluster going in either Pseudo-Distributed Mode or on a full cluster (If CDH3u2 is already running, please skip this part, otherwise go ahead and install CDH3u3). Click following link to get started:
Make sure $JAVA_HOME is set or Mahout will complain. Click the link to learn about Mahout Installation.
Now that we have Mahout installed, let’s take a look at how we interact with it from the command line.
Working with Mahout
Running Mahout is fairly simple and acts a lot like Hadoop from the command line. For instance, if we type “mahout” on the command line after the install we should get:
Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20
An example program must be given as the first argument.
Valid program names are:
arff.vector: : Generate Vectors from an ARFF file or directory
canopy: : Canopy clustering
cat: : Print a file or resource as the logistic regression models would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text
dirichlet: : Dirichlet Clustering
eigencuts: : Eigencuts spectral clustering
evaluateFactorization: : compute RMSE of a rating matrix
factorization against probes in memory
evaluateFactorizationParallel: : compute RMSE of a rating matrix
factorization against probes
fkmeans: : Fuzzy K-means clustering
fpg: : Frequent Pattern Growth
itemsimilarity: : Compute the item-item-similarities for item-based
kmeans: : K-means clustering
lda: : Latent Dirchlet Allocation
ldatopics: : LDA Print Topics
lucene.vector: : Generate Vectors from a Lucene index
matrixmult: : Take the product of two matrices
meanshift: : Mean Shift clustering
parallelALS: : ALS-WR factorization of a rating matrix
predictFromFactorization: : predict preferences from a factorization
of a rating matrix
prepare20newsgroups: : Reformat 20 newsgroups data
recommenditembased: : Compute recommendations using item-based
rowid: : Map SequenceFile to
rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
runlogistic: : Run a logistic regression model against CSV data
seq2sparse: : Sparse Vector generation from Text sequence files
seqdirectory: : Generate sequence files (of Text) from a directory
seqdumper: : Generic Sequence File dumper
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
svd: : Lanczos Singular Value Decomposition
testclassifier: : Test Bayes Classifier
trainclassifier: : Train Bayes Classifier
trainlogistic: : Train a logistic regression using stochastic gradient descent
transpose: : Take the transpose of a matrix
vectordump: : Dump vectors from a sequence file to text
wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature
wikipediaXMLSplitter: : Reads wikipedia data and creates ch
If your JAVA_HOME env variable is set correctly you should see the above output. Now that we have Mahout installed and working with CDH let’s move on to working with some data.
GroupLens Movie Data
The input data for this demo is based on 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users, which you can download from the www.grouplens.org site. The zip file contains four files:
- movies.dat (movie ids with title and category)
- ratings.dat (ratings of movies)
- users.dat (user information)
Ratings File Description
The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:
- UsersIDs are integers
- MovieIDs are integers
- Ratings are 1 through 5 “stars” (integers)
- Time stamp is seconds since the epoch
- Each user has at least 20 ratings
This file isn’t exactly how Mahout prefers, but this is an easy fix. Mahout is looking for a CSV file with lines of the form:
userID, itemID, value
So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:
Next we’ll run the command:
tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv
This produces the output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job. While this is an extra step it gives us good practice preparing data for input into our recommendation jobs.
Running Mahout’s Recommendation System
The core workflow in mahout is based around the class
and is a completely distributed item-based recommender. The input to this job is going to be the “ratings.csv” file we generated above (after we converted the ratings.dat file to the ratings.csv file). The format of this file looks like:
userID, itemID, value
and the output will be another CSV file with the layout of:
userID [ itemID, score, ... ]
which represents the userIDs with their recommended itemIDs along with the preference scores. This algorithm is also implemented in MapReduce and the following table has some of the options we can specify for the job:
Path to the job input directory in HDFS (-i shortcut flag is broken in 0.5, MAHOUT-842)
Path to the output dir in HDFS
|--usersFile||File listing users to generate recommendations for|
Intermediary output directory
To set things up we need to setup the list of users we want to produce a recommendation for. Put the number “6040” (a user ID in our source data) on a line by itself in a file and put that in HDFS. Use the hadoop command line tools to copy this file to HDFS:
hadoop fs -put [my_local_file] [user_file_location_in_hdfs]
With our user list in hdfs we can now run the Mahout recommendation job with a command in the form of:
mahout recommenditembased --input [input-hdfs-path] --output [output-hdfs-path] --tempDir [tmp-hdfs-path] --usersFile [user_file_location_in_hdfs]
which will run for a while (a chain of 10 MapReduce jobs) and then write out the item recommendations into HDFS we can now take a look at. If we tail the output from the RecommenderJob with the command:
hadoop fs -cat [output-hdfs-path]/part-r-00000
it should look like:
where each line represents a userID with associated recommended itemIDs and their scores.
This concludes the introduction to Mahout’s implementation of Collaborative Filtering which is now included in CDH3u2. CF is but one of the many techniques included in the Mahout project and the reader should now be ready to not only further explore CF but tackle some of the other techniques as well. Special thanks to Sean Owen for review help on this article.
- S. Owen, R. Anil, T. Dunning, E. Friedman: Mahout in Action
- Sarwar et al.: Item-Based Collaborative Filtering Recommendation Algorithms
- Apache Mahout Wiki: