Model Based Collaborative Filtering — SVD
Back in 2006, Netflix announced the Netflix Prize, a machine learning competition for predicting movie ratings. They offered a one million dollar prize to whoever improved the accuracy of their existing recommendation system by 10%. A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They used an ensemble model composed of Matrix Factorization (SVD) and Restricted Boltzmann Machines (RBM) algorithms. In this article, we will focus on singular value decomposition or SVD for short.
SVD for recommendation systems
Memory-based collaborative filtering doesn’t perform well at scale. When you have millions of users and/or items, computing pairwise correlations is expensive and slow. We already saw that we could avoid processing all the data every time by establishing neighbourhoods on the basis of similarity. Is it possible to reduce the size of the ratings matrix some other way? Is it really critical that we store every single item? For example, if a user liked a movie and its sequels (e.g. The Terminator, The Matrix). Do we really need to have a column in our ratings matrix for each movie? What if the set of movies could be represented using only k latent features where each feature corresponds to a category (e.g. genre) with common characteristics? We could use those for the purpose of providing recommendations. In a similar vein, the users could be represented using k dimensions. You can think of each feature as a demographic group (e.g. young men, old widows) with similar preferences. This is where SVD comes into play. Suppose we had a ratings matrix with m users and n items. We can factor the matrix into two other matrices P and Q, and a diagonal matrix Sigma.
The key piece of information here is that each of the absolute values in the diagonal matrix Sigma represent how important the associated dimension is in terms of expressing the original ratings matrix. If we sort those values in descending order and exclude all but the top **k (**user defined) features, we can obtain the best approximation of the ratings matrix by multiplying the truncated matrices back together. This is the principle behind SVD for dimensionality reduction.
It’s important to note that like K-means clusters, the k features are not inherently interpretable. In other words, it’s up to the user to determine what each of the k features represent (e.g. movie genres, demographics).
Assuming that r is the number of original features, k is the number latent features, and f is the index of a specific feature, the score (rating) for user u and item i can be calculated as follows:
More often than not, you will see the formula written as follows:
where the b term is the baseline (e.g. user mean, item mean).
SVD suffers from the following cons:
- Slow to compute
- Cannot handle missing data (i.e. the data must be imputed using the mean)
It’s for these reasons that we began using machine learning (specifically gradient descent and alternating least squares) to search for the best rank-k approximation of the ratings matrix. During training, we only consider the elements with a rating and ignore the ones that don’t. Thus, relieving the need to impute the missing values.
We begin by initializing the item & user feature vectors using random non-zero values.
Then for each feature (1 through k) we perform the following:
For each user/item combination in the dataset, we predict the rating, using the following formula:
Note: we no longer include Sigma since we’re no longer relying on algebra (using machine learning instead) to compute the matrices
Then, we calculate the error (i.e. RMSE) using the score (rating) and update the values in the user matrix P and item matrix Q.
where lambda is the learning rate used in gradient descent and gamma is the regularization term.
We stop after the maximum number of iterations is reached or we have not detected a significant change in the error for a user specified number of iterations.
We begin by importing modules from the surprise library. Surprise is the de facto package for creating recommendation systems in Python. It contains robust implementations of different algorithms used in recommendation systems such as SVD++ and Non-negative Matrix Factorization.
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import train_test_split from collections import defaultdict
We will train our model using the MovieLens dataset with one hundred thousand records. Fortunately, the surprise library provides an API for downloading the dataset and storing its contents in a variable.
# three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings data = Dataset.load_builtin('ml-100k')
We set a portion of the data aside for testing.
trainset, testset = train_test_split(data, test_size=.25)
We instantiate an instance of the SVD class.
algo = SVD()
We train the model.
We predict the ratings for the user/item pairs in the test set.
predictions = algo.test(testset)
We calculate the Root Mean Square Error (RMSE).
As we can see, the RMSE is lower than the baseline RMSE of 0.944.
Next, we define a function to return the top n recommendations.
def get_top_n(predictions, n=10): top_n = defaultdict(list) for uid, iid, true_r, est, _ in predictions: top_n[uid].append((iid, est))
for uid, user_ratings in top_n.items(): user_ratings.sort(key=lambda x: x, reverse=True) top_n[uid] = user_ratings[:n]
We call it, providing the predictions we computed previously and print the results.
top_n = get_top_n(predictions, n=10)
for uid, user_ratings in top_n.items(): print(uid, [iid for (iid, _) in user_ratings])
As we can see, the model recommended that the user associated with id = 912 should watch the movie associated with the id = 427.
912 ['427', '479', '474', '194', '152', '653', '268', '610', '616'] 416 ['357', '64', '172', '96', '480', '69', '199', '97', '197', '83'] 346 ['12', '22', '181', '127', '98', '134', '96', '265', '132', '83']