The alternating least squares algorithm in recommenderlab

 Many sites nowadays use recommendation engines in order to offer additional value to the client and increase revenue. This technology is used for a wide variety of subjects, ranging from movies, music, books, news and restaurants to candidates on online dating sites.

One approach to recommend an item is by content-based filtering: when a customer shows interest in an item with certain properties, other items with similar properties are recommended. This blog however discusses a collaborative filtering approach: recommendations are made by comparing the list of items you bought with the buying history of other people.
Michael Hahsler developed the excellent package recommenderlab to test collaborative filtering algorithms in R. You can find an introduction to recommenderlab in the accompanying vignette or on r-bloggers.com.
The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering. It nowadays is available as the standard algorithm for recommendations in Apache SPARK’s MLlib. An alternative R implementation of the algorithm can be found here.

The implementation we are presenting here is joined work with Burhan Bakar.

The alternating least squares algorithm with weighted-\(\lambda\)-regularization

Here, we explain the rationale behind the algorithm, as explained by Zhou et al. We start with ratings given by \(n_u\) users and \(n_m\) items (the m originally stood for movies). The ratings are saved in a \(n_u\) by \(n_m\) sized rating matrix \(R\), with matrix elements \(r_{u,i}\). This rating matrix \(R\) will have many missing values. The challenge for the recommender engine is thus to find sensible ratings for these unrated user-movie pairs. In ALS, this is done by approximating \(R\) as \(U \times M\), where user matrix \(U\) is a \(n_u\) by \(n_{factor}\) matrix and item matrix M a \(n_{factor}\) by \(n_m\) matrix. Intuitively, one could understand that matrix \(M\) describes how much each movie scores according to some latent factors (e.g. comedy, romance, action etc.). And matrix \(U\) describes how much that viewer likes these kind of latent factors. In practice, the interpretation of the different latent factors can prove hard however.
In order to ensure \(U \times M\) approximates \(R\), the following cost function is minimized:
\[f(U,M) = \sum_{(i,j)}w_{i,j} (r_{i,j}-u_{i} \times m_{j})^2 + \lambda \left(\sum_{i} n_{u_{i}} ||u_{i}||^2 + \sum_{j} n_{m_{j}} ||m_{j}||^2 \right)\] The first term on the right minimizes \(R - U \times M\). For explicit ratings, weight matrix \(W\) has elements \(w_{i,j}\) which are 1 whenever a rating is known and 0 whenever a rating is unknown. Thereby, \(W\) ensures that only known ratings are taken into account during cost function minimization. The second term is the weighted-\(\lambda\)-regularization. For sizable \(\lambda\) values, this term ensures the matrix elements of \(U\) and \(M\) are not too large, and in this way overfitting of the data is opposed. Here, \(n_{u_{i}}\) is the number of ratings available for user \(u_i\) and \(n_{m_{j}}\) is the number of ratings available for item \(m_j\). These weighting parameters \(n_{u_{i}}\) and \(n_{m_{j}}\) ensure that a \(\lambda\)-value optimized on a small dataset will still work on a larger dataset. In this blog, we will not discuss the actual calculation of the cost function minimization in detail. Interested readers can read the article and our source code. First, \(M\) is initialized by assigning the average rating of the movies as the first row and small random values for the remaining rows. Next, a loop is repeated \(n_{iteration}\) times. In the first step of the loop, \(M\) is kept fixed and \(f(U,M)\) is minimized by solving for \(U\). And in the second step of the loop, \(U\) is kept fixed and \(f(U,M)\) is minimized by solving for \(M\).

ALS implemented in recommenderlab

Prof. Hahsler was so kind to add our basic implementation of ALS to his recommenderlab package.

# Ensure that a version of recommenderlab with ALS is installed
# If you have an old version of recommenderlab, first uninstall the package and restart R. 
# Next, install the latest version of recommenderlab directly from github, using devtools.
if("recommenderlab" %in% rownames(installed.packages()) == FALSE){ install_github("mhahsler/recommenderlab") }

If the installation went fine, three ALS related recommender should be added to the recommenderRegistry.

## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "ALS_implicit_binaryRatingMatrix" "AR_binaryRatingMatrix"
## [5] "IBCF_binaryRatingMatrix" "IBCF_realRatingMatrix"
## [7] "POPULAR_binaryRatingMatrix" "POPULAR_realRatingMatrix"
## [9] "RANDOM_realRatingMatrix" "RANDOM_binaryRatingMatrix"
## [11] "RERECOMMEND_realRatingMatrix" "SVD_realRatingMatrix"
## [13] "SVDF_realRatingMatrix" "UBCF_binaryRatingMatrix"
## [15] "UBCF_realRatingMatrix"

In the code below, a recommender model called r is first defined. The parameters passed to the model below, are the same parameters that are set as default. When “verbose”" is set to TRUE, the cost function is printed out during model calculation. The predict() function calculates ratings (type = “ratings” or type = “ratingMatrix”) or a list with top N predictions (type = “topNList”) for the users in newdata. When you run this code, you’ll find that this predict() function takes significantly longer than the Recommender() function. This is caused by a drawback of the ALS algorithm: you actually need data of all the users of interest, before model calculations can be performed. Hence, no calculations happen in Recommender(), since the calculations are postponed to predict().

r <- Recommender(data = MovieLense[1:500,], method = "ALS", 
                 parameter = list(normalize=NULL, lambda=0.1, n_factors=10, 
n_iterations=10, seed = NULL, verbose = FALSE)) recom_ratings <- predict(r, newdata = MovieLense[501:502,], type = "ratings") as(recom_ratings, "matrix") recom_topNList <- predict(r, newdata = MovieLense[501:502,], type = "topNList", n = 5) as(recom_topNList, "list")

The recommenderlab package makes it convenient to perform train/test splits on the data. In the code below, 90% of the users are marked as training users. For the remaining 10%, the test users, 5 ratings of items are marked as “unknown” and the rest of the ratings as “known” (given = -5). The idea is of course that for these users, based on their “known”" ratings, predictions can be made which can be tested through the “unknown” data.
The function accuracy_table() first defines a model and than calculates ratings for the test users, based on all the train data and the “known” data. Recommenderlab than lets you score your model with calcPredictionAccuracy(). For “ratings” data, calcPredictionAccuracy() returns root mean square error, mean square error and mean absolute value. The function accuracy_table() returns the scores as a data.frame. Using this function, we benchmark the ALS algorithm against other algorithms.

scheme <- evaluationScheme(MovieLense, method="split", train=0.9, given=-5, goodRating=4)

accuracy_table <- function(scheme, algorithm, parameter){
  r <- Recommender(getData(scheme, "train"), algorithm, parameter = parameter)
  p <- predict(r, getData(scheme, "known"), type="ratings")                      
  acc_list <- calcPredictionAccuracy(p, getData(scheme, "unknown"))
  total_list <- c(algorithm =algorithm, acc_list)
  total_list <- total_list[sapply(total_list, function(x) !is.null(x))]

table_random <- accuracy_table(scheme, algorithm = "RANDOM", parameter = NULL)
table_ubcf <- accuracy_table(scheme, algorithm = "UBCF", parameter = list(nn=50))
table_ibcf <- accuracy_table(scheme, algorithm = "IBCF", parameter = list(k=50))
table_pop <- accuracy_table(scheme, algorithm = "POPULAR", parameter = NULL)
table_ALS_1 <- accuracy_table(scheme, algorithm = "ALS", 
parameter = list( normalize=NULL, lambda=0.1, n_factors=200,
n_iterations=10, seed = 1234, verbose = TRUE))

rbind(table_random, table_pop, table_ubcf, table_ibcf, table_ALS_1)

## algorithm RMSE MSE MAE
## 1 RANDOM 1.42278988468406 2.02433105595929 1.12227976395336
## 2 POPULAR 0.951735261736827 0.905800008433267 0.745715896299643
## 3 UBCF 0.981359108883343 0.963065700588309 0.770757174468976
## 4 IBCF 1.00306124478926 1.00613186079818 0.750610859963507
## 5 ALS 0.923355428799972 0.85258524789438 0.726210273691505

These results indicate that the root mean square error of ratings predicted by ALS compares favorably to that of other models. We hereby must however caution that not all models have necessarily received an equal amount of effort towards optimization.
Recommenderlab can also calculate model scores, such as the true positive rate (TPR) and false positive rate (FPR), starting from “topNLists”.

algorithms <- list("random items" = list(name="RANDOM", param=NULL),
                   "popular items" = list(name="POPULAR", param=NULL),
                   "user-based CF" = list(name="UBCF", param=list(nn=50)),
                   "item-based CF" = list(name="IBCF", param=list(k=50)),
                   "SVD approximation" = list(name="SVD", param=list(k = 50)),
                   "ALS_explicit" = list(name="ALS", 
                      param = list(normalize=NULL, lambda=0.1, n_factors=200, 
n_iterations=10, seed = 1234, verbose = TRUE)), "ALS_implicit" = list(name="ALS_implicit", param = list(lambda=0.1, alpha = 0.5, n_factors=10,
n_iterations=10, seed = 1234, verbose = TRUE)) )
results <- evaluate(scheme, algorithms, type = "topNList", n=c(1, 3, 5, 10, 15, 20)) avg(results)

We now plot the TPR and FPR for these different algorithms, starting from the top 1, top 3, top 5, top 10, top 15 or top 20 ratings.

plot(results, annotate=c(1,3), legend="topleft")

While the ALS algorithm for explicit ratings gave excellent root mean square errors, it does show rather disappointing true positive rates in the plot above. To understand this, one must understand a crucial difference in these scoring methods. Whenever a movie in a top N got a rating of 4 or higher, it would be marked a true positive. When it received a rating lower than 4, it is marked a false positive. However, whenever a movie in the top N is simply never seen by the user, it is also marked a false positive. Whenever the ALS algorithm thus recommends movies that got good reviews, but that are not watched a lot, that will negatively affect the TPR. In contrast, during the rmse calculation, only watched movies are taken into account.
The plot above also shows the result for the ALS algorithm for implicit ratings. This algorithm is meant to be used whenever no real ratings are available, but your data is only suggestive towards the preference of the user. An example would be a dataset that expresses how much a user has seen different movies. We then assume a user likes a movie when he has seen it and set the corresponding element in rating matrix \(R\) to 1. Unseen movies are probably not liked, and the corresponding element \(r_{u,i}\) is then set to 0. However, the confidence associated with the 0s is lower than the confidence associated with the 1s. Therefore, the weight \(w_{u,i}\) associated with unwatched movies is set to 1, while that associated with watched movies is set to \(1 + \alpha \times rating_{u,i}\). This ensures that the confidence for movies that have been watched multiple times is higher. While \(R\) and \(W\) are thus defined very differently in explicit ALS and implicit ALS, the actual cost function that is minimized is the same in both cases. Well, technically, the actual implementation still slightly differs, but this is merely because of speed optimization.
When the MovieLense data are used as implicit ratings, the implicit_ALS algorithm results in high true positive rates (again, be warned that not all algorithms have been optimized with the same care). The trade-off is that the algorithm will not as readily recommend movies that have been watched by relatively few people, and that the algorithm is not meant to predict actual ratings for the movies.
In summary, we have implemented the ALS algorithm for recommendations in recommenderlab. There are three versions available: one for explicit ratings in a realRatingMatrix, and two versions for implicit ratings, in either a realRatingMatrix or a binaryRatingMatrix. Hope you enjoy it!