在大数据的潮流下，如何有效地提取数据中的含义变得尤其重要。为了鼓励大家找到更加好的推荐算法，Netflix Challange 比赛便产生了。包子在这里要分享的文章，是一篇关于历届获奖算法的总结。总的来说，作者提到了预处理数据的重要性，以及使用分解矩阵的方法来得到collaborative filtering 是十分有效的。
Suppose Alice rates Inception 4 stars. We can think of this rating as composed of several parts:
In other words, we’ve decomposed the 4-star rating into: 4 = [3.1 (the baseline rating) - 0.5 (the Alice effect) + 0.7 (the Inception effect)] + 0.7 (the specific interaction)
So instead of having our models predict the 4-star rating itself, we could first try to remove the effect of the baseline predictors (the first three components) and have them predict the specific 0.7 stars. (I guess you can also think of this as a simple kind of boosting.)
More generally, additional baseline predictors include:
And, in fact, modeling these biases turned out to be fairly important: in their paper describing their final solution to the Netflix Prize, Bell and Koren write that
Of the numerous new algorithmic contributions, I would like to highlight one – those humble baseline predictors (or biases), which capture main effects in the data. While the literature mostly concentrates on the more sophisticated algorithmic aspects, we have learned that an accurate treatment of main effects is probably at least as signficant as coming up with modeling breakthroughs.
(For a perhaps more concrete example of why removing these biases is useful, suppose you know that Bob likes the same kinds of movies that Alice does. To predict Bob’s rating of Inception, instead of simply predicting the same 4 stars that Alice rated, if we know that Bob tends to rate movies 0.3 stars higher than average, then we could first remove Alice’s bias and then add in Bob’s: 4 + 0.5 + 0.3 = 4.8.)
Let’s now look at some slightly more sophisticated models. As alluded to in the section above, one of the standard approaches to collaborative filtering is to use neighborhood models.
Briefly, a neighborhood model works as follows. To predict Alice’s rating of Titanic, you could do two things:
(See also my post on item-to-item collaborative filtering on Amazon.)
The main questions, then, are (let’s stick to the item-item approach for simplicity):
The standard approach is to take some similarity metric (e.g., correlation or a Jaccard index) to define similarities between pairs of movies, take the K most similar movies under this metric (where K is perhaps chosen via cross-validation), and then use the same similarity metric when computing the weighted mean.
This has a couple problems:
So another approach is the following:
(A slightly more complicated user-user approach, similar to this item-item neighborhood approach, is also useful.)
Adding on to the neighborhood approach, we can also let implicit data influence our predictions. The mere fact that a user rated lots of science fiction movies but no westerns, suggests that the user likes science fiction better than cowboys. So using a similar framework as in the neighborhood ratings model, we can learn for Inception a set of offset weights associated to Inception’s movie neighbors.
Whenever we want to predict how Bob rates Inception, we look at whether Bob rated each of Inception’s neighbors. If he did, we add in the corresponding offset; if not, then we add nothing (and, thus, Bob’s rating is implicitly penalized by the missing weight).
Complementing the neighborhood approach to collaborative filtering is the matrix factorization approach. Whereas the neighborhood approach takes a very local approach to ratings (if you liked Harry Potter 1, then you’ll like Harry Potter 2!), the factorization approach takes a more global view (we know that you like fantasy movies and that Harry Potter has a strong fantasy element, so we think that you’ll like Harry Potter) that decomposes users and movies into a set of latent factors (which we can think of as categories like “fantasy” or “violence”).
In fact, matrix factorization methods were probably the most important class of techniques for winning the Netflix Prize. In their 2008 Progress Prize paper, Bell and Koren write
It seems that models based on matrix-factorization were found to be most accurate (and thus popular), as evident by recent publications and discussions on the Netflix Prize forum. We definitely agree to that, and would like to add that those matrix-factorization models also offer the important flexibility needed for modeling temporal effects and the binary view. Nonetheless, neighborhood models, which have been dominating most of the collaborative filtering literature, are still expected to be popular due to their practical characteristics - being able to handle new users/ratings without re-training and offering direct explanations to the recommendations.
The typical way to perform matrix factorizations is to perform a singular value decomposition on the (sparse) ratings matrix (using stochastic gradient descent and regularizing the weights of the factors, possibly constraining the weights to be positive to get a type of non-negative matrix factorization). (Note that this “SVD” is a little different from the standard SVD learned in linear algebra, since not every user has rated every movie and so the ratings matrix contains many missing elements that we don’t want to simply treat as 0.)
Some SVD-inspired methods used in the Netflix Prize include:
Some regression models were also used in the predictions. The models are fairly standard, I think, so I won’t spend too long here. Basically, just as with the neighborhood models, we can take a user-centric approach and a movie-centric approach to regression:
Restricted Boltzmann Machines provide another kind of latent factor approach that can be used. See this paper for a description of how to apply them to the Netflix Prize. (In case the paper’s a little difficult to read, I wrote an introduction to RBMs a little while ago.)
Many of the models incorporate temporal effects. For example, when describing the baseline predictors above, we used a few temporal predictors that allowed a user’s rating to (linearly) depend on the time since the first rating he ever made and on the time since a movie’s first rating. We can also get more fine-grained temporal effects by, say, binning items into a couple months’ worth of ratings at a time, and allowing movie biases to change within each bin. (For example, maybe in May 2006, Time Magazine nominated Titanic as the best movie ever made, which caused a spurt in glowing ratings around that time.)
In the matrix factorization approach, user factors were also allowed to be time-dependent (e.g., maybe Bob comes to like comedy movies more and more over time). We can also give more weight to recent user actions.
Regularization was also applied throughout pretty much all the models learned, to prevent overfitting on the dataset. Ridge regression was heavily used in the factorization models to penalize large weights, and lasso regression (though less effective) was useful as well. Many other parameters (e.g., the baseline predictors, similarity weights and interpolation weights in the neighborhood models) were also estimated using fairly standard shrinkage techniques.
Finally, let’s talk about how all of these different algorithms were combined to provide a single rating that exploits the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.)
In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors.
Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data. (For a longer description of a similar technique, see my introduction to random forests.)
Since GBDTs have a built-in ability to apply different methods to different slices of the data, we can add in some predictors that help the trees make useful clusterings:
(For example, one thing that Bell and Koren found (when using an earlier ensemble method) was that RBMs are more useful when the movie or the user has a low number of ratings, and that matrix factorization methods are more useful when the movie or user has a high number of ratings.)
Here’s a graph of the effect of ensemble size from early on in the competition (in 2007), and the authors’ take on it:
However, we would like to stress that it is not necessary to have such a large number of models to do well. The plot below shows RMSE as a function of the number of methods used. One can achieve our winning score (RMSE=0.8712) with less than 50 methods, using the best 3 methods can yield RMSE < 0.8800, which would land in the top 10. Even just using our single best method puts us on the leaderboard with an RMSE of 0.8890. The lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.