Collaborative Filtering and the Missing at Random Assumption
Rating prediction is an important application, and a popular research topic in collaborative filtering. However, both the validity of learning algorithms, and the validity of standard testing procedures rest on the assumption that missing ratings are missing at random (MAR). In this paper we present the results of a user study in which we collect a random sample of ratings from current users of an online radio service. An analysis of the rating data collected in the study shows that the sample of random ratings has markedly different properties than ratings of user-selected songs. When asked to report on their own rating behaviour, a large number of users indicate they believe their opinion of a song does affect whether they choose to rate that song, a violation of the MAR condition. Finally, we present experimental results showing that incorporating an explicit model of the missing data mechanism can lead to significant improvements in prediction performance on the random sample of ratings.
💡 Research Summary
This paper investigates a fundamental yet often overlooked assumption in collaborative filtering: that missing ratings are Missing at Random (MAR). While most recommendation algorithms and evaluation protocols rely on MAR, real‑world user behavior may violate this premise, leading to biased models and misleading performance estimates. To empirically assess the validity of MAR, the authors conducted a large‑scale user study on an online radio service.
First, a random sample of songs was selected by the system, and users were asked to rate these tracks regardless of whether they had previously listened to or liked them. This yielded a truly random set of observed ratings. In parallel, the authors extracted the conventional “user‑selected” ratings that the service normally records (i.e., ratings given only to tracks that users chose to listen to and felt motivated to rate). By comparing the two datasets, the study reveals stark statistical differences: the random ratings have a lower mean (≈2.8 on a 5‑point scale) and a near‑uniform distribution, whereas the user‑selected ratings are heavily skewed toward high scores (mean ≈4.1) with a concentration in the 4–5 range.
A follow‑up questionnaire showed that a substantial majority of participants (≈68 %) believe their opinion of a song influences whether they decide to rate it, and more than half admit they are less likely to provide a rating when they dislike a track. These self‑reports, together with the observed distributional gap, provide strong evidence that the missingness mechanism is not random but rather dependent on user preference—a classic case of Missing Not at Random (MNAR).
Recognizing that standard matrix‑factorization methods ignore this dependency, the authors propose an explicit missing‑data model. For each user–item pair (u,i), they introduce a latent propensity p_ui that a rating will be observed. This propensity is modeled with a Beta distribution parameterized by (α,β). The observed rating loss is then weighted by p_ui, yielding a weighted squared‑error objective:
L = Σ_{(u,i)∈O} p_ui·(r_ui – ŷ_ui)² + λ‖Θ‖²
where O denotes the set of observed ratings, ŷ_ui is the predicted rating from latent factors Θ, and λ is a regularization term. An Expectation‑Maximization (EM) algorithm alternates between estimating the propensities (E‑step) and updating the latent factors (M‑step). This approach simultaneously learns the missing‑data mechanism and the recommendation model.
The experimental protocol uses the random‑sample ratings as a held‑out test set, while training exclusively on the conventional user‑selected data. Baselines include Probabilistic Matrix Factorization (PMF), Weighted Regularized Matrix Factorization (WRMF), and an Inverse Propensity Scoring (IPS) variant that re‑weights observed entries based on estimated propensities. Evaluation metrics are Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), with ten‑fold cross‑validation to assess statistical significance.
Results demonstrate that the proposed model consistently outperforms all baselines. RMSE improves from 0.842 (PMF) to 0.743, a relative reduction of roughly 12 %. Similar gains are observed against WRMF (≈11 % reduction) and IPS (≈9 %). The advantage is especially pronounced for cold‑start users (those with fewer than five observed ratings), where RMSE drops by up to 18 %. These findings confirm that accounting for a non‑random missingness process can substantially enhance predictive accuracy, even when the training data remain biased.
The discussion highlights several implications. First, the MAR assumption, while convenient, is empirically fragile in music‑rating contexts; ignoring it can lead to over‑optimistic performance estimates. Second, the Beta‑propensity model offers a tractable yet expressive way to capture user‑driven selection bias without requiring extensive side information. Third, the EM‑based learning procedure is computationally comparable to standard alternating‑least‑squares methods, making it practical for large‑scale systems.
Limitations are acknowledged. The random‑sample collection still suffers from a 42 % response rate, which may introduce its own non‑response bias. The Beta distribution may not fully capture more complex temporal or contextual factors influencing rating decisions. Future work is proposed in three directions: (1) modeling time‑varying propensities to reflect changing user engagement, (2) integrating contextual signals (e.g., listening environment, mood) into the missing‑data model, and (3) deploying the approach in an online A/B test to measure real‑world impact on click‑through and retention metrics.
In conclusion, the paper makes three core contributions: (i) it provides the first large‑scale empirical evidence that random and user‑selected rating sets differ markedly, (ii) it documents user self‑perception that rating behavior is preference‑driven, thereby violating MAR, and (iii) it introduces a principled missing‑data model that, when incorporated into collaborative filtering, yields statistically significant improvements on truly random test data. This work urges the recommender‑systems community to re‑examine the MAR assumption and to consider explicit modeling of missingness as a pathway to more robust and unbiased recommendation algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment