Estimating Error and Bias in Offline Evaluation Results
Related Work
Several existing techniques attempt to measure and/or correct problems with offline evaluation. One approach is to change the experimental protocol. proposed data splitting and analysis strategies to address popularity bias; these methods affect absolute metric values, but not necessarily the relative performance of algorithms . Using random subsets of the item space as candidates for recommendation may reduce the impact of unknown relevant items , but it relies on unrealistically strong assumptions and likely exacerbates popularity bias .
Another approach is to seek metrics that admit statistically unbiased estimators with observable data. If ratings for relevant items are missing at random, recall and unnormalized DCG are unbiased. But these results limit choice of metrics and depend on assumptions unlikely to hold in actual use, as relevance is not the only influence on users’ choice of items to consume or rate.
Counterfactual evaluation uses causal inference techniques — often inverse propensity scoring — to estimate how users would have responded to a different recommender algorithm. However, it is difficult to apply to commonly-used data sets and does not yield insight into the reliability of existing evaluations. It also cannot address the fundamental problem that concerns us in this work: if the user was never exposed to an item under the logging policy, the historical log data contains no information on its relevance. Such items are precisely where a recommender system can produce the most benefit in many discovery-oriented applications.
Simulation is a promising technique for studying evaluation procedures. Simulations can produce complete ground truth and corresponding observations in a controlled manner, subject to assumptions about the structure of the data generation process. used probabilistic models to better understand the impact of popularity bias, finding relationships between popularity bias and structural assumptions about the underlying data and inversions in the relative performance of collaborative filtering algorithms between complete and observable data in some cases.