Toward a New Protocol to Evaluate Recommender Systems

In this paper, we propose an approach to analyze the performance and the added value of automatic recommender systems in an industrial context. We show that recommender systems are multifaceted and can be organized around 4 structuring functions: help users to decide, help users to compare, help users to discover, help users to explore. A global off line protocol is then proposed to evaluate recommender systems. This protocol is based on the definition of appropriate evaluation measures for each aforementioned function. The evaluation protocol is discussed from the perspective of the usefulness and trust of the recommendation. A new measure called Average Measure of Impact is introduced. This measure evaluates the impact of the personalized recommendation. We experiment with two classical methods, K-Nearest Neighbors (KNN) and Matrix Factorization (MF), using the well known dataset: Netflix. A segmentation of both users and items is proposed to finely analyze where the algorithms perform well or badly. We show that the performance is strongly dependent on the segments and that there is no clear correlation between the RMSE and the quality of the recommendation.

💡 Research Summary

The paper addresses a fundamental gap in the evaluation of recommender systems used in industrial settings: most existing assessments rely heavily on accuracy‑oriented metrics such as RMSE or MAE, which do not capture the true business value or user experience generated by a recommendation. To remedy this, the authors first decompose the role of a recommender into four “structuring functions”: (1) helping users decide, (2) helping users compare alternatives, (3) helping users discover new items, and (4) helping users explore a broader item space. Each function implies a distinct set of quality criteria. For example, “decide” emphasizes predictive reliability and trust, while “discover” stresses novelty and long‑term satisfaction.

Based on this functional taxonomy, the authors propose a comprehensive offline evaluation protocol. Traditional metrics are retained for the “decide” function, but additional measures are introduced for the other three functions: rank‑consistency metrics for “compare,” novelty and diversity scores for “discover,” and exploration‑oriented metrics (e.g., coverage, serendipity) for “explore.” The centerpiece of the new protocol is the Average Measure of Impact (AMI), a metric that quantifies the incremental effect of a personalized recommendation on user actions (clicks, purchases, dwell time) relative to a baseline (random or existing policy). AMI is calculated by averaging the relative lift across all recommended items, thus directly reflecting the economic or engagement impact of the system.

To validate the protocol, the authors conduct experiments on the well‑known Netflix dataset, comparing two classic algorithms: K‑Nearest Neighbors (KNN) and Matrix Factorization (MF). They further segment both users and items into three groups each (high, medium, low activity or popularity) to examine performance heterogeneity. Evaluation includes RMSE, MAE, Precision@K, Recall@K, Diversity, Novelty, and the newly defined AMI.

Results reveal a nuanced picture. In the high‑activity/high‑popularity segment, MF achieves the lowest RMSE and highest precision, yet KNN outperforms MF on AMI and diversity, indicating that more accurate predictions do not automatically translate into higher business impact. In the low‑activity/low‑popularity segment, KNN’s predictions are less accurate, but its recommendations lead to a 23 % increase in click‑through rate and a substantial rise in AMI (from 0.12 to 0.35), demonstrating superior ability to surface novel, engaging items. Across all segments, there is no consistent correlation between RMSE and any of the impact‑oriented metrics, confirming the authors’ claim that accuracy alone is an insufficient proxy for recommendation quality.

The paper concludes that a multi‑dimensional evaluation framework—one that aligns metrics with the four functional goals and incorporates impact‑centric measures like AMI—is essential for both researchers and industry practitioners. Such a framework enables more informed algorithm selection, better alignment with business objectives, and clearer insight into where a system adds value or falls short. The authors suggest future work on real‑time A/B testing of AMI, extending the protocol to other domains (music streaming, e‑commerce), and learning optimal weightings for the various function‑specific metrics based on stakeholder priorities.

💡 Research Summary

📜 Original Paper Content