An MDP-based Recommender System
Typical Recommender systems adopt a static view of the recommendation process and treat it as a prediction problem. We argue that it is more appropriate to view the problem of generating recommendations as a sequential decision problem and, consequently, that Markov decision processes (MDP) provide a more appropriate model for Recommender systems. MDPs introduce two benefits: they take into account the long-term effects of each recommendation, and they take into account the expected value of each recommendation. To succeed in practice, an MDP-based Recommender system must employ a strong initial model; and the bulk of this paper is concerned with the generation of such a model. In particular, we suggest the use of an n-gram predictive model for generating the initial MDP. Our n-gram model induces a Markov-chain model of user behavior whose predictive accuracy is greater than that of existing predictive models. We describe our predictive model in detail and evaluate its performance on real data. In addition, we show how the model can be used in an MDP-based Recommender system.
💡 Research Summary
The paper challenges the prevailing view of recommender systems as static prediction problems and proposes to treat recommendation as a sequential decision‑making task. By modeling the interaction between a user and the system as a Markov Decision Process (MDP), the authors aim to capture two aspects that traditional approaches typically ignore: the long‑term impact of each recommendation on future user behavior and the expected value (reward) associated with each recommendation. An MDP is defined by a set of states, actions, transition probabilities, and rewards. In this context, a state encodes the user’s current context (e.g., the most recent items viewed), an action corresponds to the item the system proposes, the transition probability models the likelihood that the user will move to a new context after accepting or rejecting the recommendation, and the reward reflects business‑relevant outcomes such as click‑through, purchase, or satisfaction.
A central obstacle to deploying an MDP‑based recommender is the need for an accurate model of the transition dynamics before any policy learning can begin. The authors address this by constructing an initial model using an n‑gram predictive framework, a technique borrowed from natural‑language processing. In an n‑gram model, the probability of the next token (here, the next item a user will interact with) is conditioned on the preceding n‑1 tokens. By treating a user’s item sequence as a “sentence,” a 3‑gram, for example, estimates the probability of the next item given the two most recent items. This approach explicitly enforces a Markov assumption of limited memory while leveraging observed frequencies to produce reliable short‑term predictions.
To convert the n‑gram statistics into an MDP transition matrix, the authors first count n‑gram occurrences in large log datasets and compute maximum‑likelihood estimates of conditional probabilities. Because many higher‑order n‑grams are sparse, they apply Laplace smoothing and a back‑off strategy: when a specific n‑gram is unseen or unreliable, the model falls back to a lower‑order n‑gram (e.g., a 2‑gram or unigram) to supply a more stable estimate. The resulting transition probabilities are then incorporated into an MDP that can be solved with standard reinforcement‑learning algorithms such as policy iteration, value iteration, or Q‑learning.
The empirical evaluation uses real e‑commerce clickstream data. The authors compare the predictive accuracy of their n‑gram‑derived Markov chain against several baselines, including classic collaborative filtering, Bayesian personalization, and recent deep‑learning sequence models. The n‑gram model consistently outperforms baselines on short horizons (predicting the next one or two items) and demonstrates superior ability to capture the sequential dependencies that drive long‑term user engagement. When the n‑gram model is embedded within an MDP and a policy is optimized for expected cumulative reward, the system yields statistically significant improvements in click‑through rate (CTR) and revenue per user relative to a static, prediction‑only recommender.
Beyond the core technical contribution, the paper discusses practical deployment considerations. The n‑gram initialization is computationally inexpensive, requires no complex parameter tuning, and can be refreshed periodically as new interaction data arrive. After the initial policy is learned, the system can continue to adapt online by updating transition counts and re‑optimizing the policy, thereby handling evolving user preferences and catalog changes. The authors acknowledge limitations: the fixed‑transition MDP assumes stationary dynamics and may struggle with abrupt context shifts or very long‑range dependencies. They suggest future work on adaptive transition models, variable‑order n‑grams, and hybrid approaches that combine deep reinforcement learning with the transparent, data‑efficient n‑gram foundation.
In summary, the paper makes three key contributions: (1) it reframes recommendation as an MDP to jointly optimize immediate relevance and long‑term value; (2) it introduces a simple yet effective n‑gram method for constructing the initial transition model, demonstrating superior predictive performance over existing models; and (3) it validates the end‑to‑end MDP‑based recommender on real‑world data, showing measurable business gains. This work bridges the gap between theoretical sequential decision models and practical recommender system engineering, offering a clear pathway for practitioners to move beyond static predictions toward dynamic, reward‑aware recommendation strategies.