Dynamic Prior Thompson Sampling for Cold-Start Exploration in Recommender Systems
Cold-start exploration is a core challenge in large-scale recommender systems: new or data-sparse items must receive traffic to estimate value, but over-exploration harms users and wastes impressions. In practice, Thompson Sampling (TS) is often initialized with a uniform Beta(1,1) prior, implicitly assuming a 50% success rate for unseen items. When true base rates are far lower, this optimistic prior systematically over-allocates to weak items. The impact is amplified by batched policy updates and pipeline latency: for hours, newly launched items can remain effectively “no data,” so the prior dominates allocation before feedback is incorporated. We propose Dynamic Prior Thompson Sampling, a prior design that directly controls the probability that a new arm outcompetes the incumbent winner. Our key contribution is a closed-form quadratic solution for the prior mean that enforces P(X_j > Y_k) = epsilon at introduction time, making exploration intensity predictable and tunable while preserving TS Bayesian updates. Across Monte Carlo validation, offline batched simulations, and a large-scale online experiment on a thumbnail personalization system serving millions of users, dynamic priors deliver precise exploration control and improved efficiency versus a uniform-prior baseline.
💡 Research Summary
The paper tackles a pervasive problem in large‑scale recommender systems: how to allocate traffic to newly launched or data‑sparse items (the “cold‑start” problem) without over‑exploring weak content and degrading user experience. In practice, many production systems deploy Thompson Sampling (TS) with a uniform Beta(1, 1) prior for new arms, implicitly assuming a 50 % success probability. This assumption is wildly optimistic in most real‑world marketplaces where the base rate of high‑performing items is often an order of magnitude lower (e.g., 1–5 %). The mismatch leads to systematic over‑allocation of impressions to weak items. The issue is amplified by operational constraints common in production: model updates are performed in batches (every few hours) and data pipelines introduce latency, so a newly introduced arm can remain “no‑data” from the policy’s perspective for hours. During this window the prior dominates the decision rule, causing substantial waste before feedback is incorporated.
The authors propose Dynamic Prior Thompson Sampling (DPTS), a principled method that designs the prior for a new arm so that the probability its TS draw exceeds that of the current incumbent winner is a user‑specified target ε. By directly tying the prior’s mean to the observed performance of the best arm, DPTS makes exploration intensity predictable and tunable, even under batched updates.
Core Technical Contribution
- Prior Parameterization – For a new arm (j) the prior is set to ( \text{Beta}(\alpha_{j},\beta_{j})) with \
Comments & Academic Discussion
Loading comments...
Leave a Comment