Optimism Stabilizes Thompson Sampling for Adaptive Inference

Optimism Stabilizes Thompson Sampling for Adaptive Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm’s pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.


💡 Research Summary

This paper investigates the inferential shortcomings of standard Thompson Sampling (TS) in stochastic multi‑armed bandits when data are collected adaptively. In such settings, the number of pulls of each arm is a random, history‑dependent quantity, breaking the independence assumptions required for classical central limit theorems. The authors adopt the notion of stability (Lai & Wei, 1982), which demands that for every arm a, the empirical pull count Nₐ,T converges in probability to a deterministic sequence Nₐ,⋆,T that diverges to infinity. When stability holds, the random sample size can be replaced by its deterministic counterpart, allowing the studentized sample mean to obey a normal limit and enabling Wald‑type confidence intervals and hypothesis tests under adaptive sampling.

The paper shows that vanilla TS fails to satisfy stability, especially when the optimal arm is not unique, leading to non‑standard asymptotics. To remedy this, the authors propose two optimistic modifications of Gaussian TS and prove that both restore stability for any fixed number of arms K ≥ 2, covering the challenging regime with multiple optimal arms.

  1. Variance‑inflated TS – The posterior mean remains the empirical mean (\hat\mu_{a,t}) but the sampling variance is multiplied by a factor (\sigma(A)>1). The factor is allowed to grow slowly with the horizon, satisfying (\sigma(A)/\log\log T\to\infty) and (\sigma(A)(\log T)^2/T\to0). Inflating the variance makes high‑tail draws more likely, which is a form of distributional optimism. Theorem 4.1 proves that for any K, the pull counts satisfy
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment