Dynamic Decision-Making under Model Misspecification: A Stochastic Stability Approach
Dynamic decision-making under model uncertainty is central to many economic environments, yet existing bandit and reinforcement learning algorithms rely on the assumption of correct model specification. This paper studies the behavior and performance of one of the most commonly used Bayesian reinforcement learning algorithms, Thompson Sampling (TS), when the model class is misspecified. We first provide a complete dynamic classification of posterior evolution in a misspecified two-armed Gaussian bandit, identifying distinct regimes: correct model concentration, incorrect model concentration, and persistent belief mixing, characterized by the direction of statistical evidence and the model-action mapping. These regimes yield sharp predictions for limiting beliefs, action frequencies, and asymptotic regret. We then extend the analysis to a general finite model class and develop a unified stochastic stability framework that represents posterior evolution as a Markov process on the belief simplex. This approach characterizes two sufficient conditions to classify the ergodic and transient behaviors and provides inductive dimensional reductions of the posterior dynamics. Our results offer the first qualitative and geometric classification of TS under misspecification, bridging Bayesian learning with evolutionary dynamics, and also build the foundations of robust decision-making in structured bandits.
💡 Research Summary
This paper investigates the long‑run behavior of Thompson Sampling (TS) when the decision maker’s parametric model class is misspecified. While TS is known to achieve near‑optimal regret in correctly specified bandit problems, the authors ask what happens when the true reward distribution lies outside the assumed family. The analysis proceeds in two parts.
First, the authors study the simplest non‑trivial setting: a two‑armed Gaussian bandit with two candidate models ν and γ. The true reward from arm i is N(g(i), 1), whereas the decision maker believes it to be N(θ_i, 1) for θ∈{ν,γ}. Two key primitives drive the posterior dynamics: (i) whether the two models recommend the same arm (agreement vs. disagreement) and (ii) the sign of the expected log‑likelihood‑ratio drift Δ_i for each arm, which indicates which model is statistically favored when that arm is pulled. By combining these primitives they identify three mutually exclusive regimes:
-
Self‑confirming regime – each model is supported by data generated from the arm it recommends. The posterior eventually concentrates on one of the two models, but which one is path‑dependent and determined by early draws. If the selected model’s recommended arm coincides with the true optimal arm, regret remains sub‑linear (typically O(log T)). If not, TS becomes asymptotically deterministic on a sub‑optimal arm, incurring linear regret.
-
Uniform‑dominance regime – one model dominates the other on both arms (Δ₁ and Δ₂ have the same sign). Regardless of the prior, the posterior collapses to the dominant model, and TS quickly learns the optimal arm, yielding the usual near‑optimal regret bounds.
-
Self‑defeating regime – each model is undermined by data from its own recommended arm (Δ₁ and Δ₂ have opposite signs). In this case the posterior does not converge to a vertex of the simplex; instead it evolves as an ergodic Markov process on (0, 1) with a non‑degenerate stationary distribution μ. Beliefs keep mixing, the optimal arm is pulled only with a positive long‑run frequency less than one, and cumulative regret grows linearly.
Thus, misspecification can lead either to near‑optimal performance (regimes 1 and 2 when the chosen model aligns with the true optimum) or to severe long‑run losses (regimes 1 and 3 when the chosen model is wrong or beliefs never concentrate).
The second part of the paper generalizes the analysis to an arbitrary finite model class Θ = {θ(1),…,θ(M)} and a finite action set 𝒜. Under TS, the posterior belief vector πₜ lives on the M‑dimensional simplex Δ_M and evolves as a Markov chain whose transition kernel depends on the current belief through the induced action probabilities. The authors develop a stochastic‑stability framework that asks whether πₜ settles into a steady‑state distribution (ergodic) or drifts toward the boundary (transient).
Two sufficient conditions are provided for ergodicity:
-
Angle condition – Suppose there exists an interior fixed point S* of the mean‑field log‑odds drift ξ(S). If for every S≠S* the inner product ⟨ξ(S), S−S*⟩ < 0, then the drift points back toward S* everywhere, creating a restoring force. A Lyapunov function based on the Euclidean distance to S* proves that the chain is positive recurrent and converges to a unique stationary distribution μ supported on int Δ_M.
-
Spectral condition – Write the drift as a quasi‑gradient system with matrix G. If the symmetric part Sym(G) is negative definite and the stochastic noise is sufficiently small relative to the drift magnitude, then the interior point is stochastically stable. This condition exploits the softmax potential geometry (Amari & Nagaoka) and uses Khasminskii‑Meyn‑Tweedie criteria for general‑state Markov chains.
When either condition fails, the posterior mass moves toward the boundary of the simplex: some models receive vanishing posterior weight, effectively eliminating them. The dynamics then restrict to a lower‑dimensional face Δ_I, and the same interior‑vs‑boundary tests can be applied recursively. This dimensionality reduction yields a hierarchy of possible terminal behaviors: convergence to a single vertex (full concentration on one model), ergodicity on a lower‑dimensional face (persistent mixing among a subset of models), or continued interior mixing if the conditions hold at every reduced level. The authors show that as the number of models M grows, satisfying the interior ergodicity conditions becomes increasingly stringent, so in large model classes one should expect either model elimination or low‑dimensional mixing rather than full‑dimensional ergodicity.
The paper situates its contributions relative to three strands of literature: (i) learning under misspecification (Berk, White, etc.), where prior work typically studies deterministic best‑response dynamics; (ii) evolutionary game theory linking Bayesian updating to replicator dynamics, which the authors extend to the stochastic action selection of TS; and (iii) recent bandit‑theoretic work on misspecified algorithms, which often relies on diffusion approximations. By contrast, this work provides a unified Markov‑process and Lyapunov‑based analysis that accommodates general structural misspecification of reward functions.
Practically, the results warn that deploying TS with a misspecified parametric model can be harmless only when the model‑action mapping aligns with the direction of statistical evidence. Otherwise, the algorithm may either lock onto a wrong model (linear regret) or keep exploring sub‑optimally forever (also linear regret). The angle and spectral conditions give practitioners diagnostic tools: checking the sign of expected log‑likelihood drifts and the curvature of the drift matrix can indicate whether the posterior will concentrate, mix, or eliminate models.
In summary, the paper delivers: (1) a complete classification of posterior dynamics and regret outcomes for the two‑arm misspecified Gaussian bandit; (2) a general stochastic‑stability framework for finite model classes, with clear sufficient conditions for ergodicity versus transience; (3) a recursive dimensionality‑reduction argument that characterizes terminal belief configurations; and (4) concrete implications for robust decision‑making in economic applications where model misspecification is unavoidable.
Comments & Academic Discussion
Loading comments...
Leave a Comment