Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Reading time: 5 minute
...

📝 Original Info

  • Title: Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains
  • ArXiv ID: 2602.16274
  • Date: 2026-02-18
  • Authors: ** 논문에 명시된 저자 정보는 제공되지 않았으나, 일반적으로 강화학습, 확률적 근사, 마코프 체인 분야의 전문가들(예: Yinlam Chow, Mingyan Liu, Jianqing Fan 등)과 협업했을 가능성이 있다. **

📝 Abstract

We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

💡 Deep Analysis

📄 Full Content

Reinforcement Learning (RL) provides a principled framework for sequential decision-making and has enabled major advances across domains ranging from large language models (LLMs) (Ouyang et al., 2022) to robotics (Levine et al., 2016) and healthcare (Yu et al., 2021). Among the many RL algorithms developed over the past decades, Q-learning (Watkins & Dayan, 1992) remains a seminal method due to its simplicity and model-free nature, and underlies influential modern approaches such as Deep Q-Networks (DQN) (Mnih et al., 2015). Classical analyses (Tsitsiklis, 1994;Borkar & Meyn, 2000) established convergence guarantees only in the asymptotic regime, which provide limited insight into practical, data-limited settings and motivate a growing body of finite-sample and finite-time analyses (Beck & Srikant, 2012;Qu & Wierman, 2020;Li et al., 2024).

In this work, we study the sample complexity and regret of online Q-learning in infinite-horizon discounted Markov decision processes. The majority of existing work on the sample complexity of Q-learning focuses on the offline setting, where state-action transitions {(s n , a n )} n∈N are generated by a fixed behavioral policy and the objective is to estimate the optimal Q ⋆ (Qu & Wierman, 2020;Li et al., 2022;2024). In contrast, online Q-learning updates Q-value estimates while interacting with the environment, with the evolving estimates directly determining the actions selected by the agent. Offline analyses primarily focus on controlling the estimation error ∥Q n -Q ⋆ ∥ ∞ and, as such, do not provide guarantees on the performance of the policy executed during learning. On the other hand, existing regret analyses for online Q-learning in this setting either consider model-based methods (He et al., 2021) or variants of Qlearning augmented with explicit optimism (Ji & Li, 2023).

Despite the practical importance of online Q-learning with simple exploration strategies, prior work does not provide regret or sample complexity guarantees for classical, fully model-free Q-learning without optimism terms. Most notably, Deep Q-Networks (DQN) (Mnih et al., 2015;2013), one of the most influential algorithms in modern reinforcement learning, employs ϵ-greedy exploration. Similarly, Boltzmann (softmax) exploration has been widely adopted in reinforcement learning applications due to its natural differentiability and smooth exploration properties (Sutton et al., 1998). Yet even for these foundational exploration strategies, no theoretical guarantees on sample complexity or regret have been established.

To bridge this gap, we establish sample complexity and regret guarantees for Q-learning with standard Q-value updates under different exploration policies. We first analyze the widely used Boltzmann Q-learning algorithm, in which actions are selected according to a Boltzmann (softmax) ex-ploration rule. We then study a Smoothed ϵ n -Greedy (SϵG) Q-learning algorithm, whose exploration policy is a convex combination of the uniform policy and the softmax policy. The smoothing is introduced solely to facilitate the analysis and does not alter the practical behavior of the algorithm, which closely mirrors standard ϵ-greedy exploration. Our results rely on a novel concentration bound for a Markovian stochastic approximation (SA) framework that allows the contraction factor to converge to one asymptotically.

We summarize the main contributions of this work below. In addition to establishing regret and sample complexity guarantees for classical online Q-learning in infinite-horizon discounted Markov decision processes, we develop a novel analytical framework for Markovian stochastic approximation that may be of independent interest.

Regret Guarantees for Boltzmann and SϵG Q-learning. We establish the first regret guarantees for classical, fully model-free online Q-learning without relying on additional optimism terms. We first characterize how the regret of Boltzmann Q-learning depends on the suboptimality gap of the Markov decision process, showing that it can be arbitrarily close to linear for small gaps. To address this limitation, we analyze the Smoothed ϵ n -Greedy (SϵG) Qlearning algorithm and show that it achieves a regret bound of near-Õ(N 9/10 ). While this rate is worse than the Õ( √ N ) dependence achieved by optimism-based methods, it demonstrates that sublinear regret is attainable for unmodified online Q-learning.

Sample Complexity of Online Q-learning. We provide high-probability sample complexity guarantees exhibiting sub-Gaussian tail behavior for classical online Q-learning under Boltzmann and SϵG exploration. Our results show how the exploration parameters, specifically the temperature schedule for the softmax exploration and the sequence ϵ n for ϵ-greedy policies, affect the convergence rate of the learned Q-values to Q ⋆ . Moreover, our concentration bounds enable us to establish almost sure convergence of the Q-learning iterates, which is nontrivial for online Q-le

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut