Achieving $ arepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle

Achieving $arepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.


💡 Research Summary

This paper addresses a long‑standing gap in the theory of average‑reward reinforcement learning: establishing optimal, ε‑inverse‑square sample complexity for model‑free Q‑learning without imposing artificial contraction assumptions. In discounted MDPs the Bellman operator is a contraction under the supremum norm, which enables clean finite‑sample analyses. In contrast, the average‑reward Bellman operator is merely non‑expansive under the span seminorm, and existing non‑asymptotic results either (i) assume a strong J‑step γ‑contraction (which requires unknown parameters and often yields sub‑optimal ε‑dependence), (ii) approximate the problem by a discounted MDP with a time‑varying discount factor, or (iii) control weaker criteria such as Bellman residuals. All of these approaches lead to sample complexities scaling as ε⁻⁶, ε⁻⁹, or worse.

The authors propose a fundamentally different route. They first introduce a lazy transformation of the original transition kernel: with probability α (chosen as ½ in the analysis) the process follows the original dynamics, and with probability 1‑α it stays in the current state. Lemma 3.1 shows that this transformation preserves the optimal average reward and the set of optimal policies; the optimal Q‑function under the lazy MDP differs from the original only by a state‑dependent additive term. Consequently, learning the lazy Q‑function is sufficient for solving the original problem.

The second key ingredient is the construction of an instance‑dependent seminorm denoted sp̂(·). This seminorm is built from the expected hitting times to a distinguished reference state s†, which is guaranteed to be reachable from any state under any stationary deterministic policy (Assumption 1). The seminorm shares the same null space as the classical span seminorm but rescales each coordinate by the corresponding hitting‑time weight. Under this seminorm the lazy Bellman operator becomes a one‑step contraction: there exists a constant δ > 0 (explicitly δ = (1‑α)/K, where K is the maximal expected hitting time) such that

‖T_{P̃}Q – T_{P̃}Q′‖{sp̂} ≤ (1‑δ)‖Q – Q′‖{sp̂}.

This result is the “new contraction principle” of the title and eliminates the need for any external γ‑contraction assumption.

Armed with a genuine contraction, the authors design two families of algorithms:

  1. Synchronous Lazy Q‑learning (Algorithms 1 and 2). In the explicit version the algorithm cycles through all |S||A| state‑action pairs each epoch; in the implicit version it samples a transition from the current state, applies the lazy self‑loop with probability ½, and performs the usual Q‑update using a diminishing stepsize η_t = c/t. The analysis shows that after T = Õ(|S||A| ε⁻²) iterations the final Q‑estimate satisfies sp̂(Q_T – Q*) ≤ ε with high probability, and the greedy policy is ε‑optimal.

  2. Asynchronous Lazy Q‑learning (Algorithms 3 and 4). Here data are collected along a single trajectory generated by the current policy (the “online” setting). The same lazy self‑loop is applied, and the stepsize is adapted to the visitation frequency of each state‑action pair. The authors prove a sample complexity of Õ(ε⁻²) (the dependence on |S| and |A| is hidden inside the mixing‑time constants). This matches the minimax lower bound for average‑reward model‑based methods, showing that the model‑free approach incurs no penalty.

The proofs consist of two main parts. First, the authors formalize the seminorm using extremal norm theory and the reachability assumption, establishing the contraction constant δ. Second, they apply standard stochastic approximation techniques: defining a Lyapunov function V_t = sp̂(Q_t – Q*)², showing that E


Comments & Academic Discussion

Loading comments...

Leave a Comment