Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(ε^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $ε$-optimal robust policy within $\tilde{\mathcal{O}}(ε^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.


💡 Research Summary

This paper addresses the challenging problem of Robust Average-Reward Reinforcement Learning under model misspecification. In many practical applications like robotics and scheduling, agents are trained in simulators but deployed in the real world, where transition dynamics may differ. The average-reward criterion, which optimizes long-term throughput rather than discounted short-term gains, is natural for these domains but introduces significant analytical hurdles when combined with robustness requirements.

The authors focus on model-free learning for robust average-reward Markov Decision Processes (MDPs) with (s,a)-rectangular uncertainty sets, including contamination, total-variation (TV) distance, and Wasserstein distance models. The core challenge is the absence of a discount factor, which traditionally provides the contraction property essential for analyzing Q-learning and actor-critis. To overcome this, the paper’s first major contribution is a novel theoretical analysis proving that the optimal robust average-reward Bellman operator is a strict contraction with respect to a carefully designed span semi-norm (where constant functions are quotiented out). This contraction holds uniformly over all deterministic policies and admissible uncertainty sets under standard ergodicity assumptions.

Leveraging this uniform contraction property, the paper makes two key algorithmic contributions with non-asymptotic sample complexity guarantees:

  1. Robust Q-Learning: The authors design a model-free robust Q-learning algorithm. By exploiting the semi-norm contraction, they prove that the algorithm converges to the optimal robust Q-function with a sample complexity of Õ(ε^{-2}) to achieve an ε-optimal Q-function. The constants in this bound are independent of the policy sequence, a significant advance over prior policy-dependent asymptotic results.
  2. Robust Actor-Critic: An actor-critic algorithm is developed where a critic subroutine estimates the robust Q-function and the actor updates the policy via mirror ascent. A critical component is establishing uniform convergence bounds for TD-based critic estimation that hold simultaneously for all policies, enabling the analysis of a changing policy sequence. Integrating these uniform critic error bounds into the mirror-ascent analysis, the authors show that the actor converges to an ε-optimal robust policy within O(log T) iterations, yielding an overall sample complexity of Õ(ε^{-2}). This extends prior average-reward robust mirror-descent frameworks, which assumed access to an exact robust gradient oracle, to the practical setting with finite-sample critic errors.

The paper also discusses practical implementations for the support function σ_{P_s^a}(V) under each uncertainty set and provides numerical simulations to evaluate the proposed algorithms. Overall, this work provides the first model-free robust planners for average-reward MDPs with end-to-end non-asymptotic sample-complexity guarantees, without requiring oracles for robust Q-functions or gradients, significantly advancing the theoretical foundations for robust long-term planning directly from data.


Comments & Academic Discussion

Loading comments...

Leave a Comment