Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning
Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.
💡 Research Summary
This paper addresses a critical gap in reinforcement learning (RL) research: the lack of finite‑sample guarantees for distributionally robust (DR) average‑reward problems. While most prior work on DR‑RL has focused on discounted or finite‑horizon settings, many real‑world applications—such as robotics, operations research, and healthcare—require stable long‑run performance measured by the average reward. The authors propose two algorithms that achieve near‑optimal sample complexity for learning the optimal robust policy and the robust average reward in a tabular setting.
The first algorithm reduces the DR average‑reward MDP (DR‑AMDP) to a DR discounted‑reward MDP (DR‑DMDP). By carefully selecting a discount factor γ close to one, the reduction introduces only a small bias that can be controlled by the mixing time of the underlying nominal MDP. Existing DR‑DMDP solvers (e.g., robust value iteration, robust Q‑learning) are then applied. The analysis shows that, under a uniformly ergodic nominal MDP, the reduction yields a sample complexity of (\widetilde{O}(|\mathcal{S}||\mathcal{A}|,t_{\text{mix}}^{2}(1-\gamma)^{-2}\varepsilon^{-2})). Choosing (\gamma = 1 - \Theta(1/t_{\text{mix}})) eliminates the dependence on (1‑γ) and results in the final bound (\widetilde{O}(|\mathcal{S}||\mathcal{A}|,t_{\text{mix}}^{2}\varepsilon^{-2})).
The second algorithm introduces an “anchoring” state. For every policy, a small probability α forces a transition to a designated anchor state, thereby guaranteeing that all transition kernels satisfy a Doeblin minorization condition with the same minorization time. This structural modification ensures that every MDP in the uncertainty set remains unichain and uniformly ergodic, which resolves a longstanding modeling issue: standard uncertainty sets can contain non‑unichain MDPs that break the average‑reward Bellman equations. With the anchor in place, a direct robust average‑reward value or policy iteration can be performed, and the same (\widetilde{O}(|\mathcal{S}||\mathcal{A}|,t_{\text{mix}}^{2}\varepsilon^{-2})) sample complexity is achieved without any knowledge of the mixing time a priori.
Both algorithms assume KL‑divergence or general (f_k)‑divergence (including χ²) uncertainty sets with radius δ that is sufficiently small. The analysis leverages uniform ergodicity of the nominal MDP: for any stationary deterministic policy π, the induced Markov chain mixes to a unique stationary distribution within (t_{\text{mix}}) steps. The authors define a unified minorization time (t_{\text{minorize}}) that is equivalent to (t_{\text{mix}}) up to constant factors, and they show that the sample complexity depends quadratically on this mixing parameter, which is unavoidable for average‑reward problems.
The theoretical contributions are threefold: (1) they provide structural conditions on the uncertainty set that guarantee stability (uniform ergodicity) for all admissible transition kernels; (2) they derive a reduction‑based sample‑complexity bound for DR‑DMDPs that improves the dependence on the effective horizon from ((1-\gamma)^{-4}) to ((1-\gamma)^{-2}); (3) they design an anchored algorithm whose output coincides with the reduction approach under a suitable choice of the anchoring probability, while requiring no prior knowledge of model‑specific parameters.
Empirical validation is performed on simulated robotics control, inventory management, and healthcare scheduling tasks. In each domain, the nominal MDP is constructed, and KL or χ² uncertainty sets are imposed. The proposed methods converge faster than prior DR‑DMDP baselines and maintain stable average rewards even as the uncertainty radius grows. The anchored algorithm, in particular, exhibits robustness to larger δ values because the anchoring mechanism enforces a uniform lower bound on transition probabilities.
In summary, this work delivers the first finite‑sample convergence guarantees for distributionally robust average‑reward reinforcement learning. By combining uniform ergodicity, Doeblin minorization, and an anchoring construction, the authors achieve the optimal dependence on the state‑action cardinality and accuracy (|S||A| ε⁻²) while only incurring a quadratic factor in the mixing time. The results close a major theoretical gap and open avenues for extending DR‑average‑reward methods to function approximation, continuous spaces, and multi‑agent settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment