Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization

Provably Efficient Algorithms for S- and Non-Rectangular Robust MDPs with General Parameterization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study robust Markov decision processes (RMDPs) with general policy parameterization under s-rectangular and non-rectangular uncertainty sets. Prior work is largely limited to tabular policies, and hence either lacks sample complexity guarantees or incurs high computational cost. Our method reduces the average reward RMDPs to entropy-regularized discounted robust MDPs, restoring strong duality and enabling tractable equilibrium computation. We prove novel Lipschitz and Lipschitz-smoothness properties for general policy parameterizations that extends to infinite state spaces. To address infinite-horizon gradient estimation, we introduce a multilevel Monte Carlo gradient estimator with $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity, a factor of $\mathcal{O}(ε^{-2})$ improvement over prior work. Building on this, we design a projected gradient descent algorithm for s-rectangular uncertainty ($\mathcal{O}(ε^{-5})$) and a Frank–Wolfe algorithm for non-rectangular uncertainty ($\mathcal{O}(ε^{-4})$ discounted, $\mathcal{O}(ε^{-10.5})$ average reward), significantly improving prior results in both the discounted setting and average reward setting. Our work is the first one to provide sample complexity guarantees for RMDPs with general policy parameterization beyond $(s, a)$-rectangularity. It also provides the first such guarantees in the average reward setting and improves existing bounds for discounted robust MDPs.


💡 Research Summary

This paper tackles a fundamental limitation in robust Markov decision processes (RMDPs): the inability to provide provable sample‑complexity guarantees when policies are represented by general function approximators (e.g., neural networks) and when the uncertainty set over transition kernels is either s‑rectangular or fully non‑rectangular. Existing works are either confined to tabular policies or to (s,a)‑rectangular uncertainty, which either leads to prohibitive computational cost or to a lack of theoretical guarantees, especially in the average‑reward setting where strong duality often fails.

The authors introduce a two‑step reduction that first converts an average‑reward robust MDP into an entropy‑regularized discounted robust MDP. By adding a τ‑scaled negative log‑policy term to the reward and selecting a discount factor γ = 1 – Θ(ε/H) (where H is the span of the optimal average‑reward policy), the Bellman operator regains its γ‑contraction property and strong max‑min duality is restored. This reduction enables the use of powerful tools from discounted robust MDP theory while preserving the original average‑reward objective up to an ε‑optimality gap.

A central technical contribution is the derivation of Lipschitz and smoothness bounds for the entropy‑regularized value function that are independent of the size of the state space. These bounds hold for any policy parameterization θ∈Θ and for linear parameterizations of the transition kernel P_ξ(s′|s,a)=⟨ϕ(s,a,s′),ξ⟩, where the feature map ϕ satisfies a bounded ℓ₁ norm. The authors further assume that the true uncertainty set can be approximated in Wasserstein‑1 distance by this linear family, which is a mild condition that scales to infinite or continuous state spaces.

To address the infinite‑horizon gradient estimation bottleneck, the paper proposes a multilevel Monte‑Carlo (MLMC) gradient estimator. By coupling simulations at multiple levels of temporal resolution, the estimator reduces bias rapidly while keeping variance under control, achieving an overall sample complexity of \tilde O(ε⁻²). This improves upon prior temporal‑difference based estimators, which typically require ε⁻⁴ samples.

With these ingredients, two algorithmic families are designed:

  1. Projected Gradient Descent (PGD) for s‑rectangular uncertainty – Because each state’s transition set is independent, the worst‑case kernel can be optimized via a simple projection onto the convex set Ξ after each gradient step. Leveraging the Lipschitz‑smoothness and a gradient‑dominance property (Lemma 4.1), the authors prove that PGD reaches an ε‑optimal solution with O(ε⁻⁵) total samples.

  2. Frank‑Wolfe (FW) for non‑rectangular uncertainty – When state transitions are coupled, projection becomes expensive. The FW method only requires solving a linear minimization oracle at each iteration, which corresponds to finding the worst‑case direction for ξ. In the discounted setting the algorithm attains O(ε⁻⁴) sample complexity; in the average‑reward setting, after the discounted reduction, the complexity becomes O(ε⁻¹⁰·⁵) (the exponent reflects the dependence on the policy span H and the entropy regularization parameter).

Both algorithms inherit the sample‑complexity guarantees from the MLMC estimator and the state‑independent smoothness constants, making them applicable to high‑dimensional or continuous domains. The paper also establishes a gradient‑dominance result for the linear kernel parameterization, ensuring global convergence even though the objective is non‑convex in ξ.

Overall, the contributions are fourfold: (i) a novel reduction that restores strong duality for average‑reward robust MDPs, (ii) state‑size‑independent Lipschitz/smoothness analysis for entropy‑regularized robust value functions, (iii) an MLMC gradient estimator with \tilde O(ε⁻²) sample complexity, and (iv) the first provably efficient algorithms (PGD and FW) with explicit sample‑complexity bounds for general policy parameterizations under both s‑rectangular and non‑rectangular uncertainty. These results close a major gap in the robust RL literature and open the door to scalable, theoretically‑grounded robust policy learning in realistic, high‑dimensional environments such as robotics, autonomous driving, and finance.


Comments & Academic Discussion

Loading comments...

Leave a Comment