Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the unichain assumption and general policy parameterizations. Existing regret analyses for constrained reinforcement learning largely rely on ergodicity or strong mixing-time assumptions, which fail to hold in the presence of transient states. We propose a primal–dual natural actor–critic algorithm that leverages multi-level Monte Carlo (MLMC) estimators and an explicit burn-in mechanism to handle unichain dynamics without requiring mixing-time oracles. Our analysis establishes finite-time regret and cumulative constraint violation bounds that scale as $\tilde{O}(\sqrt{T})$, up to approximation errors arising from policy and critic parameterization, thereby extending order-optimal guarantees to a significantly broader class of CMDPs.

💡 Research Summary

The paper tackles infinite‑horizon average‑reward constrained Markov decision processes (CMDPs) under the unichain assumption, where each policy induces a Markov chain with a single recurrent class but may contain transient states. Existing regret analyses for constrained reinforcement learning typically rely on ergodicity or strong mixing‑time guarantees; these assumptions break down when transient phases or periodic behavior are present. Moreover, prior works either restrict themselves to tabular representations or assume specific linear function approximations, leaving a gap for high‑dimensional policies such as neural networks.

To fill this gap, the authors propose a primal‑dual natural actor‑critic algorithm that works with arbitrary differentiable policy and critic parameterizations. The algorithm combines three technical innovations:

Multi‑Level Monte Carlo (MLMC) Estimators – Instead of averaging over a long trajectory (which would require O(√T) samples to achieve √T‑regret), the authors construct an MLMC estimator that telescopes across geometrically increasing trajectory lengths. By sampling a geometric random variable Q and forming a correction term, the estimator attains the bias of averaging Tmax samples while using only O(log Tmax) samples in expectation. Lemma 2 bounds the mean‑squared error of time‑averaged estimates for a unichain chain using the constants C_hit (expected hitting time to the recurrent class) and C_tar (expected mixing‑time‑like quantity within the recurrent class). Lemma 3 shows that the MLMC estimator’s bias scales as O(1/Tmax) and its variance as O(log Tmax), without requiring any knowledge of mixing times.
Logarithmic Burn‑in – In unichain settings, the critic’s linear system matrix depends on which transient states are visited, violating the kernel‑inclusion condition needed for tight bias control. The authors therefore introduce an explicit burn‑in phase of length B at the beginning of each epoch, allowing the chain to reach the recurrent class before any learning update. Lemma 4 proves that choosing B = c log T with c > 4 C_hit makes the probability of failing to enter the recurrent class exponentially small, and the extra regret contributed by burn‑in becomes O(T^{2‑B/(2C_hit)} log T), which is negligible for large T. This reduces the total burn‑in cost from O(√T) per epoch (as in earlier unichain RL works) to O(log T), preserving order‑optimal regret.
Primal‑Dual Natural Policy Gradient – The constrained objective is expressed as a saddle‑point problem: maximize the Lagrangian L(θ, λ) = J_r(θ) + λ J_c(θ) over policy parameters θ and dual variable λ ≥ 0. The natural policy gradient direction ω* = F_θ⁻¹∇_θ J_g(θ) (g ∈ {r,c}) is estimated using the MLMC‑based gradient estimator, while the Fisher information matrix F_θ is approximated from sampled trajectories. The dual variable is updated via projected sub‑gradient descent with step size β, using an MLMC estimate of the constraint value J_c(θ). The Slater condition (existence of a strictly feasible policy) guarantees that λ remains bounded.

The main theoretical result (Theorem 9) establishes that, after K = O(T / log T) epochs, the cumulative regret \

Regret Analysis of Unichain Average Reward Constrained MDPs with General Parameterization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment