Mirror descent actor-critic methods for entropy regularised MDPs in general spaces: stability and convergence
We provide theoretical guarantees for convergence of discrete-time policy mirror descent with inexact advantage functions updated using temporal difference (TD) learning for entropy regularised MDPs in Polish state and action spaces. We rigorously derive sufficient conditions under which the single-loop actor-critic scheme is stable and convergent. To weaken these conditions, we introduce a variant that performs multiple TD steps per policy update and derive an explicit lower bound on the number of TD steps required to ensure stability. Finally, we establish sub-linear convergence when the number of TD steps grows logarithmically with the number of policy updates, and linear convergence when it grows linearly under a concentrability assumption.
💡 Research Summary
This paper establishes rigorous convergence guarantees for a discrete‑time actor‑critic algorithm that combines policy mirror descent with temporal‑difference (TD) learning in entropy‑regularized Markov decision processes (MDPs) defined over general Polish state and action spaces. The authors first formalize the entropy‑regularized objective, introducing the regularized value function V⁽τ⁾_π, the soft Q‑function Q⁽τ⁾_π, and the soft advantage A⁽τ⁾_π. Policies are restricted to an admissible class Π_µ that can be expressed as a softmax over a reference measure µ, ensuring the KL‑penalty is well‑defined.
The core algorithm (Algorithm 1) performs a single TD update of a linear function approximator Q(s,a;θ)=⟨θ,ϕ(s,a)⟩ to approximate the advantage, then updates the policy by solving a mirror‑descent sub‑problem that minimizes a KL‑regularized surrogate loss. The analysis relies on three structural assumptions: (i) the feature matrix ϕ satisfies a uniform positive‑definiteness condition (λ_min>0), (ii) the features are uniformly bounded, and (iii) the true soft Q‑function lies in the linear span of the features (Q‑realizability).
Key technical contributions include:
-
Stability of the single‑loop scheme – Lemma 3.1 shows that the norm of the critic parameters grows linearly with the current KL‑divergence Kₙ = sup_s KL(πₙ(·|s)‖µ). For finite action spaces, Corollary 3.2 immediately yields a uniform bound because KL is bounded by log|A|. For general Polish action spaces, Lemma 3.3 provides a recursion for the normalized log‑density lₙ, linking it to the critic norm. Theorem 3.4 combines these results to prove that, under step‑size constraints (h for the critic, λ for the actor) and τλ<1, the quantity KL(πₙ‖µ)+‖θₙ‖² remains uniformly bounded for all iterations, establishing algorithmic stability without requiring Q‑realizability.
-
Continuity of the Bellman operator – Theorem 3.6 bounds the difference between successive soft Q‑functions in terms of the KL‑distance between policies, which is crucial for downstream convergence analysis.
-
Convergence rate – Theorem 3.7 demonstrates that the convergence error of the value function to the optimal regularized value decays as O(1/√(n h)+λh), i.e., sub‑linear, where η=λh captures the timescale separation between actor and critic updates.
-
Multi‑step TD (double‑loop) extension – Recognizing that a single TD step may be insufficient for stability in continuous action spaces, the authors propose Algorithm 2, which executes M(n)≥1 TD updates per policy iteration. Lemma 4.1 bounds the semi‑gradient norm by the distance to the optimal parameter θ_π, and Theorem 4.2 proves linear convergence of the inner TD loop when the critic step size is small enough. By selecting M(n) to grow logarithmically with n, the overall algorithm achieves the same sub‑linear rate as the single‑loop case; if M(n) grows linearly, the authors obtain linear convergence under a standard concentrability assumption (which controls the mismatch between the sampling distribution and the stationary distribution of the optimal policy).
-
Practical implications – The paper provides explicit lower bounds on the number of TD steps required for stability, clarifies how the regularization parameter τ and the KL‑penalty weight λ interact, and shows that the stability results hold even without the strong Q‑realizability assumption, making them applicable to a broader class of function‑approximation schemes.
Overall, the work fills a notable gap in the literature by delivering the first comprehensive stability and convergence analysis of mirror‑descent actor‑critic methods for entropy‑regularized MDPs in infinite‑dimensional (Polish) spaces. It bridges the divide between continuous‑time theoretical analyses and practical discrete‑time algorithms, offering clear guidelines for step‑size selection and the required number of critic updates. Future directions suggested include extending the analysis to non‑linear (deep) function approximators, relaxing the concentrability condition, and empirical validation on high‑dimensional control benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment