Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
We establish an optimal sample complexity of $O(ε^{-2})$ for obtaining an $ε$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(ε^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
💡 Research Summary
This paper addresses a long‑standing gap in the theoretical understanding of single‑time‑scale actor‑critic (AC) algorithms for infinite‑horizon discounted Markov decision processes (MDPs) with finite state and action spaces. Prior works achieved a sample complexity of O(ε⁻³) for obtaining an ε‑optimal global policy, while a lower bound of O(ε⁻²) is known to be information‑theoretically optimal. The authors close this gap by introducing two complementary variance‑reduction mechanisms: (1) STORM (Stochastic Recursive Momentum) applied to the critic updates, and (2) a lightweight replay buffer that stores a small fraction of the most recent samples and is sampled uniformly for each critic update.
The key difficulty stems from the fact that samples are drawn from a non‑stationary occupancy measure that changes as the policy evolves. Pure STORM, which assumes i.i.d. data, cannot fully control the variance introduced by this distribution shift. By maintaining a buffer of recent transitions, the algorithm effectively mixes samples generated under slightly older policies, thereby smoothing the occupancy distribution and allowing STORM’s variance‑reduction guarantees to hold.
Algorithm 1 proceeds as follows at each iteration k: (i) draw a fresh transition (sₖ, aₖ, s′ₖ) under the current policy πₖ, compute the advantage Aₖ, and update the policy parameters θₖ using a standard policy‑gradient step with stepsize ηₖ = Θ(k⁻¹/²); (ii) update the buffer Bₖ to contain the most recent c_b·k samples (c_b∈(0,1]); (iii) sample uniformly from Bₖ, compute a STORM‑style recursive gradient estimator hₖ for the critic, and update the Q‑function with stepsize βₖ = Θ(k⁻¹/²). The STORM momentum stepsize νₖ decays faster, νₖ = Θ(k⁻¹).
The theoretical contribution is a novel Lyapunov function that captures the coupled dynamics of θₖ, Qₖ, and the STORM estimator hₖ. By carefully balancing the three learning rates (ηₖ, βₖ, νₖ) and exploiting the unbiasedness and reduced variance properties of STORM together with the smoothing effect of the buffer, the authors prove that the Lyapunov function contracts at a rate O(k⁻¹). Consequently, the expected squared norm of the policy gradient satisfies E
Comments & Academic Discussion
Loading comments...
Leave a Comment