Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

💡 Research Summary

The paper investigates the instability often observed in policy‑gradient reinforcement learning by focusing on the noise‑to‑signal ratio (NSR) of the REINFORCE estimator. NSR is defined as the variance of the stochastic gradient estimator divided by the squared norm of the true gradient, providing a dimensionless measure of how much random fluctuation contaminates the direction of ascent. The authors derive exact, non‑approximate expressions for NSR in three increasingly general settings: (i) finite‑horizon linear‑quadratic (LQG) systems with Gaussian policies and linear state‑feedback, (ii) finite‑horizon polynomial systems with Gaussian policies and polynomial feedback, and (iii) general nonlinear dynamics with expressive (including neural‑network) Gaussian policies.

For the one‑step LQG case, they exploit closed‑form score‑function gradients for linear Gaussian policies and a compact notation for Gaussian moments of quadratic forms (the IS‑operator). This yields closed‑form formulas for both the mean gradient and the second‑moment (Frobenius norm) of the REINFORCE estimator. By specializing to isotropic covariances (Σ = σ²I, Σ₀ = σ₀²I) they show that NSR scales roughly as (σ₀/σ)⁶ /

Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

💡 Research Summary

Comments & Academic Discussion

Leave a Comment