Variance Reduction Based Experience Replay for Policy Optimization
Effective reinforcement learning (RL) for complex stochastic systems requires leveraging historical data collected in previous iterations to accelerate policy optimization. Classical experience replay treats all past observations uniformly and fails to account for their varying contributions to learning. To overcome this limitation, we propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation. VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods, forming the basis of our sample-efficient off-policy algorithm, Policy Gradient with VRER (PG-VRER). Motivated by the lack of rigorous theoretical analysis of experience replay, we develop a novel framework that explicitly captures dependencies introduced by Markovian dynamics and behavior-policy interactions. Using this framework, we establish finite-time convergence guarantees for PG-VRER and reveal a fundamental bias-variance trade-off: reusing older experience increases bias but simultaneously reduces gradient variance. Extensive empirical experiments demonstrate that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.
💡 Research Summary
The paper addresses a fundamental inefficiency in modern reinforcement‑learning (RL) algorithms: the indiscriminate reuse of all past experience in experience replay (ER) buffers. While ER can dramatically improve sample efficiency, treating every stored transition equally leads to high variance in policy‑gradient estimates, especially when the behavior policy that generated the data diverges from the current target policy. To remedy this, the authors propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses only the most informative samples, thereby reducing gradient variance without incurring prohibitive bias.
VRER is algorithm‑agnostic and can be plugged into any step‑based policy‑gradient method (e.g., PPO, TRPO, A2C). The core of VRER consists of two mechanisms: (1) a sample‑selection rule based on the magnitude of the importance‑weight ratio ρ(s,a)=πθ(a|s)/πβ(a|s). Only transitions whose absolute importance weight falls below a pre‑specified threshold τ are admitted to the replay set; this prevents the explosion of importance weights that would otherwise inflate variance. (2) a buffer‑management strategy that limits the replay capacity, periodically downsamples older entries, and enforces a KL‑based constraint on policy updates to keep the policy drift small. Together, these components control the bias introduced by reusing stale data while still harvesting variance reduction benefits.
A major theoretical contribution is a novel analysis that explicitly accounts for Markovian dependencies and the evolving behavior‑policy distribution. The authors assume uniform ergodicity of the underlying Markov chain and introduce a decay function ϕ(t)=κ0·κ^t that bounds the distance between the transient state distribution at time t and the stationary distribution of the current policy. Using this bound, they derive explicit upper‑bounds for both the bias term b_k = E
Comments & Academic Discussion
Loading comments...
Leave a Comment