Reparameterization Proximal Policy Optimization

Reparameterization Proximal Policy Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.


💡 Research Summary

The paper tackles two fundamental shortcomings of Reparameterization Policy Gradient (RPG) methods that have become popular with the rise of differentiable simulators: (1) the under‑utilization of expensive dynamics Jacobians, and (2) the notorious instability of training when back‑propagating through long, possibly stiff, trajectories. Existing on‑policy RPG algorithms compute the Jacobians once per batch and discard them after a single policy update, wasting a large amount of computation. Moreover, because RPG directly back‑propagates through the dynamics, gradients can explode or vanish, especially in contact‑rich environments, leading to sudden performance drops.

The authors propose Reparameterization Proximal Policy Optimization (RPO), a principled framework that simultaneously reuses the computed Jacobians and stabilizes learning. The key theoretical insight is that, when samples are reused, the RPG estimator is exactly the gradient of a PPO‑style surrogate objective expressed in reparameterization form:

Lπθold(θ)=E_{s∼dπθold, ε∼N}


Comments & Academic Discussion

Loading comments...

Leave a Comment