Mode-Dependent Rectification for Stable PPO Training

Mode-Dependent Rectification for Stable PPO Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.


💡 Research Summary

The paper investigates why mode‑dependent layers such as Batch Normalization (BatchNorm) and dropout, which behave differently during training and evaluation, cause severe instability when used with Proximal Policy Optimization (PPO), an on‑policy reinforcement‑learning algorithm. The authors show that during each PPO iteration the data‑collection phase uses the policy evaluated with running statistics (evaluation mode), while the subsequent gradient updates use batch statistics (training mode). This creates a policy mismatch Δπ_k that grows as the state distribution shifts, which they quantify with Jensen‑Shannon divergence. The mismatch perturbs the PPO clipping ratio r_t, effectively turning the fixed clipping parameter ε into a stochastic ε′ = ε + Δε. When Δε becomes large, the trust‑region guarantee of PPO is violated, leading to “reward collapse” where performance drops abruptly and does not recover.

To address this, the authors propose Mode‑Dependent Rectification (MDR), a lightweight two‑phase training procedure. In the standard update phase (α₁ proportion of total minibatch updates) PPO proceeds as usual, using training‑mode statistics. In the rectification phase (α₂ proportion), all mode‑dependent layers are switched to deterministic evaluation mode, and an additional set of gradient steps is performed. This phase forces the policy to be optimized under the same statistics that will be used during data collection, thereby nullifying δr and restoring the original clipping bound. The hyper‑parameters α₁ and α₂ control the relative length of the two phases; the authors find a 4:1 ratio works well across tasks.

Empirical evaluation is conducted on the Procgen benchmark (16 procedurally generated games) and on real‑world visual patch‑localization tasks. With BatchNorm or dropout, vanilla PPO frequently suffers catastrophic collapse, while MDR consistently stabilizes training and yields 12‑18 % higher average returns on Procgen. In the patch‑localization experiments, MDR accelerates convergence and improves final accuracy by 5‑7 %. Additional ablations demonstrate that MDR also benefits other mode‑dependent normalizations such as LayerNorm, GroupNorm, and CrossNorm. Importantly, MDR requires no architectural changes and can be added to existing PPO codebases with minimal modifications.

The paper concludes that the instability of mode‑dependent layers in on‑policy RL stems from a dynamic perturbation of PPO’s clipping trust region, and that correcting this perturbation via a deterministic rectification phase is an effective, general solution. Future work may extend MDR to other on‑policy algorithms (e.g., A2C, TRPO) and multi‑agent settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment