Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model’s full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model’s high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
💡 Research Summary
The paper investigates a fundamental failure mode in Reinforcement Learning with Verifiable Rewards (RLVR), a paradigm increasingly used to fine‑tune large language models for tasks such as mathematical problem solving. The authors identify “Recursive Space Contraction” (RSC) – an irreversible collapse of the policy’s probability mass onto a narrow set of tokens. RSC arises from the interaction of two dynamics: (1) positive updates that sharpen the distribution around the currently most probable (often correct) path, and (2) negative updates that, due to the “Squeezing Effect,” redistribute probability proportionally to existing mass, thereby reinforcing dominant tokens and starving low‑probability but still correct alternatives. Once a valid reasoning branch falls into the low‑probability tail, it is effectively unreachable because on‑policy sampling no longer provides gradient signal for its recovery.
Standard mitigation uses a KL‑divergence penalty between the current policy πθ and a reference model πref. The authors argue that KL imposes a global “Shape Matching” constraint: the policy must mimic the entire density of the reference, including its noise. This creates a gradient conflict with the sharpening needed for high Pass@1 performance, often pushing the update outside the PPO trust region and leading to instability.
To resolve this, the authors propose Anchored Policy Optimization (APO). Instead of matching the full shape, APO focuses on “Support Coverage.” They define a Safe Manifold M_safe as the high‑confidence support set of the reference model (e.g., the top‑K tokens). The objective becomes maximizing the total probability mass that πθ assigns within M_safe. This permits aggressive sharpening inside the manifold while preventing mass from leaking into unsafe regions.
APO introduces a dual‑force mechanism:
- Push Force – the usual policy‑gradient term that suppresses error tokens.
- Pull Force – a restorative term inserted directly into the policy ratio r_t(θ). It estimates the total mass of M_safe via a “Virtual Anchor Ratio” and pulls probability back toward the anchor set whenever an error is detected. Crucially, the error token itself is excluded from the anchor set to avoid signal cancellation.
The authors prove that the Pull Force gradient is collinear with the gradient of the support‑coverage objective, guaranteeing that updates during error correction are intrinsically aligned with both reward maximization and manifold consistency. Consequently, APO avoids the gradient conflict inherent in KL regularization and stays within the PPO trust region.
Empirical evaluation on five mathematical benchmarks (AIME‑24/25, Math500, Minerva, etc.) shows that APO improves Pass@1 by up to 6 % over KL‑regularized baselines and recovers 1.5 %–3.3 % of the diversity lost in Pass@K metrics. An “Oracle Coverage” analysis reveals that the greedy top‑1 path captures only ~84 % of correct tokens, whereas the top‑8 safe manifold covers ~97.5 %, highlighting the practical importance of supporting a broader token set.
In summary, the paper introduces a theoretically grounded and empirically validated alternative to KL regularization. By shifting from global shape matching to targeted support coverage, APO simultaneously enables efficient sharpening for accuracy and an elastic recovery mechanism that preserves exploration diversity, effectively breaking the traditional accuracy‑diversity trade‑off in RLVR. Future work may extend the safe manifold concept to dynamic K selection, multi‑modal tasks, or automated anchor construction.
Comments & Academic Discussion
Loading comments...
Leave a Comment