DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning
Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) – both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
💡 Research Summary
This paper tackles a central dilemma in reinforcement learning with verifiable rewards (RL‑VR) for large language model (LLM) mathematical reasoning: PPO‑style algorithms (e.g., GRPO, DAPO) are stable because they enforce a trust‑region‑like clipping on importance‑sampling (IS) weights, but this constraint slows down learning dramatically. Conversely, REINFORCE‑style methods such as CISPO drop the trust‑region, allowing gradients to flow even when IS weights deviate far from 1, which yields rapid early progress but suffers from severe instability—often collapsing after a few hundred updates. The authors identify the root cause: existing REINFORCE‑style approaches apply the same asymmetric clipping bounds to all tokens, regardless of whether the token belongs to a correct answer (positive advantage) or an incorrect answer (negative advantage), and regardless of whether the IS weight is above or below 1. This conflates four fundamentally different update regimes, each with distinct dynamics, leading to uncontrolled exploration‑distillation trade‑offs and catastrophic failure modes (repetitive outputs or vanishing response lengths).
DISPO (Decoupled Importance Sampling‑weighted Policy Optimization) resolves this by decoupling the clipping of IS weights along two axes: (1) the sign of the advantage (correct vs. incorrect response) and (2) whether the IS weight amplifies (>1) or suppresses (<1) the gradient. This yields four independent clipping parameters: ϵ⁺_low / ϵ⁺_high for correct responses and ϵ⁻_low / ϵ⁻_high for incorrect responses. The objective (Eq. 8‑9) retains the group‑relative advantage estimator and token‑level normalization used in CISPO, but replaces the single clipped IS weight r_c,i,t(θ) with a regime‑specific r_d,i,t(θ).
The paper provides a systematic analysis of each regime:
-
Regime 1 (positive advantage, r > 1) amplifies the learning signal for tokens that the model already prefers in correct answers, increasing token‑level entropy and encouraging exploration of diverse solution paths. Raising ϵ⁺_high enlarges this amplification.
-
Regime 2 (positive advantage, r < 1) suppresses the same signal, reducing entropy and acting as a distillation mechanism that concentrates probability mass on the correct tokens. Lowering ϵ⁺_low strengthens this effect.
-
Regime 3 (negative advantage, r > 1) amplifies the negative signal for tokens that appear in incorrect answers, accelerating “unlearning” of harmful patterns. Insufficient ϵ⁻_high leads to repetition‑induced collapse because the model fails to suppress erroneous tokens.
-
Regime 4 (negative advantage, r < 1) suppresses the negative signal, which, if over‑restricted (very low ϵ⁻_low), drives response lengths toward zero, causing abrupt performance collapse.
By independently tuning these four knobs, DISPO can keep average token entropy at a healthy level (balancing exploration and distillation) while preventing the sudden failures observed in CISPO. Gradient‑weight visualizations (Fig. 3) show that PPO‑style clipping zeroes gradients outside the trust region, whereas DISPO provides a smooth gating function that scales gradients continuously based on both sign and magnitude of r. This smoothness preserves learning momentum and stabilizes long‑run training.
Empirically, DISPO is evaluated on several state‑of‑the‑art LLMs (DeepSeek‑R1, Qwen‑3, Claude Sonnet 3.5) across multiple mathematical reasoning benchmarks (AIME’24, MATH, GSM‑8K, etc.). On the AIME’24 test set, DISPO achieves 61.04% accuracy, surpassing CISPO’s 55.42% and DAPO’s 50.21% by sizable margins. Across other benchmarks, DISPO consistently yields 3–5 percentage‑point gains. Learning curves demonstrate that DISPO retains the rapid early gains of REINFORCE‑style methods while avoiding the later collapse that plagues CISPO, and it remains more sample‑efficient than PPO‑style baselines.
The authors also discuss implementation details: DISPO inherits dynamic sampling, token‑level normalization, and an over‑length penalty from DAPO/CISPO, ensuring fair comparison. The primary contribution is the regime‑specific clipping strategy and the associated hyper‑parameter tuning methodology, which reframes RL‑VR from a “trust‑region” perspective to a “signal‑to‑noise” control perspective.
In conclusion, DISPO offers a simple yet powerful modification to off‑policy REINFORCE for LLMs, delivering both higher sample efficiency and robust stability. The paper opens avenues for automated clipping‑parameter optimization, multi‑stage reward shaping, and extension of the decoupled clipping paradigm to non‑mathematical reasoning tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment