Stabilizing Reinforcement Learning for Diffusion Language Models
Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO’s formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.
💡 Research Summary
The paper investigates why Group Relative Policy Optimization (GRPO), a reinforcement‑learning (RL) algorithm that has proven highly effective for post‑training autoregressive (AR) language models, fails dramatically when applied directly to diffusion large language models (dLLMs). The authors identify two fundamental incompatibilities. First, GRPO relies on importance ratios ρ = πθ/πθold that are defined by exact sequence probabilities. In AR models these probabilities are tractable, but in dLLMs the forward‑backward diffusion process makes the exact likelihood intractable; practitioners therefore estimate ρ using proxies such as ELBO‑based or mean‑field likelihood approximations. These proxies introduce substantial stochastic noise and a heavy‑tailed distribution of estimated ratios. Second, GRPO’s original design assumes exact ratios and therefore employs conditional clipping (only clipping when the advantage is positive) and a fixed‑group‑size normalization. When the ratios are noisy, conditional clipping can be bypassed by estimation noise, leading to unbounded gradient spikes; the fixed‑size normalization amplifies the effect of high‑variance ratio estimates, causing large fluctuations in gradient magnitude. Together these two flaws create a self‑reinforcing instability loop: noisy ratios generate gradient spikes, spikes cause the target policy to drift away from the behavior policy, and the increased drift further inflates the variance of subsequent ratio estimates, eventually culminating in a catastrophic reward collapse after only a few hundred training steps.
To break this loop, the authors propose StableDRL, a reformulation of GRPO specifically for dLLMs. StableDRL introduces two key modifications. (1) Unconditional clipping: the importance ratio is forcibly bounded within the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment