Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to “training inference mismatch stemming” from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model’s optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.


💡 Research Summary

The paper investigates the notorious instability of reinforcement‑learning (RL) fine‑tuning for large language models (LLMs). While prior work attributes collapse to a “training‑inference mismatch” caused by heterogeneous rollout and update engines, the authors argue that this mismatch is not a static numerical artifact but a dynamic phenomenon that co‑evolves with the optimization process. By tracking two metrics—log‑perplexity difference between the rollout policy (µ) and the update policy (π) and a smoothed gradient norm—they show that both increase sharply around the same training step (≈300 for Qwen‑3‑Base). This simultaneous rise indicates that gradient noise dominates the true signal as training proceeds, suggesting that the model’s parameters are moving into regions of the loss landscape where finite‑precision arithmetic amplifies discrepancies.

A simple theoretical argument (Appendix A) shows that reducing the learning rate η scales down the impact of gradient noise (both bias and variance). Empirically, lowering the constant learning rate from 1e‑6 to 1e‑7 dramatically postpones or eliminates collapse, albeit at the cost of slower early‑stage progress. To avoid this trade‑off, the authors search for an early‑warning signal that can trigger a dynamic learning‑rate reduction only when needed. They discover that the average response length (the number of tokens generated per prompt) exhibits a “surge”—often tripling within a few steps—just before the gradient norm spikes. Because longer sequences entail more floating‑point operations, they naturally increase the chance of non‑associative rounding errors, making response‑length a reliable proxy for impending numerical instability.

Based on this insight, the paper proposes a lightweight “length‑decay” scheduler. The scheduler keeps the initial learning rate η₀ until the step counter is a multiple of a decay period T_decay, which is chosen to align with the observed response‑length surge. At each trigger, the learning rate is halved, but never below a floor η∞ set to 10 % of η₀. The algorithm is simple (see Algorithm 1) and requires only the average response length as a signal; no extra hyper‑parameters beyond the decay period and floor are needed.

Experiments are conducted on Qwen‑3‑4B‑Base and Qwen‑3‑8B‑Base models, using the full DAPO dataset as well as heavily subsampled versions (2.5 %–25 %). Baselines include token‑level and sequence‑level importance sampling (IS), as well as truncated and masked IS (TIS, MIS). While IS variants extend training modestly, they eventually collapse, especially on the larger model. In contrast, the length‑decay scheduler maintains stable training for thousands of steps, keeps the log‑ppl difference low, and prevents the gradient‑norm explosion. The authors also show that epoch‑based schedules are unreliable because collapse timing does not scale linearly with dataset size.

The contributions are threefold: (1) reframing training‑inference mismatch as a dynamic optimization issue, (2) demonstrating that simple learning‑rate reduction mitigates the mismatch, and (3) introducing a response‑length‑driven adaptive scheduler that outperforms conventional IS‑based fixes. The paper’s strengths lie in clear empirical diagnostics, a compelling intuition linking sequence length to numerical error, and a practically implementable scheduler.

However, several limitations remain. The causal link between response length and gradient‑noise amplification is supported only by empirical correlation; a rigorous bound is only sketched (Theorem 3.1) and relies on “mild regularity” assumptions. The method is evaluated on a narrow set of Chinese‑language models; its applicability to other architectures (e.g., GPT‑4, LLaMA) or multilingual settings is untested. Moreover, the scheduler is compared only against static IS techniques; modern adaptive optimizers (AdamW with bias correction, LAMB, Ranger, or learning‑rate warm‑up/cool‑down schemes) are absent from the baseline pool. Finally, the paper does not discuss computational overhead or potential interactions with other RL tricks such as KL‑penalties or reward scaling.

In summary, the work provides a useful diagnostic of RL instability for LLMs and proposes a simple, response‑length‑triggered learning‑rate decay that empirically stabilizes training. While the idea is promising and easy to adopt, further theoretical grounding, broader experimental validation, and comparison with state‑of‑the‑art adaptive optimizers are needed to establish its general utility.


Comments & Academic Discussion

Loading comments...

Leave a Comment