Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.
💡 Research Summary
The paper investigates a critical flaw in the prevailing two‑stage training pipeline for reasoning language models, where an offline supervised fine‑tuning (SFT) phase is followed by an online reinforcement learning (RL) phase. While SFT is typically optimized in isolation to maximize offline metrics (e.g., token‑level negative log‑likelihood or KL‑regularized loss), the authors demonstrate that a stronger SFT checkpoint does not guarantee better final performance after RL. In fact, after identical RL training, models initialized from weaker SFT checkpoints often surpass those from stronger ones. The root cause is identified as a distribution mismatch: the offline data are generated by a behavior policy πβ (the data‑collecting policy), whereas RL updates the target policy πθ using roll‑outs sampled from the evolving model itself. Consequently, SFT optimizes the model on state‑action pairs that the RL stage will rarely encounter, especially in long‑horizon reasoning tasks where early token probabilities compound over the entire sequence.
To address this, the authors propose PEAR (Policy Evaluation‑inspired Algorithm for Offline Learning loss Re‑weighting), a lightweight re‑weighting scheme that adjusts the contribution of each token’s loss during SFT based on importance sampling ratios between the target and behavior policies. Three granularity levels are offered:
- Sequence‑level weighting: Compute a single importance weight w₁:T = ∏ₜ πθ(yₜ|·)/πβ(yₜ|·) for the whole output and apply it uniformly to all tokens.
- Token‑level (suffix) weighting: For each token t, calculate a discounted suffix importance Gₜ = γ^{T‑t} ∏_{j>t} πθ(y_j|·)/πβ(y_j|·), where γ∈(0,1] controls variance for long horizons.
- Block‑level weighting: Partition the sequence into blocks of size B, compute block‑wise products of ratios, and apply the suffix weight only at block boundaries, thereby reducing variance while preserving locality.
Implementation details include clipping log‑probability differences to avoid extreme values, applying upper/lower bounds to the exponentiated weights, and using a stop‑gradient operator so that the weights themselves are not differentiated. The final PEAR‑augmented loss is L_PEAR = ∑ₜ sg(Gₜ)·ℓₜ, where ℓₜ is the base SFT loss (NLL or KL‑distillation). This formulation requires only the log‑probabilities of the offline data under both πβ (known from the dataset) and the current πθ, incurring negligible overhead.
Experimental evaluation spans multiple model families (Qwen2.5‑Math, Qwen3‑Base at 0.6 B, 1.7 B, 4 B, 8 B, and DeepSeek‑Distilled‑Qwen‑1.5 B) and two benchmark domains: verifiable logic puzzles (SynLogic) and the AIME‑2025 mathematics competition. All experiments keep the RL stage identical (PPO‑style policy optimization) and vary only the SFT objective. Results show that PEAR consistently improves post‑RL metrics: Pass@1 gains of up to ~8 % and Pass@8 gains up to 14.6 % absolute on AIME‑2025, outperforming a wide range of SFT baselines (plain NLL, KL‑regularized SFT, and various token‑reweighting schemes such as TopLogP, BottomP). An additional analysis reveals that models fine‑tuned with PEAR experience ≈30 % less parameter drift during RL, indicating that the initialization is already closer to the eventual target policy.
The paper’s contributions can be summarized as follows:
- Problem articulation: It rigorously frames the SFT‑RL mismatch as an off‑policy evaluation issue, highlighting that optimizing for offline loss alone can be counter‑productive for downstream RL.
- Methodology: PEAR provides a theoretically grounded, easy‑to‑implement re‑weighting strategy that aligns the offline loss with the distribution the model will encounter during RL.
- Empirical validation: Across several model scales and two distinct reasoning tasks, PEAR yields statistically significant improvements in final RL performance and reduces training instability.
- Practical impact: Because PEAR only modifies loss weighting, it can be dropped into existing SFT pipelines without architectural changes or substantial compute overhead.
Limitations and future work are acknowledged. The current evaluation focuses on reasoning and math tasks; generalization to dialogue, code generation, or multimodal settings remains open. Moreover, accurate estimation of πθ probabilities early in training may be noisy, potentially affecting weight stability; adaptive schemes for estimating importance ratios could further improve robustness. Finally, integrating PEAR with dynamic curriculum or data‑selection strategies could amplify its benefits.
In conclusion, the study shifts the paradigm from “good SFT maximizes SFT performance” to “good SFT prepares the model for RL,” offering a concrete algorithmic solution that bridges the offline‑online gap and delivers measurable gains in the final capabilities of large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment