TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model’s evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model’s own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy–retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.


💡 Research Summary

The paper tackles a fundamental dilemma in post‑training large language models (LLMs): supervised fine‑tuning (SFT) is cheap and stable but often leads to catastrophic forgetting, while on‑policy reinforcement learning (RL) preserves broad capabilities but requires costly reward modeling, verifier pipelines, and extensive on‑policy sampling. The authors identify two distinct failure modes of standard SFT. First, Temporal Supervision Mismatch arises because the supervision distribution (a fixed set of reference outputs) does not evolve while the model’s policy changes during training. As the policy drifts away from the static labels, the cross‑entropy loss generates large corrective gradients that pull the model back toward the reference, degrading representations that support unrelated abilities. Second, Mode Collapse occurs when a single reference trajectory is used as the target for tasks with many valid solutions (e.g., math reasoning, code generation). The model concentrates probability mass on that single answer, suppressing alternative correct paths and reducing answer diversity.

To quantify these phenomena, the authors introduce Policy‑Label Divergence (PLD), defined as the forward KL (or equivalently cross‑entropy up to a constant) between the fixed supervision distribution (q(y|x)) and the current policy (\pi_{\theta_t}(y|x)): \


Comments & Academic Discussion

Loading comments...

Leave a Comment