One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout’’ by sampling multiple candidate tokens from the current policy’s distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

💡 Research Summary

The paper investigates why supervised fine‑tuning (SFT) of large language models (LLMs) often lags behind reinforcement‑learning‑based methods in terms of generalization. The authors argue that the core issue is not merely the loss function but the nature of the training data: SFT relies on a static, pre‑collected dataset (off‑policy), whereas RL continuously samples data from the current policy (on‑policy). This on‑policy characteristic is believed to enable RL to explore low‑probability regions, preserve pre‑training knowledge, and achieve better downstream performance.

To bridge this gap, the authors propose One‑Token Rollout (OTR), a novel fine‑tuning algorithm that injects on‑policy signals into the SFT process using the policy‑gradient framework at the token level. The key idea is to treat each token generation step as a single‑step reinforcement‑learning trajectory. For a given state (prompt plus previously generated tokens) the model samples K candidate tokens from an exploration policy π′θ, which is derived from the model’s logits with a temperature κ > 1 to encourage diversity. The ground‑truth token from the supervised dataset is then used to compute an immediate reward: +1 if the sampled token matches the ground truth, and a smaller constant β (set to –0.1 in experiments) otherwise.

The per‑token loss is the Monte‑Carlo estimate of the negative policy‑gradient objective:

Lₜ^{OTR}(θ) = − (N_gt/K)·log πθ(xₜ|sₜ) − β ∑_{a′ₜ,ⱼ ≠ xₜ} log πθ(a′ₜ,ⱼ|sₜ),

where N_gt is the number of times the correct token was sampled among the K candidates. This formulation yields two intuitive effects: (1) when the correct token is frequently sampled, its log‑likelihood is weighted more heavily, reproducing the usual SFT signal; (2) when incorrect tokens are sampled, the negative β term penalizes the model for assigning them high probability, acting as a regularizer. The total loss for a sequence is the average over all token positions.

OTR’s computational advantage stems from the fact that it does not require generating full sentences or computing complex, sequence‑level rewards. Sampling K tokens per position incurs O(K·T) operations, which is comparable to standard SFT (K = 1). In practice, the authors use K = 4–8 and observe negligible overhead in GPU memory or wall‑clock time relative to SFT. Because OTR can be implemented with the same optimizer, learning‑rate schedule, and batch size as SFT, it integrates seamlessly into existing training pipelines.

Experimental evaluation is conducted on a subset (5 k examples) of the OpenR1‑Math‑220k dataset, fine‑tuned for two epochs with AdamW (lr = 5e‑6, cosine decay). The method is tested on several open‑source LLMs: Qwen2.5‑3B/7B, Qwen3‑4B‑Base, Qwen3‑8B‑Base, and Olmo3‑7B. Benchmarks cover three domains: mathematical reasoning (GSM8K, Olympiad, AIME series, etc.), code generation (HumanEval+, MBPP+), and general reasoning (SuperGPQA, MMLU‑Pro).

Across all settings, OTR consistently outperforms vanilla SFT. On math benchmarks, average accuracy improves by 1–4 percentage points; on larger models the gain reaches over 2 pp. Code generation sees modest but reliable increases in pass@1 (≈1–2 pp). General reasoning tasks also benefit, with SuperGPQA and MMLU‑Pro scores rising by ~0.5–1 pp. Notably, in some cases SFT degrades performance relative to the base model (e.g., Qwen3‑4B on certain math tasks), while OTR recovers or surpasses the original capability, underscoring the value of on‑policy simulation for preserving pre‑training knowledge.

The authors acknowledge limitations: the reward function is binary and does not capture nuanced quality metrics such as execution correctness for code or logical coherence for reasoning. Moreover, token‑level rollouts ignore longer‑range dependencies that full‑sentence RL can model. Future work is suggested to explore adaptive β values, richer reward designs, or hybrid schemes that combine token‑level OTR with occasional sentence‑level rollouts.

In summary, One‑Token Rollout offers a data‑centric bridge between SFT and RL, delivering on‑policy learning benefits without the heavy computational cost of full reinforcement learning. It demonstrates that converting static supervised data into dynamic, token‑level on‑policy signals can substantially improve the generalization of fine‑tuned LLMs, marking a practical and theoretically insightful contribution to the field.

One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

💡 Research Summary

Comments & Academic Discussion

Leave a Comment