Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model’s prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model’s natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone’s QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at https://anonymous.4open.science/r/Patch-the-Prompt-Gap-4112 .

💡 Research Summary

The paper tackles a fundamental problem in fine‑tuning large language models (LLMs): when the downstream data distribution diverges significantly from the model’s pre‑training distribution, supervised fine‑tuning (SFT) often leads to catastrophic forgetting of the model’s general capabilities. Existing data‑rewriting pipelines attempt to mitigate this by rewriting downstream examples before SFT, but they sample rewrites from a prompt‑conditioned distribution π₀(· | x, y★, x_prompt). This conditional distribution does not match the QA‑style generation distribution π₀(· | x) that SFT implicitly expects, so the rewritten targets remain only partially in‑distribution. Moreover, reliance on a small set of fixed templates causes diversity collapse, further limiting the effectiveness of the approach.

To address these shortcomings, the authors recast data rewriting as a policy‑learning problem. They introduce a lightweight rewriting agent R_ϕ that sits on top of a frozen instruction‑tuned base model π₀ via LoRA adapters, thereby learning a low‑rank “patch” rather than a full model. The agent receives an input x, the expert answer y★, and a rewriting prompt x_prompt, and generates a rewrite ỹ ∼ R_ϕ(· | x, y★, x_prompt). Training proceeds with on‑policy reinforcement learning, using a unified reward that simultaneously encourages (i) task consistency, (ii) alignment with the QA‑style distribution, and (iii) output diversity.

Task‑consistency reward (r_task) is binary. It first applies a cheap rule‑based check for answer correctness; if the answer passes, a stronger LLM‑as‑judge evaluates reasoning validity. Only when both checks succeed does r_task become 1, otherwise 0. This hard gate ensures that only semantically correct rewrites receive any further reward.

Distributional‑alignment reward (r_dist) measures how likely the rewrite is under the frozen base model conditioned only on x. The authors compute the length‑normalized negative log‑likelihood (NLL) of ỹ under π₀(· | x), then perform group‑wise whitening: for each input x, they calculate the mean μₓ and standard deviation σₓ of NLL across all task‑consistent candidates S₊ₓ, and transform the raw NLL into a normalized score \hatℓ_dist. A sigmoid mapping converts this to a bounded reward in

Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT

💡 Research Summary

Comments & Academic Discussion

Leave a Comment