Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL’s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.


💡 Research Summary

The paper investigates why reinforcement learning (RL) consistently yields better out‑of‑distribution (OOD) generalization than supervised fine‑tuning (SFT) when adapting large‑scale vision‑language models (VLMs). Rather than attributing the gap to RL’s exploration or reward‑driven optimization, the authors propose a data‑centric hypothesis: RL implicitly acts as a data filter that concentrates learning on medium‑difficulty training samples while largely ignoring easy and hard examples.

To operationalize “difficulty,” the authors generate eight responses per training instance using the current VLM and categorize each instance as Easy (all responses correct), Hard (all responses incorrect), or Medium (mixed correctness). In RL, the reward variance is high for Medium samples, producing non‑zero advantage estimates (Eq. 5) and thus significant gradient updates. By contrast, Easy and Hard samples receive uniform rewards, yielding near‑zero advantage and negligible updates. Consequently, RL updates are dominated by Medium‑difficulty data.

The hypothesis is tested by training SFT models on three curated subsets (Easy, Medium, Hard) drawn from the same base data and evaluating both in‑distribution (ID) and OOD performance on two representative VLM tasks: image classification and visual grounding. The classification experiments use a 100‑class ImageNet‑1K subset for training, with ImageNet‑R (artistic) and ImageNet‑A (adversarial) as OOD tests. The grounding experiments train on 10 k RefCOCO instances and evaluate on Ref‑L4 and Lisa for OOD. Results show that training on Hard samples improves ID accuracy but dramatically harms OOD performance; Medium samples achieve a balanced, robust performance across both; Easy samples preserve OOD performance but lag in ID accuracy. These findings confirm that the presence of hard examples in the training set is a primary cause of the SFT‑RL generalization gap.

Building on this insight, the authors introduce Difficulty‑Curated SFT (DC‑SFT), a simple pipeline that first filters out Hard samples and then fine‑tunes the model on the remaining Easy + Medium data. Using LoRA‑based parameter‑efficient fine‑tuning on Qwen2.5‑VL‑3B/7B, DC‑SFT achieves OOD gains of 3–5 percentage points over standard SFT and surpasses RL‑based Group Relative Policy Optimization (GRPO) in both tasks. Moreover, DC‑SFT exhibits higher training stability (lower variance across runs) and reduced computational cost: memory usage and training time drop by roughly 30 % compared with RL, owing to the absence of reward‑model training and KL‑penalty regularization.

The paper also evaluates DC‑SFT on complex reasoning datasets that contain multi‑step instruction chains. Here, DC‑SFT outperforms both SFT and RL, indicating that the data‑centric filtering not only improves generalization but also enhances higher‑order reasoning capabilities.

Contributions are threefold: (1) a novel data‑centric explanation for the RL‑SFT generalization gap, framing RL as an implicit medium‑difficulty data selector; (2) the DC‑SFT method, which explicitly removes hard samples and delivers superior OOD performance while being more stable and efficient; (3) empirical evidence that DC‑SFT excels on reasoning‑heavy tasks, underscoring the broader impact of data curation on VLM robustness.

Overall, the work shifts the focus from algorithmic sophistication to data quality and composition, suggesting that careful difficulty‑based curation can replace expensive RL pipelines for achieving robust, generalizable VLMs.


Comments & Academic Discussion

Loading comments...

Leave a Comment