VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation
Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student’s self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher’s epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
💡 Research Summary
This paper introduces VLA-OPD (On-Policy VLA Distillation), a novel framework designed to address the critical challenges in post-training Vision-Language-Action models for robotic manipulation. While pre-trained VLAs show broad generalization, fine-tuning them for reliable deployment on specific tasks remains essential. The two dominant paradigms—offline Supervised Fine-Tuning and online Reinforcement Learning—suffer from complementary weaknesses: SFT is vulnerable to distribution shift and catastrophic forgetting of pre-trained knowledge due to its static, off-policy nature, while online RL struggles with sample inefficiency and high-variance optimization due to sparse outcome rewards.
VLA-OPD bridges this gap by unifying the efficiency of SFT with the robustness of RL through on-policy distillation. The framework operates in a three-phase iterative cycle. First, the student policy interacts with the environment to generate on-policy trajectory rollouts. This exposes the model to its own induced state distribution, including error states not seen in the original expert demonstrations. Second, for every state visited by the student, a frozen expert teacher policy provides dense, token-level action labels. This replaces the sparse environmental reward with immediate, granular supervision, enabling active error correction. Third, the student policy is optimized using a policy gradient objective derived from a Reverse-KL divergence between the student and teacher distributions.
A key technical innovation is the use of the Reverse-KL objective. The authors analyze that standard alternatives are unsuitable: Forward-KL leads to “mode-covering” behavior, causing the student to mimic the teacher’s epistemic uncertainty and resulting in entropy explosion. Using a hard cross-entropy loss (argmax matching) leads to premature entropy collapse, stripping the policy of the diversity needed for exploration. In contrast, the “mode-seeking” property of Reverse-KL allows the student to confidently focus on the primary modes of the teacher’s distribution—filtering out its uncertainty—while retaining sufficient stochasticity for diverse, valid actions. This ensures stable learning.
Furthermore, by grounding updates strictly in the student’s current behavioral manifold, VLA-OPD achieves a “gentle” alignment that effectively mitigates catastrophic forgetting of the model’s pre-trained generalist capabilities.
Extensive experiments on the LIBERO and RoboTwin2.0 benchmarks demonstrate the framework’s effectiveness. VLA-OPD significantly outperforms offline SFT in terms of robustness and success rate. Compared to a strong online RL baseline (GRPO), it achieves superior or comparable performance with an order of magnitude fewer training steps, showcasing dramatic sample efficiency gains. The results validate that VLA-OPD successfully combines the fast convergence of SFT with the distribution-aware robustness of RL. The work presents a scalable pathway for continuously improving foundation models by efficiently distilling robust behaviors from existing expert policies into a unified student backbone, circumventing the prohibitive cost of training VLAs from scratch via online RL.
Comments & Academic Discussion
Loading comments...
Leave a Comment