멀티모달 단일 롤아웃 학습 효율성 향상

Reading time: 5 minute
...

📝 Abstract

proj.github.io (a) Training accuracy 50% Fewer Steps (b) Validation accuracy (c) Generalization performance Figure 1. Performance overview of MSSR: (a-b) Training and validation accuracy of MVSR (Multimodal Vanilla Single-Rollout), GRPO [28] and our MSSR, trained on the Vision-R1-RL [15] training set and validated on its corresponding validation set. MSSR remains stable and improves steadily, whereas MVSR is unstable and collapses. Notably, MSSR reaches a similar final validation accuracy to GRPO with half of the training steps, highlighting its superior training compute efficiency. (c) Our MSSR achieves higher generalization performance across diverse multimodal reasoning benchmarks, including MathVerse [40], MathVista [23], MMK12 [26], R1-Onevision-Bench [37], and HallusionBench [12], compared to other baselines including GRPO [28], RLOO [1], and REINFORCE++ [14]. For fair comparisons, we have equivalent total number of rollouts per step for all methods.

💡 Analysis

proj.github.io (a) Training accuracy 50% Fewer Steps (b) Validation accuracy (c) Generalization performance Figure 1. Performance overview of MSSR: (a-b) Training and validation accuracy of MVSR (Multimodal Vanilla Single-Rollout), GRPO [28] and our MSSR, trained on the Vision-R1-RL [15] training set and validated on its corresponding validation set. MSSR remains stable and improves steadily, whereas MVSR is unstable and collapses. Notably, MSSR reaches a similar final validation accuracy to GRPO with half of the training steps, highlighting its superior training compute efficiency. (c) Our MSSR achieves higher generalization performance across diverse multimodal reasoning benchmarks, including MathVerse [40], MathVista [23], MMK12 [26], R1-Onevision-Bench [37], and HallusionBench [12], compared to other baselines including GRPO [28], RLOO [1], and REINFORCE++ [14]. For fair comparisons, we have equivalent total number of rollouts per step for all methods.

📄 Content

Stable and Efficient Single-Rollout RL for Multimodal Reasoning Rui Liu1,2˚, Dian Yu1, Lei Ke1, Haolin Liu3, Yujun Zhou4, Zhenwen Liang1, Haitao Mi1, Pratap Tokekar2, Dong Yu1 1Tencent AI Lab, Bellevue 2University of Maryland, College Park 3University of Virginia 4University of Notre Dame Project Page: https://mssr-proj.github.io (a) Training accuracy 50% Fewer Steps (b) Validation accuracy (c) Generalization performance Figure 1. Performance overview of MSSR: (a–b) Training and validation accuracy of MVSR (Multimodal Vanilla Single-Rollout), GRPO [28] and our MSSR, trained on the Vision-R1-RL [15] training set and validated on its corresponding validation set. MSSR remains stable and improves steadily, whereas MVSR is unstable and collapses. Notably, MSSR reaches a similar final validation accuracy to GRPO with half of the training steps, highlighting its superior training compute efficiency. (c) Our MSSR achieves higher generalization performance across diverse multimodal reasoning benchmarks, including MathVerse [40], MathVista [23], MMK12 [26], R1-Onevision-Bench [37], and HallusionBench [12], compared to other baselines including GRPO [28], RLOO [1], and REINFORCE++ [14]. For fair comparisons, we have equivalent total number of rollouts per step for all methods. Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capa- bilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been ex- plored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency- stability trade-off, we introduce MSSR (Multimodal Sta- bilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy- based advantage-shaping mechanism that adaptively reg- ularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multi-

  • Work done during an internship at Tencent AI Lab, Bellevue, WA. modal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR’s performance surpasses the group-based baseline and shows consistent generaliza- tion improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
  1. Introduction Reinforcement learning (RL) fine-tuning has become cen- tral to aligning Large Language Models (LLMs) and Multi- modal Large Language Models (MLLMs) with human pref- erences. Early efforts, such as Reinforcement Learning from Human Feedback (RLHF) [3, 27], optimized mod- els toward human-preferred behaviors, improving align- 1 arXiv:2512.18215v1 [cs.LG] 20 Dec 2025 ment and performance. More recently, attention has shifted toward Reinforcement Learning with Verifiable Rewards (RLVR) [8, 9, 13, 18, 24, 43, 44], which replaces human feedback with automatically verifiable correctness signals. These binary rewards enable models to learn directly from objective supervision and have been successfully applied to multimodal reasoning [2, 10, 15, 19, 21, 31, 39]. Despite this progress, multimodal RLVR methods still face key challenges. They typically require generating a group of rollouts for each input. Prevalent methods such as GRPO [28] rely on multiple rollouts per input to infer rel- ative advantages, resulting in substantial cost on repeated forward passes through the vision and language encoders, which becomes particularly expensive for large multimodal models. Moreover, when all rollouts in a group yield iden- tical outcomes (e.g., all correct or all incorrect), the relative advantage collapses to zero, producing no learning signal and reducing rollout utilization efficiency [38]. This raises a key question: Can multimodal RLVR be made both compute efficient and stable, without sacrificing or even improving accuracy? We introduce MSSR (Multimodal Stabilized Single- Rollout), a group-free approach that requires only one roll- out per multimodal input, aiming to achieve both high train- ing compute efficiency and stable optimization. Although single-rollout RLVR has recently been explored in text-only settings [4, 36], extending it to multimodal reasoning is con- siderably more challenging. The inclusion of dense, high- dimensional visual inputs in images substantially increases input variance and compli

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut