ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Reading time: 5 minute
...

📝 Original Info

  • Title: ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
  • ArXiv ID: 2512.02835
  • Date: 2025-12-02
  • Authors: ** - Yifan Li¹² - Yingda Yin³*† - Lingting Zhu³† - Weikai Chen³ - Shengju Qian³ - Xin Wang³ - Yanwei Fu¹²* Affiliations: ¹ Fudan University, ² Shanghai Innovation Institute, ³ LIGHTSPEED **

📝 Abstract

Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

💡 Deep Analysis

Figure 1

📄 Full Content

ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning Yifan Li1,2 Yingda Yin3*† Lingting Zhu3† Weikai Chen3 Shengju Qian3 Xin Wang3 Yanwei Fu1,2* 1Fudan University 2Shanghai Innovation Institute 3LIGHTSPEED 64.8 41.0 7.6 8.1 80.7 63.0 5.3 4.4 Sure, it is . [USER]: While driving and turning right at an intersection, what moving thing should I be most wary of to avoid a collision? …the most likely candidate would be a pedestrian crossing the street or someone walking along the sidewalk, …drivers need to be cautious of pedestrians when turning right at intersections… Object description: the pedestrian standing near the Steelers store +23.8% + >55% In-domain Test Out-of-domain Test VideoLISA Ours Base Model Base Model + RL Our Model Our Model + RL +17.7% + >75% Ablation Study Base Model Base Model + RL Our Model Our Model + RL Figure 1. (Left) Through an explicit reasoning chain, our ReVSeg tackles reasoning-focused video object segmentation and accurately grounds objects referenced by complex, abstract real-world queries. (Right) While the base model and its RL variant struggle on the task, our method achieves strong performance, with RL post-training yielding a further substantial boost. We report the J &F metric on Ref-DAVIS17 (in-domain) and ReasonVOS (out-of-domain) datasets in the chart. Abstract Reasoning-centric video object segmentation is an inher- ently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static ap- pearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially in- tractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reason- ing as sequential decisions in the native interface of pre- trained vision language models (VLMs). Rather than fold- ing all reasoning into a single-step prediction, ReVSeg ex- ecutes three explicit operations – semantics interpretation, temporal evidence selection, and spatial grounding – align- ing pretrained capabilities. We further employ reinforce- ment learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on stan- dard video object segmentation benchmarks and yields in- * Corresponding authors. † Project leads. terpretable reasoning trajectories. Project Page. 1. Introduction Human interpretation of videos relies on recognizing how events unfold, why actions occur, and when key mo- ments matter. Traditional video object segmentation (VOS) methods, however, primarily exploit appearance-only or category-level cues [3, 29, 40, 45, 46]. Reasoning VOS elevates the task: the model must parse abstract, under- specified instructions, draw on commonsense and causal reasoning, integrate temporal dynamics with semantic knowledge to identify the target (e.g., “the runner most likely to win” or “the object causing the accident”). While recent vision-language models (VLMs) have shown promising capabilities for reasoning segmenta- tion [2, 16, 50], current systems predominantly model rea- soning VOS as a single-step latent prediction: inserting a special token (e.g., ) and decoding it directly into mask outputs [2, 8, 20, 35, 41, 42, 50, 53]. This collapses the multi-step reasoning process — interpreting abstract instructions, identifying candidates, and spatial-temporal 1 arXiv:2512.02835v1 [cs.CV] 2 Dec 2025 grounding — into a simple conclusion as well as opaque embeddings. Such compactness comes at a great cost: inter- pretability is bounded, distribution shift arises from forcing VLMs into non-native output spaces, and substantial data is required for supervised fine-tuning. These challenges reflect a deep insight: effective video reasoning unfolds through a sequence of deliberate choices, not a single latent inference. A model must determine where to focus, when to attend, and which entity the query refers to. Motivated by this principle, we introduce ReVSeg, which eliminates the use of latent segmentation tokens and instead reformulates reasoning video segmentation as an ex- plicit sequence of reasoning chain. In particular, ReVSeg decomposes reasoning VOS into three actions aligned with VLM-native capabilities: video understanding (interpreting the query and assessing scene dynamics), temporal ground- ing (identifying key frames or intervals pertinent to the query), and spatial grounding (localizing the target objects within selected frames). ReVSeg orchestrates these primi- tives through multi-turn actions with a single VLM, ensur- ing the semantic context established in early reasoning steps seamlessly propagates to downstream localization. Execut- ing reasoning through these native interfaces preserves pre- training alig

📸 Image Gallery

format_reward.png framework.png num_turns.png prompt-round1.png prompt-round2.png response_length.png reward.png spatial_reward.png teaser.png temporal_reward.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut