Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning
Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3’s strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.
💡 Research Summary
Video‑o3 tackles the long‑standing challenge of understanding lengthy videos with multimodal large language models (MLLMs). Traditional approaches rely on uniform frame sampling followed by a single‑turn inference, which often dilutes sparse but crucial evidence and incurs heavy computational costs. Inspired by how humans explore videos—first scanning globally, then iteratively zooming into promising segments—Video‑o3 introduces a native interleaved tool‑invocation framework that can dynamically seek visual clues, inspect them in high resolution, and terminate reasoning as soon as sufficient evidence is gathered.
The system begins by encoding the entire video at a low‑resolution token level, providing a global overview to the model together with the user query and tool‑usage instructions. The model then enters a loop of internal reasoning: it decides whether the current observations are enough to answer the question or whether a “clue‑seeking” step is required. In a clue‑seeking step, the model issues a command to the VideoCrop tool, specifying a temporal segment and a token quota. The tool returns a high‑resolution clip that is inserted into the conversation as a new observation token. The model can repeat this process, accumulating multiple clue segments, until it judges that the evidence is sufficient and produces a final answer.
Two technical innovations make this loop stable and efficient. First, Task‑Decoupled Attention Masking (TDAM) addresses attention dispersion caused by mixing heterogeneous tokens (global video, tool‑derived clips, intermediate reasoning text) in a shared context. During supervised fine‑tuning, TDAM applies step‑specific attention masks: in clue‑seeking phases the model can attend only to the global video tokens, forcing it to plan searches based on coarse information; in answer‑reasoning phases the global overview is masked, compelling the model to rely exclusively on the high‑resolution clips. This decoupling prevents “fake thinking” where the model generates an answer that is inconsistent with the evidence it has already retrieved.
Second, the Verifiable Trajectory‑Guided Reward (VTGR) controls context growth in multi‑turn interactions. The reward function multiplies the standard answer correctness term by a factor that rewards efficient trajectories—fewer tool calls, shorter clips, and lower token consumption. Consequently, policies that achieve the same correct answer with less exploration receive higher reinforcement, encouraging the model to balance exhaustive evidence collection against computational cost and to learn a reliable early‑termination criterion.
To train such a system at scale, the authors built an automated data‑synthesis pipeline that generates diverse video‑question‑clue‑answer triples. The resulting Seeker‑173K dataset contains 173 000 high‑quality tool‑interaction trajectories, each comprising the query, global video metadata, a sequence of tool commands, the retrieved clip tokens, intermediate chain‑of‑thought reasoning, and the final answer. This dataset supports both supervised fine‑tuning (using TDAM) and reinforcement learning (using VTGR).
Extensive experiments on three long‑video benchmarks demonstrate the effectiveness of Video‑o3. On MLVU, Video‑o3 attains 72.1 % accuracy, surpassing the previous best by roughly 6 percentage points. On LVBench it reaches 47.6 % and on Video‑Holmes 46.5 %, both notable improvements over prior state‑of‑the‑art models such as Video‑R1, VideoChat‑R1, and other tool‑augmented baselines. Ablation studies show that removing TDAM drops accuracy by about 4 %, while omitting the trajectory‑guided reward increases the average number of tool calls from 2.3 to 3.8 per question, confirming the importance of both components.
The paper also discusses limitations. The current VideoCrop tool operates at a fixed frame rate and resolution, making it costly for ultra‑high‑definition or high‑frame‑rate videos. The synthetic Seeker‑173K data, while large, may not capture all real‑world complexities, potentially affecting domain transfer. Moreover, the early‑termination behavior is sensitive to the reward hyper‑parameters, requiring careful tuning.
Future work is outlined along three directions: (1) introducing dynamic resolution and frame‑rate adaptation to reduce token cost for high‑quality videos; (2) expanding the training corpus with real‑world, manually annotated long‑video trajectories to improve robustness; and (3) exploring meta‑learning or self‑supervised reward shaping to make the termination policy less dependent on hand‑crafted reward coefficients.
In summary, Video‑o3 presents a novel paradigm where multi‑hop visual evidence gathering and reasoning are tightly interleaved within a single conversational context. By decoupling attention across tasks and incentivizing efficient exploration, it achieves both higher accuracy and lower computational overhead on challenging long‑video understanding tasks, marking a significant step forward for tool‑augmented multimodal AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment