Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multiagent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.
Multimodal large language models (MLLMs) extend LLMs beyond text to perceive and reason over multimodal signals, such as visual frames, audio, and subtitles. A key emerging challenge is robust long video understanding, where information is sparsely distributed across hours of content and multiple modalities (e.g., frames, and dialogue cues). Early instruction-tuned systems such as Video-LLaMA (Zhang et al., 2023;Lin et al., 2024) demonstrated that LLMs can be adapted to jointly process sampled video frames, marking an initial step toward multimodal video reasoning. However, current models remain limited to short clips or coarse summaries and struggle with fine-grained, * Equal Contribution. temporally extended queries. Crucially, most prior systems are non-agentic models: they process a static, pre-encoded or down-sampled video. Converting the full visual stream into compressed representations in the LLM's textual space shifts the burden of temporal reasoning to this early stage-often lossy and irreversible, making it difficult to recover fine-grained evidence. These limitations motivate an agentic, tool-augmented paradigm that can actively decide what to observe next, when to query external visual or other tools, and when enough grounded evidence has been gathered to respond. Despite recent advances, the field still lacks a solution that jointly achieves efficiency, multimodal completeness, and fine-grained temporal reasoning in long videos.
Recent works have begun to frame long video understanding as an agent-driven process, rather than a passive encoding task. Notably, VideoAgent (Fan et al., 2024;Wang et al., 2024b) introduced an agent-based framework where a central LLM actively conducts video analysis. In this paradigm, the LLM agent iteratively queries external vision models (tools) to retrieve and interpret video frames, progressively compiling the information needed to answer a given query. This interactive strategy mirrors human cognitive behavior and has demonstrated promising effectiveness. These findings highlight the potential of tool-augmented LLM agents in achieving both efficiency and accuracy. However, the initial incarnation of VideoAgent relies on a less powerful toolset, primarily generic vision-language foundation models for captioning and image retrieval. Such tools are often insufficient for capturing fine-grained semantics, precise object references, or subtle temporal cues. This restricts the agent’s ability to understand complex scenes and reason over long temporal spans. Moreover, current frameworks underutilize the LLM’s inherent reasoning abilities and lack mechanisms for multi-step decision making or reinforcementbased planning.
In this paper, shown as Figure 1 we address these challenges by proposing a new multi-agent-based framework for long video understanding that strategically incorporates agents. Our system adopts a multi-agent architecture, where a central MASTER-AGENT is responsible for reasoning and answering, while coordinating with other specialized agents. Specifically, a GROUNDINGAGENT locates video segments relevant to the question, and a VISION-AGENT extracts detailed visual information from the selected clips (e.g., objects, faces, actions). The master agent gathers these outputs to iteratively reason over the accumulated evidence. To guide the reasoning process, we design a reward-driven training strategy that encourages the master agent to conduct structured, multi-step reasoning. In each iteration, the master agent generates sub-queries, invokes either the grounding or vision agent as needed, and integrates the returned information before deciding on the next step. When it determines that enough evidence has been collected, it produces a final answer. By designing a reward function that penalizes irrelevant tool use and incoherent reasoning, we guide the agent to “think” in a proper format, effectively learning when to explore the video with tools and when it has gathered sufficient evidence to answer the question. Furthermore, to evaluate long-form video reasoning in a realistic setting, we construct a new benchmark dataset LongTVQA and LongTVQA+. This dataset extends the well-known TVQA video question answering task to much longer video durations, providing a rigorous testbed for our agent.
Our Agent-with-Tools approach demonstrates superior performance on the LongTVQA benchmark, outperforming all existing baselines by a significant margin. Through ablation studies, we show that both the multi-agent architecture and the reward-guided training contribute critically to the agent’s gains. Our system not only achieves higher accuracy, but also exhibits interpretable decisionmaking, coordinating sub-agents to select relevant video segments and extract fine-grained visual information essential for reasoning. These results underscore the benefit of an agentic framework for long video understanding.
Our contributions are threefold: (i) a modular multi-agent architecture in which a master LLM coordinates grounding and vision specialists; (ii) a reward-driven agentic reinforcement learning training scheme that promotes concise, step-wise reasoning; and (iii) episode-level long video datasets LongTVQA and LongTVQA+ are proposed under which our system achieves state-of-the-art results.
2 Related Work
Early work focused on memory and attention mechanisms over appearance-motion features (Gao et al., 2018). This evolved into multimodal transformers designed for efficient frame sampling (Lei et al., 2021). Recent trends emphasize retrievalaware reasoning and efficient tokenization for long videos, as well as integrating LLM-based reasoning with video encoders (Zhang et al., 2023) and employing agentic planners that iteratively gather evidence (Wang et al., 2024b). Long-form systems further explore sparse memory and temporal grounding techniques to handle hour-scale inputs (Song et al., 2024). These developments motivate long-form VideoQA systems that selectively retrieve segments under a limited context budget.
LLM agents couple chain-of-thought with actions: planning, tool calls, and iterative evidence gathering. Foundational agent ideas include ReAct, Self-Ask, and WebGPT (Yao et al., 2022;Press et al., 2022;Nakano et al., 2021). Toolformer shows self-supervised API-calling, while orchestration frameworks (HuggingGPT/Gorilla-style) route subtasks to expert models (Schick et al., 2023;Shen et al., 2023). In multimodal settings, MM-ReAct wires LLMs to vision experts via prompting, and program-of-thought systems like ViperGPT compose perception modules through executable code for transparent, verifiable reasoning (Yang et al., 2023;Surรญs et al., 2023). For long videos, agentic designs such as VideoAgent/VideoAgent-style frameworks use memory, targeted retrieval, and temporal grounding to operate under strict context budgets while improving faithfulness (Wang et al., 2024b). Beyond planning, video-RAG pipelines extract ASR/OCR/objects and retrieve evidence to augment LVLMs for factual responses (Luo et al., 2024). In addition, long-horizon multimodal agents with persistent memory and structural planning further enhance reliability for extended videos, e.g., Long-Seeing, VideoTree, and Koala (Long et al., 2025;Wang et al., 2025b;Tan et al., 2024); and general reasoning paradigms such as Chainof-Thought, Least-to-Most, Tree-of-Thoughts, and Generative Agents provide foundations for decomposition and memory (Wei et al., 2022;Zhou et al., 2022;Yao et al., 2023;Park et al., 2023). Retrievalfirst paradigms like Retrieving-to-Answer complement agent pipelines with a retrieve-then-reason template (Pan et al., 2023). (We also include the alternative ReAct entry for key consistency (Yao et al., 2022).)
Modern MLLMs combine strong vision encoders with instruction-tuned LLMs. CLIP pretraining provides broad visual-text transfer (Radford et al., 2021). Flamingo introduces a perceiverstyle resampler for few-shot multimodal learning (Alayrac et al., 2022); BLIP-2/InstructBLIP bridge frozen encoders and LLMs (Li et al., 2023;Dai et al., 2023). Recent visually instruction-tuned MLLMs (Tang et al., 2025;Pi et al., 2024Pi et al., , 2025)), such as LLaVA (Liu et al., 2023), scale visual instruction tuning using open components, while LLaVA-OneVision (Li et al., 2024a) unifies highresolution perception with token-efficient processing for both images and videos. Recent videotuned variants (e.g., Video-LLaVA) and trainingfree token schedulers (e.g., SlowFast-LLaVA) further improve temporal coverage and efficiency (Lin et al., 2024;Xu et al., 2024b). Proprietary MLLMs (GPT-4/4o; Gemini 1.5) show long-context multimodal reasoning (Achiam et al., 2023;Gemini Team, 2024), while open models (Qwen2-VL, In-ternVL) narrow the gap via dynamic resolution, OCR, and video pipelines (Wang et al., 2024a;Chen et al., 2024). Complementary advances focus on unifying image-video tokens with few, informative representations (e.g., MiniGPT4-Video, Video-ChatGPT, Video-LaVIT, LLaMA-VID, LongVU, PLLaVA, LLaVA-Video, Chat-UniVi) (Ataallah et al., 2024;Maaz et al., 2024;Jin et al., 2024b;Li et al., 2024b;Shen et al., 2024;Xu et al., 2024a;Zhang et al., 2024c;Jin et al., 2024a), and on long-context optimization or adaptive input selection (e.g., InternVideo2.5, LongVLM, Long Context training, self-adaptive sampling, simple-buteffective alignment, and question-instructed tuning) (Wang et al., 2025a;Weng et al., 2024;Zhang et al., 2024b;Han et al., 2023;Zhang et al., 2024a;Romero and Solorio, 2024). Comprehensive analyses of video understanding in large multimodal models (e.g., Apollo) situate these models within broader capabilities and evaluation protocols (Zohar et al., 2025). For key harmonization with the bibliography, we also include the alternate Video-LLaMA entry (Zhang et al., 2023). However, most models still face long-video constraints (context length, retrieval). This motivates combining videonative encoders, instruction tuning, retrieval, and tool use for scalable long-form VideoQA.
As shown in Figure 2, we cast long-video QA as multi-agent reasoning, where a master agent LLM coordinates a grounding agent to temporally localize question-relevant segments and a vision agent to extract targeted observations from those segments. The system proceeds iteratively, maintaining a running context that accumulates subtitles, relevant segment tags, and vision observations, and it produces an answer once the master agent judges that sufficient evidence has been gathered. For opensource LLMs serving as the master agent, we apply reinforcement learning to encourage accurate, concise, and cooperation-efficient behavior while keeping the other agents frozen. At inference, the process yields clear, step-by-step traces aimed at solving the question at hand.
Master agent behavior and training. Specifically, the master agent follows the instruction schema in the System Prompt (Table 1) and the multi-turn policy in Algorithm 1 that coordinates two other specialist agents: a grounding agent and a vision agent. Given an episode with its full subti- System Prompt -LONGVIDEOAGENT You are an agent that answers questions about a long video episode. You may use two tools: a grounding agent to localize relevant segments and a vision agent to extract visual facts from the localized segment. Produce concise, direct answers. Context you may receive. All subtitles and the user question q. When a segment has been localized, you will also have a tag (e.g., ). When the vision agent has been called, you will see its textual response. tles and a question, the master runs a bounded loop (at most K steps). At each turn it emits exactly one structured action token, for a visual read, <request_grounding> for (re)localization, or to terminate. After the corresponding agent is invoked, its textual output is appended to the context of the master agent. For open-source masters, we optimize the policy with GRPO while keeping the grounding and vision agents fixed. The rollouts terminated by action tokens in Algorithm 1 provide the trajectories for training and evaluation.
Grounding agent. Given the question and subtitles, the grounding agent proposes a temporal segment and returns a symbolic tag <clip_X> marking the relevant portion of the episode. By default the window context is 1; when larger, the agent outputs a short run of consecutive tags. The master may requery grounding to refine or validate the segment as reasoning progresses.
Vision agent. Conditioned on <clip_X> and an on-demand prompt that specifies the current visual need, the vision agent extracts textual observations from frames within the localized segment (e.g., objects/entities, attributes, actions, OCR/on-screen text, scene cues). These observations are appended to the context and guide the next decision; the loop terminates when the master judges the accumulated visual evidence sufficient to answer.
For open-source LLMs serving as the master agent, we fine-tune the master with GRPO while keeping the grounding and vision agents frozen.
Long-video QA is cast as a finite-horizon decision process: at each action step after reasoning the policy emits exactly one structured action token (<visual_query>, <request_grounding>, or ).
Trajectory. A full response terminates upon emitting … or reaching K steps.
We index decision steps by t โ {0, 1, . . . , T } with T โค K.
At Increment step count t โ t + 1 28: end while 29: return final generated response y for q the policy ฯ ฮธ first plans and then emits a contiguous action string a t ending with exactly one closing tag from {</visual_query>, </request_grounding>, }. If not terminating, the system appends feedback from the invoked agent o t (e.g., a vision observation or a clip tag) to the context for the next step.
Rewards. We use two simple, rule-based rewards as supervision for reinforcement learning: (i) Structural validity r fmt t โ {0, 1} grants 1 if the action string contains exactly one top-level tag with proper closure and no extraneous text; otherwise 0. (ii) Answer correctness r ans โ [0, 1] is awarded at termination via exact match on the multiple-choice answer; if no valid appears, r ans = 0.
Objective and optimization. We seek a policy that produces well-formed actions at every step and a correct final answer. To balance these goals, the trajectory reward return is R(ฯ ) = ฮฑ T t=0 r fmt t + r ans where ฮฑ > 0 weights the per-step structural shaping and r ans supplies the terminal task reward. r fmt t encourages the master to emit exactly one correct action tag at each decision, while r ans evaluates only the final . If no valid and correct answer is produced, r ans = 0.
We optimize the master agent with GRPO on sampled rollouts: for each episode, the policy generates an action sequence, receives structural rewards at action boundaries and a terminal answer reward, and we compute sequence-level advantages with a learned value baseline. Policy updates follow the GRPO objective with standard clipping and entropy regularization, while the grounding and vision agents remain frozen. This minimal, two-signal objective provides sufficient guidance to learn structured, multi-turn coordination without additional dense rewards.
We build LongTVQA and LongTVQA+ on top of TVQA and TVQA+. TVQA spans six TV shows with 152.5K multiple-choice QAs over 21.8K clips (60-90s) with subtitles and moment annotations; questions require joint dialogue-visual reasoning (Lei et al., 2018). TVQA+ refines a subset with spatio-temporal grounding-adding precise timestamps and 310.8K frame-level boxes for referenced entities (29.4K QAs from 4,198 clips, mainly TBBT)-supporting joint QA and temporal/spatial localization (Lei et al., 2020). Pro are multimodal baselines that process and accept the full long video directly. Methods labeled Agentic indicate the model operates as the MASTERAGENT; methods labeled AgenticRL additionally denote RL fine-tuning. Parenthesized green numbers denote absolute gains over the immediately preceding (non-agentic or non-RL) setting. We observe that: (i) our multi-agent framework, LONGVIDEOAGENT, consistently outperforms the non-agentic counterparts; (ii) agentic RL yields additional gains, especially for smaller open-source models; (iii) using frames provides visual evidence beyond subtitles, and generally outperforms subtitle-only inputs; (iv) closed-source models remain strong, but the gap narrows much when open-source models adopt agentic designs and agentic RL.
Multi-agent Input RL Finetune Accuracy (%) No further tools needed. a3: A Bus Stop โ โ USER clip grounding. For reinforcement learning, we use GRPO with a learning rate of 5 ร 10 -6 , up to 2,000 optimization steps, a KL coefficient of 10 -3 , batch size 4, rollout count N =4, and temperature 1.0. Training Qwen2.5-7B took 12 hours on 4ร NVIDIA H800 GPUs, while the 3B variant took 6 hours under the same setup.
Table 2 presents overall validation accuracy. Moving from the non-agent setting to our multi-agent framework yields significante gains. This provides direct evidence for the effectiveness of a multi-agentic pipeline that can localize the relevant clips and performs targeted visual inspection.
In addition, for several open-source LLMs(as master agent), reinforcement learning consistently improves over their inference-only counterparts under identical prompts and evaluation; notably, the Qwen2.5-7B model with RL attains accuracy comparable to GPT-5-mini (closed-source) on our protocol. Illustrative examples in Table 3 and Table 5 effectively demonstrate the efficacy of our approach, with additional cases provided in the supplementary materials. GPT-4o attains 73.30 localization and 78.00 answer accuracy, outperforming Qwen3-VL-235B-a22b at 71.00 and 73.67 by +2.30 and +4.33, respectively. The gap indicates that stronger visual recognition (small objects, OCR, fine attributes) translates into better end-task accuracy in long-form QA, so we adopt GPT-4o as the default vision agent.
Contribution of agentic components. Table 4a decomposes the gains when moving from a single LLM to a multi-agent, multimodal system. Adding temporal grounding to the same backbone increases answer accuracy from 64.3 to 69.0 (+4.7), show-ing that identifying the relevant clip filters distractors and focuses reasoning. Enabling vision after grounding further lifts accuracy to 74.8 (+5.8 over grounding; +10.5 overall): targeted visual inspection complements subtitles with concrete object/text cues and can validate or refine grounding through repeated calls when uncertain. Because backbones and prompts are held fixed, these improvements are attributable to the agentic procedure. We suggest grounding narrows the context length for reasoning and guides the master agent’s attention, while vision supplies the missing finegrained evidence.
We presented a multi-agent framework, LONGVIDEOAGENT, for long-form video question answering in which a MASTER agent coordinates a GROUNDINGAGENT for temporal localization and a VISIONAGENT for targeted perception. The framework is model-agnostic: we evaluate it with both closed-and open-source LLMs; for open-source masters, we fine-tune with GRPO to encourage accurate, concise, and cooperation-efficient behavior while keeping the other agents frozen. Equipped with a unified context and GRPO training that combines structural and answer rewards, the system where open-source LLMs act as the master agent yields transparent, step-by-step traces and achieves strong gains on LongTVQA / LongTVQA+ over non-agent baselines. Ablations show that ground-ing+vision is essential, modest step limits suffice, adjacent-window context helps, and stronger perception yields higher accuracy, validating the effectiveness of the framework. Future work includes richer modalities(like audio track and knowledge background), finer grounding and larger-scale RL training.
Our work has several practical limitations. First, based on TVQA and TVQA+, we rely on provided subtitles as the primary textual channel and do not process raw audio; in future work we plan to integrate an audio-to-subtitles (ASR) module to capture raw speech. Second, the vision and grounding modules are kept fixed during RL. Jointly optimizing them could further improve robustness and accuracy. Lastly, the reward is intentionally simple (format + answer correctness), which may still have room for improvements.
Bench + sidewalk + trash can + windows strongly indicate a bus stop rather than a mall, theatre, store, or park.
(a)
This content is AI-processed based on open access ArXiv data.