GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Reading time: 5 minute
...

📝 Abstract

Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

💡 Analysis

Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive experiments confirm that GCAgent significantly enhances long-video understanding, achieving up to 23.5% accuracy improvement on the Video-MME Long split over a strong MLLM baseline. Furthermore, our framework establishes state-of-the-art performance among comparable 7B-scale MLLMs, achieving 73.4% accuracy on the Long split and the highest overall average (71.9%) on the Video-MME benchmark, validating our agent-based reasoning paradigm and structured memory for cognitively-inspired long-video understanding.

📄 Content

With the explosive growth of video-based social media and platforms, video has become the dominant medium in our daily lives, shaping communication, entertainment, and education. As millions of new videos are generated and shared every day, efficient and accurate video processing is now essential for extracting and analyzing meaningful information from this vast and rapidly expanding content. With this background, the rise of Multimodal Large Language Models(MLLMs) [1]- [10] has attracted increasing attention. By directly integrating video perception into LLMs, these models provide a promising solution for understanding and interacting with complex video content.

Despite remarkable progress in MLLMs, significant challenges remain in handling long videos, primarily due to the computational burden of modeling long-term temporal context. To address this issue, earlier efforts focus on improving the intrinsic capacity of MLLMs, primarily by extending context length [11]- [13] or reducing token usage through compression of visual embeddings at the feature level [9], [14]- [19]. These approaches aim to process longer video inputs within a single model context window by optimizing the internal efficiency of MLLMs. More recent studies have increasingly shifted toward agent-based paradigms [20]- [25], which integrate external reasoning mechanisms, retrieval modules, and collaborative planning to handle long videos more effectively. In this paradigm, agents autonomously plan how to retrieve and organize query-related information, alleviating the context length limitation.

Although retrieving query-related information enables more efficient long-video understanding, its capability remains limited. In particular, it falls short compared to the human way of maintaining and organizing global context across the entire video while selectively attending to query-relevant information, leaving clear room for improvement. Recent works in video understanding have also highlighted the role of temporal structural modeling and long-term action reasoning [26]- [29]. Cognitive psychology and cognitive science [30], [31] suggest that humans comprehend and remember events by constructing schematic and narrative structures. Here, schematic structure refers to abstract event templates (e.g., roles, typical situation frames), whereas narrative structure denotes temporally and causally ordered event sequences. These structures enable humans to more efficiently integrate new information and perform downstream tasks. Extending this insight to long video understanding, if MLLMs could build and leverage a comprehensive understanding of the global context while using query-related information to generate answers, they would likely demonstrate enhanced task performance in a more human-like manner. Bridging this cognitive insight with computational models offers a clear path toward more humanlike video understanding.

In this paper, we introduce GCAgent, a global-contextaware agent framework for long-video understanding. Specifically, GCAgent grounds global context representation in schematic and narrative structures, mirroring how humans construct and maintain situation-level understanding. At the same time, it preserves the strengths of conventional agent-based methods in retrieving query-related information. By combining these advantages, our framework significantly enhances the ability of MLLMs to understand and reason over long videos in query-driven interaction scenarios. To realize this awareness, GCAgent is composed of two complementary agents: (i) Memory Manager Agent (LLM-based), which constructs and maintains the global context before any query arrives. To this end, it primarily utilizes speech transcripts as input. The agent first detects event boundaries and segments the transcript into event-level units. Each event unit is then abstracted to extract roles and situation-level patterns, yielding schematic structures. Finally, temporal and causal relationships across these discrete events are inferred to build narrative structures, forming the episodic memory of the video. Concretely, each event-level unit becomes an episode entry in the episodic memory. When speech is unavailable or insufficient, visual captions can be optionally incorporated as supplementary evidence. (ii) Reasoning Agent (MLLM-based), which leverages both the constructed narrative structures and the query-related multimodal information to perform context-aware reasoning and answer the user query.

Once the episodic memory is structured, the agents collaborate through a three-stage paradigm to address user queries: (i) Perception, which locates query-relevant segments; (ii) Action, which performs reasoning conditioned on the global context; and (iii) Reflection, which updates memory based on the reasoning outcome. Upon query arrival, the memory manager agent switches from global context construction to queryconditioned retrieval. In the Perception stage, the memor

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut