Reading time: 26 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.18735
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

What's the state of the drawers beside the bed in two states? A. closed in both states ร— B. closed first and open later โˆš C. open in both states D. open first and closed later Current Single-State Benchmark What's the state of drawer beside the bed? A. open B. closed โˆš 1st State Change state: open the drawer Two same videos? Two different scenes? Which video should I watch? Same scene Figure 1. Limitation of current LMM benchmarks. Existing benchmarks concentrate on evaluating a model's performance on an isolated state, neglecting to test its ability to understand changes across different states. Furthermore, the lack of relevant data and benchmarks has impeded the development of model capabilities in this area. This work is designed to fill that gap.

๐Ÿ“„ Full Content

Biological intelligent agents, such as animals and humans, all possess a vital ability preserved through natural selection: to compare and perceive changes in the environment and learn from them [4]. Wild animals identify the presence of predators or prey by comparing changes in their habitat, such as new tracks, disturbed vegetation, or broken branches, allowing them to make rational decisions that increase their probability of survival. Similarly, humans acquire skills through the same comparative process. For example, an archer improves by comparing the results of consecutive shots, analyzing differences in distance and direction, and refining their technique based on these comparisons to reduce errors. This process highlights that a key feature of biological intelligent learning is the act of minimizing the 'delta' between ever-changing environment states.

This provides a key insight for research in intelligent agents: we can not provide intelligent agents with an infinite supply of high-quality training data, and the performance ceiling of models based on human-annotated data will always be limited by its quality. A truly advanced cognitive agent should mirror biological intelligence, constantly improving its abilities by perceiving and comparing changes in its environment. This ability constitutes the foundation for such higher-order cognition. Without this fundamental capacity for change detection, no meaningful learning can occur.

While recent state-of-the-art large multimodal models (LMMs) [2,25,30,41,42] have demonstrated remarkable advancements, their focus has predominantly been on interpreting static images or continuous video streams. Existing benchmarks [5,13,17,27,37] focus on static content or single-state videos, failing to test a model’s ability to reason about transformations between distinct temporal points. This deficiency represents a significant impediment to the future development of artificial intelligence and truly intelligent agents.

To address this critical gap, we introduce M 3 -Verse, a novel benchmark designed to evaluate the understanding of discrete state changes. It is constructed from paired egocentric videos that capture indoor scenes before and after a change is made. This two-state structure transcends simple recognition, requiring models to perform a fine-grained semantic comparison and infer the transformations that connect the two scenes.

We offer the following principal contributions: โ€ข A Novel Benchmark: We introduce M 3 -Verse, the first large-scale benchmark dedicated to evaluating multi-state visual understanding. โ€ข A Challenging Evaluation Paradigm: Our two-state, paired-video design goes beyond conventional QA, enabling the rigorous evaluation of comparative reasoning about state transitions. โ€ข In-depth Model Analysis: Our evaluation of 16 leading LMMs and 6 text-only LLMs reveals their limitations in tracking state transitions, highlighting critical weaknesses not exposed by existing benchmarks. โ€ข A simple but effective method: we propose a simple but effective method, namely Hierarchical Captioning and Text-based Reasoning (HCTR), which outperforms the base models on our benchmark. By open-sourcing the M 3 -Verse, we provide the community with a vital tool to advance the next generation of AI systems capable of understanding and reasoning about our dynamic world.

The development of Large Multimodal Models (LMMs) has markedly advanced AI’s ability to perceive the world. Early works [1,15,26] mapped visual features to text embeddings, leading to models [3,21] that bridged visual encoders with LLMs for impressive visual question answering. This architecture was then extended from static images to short video understanding [14,16,19,23,43]. To 3 -Verse. Intra-State: can be answered using information from a single state, Inter-State: requires information from both ‘before’ and ‘after’ states to be answered. These categories are further structured to evaluate four key capabilities: Spatial Understanding, Temporal Understanding, Attribute Recognition and Reasoning. push the boundaries of the model’s capabilities, recent research has focused on long-term video understanding, tack-ling the challenge of long contexts with methods like token merging strategies [28,32] and innovative architec-tures [25,31,34,41]. Beyond extending context length, a new frontier aims to enhance the reasoning capabilities of LMMs via post-training techniques like reinforcement learning (RL) [9,10,36]. While these models have made significant progress in video processing, their ability to understand fine-grained changes in discrete multi-state videos remains unexplored. We introduce a benchmark, concentrating on this overlooked dimension, multi-state video understanding, and utilize it to evaluate how well current SOTA models can identify the state transitions of key objects or scenes at different points in time.

To evaluate the visual understanding capabilities of LMMs, a large number of benchmarks have been proposed. Initial efforts aggregated traditional datasets for foundational image understanding tasks [22] and general video understanding [8]. More recently, the focus has shifted to fine-grind perception capabilities. Specialized benchmarks now assess fine-grained spatial relationships [13,27,37], temporal and state-transition logic [6,11,17,40]. Beyond assessing specific abilities, the landscape of video understanding benchmarks has diversified. Some benchmarks [20,40] are designed to test a model’s ability to acquire information incrementally online and reason with it. Some [20,24,33] focus on hierarchical analysis of long-form videos, and others that expand into cross-video contexts [24,40,44].

Despite this broad coverage, a dedicated evaluation of multi-state change perception-reasoning about discrete and non-continuous transitions-still remains a critical blank. Existing benchmarks may touch upon state changes, but they do not isolate and rigorously test this specific capability. Our work is designed to fill this void by providing a targeted assessment of a model’s ability to perceive how objects or scenes change between distinct states, offering a crucial supplement to the current evaluation landscape.

We introduce M 3 -Verse, a Multi-modal, Multi-state, Multidimensional benchmark for evaluating the ability of Large Multimodal Models (LMMs) to reason about state transitions. Built upon the AI2-THOR simulator [12] and ProcTHOR-10k dataset [7], it comprises 270 photorealistic indoor scenes. For each scene, we establish two distinct environmental states by programmatically altering objects, capturing a paired egocentric video for each state, resulting in 540 videos in total.

Our benchmark consists of 2,932 question-answer pairs designed to probe a model’s comprehension of dynamic environments. As illustrated in Fig. 3a, questions are divided into two main categories: intra-state questions, answerable from a single state, and inter-state questions, which require comparing the ‘before’ and ‘after’ states to identify changes. These categories are ultimately broken down into 50+ fine-grained sub-tasks, presented in multiple-choice formats.

To more closely align with real-world situations and specifically evaluate model hallucination, we also designed hallucination-type questions for the majority of task categories. This was achieved by replacing the correct choice with the option “No correct option is listed.”. These hallucination-type questions account for 23.2% of the entire benchmark.

Data Collection. For each of the 270 scenes, we establish an initial state and a modified subsequent state. Changes between states include alterations to object presence, position (extrinsic changes), or internal attributes like being open/closed or toggled on/off (intrinsic changes), all while adhering to physical plausibility. We generate an exploration trajectory for an embodied agent in each state, capturing paired egocentric videos with rich sensory data, including RGB, depth, and instance segmentation masks. To enrich object descriptions, we developed a pipeline using a Describe-Anything Model [18] and an LLM to generate detailed visual attributes like color or shape for 964 unique object assets. Further details are provided in the Appendix.

Multi-Dimensional Question-Answer Generation. We employ an automated, template-based pipeline to generate QA pairs from the rich scene metadata. A key component is our hierarchical entity grounding mechanism, which creates unambiguous references to objects and rooms by leveraging intrinsic properties, relational contexts, and positional information. This ensures question clarity, especially in scenes with multiple similar objects. The templates are designed along two primary axes. The first axis is the required context, which is either intra-state or inter-state. The second axis is the cognitive skill being tested, which includes Spatial Understanding (S. U.), Temporal Understanding (T. U.), Attribute Recognition (A. R.), and Reasoning (Rs). To ensure the quality and visual-grounding of our benchmark, the generated QA pairs undergo a rigorous two-stage filtering process. An LLM first filters for textual ambiguity and relevance, followed by a multimodal filtering stage to remove trivial questions. Finally, a thorough human review process yielded the final set of 2,932 high-quality QA pairs. Further details are provided in the Appendix.

Backend Data. M 3 -Verse is supplemented with extensive backend data to foster future research. This supplementary information includes:

โ€ข Comprehensive Scene and Object Data: We provide the static spatial layout of the scene (e.g., the coordinate positions and types of each room, along with their interconnectivity). For each state (before and after), we also provide detailed information for every object. Object data encompasses its 3D pose, its parent receptacle (the object it is on), intrinsic attributes (e.g., openness, cleanliness), and asset-level properties (e.g., color, shape, function). โ€ข Agent’s Egocentric Sensory Data: The agent’s complete navigation trajectories are provided. For each point along the trajectory, the data include the agent’s specific coordinates and orientation, the room it currently occupies, and a comprehensive list of objects visible within its field of view. For every frame in the navigation videos, we also include corresponding egocentric depth maps and instance segmentation masks. โ€ข Operation Log Data: We employ an operation log file to efficient and accurate tracking of state transitions within the scene. This log documents the ID of each object that undergoes a state change, the specific operation applied to it, and the object’s attributes from both its initial and final states. This rich, multi-faceted collection of data establishes M 3 -Verse not just as an evaluation tool, but as a versatile resource to catalyze future research in areas such as video understanding, embodied AI, visual navigation, and situated reasoning.

Task Formulation. To fully validate LMMs’ ability to perceive scenes with changing states, our experiment involves inputting videos of a scene’s two states into the LMM, along with a question and candidate options. When the videos are inputted, they are accompanied by textual descriptions to distinguish which state each represents. The LMM is then required to select the correct options. Models and Baselines. We evaluate 22 Models and 2 baselines, organized into four distinct categories: โ€ข 2 Proprietary LMMs, Gemini-2.5-pro and GPT-5.

โ€ข 14 Open-Source LMMs, with model sizes ranging from 2B to 235B, which can be further splited into general LMMs (e.g., InternVL-3.5 [30]), video-enhanced LMMs (e.g., Video-XL-2 [25]), and reasoningenhanced LMMs (e.g., Video-R1 [9]). โ€ข 6 Vision Blind LLMs, following [29], we test text-only LLM [35] at 6 different scales.

โ€ข 2 Baselines,r4 we include random and human performance baselines.

Evaluation Protocol. To ensure fair and reproducible comparisons, we adopt a standardized evaluation protocol.

We sample an equal number of frames from the ‘before’ and ‘after’ videos, adjusting the total frame count and input resolution based on each model’s context capabilities and architectural constraints. Specifically, for models that support extended context windows, we uniformly sample 200 frames per video; otherwise, we default to 60 frames per video. The input frames are resized to a canonical resolution of 448ร—448 pixels. For models that do not natively support this resolution, we instead use the closest supported resolution that maintains the aspect ratio and minimizes deviation from 448ร—448.

To mitigate issues with answer format hallucination, our prompt explicitly specifies whether each question expects a single or multiple answers. Regarding reasoning behavior, thinking mode is disabled by default across all models. For models where thinking mode cannot be fully disabled, we configure it to operate in the minimal reasoning setting. Further implementation details are detailed in the Appendix.

We present the main evaluation results in Table 2. The results are analyzed across different model families, scales, and capabilities. Our analysis reveals several key insights into the current state of LMMs for state change understanding: Contrary to the common “bigger is better” trend, larger model size does not consistently lead to improved performance on M 3 -Verse. For instance, the 4B versions of Qwen3-VL-Instruct and InternVL3.5-Instruct [30] clearly outperform their respective 8B counterparts. Moreover, the Qwen3-VL-Instruct-4B emerges as the top-performing open-source model, even surpassing the massive Qwen3-VL-235B and Gemini-2.5-pro. This suggests that for finegrained state understanding, architectural design and instruction tuning may be more critical than raw parameter count.

  1. The Critical Role of Vision and Non-Monotonic Scaling. The results confirm that visual input is essential for the benchmark, as vision-enabled LMMs consistently outperform their vision-blind LLM counterparts of similar sizes. Simultaneously, the performance within the vision-blind Qwen3 [35] family shows a clear positive correlation with model size, suggesting that a more powerful language backbone provides a better foundation for multimodal tasks. 4) Contrasting Intra-State and Inter-State Understanding. Interestingly, many models exhibit stronger performance on inter-state tasks than on intra-state tasks. We hypothesize that under our current experimental setup, intrastate tasks pose a more significant challenge by resembling a “needle in a haystack” problem. These tasks require a model to first accurately locate a specific state or attribute within a potentially complex scene before it can extract the relevant information for reasoning. This initial localization step adds a substantial layer of difficulty. 5) Hierarchy of Task Difficulty: From Simple Perception to Complex Reasoning. Across both intra-state and interstate questions, a clear hierarchy of task difficulty emerges. Attribute Recognition (A.R.) is consistently the highestscoring sub-task for most models, indicating that identifying object properties is a relatively robust capability. At the opposite end of the spectrum, intra-state Reasoning (Rs) proves to be the most formidable challenge. This pattern is universal across models, highlighting a critical bottleneck in their ability to perform multi-step logical operations, counting, or spatial reasoning within a single, complex scene. 6) Failure of Reasoning-Enhanced Models. Models specifically designed with enhanced reasoning capabilities, WeThink [36] and Video-R1 [9], were among the worst performers on M 3 -Verse. The failure of the WeThink [36] is particularly revealing. Its outputs shows it consistently failed to generate the specified number of options as requested by the prompt. The basic format violation, rather than a flaw in reasoning itself, led directly to its extremely low accuracy, suggesting that folllowing instructon robustly is a critical prerequisite for complex mult imodal tasks.

  2. The Pervasive Challenge of Hallucination. A critical finding from our evaluation is the widespread difficulty models face with hallucination-type questions. As shown in Tab. 3, most of LMMs perform substantially worse when tasked with questions designed to elicit hallucinatory responses compared to factual ones. On average, there is a performance drop of 12.32 points on hallucination-centric questions relative to their fact-based counterparts. This pronounced weakness in handling potential hallucinations is a primary contributor to the modest overall performance scores, indicating a critical area for future model development. 8) Impact of Frame Sampling on Performance. To verify the impact of the number of sampled frames on the test results, we conducted tests on Qwen3-VL-4B-Instruct and Video-XL-2 [25] with different numbers of sampled frames [25]. As illustrated in Fig

To address the inherent challenges LMMs face in processing long-form videos and tracking temporal state changes, we propose and evaluate a simple policy: Hierarchical Captioning and Text-based Reasoning (HCTR). This

We evaluated this method on InternVL3.5-Instruct [30] and Qwen3-VL-Instruct at 4B and 8B scale. The results, presented in Fig 5, demonstrate a remarkable improvement over the standard end-to-end approach. Crucially, this significant enhancement in performance is achieved without the need for any elaborately designed prompts, highlighting the inherent effectiveness and simplicity of our method. Significant Overall Performance Improvement. Here we utilize the LMM for all stages: generating clip captioning, summarization and eventually answering. The hierarchical captioning method consistently and significantly outperforms the standard direct video QA method across most tested models. For instance, InternVL3.5-8B-Instruct [30] sees its overall accuracy jump by over 7 absolute points. Similarly, Qwen3-VL-8B-Instruct achieves an improvement of more than 4 absolute points compared to its standard counterpart. This shows that structuring the reasoning process yields a more effective and robust understanding.

Dramatically Enhanced Inter-State Reasoning. The most pronounced improvements are observed in the ‘interstate’ evaluation, which is the core challenge of our benchmark. As shown in Fig 5b, for InternVL3.5-8B-Instruct [30], the ‘inter-state’ average score skyrockets by near 9 absolute points from the standard setting. A similar leap of near 7 absolute points is seen in InternVL3.5-4B-Instruct [30]. The gains in inter-state attribute recognition and reasoning are particularly notable. This strongly suggests that the hierarchical approach, by forcing the model to explicitly serialize the timeline of events into a textual narrative, effectively mitigates the tendency of end-to-end models to overlook or conflate subtle temporal changes, leading to a far superior understanding of state transitions. LMM vs. LLM for Textual Reasoning. We also explore a hybrid approach using text-only LLMs [35] to summarize clip captions and answer the question. As shown in 6, our findings reveal a clear trend: when using an LLM of a comparable size to the LMM’s language backbone, this hybrid approach outperforms the original end-to-end basemodel results but falls short of the full LMM pipeline. Notably, however, as we employ progressively larger and more powerful LLMs for the reasoning component, the performance of the hybrid system steadily increases, eventually exceeding that of the integrated LMM pipeline. This outcome yields two critical insights: 1. The integrated visual grounding within the full LMM pipeline offers a tangible advantage, likely by preserving a richer, more contextually-aware representation through the intermediate stages. This is evidenced by its superior performance over the hybrid model when the language components are of comparable size. 2. The ultimate performance is heavily contingent on the power of the reasoning module. The fact that a superior reasoning-focused LLM can compensate for and ultimately surpass the architecturally integrated LMM pipeline underscores that advanced reasoning capability is a paramount, and potentially overriding, factor for success in complex multimodal tasks.

In this paper, we introduce M 3 -Verse, a benchmark for evaluating the capacity of LMMs to perceive and reason about state transitions in pair videos. The experimental results reveal a stark performance gap between current SOTA LMMs and human cognition, highlighting M 3 -Verse remains a profound challenge for today’s models. To better understand these limitations, we introduce a simple yet effective baseline and conduct a series of experiments, which successfully pinpoint key areas of weakness. By making this work publicly available, we aim to stimulate further research and development in this domain.

Your task is to determine whether a question requires visual information to answer.

Question: {question} Options: {options} Answer: {answer}

If the question can be answered correctly without visual information, respond with “Filter out”. Otherwise, respond with “Keep”. Your response must be either “Filter out” or “Keep”, with no additional text or explanation.

Figure 15. Vision-free Answerability Check Prompt. able points.

To enhance path efficiency and safety, the initial rectangles are contracted inward. A trajectory that closely follows the outer boundaries of the reachable coordinates is problematic for two main reasons. First, it results in unnecessarily long paths, reducing exploration efficiency. Second, it introduces a significant risk of collision with nearby objects or walls, which could disrupt the agent’s movement. This contraction process may cause some larger rectangles to degenerate into simpler geometric primitives, such as line segments or single points. Finally, the shortest path connecting these contracted geometric shapes is computed using the A* search algorithm. This links all the primitives into a single, continuous tour, which forms the agent’s complete exploration trajectory.

Environment Changing A distinguishing feature of our benchmark is its focus on inter-state scene understanding, making the generation of environmental changes a cornerstone of our methodology. To ensure the integrity and realism of our tasks, all modifications strictly adhere to

You are a helpful assistant. Here you are provided with a question and its options generated with templates, which may contain unnatural phrasing. Your task is to rewrite the given question and options to make them sound more like natural language without changing their semantic meaning. Do not change the order of the options. Pay special attention to rewriting expressions like “state 0” and “state 1” to be more natural and closer to real language. Provide the rewritten question and options in JSON format with keys “rewritten question” and “rewritten options”.

Example Output: { “rewritten_question”: “What is the capital of France?”, “rewritten_options”: [“Paris”, “London”, "

Berlin"] } the physical and semantic constraints inherent to the AI2-THOR [12] simulator. This rigorous process guarantees that every altered scene remains physically plausible and logically coherent. We introduce a diverse set of controlled modifications to simulate realistic alterations. These changes encompass: โ€ข Object Extrinsic Changes: Altering the presence or pose of objects, including moving an object to a new location, adding a new object to the scene, or removing an existing one. When moving an object, we first identifies all receptacles within the scene where the object can be placed, along with the specific coordinates on those receptacles that can support the object. A position is then randomly selected from all the possible placement coordinates, and the object is moved to that location. โ€ข Object Intrinsic State Changes: Modifying the internal state of an object through simulator-supported interactions:

-Open/Close Object: Applies to objects with the openable property(e.g. book, laptop). This operation allows for adjusting the openness of an object. -Slice Object: Applicable to objects possessing the sliceable property(e.g. apple, potato, tomato), provided they have not been sliced previously. -Break Object:

Targets objects marked as breakable(e.g. egg, bottle, bowl) that are currently intact.

-Toggle Object On/Off: Designed for toggleable objects(e.g., lights, fans), enabling the alteration of their binary toggled state.

-Clean/Dirty Object: Utilized for objects with the dirtyable property(e.g. bed, cloth, cup), allowing them to be cleaned if dirty or made dirty if clean. -Cook Object: Applicable to cookable objects(e.g. egg, break, potato) that are in an uncooked state. -Fill/Empty Liquid:

Pertains to objects that canFillWithLiquid(e.g. bottle, bowl, cup). This includes filling an empty object with specified liquids (e.g., water, coffee, wine) or emptying an object that is currently filled.

-Use Up Object:

Designed for objects that canBeUsedUp(e.g. toilet paper, soap bottle) and have not yet reached their “used up” state.

The native object information provided by the simulator backend lacks essential visual attributes such as color and shape, which are critical for effective object referring and grounding. Furthermore, objects within the same category often correspond to different assets, each possessing unique visual properties. To address this limitation, we developed a pipeline to generate detailed attributes for each unique object asset.

First, we isolate all visual instances of a specific asset by leveraging the recorded first-person RGB frames and their corresponding object segmentation masks. From the resulting image patches for each asset, we select the five highestquality views-prioritizing those that are largest and closest to the center of the frame-to ensure a clear representation. These five images are then input into a Describe Anything Model [18] to generate distinct textual descriptions from various viewpoints with prompt shown in Fig. 12. Subsequently, these multiple descriptions for a single asset are fed into an LLM [35] to synthesize a unified and comprehensive final description for that asset with prompt shown in Fig. 13.

Recognizing that assets within the same category share common functionalities but differ in appearance, we perform a final comparative analysis. The descriptions for several different assets belonging to the same category are provided to an LLM [35] with prompt shown in Fig. 14. The model is prompted to compare them, identify the consistent object function, and extract the discriminative attributes (e.g., color, shape, and special features) that make each asset unique. Through this process, we successfully generated detailed descriptions and attribute profiles for 964 unique object assets.

Robust Entity Grounding Framework In natural language, entities are typically referenced by their general categories, such as “the trash can” or “the bedroom”. However, real-world scenes often feature multiple instances of the same category, rendering such generic references am-biguous. Resolving this requires descriptive language precise enough to localize specific entities-for instance, distinguishing “the red trash can” from a blue one, or specifying “the larger bedroom”. The challenge of generating unambiguous referring expressions is critical for evaluating fine-grained understanding in agents.

We tackles this challenge by implementing a multifaceted, hierarchical grounding mechanism. This mechanism leverages a rich set of contextual cues to dynamically adjust the granularity of descriptions, prioritizing minimal yet sufficient information. By balancing conciseness with clarity, our approach prevents over-specification while ensuring unique identification, thereby generating highly specific references:

โ€ข Hierarchical Uniqueness Assessment: We systematically evaluate the uniqueness of objects and rooms based on their intrinsic properties (e.g., type, shape, size and color) at both global (scene-wide) and local (typespecific) levels. This ensures that each entity is identified using the most concise unambiguous descriptor available. โ€ข Relational and Positional Grounding: Beyond intrinsic properties, our system incorporates relational contexts, identifying entities through containment relationships (e.g., “apple in fridge”) and relative spatial ordering (e.g., “the lowest object painting”). This capability is vital for distinguishing between visually similar entities.

A novel aspect of our approach is the ability to track uniquely identifiable entities across varying scene states. By comparing object sets and their properties between distinct states, we can pinpoint objects that have undergone changes or are uniquely present/absent. This facilitates the generation of questions related to dynamic scene evolution and object manipulation.

Multi-Dimensional QA Design and Template-based Generation Leveraging the rich scene metadata and our robust entity grounding, we employ an automated, template-based approach to construct the QA pairs. Our generation framework is structured along two primary dimensions to ensure comprehensive evaluation: “What is the function of the bed?” with the answer “For sleeping” would be removed, as it tests general knowledge rather than visual perception of the scene. -Ambiguity Removal: The LMM [35] evaluates whether the questions or options contain ambiguities or potential confusions with prompt shown in Fig. 16 and such ambiguous cases are removed. For instance, a question asking the model to distinguish between two objects with nearly identical colors (e.g., options like “the cyan cup” versus “the light blue cup”) would be flagged and eliminated, as the distinction is subjective and unreliable. -Linguistic Enhancement: The LLM rewrites the template-based questions and options into more natural, fluent, and human-like expressions with prompt shown in Fig. 17. โ€ข Stage 2: Multimodal Filtering We conduct preliminary answer testing using a lightweight LMM with full visionlanguage context. For each question, we run multiple inference trials. If the model consistently answers correctly without additional prompting, we deem the question too easy and remove it for lacking sufficient challenge. Through this pipeline, we ultimately constructed a high-quality, high-discriminability inter-state visual question answering benchmark.

Human Reviewing, Testing and Cleansing We recruited 12 reviewers with at least a bachelor’s degree to re-examine and screen the data, and to evaluate human performance on M 3 -Verse and ultimately yielded 2,932 high-quality QA pairs.

Evaluation Strategies. To ensure fair and rigorous comparisons across models with diverse video processing capabilities, we implement a standardized evaluation protocol. The protocol is detailed as follows:

โ€ข Input Frames. For models accessed through an API, our methodology was to maximize the frame count within the constraints of their supported context windows. This guiding principle led to the final configuration of 200 frames per video for GPT-5, a frame rate of 2 fps for Gemini-2.5-Pro, and 1 fps for Qwen3-VL-235B-A22B-Instruct. For models executed locally, the number of input

Your task is to answer questions based on the provided videos and options.

Question: {question text} options: {candidate options} Return ONLY the letter(s) corresponding to the correct option(s) (e.g., A, B, C, or A,C for multiple correct answers)

Questions with only one correct option: There is only one correct option for this question. Questions with multiple correct options: There are more than one correct option for this question.

Figure 18. Evaluation Prompt for Vision-Blind LLMs [35].

frames was tailored to their contextual capabilities. The default protocol involved uniformly sampling 200 frames from each of the ‘before’ and ‘after’ state videos, resulting in a total of 400 frames being input to the LMM. Notably, for the InternVL3.5 series [30] and VideoL-LaMA3 [42], their restricted context lengths necessitated a reduction to 60 sampled frames per video (120 frames total). Video-XL-2 [25] demanded a bespoke adjustment due to its inherent lack of support for multi-video inputs.

For this model, we concatenated the frames from both videos into a continuous stream, incorporating explicit textual markers to delineate the ‘before’ and ‘after’ con-

Your task is to answer questions based on the provided videos and options. tent. โ€ข Input Resolution. For closed-source models, the resolution at which video frames are processed is an internal parameter fixed by the service provider, and therefore cannot be adjusted by the user. For open-source models, the input resolution is standardized to 448ร—448 for all models to maintain consistency. The only exception is VideoLLaMA3 [42], which is set to 336ร—448 resolution to accommodate its fixed aspect ratio requirements. โ€ข Thinking-mode Parameters. To standardize the reason-ing process across models, we configured model-specific inference parameters where available. For the API-based models, we set the reasoning effort to low for GPT-5 and the thinking budget to 128 for Gemini-2.5-Pro. Qwen3-VL-235B-A22B-Instruct was called in a non-thinking mode. For the local models, we disabled explicit thinking modes for all models, with the sole exception of WeThink-Qwen2.5VL-7B [36], which was evaluated using its native, built-in thinking mode.

When evaluating the LLMs [35], we only input the question, options, and necessary prompts, requiring the model to answer based solely on the text input. When evaluating the LMMs, in addition to the LLM inputs, we also provided videos showing the before and after states of the scene corresponding to the question, requiring the model to answer based on the video content.

The prompt provided to each model is dynamically adjusted to specify whether the question is single-choice or multiple-choice. This decision was based on preliminary experiments, which revealed that without such guidance, a subset of models exhibited a tendency to hallucinate multiple answers even for questions with only a single correct option. This behavior would lead to undeserved penalties and an inaccurate evaluation of their true capabilities. Therefore, to mitigate this issue and ensure a fair assessment, we explicitly instruct the model on the expected number of answers based on the ground truth.

Evaluation Metrics Since the benchmark includes both single-choice and multiple-choice questions, we score the model’s answers based on accuracy: if the model’s answer is a subset of the ground truth, it receives a score proportional to the fraction of the ground truth it covers; if the model’s selection includes any incorrect option, it receives zero points for that question.

We performed an ablation study on the HCTR method to evaluate the effect of varying clip lengths on different models. For InternVL3.5-4B and InternVL3.5-8B [30], we assessed clip lengths of 50, 100, and 200, uniformly sampling 50 frames from each clip. For the Qwen3-VL series, we evaluated clip lengths of 100 and 200 for the Qwen3-VL-4B model and a length of 200 for the Qwen3-VL-8B model, with each clip being uniformly sampled to 100 frames.

Please synthesize the following descriptions of different segments of a video into a single, coherent, and detailed overall caption for the entire video.

Segment Descriptions: {captions for all the clips that make up this complete video} Focus on the environment spatial layout, objects, including their spatial layout, detailed information, and the flow of the video. Please describe in clear and concise language.

Based on the following context information, answer the question: Context: {context} Question: {question} Options: {options text} Return ONLY the option letter(s) (e.g., “A”, “B”, “C”). For multiple correct options, concatenate the letters (e.g., “ABC”). AVOID ANY OTHER OUTPUT CONTENT. In testing the HCTR method, we also conducted an ablation study by removing the caption summarization module with InternVL3.5-4B and InternVL3.5-8B [30]. In this configuration, the large model was tasked with generating answers directly from the raw captions of each clip. Clip length is set as 50. The results is shown in Tab. 8

(a) Performance comparison measured by Overall Score. Left: basemodel performance, Right: HCTR performance. (b) Performance in Inter-state subset. Blue: correct samples, Orange: incorrect samples.

Question: {question} Options: {options} Answer: {answer} If you identify any issues (e.g., ambiguous options), respond with “Filter out”. Otherwise, respond with “Keep”. Your response must be either “Filter out” or “Keep”, with no additional text or explanation.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut