MIND: Benchmarking Memory Consistency and Action Control in World Models

MIND: Benchmarking Memory Consistency and Action Control in World Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Code: https://github.com/CSU-JPG/MIND.


💡 Research Summary

The paper introduces MIND, the first open‑domain, closed‑loop benchmark designed to evaluate two fundamental capabilities of world models: memory consistency and action control. Existing benchmarks focus largely on visual fidelity, physical plausibility, or single‑view (usually first‑person) scenarios, and they rarely assess long‑term memory or the ability to generalize across different action spaces. MIND fills this gap by providing 250 high‑resolution (1080p, 24 FPS) videos rendered in Unreal Engine 5, split evenly between first‑person and third‑person perspectives. Of these, 200 clips (100 per view) share a common action space, while the remaining 50 clips (25 per view) use varied action spaces, enabling systematic study of cross‑action‑space generalization.

The benchmark defines a basic action set of eight discrete commands (W, A, S, D for translation and ↑, ↓, ←, → for camera pitch/yaw). For each action, both a translational step Δp and a rotational step Δr are parameterized, and five distinct (Δp, Δr) combinations are created to represent fine‑grained, medium, and coarse movements. This design allows the same scene to be explored under multiple motion scales, testing whether a model can adapt its dynamics without retraining.

Memory consistency is evaluated through a “re‑visit” protocol: a model first observes a memory segment M = {f₁,…,f_T} and then receives a future action sequence A = {a_{T+1},…,a_{T+k}}. It must generate predicted frames \hat V = {\hat f_{T+1},…, \hat f_{T+k}}. Consistency is measured by the L₂ distance between each predicted frame and the ground‑truth frame at the same revisited state (L_mem = ‖\hat f_t – f’_t‖²). This captures both spatial layout preservation (objects, textures, geometry) and temporal smoothness (no flickering or abrupt changes). Long‑context memory consistency is quantified by the mean‑squared error (L_lcm) over the entire predicted sequence, reflecting how well the model retains information over extended horizons.

Action control is assessed by comparing the model’s generated motion against the ground‑truth action logs. Accuracy is reported per‑action and per‑view, and a cross‑action‑space generalization score is computed by measuring performance drop when the model trained on the shared action space is evaluated on the varied‑action clips.

To facilitate baseline comparisons, the authors present MIND‑World, an interactive Video‑to‑World system that ingests action logs and produces video output in real time. MIND‑World integrates recent diffusion‑based world‑model advances (e.g., KV‑caching, Self‑Forcing) with memory‑compression modules such as CAM, Infinite‑World, and explicit 3‑D memory representations (SPMem). This pipeline demonstrates that current state‑of‑the‑art models can generate plausible short‑term predictions but struggle with longer horizons and with actions that deviate from the training distribution.

Extensive experiments on several leading video generation models (including SVD, Hunyuan‑Video, Sora 2) reveal systematic shortcomings: (1) memory consistency remains high for the first 1–2 seconds (≈48 frames) but degrades sharply beyond 5 seconds, leading to drift in object positions and texture loss; (2) action control is reliable for the base movement/rotation speeds (≈85 % success) but drops below 60 % for larger step sizes; (3) cross‑action‑space generalization incurs a 1.8× increase in MSE, indicating limited adaptability to new motion scales. These findings underscore that while visual realism has advanced, the core embodied capabilities of world models are still immature.

The paper’s contributions are threefold: (i) a publicly released, high‑quality, multi‑view, multi‑action dataset; (ii) a unified evaluation protocol that jointly measures long‑term memory fidelity and precise action execution; (iii) the introduction of action‑space generalization as a new research axis. The authors argue that future work should focus on more efficient memory compression, explicit 3‑D scene reconstruction for viewpoint invariance, and meta‑learning approaches that enable rapid adaptation to novel action spaces. MIND is positioned as a standard testbed to drive progress toward world models that can both remember reliably over long horizons and act with fine‑grained control across diverse environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment