On Memory: A comparison of memory mechanisms in world models
World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the backbone architecture. This limitation leads to perceptual drift in long rollouts, hindering the model’s capacity to perform loop closures within imagined trajectories. In this work, we investigate the effective memory span of transformer-based world models through an analysis of several memory augmentation mechanisms. We introduce a taxonomy that distinguishes between memory encoding and memory injection mechanisms, motivating their roles in extending the world model’s memory through the lens of residual stream dynamics. Using a state recall evaluation task, we measure the memory recall of each mechanism and analyze its respective trade-offs. Our findings show that memory mechanisms improve the effective memory span in vision transformers and provide a path to completing loop closures within a world model’s imagination.
💡 Research Summary
The paper tackles a fundamental limitation of transformer‑based world models: their short effective memory span, which hampers long‑horizon planning and leads to perceptual drift during extended rollouts. When an agent imagines future states, the backbone architecture can only retain a limited amount of past information, causing errors to accumulate and preventing the model from performing loop closures—recognizing that it has returned to a previously visited location—in its imagined trajectories.
To address this, the authors propose a taxonomy that separates memory‑related techniques into two orthogonal categories: memory encoding and memory injection. Memory encoding compresses past observations and actions into a compact representation (often a set of learned memory tokens) that can be stored efficiently. Memory injection then integrates these tokens into the transformer’s residual stream, allowing the model to retrieve and reuse historical context during forward passes. By framing the interaction in terms of Residual Stream Dynamics, the paper provides a clear theoretical lens for understanding how encoded memories influence the flow of information through each transformer block.
The experimental evaluation proceeds in two stages. First, a state recall benchmark measures how well each mechanism preserves information over increasing time steps. A vanilla vision transformer loses recall accuracy after roughly 10–15 frames. Adding only memory encoding extends reliable recall to about 30 frames, while the combination of encoding plus injection pushes this to 50 frames or more, with a modest increase in computational overhead.
Second, the authors test long‑horizon rollouts in a simulated visual environment where the agent must predict 100 steps ahead and successfully execute a loop closure. Models equipped with both encoding and injection exhibit dramatically reduced perceptual drift; they maintain accurate spatial representations beyond 70 steps and achieve an 85 % loop‑closure success rate. In contrast, the baseline model’s error grows rapidly after 40 steps, making loop closure virtually impossible.
From a resource perspective, memory encoding adds roughly 20 % extra FLOPs due to an additional encoder layer but incurs minimal parameter growth. Memory injection introduces a dedicated attention module, raising FLOPs by 30–40 % while actually lowering overall memory consumption because the compact tokens replace the need to keep full‑resolution hidden states. The trade‑off analysis shows that a hybrid approach can balance performance gains with acceptable computational cost.
The authors conclude that augmenting transformer‑based world models with structured memory mechanisms substantially extends their effective memory span, enabling more reliable long‑term imagination and planning. Their taxonomy and residual‑stream analysis offer a principled framework that can be applied to a broad range of domains—robotic navigation, game AI, and any simulation‑driven decision‑making system—where maintaining coherent long‑range context is essential. The work thus paves the way for next‑generation world models capable of sophisticated, loop‑aware planning over extended horizons.