MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, LIBERO-5 suites and Mikasa-Robo, it achieves 71.9%, 72.7%, 96.5%, and 41.2% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge and +11.8 gain on Mikasa-Robo. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA


💡 Research Summary

MemoryVLA tackles a fundamental limitation of current Vision‑Language‑Action (VLA) systems: they treat manipulation as a Markovian problem and rely solely on the current observation, which leads to poor performance on long‑horizon, temporally dependent tasks. Inspired by the dual‑memory architecture of the human brain—working memory for short‑term control and hippocampal episodic memory for long‑term experience—the authors propose a Cognition‑Memory‑Action framework that explicitly models temporal context.

The pipeline begins with a pretrained 7‑billion‑parameter Vision‑Language Model (VLM). An RGB image is processed by parallel DINOv2 and SigLIP backbones, whose features are compressed into 256 perceptual tokens. Simultaneously, the same visual features are projected into the language embedding space and concatenated with the tokenized instruction before being fed to LLaMA‑7B; the embedding at the end‑of‑sentence token becomes a single high‑level cognitive token. Together these tokens form a short‑term “working memory”.

To provide long‑term context, the system maintains a Perceptual‑Cognitive Memory Bank (PCMB) consisting of two streams (perceptual and cognitive) each holding up to L entries. Every entry is tagged with a sinusoidal timestep embedding, preserving temporal order. At each step, the current working memory queries the PCMB via cross‑attention with positional encodings. Scaled‑dot‑product attention yields raw retrieved embeddings, which are passed through two Transformer layers to produce refined perceptual (Hₚ) and cognitive (H𝚌) memories.

Retrieved memories are merged with the current working memory through a gated fusion mechanism: a sigmoid gate determines how much of the current token versus the retrieved token should contribute to the final representation, allowing the model to balance fresh sensory input with relevant past context.

When the PCMB reaches capacity, a consolidation step merges the most similar adjacent entries (based on cosine similarity), averaging their token vectors and updating their timestamps. This keeps the memory compact while retaining essential episodic information.

The fused tokens condition a diffusion‑based action expert. The cognitive token serves as the primary conditioning signal, while perceptual tokens enrich the diffusion process with fine‑grained visual detail. The diffusion model iteratively denoises a latent action trajectory, producing a sequence of 7‑DoF robot commands that respect the temporal dependencies encoded in the memory.

Experiments span four simulated suites (SimplerEnv‑Bridge, Fractal, LIBERO‑5, Mikasa‑Robo) and twelve real‑world tasks on Franka and Wido wX robots. MemoryVLA achieves 71.9 % (Bridge), 72.7 % (Fractal), 96.5 % (LIBERO‑5) and 41.2 % (Mikasa‑Robo) success rates, outperforming the strongest baselines CogACT and π⁰ by +14.6 %p, +4.6 %p, +3 %p and +11.8 %p respectively. On real‑world tasks, it reaches an overall 84 % success rate, with a striking +26 %p gain on long‑horizon tasks. Additional robustness tests show stable performance under varied backgrounds, lighting, distractors, and object variations.

Ablation studies confirm the importance of each component: removing the memory bank, disabling retrieval, or omitting gated fusion each degrades performance by 5‑12 %p. The analysis also highlights sensitivity to the memory size L and consolidation frequency, suggesting avenues for adaptive memory management.

Limitations include reliance on RGB‑text inputs (no tactile or force feedback) and the need for hand‑tuned hyper‑parameters for different task domains. Future work aims to incorporate multimodal sensors, meta‑learning for dynamic memory updates, and scaling to larger real‑world datasets.

In summary, MemoryVLA demonstrates that embedding human‑inspired working and episodic memory mechanisms into VLA architectures dramatically improves temporal reasoning and long‑horizon manipulation, establishing a new direction for memory‑augmented robot learning.


Comments & Academic Discussion

Loading comments...

Leave a Comment