VLM-Guided Experience Replay

VLM-Guided Experience Replay
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent’s experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. https://esharony.me/projects/vlm-rb/


💡 Research Summary

Background and Motivation
Reinforcement learning (RL) has achieved impressive results across robotics, language, and logistics, yet it remains notoriously sample‑inefficient, especially in sparse‑reward, long‑horizon tasks. Off‑policy methods mitigate this by reusing past experience stored in a replay buffer, but the key open question is how to select the most informative transitions. Prioritized Experience Replay (PER) uses the magnitude of temporal‑difference (TD) error as a proxy for learning value, but TD‑error lacks semantic awareness: it cannot distinguish whether a transition represents genuine progress toward a goal or merely a noisy fluctuation. This limitation is acute in tasks such as robotic manipulation where critical actions (e.g., unlocking a door) may generate tiny TD errors early on, while visually striking but irrelevant motions can produce large TD errors.

Proposed Method: VLM‑Guided Experience Replay (VLM‑RB)
The authors introduce VLM‑RB, a plug‑and‑play framework that leverages a frozen, pre‑trained Vision‑Language Model (VLM) to assign semantic scores to sub‑trajectories. The pipeline consists of three stages:

  1. Scoring – As the agent interacts with the environment, each transition is grouped into a short video clip (e.g., 32 frames). The clip together with a generic natural‑language prompt (e.g., “Does this clip contain meaningful behavior?”) is fed to a VLM (Perception‑LM 1B). The VLM returns a scalar score; the authors adopt a binary indicator (1 = meaningful, 0 = irrelevant). Scoring is performed asynchronously, so policy updates are never blocked by VLM inference latency.

  2. Prioritization – The binary scores are propagated to every transition inside the corresponding clip. All transitions marked as “meaningful” receive uniform probability mass, while the rest receive zero. This yields a priority distribution (q_P) that is essentially a mask over the buffer.

  3. Sampling – Purely sampling from (q_P) would discard a large portion of collected data and could impair exploration. Therefore VLM‑RB mixes the VLM‑derived distribution with uniform replay:
    \


Comments & Academic Discussion

Loading comments...

Leave a Comment