StreamVLA: Breaking the Reason-Act Cycle via Completion-State Gating
Long-horizon robotic manipulation requires bridging the gap between high-level planning (System 2) and low-level control (System 1). Current Vision-Language-Action (VLA) models often entangle these processes, performing redundant multimodal reasoning at every timestep, which leads to high latency and goal instability. To address this, we present StreamVLA, a dual-system architecture that unifies textual task decomposition, visual goal imagination, and continuous action generation within a single parameter-efficient backbone. We introduce a “Lock-and-Gated” mechanism to intelligently modulate computation: only when a sub-task transition is detected, the model triggers slow thinking to generate a textual instruction and imagines the specific visual completion state, rather than generic future frames. Crucially, this completion state serves as a time-invariant goal anchor, making the policy robust to execution speed variations. During steady execution, these high-level intents are locked to condition a Flow Matching action head, allowing the model to bypass expensive autoregressive decoding for 72% of timesteps. This hierarchical abstraction ensures sub-goal focus while significantly reducing inference latency. Extensive evaluations demonstrate that StreamVLA achieves state-of-the-art performance, with a 98.5% success rate on the LIBERO benchmark and robust recovery in real-world interference scenarios, achieving a 48% reduction in latency compared to full-reasoning baselines.
💡 Research Summary
StreamVLA introduces a dual‑system architecture that cleanly separates high‑level deliberative planning (System 2) from low‑level reactive control (System 1) within a single, parameter‑efficient backbone. The core innovation is a “Lock‑and‑Gated” mechanism that activates the heavy multimodal reasoning components only when a sub‑task transition is detected. During the slow‑thinking phase, the model generates a textual sub‑goal and, crucially, a visual completion state—an imagined image representing the exact end condition of the current sub‑task. This completion image is time‑invariant, providing a stable visual anchor that remains valid regardless of execution speed fluctuations.
A lightweight gating module continuously compares the current observation (typically a head‑mounted camera view) with the locked completion image using cross‑attention to compute a discrepancy score. If the score exceeds a preset threshold, the system assumes the sub‑task is still in progress and enters “Skip Mode,” bypassing the autoregressive reasoning heads and reusing the cached sub‑goal and completion image to condition a Flow Matching action head. This head, based on Conditional Flow Matching, produces continuous action chunks at high frequency with minimal latency. When the discrepancy falls below the threshold, indicating that the visual goal has been reached, “Full Mode” is triggered: the system re‑engages the autoregressive heads to produce a new textual instruction and a new completion image for the next sub‑task.
Unlike conventional video‑prediction approaches that generate future frames at fixed time offsets (t + Δt), StreamVLA’s imagination head directly predicts the final successful state of the sub‑task. This eliminates temporal misalignment issues and yields a robust goal representation that guides both planning and control. The imagination head leverages the Infinity architecture’s bit‑wise autoregressive modeling, sharing most parameters with the backbone and using KV‑caching to keep inference costs low.
Empirical evaluation on the LIBERO long‑horizon manipulation benchmark shows a 98.5 % success rate, surpassing prior state‑of‑the‑art VLA models. In dynamic real‑world scenarios (e.g., RoboTwin 2.0), the system demonstrates reliable recovery from disturbances. Importantly, StreamVLA reduces average inference latency from 244 ms to 128 ms—a 48 % improvement—while spending only 28 % of timesteps in the expensive full‑reasoning mode; the remaining 72 % are handled by the fast‑action pathway.
Overall, StreamVLA provides a practical solution to the longstanding trade‑off between deep multimodal reasoning and real‑time control in robotic manipulation. By dynamically gating computation based on self‑generated visual foresight, it achieves both high‑level planning capability and low‑latency execution, paving the way for more capable and deployable generalist robot agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment