Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation
Reading time: 5 minute
...
📝 Original Info
Title: Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation
ArXiv ID: 2512.20188
Date: 2025-12-23
Authors: Astribot 팀 astribot ai@astribot.com 전체 저자 목록은 기여 및 감사의 말에서 확인할 수 있습니다.
📝 Abstract
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.
💡 Deep Analysis
📄 Full Content
Asynchronous Fast-Slow Vision-Language-Action
Policies for Whole-Body Robotic Manipulation
Astribot Team
astribot ai@astribot.com
Full Author List in Contributions and Acknowledgments
Abstract
Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for
semantic reasoning with an action expert generating continuous action signals, yet both typically
run at a single unified frequency. As a result, policy performance is constrained by the low infer-
ence speed of large VLMs. This mandatory synchronous execution severely limits control stability
and real-time performance in whole-body robotic manipulation, which involves more joints, larger
motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow
VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency ac-
tion generation and a slow pathway for rich VLM reasoning. The system is characterized by two
key features. First, a latent representation buffer bridges the slow and fast systems. It stores in-
struction semantics and action-reasoning representation aligned with the scene-instruction context,
providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides
a compact, unified representation of whole-body actions. Importantly, the VLM and action expert
are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous
execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-
chunk generation, approximately three times as fast as prior VLA models with comparable model
sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates
and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The
implementation of DuoCore-FS, including training, inference, and deployment, is provided to com-
mercial users by Astribot as part of the Astribot robotic platform.
1
Introduction
Vision–language–action (VLA) [1, 2, 3] models have recently attracted substantial attention in general-
purpose robotic manipulation, as they provide a unified framework for learning policy that takes
visual observations and linguistic instructions as inputs and generates corresponding action poses as
outputs. To mitigate the limitations caused by the scarcity of robotic data, some works [4, 5] directly
leverage VLMs pretrained on internet-scale image–text corpora, extending the auto-regressive next-
token prediction paradigm to generate action tokens. This strategy transfers pretrained cross-modal
knowledge to robot policies, aligning action outputs with visual and linguistic signals.
While autoregressive VLA methods have shown promise, the sequential, token-by-token generation
of discrete action tokens often suffers from low inference speed and limited action accuracy. To enable
the generation of high-frequency continuous action chunks, a dual-system VLA architecture has been
proposed in several works [6, 7], inspired by human cognitive processing [8]. The dual-system VLA
architecture jointly fine-tunes a VLM and a lightweight diffusion-based action expert, where the VLM
acts as a slow, deliberative reasoning module and the action expert provides fast, reactive control.
In most dual-system VLA implementations, the VLM and action expert still operate synchronously,
so the fast module’s execution frequency is ultimately constrained by the slow VLM’s inference speed.
Especially, this limitation becomes increasingly severe as a recent trend has emerged in which VLA
systems [5, 9, 10, 11] adopt much larger VLM backbones, further widening the gap between reasoning
latency and the control-rate requirements of real-world manipulation.
To address this bottleneck,
several recent works [12, 13, 14] introduce asynchronous dual-system VLA, allowing the two subsystems
to run at independent frequencies rather than being locked to the VLM’s slow update rate. In this
formulation, the slow system module performs infrequent high-level deliberation, while the fast system
1
arXiv:2512.20188v1 [cs.RO] 23 Dec 2025
module updates at a much higher frequency to generate low-level actions for real-time control, as
previously described in [15].
Our approach differs from existing asynchronous dual-system frameworks such as FiS-VLA [14],
OpenHelix [15], Helix [12], and Hume [13] in how the slow and fast subsystems interact during inference
or training. In our design, the slow system and fast system run fully in parallel: the slow module
engages in multi-faceted reasoning at a low frequency, e.g., generating discrete action tokens and
providing high-level latent representations to guide the fast system, while the fast module translates
the latest latent representations from the slow module together with real-time visual observations
and proprioceptive states to generate high-frequency continuous actions. In contrast, FiS-VLA and
OpenHelix do not perform parallel inference;