Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation

Reading time: 5 minute
...

📝 Original Info

  • Title: Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation
  • ArXiv ID: 2512.20188
  • Date: 2025-12-23
  • Authors: Astribot 팀 astribot ai@astribot.com 전체 저자 목록은 기여 및 감사의 말에서 확인할 수 있습니다.

📝 Abstract

Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low inference speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency action generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores instruction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action-chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to commercial users by Astribot as part of the Astribot robotic platform.

💡 Deep Analysis

Figure 1

📄 Full Content

Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation Astribot Team astribot ai@astribot.com Full Author List in Contributions and Acknowledgments Abstract Most Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals, yet both typically run at a single unified frequency. As a result, policy performance is constrained by the low infer- ence speed of large VLMs. This mandatory synchronous execution severely limits control stability and real-time performance in whole-body robotic manipulation, which involves more joints, larger motion spaces, and dynamically changing views. We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS), organizing the system into a fast pathway for high-frequency ac- tion generation and a slow pathway for rich VLM reasoning. The system is characterized by two key features. First, a latent representation buffer bridges the slow and fast systems. It stores in- struction semantics and action-reasoning representation aligned with the scene-instruction context, providing high-level guidance to the fast pathway. Second, a whole-body action tokenizer provides a compact, unified representation of whole-body actions. Importantly, the VLM and action expert are still jointly trained end-to-end, preserving unified policy learning while enabling asynchronous execution. DuoCore-FS supports a 3B-parameter VLM while achieving 30 Hz whole-body action- chunk generation, approximately three times as fast as prior VLA models with comparable model sizes. Real-world whole-body manipulation experiments demonstrate improved task success rates and significantly enhanced responsiveness compared to synchronous Fast-Slow VLA baselines. The implementation of DuoCore-FS, including training, inference, and deployment, is provided to com- mercial users by Astribot as part of the Astribot robotic platform. 1 Introduction Vision–language–action (VLA) [1, 2, 3] models have recently attracted substantial attention in general- purpose robotic manipulation, as they provide a unified framework for learning policy that takes visual observations and linguistic instructions as inputs and generates corresponding action poses as outputs. To mitigate the limitations caused by the scarcity of robotic data, some works [4, 5] directly leverage VLMs pretrained on internet-scale image–text corpora, extending the auto-regressive next- token prediction paradigm to generate action tokens. This strategy transfers pretrained cross-modal knowledge to robot policies, aligning action outputs with visual and linguistic signals. While autoregressive VLA methods have shown promise, the sequential, token-by-token generation of discrete action tokens often suffers from low inference speed and limited action accuracy. To enable the generation of high-frequency continuous action chunks, a dual-system VLA architecture has been proposed in several works [6, 7], inspired by human cognitive processing [8]. The dual-system VLA architecture jointly fine-tunes a VLM and a lightweight diffusion-based action expert, where the VLM acts as a slow, deliberative reasoning module and the action expert provides fast, reactive control. In most dual-system VLA implementations, the VLM and action expert still operate synchronously, so the fast module’s execution frequency is ultimately constrained by the slow VLM’s inference speed. Especially, this limitation becomes increasingly severe as a recent trend has emerged in which VLA systems [5, 9, 10, 11] adopt much larger VLM backbones, further widening the gap between reasoning latency and the control-rate requirements of real-world manipulation. To address this bottleneck, several recent works [12, 13, 14] introduce asynchronous dual-system VLA, allowing the two subsystems to run at independent frequencies rather than being locked to the VLM’s slow update rate. In this formulation, the slow system module performs infrequent high-level deliberation, while the fast system 1 arXiv:2512.20188v1 [cs.RO] 23 Dec 2025 module updates at a much higher frequency to generate low-level actions for real-time control, as previously described in [15]. Our approach differs from existing asynchronous dual-system frameworks such as FiS-VLA [14], OpenHelix [15], Helix [12], and Hume [13] in how the slow and fast subsystems interact during inference or training. In our design, the slow system and fast system run fully in parallel: the slow module engages in multi-faceted reasoning at a low frequency, e.g., generating discrete action tokens and providing high-level latent representations to guide the fast system, while the fast module translates the latest latent representations from the slow module together with real-time visual observations and proprioceptive states to generate high-frequency continuous actions. In contrast, FiS-VLA and OpenHelix do not perform parallel inference;

📸 Image Gallery

anomaly_case.png co_training.jpg cup_scoop_in_distribution.png framework.png language_following_compare.png logo1.png popcorn_pipeline.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut