📝 Original Info
- Title: 액션플로우: 엣지 로봇을 위한 초고속 비전‑언어‑액션 추론 프레임워크
- ArXiv ID: 2512.20276
- Date: 2026-02-23
- Authors: Researchers from original ArXiv paper
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20-30 Hz, current VLA models typically operate at only 3-5 Hz on edge devices due to the memorybound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross-Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55× improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.
💡 Deep Analysis
Deep Dive into 액션플로우: 엣지 로봇을 위한 초고속 비전‑언어‑액션 추론 프레임워크.
Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20-30 Hz, current VLA models typically operate at only 3-5 Hz on edge devices due to the memorybound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware
📄 Full Content
ActionFlow: A Pipelined Action Acceleration for Vision Language
Models on Edge
Yuntao Dai
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Teng Wang∗
Suzhou Institute for Advanced
Research, University of Science and
Technology of China
Suzhou, China
wangteng@ustc.edu.cn
Hang Gu
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Qianyu Cheng
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Yifei Zheng
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Zhiyong Qiu
IEIT SYSTEMS Co., Ltd.
Beijing, China
Lei Gong
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Wenqi Lou
Suzhou Institute for Advanced
Research, University of Science and
Technology of China
Suzhou, China
Xuehai Zhou
School of Computer Science and
Technology, University of Science and
Technology of China
Hefei, China
Suzhou Institute for Advanced
Research, University of Science and
Technology of China
Suzhou, China
Abstract
Vision-Language-Action (VLA) models have emerged as a unified
paradigm for robotic perception and control, enabling emergent
generalization and long-horizon task execution. However, their
deployment in dynamic, real-world environments is severely hin-
dered by high inference latency. While smooth robotic interaction
requires control frequencies of 20–30 Hz, current VLA models typi-
cally operate at only 3–5 Hz on edge devices due to the memory-
bound nature of autoregressive decoding. Existing optimizations
often require extensive retraining or compromise model accuracy.
To bridge this gap, we introduce ActionFlow, a system-level
inference framework tailored for resource-constrained edge plat-
forms. At the core of ActionFlow is a Cross-Request Pipelin-
ing strategy, a novel scheduler that redefines VLA inference as a
macro-pipeline of micro-requests. The strategy intelligently batches
memory-bound Decode phases with compute-bound Prefill phases
across continuous time steps to maximize hardware utilization.
Furthermore, to support this scheduling, we propose a Cross-
Request State Packed Forward operator and a Unified KV
Ring Buffer, which fuse fragmented memory operations into
efficient dense computations. Experimental results demonstrate
that ActionFlow achieves a 2.55× improvement in FPS on the
OpenVLA-7B model without retraining, enabling real-time dy-
namic manipulation on edge hardware. Our work is available at
https://anonymous.4open.science/r/ActionFlow-1D47.
∗Corresponding author.
CCS Concepts
• Computer systems organization →Robotics; • Computing
methodologies →Artificial intelligence.
Keywords
VLA, Acceleration, Embodied Robot
1
Introduction
Vision-Language-Action (VLA) models represent a paradigm shift
in embodied intelligence, unifying visual perception, language in-
structions, and action outputs within a single, end-to-end sequence
modeling framework. By discarding the traditional, modular "sense-
plan-act" pipeline, VLAs model the relationship between world
states and future actions directly. [21] This approach has unlocked
transformative capabilities. These models exhibit emergent general-
ization to unseen objects and scenarios, comprehend complex and
abstract natural language, and successfully execute long-horizon,
multi-step tasks. These breakthroughs are actively driving explo-
ration in applications from home assistant robots to autonomous
industrial manipulation [15].
Despite their conceptual power, the practical deployment of
VLAs is severely constrained by high computational latency. It is
widely accepted in robotics that control loops involving dynamic
interactions—such as obstacle avoidance or human-robot collab-
oration—require a frequency of at least 10 Hz to ensure stability.
Frequencies of 20-30 Hz are considered ideal for fluid and safe
interaction [3, 24]. Latency below this threshold makes a robot
"sluggish," unable to react to environmental changes (e.g., grasping
arXiv:2512.20276v1 [cs.AI] 23 Dec 2025
Conference’17, July 2017, Washington, DC, USA
Trovato et al.
Instruction:
open the
middle
drawer of
the cabinet
LLM
ViT
Tokenizer
Vision input
a0
ACTION = {a0, a1 .. , an}
Δx
Δθ
ΔGrip
LLM
Vision tokens
Text tokens
a1
an
…
LLM
autoregressive decoding
Embodied Robot Platforms
Figure 1: The process of Vision Language Model/Action
a moving object or avoiding a sudden obstacle) and can even lead
to control-system oscillations.
However, mainstream VLAs, which rely on autoregressive infer-
ence, as shown in Figure1, typically operate at a mere 3-5 frames
per second (FPS) [12]. This performance gap is exacerbated on
resource-constrained edge devices; for instance, the 7B-parameter
OpenVLA model achieves only 3 FPS on a Jetson AGX Orin plat-
form, even with INT4 quantization. This massive performance gap
(3 FPS vs. 30 Hz) renders VLAs unsuitab
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.