액션플로우: 엣지 로봇을 위한 초고속 비전‑언어‑액션 추론 프레임워크

Reading time: 6 minute
...

📝 Original Info

  • Title: 액션플로우: 엣지 로봇을 위한 초고속 비전‑언어‑액션 추론 프레임워크
  • ArXiv ID: 2512.20276
  • Date: 2026-02-23
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20-30 Hz, current VLA models typically operate at only 3-5 Hz on edge devices due to the memorybound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross-Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55× improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dynamic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

💡 Deep Analysis

Deep Dive into 액션플로우: 엣지 로봇을 위한 초고속 비전‑언어‑액션 추론 프레임워크.

Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hindered by high inference latency. While smooth robotic interaction requires control frequencies of 20-30 Hz, current VLA models typically operate at only 3-5 Hz on edge devices due to the memorybound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge platforms. At the core of ActionFlow is a Cross-Request Pipelining strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware

📄 Full Content

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge Yuntao Dai School of Computer Science and Technology, University of Science and Technology of China Hefei, China Teng Wang∗ Suzhou Institute for Advanced Research, University of Science and Technology of China Suzhou, China wangteng@ustc.edu.cn Hang Gu School of Computer Science and Technology, University of Science and Technology of China Hefei, China Qianyu Cheng School of Computer Science and Technology, University of Science and Technology of China Hefei, China Yifei Zheng School of Computer Science and Technology, University of Science and Technology of China Hefei, China Zhiyong Qiu IEIT SYSTEMS Co., Ltd. Beijing, China Lei Gong School of Computer Science and Technology, University of Science and Technology of China Hefei, China Wenqi Lou Suzhou Institute for Advanced Research, University of Science and Technology of China Suzhou, China Xuehai Zhou School of Computer Science and Technology, University of Science and Technology of China Hefei, China Suzhou Institute for Advanced Research, University of Science and Technology of China Suzhou, China Abstract Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin- dered by high inference latency. While smooth robotic interaction requires control frequencies of 20–30 Hz, current VLA models typi- cally operate at only 3–5 Hz on edge devices due to the memory- bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat- forms. At the core of ActionFlow is a Cross-Request Pipelin- ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross- Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55× improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy- namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47. ∗Corresponding author. CCS Concepts • Computer systems organization →Robotics; • Computing methodologies →Artificial intelligence. Keywords VLA, Acceleration, Embodied Robot 1 Introduction Vision-Language-Action (VLA) models represent a paradigm shift in embodied intelligence, unifying visual perception, language in- structions, and action outputs within a single, end-to-end sequence modeling framework. By discarding the traditional, modular "sense- plan-act" pipeline, VLAs model the relationship between world states and future actions directly. [21] This approach has unlocked transformative capabilities. These models exhibit emergent general- ization to unseen objects and scenarios, comprehend complex and abstract natural language, and successfully execute long-horizon, multi-step tasks. These breakthroughs are actively driving explo- ration in applications from home assistant robots to autonomous industrial manipulation [15]. Despite their conceptual power, the practical deployment of VLAs is severely constrained by high computational latency. It is widely accepted in robotics that control loops involving dynamic interactions—such as obstacle avoidance or human-robot collab- oration—require a frequency of at least 10 Hz to ensure stability. Frequencies of 20-30 Hz are considered ideal for fluid and safe interaction [3, 24]. Latency below this threshold makes a robot "sluggish," unable to react to environmental changes (e.g., grasping arXiv:2512.20276v1 [cs.AI] 23 Dec 2025 Conference’17, July 2017, Washington, DC, USA Trovato et al. Instruction: open the middle drawer of the cabinet LLM ViT Tokenizer Vision input a0 ACTION = {a0, a1 .. , an} Δx Δθ ΔGrip LLM Vision tokens Text tokens a1 an … LLM autoregressive decoding Embodied Robot Platforms Figure 1: The process of Vision Language Model/Action a moving object or avoiding a sudden obstacle) and can even lead to control-system oscillations. However, mainstream VLAs, which rely on autoregressive infer- ence, as shown in Figure1, typically operate at a mere 3-5 frames per second (FPS) [12]. This performance gap is exacerbated on resource-constrained edge devices; for instance, the 7B-parameter OpenVLA model achieves only 3 FPS on a Jetson AGX Orin plat- form, even with INT4 quantization. This massive performance gap (3 FPS vs. 30 Hz) renders VLAs unsuitab

…(Full text truncated)…

📸 Image Gallery

acm-jdslogo.png acm-jdslogo.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut