딥 트랜스포머 레이어를 건너뛰는 자율주행 행동 계획 가속기

Reading time: 5 minute
...

📝 Abstract

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

💡 Analysis

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

📄 Content

DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving Haibo Hu haibohu2-c@my.cityu.edu.hk City University of Hongkong Hong Kong, China Lianming Huang lmhuang8-c@my.cityu.edu.hk City University of Hongkong Hong Kong, China Nan Guan nanguan@cityu.edu.hk City University of Hongkong Hong Kong, China Chun Jason Xue jason.xue@mbzuai.ac.ae Mohamed bin Zayed University of Artificial Intelligence Abu Dhabi, The United Arab Emirates Abstract Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Plan- ning) within a tolerable deviation (≤2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety. Keywords Early Exit, Autonomous Driving, Vision-Language-Action, Real- Time Planning 1 Introduction Recent progress in Vision-Language Models (VLMs) [1, 15, 16] has opened new possibilities for autonomous driving, enabling richer scene understanding and interpretable decision-making. Building on this momentum, recent systems such as ORION [6, 8, 14] extend large-model reasoning into the action domain, resulting in VLA frameworks that unify perception, language-based reasoning, and trajectory generation into a single end-to-end (E2E) architecture. Fig. 1 shows the evolution of autonomous driving paradigms, ranging from classic modular pipelines [9, 13], to VLM-enhanced perception models [10, 20], and finally to full-stack VLA systems [6]. While these unified models demonstrate impressive capabilities in simulation and closed-loop driving, they also impose significant computational overhead. A single inference pass typically involves dozens of transformer blocks, leading to hundreds of milliseconds of latency and heavy GPU utilization. This cost scales with model depth and input length, creating a fundamental barrier to real-time deployment on embedded platforms [4, 5, 22, 23]. Perception Motion Prediction Planning (a) Classic E2E Methods Vis Encoder (b) VLM-Base E2E Methods Text Vis Encoder (c) VLA Method Action Head Vis Encoder (d) Our Method Action Head DeeAD Figure 1: Comparison of autonomous driving paradigms: (a) classic E2E, (b) VLM-based, (c) VLA, and (d) our DeeAD with Action-Guided Early Exit for efficient, physically consistent inference. Our empirical analysis on the Bench2Drive benchmark [11] re- veals that trajectory refinement across layers often saturates: inter- mediate layers can already generate safe and physically plausible trajectories that deviate only slightly from the final output. In some cases, these mid-layer plans align better with the intended naviga- tion path than the final prediction. This mirrors real-world driving behaviors, where slight deviations from the optimal path are toler- ated as long as the motion remains safe and consistent [24]. This observation motivates the need for an adaptive inference mechanism that dynamically terminates reasoning once a good- enough action is found. However, existing early-exit methods, typically based on feature-space confidence scores [2, 7, 22] or learned exit policies [5, 21, 23], suffer from domain dependency and lack physical grounding, making them unstable under distribution shifts [11, 25]. This is particularly risky for autonomous driving, where inference uncertainty can directly affect safety. To address this, we introduce DeeAD, a lightweight, plug-and- play early-exit framework tailored for VLA models. The core idea is intuitive: rather than waiting for the final decoding layer, the model should dynamically terminate inference once the predicted trajec- tory already falls within an acceptable deviation from a navigation reference. When intermediate predictions align closely, typically within a 2 m tolerance, with route priors such as map navigation or low-precision planning, continued deep-layer reasoning brings diminishing returns. This follows a natural principle of human arXiv:2511.20720v1 [cs.CV] 25 Nov 2025 Trovato et al. driving, where slight spatial deviations are permissible as long as motion remains safe and goal-consistent. Turning this idea into a practical inference mechanism, however, introduces several challenges. First, standard VLA architectures only expose the final planning output, making it impossible to

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut