시각‑언어‑행동 기반 드론 자율비행 프레임워크
📝 Abstract
This paper proposes VLA-AN (Vision-Language-Action framework for Aerial Navigation), an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time inference rate of 2-3 Hz on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
💡 Analysis
This paper proposes VLA-AN (Vision-Language-Action framework for Aerial Navigation), an efficient and onboard Vision-Language-Action (VLA) framework dedicated to autonomous drone navigation in complex environments. VLA-AN addresses four major limitations of existing large aerial navigation models: the data domain gap, insufficient temporal navigation with reasoning, safety issues with generative action policies, and onboard deployment constraints. First, we construct a high-fidelity dataset utilizing 3D Gaussian Splatting (3D-GS) to effectively bridge the domain gap. Second, we introduce a progressive three-stage training framework that sequentially reinforces scene comprehension, core flight skills, and complex navigation capabilities. Third, we design a lightweight, real-time action module coupled with geometric safety correction. This module ensures fast, collision-free, and stable command generation, mitigating the safety risks inherent in stochastic generative policies. Finally, through deep optimization of the onboard deployment pipeline, VLA-AN achieves a robust real-time inference rate of 2-3 Hz on resource-constrained UAVs. Extensive experiments demonstrate that VLA-AN significantly improves spatial grounding, scene reasoning, and long-horizon navigation, achieving a maximum single-task success rate of 98.1%, and providing an efficient, practical solution for realizing full-chain closed-loop autonomy in lightweight aerial robots.
📄 Content
The rapid and concurrent advances of Multimodal Large Language Models (MLLMs) [1][2][3][4] have fundamentally reshaped the imagination for the intelligence and autonomy of Unmanned Aerial Vehicles (UAVs). The emerging paradigm envisions aerial systems endowed with human-like cognitive-action capabilities, such as interpreting complex natural language instructions, understanding fine-grained scene semantics, inferring nuanced task intentions, and generating context-aware action sequences. However, most existing UAV systems remain largely dependent on manual control or limited autonomy. Their designs typically follow a cascaded perception-mapping-planning-control pipeline, composed of carefully engineered modules tailored to specific tasks such as tracking [5], landing [6], or physical interaction [7]. While effective in constrained settings, this modular paradigm introduces two fundamental limitations. First, errors tend to accumulate across stages, degrading overall system robustness. Second, these systems lack the capacity to reason over open-ended language and high-level intent. Extending them to new tasks often requires substantial manual redesign, making adaptation slow and resource-intensive. Consequently, there is a pressing need for a more general and unified approach that can reliably translate abstract task descriptions into precise control commands, thereby supporting aerial navigation in complex environments.
To address these limitations, data-driven navigation approaches based on Vision Language Action (VLA) [8,9] or Vision Language Model (VLM) [10][11][12], have emerged as a promising direction for enhancing drone autonomy. By leveraging largescale pretraining and cross-modal alignment, these models significantly enhance semantic understanding and enable languageconditioned action generation. With recent advances in model compression techniques such as quantization and knowledge distillation [13,14], the resulting models have become increasingly acceptable for deployment on onboard platforms. However, most existing VLM/VLA systems are primarily developed for ground-based platforms with fixed viewpoints or for relatively open and simple outdoor environments. As a result, they fall short of meeting the distinctive demands of aerial navigation, particularly in scenarios involving indoor flight, aggressive maneuvers, and tightly constrained spaces. We outline the core challenges inherent in deploying large navigation models on agile drones as follows:
- Data Distribution Mismatch. Most large models are pretrained on static images or data captured from fixed or height-constrained viewpoints. This training distribution differs substantially from the highly dynamic, first-person aerial perspective encountered during flight, leading to degraded semantic perception and unstable spatial localization.
In addition, acquiring high-quality navigation data from real UAVs is inherently expensive and entails considerable operational risk, further limiting the availability of representative training data. 2) Insufficient Temporal Reasoning for Navigation. Existing approaches predominantly rely on single-frame inference and have limited ability to encode temporal context. As a result, they struggle to exploit historical observations, reason under complex scenes, or execute long-horizon navigation tasks. These shortcomings hinder stable and continuous operation in real-world environments. Moreover, the reliance on preconstructed or known maps further restricts adaptability when operating in previously unseen spaces. 3) Safety Limitations in Generative Action Models. Many state-of-the-art VLA models often utilize generative models, such as diffusion policy [15] or flow matching [16], to generate continuous control sequences. While flexible, these approaches introduce stochasticity and generative noise that can significantly increase collision risk, particularly in DECEMBER 22, 2025 confined environments. Additionally, these models struggle to incorporate explicit geometric constraints in training collision-free action generation in constrained 3D space.
- Constraints on Onboard Deployment. Current VLA models [8] impose substantial computational demands, typically requiring high-performance GPUs, which limits their deployment on UAVs with constrained onboard compute and strict payload restrictions. Although approaches such as model distillation [17] can partially alleviate these constraints, they often incur performance degradation. Consequently, efficient and lightweight frameworks specifically designed for resource-limited UAV deployment remain largely underdeveloped.
To systematically address these challenges, we propose an efficient and onboard VLA framework for Aerial Navigation (named VLA-AN) in complex environments, establishing a unified pipeline that integrates high-fidelity data generation, multi-stage training, robust action generation, and onboard deployment to achieve full-chain autonomous aerial
This content is AI-processed based on ArXiv data.