AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

💡 Research Summary

The paper introduces AutoFly, an end‑to‑end Vision‑Language‑Action (VLA) system designed for autonomous navigation of unmanned aerial vehicles (UAVs) in outdoor, previously unknown environments. Traditional UAV vision‑language navigation (VLN) approaches rely on detailed, step‑by‑step natural‑language instructions that prescribe a fixed flight path. Such methods work well in controlled simulators but break down when only coarse‑grained guidance (e.g., “fly toward the red building”) is available and the environment is dynamic and cluttered. AutoFly addresses this gap by (1) generating depth‑aware representations from a single RGB camera using a pseudo‑depth encoder, (2) integrating these depth tokens with visual and linguistic embeddings in a multimodal transformer, and (3) outputting low‑level velocity and yaw commands through an action de‑tokenizer.

Pseudo‑depth encoder: The authors employ the state‑of‑the‑art monocular depth estimator Depth Anything V2 to produce high‑fidelity depth maps from RGB frames. The depth map is split into patches, linearly embedded, and projected into the same dimensional space as the vision tokens. This alignment preserves spatial relationships and supplies the model with geometric cues essential for 3‑D obstacle avoidance, altitude control, and distance estimation—capabilities that pure RGB‑only VLA models lack. Importantly, the approach avoids the need for dedicated depth sensors, reducing payload and cost while mitigating the simulation‑to‑real domain gap that arises from idealized depth data in simulators like AirSim.

Two‑stage progressive training: Training proceeds in two phases. In the first stage, a large synthetic dataset (≈100 K trajectories) generated in simulation is used to pre‑train the multimodal encoder‑decoder. The loss combines language‑action alignment with a regression loss on the velocity/yaw outputs. In the second stage, a curated real‑world dataset—collected with actual UAVs flying in diverse outdoor scenes—is used for domain adaptation and fine‑tuning. This progressive scheme allows the model to first learn generic visual‑linguistic grounding and then specialize to the noise, lighting variations, and sensor imperfections encountered in the field.

Dataset contribution: Recognizing that existing VLN benchmarks (R2R, RxR, REVERIE, etc.) are unsuitable for UAV autonomous navigation, the authors construct the AutoFly Autonomous Navigation Dataset. It comprises 13,476 trajectories (147 of them captured in real outdoor flights) with an average path length of 12 m and a rich distribution of obstacles (average 10.3 obstacle encounters per trajectory). The dataset includes both “Seen” and “Unseen” environments, a balanced split of training/validation/test (65 %/18 %/17 %), and a vocabulary of 15.6 K tokens. Compared with prior aerial VLN datasets, it offers higher trajectory count, longer paths, more diverse object categories, and a substantial real‑world component, thereby reducing the sim‑to‑real gap.

Experimental results: Evaluation uses Success Rate (SR) and SPL (Success weighted by Path Length). AutoFly outperforms two strong VLA baselines (OpenVLA and a LLaVA‑based model) by +3.9 % SR and +2.7 % SPL on the test set, with consistent gains across simulated and real environments. Ablation studies reveal that removing the pseudo‑depth encoder drops SR by 1.8 % points, while training without the second fine‑tuning stage reduces SR by 1.5 % points, confirming the importance of both components. Real‑flight tests also show reduced battery consumption and smoother trajectories, indicating that the model’s depth‑aware reasoning translates into more efficient control.

Key insights and impact:

Depth‑aware multimodal fusion is crucial for UAVs, where 3‑D geometry directly influences safety.
Progressive training bridges the simulation‑real world gap more effectively than a single end‑to‑end training run.
Dataset design that emphasizes continuous planning, obstacle avoidance, and object recognition shifts VLN research from instruction‑following to genuine autonomous decision‑making.

The authors conclude that AutoFly constitutes a significant step toward practical UAV navigation in the wild, offering a unified architecture that jointly optimizes perception, language understanding, and low‑level control. Future directions include scaling to richer, possibly ambiguous natural‑language goals, integrating reinforcement learning for long‑horizon planning, and extending the framework to multi‑UAV coordination scenarios.

AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

💡 Research Summary

Comments & Academic Discussion

Leave a Comment