APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
Aerial Object Goal Navigation, a challenging frontier in Embodied AI, requires an Unmanned Aerial Vehicle (UAV) agent to autonomously explore, reason, and identify a specific target using only visual perception and language description. However, existing methods struggle with the memorization of complex spatial representations in aerial environments, reliable and interpretable action decision-making, and inefficient exploration and information gathering. To address these challenges, we introduce \textbf{APEX} (Aerial Parallel Explorer), a novel hierarchical agent designed for efficient exploration and target acquisition in complex aerial settings. APEX is built upon a modular, three-part architecture: 1) Dynamic Spatio-Semantic Mapping Memory, which leverages the zero-shot capability of a Vision-Language Model (VLM) to dynamically construct high-resolution 3D Attraction, Exploration, and Obstacle maps, serving as an interpretable memory mechanism. 2) Action Decision Module, trained with reinforcement learning, which translates this rich spatial understanding into a fine-grained and robust control policy. 3) Target Grounding Module, which employs an open-vocabulary detector to achieve definitive and generalizable target identification. All these components are integrated into a hierarchical, asynchronous, and parallel framework, effectively bypassing the VLM’s inference latency and boosting the agent’s proactivity in exploration. Extensive experiments show that APEX outperforms the previous state of the art by +4.2% SR and +2.8% SPL on challenging UAV-ON benchmarks, demonstrating its superior efficiency and the effectiveness of its hierarchical asynchronous design. Our source code is provided in \href{https://github.com/4amGodvzx/apex}{GitHub}
💡 Research Summary
The paper addresses the challenging problem of Aerial Object Goal Navigation, where an unmanned aerial vehicle (UAV) must locate a target object described only by a high‑level textual cue while operating solely on onboard visual sensors. Existing approaches suffer from three major limitations: inadequate long‑term spatial‑temporal memory for complex 3‑D environments, a brittle coupling between high‑level semantic understanding (often provided by large vision‑language models, VLMs) and low‑level control, and prohibitive inference latency that forces a stop‑and‑think execution style. To overcome these issues, the authors propose APEX (Aerial Parallel Explorer), a hierarchical, modular system composed of three decoupled components.
-
Dynamic Spatio‑Semantic Mapping Memory: Using a VLM’s zero‑shot grounding capability together with an open‑vocabulary segmentation model, the system back‑projects RGB‑D observations into a high‑resolution 3‑D voxel grid. Three parallel channels are maintained: an Attraction map that scores voxels according to their semantic relevance to the goal, an Exploration map that records observation density to encourage coverage of unknown regions, and an Obstacle map for collision avoidance. The attraction scores are updated only when a nearer observation is received, ensuring that the most reliable information dominates.
-
Action Decision Module: A Proximal Policy Optimization (PPO) network consumes the three maps and produces low‑level navigation actions. Because the map‑updating process is slower than the control loop, the architecture runs the mapping, decision, and target‑grounding modules at different frequencies in an asynchronous parallel framework. This design hides the VLM’s computational latency while guaranteeing that the policy always has the freshest obstacle information.
-
Target Grounding Module: An open‑vocabulary object detector (based on DINO) continuously scans the visual stream to confirm the presence and precise location of the target object, solving the “last‑mile” identification problem.
The three modules are integrated hierarchically: the mapping module supplies a rich spatial‑semantic memory, the decision module translates this memory into robust actions, and the grounding module validates success. Extensive experiments on the UAV‑ON benchmark demonstrate that APEX outperforms prior state‑of‑the‑art methods by +4.2 % in Success Rate (SR) and +2.8 % in SPL, particularly in large‑scale, cluttered 3‑D scenes. Ablation studies confirm the individual contributions of dynamic mapping, asynchronous execution, and modular decoupling. By releasing code and models, the authors provide a solid foundation for future research on memory‑rich, real‑time aerial embodied agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment