MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
💡 Research Summary
The paper tackles the long‑standing trade‑off in Object Goal Navigation (OGN) between high success rates (SR) and strong generalization. Existing approaches fall into two camps: Supervised Fine‑Tuning (SFT) methods that achieve high SR by training on task‑specific data but overfit to the training distribution, and Training‑Free (TF) or “agentic” methods that preserve the broad competence of large pre‑trained vision‑language models (VLMs) and thus generalize well, yet suffer from lower SR.
To bridge this gap, the authors propose MerNav, a three‑stage cognitive architecture inspired by human neuroscience: Memory → Execute → Review. The system operates in a closed “Observe‑Think‑Act” loop, where “Think” is explicitly decomposed into three functional modules.
Memory is hierarchical, mirroring short‑term, long‑term, and commonsense memory. Short‑term memory stores the current RGB‑D observation and pose. Long‑term memory maintains two complementary structures: a compressed bird‑eye‑view (BEV) map of explored vs. unexplored areas, and a value‑record map that aggregates directional scores over time using an exponential moving average (EMA). A non‑compressed buffer (sliding window) preserves the most recent raw observation for fine‑grained decisions. Commonsense memory holds task‑independent rules (e.g., robot width, “closed doors cannot be crossed”, object size priors) and is injected into the VLM prompt as explicit constraints.
Execute handles routine navigation through a four‑step pipeline:
-
Observation Analysis – The agent fuses three visual inputs: a local view (current forward image), a global view (stitched six‑direction panorama with direction labels), and the historical BEV map. A VLM‑based analysis agent scores each of the six candidate directions, outputs a numeric score and a natural‑language rationale (Rₜ), and the scores are fused with the long‑term value map via EMA. The highest‑scoring direction Dₜ^sel is selected.
-
Path Planning – Using the selected image, its rationale, the overall goal G, and any previously generated sub‑goals, a sub‑goal generation function produces an intermediate waypoint (e.g., “approach the nearest chair”). This decomposes the high‑level goal into a sequence of executable sub‑goals.
-
Action Selection – Depth information defines traversable ground (height difference < Tₕ). From the current pose, the farthest reachable point is computed, and additional candidate points are sampled left/right at fixed angular intervals. The candidate with the highest accumulated value is chosen, and the corresponding low‑level action (turn, move forward, stop) is issued.
-
Stop Decision – If the Euclidean distance to the target object falls below a threshold T_g, the agent issues a STOP command and terminates the episode.
Review runs in parallel, continuously monitoring execution. Two review strategies are defined: a two‑step review that triggers when the immediate directional score drops sharply or the angle to the goal deviates beyond a limit, and a multi‑step review that activates after several consecutive steps without progress. In either case, an independent VLM instance receives the current state, the rationale, and the memory context, then generates a natural‑language explanation of the failure and proposes corrective actions (e.g., regenerate sub‑goal, re‑plan direction, abort current move). This meta‑cognitive loop mimics human self‑inspection (“I examine myself three times a day”) and provides transparency and safety.
The authors evaluate MerNav on four datasets: MP3D, HM3D_v0.1, HM3D_OVON (an open‑vocabulary extension), and an additional public benchmark. Both Zero‑Shot (no fine‑tuning) and Training‑Free settings are tested. Results show:
- Compared to TF baselines, MerNav improves average SR by 7 percentage points.
- Compared to Zero‑Shot baselines, it improves SR by 5 pp.
- On HM3D_v0.1 and HM3D_OVON, the gains are 8 pp and 6 pp, respectively, under Zero‑Shot.
- On MP3D and HM3D_OVON, MerNav outperforms all TF and all SFT methods, achieving +5 pp and +2 pp higher SR, respectively, and similar gains in SPL.
- Swapping in a more powerful foundation model yields an additional +3 pp in both SR and SPL, pushing SR above the “acceptable” 70 % threshold.
Key contributions are: (1) a biologically inspired Memory‑Execute‑Review architecture that integrates hierarchical memory and meta‑cognitive error correction; (2) a VLM‑driven scoring and reasoning mechanism that provides explicit natural‑language rationales, enhancing interpretability; (3) a systematic review process that detects and corrects anomalies in real time; (4) comprehensive empirical validation showing simultaneous improvements in success rate, generalization, and interpretability across diverse environments.
The paper acknowledges limitations: heavy reliance on prompt engineering and VLM inference cost, potential latency of the review module, and the need for scalable commonsense knowledge injection. Future work may explore lightweight memory encoders, reinforcement‑learning‑based review policies, and multimodal extensions (audio, tactile) to further close the gap between simulated and real‑world embodied agents.
In summary, MerNav demonstrates that embedding human‑like memory structures and self‑review mechanisms into a training‑free VLN pipeline can achieve state‑of‑the‑art performance without sacrificing the broad generalization that large pre‑trained models provide, marking a significant step toward robust, interpretable, and deployable embodied navigation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment