DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) requires an embodied agent to navigate in a complex 3D environment according to natural language instructions. Recent progress in large language models (LLMs) has enabled language-driven navigation with improved interpretability. However, most LLM-based agents still rely on single-shot action decisions, where the model must choose one option from noisy, textualized multi-perspective observations. Due to local mismatches and imperfect intermediate reasoning, such decisions can easily deviate from the correct path, leading to error accumulation and reduced reliability in unseen environments. In this paper, we propose DV-VLN, a new VLN framework that follows a generate-then-verify paradigm. DV-VLN first performs parameter-efficient in-domain adaptation of an open-source LLaMA-2 backbone to produce a structured navigational chain-of-thought, and then verifies candidate actions with two complementary channels: True-False Verification (TFV) and Masked-Entity Verification (MEV). DV-VLN selects actions by aggregating verification successes across multiple samples, yielding interpretable scores for reranking. Experiments on R2R, RxR (English subset), and REVERIE show that DV-VLN consistently improves over direct prediction and sampling-only baselines, achieving competitive performance among language-only VLN agents and promising results compared with several cross-modal systems.Code is available at https://github.com/PlumJun/DV-VLN.
💡 Research Summary
DV‑VLN introduces a generate‑then‑verify framework for Vision‑and‑Language Navigation (VLN) that markedly improves the reliability of large language model (LLM) based agents. The authors first adapt an open‑source LLaMA‑2 model to the VLN domain using parameter‑efficient techniques (e.g., LoRA), fine‑tuning it to output a structured chain‑of‑thought (CoT) at each navigation step. This CoT consists of three fields: Prediction (what the next scene should look like), View Match (which panoramic view supports that prediction), and Action (the concrete movement direction). By training the model to produce these explicit intermediate statements, the system creates a clear hypothesis for each candidate action.
During inference, the adapted LLM samples multiple candidate actions via stochastic decoding. Each candidate, together with its CoT, is then subjected to two complementary verification channels:
-
True‑False Verification (TFV) – The candidate hypothesis is reformulated as a declarative statement and the LLM is asked whether it is true given the instruction, navigation history, and current observations. A “True” answer yields a verification point.
-
Masked‑Entity Verification (MEV) – Key entities in the instruction (e.g., “kitchen”, “sofa”) are masked. Assuming the candidate action is taken, the LLM is queried to recover the masked entities. Successful recovery also yields a point.
Both TFV and MEV are run several times (typically 3‑5 samples) for each candidate, and the total number of successful checks constitutes an interpretable verification score. The candidate with the highest aggregated score is selected for execution, providing a transparent confidence measure without any external classifier.
The authors evaluate DV‑VLN on three standard VLN benchmarks: Room‑to‑Room (R2R), RxR (English subset), and REVERIE. Compared with language‑only baselines such as NavGPT and DiscussNav, DV‑VLN achieves consistent gains in Success Rate, SPL, and nDTW, especially on unseen splits where generalization is critical. It also reaches performance comparable to several strong cross‑modal models that rely on large vision‑language pre‑training, demonstrating that a well‑designed LLM‑centric approach can rival heavyweight multimodal systems.
Ablation studies confirm that TFV and MEV each contribute positively, and their combination yields the largest improvement, highlighting their complementary nature. Analyses of candidate count and verification sampling depth reveal a sweet spot (≈5‑8 candidates, 3‑5 verification samples) that balances computational cost and navigation accuracy.
Key contributions of the paper are:
- Introducing the first generate‑then‑verify paradigm for VLN, turning action selection into a hypothesis‑testing process.
- Formalizing navigation reasoning into a Prediction‑View‑Action triple and efficiently adapting an open‑source LLM to produce it.
- Designing two self‑verification mechanisms (TFV and MEV) that operate without additional training, offering interpretable re‑ranking of actions.
- Demonstrating that language‑only agents equipped with dual verification can achieve competitive results against cross‑modal baselines, reducing reliance on large vision‑language encoders.
In summary, DV‑VLN shows that embedding structured reasoning and self‑verification into LLM‑driven navigation yields more robust, transparent, and deployable agents, paving the way for future work that may combine these verification steps with direct visual grounding or richer multimodal feedback.
Comments & Academic Discussion
Loading comments...
Leave a Comment