Efficient Multimodal Planning Agent for Visual Question-Answering
Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.
💡 Research Summary
The paper addresses the inefficiency of current multimodal Retrieval‑Augmented Generation (mRAG) pipelines for Visual Question Answering (VQA). Traditional mRAG systems follow a rigid, multi‑stage workflow—image grounding, image search, query rewriting, and text retrieval—where each stage often depends on the previous one. This static design forces every input, even simple queries that could be answered directly, to undergo all steps, leading to excessive computation and latency.
To solve this, the authors propose a multimodal planning agent that learns to dynamically select which components of the mRAG pipeline are necessary for a given VQA instance. They define four categories: (c₁) no retrieval needed, (c₂) only textual context (kₜ) needed, (c₃) only visual context (kᵢ) needed, and (c₄) both visual and textual contexts needed. At inference time the agent predicts one of these categories and triggers the corresponding sub‑pipeline, thereby skipping unnecessary operations.
Training data are constructed automatically from three large VQA sources (InfoSeek, VQA‑v2, WanWu). For each original (image, question, answer) triple, the authors generate: an image‑focused query qᵢ (e.g., “What is in the image?”) with its answer aᵢ, and a “gold” query q_g that rewrites the original question to be more explicit. A strong multimodal LLM (Qwen2.5‑VL‑72B) is used to produce these annotations at scale. The planning agent (initialized from the same LLM) is then fine‑tuned to predict the correct category c given the original VQA pair, using a cross‑entropy loss over the category distribution. Both LoRA (rank 32) and full fine‑tuning are explored, with LoRA offering comparable performance at lower resource cost.
During inference, the agent follows a simple decision flow:
- c₁: answer directly with the base MLLM (no retrieval).
- c₂: rewrite the question to q_g, retrieve textual passages (kₜ), then answer with (question, kₜ).
- c₃: retrieve visual evidence (kᵢ) and answer with (question, kᵢ).
- c₄: retrieve visual evidence first, use it to rewrite the question to q_g, retrieve textual passages, and finally answer with (question, kᵢ, kₜ).
The method is evaluated on six diverse VQA benchmarks: Life VQA, Private VQA, Dyn‑VQA (Chinese and English), Visual7W, NoCaps, and a mixed “Mix” set. Baselines include the Deep Research agent WebWatcher and a prompt‑based OmniSearch system. Metrics comprise LLM‑Eval scores (0–4) and the proportion of queries using each retrieval type.
Results show that the planning agent reduces overall retrieval time by more than 60% relative to always‑on mRAG, while achieving higher accuracy across all datasets. For example, on Life VQA the agent reaches 71.81 LLM‑Eval points versus 59.19 without retrieval and outperforms the baselines by a sizable margin. The agent also dramatically cuts tool‑call latency, being 3× faster than WebWatcher‑7B and 4.5× faster than WebWatcher‑32B, while still delivering superior answer quality. Analysis of category usage reveals that the majority of queries fall into c₁ or c₃, confirming that many VQA items do not require costly textual retrieval.
The paper’s contributions are threefold: (1) introducing a dynamic planning agent that adaptively configures mRAG pipelines for VQA, (2) presenting an automated large‑scale annotation pipeline for training such agents, and (3) demonstrating consistent gains in both efficiency and effectiveness across a broad set of VQA tasks.
Limitations include reliance on a separate LLM to rewrite queries into q_g at inference time, and the fact that the planning agent itself is a large model, so scaling to even larger backbones may increase base inference cost. Future work could explore lighter‑weight planners, more sophisticated decision trees, and integration with additional tools (e.g., OCR, object detectors) to further enhance multimodal retrieval efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment