Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference
Vision-Language Models (VLMs) have become powerful backbones for agents to autonomously operate in digital environments like the web and operating systems. However, these models suffer from inadaptability to fast-changing environments like the web, which can be alleviated by fine-tuning requiring expansive model training and data collection. In this work, we introduce a novel paradigm for enhancing agentic VLM policies at inference without policy retraining. Fundamentally, our approach decouples the VLM’s role as a high-capacity action proposer from the final action selection mechanism. We keep the VLM policy frozen and use it to generate a set of candidate actions for a given state. Then, a lightweight, offline-trained Q-function reranks these candidates, and the agent executes the action with the highest estimated value. The main contribution is to apply the Q-function directly during inference for immediate policy improvement, and not offline to relabel data for policy retraining. We demonstrate on the academic WebVoyager benchmark that our method significantly boosts agent success rates, improving a Qwen2.5-VL-7B agent from 38.8% to 55.7% and a proprietary GPT-4.1 agent from 82.4% to 88.8%.
💡 Research Summary
The paper introduces “Best‑of‑Q,” a novel inference‑time framework that improves the performance of vision‑language model (VLM) agents without any policy retraining. Traditional VLM‑based agents treat the model as an end‑to‑end policy that directly maps multimodal observations (screenshots, text instructions, action history) to low‑level actions. While powerful, such policies struggle in fast‑changing environments like the web because they cannot adapt quickly, and fine‑tuning them requires massive data collection and computational resources.
Best‑of‑Q decouples the VLM’s role into two distinct components: (1) a frozen VLM that acts as a high‑capacity action proposer, generating a set of N candidate actions for the current state, and (2) a lightweight Q‑function trained offline that scores each candidate and selects the one with the highest expected return. The Q‑function receives multimodal embeddings of the state and each candidate action, produced by the frozen VLM, and processes them through a small multi‑layer perceptron (MLP).
Training the Q‑function uses Implicit Q‑Learning (IQL), which first learns a state‑value function V(s) via expectile regression and then updates Q(s,a) using the target r + γ V(s′). IQL is chosen because it avoids querying out‑of‑distribution actions, thereby mitigating distribution shift when learning from static datasets. The offline dataset is built iteratively: an initial ε‑greedy policy collects diverse trajectories, a Q‑function is trained on this data, and the resulting Best‑of‑Q agent is then used to collect higher‑quality trajectories. This loop repeats, progressively improving both the dataset and the Q‑function.
During inference, the VLM generates N=3 candidate actions (the paper’s ablation shows larger N yields diminishing returns). The Q‑function evaluates each candidate, and the action with the maximal Q‑value is executed. This mechanism mirrors classic Deep Q‑Network (DQN) action selection, but instead of a fixed discrete action space, it operates on a dynamically generated sub‑space of the agent’s otherwise infinite action set.
Experiments are conducted on the WebVoyager benchmark, using 590 patched tasks across 15 domains. Two policy backbones are evaluated: a proprietary GPT‑4.1 model and open‑source Qwen2.5‑VL models (7 B and 72 B parameters). Three baselines are compared: (a) standard prompting (single action per step), (b) random action selection (ε‑greedy with ε = 1), and (c) Best‑of‑Q. Results show substantial gains for all backbones. GPT‑4.1’s success rate rises from 82.4 % (prompting) to 88.8 % with Best‑of‑Q (+6.4 %p). Qwen2.5‑7 B improves from 38.8 % to 55.7 % (+16.9 %p), and Qwen2.5‑72 B from 71.8 % to 79.3 % (+7.5 %p). Average steps to success remain comparable, indicating that the method does not incur significant efficiency penalties.
The paper positions Best‑of‑Q relative to three research streams: (1) VLM agents for GUI control, which typically require costly fine‑tuning; (2) training‑heavy policy improvement methods (online RL, large‑scale imitation learning), which demand extensive interaction or compute; and (3) inference‑time value estimation techniques (self‑critique, search‑based methods). Unlike prior work that uses a learned Q‑function only for offline relabeling (e.g., DigiQ) or heavy search (MCTS), Best‑of‑Q directly leverages the Q‑function at inference to rerank VLM‑generated actions, achieving immediate gains with minimal overhead.
Limitations include dependence on the quality and diversity of generated candidates, sensitivity to the number of candidates N, and the fact that only single‑step Q‑values are used, which may not capture long‑term planning nuances. Future directions suggested are multi‑step value estimation, richer candidate generation strategies, and extending the approach to other modalities such as mobile or desktop GUIs.
Overall, Best‑of‑Q demonstrates a practical, model‑agnostic pathway to boost VLM‑based agents by separating perception (VLM) from decision‑making (Q‑function), offering a cost‑effective alternative to full policy retraining while delivering notable performance improvements on realistic web navigation tasks.
Comments & Academic Discussion
Loading comments...
Leave a Comment