AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user’s true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap theory, we propose a taxonomy of four clarity levels: Detailed, Standard, Incomplete, and Ambiguous. We construct a rigorous dataset of 240 ecologically valid tasks across 25 applications, subject to strict review protocols. Furthermore, targeting evaluation in dynamic environments, we develop MUSE (Mobile User Satisfaction Evaluator), an automated framework utilizing an MLLM-as-a-judge multi-agent architecture. MUSE performs fine-grained auditing across three dimensions: Outcome Effectiveness, Execution Quality, and Interaction Quality. Empirical results on AmbiBench reveal the performance boundaries of SoTA agents across different clarity levels, quantify the gains derived from active interaction, and validate the strong correlation between MUSE and human judgment. This work redefines evaluation standards, laying the foundation for next-generation agents capable of truly understanding user intent.
💡 Research Summary
The paper introduces AmbiBench, the first benchmark that explicitly evaluates mobile GUI agents’ ability to handle incomplete or ambiguous user instructions through multi‑turn interaction. Recognizing that real‑world users often issue commands that lack full detail, the authors ground their work in Cognitive Gap theory and define a four‑level taxonomy of instruction clarity: Detailed (all requirements explicit), Standard (core goal explicit, minor options omitted), Incomplete (key constraints missing), and Ambiguous (goal itself unclear). This taxonomy serves as the backbone for a systematic assessment of how much clarification an agent must solicit.
To populate the benchmark, the authors curate 240 ecologically valid tasks spanning 25 mainstream Android applications, covering both single‑app and cross‑app workflows. Each task is annotated with high‑quality human‑executed trajectories and undergoes a rigorous “Legitimacy Review” and “Effectiveness Assurance” process to eliminate meaningless or noisy tasks, ensuring ecological validity. For every task, four variants corresponding to the clarity levels are generated, allowing direct comparison of agent performance under varying degrees of user uncertainty.
A central component of AmbiBench is an LLM‑driven User Simulator. The simulator retains the full ground‑truth intent but only reveals information that the agent explicitly asks for, thereby emulating a real user who can answer clarification questions in natural language rather than following a fixed script. This dynamic interaction capability overcomes the limitations of prior benchmarks that rely on static, scripted responses.
Evaluation is performed by MUSE (Mobile User Satisfaction Evaluator), an automated framework that adopts an “MLLM‑as‑a‑judge” multi‑agent architecture. Three specialized judging agents independently score: (1) Outcome Effectiveness – whether the final result matches the user’s true intent; (2) Execution Quality – accuracy, robustness, and efficiency of UI actions; and (3) Interaction Quality – relevance and informativeness of the agent’s questions, information gain, and conversational flow. Each dimension combines binary, quantitative, and sequence‑based metrics, delivering fine‑grained diagnostics. Human‑in‑the‑loop studies show Pearson and Spearman correlations above 0.89 between MUSE scores and human judgments, validating the reliability of the automated metrics.
Empirical experiments benchmark a range of state‑of‑the‑art agents, including general‑purpose LLMs (GPT‑4o, Claude‑Opus) and specialized mobile agents (UI‑TARS, AutoGLM). Results reveal a clear performance gradient: agents achieve >85 % success on Detailed and Standard tasks, but drop to <45 % on Incomplete and <30 % on Ambiguous tasks when interaction is disabled. When multi‑turn clarification is enabled, performance recovers by an average of 22 percentage points, with the greatest gains observed in the Interaction Quality scores. This demonstrates that active questioning effectively bridges the cognitive gap and aligns the agent’s actions with the user’s hidden preferences.
A comparative analysis against existing mobile benchmarks (AndroidArena, MobileBench, SP‑ABench, etc.) highlights AmbiBench’s advantages: it supports three‑dimensional, multi‑granular metrics; operates in an online sandbox proxy environment; and, crucially, incorporates the instruction‑clarity taxonomy, which prior works lack. The authors also discuss limitations: the current implementation focuses on Android, the User Simulator inherits biases from its underlying LLM, and scalability to multi‑user or cross‑device scenarios remains an open challenge.
In summary, AmbiBench and the MUSE evaluation framework redefine how mobile GUI agents are assessed, shifting the focus from one‑shot instruction following to bidirectional intent alignment. By quantifying the impact of instruction ambiguity and demonstrating the tangible benefits of interactive clarification, the work establishes a rigorous, reproducible standard that will guide the development of next‑generation agents capable of truly understanding and fulfilling user intent.
Comments & Academic Discussion
Loading comments...
Leave a Comment