Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents
VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent’s ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q&A interactions is released to evaluate the agent’s proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.
💡 Research Summary
Mobile‑Bench‑v2 is introduced as a next‑generation benchmark designed to evaluate Vision‑Language Model (VLM) based mobile agents under conditions that more closely resemble real‑world smartphone usage. The authors first identify two major shortcomings of existing benchmarks. Online benchmarks run agents on actual devices, but their step‑wise reward signals are unstable due to OS updates, app version changes, and user‑specific settings; they also typically allow only a single high‑level instruction, ignoring the fact that many tasks admit multiple valid action sequences. Offline benchmarks, on the other hand, rely on pre‑recorded single‑path trajectories (golden paths) and thus cannot assess an agent’s ability to discover alternative solutions or to handle process‑level feedback. Moreover, both categories largely ignore noisy environments (ads, pop‑ups) and the need for agents to ask clarification questions when instructions are ambiguous.
To overcome these gaps, the paper proposes four key contributions. First, a slot‑based instruction generation method called GIAS (Generating Instructions From Mobile UI Action Sequences) is built on the Mobile3M graph corpus. GIAS extracts “intents” (semantic descriptions of actions) and “slots” (key UI element information) from each trajectory, then fills pre‑defined instruction templates with these slots. Because the same slot can appear in many different trajectories, a single instruction can map to multiple valid paths, enabling multi‑path evaluation. Second, an offline multi‑path evaluation protocol is introduced. Agents may either follow a single path and be compared to the golden path, or they may search the graph corpus for alternative paths, accumulating step‑wise rewards that mimic online feedback. This hybrid approach combines the reproducibility of offline benchmarks with the richer process‑level signals of online testing. Third, a realistic noisy split (Mobile‑Bench‑Noisy) is constructed by selecting third‑party apps that contain unavoidable ads and pop‑ups, and by contaminating the high‑quality AITZ dataset with randomly inserted ads (AITZ‑Noise). No pre‑handling of login, permission dialogs, or updates is performed, forcing agents to demonstrate robustness, rollback, and recovery capabilities. Fourth, an ambiguous instruction split (Mobile‑Bench‑Ambiguous) is created by stripping slot information from full instructions and pairing each GUI state with a set of pre‑written Q&A pairs. Agents are allowed to ask clarification questions before taking an action; correct answers provide the missing slot information and serve as step rewards. Since existing frameworks lack a dedicated questioning stage, the benchmark requires agents to decide autonomously whether to query.
The authors evaluate four representative systems: the single‑agent framework AppAgent‑v1, the multi‑agent framework Mobile‑Agent‑v2, and two open‑source VLM agents, UI‑Tars and OS‑Atlas. Results show that multi‑path evaluation reveals performance differences that single‑path metrics hide; for example, UI‑Tars excels in noisy environments but lags in discovering alternative solutions. In the ambiguous split, Mobile‑Agent‑v2, which actively asks questions, achieves the highest cumulative rewards, while agents that never query suffer a steep drop in success rate. The noisy split demonstrates that agents equipped with rollback mechanisms can recover from ad‑induced diversions, whereas others fail to complete tasks.
Overall, Mobile‑Bench‑v2 offers a comprehensive, realistic, and interactive evaluation suite that addresses the three major limitations of prior benchmarks: lack of multi‑path assessment, overly clean testing environments, and overly explicit instructions. By providing stable slot‑based reward signals, realistic noise, and a framework for proactive clarification, Mobile‑Bench‑v2 sets a new standard for measuring the true capabilities of VLM‑based mobile agents and is poised to become the de‑facto benchmark for future research and deployment in this rapidly evolving field.
Comments & Academic Discussion
Loading comments...
Leave a Comment