MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.


💡 Research Summary

The paper “MemGUI‑Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments” addresses a critical gap in the evaluation of mobile GUI agents: the lack of systematic assessment of their memory capabilities. While recent multimodal large language models (LLMs) have enabled agents that can autonomously interact with mobile interfaces, existing benchmarks allocate only 5.2‑11.8 % of tasks to memory‑related scenarios and completely ignore cross‑session learning. To fill this void, the authors introduce MemGUI‑Bench, a comprehensive, memory‑centric benchmark that evaluates both short‑term (in‑session) and long‑term (cross‑session) memory.

The contribution starts with a memory taxonomy that classifies 11 state‑of‑the‑art agents into five short‑term architectures—Memory Agent, Action‑Thought Pattern, Multi‑turn Context, Rule‑based Aggregation, and No Historical Context—and two long‑term strategies—Success‑Based Learning and Failure‑Based Learning. This taxonomy clarifies how each model attempts to retain information during a task and how it accumulates experience across sessions.

The benchmark itself comprises 128 tasks spread across 26 real‑world mobile applications. Tasks are carefully balanced across three difficulty levels and four app‑complexity categories. Importantly, 89.8 % of the tasks explicitly require agents to retain and retrieve information across time (short‑term) and across applications (spatial). To probe long‑term learning, the authors pair 64 tasks into “mirror” pairs that share the same app composition and cognitive load but differ in specific requirements, enabling a pass@k evaluation protocol where agents can attempt a task up to k times (default k = 3). This design directly measures whether knowledge from earlier attempts improves performance on later, related tasks.

A snapshot‑based plug‑and‑play framework underpins the benchmark, providing rapid environment reset, parallel execution of multiple agents, and persistent state handling for multi‑attempt protocols. This infrastructure overcomes the consistency and scalability issues that have plagued prior GUI‑agent benchmarks.

For evaluation, the authors develop MemGUI‑Eval, an automated pipeline that employs a novel “Progressive Scrutiny” approach. The pipeline consists of three stages: (1) a Triage Judge that makes a quick decision based on minimal evidence (goal description, concise action logs, final three screenshots), dramatically reducing cost for clearly successful cases; (2) a Semantic Judge that, after a Step Descriptor generates detailed textual descriptions of each interaction, performs full semantic analysis, computes the Information Retention Rate (IRR), and, if needed, requests specific historical screenshots; (3) a Visual Judge that receives only the explicitly requested visual evidence to make the final determination. This staged approach balances accuracy and computational expense far better than traditional LLM‑as‑Judge methods that must ingest entire trajectories.

The benchmark defines seven hierarchical metrics across three dimensions: short‑term memory fidelity (Success Rate, IRR, Memory‑Task Proficiency Ratio), long‑term learning (pass@k Success Rate, Failure Recovery Rate), and execution efficiency (average step ratio, time per step, cost per step). These metrics enable fine‑grained diagnosis of where agents succeed or fail.

Empirical results on the 11 agents reveal significant memory deficits. Short‑term IRR values hover around 30 % on average, and the Memory‑Task Proficiency Ratio shows a 4‑10× gap between memory‑intensive and standard tasks. Long‑term metrics indicate limited cross‑session learning, with modest improvements even after three attempts. The authors identify five failure modes: (1) information omission, (2) temporal ordering errors, (3) context confusion across app switches, (4) lack of knowledge transfer, and (5) excessive resource consumption for memory management. From these observations they derive five actionable design implications: (a) introduce explicit memory slots or key‑value stores, (b) employ meta‑learning or continual‑learning techniques for robust long‑term memory, (c) optimize memory‑management policies for cost‑efficiency, (d) normalize multi‑app context representations, and (e) adopt progressive‑scrutiny evaluation pipelines for scalable benchmarking.

In conclusion, MemGUI‑Bench establishes the first standardized, memory‑focused evaluation suite for mobile GUI agents, exposing pervasive shortcomings in current models and providing a clear roadmap for future research. All code, task suites, and evaluation results are openly released and will be continuously maintained, inviting the community to build upon this foundation and advance the memory capabilities of autonomous mobile agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment