모바일 GUI 에이전트 평가를 위한 모듈형 다경로 오프라인 벤치마크 MobiBench
📝 Abstract
Mobile GUI Agents-AI agents capable of interacting with mobile applications on behalf of users-have the potential to transform human-computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single-path offline benchmarks or online live benchmarks. However, offline benchmarks using static, single-path annotated dataset unfairly penalize valid alternative actions, and online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench 1 , the first modular and multi-path aware offline benchmarking framework for Mobile GUI Agents that enables high-fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72% agreement with human evaluators-on par with carefully engineered online benchmarks-while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module-level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI Agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost-efficient mobile agents.
💡 Analysis
Mobile GUI Agents-AI agents capable of interacting with mobile applications on behalf of users-have the potential to transform human-computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single-path offline benchmarks or online live benchmarks. However, offline benchmarks using static, single-path annotated dataset unfairly penalize valid alternative actions, and online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench 1 , the first modular and multi-path aware offline benchmarking framework for Mobile GUI Agents that enables high-fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72% agreement with human evaluators-on par with carefully engineered online benchmarks-while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module-level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI Agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost-efficient mobile agents.
📄 Content
Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents Youngmin Im KAIST S. Korea Byeongung Jo Sungkyunkwan University S. Korea Jaeyoung Wi KAIST S. Korea Tae Hoon Min Sungkyunkwan University S. Korea Seungwoo Baek Sungkyunkwan University S. Korea Joo Hyung Lee Sungkyunkwan University S. Korea Sangeun Oh Korea University S. Korea Insik Shin KAIST & Fluiz S. Korea ym.im@kaist.ac.kr {whale1510,sunjae.lee}@skku.edu Sunjae Lee Sungkyunkwan University S. Korea Abstract Mobile GUI Agents—AI agents capable of interacting with mobile applications on behalf of users—have the potential to transform human-computer interaction. However, cur- rent evaluation practices for GUI agents face two fundamen- tal limitations. First, they either rely on single-path offline benchmarks or online live benchmarks. However, offline benchmarks using static, single-path annotated dataset un- fairly penalize valid alternative actions, and online bench- marks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evalua- tion. Second, existing benchmarks treat agents as mono- lithic black boxes, overlooking the contributions of individ- ual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench1, the first modular and multi-path aware offline benchmarking framework for Mo- bile GUI Agents that enables high-fidelity, scalable, and re- producible evaluation entirely in offline settings. Our experi- ments demonstrate that MobiBench achieves 94.72% agree- ment with human evaluators—on par with carefully engi- neered online benchmarks—while preserving the scalability and reproducibility of static offline benchmarks. Further- more, our comprehensive module-level analysis uncovers several key insights, including a systematic evaluation of di- verse techniques used in mobile GUI Agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost-efficient mobile agents. 1 Introduction The proliferation of Large Foundation Models (LFMs) has catalyzed the development of mobile Graphical User Inter- face (GUI) agents [16, 22, 33, 37, 51]—AI agents capable of autonomously interacting with mobile apps on behalf of hu- mans. These Mobile GUI Agents promise to revolutionize human-computer interaction by automating complex, repet- itive mobile tasks that pervade our digital lives. However, despite significant advances in agent capabilities, the evalua- tion methodologies [3, 7, 24, 25, 34, 43–45, 52, 55] for these systems remain fundamentally limited, hindering both fair performance comparison and systematic improvement of these agents. Current evaluation practices for Mobile GUI Agents suffer from two critical shortcomings. First, existing benchmarks fall into two categories—online and offline evaluation—each with its own limitations. Offline evaluations [3, 4, 15, 25, 29, 44, 53] typically rely on static datasets composed of sequence of screenshots paired with a “single” correct action at each step. While convenient and reproducible, these datasets fail to account for the multiple valid paths available in real-world applications. For instance, when booking a flight, an agent may choose either the “Search Flights” button or an “Explore Trips” shortcut; single-path datasets treat the latter as an error even though both lead to the correct next state. Conse- quently, current offline benchmark mark any deviation from the pre-recorded “golden path” as failure, unfairly penalizing agents that pursue equally valid alternatives. ∗Co-first authors: Youngmin Im, Byeongung Jo. 1Our benchmark framework and dataset is available at https://github. com/fclab-skku/Mobi-Bench arXiv:2512.12634v2 [cs.AI] 8 Jan 2026 Y. Im, B. Jo, et al., Youngmin Im, Byeongung Jo, Jaeyoung Wi, Tae Hoon Min, Seungwoo Baek, Joo Hyung Lee, Sangeun Oh, Insik Shin, and Sunjae Lee Table 1: Comparison of MobiBench to other benchmarks Benchmark Modular Assessment Multi-path Support Scalable Dataset Reproducible Results Real world Apps Offline Benchmarks AITW [25] ✗ ✗ ✓ ✓ ✓ MoTiF [3] ✗ ✗ ✓ ✓ ✓ Meta-Gui [29] ✗ ✗ ✗ ✓ ✓ Mobile-Bench-v2 [44] ✗ ✓ ✗ ✓ ✓ MobileGPT [15] ✗ ✗ ✓ ✓ ✓ DroidTask [37] ✗ ✗ ✓ ✓ ✓ Online Benchmarks AndroidArena [43] ✗ ✓ ✓ ✗ ✓ LlamaTouch [55] ✗ ✓ ✗ ✗ ✓ Mobile-Bench [7] ✗ ✓ ✗ ✗ ✓ MobileAgentBench [34] ✗ ✓ ✗ ✗ ✗ Android Lab [45] ✗ ✓ ✗ ✓ ✓ AndroidWorld [24] ✗ ✓ ✗ ✓ ✗ MobiBench (ours) ✓ ✓ ✓ ✓ ✓ Conversely, online benchmarks [7, 24, 34, 43, 45, 52, 55] evaluate agents in live environments using execution check- points. While this supports multiple valid paths, it suffers from severe scalability and reproducibility challenges. Creat- ing valid checkpoints requires a comprehensive understand- ing of the app’s logic and extensive engineering effort, often requiring app code modifications. Moreover, environmental factors such as app updates, dyna
This content is AI-processed based on ArXiv data.