GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection

Reading time: 5 minute
...

📝 Original Info

  • Title: GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection
  • ArXiv ID: 2512.09396
  • Date: 2025-12-10
  • Authors: ** - Zishu Wei¹‡ - Qixiang Ma¹‡ - Xavier Hu¹ - Yuhang Liu¹ - Hui Zang² - Yudong Zhao² - Tao Wang² - Shengyu Zhang¹* - Fei Wu¹* ¹ Zhejiang University, ² Huawei Technologies Ltd. ‡ 공동 1저자, * 교신 저자 — **

📝 Abstract

Building AI systems for GUI automation task has attracted remarkable research efforts, where MLLMs are leveraged for processing user requirements and give operations. However, GUI automation includes a wide range of tasks, from document processing to online shopping, from CAD to video editing. Diversity between particular tasks requires MLLMs for GUI automation to have heterogeneous capabilities and master multidimensional expertise, raising problems on constructing such a model. To address such challenge, we propose GAIR: GUI Automation via Information-Joint Reasoning and Group Reflection, a novel MLLM-based GUI automation agent framework designed for integrating knowledge and combining capabilities from heterogeneous models to build GUI automation agent systems with higher performance. Since different GUI-specific MLLMs are trained on different dataset and thus have different strengths, GAIR introduced a general-purpose MLLM for jointly processing the information from multiple GUI-specific models, further enhancing performance of the agent framework. The general-purpose MLLM also serves as decision maker, trying to execute a reasonable operation based on previously gathered information. When the general-purpose model thinks that there isn't sufficient information for a reasonable decision, GAIR would transit into group reflection status, where the general-purpose model would provide GUI-specific models with different instructions and hints based on their strengths and weaknesses, driving them to gather information with more significance and accuracy that can support deeper reasoning and decision. We evaluated the effectiveness and reliability of GAIR through extensive experiments on GUI benchmarks.

💡 Deep Analysis

Figure 1

📄 Full Content

GAIR : GUI AUTOMATION VIA INFORMATION-JOINT REASONING AND GROUP REFLECTION Zishu Wei1,‡, Qixiang Ma1,‡, Xavier Hu1, Yuhang Liu1, Hui Zang2, Yudong Zhao2, Tao Wang2, Shengyu Zhang1,*, and Fei Wu1,* 1Zhejiang University 2Huawei Technologies Ltd. ABSTRACT Building AI systems for GUI automation task has attracted remarkable research efforts, where MLLMs are leveraged for processing user requirements and give operations. However, GUI automation includes a wide range of tasks, from document processing to online shopping, from CAD to video editing. Diver- sity between particular tasks requires MLLMs for GUI automation to have het- erogeneous capabilities and master multidimensional expertise, raising problems on constructing such a model. To address such challenge, we propose GAIR : GUI Automation via Information-Joint Reasoning and Group Reflection, a novel MLLM-based GUI automation agent framework designed for integrating knowl- edge and combining capabilities from heterogeneous models to build GUI au- tomation agent systems with higher performance. Since different GUI-specific MLLMs are trained on different dataset and thus have different strengths, GAIR introduced a general-purpose MLLM for jointly processing the information from multiple GUI-specific models, further enhancing performance of the agent frame- work. The general-purpose MLLM also serves as decision maker, trying to exe- cute a reasonable operation based on previously gathered information. When the general-purpose model thinks that there isn’t sufficient information for a reason- able decision, GAIR would transit into group reflection status, where the general- purpose model would provide GUI-specific models with different instructions and hints based on their strengths and weaknesses, driving them to gather information with more significance and accuracy that can support deeper reasoning and de- cision. We evaluated the effectiveness and reliability of GAIR through extensive experiments on GUI benchmarks. 1 INTRODUCTION GUI automation task, aiming at utilizing AI systems for automating GUI operations, would bring further convenience for people’s daily work, thus becoming an attractive research field. With the rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), the capabilities of AI models have reached the level of human intelligence, making it possible to construct AI systems for GUI automation. Researchers have found that using MLLMs for GUI automation task can economize computational cost (Xu et al., 2024), enhance generalization (Liu et al., 2025b) and reduce model hallucination (Meng et al., 2024) and thus reaching better per- formance (Kil et al., 2024). Therefore, the outstanding capabilities for processing complex semantic information and contextual information and the architecture design that can simultaneously process both textual and visual input have made MLLMs an appropriate choice for constructing AI systems for GUI automation. However, GUI automation task is continually expanding, involving more and more real-world scenarios, where applications, GUI pages and action spaces in each scenario can be completely different, thus requiring different MLLMs to master such capabilities and achieve GUI automation in each scenario and scope of task. For instance, general GUI automation models like ‡Equal Contribution, ∗Corresponding Author 1 arXiv:2512.09396v1 [cs.MA] 10 Dec 2025 Instruction: Show High-Speed Tracks Only Resolve GUI Layout Extract GUI Elements 12306 Homepage Status Bar Block functioning searching for particular trains, results of which used for booking tickets. Tabs Block for accessing FAQ pages Identify Significant Blocks and Elements Decision and Operation Execution Search directly Add query condition Searching directly needs filtering afterwards, taking more time Click on the button Instruction corresponds with searching, so searching block and elements are significant Figure 1: Demonstration of sub-tasks involved in the process of GUI automation task. AGUVIS (Xu et al., 2024) and Ferret-UI (Li et al., 2025) does well on web and mobile applications that are common in people’s daily life, but inefficient on various tasks in vertical fields; Assistant systems designed for specific scenario or discipline such as CAD-Assistant (Mallis et al., 2024) would be able to handle tasks within the designed scope but unable to process various tasks in peo- ple’s daily life. In addition, as demonstrated in Fig. 1, the process of GUI automation tasks involves multiple sub-tasks such as GUI page information extraction, information-processing-based decision and precise operation generation, bringing higher complexity. Therefore, to build an efficient and reliable agent system that is capable for most GUI automation tasks, multiple MLLMs need to be leveraged. In this paper, we would explore the way to combine the capabilities of multiple GUI- specific MLLMs to construct a G

📸 Image Gallery

Fig1_Demonstration.png Fig2_GAIR.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut