Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation

Reading time: 6 minute
...

📝 Abstract

The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in longhorizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task’s state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system’s planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-andplay module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES * Corresponding author Simple Task Long-Horizon Task Using the Firefox Browser, search for an introductory article about Earth, download the picture and share the link on Tumblr with Katsunaksu. Choose the Dreamland v3.

💡 Analysis

The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in longhorizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task’s state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system’s planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-andplay module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES * Corresponding author Simple Task Long-Horizon Task Using the Firefox Browser, search for an introductory article about Earth, download the picture and share the link on Tumblr with Katsunaksu. Choose the Dreamland v3.

📄 Content

Graphical User Interface (GUI) agents play a crucial role in automating complex tasks [1,4,14,25,28,31,37,40]. Traditional training paradigm for GUI agents primarily relies on Supervised Fine-Tuning (SFT), where the model learns the mapping from environmental states to actions through imitation learning [24,35,48]. However, SFT heavily depends on large-scale, costly, and meticulously annotated high-quality trajectory data, and often exhibits poor generalization capability in unseen environments [12,21,23,33].

Recent studies [16,23,34,47] have largely focused on using rule-based Reinforcement Learning (RL) to train agents. They utilizes verifiable reward functions, supplanting the need for expensive manual annotations and human feedback. RL has been demonstrated as an efficient alternative to SFT in GUI tasks, as it requires only a relatively small number of high-quality training data to achieve significant performance gains [23]. While existing methods have achieved commendable results in simple tasks [18,21,23], they still face two fundamental problems when confronted with long-horizon tasks, illustrated in Figure 1: (i) Responsibility Coupling and Capability Conflict in Single-Agent Architectures. Current mainstream endto-end model attempts to couple heterogeneous capabilities, such as long-term task planning, multi-step reasoning, GUI element grounding, and precise action execution, within a unified policy network. This design in optimization presents fundamental difficulties: a model with finite parameters struggles to simultaneously master high-level abilities, such as decomposing complex instructions, tracking task progress, alongside low-level abilities like grounding and execution. As task complexity increases, this coupling may lead to a catastrophic collapse of the model’s diverse capabilities.

(ii) Lack of Task State Awareness. In long-horizon tasks, an accurate awareness of the current progress is crucial for making correct decisions. Most methods rely on historical action sequences (e.g., Click(x, y)), but these low-level actions provide almost no task state information or semantic context. Therefore, the agent has to primarily rely on the visual information from screenshots to infer its progress. However, screenshots are an insufficient and unreliable representation of task progress. This limitation makes it difficult for the agent to determine its current position in a long task, leading to errors and progress loss.

To address the challenges above, we propose a staged execution-feedback RL algorithm, where the insight is to resolve responsibility coupling and conflicting optimization objectives. To support this algorithmic paradigm, we built a Coordinator-Executor-State Tracker (CES) framework, which structurally decouples the complex automation process into three specialized agents. Unlike training a unified and overloaded policy model, our algorithm is designed to optimize specific high-level scheduling models. Specifically, we treat the low-level Executor as a frozen, plugand-play component to provide verifiable feedback, and focus our training exclusively on the Coordinator and State Tracker. Through our staged optimization strategy, the Coordinator is trained to handle strategic planning and task decomposition, effectively decoupling high-level reasoning from low-level execution; meanwhile, the State Tracker is trained to act as dynamic memory, directly solving the lack of task state awareness by maintaining a high-semantic state summary in natural language.

In summary, our main contributions are as follows:

• We build CES multi-agent framework, featuring generalpurpose, plug-and-play high-level components (Coordinator and State Tracker) that can integrate with various Executors and enhance their abilities. • We introduce a State Tracker, whose core task is dynamic context compression and state summarization, effectively resolving the state unawareness problem and maintaining the agent’s logical coherence in long-horizon tasks. • We propose a staged execution-feedback RL strategy.

The core of this algorithm is to decouple high-level capabilities from low-level execution: it freezes a pre-trained Executor and uses the reward signals it generates to exclusively train the high-level Coordinator and State Tracker. • Extensive experiments demonstrate that our method significantly enhances the long-horizon scheduling and state management capabilities of various Executor models and surpasses existing baselines.

Traditional mobile intelligent assistants primarily relied on rule-based or intent-driven API calls or structured text representations [6,13,44]. However, these methods are often platform-specific [32], difficult to generalize [23]. With the development of VLMs, research has shifted towards purevision settings [13,40,42]. In this pure-vision paradigm, the agent receives only screenshots and task descriptions as input, generating coordinate-based actions directly in pixel space [4

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut