M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M$^2$-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.
💡 Research Summary
The paper introduces M$^{2}$‑Miner, the first automated framework for mining high‑quality intent‑trajectory data for mobile GUI agents, built on Monte Carlo Tree Search (MCTS) and a collaborative multi‑agent architecture. The authors identify three major shortcomings of existing data collection methods: (1) manual annotation is prohibitively expensive, (2) current automated pipelines based on vanilla MCTS suffer from low efficiency because expansion is random and simulation‑based reward estimation is costly, and (3) the resulting datasets are “flat”, storing only a single successful path per intent, which limits richness and diversity.
M$^{2}$‑Miner addresses these issues by redefining the mining process as the construction of an intent‑trajectory tree, where each node stores a screenshot, action metadata, value estimate (Q), visit count (N), and task status (success, failure, or intermediate). The tree captures the entire exploration process rather than a single path.
The core innovation is a three‑agent system that augments the standard MCTS phases:
-
InferAgent – Given a selected node, InferAgent queries multiple large multimodal language models (MLLMs) to generate K candidate actions that are likely to achieve the target intent. It incorporates previously generated actions into the prompt to avoid redundancy and uses diverse models to ensure action‑space coverage.
-
OrchestraAgent – This agent merges equivalent actions, ranks the remaining candidates by their predicted likelihood of satisfying the intent, and assigns decreasing initial UCT values according to the ranking. By doing so, it steers the search toward promising branches and eliminates wasteful random expansions.
-
JudgeAgent – Instead of performing costly roll‑outs, JudgeAgent evaluates the newly expanded node directly from its screenshot. It classifies the node’s status (success/failure/intermediate) and predicts a binary “valid/invalid” reward for intermediate nodes. This process‑based reward dramatically reduces simulation overhead while still providing useful feedback for back‑propagation.
Beyond improving efficiency, the authors propose an intent recycling strategy. After a mining episode finishes, every non‑primary path in the tree is passed through a dedicated filter. For each viable path, a new natural‑language intent is generated by an MLLM and verified by JudgeAgent. This yields multiple new intent‑trajectory pairs from a single initial intent, enriching the dataset’s diversity without additional interaction cost.
To further boost success rates, a progressive model‑in‑the‑loop training scheme is introduced. The mined data are used to fine‑tune the agents, and the updated models are fed back into the mining pipeline. Over successive iterations, the agents become better at predicting useful actions and evaluating states, leading to higher mining success and better generalization to unseen apps.
Extensive experiments on several mobile GUI benchmarks (e.g., AndroidControl, GUI‑Odyssey) demonstrate that agents trained on M$^{2}$‑Miner data achieve state‑of‑the‑art performance, surpassing prior methods by a large margin. Ablation studies confirm that each component—InferAgent, OrchestraAgent, JudgeAgent, intent recycling, and progressive training—contributes significantly to the overall gains. Notably, the framework yields a 64× speed‑up in mining efficiency for tasks of length nine and produces a markedly more diverse intent distribution, as visualized with t‑SNE embeddings.
In summary, M$^{2}$‑Miner combines MCTS with a purpose‑built multi‑agent system, an intent‑recycling mechanism, and iterative model refinement to deliver low‑cost, high‑quality, and richly diverse training data for mobile GUI agents. This work provides a scalable data foundation that can accelerate research and deployment of intelligent, multimodal GUI agents across mobile platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment