An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
Reading time: 1 minute
...
📝 Original Info
- Title: An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
- ArXiv ID: 2511.08172
- Date: 2025-11-11
- Authors: 정보 없음 (제공된 텍스트에 저자 정보가 포함되어 있지 않음)
📝 Abstract
Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.💡 Deep Analysis
📄 Full Content
Reference
This content is AI-processed based on open access ArXiv data.