An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

February 22, 2026

Reading time: 1 minute

...

📝 Original Info

Title: An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
ArXiv ID: 2511.08172
Date: 2025-11-11
Authors: 정보 없음 (제공된 텍스트에 저자 정보가 포함되어 있지 않음)

📝 Abstract

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought-augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Start searching

No results found