다단계 시각 추론을 통한 GUI 그라운딩 혁신

Reading time: 6 minute
...

📝 Abstract

GUI grounding aims to align natural-language instructions with precise regions in complex user interfaces (UIs). While advanced MLLMs have demonstrated strong capabilities in visual GUI grounding, they still struggle with small or visually similar targets, and ambiguity in real-world layouts. We argue that these limitations stem not only from the models’ inherent grounding capacity, but also from an overlooked underutilization of their existing reasoning potential. To address this, we present Chain-of-Ground (CoG), a training-free multi-step grounding framework that leverages MLLMs for iterative visual reasoning and refinement. Instead of relying on direct prediction, Chain-of-Ground enables the model to progressively reflect and adjust its hypotheses, achieving more accurate and interpretable localization. Our approach establishes a new state of the art on the ScreenSpot-Pro benchmark with 68.4% accuracy, surpassing the previous best by 4.8%. To evaluate real-world generalization, we introduce TPanel-UI, a dataset of 420 labeled industrial control panels featuring visual distortions such as blur and masking to test robustness. On TPanel-UI, Chain-of-Ground outperforms the SOTA MLLM Qwen3-VL-235B by 6.9%, demonstrating the effectiveness of multi-step, training-free grounding across realworld and digital interfaces. Together, these results point to a new direction for unlocking MLLMs’ grounding potential, through structured, iterative refinement rather than additional training.

💡 Analysis

GUI grounding aims to align natural-language instructions with precise regions in complex user interfaces (UIs). While advanced MLLMs have demonstrated strong capabilities in visual GUI grounding, they still struggle with small or visually similar targets, and ambiguity in real-world layouts. We argue that these limitations stem not only from the models’ inherent grounding capacity, but also from an overlooked underutilization of their existing reasoning potential. To address this, we present Chain-of-Ground (CoG), a training-free multi-step grounding framework that leverages MLLMs for iterative visual reasoning and refinement. Instead of relying on direct prediction, Chain-of-Ground enables the model to progressively reflect and adjust its hypotheses, achieving more accurate and interpretable localization. Our approach establishes a new state of the art on the ScreenSpot-Pro benchmark with 68.4% accuracy, surpassing the previous best by 4.8%. To evaluate real-world generalization, we introduce TPanel-UI, a dataset of 420 labeled industrial control panels featuring visual distortions such as blur and masking to test robustness. On TPanel-UI, Chain-of-Ground outperforms the SOTA MLLM Qwen3-VL-235B by 6.9%, demonstrating the effectiveness of multi-step, training-free grounding across realworld and digital interfaces. Together, these results point to a new direction for unlocking MLLMs’ grounding potential, through structured, iterative refinement rather than additional training.

📄 Content

Grounding natural language instructions to UI elements is fundamental for computer-use agents and multimodal systems [1,15,24,34,36,38,40,42,48,48], yet accuracy remains the main bottleneck to reliability and autonomy. Interfaces pack hundreds of lookalike controls whose meaning depends on structure and function, making precise grounding essential for downstream steps [9]. Still, grounding from pixels is hard: targets are small and crowded; icons are compositional; text is tiny or occluded; spatial cues span distant panes; appearance shifts with themes and resolutions; repeats and low contrast mislead; and control boundaries poorly match pixels. Because single-step prediction often misaligns and induces errors, a stepwise procedure that reasons and exposes intermediate decisions yields a more robust and interpretable framework.

Existing GUI grounding methods range from classical detectors and segmentation networks [8,16,20,21,30,31] to multimodal LLMs such as GPT-5 [25], Qwen3-VL [33], Claude Sonnet 4.5 [28], Seed 1.5 VL [11], and InternVL3.5 [35]. While MLLMs are strong at reasoning and grounding, they still struggle with precise localization in cluttered scenes; visual prompting like SoM [43] helps alignment but depends on brittle segmentation, and finetuned MLLMs [4,6,9,10,29,44] improve accuracy yet remain limited in stability and interpretability. Recent iterative localization such as DiMo-GUI [37] and Iterative Narrowing [22] zoom screnshots into cropped regions (see Fig. 2) but sacrifice global context.

Motivated by these gaps, we introduce Chain-of-Ground (CoG), a purely visual framework designed to enhance grounding accuracy through iterative reasoning and reference feedback. CoG first anchors an initial guess, encodes this location as an explicit marker, and then progressively refines the prediction by re-evaluating the instruction in the context of the updated image. The framework produces an interpretable reasoning trace, allowing earlier guesses to be revisited and corrected. It comprises two core components: (1) iterative reasoning, where the model incrementally updates its prediction using both the current hypothesis and prior steps, over a fixed or adaptive number of iterations; and (2) reference feedback, where each predicted location is returned to the model as a visual or textual signal. We design this feedback to balance interpretability and precision, exploring different marker modalities (e.g., image overlays vs. text prompts) and scales (e.g., small vs. large marks).

We evaluate our approach on the ScreenSpot-Pro benchmark, which features high-resolution professional GUIs with complex layouts and dense We evaluate on the ScreenSpot-Pro benchmark with high-resolution professional GUIs. Across multiple MLLM backbones, our reference-guided iterative grounding consistently outperforms direct one-step baselines and sets a new state of the art on the public leaderboard, surpassing the previous best model by 4.1% in grounding accuracy.

The method also reduces prediction instability across domains, and qualitative analysis shows that it reasons relationally by linking targets to contextual references rather than relying on surface-level visual matching.

Although our method achieves strong results on established computer GUI grounding benchmarks, we aim to extend its applicability to real-world industrial interfaces, where conditions are far more challenging. However, existing benchmarks rarely capture such complexity, leaving a gap between academic progress and deployment settings. To bridge this gap, we introduce TPanel-UI, a dataset of 420 labeled instances capturing real industrial control panels such as thermostats, instrument dashboards, and machinery interfaces. Unlike prior GUI datasets focused on virtual or synthetic environments, TPanel-UI emphasizes physical panels with dense layouts, metallic surfaces, variable lighting, and complex iconography representative of real operational settings. Specifically: (1) it includes 420 high-resolution panel images from 20 commercial brands with diverse layouts; (2) among them, 100 instances involve physical-button interactions and 320 correspond to touch-based interfaces; and (3) to stress test robustness, we add controlled degradations at multiple severities including blur, masking, exposure shifts, noise, and compression, yielding paired clean-degraded samples that preserve labels and enable rigorous robustness auditing. This enables systematic evaluation of grounding robustness and supports research on reference resolution in industry-oriented, safety-critical scenarios.

This work makes the following contributions.

  1. We revisited GUI grounding as a structured reasoning problem rather than a single-step recognition task. To this end, we introduced Chainof-Ground (CoG), a reference-guided iterative grounding framework that enables the model to reflect progressively with contextual feedback. 2. We proposed three complementary components

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut