POINTS-GUI-G: GUI-Grounding Journey

POINTS-GUI-G: GUI-Grounding Journey
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model’s success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.


💡 Research Summary

The paper “POINTS‑GUI‑G: GUI‑Grounding Journey” presents a comprehensive approach to building a state‑of‑the‑art GUI grounding model from the ground up, starting with a modest base model (POINTS‑1.5) that lacks strong spatial awareness. The authors argue that most prior work fine‑tunes already‑well‑trained vision‑language models (e.g., Qwen3‑VL) for GUI tasks, thereby overlooking the technical challenges of constructing a robust grounding pipeline. To address this, they develop POINTS‑GUI‑G‑8B, which achieves new best‑in‑class scores on four major benchmarks: ScreenSpot‑Pro (59.9), OSWorld‑G (66.0), ScreenSpot‑v2 (95.7), and UI‑Vision (49.9).

The contribution is organized around three pillars: refined data engineering, improved training strategies, and reinforcement learning with verifiable rewards (RL‑VR).

1. Refined Data Engineering
The authors first collect a large pool of open‑source GUI grounding datasets, which differ in coordinate scales (normalized vs. raw pixels), annotation formats (list‑based, tag‑based), and task definitions. They standardize all spatial annotations to a three‑decimal


Comments & Academic Discussion

Loading comments...

Leave a Comment