ShowUI파이: 흐름 기반 생성 모델을 활용한 GUI 자동화 손

초록

ShowUI파이는 연속적인 드래그 동작을 필요로 하는 GUI 자동화를 위해 설계된 경량 흐름 기반 생성 모델이다. 화면에 실시간으로 제공되는 시각적 관찰을 입력으로 받아, 주어진 질의에 대응하는 연속적인 커서 궤적을 효율적으로 생성한다. 본 논문에서는 PowerPoint 텍스트 박스 대각선·수평 리사이즈, 회전 캡차 해결, Premiere 영상 클립에 효과 적용, 캔버스에 손글씨 쓰기, OS 데스크톱에서 파일을 폴더로 정렬하는 등 다양한 작업을 실험하였다. 드래그는 클릭 후 커서를 지속적으로 움직이는 연속 상호작용으로 정의한다. ShowUI파이는 이러한 복합적인 작업을 스트리밍 비주얼 관찰과 결합해 높은 성공률(26.98 % ± 4.8)로 수행한다.

상세 요약

Analysis of the Paper “ShowUI-$\pi$: Flow-based Generative Models as GUI Dexterous Hands”

Summary:

The paper introduces ShowUI-$\pi$, a novel GUI agent designed to enable human-like manipulation control in graphical user interface (GUI) environments. This flow-based generative model focuses on continuous trajectory generation, overcoming the limitations of existing GUI agents that are typically fine-tuned language models limited to simple clicks or short drags.

Existing Challenges:

Automation Limitations: While GUI automation is crucial for enhancing productivity and reducing workload, developing effective agents remains a significant challenge. Traditional GUI agents often rely on fine-tuning language models, which restricts their capabilities to basic interactions like clicking or brief dragging actions.

Innovations of ShowUI-$\pi$:

Integrated Action Representation: ShowUI-$\pi$ integrates clicks and drags into a single continuous trajectory using (x, y, m) triplet sequences for cursor coordinates and mouse button states. This allows the model to handle both types of interactions without distinguishing between them.
Flow-based Action Generation: Unlike existing agents that predict tokenized actions from descriptions, ShowUI-$\pi$ uses a lightweight action expert to progressively predict cursor adjustments based on visual observations. Built on a transformer backbone, this expert generates stable and accurate trajectories through flow matching.
ScreenDrag Benchmark: The paper introduces the ScreenDrag benchmark, which includes 505 real drag operations across various domains and use cases. This benchmark evaluates continuous trajectory performance through offline open-loop evaluation (average trajectory error and endpoint accuracy) and online closed-loop evaluation (task success rate).
Training Data: A training dataset was constructed with 20,000 manually collected and synthesized drag trajectories. The data records UI states and dense coordinates for model learning.

Experimental Results:

ShowUI-$\pi$ demonstrated superior performance on the ScreenDrag benchmark compared to competitors, achieving an F1 score of 26.98 with only 450M parameters. This indicates effective learning even in complex drag operations.

Contributions:

Efficient Training Data Synthesis: The paper combines manually collected and synthesized data to create a large-scale training dataset that covers various drag operations.
ScreenDrag Benchmark Development: A specialized benchmark for continuous GUI tasks was developed, providing a standard for evaluating and comparing model performance accurately.
Improvement of Flow-based Generative Models: ShowUI-$\pi$ overcomes the limitations of existing flow-based models through flow matching training and directional normalization, enabling more stable and accurate trajectory generation.

Conclusion:

ShowUI-$\pi$ is a powerful tool that enables human-like manipulation control in GUI environments. This research sets a new direction for GUI agent development and promotes further studies towards more natural and efficient digital interactions.

This analysis highlights the innovative approach of ShowUI-$\pi$, its significant contributions to overcoming existing challenges, and its potential impact on future advancements in GUI automation technology.

초록