ShowUI파이: 흐름 기반 생성 모델을 활용한 GUI 자동화 손

ShowUI파이는 연속적인 드래그 동작을 필요로 하는 GUI 자동화를 위해 설계된 경량 흐름 기반 생성 모델이다. 화면에 실시간으로 제공되는 시각적 관찰을 입력으로 받아, 주어진 질의에 대응하는 연속적인 커서 궤적을 효율적으로 생성한다. 본 논문에서는 PowerPoint 텍스트 박스 대각선·수평 리사이즈, 회전 캡차 해결, Premiere 영상 클립에 효과

ShowUI파이: 흐름 기반 생성 모델을 활용한 GUI 자동화 손

초록

ShowUI파이는 연속적인 드래그 동작을 필요로 하는 GUI 자동화를 위해 설계된 경량 흐름 기반 생성 모델이다. 화면에 실시간으로 제공되는 시각적 관찰을 입력으로 받아, 주어진 질의에 대응하는 연속적인 커서 궤적을 효율적으로 생성한다. 본 논문에서는 PowerPoint 텍스트 박스 대각선·수평 리사이즈, 회전 캡차 해결, Premiere 영상 클립에 효과 적용, 캔버스에 손글씨 쓰기, OS 데스크톱에서 파일을 폴더로 정렬하는 등 다양한 작업을 실험하였다. 드래그는 클릭 후 커서를 지속적으로 움직이는 연속 상호작용으로 정의한다. ShowUI파이는 이러한 복합적인 작업을 스트리밍 비주얼 관찰과 결합해 높은 성공률(26.98 % ± 4.8)로 수행한다.

상세 요약

Analysis of the Paper “ShowUI-$\pi$: Flow-based Generative Models as GUI Dexterous Hands”

Summary:

The paper introduces ShowUI-$\pi$, a novel GUI agent designed to enable human-like manipulation control in graphical user interface (GUI) environments. This flow-based generative model focuses on continuous trajectory generation, overcoming the limitations of existing GUI agents that are typically fine-tuned language models limited to simple clicks or short drags.

Existing Challenges:

  • Automation Limitations: While GUI automation is crucial for enhancing productivity and reducing workload, developing effective agents remains a significant challenge. Traditional GUI agents often rely on fine-tuning language models, which restricts their capabilities to basic interactions like clicking or brief dragging actions.

Innovations of ShowUI-$\pi$:

  • Integrated Action Representation: ShowUI-$\pi$ integrates clicks and drags into a single continuous trajectory using (x, y, m) triplet sequences for cursor coordinates and mouse button states. This allows the model to handle both types of interactions without distinguishing between them.

  • Flow-based Action Generation: Unlike existing agents that predict tokenized actions from descriptions, ShowUI-$\pi$ uses a lightweight action expert to progressively predict cursor adjustments based on visual observations. Built on a transformer backbone, this expert generates stable and accurate trajectories through flow matching.

  • ScreenDrag Benchmark: The paper introduces the ScreenDrag benchmark, which includes 505 real drag operations across various domains and use cases. This benchmark evaluates continuous trajectory performance through offline open-loop evaluation (average trajectory error and endpoint accuracy) and online closed-loop evaluation (task success rate).

  • Training Data: A training dataset was constructed with 20,000 manually collected and synthesized drag trajectories. The data records UI states and dense coordinates for model learning.

Experimental Results:

ShowUI-$\pi$ demonstrated superior performance on the ScreenDrag benchmark compared to competitors, achieving an F1 score of 26.98 with only 450M parameters. This indicates effective learning even in complex drag operations.

Contributions:

  • Efficient Training Data Synthesis: The paper combines manually collected and synthesized data to create a large-scale training dataset that covers various drag operations.

  • ScreenDrag Benchmark Development: A specialized benchmark for continuous GUI tasks was developed, providing a standard for evaluating and comparing model performance accurately.

  • Improvement of Flow-based Generative Models: ShowUI-$\pi$ overcomes the limitations of existing flow-based models through flow matching training and directional normalization, enabling more stable and accurate trajectory generation.

Conclusion:

ShowUI-$\pi$ is a powerful tool that enables human-like manipulation control in GUI environments. This research sets a new direction for GUI agent development and promotes further studies towards more natural and efficient digital interactions.

This analysis highlights the innovative approach of ShowUI-$\pi$, its significant contributions to overcoming existing challenges, and its potential impact on future advancements in GUI automation technology.


📜 논문 원문 (영문)

🚀 1TB 저장소에서 고화질 레이아웃을 불러오는 중입니다...