See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning–acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

💡 Research Summary

The paper introduces ScratchWorld, a novel benchmark designed to evaluate multimodal GUI agents on program‑by‑construction tasks within the Scratch block‑based programming environment. While large language models (LLMs) have been extensively benchmarked for text‑only code generation and repair, there has been little systematic study of agents that must interact with a graphical user interface to build executable programs. ScratchWorld fills this gap by providing 83 carefully curated tasks organized into four pedagogically motivated categories—Create, Debug, Extend, and Compute—derived from the Use‑Modify‑Create framework.

Each task includes a natural‑language instruction, an initial Scratch project (the starting state), a golden project (the correct solution), and an execution‑based evaluation script that runs the project in the Scratch VM to verify functional correctness. The benchmark employs a dual‑mode evaluation protocol. In Primitive mode, agents receive screenshots with indexed UI elements and must perform low‑level actions such as clicks, drags, and typing. This mode directly tests visuomotor control and spatial grounding. In Composite mode, agents are given high‑level semantic APIs (e.g., add_block, connect_blocks, delete_block) that abstract away the drag‑and‑drop mechanics, allowing the assessment of pure program‑logic reasoning. By comparing performance across the two modes, the benchmark isolates whether failures stem from reasoning deficits or from execution (GUI manipulation) errors.

The authors describe a semi‑automated pipeline for constructing the benchmark: human experts design seed tasks, large language models (DeepSeek‑V3, GitHub Copilot) expand them, and a strict human‑in‑the‑loop verification ensures that each task’s instruction, golden solution, and test script are correct.

Extensive experiments were conducted with state‑of‑the‑art multimodal models (Claude‑Sonnet‑4.5, GPT‑4‑Vision, Gemini‑Pro) integrated into a generic GUI‑agent framework. In Composite mode, the best model achieved a 78.31 % success rate, demonstrating that these agents can plan and reason about Scratch programs effectively. In stark contrast, Primitive mode performance dropped to 14.46 % for the same model, revealing a substantial “reasoning‑acting gap.”

To diagnose the low Primitive‑mode success, the authors introduced two auxiliary benchmarks. The Single‑Step Drag Benchmark isolates the drag‑and‑drop primitive: even when the start point is provided, agents correctly locate the drop endpoint only 23–32 % of the time, indicating that precise spatial grounding for “where to drop” is the dominant failure mode. The Visual Perception QA Benchmark measures static visual understanding; agents score up to 90.5 % accuracy, showing that perception alone is not the bottleneck. Rather, the challenge lies in closed‑loop execution—continuously adjusting mouse movements based on dynamic visual feedback.

The paper concludes that progress on Scratch‑style program‑by‑construction will depend less on improving logical planning and more on developing high‑precision, snap‑aware interaction policies. Suggested future directions include reinforcement‑learning‑based control for fine‑grained coordinate prediction, multimodal models that jointly learn perception and motor control, and dedicated Scratch simulators for large‑scale training.

Overall, ScratchWorld provides the first systematic, execution‑validated benchmark that separates “can you figure out the right program?” from “can you actually build it in the GUI?” and demonstrates that current multimodal agents excel at the former while still struggling dramatically with the latter. This insight is crucial for building AI assistants that can genuinely support young learners in low‑code environments by handling the tedious drag‑and‑drop actions that currently impose a high cognitive load.

See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

💡 Research Summary

Comments & Academic Discussion

Leave a Comment