Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction
Reinforcement learning with verifiable rewards (RLVR) is pivotal for the continuous evolution of GUI agents, yet existing evaluation paradigms face significant limitations. Rule-based methods suffer from poor scalability and cannot handle open-ended tasks, while LLM-as-a-Judge approaches rely on passive visual observation, often failing to capture latent system states due to partial state observability. To address these challenges, we advocate for a paradigm shift from passive evaluation to Agentic Interactive Verification. We introduce VAGEN, a framework that employs a verifier agent equipped with interaction tools to autonomously plan verification strategies and proactively probe the environment for evidence of task completion. Leveraging the insight that GUI tasks are typically “easy to verify but hard to solve”, VAGEN overcomes the bottlenecks of visual limitations. Experimental results on OSWorld-Verified and AndroidWorld benchmarks demonstrate that VAGEN significantly improves evaluation accuracy compared to LLM-as-a-Judge baselines and further enhances performance through test-time scaling strategies.
💡 Research Summary
The paper addresses a fundamental bottleneck in the training of GUI‑automation agents: the provision of reliable, verifiable reward signals. Existing evaluation paradigms fall into two categories. Rule‑based verification relies on human‑written scripts or unit tests, which are accurate but do not scale to open‑ended tasks and require substantial engineering effort. LLM‑as‑a‑Judge approaches use large language models (e.g., GPT‑4o) to infer success from screenshots and textual logs, but they suffer from partial state observability: many success criteria (file creation, background process changes, registry updates) are invisible to a purely visual observer.
To overcome these limitations, the authors propose a paradigm shift called Agentic Interactive Verification. Instead of passive observation, a dedicated verifier agent is endowed with the same interaction capabilities as the actor agent and equipped with a suite of tools: screenshot retrieval, shell command execution, Python code execution, and direct GUI manipulation. This verifier can actively probe the environment to gather evidence that is otherwise hidden from view.
The concrete implementation, VAGEN (Verification via Agentic Environment Interaction), follows a three‑stage progressive verification mechanism:
- Static Assessment – Using only the final screenshot and a concise operation summary (derived from the actor’s trajectory via an LLM), the verifier makes a quick decision. If the evidence is clear, verification terminates here.
- Visual Retrospection – When the static stage is inconclusive, the verifier inspects selected intermediate screenshots to collect indirect visual cues (E_visual).
- Dynamic Interaction – If uncertainty persists, the verifier invokes the interactive tools to query latent system state: shell commands to list files or processes, Python scripts for logical checks, and direct GUI actions (clicks, typing) to elicit system responses.
The trajectory memory consolidation step is crucial: the actor’s reasoning steps (state observation, sub‑goal analysis, action description) are compressed into a sequence of operations H, discarding noisy sub‑goal commentary and focusing the verifier on observable actions.
Experiments were conducted on two benchmark suites: OSWorld‑Verified (desktop OS tasks) and AndroidWorld (mobile tasks). Using Claude‑Sonnet‑4.5 as the backbone LLM, VAGEN achieved a jump from 84.7 % to 92.9 % accuracy in class‑balanced OSWorld scenarios and from 85.3 % to 93.4 % in class‑imbalanced settings, outperforming the best LLM‑as‑a‑Judge baselines by 7–9 percentage points. Similar gains were reported on AndroidWorld, despite the reduced toolset (no shell or Python execution on mobile).
A further contribution is the test‑time scaling strategy: multiple verifier instances are run in parallel on the same trajectory, and their predictions are aggregated (majority vote or confidence‑weighted). This not only improves verification robustness but also provides a modest performance boost to the actor agent itself, suggesting a symbiotic training loop.
The paper’s strengths lie in (i) directly addressing partial observability by granting the verifier agency, (ii) designing a cost‑effective progressive pipeline that balances speed and thoroughness, and (iii) demonstrating substantial empirical improvements on realistic benchmarks. Limitations include heavy reliance on a large LLM for the verifier (which may be costly and sensitive to prompt design), reduced tool availability on mobile platforms, and additional computational overhead during verification.
Future work could explore lightweight verifier models, domain‑specific toolkits (e.g., Android ADB commands), and multi‑agent collaboration where several specialized verifiers share responsibilities. Extending VAGEN to continuous‑learning loops, where verified rewards feed back into actor training in real time, would further solidify its role as a cornerstone for safe, scalable GUI‑agent development.
Comments & Academic Discussion
Loading comments...
Leave a Comment