DynaWeb: Model-Based Reinforcement Learning of Web Agents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

💡 Research Summary

The paper “DynaWeb: Model-Based Reinforcement Learning of Web Agents” tackles a fundamental bottleneck in training autonomous web agents: the high cost, inefficiency, and safety risks associated with interacting with the live internet during reinforcement‑learning (RL) exploration. The authors propose a model‑based RL (MBRL) framework, DynaWeb, that replaces most real‑world interactions with a learned web “world model” that can simulate realistic page states in response to agent actions.

World Model Design
The world model is an LLM (GPT‑oss‑120B) fine‑tuned to predict the state change description Δ(oₜ, oₜ₊₁) given the current observation (represented as an accessibility‑tree) and an atomic browser action. Rather than generating the entire next page text, the model outputs a concise natural‑language description of what changes (e.g., “clicking ‘Submit’ submits the form and loads a new page”). This description is then applied to the current tree to obtain the next observation. Training data are drawn from the NNetNav dataset; a cleaning pipeline removes incomplete or inconsistent trajectories. The loss jointly maximizes likelihood of the ground‑truth reasoning trace and the Δ description.

DynaWeb Training Loop
During training, the agent policy (an LLM‑based policy πθ) interacts with the world model as if it were a web server, generating multi‑step imagined rollouts without any live HTTP requests. After each imagined episode, a model‑based self‑assessment module evaluates whether the task goal (specified by a natural‑language query) has been achieved, assigning a binary reward (0/1). These imagined trajectories are mixed with two other sources of experience: (1) real expert demonstrations sampled from existing training data, and (2) a limited amount of on‑policy real‑web interaction. The mixing ratio (e.g., 70 % imagined, 20 % expert, 10 % real) is random per batch, preserving on‑policy signal while stabilizing learning against model bias. Policy updates use sequence‑level RL objectives (PPO‑style or REINFORCE with baselines) that are well‑suited to sparse, terminal rewards typical of web tasks.

Experimental Evaluation
The authors evaluate DynaWeb on two challenging benchmarks: WebArena and WebVoyager, both containing multi‑step tasks such as shopping, account creation, and information retrieval. Baselines include state‑of‑the‑art open‑source agents (WebAgent‑R1, WebRL, WebDancer) trained with pure online RL. Results show consistent improvements: success rates rise from 68 % to 78 % on WebArena and from 62 % to 73 % on WebVoyager (≈10 % absolute gain). Crucially, the number of real web calls during training drops by over 80 %, dramatically reducing cost and risk. The world model’s Δ prediction accuracy reaches 92 %, and imagined rollouts enable a 2.3× faster convergence compared to training with only real interactions.

Analysis of Strengths and Limitations
The paper’s key contribution is the integration of a high‑fidelity, LLM‑driven simulator into the RL training loop, turning imagined experience into first‑class data rather than a mere planning aid. The interleaving of expert demonstrations mitigates model‑drift and preserves on‑policy learning signals. However, the current world model only handles accessibility‑tree observations, limiting its ability to simulate complex JavaScript dynamics, asynchronous loading, or visual elements such as images and canvas graphics. The authors acknowledge these gaps and suggest future work on multimodal observations (e.g., screenshots + OCR) and more sophisticated dynamic modeling.

Conclusion
DynaWeb demonstrates that “learning by imagination” is a viable and scalable strategy for web agents. By training a dedicated world model, generating vast imagined rollouts, and blending them with a modest amount of real expert data, the framework achieves higher performance while slashing the expensive and risky live‑web interaction component. This work opens a path toward large‑scale, safe, and cost‑effective training of general‑purpose AI assistants that operate on the open web.

DynaWeb: Model-Based Reinforcement Learning of Web Agents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment