Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: https://agent2world.github.io.
💡 Deep Analysis
📄 Full Content
Preprint
AGENT2WORLD:
LEARNING
TO GENERATE SYM-
BOLIC
WORLD
MODELS
VIA
ADAPTIVE
MULTI-
AGENT FEEDBACK
Mengkang Hu♠♡∗Bowei Xia♠♢∗
Yuran Wu♠Ailing Yu♠Yude Zou♠Qiguang Chen♣
Shijian Wang♡Jiarui Jin♡Kexin Li♢Wenxiang Jiao♡Yuan Lu♡Ping Luo♠†
♠The University of Hong Kong
♡Xiaohongshu Inc.
♢UESTC
♣Harbin Institute of Technology
ABSTRACT
Symbolic world models (e.g., PDDL domains or executable simulators) are cen-
tral to model-based planning, but training LLMs to generate such world models
is limited by the lack of large-scale verifiable supervision. Current approaches
rely primarily on static validation methods that fail to catch behavior-level errors
arising from interactive execution. In this paper, we propose AGENT2WORLD, a
tool-augmented multi-agent framework that achieves strong inference-time world-
model generation and also serves as a data engine for supervised fine-tuning, by
grounding generation in multi-agent feedback. AGENT2WORLD follows a three-
stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by
web searching to address specification gaps; (ii) A Model Developer agent im-
plements executable world models; And (iii) a specialized Testing Team conducts
adaptive unit testing and simulation-based validation. AGENT2WORLD demon-
strates superior inference-time performance across three benchmarks spanning
both Planning Domain Definition Language(PDDL) and executable code repre-
sentations, achieving consistent state-of-the-art results. Beyond inference, Testing
Team serves as an interactive environment for the Model Developer, providing
behavior-aware adaptive feedback that yields multi-turn training trajectories. The
model fine-tuned on these trajectories substantially improves world-model gen-
eration, yielding an average relative gain of 30.95% over the same model before
training. Project page: agent2world.github.io.
1
INTRODUCTION
In recent years, researchers have explored symbolic world models, a formal representation of an en-
vironment’s dynamics and constraints, which is widely used in model-based planning (Guan et al.,
2023; LeCun, 2022; Craik, 1967). The task of symbolic world-model generation involves automat-
ically synthesizing these models from natural language descriptions, eliminating the need for do-
main experts to manually design and specify complex rules and dynamics. Large language models
(LLMs) (Guo et al., 2025; Zhao et al., 2023; Bai et al., 2023) have made this automation possi-
ble by combining two key capabilities: commonsense knowledge about how the world works, and
code generation abilities that formalize this knowledge into executable representations (Chen et al.,
2025a). However, learning to generate such models from natural language remains difficult: cor-
rectness is behavioral and execution-dependent, while large-scale, verifiable supervision is scarce.
As illustrated in Figure 1, prior work in this domain largely follows two paradigms: (i) direct gen-
eration of symbolic world models, and (ii) scripted workflows that couple generation with iterative
verification and repair. Across both PDDL-style domains (Guan et al., 2023; Hu et al., 2025a) and
executable code world models (Dainese et al., 2024), the second paradigm typically couples gener-
ation with a pre-specified verification interface (e.g., parsers/planners/validators, fixed sets of evalu-
ation trajectories). While such static validation improves syntactic validity, it misses behavior-level
errors that only appear under interactive execution (e.g., inconsistent state updates or unreachable
∗Equal contribution. Corresponding to mkhu@connect.hku.hk, pluo.lhi@gmail.com.
1
arXiv:2512.22336v1 [cs.AI] 26 Dec 2025
Preprint
Direct
Workflow
Agent (ours)
Do not Pass
❌
❌
❌
LLM
❌
Results
❌
LLM
Fix / Improve
× Power
× Bread
× Bread
+ power_on()
+ bread_in()
✅
pass
× Power
× Bread
√Power
√Bread
Results
× Temperature unknown
Searcher
Developer
World Model
Required
Action
√Power
√Bread
√Temperature
World Model
Excution
Unit Tester
Require
power
Require
bread
Plug
Insert
Press
Success
✅
Simulation Tester
World Model
pass
√Power
√Bread
√Temperature
Results
Required
Temperature
Score
Text2World
CWMB
Bytesized32
39
60
75
26
35
48
70
73
79
Figure 1: Comparison of AGENT2WORLD and previous world-model generation paradigms.
goals). Furthermore, existing studies on generating symbolic world models with LLMs have pri-
marily focused on training-free methods for one particular type of world models (Yu et al., 2025;
Kong et al., 2025; Zhang et al., 2025), rather than fundamentally enhancing the world modeling
capabilities of the LLMs themselves.
In this paper, we propose AGENT2WORLD, a tool-augmented multi-agent framework that evaluates
and improves world models through interactive execution. Given a natural-language description,
AGENT2WORLD coordinates multiple LLM-based agents with access to external tools (e.g., web
retrieval and code execution) to iteratively produce an executable world model. At a high level,