KOINEU

February 10, 2026

Reading time: 37 minute

...

📝 Original Info

Title:
ArXiv ID: 2512.22336
Date:
Authors: Unknown

📝 Abstract

Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose AGENT2WORLD, a tool-augmented multi-agent framework that achieves strong inference-time worldmodel generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. AGENT2WORLD follows a threestage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. AGENT2WORLD demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language(PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: agent2world.github.io.

📄 Full Content

In recent years, researchers have explored symbolic world models, a formal representation of an environment's dynamics and constraints, which is widely used in model-based planning (Guan et al., 2023;LeCun, 2022;Craik, 1967). The task of symbolic world-model generation involves automatically synthesizing these models from natural language descriptions, eliminating the need for domain experts to manually design and specify complex rules and dynamics. Large language models (LLMs) (Guo et al., 2025;Zhao et al., 2023;Bai et al., 2023) have made this automation possible by combining two key capabilities: commonsense knowledge about how the world works, and code generation abilities that formalize this knowledge into executable representations (Chen et al., 2025a). However, learning to generate such models from natural language remains difficult: correctness is behavioral and execution-dependent, while large-scale, verifiable supervision is scarce.

As illustrated in Figure 1, prior work in this domain largely follows two paradigms: (i) direct generation of symbolic world models, and (ii) scripted workflows that couple generation with iterative verification and repair. Across both PDDL-style domains (Guan et al., 2023;Hu et al., 2025a) and executable code world models (Dainese et al., 2024), the second paradigm typically couples generation with a pre-specified verification interface (e.g., parsers/planners/validators, fixed sets of evaluation trajectories). While such static validation improves syntactic validity, it misses behavior-level errors that only appear under interactive execution (e.g., inconsistent state updates or unreachable goals). Furthermore, existing studies on generating symbolic world models with LLMs have primarily focused on training-free methods for one particular type of world models (Yu et al., 2025;Kong et al., 2025;Zhang et al., 2025), rather than fundamentally enhancing the world modeling capabilities of the LLMs themselves.

In this paper, we propose AGENT2WORLD, a tool-augmented multi-agent framework that evaluates and improves world models through interactive execution. Given a natural-language description, AGENT2WORLD coordinates multiple LLM-based agents with access to external tools (e.g., web retrieval and code execution) to iteratively produce an executable world model. At a high level, AGENT2WORLD consists of three stages (Figure 2): a Deep Researcher resolves underspecified details by gathering missing background knowledge, a Model Developer implements the world model in the target representation, and a Testing Team evaluates the resulting artifact under adaptive execution (unit tests and simulation-style evaluation) and returns structured feedback for repair. Unlike prior pipelines that rely on static validators or fixed test suites, AGENT2WORLD conditions evaluation on observed execution behavior and iteratively produces targeted checks, which expose behavior-level failures that predetermined checks often miss.

Most importantly, the same developer-tester interaction loop can be viewed as an interactive environment for the Model Developer, naturally producing multi-turn trajectories that capture how world models are revised under feedback and providing verifiable reward. We use this property to turn inference-time feedback into training data for improving the Model Developer policy via verifier-guided rejection sampling (Section 3.2) and construct a dataset consisting of 1526 highquality verified trajectories covering four distinct types of world models across diverse domains.

We conducted experiments on three benchmark datasets to evaluate the performance of AGENT2WORLD: (i) Text2World (Hu et al., 2025a) for PDDL-based domain generation, (ii) Codebased World Models Benchmark (CWMB) (Dainese et al., 2024) for MuJoCo-style environment generation, and (iii) ByteSized32 (Wang et al., 2023) for reasoning-heavy text game generation. First, we validated the inference-time performance of AGENT2WORLD with two different models: GPT-4.1-mini (Achiam et al., 2023) and Llama-3.1-8b (Grattafiori et al., 2024). Our results demonstrate that AGENT2WORLD consistently achieves state-of-the-art performance across all three benchmarks with both models. This highlights its robust capabilities in symbolic world model generation, regardless of the underlying model. Furthermore, we also validated the effectiveness of AGENT2WORLD through training experiments. We performed supervised fine-tuning on the same Llama-3.1-8b model. Our results show consistent improvements across all benchmarks, indicating that the model is able to refine its world-model generation process effectively through iterative multi-agent feedback.

We investigate the problem of symbolic world-model generation from natural language. Given a textual description x, the objective is to synthesize an executable program WM that faithfully cap- tures the dynamics and constraints of the environment. Such a program may take various forms, for instance, a specification in the Planning Domain Definition Language (PDDL) (Hu et al., 2025a;McDermott et al., 1998) or an implementation in Python (Dainese et al., 2024;Wang et al., 2023). Formally, an environment is defined by a set of predicates P env , a set of actions A env , and a transition function T env : S env × A env → S env , where S env denotes the set of possible states. Semantically, WM encodes these components to represent the environment in an executable manner. We therefore define the task as a mapping F (x) = WM where WM = ⟨P env , A env , T env ⟩, and F is a synthesis procedure that generates the world-model program from natural language input x.

As is shown in Figure 2, AGENT2WORLD Multi unfolds in three stages: (i) Knowledge Synthesis ( §3.1.1): As outlined in Section 1, a key challenge in symbolic world-model construction arises from incomplete descriptions. For example, commonsense knowledge may be missing both from LLMs and from the given specifications. To address this limitation, we employ a deep researcher agent that interacts with external resources such as the internet or structured databases, thereby enriching the specification and producing an intermediate representation. (ii) World Model Generation ( §3.1.2): At this stage, a developer agent equipped with a code execution tool constructs the symbolic world model. The process is iteratively refined based on execution feedback, ensuring both correctness and executability. (iii) Evaluation-Driven Refinement ( §3.1.3): We enhance the semantic fidelity by designing two complementary test agents: one that generates unit tests to validate functional behavior, and another that simulates downstream usage to evaluate performance through trajectorybased testing. We also provide a pseudo code in Algorithm 1.

We introduce a Deep Researcher agent designed to gather background knowledge and fill in missing details that are not explicitly provided in the world model description. By leveraging external information sources, this agent not only compensates for potential knowledge gaps inherent in large language models but also enhances the factual reliability of world model descriptions. Equipped with web search and retrieval tools, it iteratively retrieves the knowledge required for world model construction from the internet and ultimately outputs a structured intermediate representation with the missing information completed. Preprint

After obtaining the comprehensive world-model description from the previous stage, the Model Developer takes this as input and generates a concrete implementation of the world model in the required formalism (e.g., PDDL or executable code). To support iterative refinement, the Model Developer is equipped with a sandboxed code-execution tool, enabling it to test and debug implementations in multiple rounds until the code is functional and consistent with the specification.

A key component of our approach is the refinement of a bug-free, code-based world model. Unlike prior works that rely on annotated gold trajectories (Dainese et al., 2024) or human feedback (Guan et al., 2023), our method is fully autonomous and does not require manual labels. More specifically, we introduce a two-agent Testing Team to evaluate and diagnose the generated models: (i) The Unit Tester conducts systematic, programmatic verification to validate the basic functionality of generated world model. It automatically generates Pytest-style unit tests targeting the predicates, actions, and invariants specified in the world-model descriptions. (ii) The Simulation Tester evaluates the world model in a play-testing manner by attempting to perform tasks, explore actions, and issue queries within the environment. Specifically, it interacts with the environment in a ReAct-style (Yao et al., 2023) loop to collect trajectories for subsequent behavior and reward analysis, which uncovers execution-time failures such as unreachable goals, missing preconditions, or inconsistent state updates. Together, these agents produce a detailed test report that assesses the quality of the generated world model and provides fine-grained diagnostic signals on correctness, coverage, logical consistency, and compliance with physical requirements. Unlike prior methods such as Text2World (Hu et al., 2025a), which uses PDDL validators, or GIF-MCTS (Dainese et al., 2024), which relies on a fixed set of offline agent trajectories, our Testing Team dynamically synthesizes test cases based on the specific errors exhibited by each world model, enabling precision-guided debugging rather than generic checks. This adaptive feedback is propagated back to the Model Developer; if inconsistencies or failures are detected, the Model Developer revises the implementation, triggering another evaluation round. This loop continues until all checks are satisfied or a predefined convergence criterion is reached.

So far, AGENT2WORLD Multi has been described as an inference-time multi-agent workflow in which a frozen backbone LLM plays all the agent roles. However, the interaction between the Model Developer and the Testing Team naturally defines a learning environment that can be used to train more effective world-model agents.

We view each role in AGENT2WORLD Multi as a tool-augmented LLM agent that interleaves generation with tool use. Let T = {t 1 , . . . , t m } be the available tools (e.g., code execution, testing, retrieval). At step t, the agent maintains a history h t = (x, o ≤t , a <t ), where x is the task specification, o are tool observations (e.g., execution logs and test reports), and a are past tool calls or edits. A policy π θ (implemented by an LLM) selects the next action:

To connect this view to training, we model the Model Developer as an agent acting in an induced Markov Decision Process (MDP): M MD = (S, A, P, R, γ). Here, the state s t ∈ S concatenates the world-model specification with diagnostics produced by the Testing Team; the action a t ∈ A is a new implementation or a patch to the current world model. The transition P is realized by executing the candidate model in a sandbox and re-running the Testing Team to obtain updated diagnostics. The reward R(s t , a t ) aggregates testing outcomes (e.g., unit-test pass rates and simulation performance), capturing both local correctness and downstream utility. This formalization lets us treat world-model synthesis as sequential decision making and naturally yields multi-turn interaction trajectories for training.

The agent-in-the-loop view naturally yields multi-turn interaction trajectories between the Model Developer and the Testing Team. We leverage this interaction as a data engine to construct training trajectories without manual labels via verifier-guided rejection sampling. Given a world-model specification x, we run AGENT2WORLD Multi to produce a sequence of developer proposals and feedback, τ = {(s t , a t , o t+1 )} T -1 t=0 , where o t+1 contains execution logs and testing diagnostics. We define a verifier outcome V (τ ) ∈ {0, 1} based on the final candidate model produced in τ . Concretely, V (τ ) = 1 if the final world model (i) executes successfully in the sandbox and (ii) satisfies the Testing Team’s evaluation, i.e., it passes the synthesized unit tests and meets simulationbased checks (when applicable); otherwise V (τ ) = 0. Rejection sampling keeps only accepted trajectories:

. Intuitively, the Testing Team acts as a verifier that filters for executable and behaviorally consistent solutions, while preserving the intermediate repair steps. As a result, D RS contains multi-turn training traces that teach a Model Developer policy to iteratively revise world models under execution-grounded feedback, rather than producing a singleshot solution.

In this section, we first describe the training dataset ( § 4.1), baselines ( § 4.2), and implementation details ( § 4.3), and then present experiments on three benchmarks: (i) Text2World (Hu et al., 2025a)

Our data construction follows a staged pipeline in which we (i) synthesize diverse world-model specifications spanning multiple representations (PDDL, Mujoco-style environments, text games, MCP-style tool environments). (ii) run AGENT2WORLD with gpt-4.1-mini to generate executable world models and record the full multi-agent interaction traces (including iterative repairs under execution feedback); iii apply verification mechanisms, such as legality, executability, and compliance checks, while combining semantic quality assessments regarding completeness, rationality, behavioral diversity, and execution consistency. Only retain the trajectories that meet these criteria.More detail show in Appendix E. To reduce the risk of training-evaluation leakage, we avoid using any benchmark instances as prompts or templates for rewriting/augmentation. Instead, we follow the environment synthesis method in AgentGen (Hu et al., 2025b) and using LIMA (Zhou et al., 2023) dataset as the inspiration corpus.

We compare AGENT2WORLD Multi against the following methods:

(i) Direct Generation (Direct): Single-shot generation of the symbolic world model without tool use, external retrieval, or feedback. (ii) Agent2World single : A single agent closes the loop by invoking code execution/validators/web search tools for self-repair and information synthesis, without multi-agent specialization. (iii) Text2World (EC=k) (Hu et al., 2025a): directly using large language models to generate PDDL-based world model and iteratively repairing with planner/validator signals, where EC denotes the error-correction budget. (iv) WorldCoder (Tang et al., 2024): A plan-code-execute-repair search that scores and iteratively improves candidate programs using simulator/planner signals to select runnable hypotheses. (v) GIF-MCTS (Dainese et al., 2024): A macro-action MCTS that orchestrates Generate/Improve/Fix steps, guided by unit tests and trajectory-based feedback for code world-model synthesis. We also introduced an enhanced version of GIF-MCTS where a Deep Researcher agent gathers additional research data. (vi) Byte-Sized32 baseline (Wang et al., 2023). The reference pipeline introduced by Wang et al. (2023). In Table 1: Benchmark results on Text2World (Hu et al., 2025a). Following the reporting convention in Text2World, all metrics are presented as percentage scores (%). Stiennon et al., 2020;Yu et al., 2025): A method that performs reasoning over multiple samples and selects the best result. (viii) Self-Consistency (Wang et al., 2022): A multi-sample reasoning method that votes over the results to improve consistency in decision-making.

We employ the OpenAI GPT-4.1-mini model via the official API and Llama-3.1-8b-instruct via the official Huggingface repo. We set the decoding temperature to 0 and top_p to 1 for deterministic reproducibility. All agents operate within a ReAct (Yao et al., 2023) framework, following a “think → act (tool) → observe” loop for a maximum of 10 steps. The Deep Researcher agent utilizes the Serper API for web searching. We blocked some websites to ensure experimental integrity and prevent information leakage1 Regarding the configuration of refinement turns, we set Text2World and ByteSized32 to 2 iterations and CWMB to 3 iterations based on the complexity of environments.

For automated evaluation on the ByteSized32 benchmark, we leverage GPT-4o (Hurst et al., 2024) as the LLM evaluator. All experiments with gpt-4.1-mini are conducted on a CPU server without GPU acceleration. The experiments with llama-3.1-8b-instruct (including training and inference) are conducted with an 8xA100 server. The prompt examples could be found at Appendix G. As for the training experiments, we leverage the LlamaFactory (Zheng et al., 2024) to manage and execute the training procedure. We perform supervised fine-tuning (SFT) on the llama-3.1-8b models, and truncate input sequences to a maximum length of 30,000 tokens. We train the model for 5 epochs with learning rate of 1 × 10 -6 .

We evaluate the Planning Domain Definition Language (PDDL)-based world model generation of AGENT2WORLD on Text2World Hu et al. (2025a), which comprises 103 PDDL domains paired with natural language descriptions. The evaluation metrics are: (i) Executability: whether the generated PDDL can be parsed and validated; (ii) Structural Similarity: the normalized Levenshtein similarity; (iii) Component-wise F1: the macro-averaged F1 of predicates (F1 PRED ) and action components, including parameters (F1 PARAM ), preconditions (F1 PRECOND ), and effects (F1 EFF ).

Results. We can draw several conclusions from Table 1: (i) Direct Generation attains the highest Similarity (82.8) yet performs poorly on executability (45.5) and all component-wise F1s, underscoring that surface-level textual overlap is a weak proxy for runnable, semantically correct PDDL.

Table 2: Benchmark results on CWMB. † We adopted the official implementation of GIF-MCTS (Dainese et al., 2024) and their reimplementation of WorldCoder (Tang et al., 2024). GIF-MCTS ⋆ denotes the enhanced version that we obtain by connecting the deep research agent with the original GIF-MCTS pipeline. (ii) While agent-based methods achieve executability comparable to the reference solution (e.g., AGENT2WORLD Single with 79.2 vs. Text2World EC=3 with 78.2), they exhibit substantial gaps in F1 scores (AGENT2WORLD Single : 69.6 vs. Text2World EC=3 : 60.1). This suggests that while integrating validators for iterative correction can significantly improve syntactic validity, the semantic utility of the generated world models remains limited without comprehensive knowledge synthesis. (iii) AGENT2WORLD Multi achieves both the highest executability (+14.9 points over Text2World EC=3 ) and superior F1 performance (+15.3 points), demonstrating the synergistic benefits of multi-agent specialization. These patterns align with our design philosophy: knowledge synthesis combined with evaluation-driven refinement steers the model to recover the correct predicate inventory and logical gating constraints, producing domains that are both syntactically valid and semantically solvable, even when surface-level representations diverge from reference implementations. (iv) Based on the Llama3.1-8b-instruct model, AGENT2WORLD Multi (SFT) significantly improved executability (+16.9 points over AGENT2WORLD Multi ) and optimized F1 performance (+6.4 points over AGENT2WORLD Multi ). This demonstrates that our SFT data effectively reduced the model’s errors in PDDL code and enhanced its understanding of task descriptions, improving the semantic structure of the generated code. In terms of structural similarity, the model scored 81.9, nearly equivalent to the Llama 4.1-mini model (81.0), indicating that the generated PDDL is structurally close to the reference standard.

The CWMB (Dainese et al., 2024) evaluates the ability of generated executable code to serve as faithful world models across 18 MuJoCo-style environments. It measures both the predictive accuracy of next-state dynamics and the normalized return (R) when the model is used by a planner, where R reflects the gap between a random policy and an oracle planner with the true environment. This setup ensures CWMB jointly assesses the correctness of the simulation code and its practical utility for downstream control.

Results. Table 2 reveals several key findings. (i) All methods demonstrate superior performance in discrete spaces compared to continuous settings, reflecting the inherent difficulty of modeling continuous dynamics. (ii) Workflow-based approaches consistently outperform both Direct Generation and AGENT2WORLD Single , indicating that LLMs’ native world model generation capabilities are limited and require expert-designed iterative refinement to achieve competitive performance. (iii) AGENT2WORLD Multi establishes new state-of-the-art results, surpassing the previous best method GIF-MCTS by +0.132 R points in overall normalized return. Notably, while other methods achieve comparable predictive accuracy (e.g., 0.917 vs 0.914 on discrete settings), our simulation-based Preprint testing framework significantly enhances the downstream utility of generated world models, demonstrating that accurate next-state prediction alone is insufficient for effective model-based planning.

(iv) AGENT2WORLD Multi (SFT) further improves ByteSized32 by enhancing executability and specification compliance. It significantly improves the “runnable game” score (0.0849 points higher than multi) and greatly improves the compliance of “key actions” and “distractors” (0.2944 points higher than multi), thereby increasing the “winnable game” rate (0.0388 points higher than multi). The training data improves the model’s understanding of task and execution-based iterative debugging, thereby reducing missing necessary components and generating more runnable implementations. The ByteSized32 (Wang et al., 2023) benchmark consists of 32 reasoning-heavy text games, each implemented as an executable Python environment. Models are required to generate runnable game code that captures task-specific dynamics, objects, and rules, allowing direct interaction and evaluation. The benchmark evaluates four dimensions: Technical Validity (whether the code runs), Specification Compliance (whether all required elements are present), Winnability (whether the task can be completed), and Physical Reality Alignment (whether the environment dynamics are consistent with commonsense constraints). This setting emphasizes both logical fidelity and practical executability, making it a stringent testbed for language models as world-model generators.

Results. Several conclusions could be drawn from Table 3 and4 showing a substantial decrease of 0.3926 for discrete spaces when removed. (iii) Although the removal of Simulation Tester results in the smallest overall performance drops, the reward R decreases by 0.2615 and 0.1473 for discrete space and continuous space, respectively. These results collectively validate our design choices and highlight the complementary nature of the three components.

To quantify the effect of AGENT2WORLD Multi and the refinement procedure, we perform instancelevel pairwise comparisons, recording a Win-Tie-Loss (WTL) outcome according to the benchmarks’ primary metric: (i) F1 AVG for Text2World; (ii) R for CWMB; and (iii) the mean of all official metrics for ByteSized32. As shown in Figure 4, the left panel contrasts the final-turn model with its first-turn counterpart. Refinement yields consistent gains on CWMB and ByteSized32 (68.8% and 93% wins, respectively; no losses), largely preserves performance on Text2World while delivering occasional improvements (14% wins vs. 7% losses). The right panel compares AGENT2WORLD Multi against previous state-of-the-art systems. AGENT2WORLD Multi attains clear advantages across all three benchmarks, most notably on ByteSized32 (87% wins) and CWMB (66.7% wins).

To understand the dynamics of performance improvement through iterative feedback, we analyze how model performance evolves as the number of testing team feedback iterations increases. evaluation benchmarks. The results reveal several key patterns: (i) Text2World shows rapid initial improvements. Notably, execution-based metrics improve substantially while similarity measures remain stable, suggesting that refinement enhances functional correctness rather than surface-level similarities. (ii) CWMB demonstrates sustained improvement across iterations, reflecting the compound complexity of physics simulation where numerical accuracy and dynamics must be jointly optimized. (iii) ByteSized32 exhibits the most dramatic gains, with several metrics showing stepfunction improvements that reflect the discrete nature of game logic debugging.

We conducted a manual error analysis to examine the evolution of error patterns throughout the refinement process of AGENT2WORLD Multi . Taken CWMB in Figure 6 as an example, the initial turn predominantly exhibits superficial errors like signature-arity mismatches and representation mismatches, stemming from inadequate adherence to world model specifications. Throughout the iterative refinement process, these surface-level inconsistencies are systematically eliminated, with the error landscape shifting toward more fundamental dynamics mismatches in later iterations. This pattern demonstrates remarkable consistency across all benchmarks: refinement consistently shifts the error distribution from form-oriented problems (syntax, arity) to substance-oriented challenges (dynamics, state transitions) as shown in Figure 7 and 8. The systematic progression from surface to substance reflects the hierarchical nature of world model correctness and validates our multi-turn refinement architecture. We also provide the detailed proportion of each error type in Appendix F.

To quantify the cost of multi-agent specialization, we analyze the token consumption reported in Appendix K. Compared to AGENT2WORLD Single , AGENT2WORLD Multi incurs a higher computational cost during the generation phase. This increase stems from the proactive testing loop, where Preprint agents autonomously generate unit tests and simulation trajectories to diagnose errors. However, this additional cost represents a finite upfront investment rather than a recurring inefficiency, which is incurred only once during the synthesis of the world model. In exchange, the framework secures a permanent performance gain in the quality of the generated artifact (e.g., raising the Normalized Return R from 0.3419 to 0.4811 on CWMB).

World Models. World models are widely applied in reinforcement learning, robotics, and autonomous systems for planning, etc (Hao et al., 2023;Ha & Schmidhuber, 2018). Generally, there are two types of world models: (i) neural world models, which employ neural networks to approximate dynamics (Ha & Schmidhuber, 2018;Hafner et al., 2019) Large Language Model-based Agent. In recent years, benefiting from the rapid advancement of large language models (LLMs), LLM-based agents have emerged as particularly powerful systems that accept natural language user intentions as input and achieve goal states through planning and sequential decision-making (Hu et al., 2024;Yao et al., 2023;Schick et al., 2023). These autonomous agents have demonstrated remarkable effectiveness across diverse applications, ranging from web navigation (Yao et al., 2022;Nakano et al., 2021;Wang et al., 2025) and software development (Qian et al., 2023;Hong et al., 2024) to scientific research (Lu et al., 2024;Chen et al., 2025b) and robotic planning (Huang et al., 2022). Prominent examples of such systems include ReAct (Yao et al., 2023), which synergizes reasoning and acting in language models by interleaving thought, action, and observation steps; Existing research has explored how world models can assist LLM-based agents in planning, such as RAP (Hao et al., 2023), which uses Monte Carlo Tree Search with world models for improved reasoning, and Guan et al. (2023), which leverages pre-trained LLMs to construct world models for model-based task planning. These approaches primarily focus on utilizing existing world models rather than generating them. Similarly, recent work has investigated how world models can enhance training of LLM-based agents, as demonstrated by AgentGen (Hu et al., 2025b) and Kimi-K2 (Team et al., 2025). To our best knowledge, our work represents the first systematic investigation into using autonomous agents for world model generation, bridging the gap between agent-based problem solving and symbolic world modeling.

We introduced AGENT2WORLD (Hu et al., 2025a), Code World Models Benchmark (CWMB) (Dainese et al., 2024), and ByteSized32 (Wang et al., 2023). “Type” denotes the target representation (PDDL vs. executable code). Metrics are shown at the family level. A detailed explanation of each metric is presented in Appendix C.2.

F1 scores. Range: [0, 1] (higher is better). When Exec= 1, we parse both generated and gold PDDL into structured representations and report macro-averaged F1 for the following components: Predicates (F1 PRED ), Parameters (F1 PARAM ), Preconditions (F1 PRECOND ), and Effects (F1 EFF ).We use the standard definition of F 1 , where P and R denote precision and recall, respectively:

Prediction Accuracy. Symbol: Acc pred . Range: [0, 1] (higher is better). Definition: We use the same accuracy metric as in the evaluation phase of GIF-MCTS (Sec. 4). Given a validation set

, the accuracy uniformly weights next state, reward, and termination:

Normalized Return. Symbol: R. Range: unbounded (higher is better; R > 0 means better than random; R → 1 approaches the oracle). Definition:

where R(π) denotes the return. Protocol: as in the original setup, we use vanilla MCTS for discrete action spaces and CEM for continuous action spaces; R(•) is averaged across a fixed number of episodes per environment (10 in the original), and R(π rand ) uses the environment’s random policy baseline.

Technical Validity. Range: [0, 1]. Measured in the order of API calls, such that failure of an earlier function implies failure of subsequent tests. Game initialization is evaluated once at the beginning of the game, whereas GENERATEPOSSIBLEACTIONS() and STEP() are evaluated at every step. We check:

• Game initialization: the game/world initializes without errors;

• Valid actions generation: the routine that enumerates valid actions for the current state returns without errors (verified via a bounded path crawl); • Runnable game: a bounded-depth crawl of trajectories executes without errors. Specification Compliance. Range: [0, 1]. An LLM acts as the judge for true/false compliance against the task specification. The prompt provides the task spec {GAME_SPEC}, the game code {GAME_CODE}, and an evaluation question {EVAL_QUESTION}; the LLM is instructed to first output Yes/No and then a brief rationale. To reduce variance, we use a fixed prompt template and perform multiple independent runs with majority vote/mean aggregation. We report three submeasures: Task-critical objects, Task-critical actions, and Distractors.

Physical Reality Alignment. Range: [0, 1]. Automatic evaluation proceeds in two stages:

(1) Trajectory generation: perform a breadth-first crawl using the action strings returned by GENERATEPOSSIBLEACTIONS() at each step; actions are grouped by verb (first token) and expanded in a bounded manner. If an error occurs, the error message is recorded as the observation, and the search continues.

(2) Sampling and judgment: group paths by the last action verb, draw a fixed-size subsample approximately balanced across groups, and submit each path-together with the task description {GAME_TASK}-to an LLM for a binary judgment (yes/no; errors are treated as failures). The final score is the fraction judged aligned. Winnability. Range: [0, 1]. A text-game agent (LLM agent) attempts to reach a terminal win within horizon H; we report the fraction of tasks deemed winnable. Given the limited agreement between automatic and human assessments for this metric, we prioritize human evaluation in the main results and use the automatic estimate asan auxiliary reference. We visualize the error patterns during evaluation-driven refinement on Text2World and ByteSized32 in Figure 7 and Figure 8.

A detailed proportion of error types on Text2World, CWMB, ByteSized32 are presented in Table 9, Table 10 andTable 11, respectively.

As shown in Table 12, both WorldCoder and GIF-MCTS still exhibit several residual dynamicsrelated failures in their final outputs: although they no longer trigger signature-or schema-mismatch errors, they retain multiple dynamics-error and judgment-bug cases, together with non-trivial invariant-violation counts. This suggests that their search procedures are effective at ironing out

interface-and type-level inconsistencies, but are less successful at fully aligning the learned dynamics with the intended environment specification.

In contrast, AGENT2WORLD Multi starts from a broader spectrum of error types at turn 1 (including signature/schema mismatches and a larger number of dynamics errors), and progressively suppresses them over subsequent refinement rounds. By turn 3, schema mismatches and nNondeterministicbehaviors have disappeared, signature mismatches are reduced to a single case, and dynamics errors, judgment bugs, and invariant violations are all substantially lower than those of the baselines. This trajectory is consistent with the qualitative analysis in Section 5.4: the Testing Team first drives the Model Developer to resolve structural and specification-level issues, and then focuses on subtle dynamics and invariant failures, ultimately producing world models that are both syntactically robust and semantically faithful to the target environments. -Integrate the findings into a concise evidence summary with citations.

-When sources conflict, explain the differences and justify the chosen resolution (related to version locking).

• Refinement and Improvement (Specification Patch) -Generate a structured “diff”: action/observation space; rewards; termination/truncation; timing (dt/frame_skip); seeding and certainty; numerical tolerances; dependencies; interface flags. • Formalization and Finalization (Ready-to-Use Specification) -Write the final specification according to the , including the public API, core logic, usage scenarios, and a verification plan aligned with metrics and statistical validation.

• Review and Self-Correction (Compliance Check) -Verify conformance to the (OUT-PUT_CONSTRAINTS>), version consistency, SI units, ISO dates, and the inclusion of any code.

• Strictly adhere to the structure defined in <PLANNING_STRUCTURE>.

• Do NOT output runnable code definitions (classes, functions). Only may include short illustrative snippets or pseudo-code.

• All claims about industry standards or common practices MUST be supported by citations.

• Use ISO-8601 dates (e.g., 2025-09-02).

• Use SI units for physical and mathematical quantities.

• Data-leakage rule: Do not access, copy, quote, or derive from raw source code in the OpenAI/Gym/Gymnasium repositories or similar code repositories. Do not include any repository code in the output. Prefer official documentation, standards, papers, or reputable secondary sources. If the only available evidence is a code repository, summarize behavior without copying code and mark it as an inference with risks. <PLANNING_STRUCTURE>

• Your output must begin with this planning and analysis section.

• Ambiguity Analysis -List each ambiguity/vagueness/conflict and mark Impact: High / Medium / Low.

-Cover at least: missing numeric value, missing unit, missing boundary/range, time-sensitive items, unclear references, open lists (“etc.”/“e.g.”), conflicts, and missing citation.

• Investigation Plan -For each High/Medium item, provide one atomic question.

-For each question, provide 1-2 executable queries including: synonyms/abbreviations, a site filter to authoritative domains, and a time window (e.g., after:2024-01-01 or “past 2 years”). -State the minimum evidence policy: High/Medium → ≥ 2 credible sources; if disagreement → add ≥ 1 more for triangulation.

• For any formula, define all symbols, units, and applicability constraints.

• Cite the source of the formula immediately after its definition.

• Provide the complete formula rather than a descriptive explanation. <OUTPUT_FORMAT>

• Please provide the final specification document structured as follows. This is the primary deliverable. Do NOT include code.

• ([3,1..10]) and the goal ([3,11]). The player can occupy all cells in the first 3 rows plus the bottom-left cell [3,0]. • The start state is 36 (row=3, col=0), and the goal state is 47 (row=3, col=11).

• The action space is Discrete(4) with actions: 0 (up), 1 (right), 2 (down), 3 (left).

• Each step yields a reward of -1. Stepping into the cliff yields a reward of -100 and resets the player to the start state; the episode continues.

• The episode terminates only when the player reaches the goal state.

• The set_state method is not explicitly documented but is inferred to accept any valid non-terminal, non-cliff state and sets the environment to not done.

• No explicit error handling for invalid actions is documented; valid inputs are assumed.

• The environment is deterministic. Spec Patch

• action.space = Discrete(4), values {0,1,2,3}, shape (1,)

• observation.space = Discrete(48), integer in [0,47], representing flattened grid index: row * 12 + col

• valid states = all grid cells except cliff cells ([3,1..10]) and goal ([3,11])

• start_state = 36 (row=3, col=0)

• goal_state = 47 (row=3, col=11)

• reward.step = -1 per step

-Episode ends when s ′ = goal_state.

• Symbol Table : s: state (int), flattened grid index, 0 ≤ s ≤ 47, s not in cliff or goal.

a: action (int), in {0 : up, 1 : right, 2 : down, 3 : left}.

r: reward (float), -100 or -1.

done: boolean, True if s = goal_state.

-start_state: 36 (int).

-goal_state: 47 (int).

-cliff_cells: set of ints corresponding to grid positions [3, 1..10].

nrows: 4 (int).

ncols: 12 (int).

• Assumptions:

-No shaping rewards.

-Deterministic environment.

-No truncation or max step limit.

Final Specification

• Environment Name -CliffWalking-v0

• Public API -init() -set_state(state: int) step(action: int) → (observation: int, reward: float, done: bool)

• Core Logic Description -init:

• set_state accepts only valid non-terminal, non-cliff states; no explicit error handling for invalid states.

• No explicit error handling for invalid actions; assumed caller responsibility.

• Reward and reset behavior on cliff stepping inferred from documentation; no episode termination on cliff.

• Observation space includes all grid cells but player never occupies cliff or goal states.

• Potential future Gymnasium version changes may alter behavior.

• Gymnasium API conventions inform environment interface: discrete action and observation spaces, step and reset semantics.

• No new runtime dependencies introduced.

• RNG seeding not specified; environment deterministic.

• No source code copied; specification based on official Gymnasium documentation and Sutton & Barto RL book.

• Numpy indexing conventions used for grid flattening. </Research Report> 1. Deconstruct Specification: Carefully review the , to fully understand the environment’s specification, including state/action spaces, dynamics, reward function, and termination conditions. 2. Physics Engine Selection: Evaluate if the task requires physics simulation. If so, choose an appropriate physics engine for the specific task requirements. 3. Model Design: If using a physics engine, design the model structure and embed it as needed in the Python file. 4. Plan Class Structure: Outline the ‘Environment’ class, including its internal state variables, helper methods, and the public interface (’init’, ‘reset’, ‘set_state’, ‘step’). 5. Implement Complete Code: Write the full implementation of the ‘Environment’ class. 6. Self-Correction Review: Meticulously check that the generated code fully complies with the , the , and all . 7. Finalize Output: Present the complete, reviewed, and runnable single-file code in the specified final format. 1. Interface (single file):

• Implement a complete, self-contained Python class Environment with:

-init(self, seed: int | None = None) Preprint reset(self, seed: int | None = None) → ndarray (reinitialize the episode and return the initial observation in canonical shape) -set_state(self, state) (must accept ndarray or list/tuple in canonical shape) step(self, action) → tuple[ndarray, float, bool] (returns: observation, reward, done) • Requirements:

-Single-file constraint: all code, including any model definitions, must be contained in one Python file.

• Provide reproducibility via seed (constructor and/or seed(int) method).

• Normalize inputs: accept equivalent representations (e.g., NumPy scalar/int/len-1 array) and convert to a canonical form. • Validate inputs; raise clear one-line errors (ValueError/TypeError) on invalid shapes or ranges.

Dynamics (MCTS/control oriented):

• For physics-based tasks, prefer suitable physics simulation methods with embedded model definitions over custom physics implementations. • Choose and document an integration scheme (e.g., implicit integrator, explicit Euler) consistent with the research report. • Use a stable time step dt; clamp to safety bounds; keep all values finite (no NaN/Inf). • Keep per-step computation efficient and allocation-light.

• No Gym inheritance or external RL frameworks unless explicitly allowed.

• Allowed: third-party libraries as needed (e.g., NumPy, physics engines, SciPy, Numba, JAX, PyTorch, etc.). • For robotics/physics tasks, physics engines with embedded model definitions are recommended over custom implementations. • Clean, readable code suitable for RL experimentation.

• All dependencies must be importable standard libraries or commonly available packages.

<code_file_path> The entrypoint file path of the generated code. </code_file_path> <entrypoint_code> “‘python # Your complete, runnable single-file implementation here. “’ </entrypoint_code> </Output Format> Unit Tester Cliff walking involves crossing a gridworld from start to goal while avoiding falling off a cliff. ## Description … </TASK DESCRIPTION> {code}

• Do not modify the student’s source file.

• Create exactly one pytest file at “tests/test_env.py” using file_tool(“save”).

• Import the module from “environment.py” via importlib (spec_from_file_location + module_from_spec).

• Run tests with code_tool(“run”, “pytest -q”); capture exit_code, duration, and stdout/stderr tail.

class Environment can be imported and instantiated, e.g., Environment(seed=0).

• Contract:

set_state accepts list/tuple/ndarray of the same logical content (convert to canonical). 2. step(action) returns a 3-tuple: (observation, reward, done) with expected types/shapes. 3. Determinism: with the same seed and same initial state, the first step with the same action yields identical outputs. 4. Action space validation: actions within bounds are accepted, out-of-bounds actions are handled gracefully. 5. Observation space validation: observations match declared space bounds and shapes. 6. State space consistency: internal state dimensions match expected environment specifications.

• Acceptance: success iff pytest exit_code == 0 (all tests pass). • Summarize pytest results in 2-4 sentences; mention the first failing nodeid/assert if any.

• Provide a brief contract coverage assessment and the most probable root cause for failures.

• If failing, add 1-3 concise actionable fixes (no long logs). Return exactly one block containing a single JSON object that matches PytestReport: { “success”: true|false, “analysis”: “<2-4 sentence summary/diagnosis>”, “suggest_fix”: " 1-3 bullets with minimal actionable changes>” } No extra text outside . No additional code fences. { “success”: false, “code_result”: “”, “analysis”: “”, “suggest_fix”: "” } • Seeding is only via reset(seed=. . . ); no separate seed method.

• Risks include dependency on MuJoCo simulation correctness and numerical stability.

• No explicit handling of invalid actions beyond clipping.

• No rendering or visualization included.

• Uses Python standard library and NumPy only.

• NumPy used for array operations, clipping (np.clip), norm calculations (np.linalg.norm), and finite checks (np.isfinite).

• No external dependencies beyond MuJoCo simulation backend (assumed).

• All units are SI: torque in N•m, angles in radians, positions in meters, time in seconds.

• No source code copied; specification based on official documentation and API standards. I CASE STUDY I.1 TEXT2WORLD Task Describe.Prepare and deliver sandwiches to a set of children: assemble sandwiches in the kitchen (optionally gluten-free), place them on a tray, move the tray to the child’s location, and serve subject to allergy constraints. Gluten-allergic children must receive a gluten-free sandwich; non-allergic children may receive any sandwich. Serving requires the sandwich on the tray and the tray at the child’s location; making a sandwich switches it from “notexist” to “prepared.” The goal is that all children become served.

Prev. SOTA. Analysis. Compared to the baseline domain, our Child-Snack formulation introduces three taskaligned modifications that improve state consistency, compositionality, and plan feasibility. (i) Creation-valid effects. During sandwich construction we flip the existence status from “nonexistent” to “prepared,” and record gluten-free status when applicable, thereby avoiding contradictory postconditions at creation time; this yields deterministic successor states and reduces backtracking caused by ill-defined truth values. (ii) Serve-focused effects. During serving we only transfer the item off the tray and mark the child as served, leaving the waiting label untouched; this separation of concerns prevents nonessential side-effects, preserves modular composability with downstream routines (e.g., queueing or follow-up allocation), and promotes goal-monotonic progress on the served objective. (iii) Permissive-serving preconditions. For non-allergic children we do not exclude gluten-free items, weakening preconditions to accept any admissible sandwich; this enlarges the feasible search space and prevents avoidable dead-ends when only gluten-free inventory remains, while safety for allergic children is still enforced via a dedicated gluten-free serving action. Collectively, these choices align with the ground-truth specification, produce cleaner state transitions, and yield empirically favorable search dynamics-smaller inconsistent-state frontiers and fewer spurious deletions-resulting in a more robust make→put-on-tray→move-tray→serve pipeline for the objective of “serving each child an acceptable sandwich.” I.2 CWMB Task Describe. Control a 3D Ant (one free-body torso with four 2-DoF legs; nine bodies, eight hinge joints) to move forward along the +x axis by applying torques to the eight joints at each step. The action space is Box([-1, 1] 8 ) (joint torques). Observations list positions then velocities (27-D by default; 29-D if current x, y are included), and optionally +84 contact-force terms when use_contact_forces=True or version < v4. The reward is r = r healthy + r forward -ctrl_cost (and -contact_cost if contact forces are used), where r forward ≈ ∆x/∆t is positive for motion in +x and ∆t = frame_skip × 0.01 = 0.05 by default. Episodes start from an upright, slightly noisy state, truncate at 1000 steps, and (by default) terminate early if the agent becomes unhealthy (non-finite state or torso z / ∈ [0.2, 1.0]).

Prev SOTA vs Agent2World.

Task Description. We build a lightweight, text-interactive micro-simulation of pea growth in a small garden. The world contains a Pea, a FlowerPot, a Jug, and a Sink; water is represented as scalar levels in the Jug and FlowerPot and as an internal level in the Pea. The agent can look/examine, take/put objects, switch the sink on/off, fill the jug from the sink (effective only when the sink is on), and pour water from the jug into the flower pot. After each action, a tick advances processes: the sink supplies water if on; the pot passively transfers its water to the pea; and the pea consumes water and progresses from seed → sprout → young plant → mature → reproducing when sufficiently hydrated for several consecutive ticks. Episodes start with an unplanted pea and an empty pot; the goal is to plant the pea and water it repeatedly until it reaches the reproducing stage.

Prev SOTA vs Agent2World. (iii) Game-style and Mujoco-style Tasks: Additionally, we include 50 tasks from each of the game-style and Mujoco-style domains, where the tasks involve dynamic simulations with continuous action spaces. These tasks are designed to mimic real-world, high-dimensional environments in which agents must navigate, plan, and interact with physical systems. The tasks are synthesized using a method similar to AgentGen (Hu et al., 2025b).

For each of these datasets, we perform the following steps to build our training set:

(i) For each dataset, we generate 3 distinct world-model rollouts using the same llama-3.1-8b-instruct model. These rollouts are generated by running the Model Developer agent through each task, resulting in different candidate solutions for each problem. (ii) Reward Filtering: We then evaluate these rollouts using the Testing-Team feedback mechanism, which includes unit tests, simulation performance, and control tasks. The rollouts are ranked based on their reward scores, and we select the one with the highest reward as the final solution for the task.

Multi (SFT) 0.9688 +0.0000 0.9688 +0.2944 0.1875 +0.0937 0.1042 +0.0388Figure 3: Ablation Study on CWMB. multi vs. Prev. SOTA

Multi (SFT) 0.9688 +0.0000 0.9688 +0.2944 0.1875 +0.0937 0.1042 +0.0388

(2025a) andTang et al. (2024), utilize executors and validators to generate feedback. GIF-MCTS(Dainese et al., 2024) leverages gold experiences as feedback signals. Compared to these scripted workflows, the AGENT2WORLD paradigm introduced in this paper can more flexibly adjust subsequent strategies based on feedback signals.

(2025a) andTang et al. (2024), utilize executors and validators to generate feedback. GIF-MCTS(Dainese et al., 2024)

(2025a) andTang et al. (2024), utilize executors and validators to generate feedback. GIF-MCTS

(2025a) andTang et al. (2024)

(2025a) and

search, world model development with iterative refinement, and evaluation-driven testing through unit tests and simulation. Experimental results demonstrate consistent state-of-the-art performance across three world-model generation benchmarks of different types. By enabling fully autonomous world model generation without human feedback or manual annotations, this work opens new possibilities for AI systems that can reliably understand and formalize complex environments from natural language. Preprint ETHICS STATEMENT All r ← predefined integers

A side-by-side comparison of the evaluated benchmarks in this paper is presented in Table6

A side-by-side comparison of the evaluated benchmarks in this paper is presented in Table

Detailed experimental results of the ablation study are presented in Table7.E MORE DETAILS ON DATA CONSTRUCTION

Detailed experimental results of the ablation study are presented in Table7.

Detailed experimental results of the ablation study are presented in Table7

Detailed experimental results of the ablation study are presented in Table

class Container(GameObject): def init(self, name): super().init(name); self.props[“isContainer”] = True def place(self, obj): if not obj.get(“isMoveable”): return (“Can’t move that object.”, False) self.add(obj); return (“OK.”, True)

For example, the original Text2World and ByteSized32 huggingface pages, CWMB source code, OpenAI-Gym code repository are blocked.

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on open access ArXiv data.

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

📄 Full Content

Reference

Start searching

No results found