LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.
💡 Research Summary
This paper investigates whether large language models (LLMs) can serve as stand‑alone world models that make decisions without any auxiliary components such as value critics or separate policy networks. The authors first motivate the study by reviewing the success of traditional model‑based reinforcement‑learning approaches such as MuZero and Dreamer, which learn transition and reward functions to enable planning, transfer, and offline learning. While these methods excel in domains where abundant interaction data are available, they struggle to capture the breadth of common‑sense physics, social dynamics, and specialized scientific knowledge that LLMs have implicitly absorbed from massive text corpora.
Recent work has begun to treat LLMs as general world simulators, either explicitly (e.g., Reasoning via Planning, BlocksWorld) by prompting the model to output next‑state descriptions, or implicitly (e.g., Tree of Thoughts, Graph of Thoughts) where the model’s “thoughts” serve as hidden state representations during reasoning. However, prior evaluations typically treat LLMs either as generic simulators—measuring one‑step prediction error—or as auxiliary modules that assist a planner, focusing on the accuracy of value estimates rather than on the model’s ability to generate a complete policy.
To address this gap, the authors propose three decoupled evaluation tasks that require the LLM‑based world model to act as the sole decision‑making engine:
-
Policy Verification – Given a rule‑based policy for an environment, the model must determine whether executing that policy will achieve the goal. This tests the model’s ability to simulate entire action sequences and assess success.
-
Action Proposal – The model must output the top‑K plausible actions from a given state, leveraging its world knowledge to reduce planning complexity. Unlike traditional agents that output a single action, this multi‑candidate approach mirrors how a game engine suggests possible moves.
-
Policy Planning – Combining verification and proposal, the model must construct a full policy from scratch using only its predicted next states and action candidates. No explicit value function (critic) is employed; instead, a search algorithm (e.g., Monte‑Carlo Tree Search) operates on the model’s simulated trajectories.
The experimental suite consists of 31 diverse text‑based environments drawn from Wang et al. (2023, 2024), spanning everyday tasks (laundry, cooking) to scientific/engineering challenges (key forging, metal casting). For each environment the authors hand‑crafted a deterministic rule‑based policy to serve as ground truth. Two state‑of‑the‑art LLMs—GPT‑4o and its lightweight counterpart GPT‑4o‑mini—are evaluated under identical prompts and settings across all three tasks.
Key findings:
-
Model size matters – GPT‑4o consistently outperforms GPT‑4o‑mini, especially on domains requiring specialized knowledge (e.g., scientific protocols). The performance gap reaches up to 30 percentage points on the most knowledge‑intensive tasks.
-
Bottleneck steps dominate difficulty – The total number of steps in a task is a weak predictor of success. Instead, accuracy on a few critical “bottleneck” transitions (e.g., temperature control in metal casting) determines overall policy success. This suggests that conventional one‑step MSE metrics are insufficient for decision‑making evaluation.
-
Combining functionalities introduces instability – When the model simultaneously predicts next states, rewards, and proposes actions, performance variance increases markedly. Errors propagate across sub‑modules, sometimes masking the true capability gap between the two LLMs.
-
Prediction accuracy ≠ decision quality – Figure 2 illustrates that a model with lower absolute prediction error can still select the correct action, whereas a model with a larger error may misrank actions. Hence, evaluation should focus on predictions that are directly relevant to the desired policy rather than on generic simulation fidelity.
The authors argue that a rigorous evaluation framework for LLM‑based world models should (i) prioritize task‑relevant predictions (e.g., “does this action lead to the goal?”), (ii) account for error accumulation over multi‑step horizons, and (iii) isolate the contribution of each functional component. They propose future directions such as targeted fine‑tuning on identified bottleneck steps, designing error‑correction mechanisms between sub‑modules, and developing policy‑centric metrics (e.g., success rate of goal attainment, bottleneck‑step accuracy).
In conclusion, the paper demonstrates that LLMs can indeed act as stand‑alone world models capable of verifying policies, proposing viable actions, and planning entire strategies across a wide range of environments. However, limitations remain: long‑horizon planning suffers from error accumulation, and integrating multiple functionalities can lead to unstable performance. The work calls for more nuanced, decision‑oriented evaluation protocols and for research into stabilizing and refining LLM‑based world models for robust autonomous decision making.
Comments & Academic Discussion
Loading comments...
Leave a Comment