WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model’s predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.

💡 Research Summary

WorldLLM tackles the persistent gap between the broad, unstructured knowledge stored in large language models (LLMs) and the precise, domain‑specific dynamics required for accurate world modeling in structured environments such as simulations or text‑based games. The authors propose a three‑component framework that iteratively refines an LLM‑based forward model without any gradient‑based fine‑tuning.

Statistician – A pre‑trained LLM (Phi‑3‑mini‑4k‑Instruct) serves as a conditional probability estimator (P_{\text{LLM}}(s’|s,a,H)). It receives the current state‑action pair ((s,a)) and a set of natural‑language hypotheses (H) as part of its prompt, and returns the likelihood of each possible next state (s’). By conditioning on (H), the model can incorporate domain‑specific rules that sharpen its predictions.
Scientist – A second LLM acts as a proposal distribution for new hypothesis sets. Using a Bayesian inference loop, the Scientist draws candidate hypotheses (\hat H_i) from (P_{\text{Sci}}(\hat H_i|D_t,\hat H_{t})), where (D_t) is the collection of transitions gathered so far. The Metropolis‑Hastings algorithm (with five steps per outer iteration) evaluates each candidate by the log‑likelihood of the data under the Statistician. If the candidate improves the likelihood, it is accepted; otherwise it is rejected. This stochastic search yields a compact, human‑readable theory that maximizes the evidence.
Experimenter – The Experimenter collects transitions that are most informative for hypothesis refinement. Two families are explored: (a) hand‑crafted oracle policies (O‑Ideal, O‑Curriculum, O‑Hardest, O‑Random) that provide controlled baselines, and (b) curiosity‑driven reinforcement‑learning agents. Three intrinsic rewards are defined:
- RL‑LogP – negative log‑likelihood (-\log P_{\text{Stat}}(s’|s,a,H_t)), encouraging the agent to visit low‑probability outcomes.
- RL‑ALP – absolute learning progress (|\log P_{\text{Stat}}^{t-1} - \log P_{\text{Stat}}^{t}|), rewarding transitions that cause the forward model to improve (or forget).
- RL‑ALPEXP – a stabilized version that partitions the transition space into semantically meaningful sub‑spaces (e.g., “Standing”, “Holding1”, “GrowPlant”) and computes ALP per partition, reducing noise.

The framework is evaluated in Playground‑Text, a textual simulation where an agent manipulates four object types (Water, Plants, Small Herbivores, Big Herbivores). Objects can be combined according to a predefined technology tree: water + seed → grown plant; grown plant + baby small herbivore → grown small herbivore; two grown plants + baby big herbivore → grown big herbivore. Six transition categories are defined, and a hidden test set (D_{\text{test}}) is used for final evaluation.

Experimental protocol: 400 outer iterations, each collecting 150 transitions (or 3600 when training an RL Experimenter). After each collection, the Scientist runs five Metropolis steps to update hypotheses. Both the Statistician and Scientist share the same LLM; the Experimenter is either an oracle or a small MLP policy trained on symbolic state representations.

Results show that curiosity‑driven RL (especially RL‑ALPEXP) outperforms random oracles in covering the transition space efficiently, discovering complex transitions (e.g., growing a big herbivore) earlier. The hypothesis set converges to concise natural‑language rules that mirror the environment’s underlying logic. When these hypotheses are fed back to the Statistician, the forward model’s log‑likelihood on the full transition distribution improves by roughly 15‑20 % compared with a baseline LLM that receives no hypotheses. Moreover, the generated theories are directly interpretable by humans, demonstrating the framework’s ability to produce explainable world models.

Key insights:

In‑context hypothesis injection allows LLMs to specialize without costly fine‑tuning, leveraging their existing knowledge base.
Bayesian Metropolis search efficiently explores the combinatorial space of natural‑language theories, yielding compact, high‑likelihood explanations.
Curiosity‑based data acquisition supplies the most informative evidence, making the overall process sample‑efficient under a strict interaction budget.

Limitations include reliance on natural‑language expressivity (complex mathematical dynamics may be hard to capture), potential slow convergence of Metropolis sampling in larger hypothesis spaces, and validation only on a relatively simple textual domain. Future work could extend the approach to multimodal LLMs for visual or auditory hypotheses, employ variational inference or advanced MCMC techniques for faster convergence, and test the pipeline on high‑dimensional physical simulators or real‑world robotics.

In summary, WorldLLM presents a novel, cost‑effective architecture that unifies theory‑based reinforcement learning, Bayesian inference, and curiosity‑driven exploration to endow LLMs with accurate, interpretable world models, opening new avenues for grounded AI without the heavy computational burden of traditional fine‑tuning.

WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment