Offline World Models as Imagination Networks in Cognitive Agents
The computational role of imagination remains debated. While classical accounts emphasize reward maximization, emerging evidence suggests it accesses internal world models (IWMs). We employ psychological network analysis to compare IWMs in humans and large language models (LLMs) via imagination vividness ratings, distinguishing offline world models (persistent memory structures accessed independent of immediate goals) from online models (task-specific representations). Analyzing 2,743 humans across three populations and six LLM variants, we find human imagination networks exhibit robust structural consistency, with high centrality correlations and aligned clustering. LLMs show minimal clustering and weak correlations with human networks, even with conversational memory, across environmental and sensory contexts. These differences highlight disparities in how biological and artificial systems organize internal representations. Our framework offers quantitative metrics for evaluating offline world models in cognitive agents.
💡 Research Summary
The paper introduces the concept of “offline world models” to capture the persistent, goal‑independent memory structures that underlie human imagination. While classical reinforcement‑learning accounts treat imagination as a tool for reward maximisation, recent evidence suggests that imagination can also serve to sample from a rich internal model of the world that exists independently of immediate tasks. To operationalise this idea, the authors compare internal world models of humans and large language models (LLMs) by turning vividness ratings from two validated imagination questionnaires—the Vividness of Visual Imagery Questionnaire‑2 (VVIQ‑2) and the Plymouth Sensory Imagery Questionnaire (PSIQ)—into graph‑theoretic networks. In these “imagination networks,” nodes represent imagined scenarios (environmental contexts for VVIQ‑2, sensory modalities for PSIQ) and edges encode partial correlations of vividness scores across items.
Human data were collected from 2,743 participants across three geographic populations (Florida, Poland, London). VVIQ‑2 includes 32 items (8 contexts × 4 items) rated on a 1‑5 scale; PSIQ includes 21 items (7 modalities × 3 items) rated on a 0‑10 scale. The authors split the Polish sample into two non‑overlapping groups to test within‑population consistency and also formed combined samples (Florida+Poland, Florida+London). For LLMs, six model variants were used: Gemma‑3 12B, Gemma‑3 27B, their quantisation‑aware trained counterparts, Llama‑3.3 70B, and a mixture‑of‑experts Llama‑4‑16×17B. Each model generated 1,000 simulations by crossing 200 synthetic personas with five imagination‑ability prompts (aphantasia, hypophantasia, typical, hyperphantasia, no instruction). Two conversational conditions were examined: independent (stateless) and cumulative (maintaining a dialogue history). To match human variability, the authors down‑sampled each LLM’s simulations to 600 cases based on total vividness distributions.
Univariate analyses showed that LLM total vividness scores varied systematically with model size, prompt, and conversational condition (Kruskal‑Wallis p < 10⁻⁷). However, Kolmogorov‑Smirnov tests revealed that all LLM distributions differed significantly from every human distribution (D > 0.15, p < 0.001), indicating that matching overall scores does not guarantee comparable internal structure.
The core contribution lies in the multivariate network analysis. Networks were estimated using the EBIC‑glasso method applied to Spearman partial correlations, yielding sparse, regularised adjacency matrices. Centrality measures—expected influence (signed strength) and strength (absolute sum of edge weights)—were computed for each node. Human networks displayed striking consistency: within‑Poland groups, expected‑influence correlations reached r = 0.93; cross‑population correlations (e.g., Florida vs. Poland‑All) ranged from r = 0.46 to r = 0.73, all statistically significant after FDR correction. Strength centrality showed analogous patterns. Community detection (Louvain/Infomap) revealed coherent clusters that aligned with the questionnaire’s contextual or modality groupings, suggesting that human imagination organizes scenes and senses into meaningful higher‑order structures.
In contrast, LLM networks exhibited weak and erratic correlations with human centralities (most r < 0.30) and virtually no stable clustering, regardless of model size, prompt, or conversational condition. The cumulative dialogue condition modestly increased total vividness (≈5 % boost) but did not produce human‑like network topology. Even among LLMs themselves, centrality patterns were inconsistent, reflecting that current language‑only models lack a unified offline world model comparable to that inferred from human imagery data.
The authors interpret these findings as evidence that human offline world models are grounded in multimodal, embodied experience, yielding richly interconnected representations that persist across tasks. LLMs, trained solely on text, encode world knowledge statistically but fail to organise it into a coherent, task‑independent graph that can be probed via imagination‑like prompts. The study also demonstrates that simple prompting can elicit vividness responses from LLMs, yet such responses are driven by surface‑level generation rather than deep structural access.
Limitations include reliance on self‑report questionnaires (subjective bias), the artificial nature of synthetic personas for LLMs (which may not capture genuine cognitive diversity), and the use of linear partial correlations which may miss nonlinear dependencies. The authors suggest future work should compare these network signatures with neuroimaging‑derived functional connectivity, explore multimodal models (e.g., vision‑language transformers) that might develop richer offline world models, and devise evaluation paradigms that capture spontaneous, non‑goal‑directed imagination without explicit instruction.
In sum, the paper provides a novel quantitative framework—imagination networks—to assess offline world models, demonstrates robust structural consistency across human populations, and reveals a substantial gap between human and current LLM internal representations. This work opens a pathway for more nuanced benchmarking of AI systems against human cognitive architecture, especially concerning the organization of knowledge that supports imagination, planning, and creative thought.
Comments & Academic Discussion
Loading comments...
Leave a Comment