MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
💡 Research Summary
The paper introduces MentisOculi, a procedural, stratified benchmark designed to evaluate whether frontier multimodal models can use visual “mental imagery” as an intermediate reasoning aid, akin to human cognition. The benchmark comprises five multi‑step visual reasoning tasks—Form Board, Hinge Folding, Paper Fold, Rush Hour, and Sliding Puzzle—each generated at five difficulty levels (1–5) by varying the number of required operations and objects. This procedural generation yields ground‑truth visual chain‑of‑thought solutions, enabling fine‑grained analysis and future extensibility while mitigating data contamination.
Four families of models are tested: (1) text‑only multimodal large language models (MLLMs) such as Gemini 2.5, Gemini 3, GPT‑5.1, and Qwen3‑VL; (2) a latent visual reasoning model (Mirage) fine‑tuned on Qwen2.5‑VL to interleave visual latent tokens with text; (3) unified multimodal models (UMMs) that can explicitly generate images interleaved with text, namely Gemini 2.5‑I and Gemini 3‑I; and (4) a video generation model (Veo 3.1) that produces visual rollouts conditioned on a prompt. Evaluation combines automated scoring of textual answers, simulation of predicted action sequences, and frame‑by‑frame analysis of video outputs. Human performance is measured on Rush Hour with a small cohort of PhD students to establish an upper bound.
Results show a consistent failure pattern across all model families. Performance degrades sharply with increasing difficulty and falls below chance at level 5. Surprisingly, UMMs—despite being able to generate images—perform worse than their text‑only counterparts, indicating a breakdown in the integration of visual and textual reasoning streams. Mirage exhibits modest gains over baseline MLLMs, but these diminish as tasks become harder. The video model fails to extract reliable actions due to visual noise. Moreover, even when provided with ground‑truth visualizations, models do not leverage them, and self‑generated images introduce compounding errors across steps.
The authors conclude that current architectures cannot yet realize the hypothesized benefit of mental imagery for reasoning. They attribute the gap to (i) an inability to maintain a consistent visual state over multiple steps, (ii) poor synchronization between generated visuals and textual logic, and (iii) lack of mechanisms to feed visual feedback back into the reasoning loop. The paper proposes future research directions: developing memory modules for persistent visual representations, designing tighter image‑text coupling protocols, and constructing multimodal chain‑of‑thought frameworks that treat visual states as first‑class reasoning objects.
MentisOculi thus provides a rigorous, extensible platform for diagnosing and ultimately closing the gap between visual generation and logical reasoning in next‑generation multimodal AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment