Authors: ** Arijit Ray¹⁴, Ahmed Abdelkader¹, Chengzhi Mao¹, Bryan A. Plummer⁴, Kate Saenko⁴, Ranjay Krishna², Leonidas Guibas¹³, Wen‑Sheng Chu¹ ¹ Google ² University of Washington ³ Stanford University ⁴ Boston University **
📝 Abstract
Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -)ul¬¦ETokens -modalityagnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train )ul¬¦ETokens inspired by latent reasoning frameworks. We first train )ul¬¦ETokens using supervision from interleaved text-image traces, and then finetune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that )ul¬¦ETokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, )ul¬¦ETokens offers a simple solution to abstractly think in multiple modalities.
💡 Deep Analysis
📄 Full Content
ul¬-Tokens: Modality-Agnostic Latent Thinking
Arijit Ray1,4
Ahmed Abdelkader1
Chengzhi Mao1
Bryan A. Plummer4
Kate Saenko4
Ranjay Krishna2
Leonidas Guibas1,3
Wen-Sheng Chu1
1Google
2University of Washington
3Stanford University
4Boston University
https://arijitray.com/multimodal_thinking/
Abstract
Reasoning goes beyond language; the real world re-
quires reasoning about space, time, affordances, and much
more that words alone cannot convey. Existing multimodal
models exploring the potential of reasoning with images are
brittle and do not scale.
They rely on calling specialist
tools, costly generation of images, or handcrafted reason-
ing data to switch between text and image thoughts. Instead,
we offer a simpler alternative – ul¬-Tokens – modality-
agnostic latent tokens pre-trained to hold intermediate in-
formation in either image or text modalities to let the model
think free-form towards the correct answer. We investigate
best practices to train ul¬-Tokens inspired by latent rea-
soning frameworks. We first train ul¬-Tokens using su-
pervision from interleaved text-image traces, and then fine-
tune without any supervision by only using the final an-
swers. Across four challenging spatial reasoning bench-
marks involving tasks such as solving puzzles and taking
different perspectives, we demonstrate that ul¬-Tokens
improve upon several baselines utilizing text-only reason-
ing or interleaved image-text reasoning, achieving a +3%
average improvement and up to +16% on a puzzle solving
reasoning-heavy split compared to our strongest baseline.
Adding to conversations around challenges in grounding
textual and visual reasoning, ul¬-Tokens offers a sim-
ple solution to abstractly think in multiple modalities.
1. Introduction
Real-world visual tasks [57, 58] require a synchrony of
perception and reasoning [26]. For example, when solv-
ing visual puzzles and IQ tests, people imagine patterns
and manipulate visual maps to solve the problem [42].
Such reasoning—often bundled up as visual/spatial reason-
ing [48]— is difficult for humans and machines alike [66].
It requires reasoning in space [29], in 3D [5, 32], and often
in time [6, 66]. While textual Chain-of-Thought (CoT) [61]
text-only
interleaved
text-image
modality
agnostic
(ours)
+20%
+15%
+10%
+5%
Qwen
Reasoning Method
direct-answer
fine-tuning
Figure 1. Compared to existing approaches for reasoning in text
or interleaving image and text, we offer a simpler alternative
- modality-agnostic thinking in the vision-language space using
ul¬-Tokens to answer visual queries.
has advanced verbal logical reasoning for language mod-
els, it falls short on visual reasoning: models still falter
when problems demand more than verbal image descrip-
tions [2, 38, 59]. Mastering vision-centric logical reasoning
remains an open challenge.
To advance visual reasoning, researchers have explored
“think-and-sketch” strategies [26, 44] using interleaved vi-
sual thoughts, but with conflicting results [32].
Tool-
augmented designs rely on external visual modules [41]
such as cropping tools [44] or specialized sketching mod-
els [26, 77] - making the reasoning indirect and often brit-
tle [23]. Unified models [8, 13, 53] offer a more integrated
alternative by generating intermediate images but are ex-
pensive to train. The latest visual reasoning models instead
use image thoughts, either as explicit visual tokens [5] or
dense continuous embeddings [68]. However, they require
bespoke task-specific training data and hence, do not of-
fer a general recipe for visual reasoning. For instance, we
find that naively adding modality-specific reasoning super-
vision can sometimes hurt: supervising a model to inter-
leave textual thoughts and visual latents actually reduced
performance on a visual puzzle solving task compared to
reasoning in text. Hence, an effective strategy to use the
growing availability of multimodal CoT traces [7, 30] re-
mains elusive.
1
arXiv:2512.10941v1 [cs.CV] 11 Dec 2025
To
address
this
gap,
we
present
ul¬-Tokens:
modality-agnostic latent tokens that function as a multi-
modal scratchpad for the model’s internal reasoning, ca-
pable of representing both visual or textual information as
needed. During inference, we append these special tokens
to the input question, prompting the model to utilize those
modality-agnostic slots for arbitrary intermediate computa-
tions (spatial mappings, depth predictions, verbal manipula-
tions, etc.) towards the answer, without the need for explicit
decoding into text or images.
Inspired by recent latent reasoning frameworks [22, 49,
68, 69], we develop a two-stage curriculum for training a
model to use ul¬-Tokens effectively.
The first stage
trains the model using multimodal reasoning traces where
each ul¬-Token is anchored to a relevant textual or visual
concept [30]. For example, one token might learn to encode
a visual object layout or a textual symbolic mapping if those
are useful for the task