다중모달 잠재 토큰으로 강화된 공간 추론

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: 다중모달 잠재 토큰으로 강화된 공간 추론
ArXiv ID: 2512.10941
Date: Pending
Authors: ** Arijit Ray¹⁴, Ahmed Abdelkader¹, Chengzhi Mao¹, Bryan A. Plummer⁴, Kate Saenko⁴, Ranjay Krishna², Leonidas Guibas¹³, Wen‑Sheng Chu¹ ¹ Google ² University of Washington ³ Stanford University ⁴ Boston University **

📝 Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -)ul¬¦ETokens -modalityagnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train )ul¬¦ETokens inspired by latent reasoning frameworks. We first train )ul¬¦ETokens using supervision from interleaved text-image traces, and then finetune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that )ul¬¦ETokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, )ul¬¦ETokens offers a simple solution to abstractly think in multiple modalities.

💡 Deep Analysis

📄 Full Content

ul¬-Tokens: Modality-Agnostic Latent Thinking Arijit Ray1,4 Ahmed Abdelkader1 Chengzhi Mao1 Bryan A. Plummer4 Kate Saenko4 Ranjay Krishna2 Leonidas Guibas1,3 Wen-Sheng Chu1 1Google 2University of Washington 3Stanford University 4Boston University https://arijitray.com/multimodal_thinking/ Abstract Reasoning goes beyond language; the real world re- quires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reason- ing data to switch between text and image thoughts. Instead, we offer a simpler alternative – ul¬-Tokens – modality- agnostic latent tokens pre-trained to hold intermediate in- formation in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train ul¬-Tokens inspired by latent rea- soning frameworks. We first train ul¬-Tokens using su- pervision from interleaved text-image traces, and then fine- tune without any supervision by only using the final an- swers. Across four challenging spatial reasoning bench- marks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that ul¬-Tokens improve upon several baselines utilizing text-only reason- ing or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, ul¬-Tokens offers a sim- ple solution to abstractly think in multiple modalities. 1. Introduction Real-world visual tasks [57, 58] require a synchrony of perception and reasoning [26]. For example, when solv- ing visual puzzles and IQ tests, people imagine patterns and manipulate visual maps to solve the problem [42]. Such reasoning—often bundled up as visual/spatial reason- ing [48]— is difficult for humans and machines alike [66]. It requires reasoning in space [29], in 3D [5, 32], and often in time [6, 66]. While textual Chain-of-Thought (CoT) [61] text-only interleaved text-image modality agnostic (ours) +20% +15% +10% +5% Qwen Reasoning Method direct-answer fine-tuning Figure 1. Compared to existing approaches for reasoning in text or interleaving image and text, we offer a simpler alternative - modality-agnostic thinking in the vision-language space using ul¬-Tokens to answer visual queries. has advanced verbal logical reasoning for language mod- els, it falls short on visual reasoning: models still falter when problems demand more than verbal image descrip- tions [2, 38, 59]. Mastering vision-centric logical reasoning remains an open challenge. To advance visual reasoning, researchers have explored “think-and-sketch” strategies [26, 44] using interleaved vi- sual thoughts, but with conflicting results [32]. Tool- augmented designs rely on external visual modules [41] such as cropping tools [44] or specialized sketching mod- els [26, 77] - making the reasoning indirect and often brit- tle [23]. Unified models [8, 13, 53] offer a more integrated alternative by generating intermediate images but are ex- pensive to train. The latest visual reasoning models instead use image thoughts, either as explicit visual tokens [5] or dense continuous embeddings [68]. However, they require bespoke task-specific training data and hence, do not of- fer a general recipe for visual reasoning. For instance, we find that naively adding modality-specific reasoning super- vision can sometimes hurt: supervising a model to inter- leave textual thoughts and visual latents actually reduced performance on a visual puzzle solving task compared to reasoning in text. Hence, an effective strategy to use the growing availability of multimodal CoT traces [7, 30] re- mains elusive. 1 arXiv:2512.10941v1 [cs.CV] 11 Dec 2025 To address this gap, we present ul¬-Tokens: modality-agnostic latent tokens that function as a multi- modal scratchpad for the model’s internal reasoning, ca- pable of representing both visual or textual information as needed. During inference, we append these special tokens to the input question, prompting the model to utilize those modality-agnostic slots for arbitrary intermediate computa- tions (spatial mappings, depth predictions, verbal manipula- tions, etc.) towards the answer, without the need for explicit decoding into text or images. Inspired by recent latent reasoning frameworks [22, 49, 68, 69], we develop a two-stage curriculum for training a model to use ul¬-Tokens effectively. The first stage trains the model using multimodal reasoning traces where each ul¬-Token is anchored to a relevant textual or visual concept [30]. For example, one token might learn to encode a visual object layout or a textual symbolic mapping if those are useful for the task

📄 Read Full PDF on ArXiv