Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling
Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.
💡 Research Summary
The paper introduces “Thinking with Comics” (TwC), a novel multimodal reasoning paradigm that positions comics as an intermediate visual medium between static images and videos. The authors argue that while images lack temporal structure and videos provide rich dynamics at a high computational cost, comics combine the advantages of both: they convey sequential, causal narratives across panels and embed textual information directly within the visual content, resulting in high information density with lower inference overhead.
Two implementation paths are explored. Path I (End‑to‑End Visualized Reasoning) uses a text‑to‑image generation model (Gemini‑3 Pro Image) to generate a sequence of comic panels conditioned on a question. Each panel represents an intermediate reasoning state, and the final answer is extracted from the last panel. This approach tightly couples generation and reasoning, allowing the model’s latent state transitions to be visualized directly. Path II (Comic as Conditioning Context for a Vision‑Language Model) also generates a comic with the same model, but then feeds both the original question and the comic into a multimodal large language model (Gemini‑3 Pro) for joint reasoning. Here the comic serves as an explicit intermediate variable, analogous to textual chain‑of‑thought steps, but enriched with spatial and temporal cues.
The authors evaluate TwC on a suite of benchmarks covering both explicit reasoning (MA TH‑500, GSM8K, MathV ista) and long‑context understanding (DocVQA, eBDtheque, CulturalBench). Results show that TwC outperforms the “Thinking with Images” (TWI) baseline by roughly 10–20 percentage points on reasoning tasks and matches or exceeds the performance of the “Thinking with Video” (Sora 2) baseline while using far fewer computational resources. Notably, the style of the comic narrative (e.g., detective‑style for logical puzzles, culture‑centric for contextual tasks) systematically influences performance, and shuffling panel order degrades accuracy, confirming the importance of temporal and causal coherence.
Ablation studies reveal scaling behavior similar to traditional chain‑of‑thought: more complex problems benefit from a larger number of panels. The embedded text in panels reduces semantic ambiguity that pure visual reasoning suffers from. The paper also discusses limitations: current text‑to‑image models may struggle with precise mathematical notation or fine‑grained diagrams, and the quality of generated comics can vary across styles. Future work is suggested in the form of dedicated comic‑generation models, graph‑based modeling of panel relationships, and exploration of adaptive panel counts.
Overall, the work positions comics as an effective, low‑cost, high‑density visual storytelling medium that bridges the gap between images and videos, offering a promising direction for unified multimodal reasoning frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment