Rethinking Chain-of-Thought Reasoning for Videos

Reading time: 6 minute
...

📝 Original Info

  • Title: Rethinking Chain-of-Thought Reasoning for Videos
  • ArXiv ID: 2512.09616
  • Date: 2025-12-10
  • Authors: Yiwu Zhong, Zi-Yuan Hu, Yin Li, Liwei Wang

📝 Abstract

Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM's reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning. Collectively, our results suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi-Lab/Rethink_CoT_Video.

💡 Deep Analysis

Deep Dive into Rethinking Chain-of-Thought Reasoning for Videos.

Chain-of-thought (CoT) reasoning has been highly successful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hypothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video reasoning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that enhances a video MLLM’s reasoning capability. Our framework enables models to operate on compressed visual tokens and generate brief reasoning traces prior to answering. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT annotations or supervised fine-tuning

📄 Full Content

Rethinking Chain-of-Thought Reasoning for Videos Yiwu Zhong1, Zi-Yuan Hu1, Yin Li2, Liwei Wang1* 1The Chinese University of Hong Kong, 2University of Wisconsin-Madison Abstract Chain-of-thought (CoT) reasoning has been highly success- ful in solving complex tasks in natural language processing, and recent multimodal large language models (MLLMs) have extended this paradigm to video reasoning. However, these models typically build on lengthy reasoning chains and large numbers of input visual tokens. Motivated by empirical observations from our benchmark study, we hy- pothesize that concise reasoning combined with a reduced set of visual tokens can be sufficient for effective video rea- soning. To evaluate this hypothesis, we design and validate an efficient post-training and inference framework that en- hances a video MLLM’s reasoning capability. Our frame- work enables models to operate on compressed visual to- kens and generate brief reasoning traces prior to answer- ing. The resulting models achieve substantially improved inference efficiency, deliver competitive performance across diverse benchmarks, and avoid reliance on manual CoT an- notations or supervised fine-tuning. Collectively, our re- sults suggest that long, human-like CoT reasoning may not be necessary for general video reasoning, and that concise reasoning can be both effective and efficient. Our code will be released at https://github.com/LaVi- Lab/Rethink_CoT_Video. 1. Introduction Chain-of-Thought (CoT) [61, 80] aims to solve complex tasks by generating explicit, step-by-step intermediate rea- soning traces before producing a final answer. CoT has been a key driver of the strong reasoning capabilities in latest large language models (LLMs) [15, 19]. Building on this success, several recent efforts [12, 18, 59, 68, 71] have extended CoT to multimodal large language models (MLLMs), demonstrating improved reasoning over visual inputs, including both images and videos. Despite its success in vision, CoT reasoning in MLLMs incurs major overhead in both inference and training. On the inference side, visual inputs, especially videos, often *Corresponding author. … … Input Video Tokens “ Let me think about … Hmm … Wait … Oh, I see … ” Prefilling Phrase Decoding Phrase … … … …… LLM Input Video Tokens “ This video shows … So … ” … … … …… CoT Reason Concise Reason LLM Figure 1. CoT Reasoning, with dense prefilling and lengthy de- coding, incurs substatial computation load at both training and in- ference. In contrast, Concise Reasoning coupled with token com- pression is significantly more efficient, thanks to sparse prefilling and concise decoding. expand into thousands of visual tokens with high redun- dancy, while CoT produces long reasoning sequences. To- gether, these factors compound to sharply increased mem- ory usage, compute cost, and carbon footprint during de- ployment. Additionally, CoT training relies on supervised fine-tuning (SFT) with heavily-labeled reasoning traces, followed by reinforcement learning (RL) over long se- quences [4, 12, 18, 49, 59, 68, 71, 76]. This pipeline not only requires costly annotation, but also leads to prolonged training cycles. In this paper, we aim to reduce the inference and train- ing overhead of reasoning-oriented video MLLMs. With a Transformer-based architecture (Fig. 1), this overhead can be decomposed into: (1) the cost of decoding phrase, which scales with the number of output tokens for reasoning and the total number of input tokens, and (2) the cost of pre- filling phrase, which is determined by the number of input tokens. Therefore, reducing CoT overhead requires either shortening the output chain, which promotes more concise reasoning, or decreasing the number of input tokens, which is dominated by redundant visual tokens and can be reduced 1 arXiv:2512.09616v1 [cs.CV] 10 Dec 2025 via token compression [3, 42, 52, 79, 86]. With this intu- ition, we conduct a systematic benchmark of MLLMs with and without CoT across a suite of video datasets, covering general, long-form, and complex video understanding tasks. Our benchmark leads to several surprising observations. First, despite the overhead, adding CoT yields only modest gains over the base pre-trained MLLM using direct answer- ing. Indeed, we observe that CoT outputs frequently con- tain human-like “pondering” patterns (e.g., “Hmm,” “Let’s think,” or “Wait”) that contribute little to reasoning. Sec- ond, prompting the base pre-trained MLLM to generate concise reasoning chains leads to a major performance drop, significantly worse than direct answering. We conjecture that the model possesses the necessary knowledge to an- swer the questions, yet is not well aligned with the concise reasoning paradigm. Third, we find that token compres- sion, although previously shown to be effective for video MLLMs [86], causes notably larger performance degrada- tion when the model i

…(Full text truncated)…

📸 Image Gallery

analysis_figure.png analysis_figure.webp method_figure.png method_figure.webp overhead.png overhead.webp teaser_figure.png teaser_figure.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut