EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs’ reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT’s denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://lennoxdai.github.io/EndoCoT-Webpage/.
💡 Research Summary
The paper “EndoCoT: Scaling Endogenous Chain‑of‑Thought Reasoning in Diffusion Models” addresses two fundamental shortcomings of current multimodal large language model (MLLM)‑augmented diffusion pipelines. First, the MLLM is used only as a static text encoder, producing a single embedding at the start of generation. This one‑shot encoding cannot trigger a Chain‑of‑Thought (CoT) process, which is essential for solving tasks that require multi‑step logical reasoning. Second, the diffusion transformer (DiT) receives this static conditioning and never updates it during the denoising steps, so it cannot progressively decompose a complex instruction into a sequence of actionable denoising operations.
Through a systematic empirical analysis, the authors show that (i) logical reasoning capacity is concentrated in the upper layers of the language model but is not fully exploited in a single forward pass, and (ii) attention entropy between textual tokens and visual patches explodes in high‑complexity scenarios (e.g., 32×32 mazes), indicating a breakdown of the static coupling between MLLM and DiT.
To overcome these bottlenecks, the authors propose Endogenous Chain‑of‑Thought (EndoCoT), a framework that endows diffusion models with genuine, iterative CoT reasoning. EndoCoT consists of two main components:
-
Iterative Thought Guidance – The MLLM’s latent “thought” representation (denoted hτ) is updated repeatedly. At each reasoning step τ, the previous thought hτ‑1 is fed back into the MLLM to produce a refined thought hτ, which is then used as a dynamic conditioning signal cτ for the DiT. The DiT consumes cτ while performing its denoising step, yielding an intermediate visual output Iτ. This loop mimics human problem solving: the model refines its understanding and proposed solution incrementally rather than attempting to output the final answer in one shot.
-
Terminal Thought Grounding – The final thought hT is explicitly aligned with the ground‑truth answer using a combination of cross‑entropy loss and a semantic similarity loss (e.g., cosine similarity). This grounding prevents cumulative drift of the reasoning trajectory and ensures that the end‑to‑end chain converges on the correct solution.
Training proceeds in two stages. In the first stage, the model is trained to predict both intermediate thoughts and intermediate visual outputs, thereby learning the full multi‑step reasoning trajectory. Supervision for intermediate thoughts is obtained from annotated reasoning traces (e.g., “current position is (2,8)”, “next move right”). In the second stage, gradients on intermediate states are frozen, and only the terminal thought and final image quality are fine‑tuned, preserving the learned reasoning dynamics while improving visual fidelity.
The authors evaluate EndoCoT on four diverse reasoning benchmarks: Maze navigation, Traveling Salesperson Problem (TSP) routing, Visual Spatial Planning (VSP), and Sudoku solving. EndoCoT achieves an average accuracy of 92.1%, surpassing the strongest baseline (DiffThinker, Qwen3‑VL‑8B, etc.) by 8.3 percentage points. Notably, on more challenging instances (Maze‑32, Sudoku‑35) the model retains high performance (≈90% and 95% respectively), whereas prior methods tend to collapse after the early denoising steps. Visualizations show that EndoCoT produces interpretable, step‑by‑step reasoning chains rather than committing to a solution prematurely.
In summary, EndoCoT is the first diffusion framework that integrates an endogenous CoT mechanism via iterative latent‑state refinement. By dynamically coupling the MLLM’s evolving thoughts with the DiT’s denoising process and grounding the final thought in textual supervision, the approach unlocks deep reasoning capabilities in diffusion models. The work opens avenues for applying such iterative, self‑guided reasoning to larger LLMs, robotic control, simulation environments, and any domain where sequential decision making and visual grounding are required.
Comments & Academic Discussion
Loading comments...
Leave a Comment