Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning
Chain-of-Thought (CoT) training has markedly advanced the reasoning capabilities of large language models (LLMs), yet the mechanisms by which CoT training enhances generalization remain inadequately understood. In this work, we demonstrate that compositional generalization is fundamental: models systematically combine simpler learned skills during CoT training to address novel and more complex problems. Through a theoretical and structural analysis, we formalize this process: 1) Theoretically, the information-theoretic generalization bounds through distributional divergence can be decomposed into in-distribution (ID) and out-of-distribution (OOD) components. Specifically, the non-CoT models fail on OOD tasks due to unseen compositional patterns, whereas CoT-trained models achieve strong generalization by composing previously learned skills. In addition, controlled experiments and real-world validation confirm that CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. 2) Structurally, CoT training internalizes reasoning into a two-stage compositional circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. A key insight is that CoT training teaches models how to think-by fostering compositional reasoning-rather than merely what to think, through the provision of correct answers alone. This paper offers valuable insights for designing CoT strategies to enhance LLMs’ reasoning robustness.
💡 Research Summary
This paper investigates why Chain‑of‑Thought (CoT) training dramatically improves the reasoning abilities of large language models (LLMs) and, more importantly, how it enables robust generalization to both in‑distribution (ID) and out‑of‑distribution (OOD) tasks. The authors propose that the key is compositional generalization: during CoT training the model learns a repertoire of simple “skills” (atomic facts or sub‑tasks) and learns to combine them in novel ways when faced with more complex problems.
Theoretical contribution
The authors formalize the data generation process as a conditional distribution (P(Y|X)) that can be factorized through an intermediate reasoning sequence (C=(C_1,\dots,C_K)):
(P(Y|X)=\sum_{C}P(Y|X,C)P(C|X)).
Using information‑theoretic tools, they derive a generalization bound that separates the expected error into ID and OOD components weighted by the OOD mixing coefficient (\alpha):
\
Comments & Academic Discussion
Loading comments...
Leave a Comment