Understanding Chain-of-Thought Effectiveness in Code Generation: An Empirical and Information-Theoretic Analysis
Large language models (LLMs) achieve strong performance on code generation, but the mechanisms by which Chain-of-Thought (CoT) prompting helps remain unclear. We present a systematic empirical and information-theoretic study of CoT effectiveness in neural code generation, evaluating five paradigms (Zero-Shot, Zero-Shot CoT, Self-Planning, Structured CoT, Reasoning-CoT) across six Python benchmarks, a multilingual benchmark with 12 programming languages, and six models from 7B to 480B parameters, using conditional mutual information $I(Y;C|X)$ as a conceptual lens. Our results show that externally guided CoT consistently outperforms direct generation, with structured methods improving Pass@1 by 5–12% on average while using substantially fewer tokens than reflective reasoning, and that CoT benefits depend on language type systems and model capacity. We further find that reasoning \emph{quality} is critical: high-quality structured CoT from strong generators yields significantly higher accuracy than lightweight alternatives with the same template, whereas naive Zero-Shot CoT can even degrade performance. These findings provide practical guidance for choosing CoT strategies based on model capacity, language characteristics, and task complexity.
💡 Research Summary
This paper investigates why chain‑of‑thought (CoT) prompting improves neural code generation and how its effectiveness varies across models, tasks, and programming languages. Five CoT paradigms—Zero‑Shot, Zero‑Shot CoT, Self‑Planning, Structured CoT (SCoT), and Reasoning‑CoT—are evaluated on six Python benchmarks of differing difficulty, a multilingual benchmark covering 12 languages (both statically and dynamically typed), and six large language models ranging from 7 B to 480 B parameters (including Qwen2.5, Qwen3, GPT‑3.5‑Turbo, and GPT‑5). The authors introduce an information‑theoretic framework based on conditional mutual information I(Y;C|X), which quantifies how much a reasoning chain C reduces the uncertainty of the target code Y given the problem description X. Empirical results show that externally guided CoT consistently outperforms direct generation. Structured approaches (Self‑Planning, SCoT) achieve 5‑12 % higher Pass@1 on average while using roughly 10 % of the tokens required by reflective Reasoning‑CoT. Zero‑Shot CoT can even degrade performance when the generated reasoning is noisy. Language‑type effects are evident: static languages (Java, C++, Go) benefit more from structured templates (+7 % on average), whereas dynamic languages (Python, JavaScript, Ruby) see balanced gains from deeper reflective reasoning (+6 %). Model capacity matters: larger models succeed in about two‑thirds of asymmetric cases, while smaller models frequently fail on type handling and alignment between reasoning and code. Crucially, the quality of the reasoning chain drives performance; high‑quality Structured CoT produced by GPT‑5‑Mini outperforms lightweight alternatives by 7.5 % Pass@1 under identical templates, and low‑quality CoT can fall below the Zero‑Shot baseline. The paper also demonstrates the trade‑off between information density and token length, confirming that short, template‑constrained chains efficiently convey essential information, whereas long, free‑form chains increase expressiveness at higher computational cost. The findings provide concrete guidance for practitioners: select Structured CoT for efficiency and static languages, use reflective Reasoning‑CoT when deeper exploration is needed, and match the CoT strategy to model scale and task complexity.
Comments & Academic Discussion
Loading comments...
Leave a Comment