Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model’s decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model’s confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70–85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

💡 Research Summary

The paper investigates whether Chain‑of‑Thought (CoT) explanations generated by large language models (LLMs) genuinely reflect the models’ internal reasoning or merely serve as post‑hoc rationalizations. To answer this, the authors introduce a novel metric called Normalized Logit Difference Decay (NLDD). NLDD quantifies the change in a model’s confidence in the correct answer when an individual reasoning step is corrupted. Confidence is measured as the logit margin between the correct token and the highest‑scoring incorrect token; this margin is divided by a normalization constant derived from the standard deviation of logits across the entire vocabulary, thereby making the metric comparable across architectures with different output scaling.

The NLDD value is expressed as a percentage of the original margin lost after corruption. Positive NLDD indicates that the step contributes causally to the final prediction, values near zero suggest weak coupling, and negative NLDD means that corrupting the step paradoxically increases confidence, revealing an “anti‑faithful” effect.

In addition to NLDD, the authors employ two complementary analyses: Representational Similarity Analysis (RSA) and Trajectory Alignment Score (TAS). RSA measures the Spearman correlation between representational dissimilarity matrices (RDMs) of hidden states from clean and corrupted chains, indicating whether internal relational structure is preserved. TAS evaluates the geometric efficiency of hidden‑state trajectories by comparing straight‑line displacement from the initial to the final state with the cumulative path length; values close to 1 denote direct, efficient trajectories.

Experiments are conducted on three model families—DeepSeek‑Coder‑6.7B‑Instruct (reasoning‑optimized), Llama‑3.1‑8B‑Instruct (standard dense), and Gemma‑2‑9B‑Instruct (soft‑capped logits)—and three benchmark tasks of increasing semantic ambiguity: Dyck‑n (syntactic state tracking), PrOntoQA (multi‑hop logical inference), and GSM8K (multi‑step arithmetic). For each task, 100 correctly answered samples are selected, and up to five counterfactual variants are generated by introducing controlled errors at specific reasoning steps (e.g., depth errors, entity substitutions, arithmetic mistakes) while preserving surface coherence. Corrupted steps are then truncated, and NLDD, RSA, and TAS are computed for each corruption position k.

The results reveal a consistent “Reasoning Horizon” (k*) across models and tasks: NLDD peaks at roughly 70‑85 % of the chain length, after which it sharply declines. Beyond k*, RSA and TAS remain high, indicating that internal representations and trajectory geometry stay stable even though the corrupted steps no longer affect confidence. This pattern suggests that the later portion of a CoT often functions as decorative text rather than as a computational driver.

Model‑specific patterns emerge as well. DeepSeek‑Coder shows strong, positive NLDD throughout the horizon, confirming genuine reliance on its reasoning steps. Llama‑3.1 exhibits moderate NLDD with stable RSA/TAS, reflecting typical behavior of dense models. Gemma‑2 displays negative NLDD after the horizon, an “anti‑faithful” regime where the generated reasoning actually harms confidence, likely due to its soft‑capping logit mechanism.

The paper positions NLDD against prior faithfulness metrics that treat CoT as a binary property (e.g., accuracy drop after truncation) or rely on raw sensitivity curves that are not comparable across architectures. By normalizing logit differences, NLDD enables fair cross‑model comparisons. The combined use of RSA and TAS provides mechanistic insight: when NLDD decays but RSA/TAS stay high, the model’s computation is insulated from the superficial chain, confirming the post‑hoc hypothesis.

Limitations include reliance on artificially constructed counterfactuals that may not capture all forms of human‑perceived reasoning errors, analysis confined to a single middle transformer layer, and the use of first‑token margin for multi‑token answers, which may underestimate confidence changes. Future work is suggested to extend NLDD to full‑sequence logit margins, explore layer‑wise RSA/TAS dynamics, and incorporate human judgments of reasoning quality.

In summary, the study provides strong empirical evidence that CoT explanations do not always correspond to the model’s internal reasoning. NLDD offers a quantitative tool to pinpoint the point in a chain where genuine causal influence ends, enabling practitioners to assess the reliability of CoT in high‑stakes applications such as medicine or law.

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment