Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
💡 Research Summary
The paper tackles a striking failure mode of large language models (LLMs) that employ chain‑of‑thought (CoT) prompting: when the number of reasoning steps (or “hops”) required at test time exceeds what the model has seen during training, performance drops dramatically even though the underlying algorithmic skill (e.g., multi‑digit multiplication) remains unchanged. This phenomenon, termed “reasoning‑hop generalization,” is distinct from classic length‑generalization because the total token length may stay modest while the depth of reasoning grows.
The authors frame two research questions. RQ1 asks where the errors occur. Because a full CoT answer can span hundreds of tokens, a naïve token‑wise analysis is infeasible. The authors therefore adopt an error‑centric approach: they manually define a small set (typically 5‑10) of task‑specific “error types” that capture the most common logical mistakes (e.g., recalling the wrong name, mis‑updating a state variable). Across seven benchmark tasks drawn from symbolic reasoning, arithmetic, coding, and object counting, they find that a handful of these error types account for ≥30 % of all mistakes, often exceeding 70 % of the total error budget. In the 50‑hop Parity‑NL task, for instance, 78.6 % of failures stem from a single “recall‑wrong‑name” error. This concentration suggests that the performance collapse is not a diffuse accumulation of random noise but is driven by a few systematic patterns.
RQ2 probes why these systematic errors arise. The authors combine three mechanistic tools: Logit Lens (to project intermediate residual‑stream vectors into vocabulary space), head‑knockout interventions (zero‑ing out a specific attention head’s output), and residual‑stream circuit analysis. Their analysis reveals a competition circuit inside the transformer. Two families of attention heads emerge: Correct‑Processing heads (cp heads) that amplify the signal leading to the right answer, and Erroneous‑Processing heads (ep heads) that amplify spurious signals and suppress the correct ones. Both cp and ep heads feed into “answer‑writing” heads (aw heads) in deeper layers, which finally produce the token. Crucially, ep heads are often shared across different tasks and error types, indicating a model‑wide vulnerability rather than a task‑specific quirk. When an ep head is knocked out during inference, the model’s prediction at the offending token frequently flips from wrong to correct, confirming the causal role of these heads.
Motivated by this mechanistic insight, the authors propose Test‑time Correction of Reasoning (TCR), a lightweight, model‑agnostic intervention that dynamically deactivates ep heads at inference time. TCR consists of three components: (1) a pre‑compiled candidate set of common ep heads for each model, (2) a head‑selector network trained to predict, from the current context, which ep head(s) should be silenced, and (3) an entropy‑threshold detector that flags high‑uncertainty tokens where intervention is likely needed. When the detector fires, the selector zeroes out the chosen ep head(s) before the residual stream is updated.
The authors evaluate TCR on seven hop‑generalization datasets and four open‑source LLMs (Qwen2.5‑7B‑Instruct, Phi‑3‑Instruct, LLaMA3‑8B‑Instruct, Qwen3‑8B‑Instruct). Across the board, TCR yields consistent gains: the average accuracy improvement on Qwen2.5‑7B‑Instruct is +6.8 percentage points. An oracle version, TCR‑gold, which uses a perfect error detector, boosts accuracy from 41.7 % to 61.3 % (≈20 pp), underscoring the upper bound of test‑time head deactivation. Importantly, the method does not require any fine‑tuning on downstream data; it operates purely at inference, making it compatible with off‑the‑shelf models.
In summary, the paper makes three major contributions: (1) it demonstrates that reasoning‑hop generalization failures are dominated by a small set of token‑level error types, enabling focused diagnosis; (2) it uncovers a competition mechanism between cp and ep heads that explains why these errors arise, and shows that ep heads are a shared source of failure across tasks; (3) it introduces TCR, a test‑time, head‑knockout based correction technique that reliably improves hop‑generalization without retraining. The work advances our mechanistic understanding of LLM reasoning and offers a practical, low‑cost remedy, paving the way toward more robust, human‑like multi‑step reasoning in future AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment