Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In mechanistic interpretability, recent work scrutinizes transformer “circuits” - sparse, mono or multi layer sub computations, that may reflect human understandable functions. Yet, these network circuits are rarely acid-tested for their stability across different instances of the same deep learning architecture. Without this, it remains unclear whether reported circuits emerge universally across labs or turn out to be idiosyncratic to a particular estimation instance, potentially limiting confidence in safety-critical settings. Here, we systematically study stability across-refits in increasingly complex transformer language models of various sizes. We quantify, layer by layer, how similarly attention heads learn representations across independently initialized training runs. Our rigorous experiments show that (1) middle-layer heads are the least stable yet the most representationally distinct; (2) deeper models exhibit stronger mid-depth divergence; (3) unstable heads in deeper layers become more functionally important than their peers from the same layer; (4) applying weight decay optimization substantially improves attention-head stability across random model initializations; and (5) the residual stream is comparatively stable. Our findings establish the cross-instance robustness of circuits as an essential yet underappreciated prerequisite for scalable oversight, drawing contours around possible white-box monitorability of AI systems.

💡 Research Summary

The paper tackles a fundamental but under‑explored question in mechanistic interpretability: do transformer language models learn the same internal “circuits” across independent training runs, or are the discovered sub‑computations largely idiosyncratic to a particular random seed? To answer this, the authors systematically measure the stability of attention heads – the primary computational units in most circuit‑based explanations – across many re‑fits of the same architecture.

Experimental design
They train 26 GPT‑like decoder‑only configurations ranging from 2 to 12 layers, with 8 or 16 heads per layer. Small models (2, 4, 8 layers) are trained on a 2 B‑token subset of C4, each with 50 random seeds; the 12‑layer GPT‑2‑small models are trained on 9 B tokens of OpenWebText with 5 seeds. All hyper‑parameters (learning‑rate schedule, warm‑up, etc.) are held constant except for the optimizer: a baseline Adam run is compared to AdamW (decoupled weight decay) to test the effect of regularization.

Stability metric
For a fixed set of ~100 prompts, the post‑softmax attention score matrix of each head is flattened into a vector. Cosine similarity between two heads (same layer, different seed) is computed per prompt, averaged across prompts, and the best‑matching head in the other seed is selected. This yields a per‑head “single‑seed” similarity; averaging over all other seeds gives a final stability score S(m)_hi for head i in anchor model m. Layer‑wise stability is the mean of its heads, and a “cross‑layer” variant allows matching to any head in any layer. This metric is permutation‑invariant and avoids the need for Hungarian matching.

Key findings

Mid‑layer instability – Across all architectures, the first and last layers exhibit high stability (S≈1.0), while middle layers show a pronounced dip (e.g., layer 5 in an 8‑layer model drops to S≈0.70). This indicates that the representations learned in intermediate depths are highly seed‑dependent.
Depth dependence – The magnitude of the dip grows with model depth. In 12‑layer GPT‑2‑small, middle layers fall to S≈0.55, confirming that deeper transformers amplify the divergence of internal sub‑computations.
Functional importance of unstable heads – By ablating heads and measuring changes in perplexity and log‑likelihood, the authors find that unstable heads in deeper layers have a disproportionately larger impact on model performance than stable peers. This suggests that the most “important” circuits may be the least reproducible.
Weight decay improves stability – Switching from Adam to AdamW raises average stability by ~0.12 across all layers, with the most noticeable gains in the middle layers. The authors argue that weight decay curtails norm growth, steering the model toward a more consistent representational subspace.
Residual stream robustness – The residual stream’s token‑level representations remain extremely stable across seeds (cosine similarity > 0.95), highlighting that the additive pathway is far less sensitive to random initialization than the attention heads.

Implications
The results challenge the implicit assumption that discovered circuits are universally present across model instances. For safety‑critical applications (e.g., medical diagnosis, finance, nuclear control), relying on a circuit identified in a single model could be misleading if that circuit is not reproducible. The study therefore proposes “seed stability” as a prerequisite for any claim of circuit universality. Moreover, simple training choices—particularly the use of weight decay—can dramatically affect reproducibility without altering downstream performance, suggesting that future LLM development should incorporate stability‑focused objectives.

Future directions
The authors note that their similarity measure only captures attention‑score geometry; integrating value‑propagation, gradient‑based attribution, or automated circuit discovery could provide a richer picture of functional alignment. Extending the analysis to larger models (billions of parameters) and to downstream task‑specific probes would test whether the observed patterns persist at scale.

In sum, the paper provides a rigorous, task‑agnostic methodology for quantifying attention‑head stability, uncovers systematic instability in middle layers that grows with depth, demonstrates that unstable heads can be functionally crucial, and shows that weight decay is an effective lever for improving reproducibility. These insights lay the groundwork for more reliable mechanistic interpretability and for building trustworthy, monitorable AI systems.

Quantifying LLM Attention-Head Stability: Implications for Circuit Universality

💡 Research Summary

Comments & Academic Discussion

Leave a Comment