Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin
Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
💡 Research Summary
The paper “Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin” tackles two puzzling phenomena that have been observed in large language models (LLMs): attention sinks—heads that collapse their attention onto semantically uninformative tokens such as the beginning‑of‑sequence (BOS) token—and compression valleys—layers where the hidden‑state matrix exhibits unusually low entropy despite the model’s high dimensionality. Historically, these phenomena have been studied in isolation: attention sinks have been linked to positional biases, over‑mixing prevention, or spectral subspace effects, while compression valleys have been explained via information‑bottleneck arguments. The authors propose a unified mechanism: massive activations in the residual stream, typically on the BOS token, simultaneously give rise to both effects.
Theoretical contribution
The authors formalize the representation matrix X∈ℝ^{T×d} with rows x_i, where x_0 denotes the BOS token. They define M = ‖x_0‖², R = Σ_{i≠0}‖x_i‖², and an alignment term α that captures the cosine similarity between x_0 and the other rows. Theorem 1 (Massive Activations Induce Spectral Dominance) shows that σ₁² ≥ M + αR, i.e., the leading singular value of X is lower‑bounded by the BOS norm plus an alignment contribution. From this, Corollary 2 derives explicit bounds on anisotropy (p₁), entropy H(X), and the ratio of dominant singular value to the Frobenius norm. The proof sketch reveals that when M dominates R (c = M/R ≫ 1) or when α → 1, the matrix becomes effectively rank‑one, forcing the entropy to collapse. These results provide a rigorous link between a single scalar quantity (the BOS norm) and the global geometry of the hidden‑state space.
Empirical validation
The authors evaluate six models ranging from 410 M to 120 B parameters (Pythia, Gemma, LLaMA‑3, Qwen2, Bloom) on the GSM8K benchmark. For each layer they compute three metrics: (1) normalized matrix entropy, (2) BOS sink‑rate (the fraction of attention mass directed to BOS across heads), and (3) the L2 norm of the BOS token. Across all models, a sharp spike in BOS norm appears in the middle layers (roughly 20‑85 % depth). At the same layers entropy drops dramatically (often below 0.5 bits) and sink‑rate approaches 1.0. Pearson correlations between BOS norm changes and entropy are around –0.9, confirming a strong inverse relationship; correlations with sink‑rate are positive (~0.6). Figure 2 further shows that this synchronized pattern emerges early in training (≈1 k steps) and persists throughout.
Causal ablations
To test causality, the authors zero out the MLP contribution to the BOS token at the layers where massive activations appear, effectively removing the large residual addition while leaving the attention pathway intact. In LLaMA‑3 8B, this intervention prevents the entropy drop (remaining at ~0.5 bits instead of ~0.02), eliminates sink formation (sink‑rate stays near 0), and keeps the BOS norm comparable to other tokens. Similar results are reported for other models, and ablating multiple “massive activation stages” yields additive reductions in compression, indicating that each stage contributes cumulatively.
Mix‑Compress‑Refine theory
Building on the empirical and theoretical findings, the authors propose a three‑phase information‑flow model for transformer LLMs:
- Mix (early layers, 0‑20 % depth) – Broad attention mixing distributes information across tokens.
- Compress (middle layers, 20‑85 % depth) – Massive BOS activations dominate the residual stream, causing attention sinks and a rank‑deficient hidden state; computation is effectively compressed.
- Refine (late layers, 85‑100 % depth) – As BOS norms equalize, attention becomes more selective, and the model fine‑tunes predictions.
This framework explains why embedding‑type tasks (e.g., similarity search) achieve peak performance in intermediate layers, whereas generative tasks (e.g., next‑token prediction) benefit from the full depth, needing the final refinement stage.
Strengths
- Provides a mathematically rigorous link between two previously unrelated phenomena.
- Empirical evidence spans a wide range of model sizes, architectures, and training stages.
- Causal ablation convincingly demonstrates that massive activations are not merely correlated but necessary for both sinks and compression.
- The Mix‑Compress‑Refine narrative offers an intuitive, testable hypothesis about depth‑wise computation allocation.
Limitations and open questions
- The study focuses exclusively on decoder‑only transformers; it is unclear whether encoder‑decoder or multimodal models exhibit the same mechanism.
- The origin of massive BOS activations is not fully explained; why training dynamics preferentially amplify special tokens remains an open theoretical question.
- Practical implications for model design (e.g., layer‑norm tweaks, residual scaling) are hinted at but not explored experimentally.
- The proposed theory does not address potential trade‑offs between compression benefits and loss of expressive capacity in downstream fine‑tuning scenarios.
Future directions
- Extending the analysis to encoder‑decoder architectures and vision‑language models to test the universality of the mechanism.
- Investigating training‑time interventions (e.g., regularizing BOS norm) to control the depth at which compression occurs, potentially yielding more efficient models.
- Exploring whether deliberate manipulation of massive activations can be used to steer model behavior, such as encouraging earlier compression for faster inference.
- Connecting the Mix‑Compress‑Refine phases to recent work on “early exiting” and adaptive computation, possibly designing dynamic depth policies based on detected massive activation signatures.
In summary, the paper convincingly unifies attention sinks and compression valleys under the single cause of massive residual activations, provides solid theoretical bounds, validates them across a spectrum of models, and proposes a coherent three‑stage view of information flow in LLMs. While some aspects remain to be explored, the work represents a significant step toward demystifying the internal dynamics of modern large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment