Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.


💡 Research Summary

The paper tackles a fundamental problem in modern deep learning: how to initialise very deep Transformer models so that training proceeds smoothly. While the theory of signal propagation and “edge‑of‑chaos” initialisation is well‑established for fully‑connected networks, Transformers introduce two distinct failure modes that have not been jointly explained before. The first, rank collapse, occurs when the self‑attention matrix becomes essentially uniform: every token is mapped to the same output direction, driving the representation matrix to rank‑one. This destroys token‑wise information and leads to vanishing gradients as depth grows. The second, entropy collapse, appears when the query and key weights are too large at initialization; the soft‑max in attention saturates, concentrating probability mass on a few tokens regardless of the input, which yields a low‑entropy attention distribution and unstable training dynamics.

To obtain a unified description, the authors map the random attention scores at initialization onto the Random Energy Model (REM) from statistical physics. In the REM each configuration has an energy drawn from a Gaussian distribution, and the probability of a configuration follows a Boltzmann law with inverse temperature β. By initializing the query and key matrices with variance σ²_Q = σ²_K = β·log T / d (where T is the sequence length and d the embedding dimension), the attention scores become Gaussian with variance σ²_a = β·log T. This scaling ensures that the fluctuations of the scores grow like √log T, exactly matching the energy fluctuations of a REM with N ≈ log T degrees of freedom. Consequently, the inverse temperature β becomes the single knob that controls the sharpness of the attention distribution.

The analysis shows a sharp phase transition at a critical value β_c. For β < β_c the attention scores are too small; the soft‑max yields an almost uniform matrix, and the cosine similarity between any pair of tokens quickly converges to 1 as layers stack – the rank‑collapse regime. For β > β_c the scores are large enough that the soft‑max becomes highly peaked; the attention matrix is sparse, the Shannon entropy of each row drops, and the model enters the entropy‑collapse regime.

A second crucial parameter is the strength of the residual (skip) connection in the self‑attention sub‑layer, denoted α_SA. A sufficiently large α_SA counteracts the tendency of the cosine similarity to drift toward 1, preserving token diversity even when β is close to β_c. By jointly varying (β, α_SA) the authors construct a trainability diagram that partitions the (β, α_SA) plane into three regions: rank collapse (red), entropy collapse (yellow), and a blue region where the model is theoretically trainable. The diagram is computed by Algorithm 1, which iteratively propagates the average token overlap q and the average cosine similarity ρ through a full Transformer block (self‑attention, layer‑norm, MLP, and their respective residuals). The algorithm incorporates known results for MLPs (He‑type scaling) and for layer‑norm, thereby providing a complete picture of signal propagation.

The backward‑pass analysis yields an exact expression for the norm of the gradient with respect to the query and key matrices at initialization. The gradient magnitude vanishes exponentially when β < β_c, confirming that the rank‑collapse regime is also a gradient‑vanishing regime. Conversely, in the entropy‑collapse regime gradients remain non‑zero but the forward‑pass representation is already pathological, leading to training instability.

Empirical validation is performed on a 60‑layer BERT‑style encoder trained on the TinyStories dataset. The authors vary α_SA and β across the three regimes predicted by the theory. In the rank‑collapse region, the average cosine similarity between token embeddings quickly reaches 1, and the model fails to learn (test loss stays high). In the entropy‑collapse region, attention matrices become extremely sparse, and training diverges. In the blue “trainable” region, cosine similarity stays moderate, attention entropy is high, and the model achieves low test loss comparable to a properly tuned baseline. The measured cosine‑similarity curves match the theoretical predictions (solid lines) almost perfectly, confirming the accuracy of the analytical approximations and the finite‑size corrections derived in Section 3.2.

Beyond the main experiment, three case studies illustrate the flexibility of the framework: (i) standard BERT architecture with varying depth, (ii) comparison of pre‑norm versus post‑norm layer‑norm placement, and (iii) alternative attention mechanisms such as linear‑attention and scaled‑dot‑product variants. In all cases the same (β, α_SA) analysis predicts whether a given configuration will suffer rank or entropy collapse, and suggests how to adjust the residual strength or the query/key variance to stay in the trainable regime.

In summary, the paper delivers a unified, mathematically exact theory of signal propagation in deep Transformers that simultaneously explains rank collapse and entropy collapse. By casting self‑attention as a Random Energy Model, the authors derive a simple, practically useful scaling law for query/key weights (β·log T) and identify a critical inverse temperature β_c. Together with the residual strength α_SA, these two hyper‑parameters fully determine the trainability of a deep Transformer at initialization. The resulting trainability diagrams and the accompanying algorithm give practitioners a principled, quantitative tool for designing and initializing very deep Transformer models, reducing reliance on costly trial‑and‑error and improving reliability of large‑scale language model training.


Comments & Academic Discussion

Loading comments...

Leave a Comment