Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws

Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.


💡 Research Summary

This paper presents a rigorous high‑dimensional theory for a single‑head tied‑attention layer, addressing three long‑standing puzzles in modern transformer models: (i) why the learned query‑key weight matrices exhibit a highly non‑random singular‑value spectrum (low‑rank collapse, bulk deformation, isolated outliers), (ii) how these spectral features relate to generalization performance, and (iii) why large‑scale transformers obey power‑law scaling relations between performance and resources.

Model and setting. The authors consider synthetic sequence‑to‑sequence (seq2seq) and sequence‑to‑label (seq2lab) tasks. Input sequences consist of T tokens, each represented by a d‑dimensional Gaussian embedding. The attention map is defined with tied query and key matrices W∈ℝ^{d×p}: A_W(x)=softmax_β(xW Wᵀxᵀ−E_tr


Comments & Academic Discussion

Loading comments...

Leave a Comment