Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags high risk, turning on KSS successfully prevents divergence. In the challenging high learning rate regime without normalization layers, KSS reduces the divergence rate from 66.7% to 12.5% and enables learning rates that are 50% to 150% higher. These findings generalize to WikiText-103 language modeling, vision transformers on CIFAR-10, and pretrained language models, including GPT-2 and LLaMA-2 up to 7B, as well as emerging architectures such as MoE, Mamba-style SSMs, and KAN.
💡 Research Summary
This paper tackles the pervasive problem of training divergence in transformer models by introducing two complementary techniques: Residual Koopman Spectral Profiling (RKSP) and Koopman Spectral Shaping (KSS). RKSP treats the residual connection of a transformer as a discrete‑time dynamical system and, using only a single forward pass at initialization, collects layer‑wise residual snapshots across a modest batch of inputs. Unlike conventional Dynamic Mode Decomposition (DMD) that relies on temporally shifted data, RKSP pairs input‑output residuals from the same sample, forming snapshot matrices Xℓ and Yℓ for each layer ℓ. After whitening the data with a zero‑phase component analysis transform, RKSP solves a least‑squares problem to obtain a linear operator ˆAℓ ≈ YℓXℓ†. Eigen‑decomposition of ˆAℓ yields a spectrum Λℓ = {λℓj}. From this spectrum three quantities are derived: the spectral radius ρℓ, the eigenvector condition number κℓ (a proxy for non‑normality), and the near‑unit spectral mass M≈1ℓ, defined as the fraction of eigenvalues whose magnitudes lie within a narrow band around the unit circle (|λ|∈
Comments & Academic Discussion
Loading comments...
Leave a Comment