Phase diagram and eigenvalue dynamics of stochastic gradient descent in multilayer neural networks
Hyperparameter tuning is one of the essential steps to guarantee the convergence of machine learning models. We argue that intuition about the optimal choice of hyperparameters for stochastic gradient descent can be obtained by studying a neural network’s phase diagram, in which each phase is characterised by distinctive dynamics of the singular values of weight matrices. Taking inspiration from disordered systems, we start from the observation that the loss landscape of a multilayer neural network with mean squared error can be interpreted as a disordered system in feature space, where the learnt features are mapped to soft spin degrees of freedom, the initial variance of the weight matrices is interpreted as the strength of the disorder, and temperature is given by the ratio of the learning rate and the batch size. As the model is trained, three phases can be identified, in which the dynamics of weight matrices is qualitatively different. Employing a Langevin equation for stochastic gradient descent, previously derived using Dyson Brownian motion, we demonstrate that the three dynamical regimes can be classified effectively, providing practical guidance for the choice of hyperparameters of the optimiser.
💡 Research Summary
The paper proposes a novel physical‑theoretic framework for understanding stochastic gradient descent (SGD) in deep multilayer neural networks by mapping the loss landscape onto a disordered spin‑glass Hamiltonian. The authors first rewrite the mean‑squared‑error loss of a feed‑forward network as a quadratic form involving “features” (the activations of the last hidden layer) and a coupling matrix constructed from the final weight matrix. These features are interpreted as continuous “soft spins” bounded between –1 and +1 by the hyperbolic‑tangent activation, while the product of the final weight matrix with itself yields a symmetric positive‑semidefinite coupling matrix that follows a Wishart‑Laguerre ensemble.
Two key hyperparameters are identified as physical control knobs: the variance of the initial weight matrices (σ_W²) plays the role of disorder strength, and the ratio of learning rate η to batch size B (η/B) is identified as an effective temperature T. By varying σ_W² and T, the training dynamics fall into three qualitatively distinct phases:
-
Ordered (ferromagnetic) phase – Small σ_W² (well below a critical value σ_c² = ½) keeps the pre‑activation distribution narrow, so features lie near zero, the tanh nonlinearity operates in its linear regime, and learning proceeds efficiently. The singular values of the weight matrices undergo Dyson Brownian motion, expanding steadily; the average level spacing Δ reaches a positive stable fixed point.
-
Paramagnetic (fluctuation‑dominated) phase – Moderate σ_W² combined with higher temperature (large η/B) leads to noisy dynamics. Features are still centered near zero but fluctuate strongly; the singular‑value spectrum remains close to its initial random‑matrix width, and Δ hovers around an unstable fixed point, producing slow, noisy loss decay.
-
Jamming (spin‑glass) phase – Large σ_W² (> σ_c²) pushes many pre‑activations into the saturated region of tanh, causing features to cluster near ±1. Gradient magnitudes vanish, the network essentially “jams,” and learning stalls. Singular values collapse toward a narrow band, and Δ decays to zero, indicating loss of spectral repulsion.
To formalize these observations, the authors derive a Langevin equation for SGD based on a Dyson‑Brownian‑motion description of the singular values. The equation contains the temperature T and disorder parameter 1/σ_W, allowing a stability analysis that yields analytic expressions for the phase boundaries. The average level spacing Δ obeys a stochastic differential equation whose fixed‑point structure directly maps onto the three phases described above.
Empirically, the authors train a two‑hidden‑layer multilayer perceptron with tanh activations on synthetic data, scanning a grid of (σ_W², η/B). They monitor loss, feature averages, singular‑value spectra, and Δ over training time. The resulting phase diagram matches the theoretical predictions: the ordered region occupies low σ_W² and low‑to‑moderate temperature, the paramagnetic region appears at higher temperature for the same σ_W², and the jamming region dominates at large σ_W² irrespective of temperature.
The practical implication is a physics‑inspired guideline for hyper‑parameter selection. To achieve successful training one should aim to stay inside the ordered phase: keep the initial weight variance modest (to avoid jamming) and choose a learning‑rate‑to‑batch‑size ratio that does not raise the effective temperature too high (to avoid the paramagnetic regime). The framework also suggests that monitoring singular‑value dynamics during training could serve as an early diagnostic of phase transitions and impending training failure.
Overall, the work bridges concepts from spin‑glass theory, random‑matrix dynamics, and stochastic optimization, offering both a deeper theoretical understanding of deep‑learning training dynamics and actionable insights for practitioners. Future extensions could explore other loss functions, activation families, or the infinite‑width limit, further enriching the dialogue between statistical physics and modern machine learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment