Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model

Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network’s loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.


💡 Research Summary

**
The paper tackles the long‑standing “sharpness paradox” in deep learning: does stochastic gradient descent (SGD) inherently prefer flat minima, or can it converge to sharp ones? To answer this, the authors construct an analytically tractable setting—a D‑layer deep linear network trained with mean‑squared error on data generated by a linear teacher (V) plus additive Gaussian label noise (\epsilon). The global minima of the loss satisfy (W_D\cdots W_1 = V), but because of matrix rescaling symmetries there exists an entire manifold of solutions with arbitrarily large Hessian trace (their chosen sharpness metric).

The key theoretical insight comes from the “minimal‑fluctuation” perspective on SGD. Recent works have shown that minibatch SGD with learning rate (\eta) is equivalent (up to higher‑order terms) to minimizing an entropic loss:
\


Comments & Academic Discussion

Loading comments...

Leave a Comment