Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent

Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We show that the behavior of stochastic gradient descent is related to Bayesian statistics by showing that SGD is effectively diffusion on a fractal landscape, where the fractal dimension can be accounted for in a purely Bayesian way. By doing this we show that SGD can be regarded as a modified Bayesian sampler which accounts for accessibility constraints induced by the fractal structure of the loss landscape. We verify our results experimentally by examining the diffusion of weights during training. These results offer insight into the factors which determine the learning process, and seemingly answer the question of how SGD and purely Bayesian sampling are related.


💡 Research Summary

The paper “Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent” presents a novel theoretical framework that connects stochastic gradient descent (SGD) with Bayesian inference by modeling the long‑run behavior of SGD as diffusion on a fractal loss landscape. The authors start from singular learning theory (SLT), which introduces the local learning coefficient (LLC) λ(w) as a measure of the fractal dimension of the set of low‑loss parameters around a point w. They then formulate a time‑fractional Fokker‑Planck equation (FFPE) using the Caputo derivative D^α_t (0 < α < 1) to capture both early‑stage super‑diffusive and late‑stage sub‑diffusive dynamics observed in neural network training.

In this setting, two additional fractal dimensions are defined: the spectral dimension d_s, describing how the volume of states visited by the diffusion scales with time (V_s(t) ∼ t^{d_s/2}), and the walk dimension d_walk, governing the scaling of the mean displacement R(t) ∼ t^{1/d_walk}. By invoking the Alexander‑Orbach relation, the authors derive d_walk = 2 λ(w) d_s for points near critical (degenerate) minima, linking the local geometry (λ) with the global diffusion properties (d_s).

A key technical contribution is the approximation of the diffusion tensor D_{ij}(w) by a scalar function D(w) under large‑batch, moderate‑learning‑rate regimes. This simplification enables analytical treatment of the steady‑state solution of the FFPE. The resulting stationary distribution takes the form

 p_∞(w) ∝ π(w) exp(−L(w)/T) · A(w),

where π(w) is the prior, L(w) the loss, T a temperature proportional to the learning rate, and A(w) ≈ ε^{λ(w)} encodes the “accessibility” of a region as determined by its local fractal dimension. In other words, SGD behaves like a tempered Bayesian sampler whose probabilities are re‑weighted by the fractal accessibility of the loss surface.

Empirically, the authors train fully‑connected networks on synthetic “Moons” data and standard image benchmarks. They measure the mean weight displacement R(t) and observe an initial super‑diffusive regime (R ∝ t^{1/ν}, ν ≥ 2) followed by a sub‑diffusive regime consistent with the predicted walk dimension (>2). They also estimate λ(w) using recent LLC estimation techniques, compute A(w), and show that the empirical weight distribution aligns closely with the theoretical tempered posterior (low KL divergence).

The paper discusses limitations: the need for explicit estimation of the fractional order α and the scalar diffusion coefficient D(w), potential bias in LLC estimation in high‑dimensional spaces, and the focus on relatively small models. Future work is suggested on extending the framework to large‑scale architectures (e.g., Transformers), developing robust fractal‑dimension estimators, and designing adaptive learning‑rate schedules that respect the underlying fractional dynamics.

Overall, the work offers a compelling synthesis of singular learning theory, fractal geometry, and stochastic differential equations, providing a fresh Bayesian interpretation of SGD that accounts for the complex, degenerate structure of modern neural‑network loss landscapes. This perspective could influence both theoretical analyses of generalization and practical algorithm design.


Comments & Academic Discussion

Loading comments...

Leave a Comment