Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region
We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the training outcome is unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.
💡 Research Summary
The paper investigates the dynamics of gradient descent (GD) when the learning rate is large, focusing on matrix factorization problems. Starting with the simplest case—factorizing a scalar target y as the inner product of two vectors u and v—the authors derive an explicit expression for the critical learning rate η* that separates convergence from divergence for almost all initializations. When |y| η < 1, the convergence region D_η is essentially the set of points satisfying a quadratic‑plus‑quartic inequality, and it occupies almost the whole space.
Near this critical learning rate, the algorithm becomes extremely sensitive to the initial condition. The authors prove that for any point on the boundary of the convergence region, arbitrarily small perturbations can lead to three qualitatively different outcomes: (i) convergence to a global minimizer with any prescribed norm γ, (ii) convergence to a different global minimizer, or (iii) convergence to a strict saddle point. Thus infinitesimal changes in the starting point can produce drastically different final models, contradicting the intuition that GD is robust to small perturbations.
When ℓ₂ regularization (λ > 0) is added, the situation becomes richer. The previously smooth convergence boundary turns into a self‑similar fractal. After symmetry reduction the boundary can be described by an iterated function system whose attractor has an estimated fractal dimension of about 1.249. Regularization also creates a binary selection between the minimal‑norm and maximal‑norm global minimizers; at the critical learning rate, an infinitesimal change in the initialization can flip this selection.
The authors then extend the analysis to the full matrix factorization problem
min_{U,V} ½‖UᵀV − Y‖_F² + λ(‖U‖_F² + ‖V‖_F²).
If the initialization lies in a subspace defined by orthogonal conditions (e.g., identity initialization or the subspace used in linear residual networks), the dynamics decouple into independent scalar factorization processes. Consequently, all the phenomena proved for the scalar case—critical step size, fractal convergence region, and chaotic sensitivity—hold on this subspace for the high‑dimensional problem.
The root cause of chaos is identified as a “folding” behavior of the GD update map GD_η(θ) = θ − η∇L(θ). The map sends a region C onto a superset of C in a multi‑fold covering manner. When a convergence boundary is invariant under this map, the covering induces self‑similarity, leading to a fractal structure and mixing on the boundary. For networks with polynomial activations, the authors prove that, after removing a measure‑zero set, GD acts as a covering map on each connected component of the parameter space. This property guarantees a lower bound on the topological entropy: h(GD_η) ≥ log 3, a classic hallmark of chaotic dynamics.
Empirical validation is performed on (i) scalar and matrix factorization, (ii) deep linear and nonlinear (ResNet‑style, ReLU) networks, and (iii) real‑world datasets (image and text). In all cases, large learning rates produce fractal‑like convergence boundaries and extreme sensitivity to initialization, confirming the theoretical predictions. Small perturbations of the initial weights under a fixed large learning rate lead to large variations in final test accuracy, norm, and imbalance of the learned factors.
In summary, the paper demonstrates that gradient descent with near‑critical large step sizes does not simply accelerate training; it drives the optimizer into a chaotic regime where the outcome is fundamentally unpredictable. Traditional implicit‑bias explanations (balancedness, minimum norm, flat minima) break down, and the choice of learning rate becomes a decisive factor shaping the final model. This work opens new directions for understanding and controlling optimization dynamics in the high‑learning‑rate regime.
Comments & Academic Discussion
Loading comments...
Leave a Comment