Robust Gradient Descent via Heavy-Ball Momentum with Predictive Extrapolation
Accelerated gradient methods like Nesterov’s Accelerated Gradient (NAG) achieve faster convergence on well-conditioned problems but often diverge on ill-conditioned or non-convex landscapes due to aggressive momentum accumulation. We propose Heavy-Ball Synthetic Gradient Extrapolation (HB-SGE), a robust first-order method that combines heavy-ball momentum with predictive gradient extrapolation. Unlike classical momentum methods that accumulate historical gradients, HB-SGE estimates future gradient directions using local Taylor approximations, providing adaptive acceleration while maintaining stability. We prove convergence guarantees for strongly convex functions and demonstrate empirically that HB-SGE prevents divergence on problems where NAG and standard momentum fail. On ill-conditioned quadratics (condition number $κ=50$), HB-SGE converges in 119 iterations while both SGD and NAG diverge. On the non-convex Rosenbrock function, HB-SGE achieves convergence in 2,718 iterations where classical momentum methods diverge within 10 steps. While NAG remains faster on well-conditioned problems, HB-SGE provides a robust alternative with speedup over SGD across diverse landscapes, requiring only $O(d)$ memory overhead and the same hyperparameters as standard momentum.
💡 Research Summary
The paper addresses a well‑known weakness of accelerated first‑order optimizers—particularly Nesterov’s Accelerated Gradient (NAG)—which, while offering optimal convergence rates on smooth strongly‑convex problems, often become unstable and diverge on ill‑conditioned or non‑convex loss landscapes. To overcome this, the authors propose Heavy‑Ball Synthetic Gradient Extrapolation (HB‑SGE), a method that augments the classic heavy‑ball momentum update with a predictive gradient extrapolation step derived from a first‑order Taylor approximation. Instead of storing a long history of gradients, HB‑SGE computes a synthetic gradient ˜gₜ = ∇f(xₜ) + αₜΔgₜ, where Δgₜ = ∇f(xₜ) – ∇f(xₜ₋₁) and αₜ is an adaptive extrapolation coefficient that is reduced whenever the gradient norm increases, thereby damping aggressive predictions in regions of high curvature. The momentum update then blends this synthetic gradient with the existing momentum buffer: mₜ₊₁ = βmₜ + (1–β)˜gₜ, and the parameters are updated as xₜ₊₁ = xₜ – ηmₜ₊₁. The algorithm requires only O(d) memory (current and previous gradients) and uses the same two hyper‑parameters (learning rate η and momentum β) as standard momentum methods, making it a drop‑in replacement.
Theoretical contributions include: (1) A linear convergence proof for strongly convex L‑smooth functions (Theorem 3.2). Under the condition η ≤ 1/L·(1+α_max) and β < 1, the expected squared distance to the optimum contracts by a factor (1 – ηµ(1 – β)²) each iteration. The proof extends classic heavy‑ball analysis by bounding the synthetic gradient’s norm using L‑smoothness and the extrapolation term. (2) A sample‑complexity corollary showing that ε‑accuracy is achieved in O((1/(ηµ(1 – β)))·log(‖x₀–x*‖²/ε)) iterations. (3) A stability analysis for quadratic objectives (Theorem 3.5). The authors derive the eigenvalues of the HB‑SGE iteration matrix λ_HB(λ_i) = 1 – ηλ_i(1 + αηλ_i) + β and prove that if α < 2ηL – 1, all eigenvalues lie inside the unit circle, guaranteeing convergence even when NAG’s eigenvalues exceed one. This demonstrates that the extrapolation term counteracts the eigenvalue amplification that causes NAG to diverge on poorly conditioned problems.
Empirically, the method is evaluated on three families of synthetic benchmarks that capture distinct challenges: (i) ill‑conditioned quadratic functions with condition numbers κ = 10, 50, 100, 500 in 10 dimensions; (ii) the non‑convex Rosenbrock function, which exhibits a narrow curved valley with a curvature ratio of roughly 100:1; and (iii) the Beale function, featuring multiple local minima, saddles, and flat regions. For each problem, learning rates are conservatively tuned so that vanilla SGD converges, and the same η and β are used for all optimizers (including Adam with half the learning rate). Results show that on κ = 50, HB‑SGE converges in 119 iterations, whereas both SGD and NAG diverge. For κ = 500, HB‑SGE converges in 210 iterations while NAG and classical momentum diverge within the first 100 steps. On the Rosenbrock function, NAG and momentum explode after 10 iterations, yet HB‑SGE reaches the global minimum after 2,718 iterations, outperforming Adam’s slow progress. In the Beale function, HB‑SGE attains the lowest function value fastest, demonstrating robustness to complex non‑convex geometry. Table 1 confirms that per‑iteration computational cost and memory usage remain O(d) for all methods, but HB‑SGE enjoys a larger effective convergence constant (1 – ηµ(1 – β)/2), translating into a practical speed‑stability trade‑off.
In conclusion, HB‑SGE offers a principled way to blend the stabilizing effect of heavy‑ball momentum with a lightweight predictive step, achieving accelerated convergence on well‑conditioned problems while preserving robustness on ill‑conditioned or non‑convex landscapes. Its simplicity—requiring only an extra gradient difference and an adaptive scalar—makes it immediately applicable to existing deep‑learning pipelines without additional memory or hyper‑parameter overhead. The authors suggest future work on stochastic variants, distributed asynchronous settings, and integration with adaptive learning‑rate schedules, which could further broaden the method’s applicability to large‑scale neural network training.
Comments & Academic Discussion
Loading comments...
Leave a Comment