Adaptive Momentum and Nonlinear Damping for Neural Network Training

Adaptive Momentum and Nonlinear Damping for Neural Network Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.


💡 Research Summary

The paper presents a novel continuous‑time perspective on momentum‑based stochastic optimization and leverages this view to design adaptive‑friction optimizers that outperform or match Adam on large‑scale transformer training while retaining the simplicity of SGD‑style methods.
First, the authors reinterpret the discrete dynamics of momentum SGD (mSGD) as a discretization of linearly dissipative Hamiltonian dynamics (LDHD). In this formulation the model parameters are the particle’s position x, the exponentially weighted gradient accumulator is the momentum p, the loss f(x) is a potential energy, and the kinetic energy is ½‖p‖². Linear friction −γp in the continuous system corresponds exactly to the momentum coefficient μ in the discrete update via μ = 1 − γ√Δt. This equivalence shows that a fixed μ is tantamount to applying the same friction to every coordinate, which is sub‑optimal when the loss landscape exhibits strong anisotropy.
To address this, the authors introduce a per‑parameter kinetic‑energy‑based friction variable ξ ∈ ℝⁿ that evolves according to
  \dot ξ =


Comments & Academic Discussion

Loading comments...

Leave a Comment