Adaptive Optimization via Momentum on Variance-Normalized Gradients

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead.

💡 Research Summary

The paper introduces MVN‑Grad, a new Adam‑style optimizer that simultaneously addresses two fundamental shortcomings of existing adaptive methods: temporal coupling between momentum and the adaptive denominator, and the loss of signal magnitude in low‑noise regimes caused by second‑moment scaling. MVN‑Grad normalizes each coordinate by an exponential moving average (EMA) of the variance (centered second moment) rather than the uncentered second moment, and it applies momentum after this normalization. This “normalize‑then‑momentum” ordering eliminates the cross‑time ratio that occurs in standard “momentum‑then‑normalize” schemes such as Adam and AdaBelief, where a stale momentum term is divided by a stochastic denominator that can dip sharply in a low‑noise batch, leading to exploding step sizes.

Theoretical contributions are threefold. First, under a symmetric noise assumption and assuming the EMA tracks the conditional mean exactly, Theorem 3.1 shows that the one‑step conditional variance of MVN‑Grad’s update is strictly smaller than that of AdaBelief using the same variance estimator. The variance gap is expressed as ((2β₁−β₂) m_{t‑1}² \operatorname{Var}

Adaptive Optimization via Momentum on Variance-Normalized Gradients

💡 Research Summary

Comments & Academic Discussion

Leave a Comment