DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering
Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.
💡 Research Summary
The paper tackles a fundamental issue in modern deep‑learning optimization: stochastic gradients, especially when minibatch sizes are small, tend to concentrate in a low‑dimensional subspace that aligns with the outlier eigenvectors of the Hessian. Prior work demonstrated that explicitly projecting gradients away from this “Hessian outlier” subspace has negligible impact on convergence, but computing the Hessian is infeasible for large models.
The authors propose a first‑order surrogate: the centered covariance of per‑sample gradients, Σₜ = E
Comments & Academic Discussion
Loading comments...
Leave a Comment