Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm provably possesses convergence properties under suitable assumptions and is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.

💡 Research Summary

This paper tackles the unconstrained finite‑sum optimization problem that underlies most supervised deep‑learning tasks, focusing on the over‑parameterized (interpolation) regime where each training sample can be fitted exactly. While stochastic gradient descent (SGD) and its adaptive variants (Adam, RMSProp, etc.) are the de‑facto workhorses, recent work has shown that, under interpolation, simple SGD can achieve the same linear convergence rate as full‑batch gradient descent. Building on this, stochastic line‑search methods—both classic Armijo‑type and non‑monotone variants—have been proposed to adaptively select step sizes with provable linear convergence.

However, integrating momentum terms into stochastic line‑search frameworks is non‑trivial. Momentum updates add a term proportional to the previous step (x_k – x_{k‑1}), which is computed based on the loss of the previous mini‑batch. When the current mini‑batch differs substantially from the previous one, this momentum direction may no longer be a descent direction for the current stochastic objective, forcing the algorithm to shrink the momentum coefficient β or to perform many back‑tracking steps, thereby eroding the practical benefits of momentum.

The authors resolve this issue by introducing mini‑batch persistency: consecutive mini‑batches share a fraction of samples (e.g., 50‑80 %). This overlap makes the stochastic functions f_k and f_{k‑1} more similar, ensuring that the momentum term remains aligned with the current gradient and that the stochastic Armijo condition can be satisfied with few back‑tracking iterations.

A second contribution is a conjugate‑gradient‑inspired rule for the momentum coefficient β_k. Instead of fixing β, they compute
β_k = (g_kᵀ s_{k‑1}) / (s_{k‑1}ᵀ y_{k‑1}),
where s_{k‑1}=x_k−x_{k‑1} and y_{k‑1}=g_k−g_{k‑1}. This rule guarantees 0 ≤ β_k ≤ 1, prevents excessive amplification of momentum, and naturally integrates with restart or subspace‑optimization safeguards when β_k becomes unfavorable.

The resulting algorithm proceeds as follows: (1) construct a new mini‑batch that overlaps with the previous one; (2) compute stochastic gradient g_k and the CG‑type β_k; (3) form the search direction d_k = –g_k + β_k (x_k−x_{k‑1}); (4) apply a stochastic Armijo (or non‑monotone) line‑search to obtain a step size α_k, using a back‑tracking loop that is typically short thanks to the persistency; (5) update the parameters and repeat.

The authors prove convergence under the Polyak‑Łojasiewicz (PL) condition together with interpolation. They show that the expected function value decreases linearly:
E

Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment