A Short and Unified Convergence Analysis of the SAG, SAGA, and IAG Algorithms
Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.
💡 Research Summary
This paper presents a single, unified convergence analysis that simultaneously covers three widely used variance‑reduced optimization methods: Stochastic Average Gradient (SAG), its unbiased variant SAGA, and the deterministic counterpart Incremental Aggregated Gradient (IAG). The setting is the finite‑sum problem f(x)= (1/N)∑_{i=1}^N f_i(x) with each component f_i being L‑smooth and the overall objective f being µ‑strongly convex. The authors address a long‑standing gap: existing proofs for these algorithms are fragmented, rely on distinct techniques, and in the case of SAG are notoriously intricate, requiring computer‑assisted polynomial positivity checks.
The analysis proceeds in two conceptual steps. First, they bound the “staleness” caused by random sub‑sampling. By applying Bernstein’s inequality to the sampling process, they obtain a high‑probability bound on the maximum delay τ_{i,k} between the current iteration k and the last time component i was accessed. This bound scales as \tilde O(N) and holds with probability at least 1−δ for any prescribed δ>0. Consequently, on the high‑probability “good event”, both SAG and SAGA can be viewed as delayed versions of full‑gradient descent, with a deterministic upper bound on the delay.
Second, the authors construct a novel Lyapunov function that explicitly incorporates the stale gradients. Unlike traditional Lyapunov arguments that track only the distance ‖x_k−x_‖² or the gradient error, the new function aggregates the squared norms of the differences ∇f_i(x_{τ_{i,k}})−∇f_i(x_), thereby maintaining a “window” of outdated information. By carefully combining smoothness, strong convexity, and the delay bound, they derive a one‑step contractive inequality for this Lyapunov quantity. This yields a high‑probability linear convergence rate of the form
‖x_k−x_*‖² ≤ C·(1−c/(κN))^k
where κ = L/µ is the condition number, and c>0 is an absolute constant. The result holds simultaneously for SAG (biased estimator) and SAGA (unbiased estimator); for SAG the delay bound neutralizes the bias, while for SAGA the unbiasedness simplifies the argument.
The paper further extends the framework in three directions.
-
High‑probability guarantees – Theorem 3.11 provides explicit δ‑dependent bounds, filling the gap left by earlier works that only offered expectations.
-
Markov‑chain sampling – By invoking Paulin’s (2015) Bernstein inequality for uniformly ergodic Markov chains, the authors show that the same delay bound and Lyapunov analysis apply when component indices are generated by a Markov process rather than i.i.d. uniform draws (Theorem 3.14).
-
Deterministic IAG – The same Lyapunov construction applies unchanged to IAG, where the delay is exactly N (one full cycle). This yields a dramatically improved convergence rate O(exp(−K/(κN))) (Theorem 4.1), compared with the previously known O(exp(−K/(κ²N²))) bound. The new rate matches the intuition that N IAG steps are equivalent to one GD step.
Finally, the authors sketch extensions to non‑convex objectives (Section 3.4), showing that the Lyapunov framework can still deliver high‑probability bounds on the norm of the gradient, thereby unifying analysis across convex, non‑convex, stochastic, and deterministic settings.
In summary, the paper’s contributions are threefold: (i) a simple concentration argument that caps sampling‑induced delays, (ii) a novel Lyapunov function that accommodates those delays, and (iii) a modular proof that yields high‑probability linear rates for SAG, SAGA, and IAG, as well as extensions to Markov sampling and non‑convex losses. This unified approach not only simplifies existing proofs but also sets a versatile foundation for analyzing more sophisticated variance‑reduced methods such as accelerated or second‑order variants.
Comments & Academic Discussion
Loading comments...
Leave a Comment