Gradient descent avoids strict saddles with a simple line-search method too
It is known that gradient descent (GD) on a $C^2$ cost function generically avoids strict saddle points when using a small, constant step size. However, no such guarantee existed for GD with a line-search method. We provide one for a modified version of the standard Armijo backtracking method with generic, arbitrarily large initial step size. The proof underlines the double role of the Luzin $N^{-1}$ property for the iteration maps, and allows to forgo the habitual Lipschitz gradient assumption. We extend this to the Riemannian setting (RGD), assuming the retraction is real analytic (though the cost function still only needs to be $C^2$). In closing, we also improve guarantees for RGD with a constant step size in some scenarios.
💡 Research Summary
The paper addresses a long‑standing gap in the theory of non‑convex optimization: while it is known that gradient descent (GD) with a small, constant step size avoids strict saddle points (critical points where the Hessian has at least one negative eigenvalue), no comparable guarantee existed for GD equipped with a line‑search. The authors propose a modest modification of the classic Armijo backtracking line‑search (Algorithm 1) and prove that this “stabilized” line‑search GD avoids strict saddles for almost every choice of the initial step size, even when the initial step can be arbitrarily large.
The analysis proceeds in two phases. In the early phase the step size may change from iteration to iteration; the update can be written as x_{t+1}=g_{i_t}(x_t) where each candidate map g_i(x)=x−τ^i \barα ∇f(x) is smooth, but the selection i_t depends discontinuously on the current point. Classical arguments based on the Center‑Stable Manifold Theorem (CSMT) require a fixed, smooth iteration map and therefore do not apply. The authors overcome this by exploiting the Luzin N⁻¹ property: if a map is C¹ and its Jacobian is invertible almost everywhere, then the pre‑image of any measure‑zero set also has measure zero. They show (Lemma 3.7) that for GD the iteration maps satisfy Luzin N⁻¹ for almost all step sizes, even without a global Lipschitz gradient assumption.
When the algorithm converges, the step size eventually stabilizes. At that point the iteration reduces to standard GD with a fixed step size, and the authors can invoke a refined version of the CSMT (Hirsch et al., 1977) that only needs the Jacobian to be invertible almost everywhere, not everywhere. This yields Theorem 1.3: for a C¹ map with an almost‑everywhere invertible Jacobian, the set of initial points that converge to an unstable fixed point (i.e., a strict saddle) has measure zero. Combining the local CSMT argument with the global Luzin N⁻¹ property gives the main Euclidean result (Theorem 1.2): the stabilized backtracking line‑search GD avoids strict saddles for almost every initial step size.
The authors extend the theory to Riemannian manifolds. Assuming the manifold, its metric, and the retraction are real‑analytic, the Riemannian gradient descent (RGD) update x_{t+1}=R_{x_t}(−α_t grad f(x_t)) inherits the Luzin N⁻¹ property. Real‑analyticity guarantees that the retraction’s Jacobian is invertible almost everywhere, allowing the same local‑to‑global argument to go through. Consequently, Theorem 1.4 states that the same line‑search scheme avoids strict saddles on any real‑analytic manifold with a real‑analytic retraction (e.g., spheres, Stiefel manifolds, orthogonal groups, SPD matrices).
Beyond the “almost‑all” guarantee, the paper also provides new “interval” results. In Euclidean space, GD with a fixed step size α avoids strict saddles for almost all α>0, removing the classical restriction α<1/L that depends on a Lipschitz constant of ∇f. In the Riemannian setting, similar “almost‑all” guarantees hold under real‑analyticity, and for certain manifolds (e.g., Hadamard manifolds with the exponential map) the admissible interval can be as large as (0,1/L), matching the Euclidean case.
Overall, the work shows that adaptive line‑search methods retain the desirable saddle‑avoidance property of constant‑step GD, while dispensing with restrictive Lipschitz and step‑size assumptions. The key technical contributions are (i) the use of the Luzin N⁻¹ property to propagate measure‑zero sets through a countable family of possibly discontinuous iteration maps, and (ii) a refined CSMT that only requires Jacobian invertibility almost everywhere. These tools open the door to rigorous convergence analyses for a broad class of adaptive optimization algorithms in both Euclidean and Riemannian contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment