Sharpness-Aware Minimization Can Hallucinate Minimizers
Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima – often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance $ρ$ along the normalized gradient direction. We show that, for some choices of $ρ$, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM’s performance degradation often seen at large $ρ$. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of $ρ$.
💡 Research Summary
Sharpness‑Aware Minimization (SAM) has become a popular tool for encouraging flat minima, which are often associated with better generalization. In practice, SAM is implemented by first perturbing the current parameters x by a distance ρ along the normalized gradient direction u(x)=∇f(x)/‖∇f(x)‖, forming the shifted point x⁺=x+ρ u(x), and then updating the parameters using the gradient evaluated at this shifted point: x_{k+1}=x_k−η_k ∇f(x_k⁺). This implementation does not follow the true gradient of the surrogate loss f_SAM(x)=f(x⁺). Consequently, the dynamics can become stationary whenever ∇f(x⁺)=0, even if the original gradient ∇f(x)≠0. The authors call such points “hallucinated minimizers”.
The paper makes several theoretical contributions. First, it formalizes hallucinated minimizers as points where (i) the shifted gradient vanishes, (ii) the original gradient does not, and (iii) the point is a local minimizer of the surrogate loss f_SAM. Under a mild non‑degeneracy condition (the perturbation map is locally invertible), vanishing of the surrogate gradient is equivalent to vanishing of the shifted gradient.
The core existence results are:
- Theorem 2.2 (isolated local maximizer): If a smooth loss f has a global minimizer x* and an isolated local maximizer x·, then for any radius ρ ≥ ‖x*−x·‖ there exists a hallucinated minimizer. The proof constructs a super‑level set around x·, picks the point x_h on its boundary farthest from x*, and uses Lagrange multipliers to show that the gradient at x_h points exactly toward x*. Setting ρ=‖x*−x_h‖ yields x_h⁺=x*, so ∇f(x_h⁺)=0 while ∇f(x_h)≠0.
- Theorem 2.4 (bounded maximizer set): When the maximizer is not isolated but forms a bounded connected set X (e.g., a plateau due to symmetry), the same phenomenon occurs provided f is real‑analytic. Real‑analyticity enables the use of the Łojasiewicz inequality to rule out accumulation of critical points near X, ensuring the boundary of the constructed super‑level set is smooth and the Lagrange‑multiplier argument applies. Again, an interval of radii containing ρ ≥ dist(x*,X) yields hallucinated minimizers.
Dynamic properties are studied in Section 3. Theorem 3.1 shows that an isolated component of hallucinated minimizers can be a locally attracting set for the discrete‑time SAM iteration, provided the Jacobian I+ρ∇u(x) is nonsingular at the fixed point. Theorem 3.2 extends this to non‑isolated manifolds: if the perturbation map is locally invertible, the set of hallucinated minimizers inherits the dimension of the original minimizer manifold, forming a smooth manifold of attractors.
Empirically, the authors verify the phenomenon on several neural‑network training tasks. In smooth full‑batch experiments they directly observe convergence to points where ∇f(x⁺)≈0 but ∇f(x)≠0. In larger, stochastic settings they design diagnostics that compare shifted versus unshifted gradients and losses, finding signatures consistent with hallucinated minima, especially when ρ is large—a regime where prior work already reports performance degradation.
A practical mitigation is proposed: a short SGD warm‑start (e.g., 5–10 epochs) before switching to SAM dramatically reduces the likelihood of converging to hallucinated minimizers and makes performance less sensitive to the choice of ρ. The intuition is that early SGD moves the parameters into a region where the gradient direction aligns less dramatically with the perturbation, preventing the shifted gradient from vanishing prematurely.
Overall, the paper uncovers a previously uncharacterized failure mode of the standard SAM update that is intrinsic to non‑convex landscapes with both minima and maxima and to large perturbation radii. It complements existing SAM theory, which largely assumes small ρ or strong convexity, by providing geometric constructions, rigorous existence proofs, and dynamical stability analysis. The suggested warm‑start strategy offers a simple, broadly applicable safeguard for practitioners who wish to employ SAM without risking convergence to non‑stationary points of the original loss.
Comments & Academic Discussion
Loading comments...
Leave a Comment