Understanding SAM's Robustness to Noisy Labels through Gradient Down-weighting
Sharpness-Aware Minimization (SAM) was introduced to improve generalization by seeking flat minima, yet it also exhibits robustness to label noise, a phenomenon that remains only partially understood. Prior work has mainly attributed this effect to SAM’s tendency to prolong the learning of clean samples. In this work, we provide a complementary explanation by analyzing SAM at the element-wise level. We show that when noisy gradients dominate a parameter direction, their influence is reduced by the stronger amplification of clean gradients. This slows the memorization of noisy labels while sustaining clean learning, offering a more complete account of SAM’s robustness. Building on this insight, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), a simple variant of SAM that explicitly magnifies this down-weighting effect. Experiments on benchmark image classification tasks with noisy labels demonstrate that SANER significantly mitigates noisy-label memorization and improves generalization over both SAM and SGD. Moreover, since SANER is designed from the mechanism of SAM, it can also be seamlessly integrated into SAM-like variants, further boosting their robustness.
💡 Research Summary
The paper investigates why Sharpness‑Aware Minimization (SAM), originally proposed to seek flat minima and improve generalization, also exhibits robustness to noisy labels. Prior explanations focused on SAM’s tendency to prolong learning of clean samples, but they did not fully account for the fact that SAM also amplifies gradients from noisy samples. The authors provide a complementary, element‑wise analysis showing that during the transitional phase—when the model shifts from fitting clean data to memorizing noisy labels—SAM automatically down‑weights specific gradient components that are dominated by noisy supervision.
Theoretical analysis is conducted on a linear binary classification setting using the 1‑SAM variant, where each sample receives its own adversarial perturbation ϵ_i. By examining a mini‑batch containing one clean and one noisy sample from the same true class, the authors derive a ratio r_j = g_SAM_j / g_SGD_j for each shared feature dimension j. Lemma 3.1 proves that under realistic conditions (similar confidence scores and equal feature norms) this ratio satisfies 0 < r_j < 1, meaning SAM reduces the magnitude of the gradient in that dimension relative to SGD. Moreover, the reduction is larger for the noisy sample’s contribution, while the clean sample’s gradient is amplified more strongly. This asymmetry effectively suppresses the influence of noisy gradients in the aggregated update, providing a mechanistic explanation for SAM’s noise robustness beyond merely slowing down clean‑sample learning.
Empirical validation on CIFAR‑10/100 with symmetric label noise (25%–50%) and three architectures (ResNet‑18, WideResNet‑40‑2, DenseNet‑121) confirms the theory. For each training step the authors compute element‑wise ratios r_i and find that 35–45 % of parameters satisfy 0 < r_i < 1 throughout training, indicating systematic down‑weighting. Cosine similarity between the set of down‑weighted gradient components and gradients computed solely on noisy examples rises to ~0.7, showing that the down‑weighted elements align with noise‑driven updates. An ablation where down‑weighted components are replaced by their SGD counterparts (named SGD‑D) eliminates SAM’s robustness: noisy‑label memorization increases and test accuracy drops, confirming the importance of the down‑weighting effect.
Motivated by these insights, the authors propose SANER (Sharpness‑Aware Noise‑Explicit Reweighting), a simple modification of SAM that intensifies the element‑wise down‑weighting. SANER adjusts the perturbation radius or directly rescales the identified components to further diminish noisy‑gradient influence. Experiments demonstrate that SANER consistently outperforms both SAM and standard SGD across datasets, noise levels, and model families, achieving 1.5 %–3 % higher test accuracy and markedly lower noisy‑label memorization. Moreover, because SANER is built on the same mechanism as SAM, it can be seamlessly integrated into existing SAM‑like variants, further boosting their robustness.
In summary, the paper makes three key contributions: (1) a theoretical proof that SAM implicitly down‑weights specific gradient elements, thereby slowing noisy‑label learning; (2) extensive empirical evidence that these down‑weighted elements correspond to noise‑driven gradients and that removing this effect harms robustness; (3) the design of SANER, a practical optimizer that leverages and amplifies this mechanism, delivering superior performance in noisy‑label settings while remaining compatible with a wide range of SAM‑based methods. This work deepens our understanding of why SAM is noise‑robust and opens new avenues for designing noise‑aware optimization algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment