The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Reading time: 5 minute
...

📝 Original Info

  • Title: The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
  • ArXiv ID: 2602.16340
  • Date: 2026-02-18
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (가능하면 원문에서 확인 후 추가) **

📝 Abstract

We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

💡 Deep Analysis

📄 Full Content

Deep neural networks show remarkable generalization performance despite often being overparameterized, and even when trained with no explicit regularization. A well-established line of work attempts to explain this phenomenon with the notion of the implicit bias (tendency) of gradient-based optimization algorithms to converge to well-generalizing solutions. This bias is often realized in the form of maximizing a certain margin for the training points (cf. Vardi (2023)).

While earlier works studied mostly gradient descent and showed its implicit bias towards maximizing the ℓ 2 margin in increasingly complex models, recent years have witnessed an interest in the study of the implicit bias of other optimizers, such as Adam (Kingma and Ba, 2015), AdamW (Loshchilov and Hutter, 2019), and recently Muon (Jordan et al., 2024), hand-in-hand with their rising popularity. Indeed, as these algorithms are used near-universally for training large language models (LLMs) and vision transformers, there is a growing imperative to understand their inner workings.

In this work, we study smooth homogeneous models and show a margin-maximization bias of Adam and Muon. Previous work analyzed the implicit bias of Adam and Muon on linear predictors (Zhang et al., 2024;Fan et al., 2025), and we extend these results to the substantially broader class of smooth homogeneous models. Moreover, our analysis of Muon is a special case of a more general framework that we develop, which is applicable to all momentum-based optimizers built on top of steepest descent algorithms. All of our results hold for a family of exponentially tailed losses that includes the logistic and exponential losses.

Our main contributions are as follows:

  1. We show that any limit point of θt ∥θt∥ in a normalized steepest descent trajectory with a learning rate schedule η(t) is a KKT point of the ∥•∥-max-margin problem, as long as ∞ 0 η(t)dt = ∞. This result holds for any locally Lipschitz, C 1 -stratifiable homogeneous model, including ReLU networks. It extends a result by Tsilivis et al. (2025) that considered (unnormalized) steepest descent with a constant learning rate.

  2. We show that when θt ∥θt∥ converges, it converges to the direction of a KKT point of the ∥•∥-maxmargin problem even for trajectories which approximate steepest descent. This allows us to focus on momentum-based optimizers on smooth homogeneous models and show:

(a) Muon has an implicit bias towards margin maximization with respect to a norm defined using spectral norms of the weight matrices, under a decaying learning rate regime. In fact, the bias towards margin maximization holds for any normalized Momentum Steepest Descent (MSD) algorithm, for the appropriate norm. We show this includes composite MSD algorithms such as Muon-Signum. In addition, we prove an implicit bias of Muon-Adam.

(b) Adam (without the stability constant) has an implicit bias towards ℓ ∞ margin maximization under a decaying learning rate regime.

Related Work Soudry et al. (2018) first showed that gradient descent in linear models maximizes the ℓ 2 margin. This result was followed by several works on margin maximization in linear fully-connected, convolutional and diagonal networks (e.g., Ji and Telgarsky (2018); Gunasekar et al. (2018b); Yun et al. (2020); Moroshko et al. (2020)).

Going beyond linear networks, Chizat and Bach (2020) studied the implicit bias in infinitely-wide two-layer smooth homogeneous networks, and proved margin maximization w.r.t. a certain function norm, known as the variation norm. Lyu and Li (2019) studied homogeneous models under gradient descent, demonstrating that any limit point of the direction of the vector of parameters θt ∥θt∥ is the direction of a KKT point of the max-margin problem. In a complementary result, Ji and Telgarsky (2020) showed that directional convergence of the parameters is indeed guaranteed when optimizing homogeneous models definable in an o-minimal structure with gradient descent. The implicit bias of gradient descent in certain non-homogeneous neural networks was studied in Nacson et al. (2019a); Kunin et al. (2022); Cai et al. (2025). For a more comprehensive survey on the implicit bias of gradient descent, see Vardi (2023).

A general treatment of the implicit bias of the family of steepest descent algorithms was given for linear models by Gunasekar et al. (2018a), who proved maximization of the appropriate norm-dependent margin. Tsilivis et al. (2025) generalized that result and the result by Lyu and Li (2019) and proved for homogeneous models under steepest descent that any limit point of θt ∥θt∥ is a KKT point of the max-margin problem. Adam in the context of homogeneous models was studied by Wang et al. (2021), who showed a bias towards ℓ 2 -margin maximization. Notably, this work studied Adam without momentum in the numerator and with a stability constant ε which asymptotically dominates the denominator, driving behavior to be similar to gradient descent

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut