Optimal Scaling Needs Optimal Norm

Optimal Scaling Needs Optimal Norm
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. For Adam and Scion optimizers, we discover that joint optimal scaling across model and dataset sizes is conditioned on a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of Adam. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.


💡 Research Summary

The paper tackles the long‑standing problem of how to transfer hyperparameters optimally when both model size and dataset size are scaled up, a scenario that has become central to the development of large language models (LLMs). While recent work on Maximum Update Parametrization (µP) provides a theoretical framework for model‑size scaling and empirical scaling laws give practical guidance for dataset‑size scaling, no single principle has yet unified the two. The authors adopt the emerging “norm‑based optimization” paradigm, in which each layer of a neural network is assigned a specific operator norm and the optimizer updates weights according to the duality map associated with that norm. Their implementation, the Scion optimizer, is a flexible, momentum‑free alternative to Adam that can be configured with different norms for input, hidden, and output layer groups.

The central empirical discovery is that, for both Adam and Scion, the optimal learning‑rate / batch‑size pair ((\eta^{}, B^{})) always yields the same value of the output‑layer operator norm (|W_{\text{out}}|_{\text{RMS}\rightarrow\infty}), regardless of how the model is widened, deepened, or how many tokens are presented during pre‑training. The authors name this phenomenon “norm transfer”. By conducting exhaustive grid searches over learning rates and batch sizes across a wide range of model sizes (69 M to 1.3 B parameters) and dataset horizons (from (2^{33}) to (1.38\times10^{11}) tokens), they show that the output‑layer norm remains constant for the configuration that minimizes training loss. This invariance constitutes a necessary condition for optimality: many ((\eta, B)) pairs can achieve the same norm, but only one of them actually minimizes loss.

Beyond the necessary condition, the authors quantify a sufficient condition for Scion by fitting the relationship between (\eta^{*}), (B), and dataset size (D). They find \


Comments & Academic Discussion

Loading comments...

Leave a Comment