Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Setting the learning rate for a deep learning model is a critical part of successful training, yet choosing this hyperparameter is often done empirically with trial and error. In this work, we explore a solvable model of optimal learning rate schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $η_T^\star(t)$ where $t$ is the current iterate and $T$ is the total training horizon. This schedule is computed both numerically and analytically (when possible) using optimal control methods. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $η_T^\star(t) \simeq T^{-ξ} (1-t/T)^δ$ where $ξ$ and $δ$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant (in $T$) initial learning rate and annealing performed over a vanishing (in $T$) fraction of training steps. We investigate joint optimization of learning rate and batch size, identifying a degenerate optimality condition. Our model also predicts the compute-optimal scaling laws (where model size and training steps are chosen optimally) in both easy and hard regimes. Going beyond SGD, we consider optimal schedules for the momentum $β(t)$, where speedups in the hard phase are possible. We compare our optimal schedule to various benchmarks in our task including (1) optimal constant learning rates $η_T(t) \sim T^{-ξ}$ (2) optimal power laws $η_T(t) \sim T^{-ξ} t^{-χ}$, finding that our schedule achieves better rates than either of these. Our theory suggests that learning rate transfer across training horizon depends on the structure of the model and task. We explore these ideas in simple experimental pretraining setups.


💡 Research Summary

The paper investigates optimal learning‑rate schedules for stochastic gradient descent (SGD) in a power‑law random‑feature model, a setting that captures many scaling phenomena observed in modern deep learning. The authors formulate the problem as an optimal‑control task: given a fixed training budget T, choose a time‑varying learning rate η(t) (and optionally batch size and momentum) that minimizes the final test loss L_T.

The model assumes inputs x drawn from a distribution p(x) and labels generated by a linear teacher y(x)=w*·ψ(x)+σ₀z, where the feature map ψ(x) has eigenvalues λ_k∼k^{‑b} and the teacher’s projected weights satisfy (w*k)^2 λ_k∼k^{‑a}, with a,b>1. These exponents encode task difficulty (a) and data spectral decay (b). Training proceeds with online SGD on a student linear model of size N, using minibatches of size m and learning rate η_t. By averaging over the stochasticity, the dynamics reduce to deterministic recursions for the per‑mode error c{t,k}=E


Comments & Academic Discussion

Loading comments...

Leave a Comment