$μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization ($μ$P) for two state-of-the-art learned optimizer architectures and propose a simple meta-training recipe for $μ$-parameterized LOs ($μ$LOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (SP) using the same compute budget. We also empirically observe that $μ$LOs exhibit unexpectedly improved meta-generalization to deeper networks ($5\times$ meta-training) and surprising generalization to much longer training horizons ($25\times$ meta-training) when compared to SP LOs.
💡 Research Summary
The paper tackles a central challenge in learned optimizers (LOs): their poor meta‑generalization to tasks that differ substantially from the distribution used during meta‑training, especially when the downstream networks are wider, deeper, or require many more training steps. To address this, the authors bring the Maximal Update Parametrization (μP), originally developed for hand‑crafted optimizers such as Adam and SGD, into the realm of learned optimizers. μP prescribes a specific scaling of weight‑initialization variance, pre‑activation multipliers, and optimizer update magnitudes that depend on the width of each layer, ensuring that the “maximal update” dynamics remain stable as width grows.
The authors focus on two state‑of‑the‑art LO architectures: V eLO (an LSTM‑based optimizer) and small_fc_lopt (a lightweight fully‑connected network). For each, they derive a μ‑parameterization that satisfies the μP desiderata. The key modifications are: (i) initializing hidden and input layer weights with variance 1/FA_N_IN, while output layers use unit variance; (ii) multiplying output‑layer pre‑activations by 1/FA_N_IN during the forward pass; and (iii) scaling the learned update (the direction d and magnitude m) by 1/FA_N_IN for hidden layers, leaving output layers unchanged. Propositions 4.1 and 4.2 (with proofs in the appendix) formally show that under the Law of Large Numbers assumption, these adjustments yield a maximal‑update parametrization for the two optimizers.
Armed with this theoretical foundation, the authors propose a low‑cost meta‑training recipe. They construct a “multiple‑width single‑task” meta‑training set consisting solely of multilayer perceptrons (MLPs) of varying widths, each trained for 1,000 inner‑loop steps. All optimizers—both the newly μ‑parameterized LOs (μLOs) and baseline LOs trained under standard parametrization (SP)—are allocated the same FLOP budget, ensuring a fair compute comparison. Hand‑crafted baselines (μAdam and AdamW) are tuned per task using exhaustive grid searches (over 500 configurations per task).
Evaluation spans 35 downstream tasks covering image classification (CIFAR‑10, ImageNet‑32/64) with both MLPs and Vision Transformers, as well as language modeling on the LM1B dataset with a decoder‑only transformer. The test tasks deliberately push the limits of width (up to 8× the widest meta‑training width), depth (up to 5×), and unroll length (up to 25×). Results show that μLOs consistently achieve the best average rank across these out‑of‑distribution tasks, outperforming SP‑trained LOs, μAdam, and AdamW despite using the same compute. Notably, μLOs generalize to deeper networks and much longer training horizons—situations where even the strongest hand‑crafted baselines degrade sharply.
Ablation studies confirm that simply varying width during SP meta‑training does not yield comparable generalization, highlighting the essential role of μ‑scaling. Additional experiments varying the λ₁, λ₂ hyperparameters (which control the base step size) demonstrate that the μ‑parameterization remains robust across a range of settings.
The contributions are threefold: (1) a rigorous derivation of μ‑parameterizations for two leading LO architectures, (2) a practical, compute‑efficient meta‑training protocol that leverages μP to achieve strong meta‑generalization, and (3) extensive empirical evidence that μLOs surpass both SP LOs and tuned hand‑crafted optimizers on width, depth, and horizon extrapolation. Limitations include the focus on relatively simple architectures (MLPs, LSTM‑based V eLO) and the reliance on the Law of Large Numbers assumption, whose validity in highly non‑Gaussian data regimes remains open. Future work could extend μ‑scaling to residual networks, ResNets, and large transformers, and explore automated discovery of μ‑scaling factors to further reduce meta‑training overhead.
Comments & Academic Discussion
Loading comments...
Leave a Comment