Hyperparameter Transfer with Mixture-of-Expert Layers
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.
💡 Research Summary
This paper addresses the practical challenge of hyper‑parameter (HP) tuning for large‑scale language models that incorporate Mixture‑of‑Experts (MoE) layers. MoE layers decouple total parameter count from the amount of computation per token by routing each token to only a small subset of “experts”. While this yields massive parameter budgets with modest FLOPs, it also introduces new trainable components—router weights, expert feed‑forward weights, and expert biases—each of which traditionally requires separate HP tuning. Moreover, MoE models add two architectural scaling dimensions beyond the usual width and depth: the number of experts (n_exp) and the hidden size of each expert (controlled by a multiplier α_ffn). Directly searching for optimal HPs at the scale of billions of parameters is prohibitively expensive, motivating the need for a principled way to transfer HPs discovered on small models to much larger ones.
Contributions
- A unified MoE‑specific parameterization – Building on the max‑update (µP) and Complete P parameterizations that have proven effective for dense transformers, the authors derive explicit scaling rules for every MoE parameter group (router, expert up/down projections, expert bias). These rules prescribe how to scale the initialization standard deviation and the Adam learning rate as a function of model width (n_embd), depth (L), expert count (n_exp), and expert hidden multiplier (α_ffn).
- Theoretical justification via Dynamical Mean‑Field Theory (DMFT) – The paper presents a novel DMFT analysis for residual networks that contain MoE layers, taking the simultaneous limit of infinite width, depth, expert size, and expert count. The analysis reveals a three‑level mean‑field hierarchy: the residual stream is a mean field over expert outputs, which themselves are mean fields over individual expert neurons. Under the proposed scaling, the evolution of finite‑network summary statistics (e.g., layer‑wise kernels, gradient covariances) converges to a well‑defined limit, guaranteeing that updates for each parameter group remain O(1) regardless of scale. This provides a rigorous foundation for HP transfer across all four scaling dimensions.
- Extensive empirical validation – Experiments are conducted on the FineWeb dataset with a fixed token budget of 1 B tokens (≈2000 training steps). The authors vary one scaling dimension at a time while keeping the others fixed, and apply the same learning‑rate and initialization‑scale hyper‑parameters derived from a 38 M‑active‑parameter (“base”) model. Across width, depth, expert count, and expert hidden multiplier, loss curves virtually overlap, confirming that the same HPs work from 51 M up to ~2 B total parameters. The paper also demonstrates that the base HPs can be transferred to longer training horizons (more tokens) without any additional tuning, achieving performance competitive with dense GPT‑2 baselines.
- Load‑balancing without auxiliary loss – Expert load balancing is enforced simply by a bias‑only update rule (bias ← bias – η_bias·(Load_i – κ)), where κ = n_act / n_exp is the fixed sparsity ratio. This “auxiliary‑loss‑free” approach maintains near‑perfect load balance even when the number of experts is scaled, as shown in Figure 17.
- Insights on expert‑count vs. expert‑size trade‑off – Holding total parameter count constant, the authors find that increasing the number of experts (while keeping sparsity κ fixed) yields better downstream performance than enlarging each expert’s hidden dimension. This empirical observation aligns with recent theoretical work on MoE specialization and provides a concrete guideline for model architects.
Key Findings
- The proposed scaling rules ensure that all parameter groups have O(1) forward activations and O(1) Adam updates at initialization, which is the core condition for µP‑style HP transfer.
- Fixed sparsity κ is crucial: varying κ leads to breakdown of transferability, confirming the theoretical prediction that the mean‑field analysis assumes a constant routing probability.
- The DMFT‑derived three‑level hierarchy explains why expert‑count and expert‑size can be scaled jointly without destabilizing training dynamics.
- The bias‑only load‑balancing mechanism is sufficient to prevent expert collapse, dead experts, or “super‑expert” phenomena, simplifying the training pipeline compared to earlier MoE works that required complex auxiliary losses.
Limitations and Future Work
- All experiments are limited to a 1 B token budget; while the authors extrapolate to longer horizons, direct validation on tens of billions of tokens remains an open question.
- The routing function is implemented as a sigmoid followed by top‑k selection; other routing schemes (softmax‑top‑k, expert‑choice, shared experts) may require adapted scaling constants.
- The analysis assumes a homogeneous expert architecture (single hidden layer MLP). Extending the theory to deeper expert networks or to encoder‑decoder architectures would broaden applicability.
Conclusion
By integrating a DMFT‑backed scaling theory with practical hyper‑parameter rules, this work provides a robust, theory‑driven recipe for scaling MoE‑augmented transformers. Practitioners can now train models ranging from tens of millions to billions of parameters using a single set of HPs derived from inexpensive small‑scale sweeps, while preserving training stability, load balance, and competitive performance. The paper thus bridges the gap between theoretical mean‑field insights and real‑world large‑scale language model engineering, offering a valuable tool for the next generation of trillion‑parameter MoE systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment