Generalizing Scaling Laws for Dense and Sparse Large Language Models

Generalizing Scaling Laws for Dense and Sparse Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.


💡 Research Summary

**
This paper revisits the empirical scaling laws that have been proposed to predict the performance of large language models (LLMs) as a function of model size, training data, and compute. Existing laws are largely architecture‑specific: the original Kaplan et al. law and the later Hoffmann et al. law apply to dense transformers, while Frantar et al. and Abnar et al. derived separate formulas for sparsified models obtained via pruning or mixture‑of‑experts (MoE). The authors identify two major shortcomings of these prior works. First, they use different notions of “parameter count” – total parameters, non‑zero parameters, or active expert parameters – which makes cross‑architecture comparison and budget planning cumbersome. Second, the sparse laws do not reduce to the dense Hoffmann law when sparsity S = 0, indicating a lack of true unification.

To address these issues, the authors propose a generalized scaling law that is based solely on the number of active (non‑zero) parameters, denoted Nₐ, and the number of training tokens D. Compute is defined uniformly as C = 6 Nₐ D, mirroring the FLOP estimate used in the dense case. The loss is modeled as a simple three‑term power‑law:

 L(Nₐ, D) = e + a Nₐ^α + b D^β

where e captures the irreducible entropy of natural language, and a, b, α, β are fitted jointly across a large collection of dense and sparse training runs. When S = 0, Nₐ = N and the formula collapses exactly to the Hoffmann law, establishing it as a special case. For any sparsity level, the same expression automatically accounts for the reduced effective model capacity without introducing extra sparsity‑specific terms.

The authors evaluate the new law against the three prior formulations using a comprehensive benchmark that includes 48 pruned models, 7 MoE models, and several dense baselines. Prediction accuracy is measured by mean‑squared error (MSE) of the loss. The generalized law achieves an MSE of 1.06 compared with 3.04 for the Frantar law and 1.06 for the Abnar law, indicating a consistently tighter fit across the entire sparsity spectrum. IsoFLOP experiments further demonstrate that, for a fixed FLOP budget, there exists an optimal sparsity range (approximately 80 %–95 %) that yields the lowest loss; this insight is validated on a state‑of‑the‑art MoE model, DeepSeek‑V3 (671 B parameters, 94.49 % sparsity).

Beyond prediction, the unified formulation enables inverse design: given a compute budget C and a desired token count D, one can solve for the optimal active parameter count Nₐ, and consequently infer the best sparsity level S for a target total parameter budget N. This provides a practical tool for resource‑constrained model planning, allowing practitioners to balance model size, data volume, and sparsity without resorting to separate, architecture‑specific heuristics.

The paper also discusses limitations and future directions. The current empirical regime covers sparsities up to 98 %; behavior at extreme sparsities (>99 %) and the impact of routing overheads in MoE systems remain open questions. Extending the framework to multimodal models, quantized or low‑precision training, and incorporating hardware‑specific cost factors (e.g., memory bandwidth) are identified as promising avenues.

In summary, by grounding scaling behavior on the count of active parameters, the authors deliver a concise, architecture‑agnostic scaling law that unifies dense and sparse LLMs, offers superior predictive performance, and supplies a straightforward methodology for optimal model‑sparsity allocation under realistic compute constraints. This contribution is poised to become a valuable reference for both researchers designing next‑generation LLMs and engineers tasked with budgeting large‑scale training runs.


Comments & Academic Discussion

Loading comments...

Leave a Comment