Configuration-to-Performance Scaling Law with Neural Ansatz
Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the \textit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.
💡 Research Summary
The paper addresses a fundamental limitation of existing scaling laws for large language model (LLM) pre‑training: they predict final loss solely as a function of model size (N) and data tokens (D), assuming all other training hyper‑parameters (learning rate, batch size, optimizer, etc.) are already optimally tuned. In practice, optimal hyper‑parameter tuning is costly, sometimes infeasible due to hardware constraints, and the optimal settings may differ across scales. To overcome this, the authors propose a Configuration‑to‑Performance Scaling Law (CPL) that maps the full training configuration C—including architecture details, data scale, optimizer choice, learning‑rate schedule, weight decay, batch size, and other settings—to a performance metric P such as final loss.
Because no simple closed‑form expression can capture the high‑dimensional, nonlinear relationship between C and P, the authors adopt a data‑driven neural ansatz. They collect over 3,000 publicly available pre‑training logs from two open‑source projects (Marin and StepLaw) covering a wide variety of model sizes, data amounts, and hyper‑parameter settings. Using these logs, they fine‑tune a pretrained large language model (Qwen‑3‑1.7B) as a regression model fθ. Textual fields (e.g., optimizer name) are tokenized with the model’s native tokenizer, while numerical fields (N, D, learning rates, etc.) are projected into the embedding space via a two‑layer MLP. The final hidden state of the last input token is passed through a linear head to predict a residual loss: the observed loss minus a baseline Chinchilla‑law prediction that depends only on N and D. This residual formulation forces the network to learn configuration‑specific effects beyond the coarse N‑D dependence.
Training proceeds in two stages (LP‑FT): first only the numeric MLP encoder and the linear head are updated, then the entire model is fine‑tuned. When multiple data sources are used, a separate Chinchilla baseline is fitted per source to avoid leakage. Evaluation splits the data by model size into in‑distribution (ID, ≤ 430 M parameters) and out‑of‑distribution (OOD, up to 10× larger compute). Results show:
-
Final loss prediction – On the StepLaw benchmark, NCPL reduces mean absolute error (MAE) by more than 40 % compared to the Chinchilla law; on the Marin benchmark the reduction is over 20 %. Spearman correlation also improves substantially, indicating better capture of configuration‑specific variance.
-
Joint hyper‑parameter tuning – By enumerating candidate learning‑rate and batch‑size pairs for a target (N, D) and selecting the configuration with the lowest predicted loss, NCPL matches the performance of the dedicated hyper‑parameter scaling law of Li et al. (which assumes power‑law forms). This demonstrates that NCPL can be used for multi‑parameter optimization without explicit functional assumptions.
-
Loss‑curve prediction – Extending the target to intermediate steps enables reconstruction of the entire loss trajectory. The model accurately reproduces optimizer‑specific curve shapes, a task previously requiring handcrafted functional forms.
-
Comparison with non‑neural baselines – Strong tree‑based regressors such as XGBoost perform worse than NCPL, highlighting the advantage of transformer‑based models in handling heterogeneous categorical and numeric inputs and leveraging large‑scale pre‑training knowledge.
The authors discuss limitations: NCPL’s extrapolation is reliable only within the distribution of logged configurations; novel architectures or data domains would require additional fine‑tuning. Nevertheless, as more public logs become available, a community‑wide NCPL could continuously ingest new data, benefiting from the prior knowledge encoded in the base language model and requiring relatively few new examples to adapt.
In summary, the paper introduces a novel “configuration‑to‑performance” paradigm, implements it with a neural network trained on heterogeneous open‑source logs, and demonstrates superior predictive accuracy, effective joint hyper‑parameter selection, and the ability to predict full loss curves. This work paves the way for more practical, cost‑effective planning of large‑scale LLM training runs, moving beyond the restrictive assumptions of traditional scaling laws.
Comments & Academic Discussion
Loading comments...
Leave a Comment