Neural Neural Scaling Laws

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks – a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.

💡 Research Summary

The paper revisits the problem of predicting downstream task performance of large language models as they scale. Traditional scaling laws, such as power‑law or logistic relationships, rely on a single aggregate metric—typically validation loss or perplexity—to forecast accuracy on downstream benchmarks. The authors identify two fundamental shortcomings of this approach. First, averaging token‑level losses discards rich distributional information; two models with identical average loss can have very different loss histograms, reflecting distinct learning dynamics and capabilities. Second, a single parametric family cannot capture the full spectrum of observed scaling behaviors, which include monotonic improvement, early saturation, and “inverse scaling” where performance degrades at larger scales.

To overcome these issues, the authors cast scaling‑law prediction as a time‑series extrapolation task and introduce Neural Neural Scaling Laws (NeuNeu), a neural network that consumes both the full token‑level validation loss distribution and the historical trajectory of downstream accuracies. Token‑level cross‑entropy losses ℓ_i are transformed into probabilities p_i = e^{‑ℓ_i} to obtain a bounded representation. A loss encoder built from four strided 1‑D convolutional layers hierarchically downsamples the probability vector, producing a single embedding e. In parallel, each observed accuracy y_t and the compute gap g_t (the amount of compute between evaluations) are linearly projected and concatenated into context tokens c_t. These embeddings, together with a CLS token, are fed into a six‑layer bidirectional Transformer with rotary positional embeddings. The Transformer’s CLS output is passed through a linear head that predicts five quantiles (0.1, 0.25, 0.5, 0.75, 0.9) of the future accuracy distribution using quantile (pinball) loss, thereby providing both a point estimate (median) and calibrated uncertainty.

Training data are harvested from publicly available HuggingFace checkpoints of the DataDecide suite, covering six model sizes (90 M–1 B parameters) each trained on three random seeds. For every checkpoint, 66 downstream tasks from the OLMES benchmark are evaluated, yielding a rich set of accuracy trajectories. The authors construct training examples by randomly dropping entries from each trajectory (drop probability 0.4) and re‑aggregating compute gaps, which forces the model to learn robust forecasting from incomplete histories. Token‑level loss vectors are computed on a fixed 256 k token slice of the WebOrganizer dataset, with whitespace tokenization and sub‑word probability aggregation to ensure consistency across models.

Evaluation focuses on four zero‑shot generalization scenarios: (1) a different random seed for the same model family, (2) a different pre‑training dataset (C4) for the same sizes, (3) entirely new model families (Pythia, 70 M–6.9 B parameters) whose scale lies outside the training distribution, and (4) 13 downstream tasks withheld during training. NeuNeu achieves a mean absolute error (MAE) of 2.04 % across 66 tasks, a 38 % reduction relative to the best logistic scaling baseline (MAE = 3.29 %). In the most challenging Pythia scenario, NeuNeu’s MAE is 0.022 versus 0.036 for logistic scaling. Ablation studies demonstrate that using only the average loss, only histogram deltas, or omitting loss information altogether degrades performance substantially, confirming that the full token‑level probability distribution is essential.

The paper’s contributions are: (1) the first non‑parametric neural model for scaling‑law prediction that directly learns from data, (2) incorporation of quantile regression for calibrated uncertainty estimates, (3) a novel loss‑encoding pipeline that preserves distributional signals, and (4) a reproducible framework that leverages only open‑source checkpoints, enabling community‑wide adoption. Limitations include the modest 20 M‑parameter Transformer backbone and the focus on language‑only data; extending NeuNeu to larger architectures or multimodal settings remains future work. Nonetheless, NeuNeu offers a powerful tool for researchers and practitioners to forecast downstream performance, allocate compute resources, and guide model‑size decisions without relying on oversimplified parametric scaling laws.

Neural Neural Scaling Laws

💡 Research Summary

Comments & Academic Discussion

Leave a Comment