Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Reading time: 2 minute
...

📝 Original Info

  • Title: Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training
  • ArXiv ID: 2512.08894
  • Date: 2025-12-09
  • Authors: Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

📝 Abstract

While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-toparameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for...

📄 Full Content

Large Language Models (OpenAI et al., 2024;Team et al., 2025;DeepSeek-AI et al., 2025) based on the Transformer (Vaswani et al., 2023) architecture have achieved impressive results, approaching or exceeding human-level performance across multiple domains. Scaling laws (Hestness et al., 2017;Kaplan et al., 2020) are an established method for modeling the performance of these networks, enabling researchers to plan large-scale training runs based on curated sets of smaller experiments. Traditionally, these laws focus on predicting proxy metrics for model quality, such as pre-

…(Content truncated for length.)

📸 Image Gallery

apple_logo.png arc_challenge.png arc_challenge_multiple_figs.png arc_challenge_thresh_6e+21.png arc_easy.png arc_easy_thresh_6e+21.png avg.png combined_arc_hellaswag_logistic_fit.png gsm8k_cot_llama.png hellaswag.png hellaswag_thresh_6e+21.png hellaswagy_multiple_figs.png humaneval-py.png lambada_openai.png lbpp.png mae_trends_group_1_main_tasks.png mae_trends_group_2_main_tasks.png mae_trends_group_3_main_tasks.png mae_trends_group_4_main_tasks.png mre_trends_group_1_main_tasks.png mre_trends_group_2_main_tasks.png mre_trends_group_3_main_tasks.png mre_trends_group_4_main_tasks.png multiple_figs.png piqa.png piqa_multiple_figs.png piqa_thresh_6e+21.png sciq.png sciq_multiple_figs.png sciq_thresh_6e+21.png triviaqa_1shot.png webqs_1shot.png winogrande.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut