For deploying foundation models, practitioners increasingly need prescriptive scaling laws: given a pre training compute budget, what downstream accuracy is attainable with contemporary post training practice, and how stable is that mapping as the field evolves? Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases. Across various tasks, the estimated boundaries are mostly stable, with the exception of math reasoning that exhibits a consistently advancing boundary over time. We then extend our approach to analyze task dependent saturation and to probe contamination related shifts on math reasoning tasks. Finally, we introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget. Together, our work releases the Proteus 2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.
Over the past several years, language model (LM) scaling has emerged as one of the most robust empirical laws in modern machine learning (Hestness et al., 2017). Across model families and training regimes, increasing pre-training compute has been shown to produce smooth and predictable improvements in loss, perplexity, and, to a lesser extent, downstream task performance (Brown et al., 2020;Chowdhery et al., 2023;Gadre et al., 2024;Hoffmann et al., 2022). This observation has driven a paradigm in which scale itself becomes a primary design variable, enabling practitioners to trade off data, model size, and compute in a principled way (Hoffmann et al., 2022;Kaplan et al., 2020). As language models transition from research artifacts to deployed systems, the limitations of existing scaling laws have become increasingly pronounced. Despite the success of scaling laws, they do not answer one question that practitioners can usually face: given a fixed pre-training compute budget C, what downstream performance can one realistically expect to achieve with high probability after post-training? While average trends with respect to compute are sometimes stable, downstream behaviors of interest (such as reasoning performance, instruction following, or domain-specific Table 1. Estimated attainable accuracies predicted by the no-split 0.98-quantile sigmoid boundaries at 10 24 FLOPs. question answering) exhibit substantial heterogeneity even among models trained with similar FLOPs (Jin et al., 2025). Post-training procedures (Ziegler et al., 2019), data curation choices (Setlur et al., 2024), and temporal effects (Dominguez-Olmedo et al., 2024) further complicate the relationship between pre-training compute and deployed performance, weakening the direct applicability of standard scaling laws for real-world decision making. Recent work has highlighted this gap from multiple perspectives: downstream benchmark scaling can be noisy, benchmark-dependent, and weakly coupled to pre-training loss, in part due to heterogeneous training factors (e.g., data mixture, architectures, and evaluation artifacts) and the disconnection between loss and downstream accuracy (Chen et al., 2024;Gadre et al., 2024;Lourie et al., 2025;Qi et al., 2025;Schaeffer et al., 2024;Zhang et al., 2025a). At the same time, the rapid growth of public evaluation repositories-especially leaderboards that aggregate thousands of post-trained checkpoints-makes it increasingly feasible to study these relationships empirically from observational data. In this paper, we study prescriptive scaling: given a base-model pre-training compute budget, what attainable post-training performance should we expect on a target benchmark? Rather than modeling only mean trends, we summarize the attainable region with capability boundaries: for each task we estimate a high conditional quantile of observed post-trained accuracy as a function of log pretraining compute (Koenker and Bassett, 1978). This framing is robust to outliers and recipe-specific variation, and it yields an end-to-end, decision-oriented compute-to-performance map from large collections of heterogeneous checkpoints. Crucially, we treat time as a first-class axis: by fitting boundaries on earlier model generations and validating on later ones, we can gain knowledge of whether a compute-based boundary remains predictive as training recipes and post-training techniques evolve. We rely on three complementary data sources: (i) the Open LLM Leaderboard v1 (Beeching et al., 2023) and v2 (Fourrier et al., 2024), each containing thousands of models evaluated on six benchmarks under consistent metrics, (ii) public leaderboards for state-of-the-art frontier models (e.g., Epoch AI and LifeArchitect.AI), and (iii) newly added 2.4k open-weight models (PROTEUS-2K) focusing on releases after the Open LLM Leaderboard v2 cutoff (2025-03-13) until the end of 2025, that we evaluate ourselves (including new model families of Qwen3 (Yang et al., 2025), Gemma-3 (Team et al., 2025), GPT-OSS (Agarwal et al., 2025)), following the same Open LLM Leaderboard pipeline. Together, these sources provide both breadth (many heterogeneous post-training pipelines) and a basis for assessing temporal validity (Dominguez-Olmedo et al., 2024). The main contributions are summarized below:
• Sigmoid capability boundaries: Compared with pre-trained model performance, we show that the attainable post-trained performance is much more predictable and is well-characterized by a simple monotone, saturating sigmoid function of log-compute.
• Temporal validity and task-dependent ceilings: Using chronological train/validation splits, we find that capability boundaries for a majority of tasks are comparatively stable over time, yielding a nearly deterministic relationship between compute and attainable accuracies, while math reasoning exhibits a consistently improving boundary. As an illustration, Table 1 provides estimated attainable accuracies (0.98-quantile
This content is AI-processed based on open access ArXiv data.