Anticipatory Evaluation of Language Models
Progress in large language models is increasingly constrained by an evaluation bottleneck: benchmarks must be built and models run before iteration can begin. We investigate whether evaluation outcomes can be forecast before any experiments are conducted. Specifically, we study text-only performance prediction, where models estimate performance from task descriptions and experimental configurations alone, without access to dataset instances. To support systematic study, we curate PRECOG, a corpus of description-performance pairs spanning diverse tasks, domains, and metrics. We scrape task and configuration descriptions from arXiv, yielding 2,290 instances covering 1,519 papers, and construct a test split using papers published after the evaluated models’ knowledge cutoff. Experiments show the task is challenging but feasible: reasoning models achieve a non-trivial forecasting skill reaching mean absolute error as low as 9.9 at high-confidence thresholds. Overall, our corpus and analyses offer an initial step toward open-ended anticipatory evaluation, supporting difficulty estimation and smarter resource allocation.
💡 Research Summary
The paper tackles a growing bottleneck in the development of large language models (LLMs): the need to build, run, and evaluate benchmarks before any iteration can occur. While prior work on performance prediction assumes access to partially observed results, pilot runs, or structured experiment tables, the authors ask a more ambitious question: can we forecast a model’s performance on a novel benchmark using only the natural‑language description of the task and its experimental configuration, before any data are annotated or any model is run?
To answer this, the authors introduce the “text‑only performance forecasting” problem and construct a large‑scale dataset called PRECOG (Performance REgression COre for Generalization). PRECOG is built by automatically harvesting experimental records from arXiv papers. For each record the pipeline (i) collects the result paper (which reports a metric) and the dataset paper (which describes the task), (ii) extracts a schema‑aligned, identity‑masked description that merges task definition, data provenance, evaluation protocol, difficulty cues, and configuration details, and (iii) normalizes the reported metric to a common 0‑100 scale. The final corpus contains 2,290 description‑performance pairs drawn from 1,519 papers, spanning seven common metrics (Accuracy, F1, Recall, Precision, ROUGE, BLEU, Exact Match). Human audits of a random sample confirm that descriptions are fully anonymized, cover the required schema, and are faithfully grounded in the source PDFs.
The forecasting models are built on state‑of‑the‑art LLMs, primarily GPT‑5. Two variants are evaluated: (a) a pure description‑only regressor that receives only the textual specification, and (b) a retrieval‑augmented version that first queries a time‑bounded literature corpus for up to k relevant documents, then incorporates the retrieved evidence in a ReAct‑style reasoning loop (reason → query → retrieve → refine). Both models output a numeric estimate ˆy∈
Comments & Academic Discussion
Loading comments...
Leave a Comment