Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance
Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.
💡 Research Summary
The paper investigates whether the performance of language models on standard benchmarks is primarily driven by the statistical overlap between the pre‑training corpus and the evaluation data. To quantify this overlap, the authors introduce two word‑level proxy metrics: (1) unigram cross‑entropy of the benchmark under the word distribution of the pre‑training corpus, and (2) simple word‑frequency statistics derived from the pre‑training data. Unigram cross‑entropy is equivalent to the KL‑divergence between marginal word distributions (up to a constant) and therefore provides a clean, tokenizer‑agnostic measure of distributional similarity. Word‑frequency statistics are used as a complementary indicator of how much probability mass is allocated to long‑tail words.
The experimental setup spans four pre‑training corpora—FineWeb‑Edu, DCLM, C4, and OpenWebText—each sampled at three scales (8.5 B, 26 B, and 60 B tokens). Models are based on the LLaMA architecture with three sizes (≈400 M, 1.33 B, and 3.36 B non‑embedding parameters). All models use the GPT‑2 tokenizer for consistency, and training follows standard AdamW optimization with cosine learning‑rate decay. Ten widely used zero‑shot benchmarks are evaluated: ARC Easy/Challenge, Hellaswag, MMLU, SciQ, OpenBookQA, PIQA, Lambada, SocialIQA, and SWAG.
Across all model sizes, data scales, and benchmarks, the authors observe a robust inverse correlation between unigram cross‑entropy and benchmark accuracy: lower cross‑entropy (i.e., higher word‑distribution similarity) consistently yields higher scores. This pattern holds for every benchmark, and the ranking of pre‑training corpora by cross‑entropy matches exactly the ranking by performance (FineWeb‑Edu > DCLM > C4 > OpenWebText). Moreover, when unigram cross‑entropy is held approximately constant, increasing the amount of pre‑training tokens still improves performance, indicating that word‑frequency statistics—particularly the coverage of long‑tail words—provide an additional boost beyond marginal distribution alignment.
These findings lead the authors to argue that many standard benchmarks are only weakly out‑of‑distribution relative to the data on which models are trained. Consequently, simple word‑overlap statistics can predict benchmark performance, calling into question the diagnostic value of such benchmarks for measuring true generalization. The paper contributes (i) a tokenizer‑agnostic, information‑theoretic metric for dataset overlap, (ii) extensive empirical evidence that this metric explains performance differences across corpora, and (iii) insight that larger pre‑training subsets improve scores even when marginal overlap is fixed.
The work has several limitations. It focuses exclusively on word‑level statistics, ignoring higher‑order syntactic or semantic divergences that may affect more complex reasoning tasks. The analysis is limited to English data and whitespace tokenization, so applicability to other languages or scripts is unclear. Only zero‑shot evaluation is considered; the relationship may differ under fine‑tuning or in‑context learning. Finally, the model sizes are capped at 3 B parameters, leaving open the question of whether the same trends hold for modern multi‑billion‑parameter models.
In summary, the paper provides compelling evidence that lexical overlap between pre‑training data and benchmark datasets is a dominant factor in benchmark performance. This insight has practical implications for data selection, benchmark design, and the interpretation of reported gains in language‑model research. Future work should explore richer overlap metrics that capture structural and semantic differences, extend the analysis to diverse languages, and test whether the observed relationships persist in larger, more capable models.
Comments & Academic Discussion
Loading comments...
Leave a Comment