Quantifying the Effect of Test Set Contamination on Generative Evaluations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.

💡 Research Summary

This paper investigates the largely unexplored problem of test‑set contamination in generative evaluations, focusing on the MATH benchmark for competition‑style mathematics problems. The authors pre‑train a series of causal transformer language models (34 M to 344 M parameters) using the Qwen‑3 architecture on a high‑quality web‑crawl corpus that has been deliberately polluted with varying numbers of exact replicas of the MATH test set, ranging from zero up to 3,162 replicas (log‑uniform spacing). Each model is trained for 20 tokens per parameter, approximating the compute‑optimal regime described by Chinchilla‑style scaling.

Pre‑training Findings

Performance Gains from Contamination – Both Math Verify scores (the fraction of problems whose generated solutions verify against the official answer) and cross‑entropy loss improve as the number of test‑set replicas increases. The improvement is modest for ≤10 replicas but shows a sharp rise around 100 replicas, eventually reaching a ceiling. Larger models benefit more from the same amount of contamination, mirroring trends seen in discriminative tasks.
Memorization vs. Generalization – To test whether the gains reflect genuine reasoning, the authors evaluate contaminated models on two altered versions of MATH: (a) re‑phrased problems that keep numbers and logic but change wording, and (b) perturbed problems that alter the numbers while preserving structure. In both cases performance collapses to the level of an uncontaminated model, indicating that the observed gains are almost entirely due to verbatim memorization of the exact test strings rather than any learned mathematical insight.
Scaling‑Law Analysis – Using the neural scaling law form L(C,R)=E(R)+C₀(R)·C^{−α(R)} (where C≈6 N D FLOP), the authors fit loss curves for each contamination level R. The irreducible error term E(R) shrinks dramatically from 3.594 (R=0) to 0.0347 (R=316). Remarkably, even a single replica (R=1) enables almost all models to achieve a lower loss than the estimated irreducible error of a completely clean corpus. Extrapolating the fitted law suggests that a contaminated corpus can “buy” more than infinite compute relative to a clean one, a finding that conflicts with prior claims that single‑shot memorization is an illusion.

Further Training Experiments

Overtraining with Fresh Data – After the compute‑optimal pre‑training, models are continued on fresh, uncontaminated data for varying multiples of the original compute budget. For low‑contamination models, additional training further reduces loss; for high‑contamination models, loss actually rises, indicating that fresh data dilutes the proportion of contaminated tokens and erodes the memorization advantage. The crossover point moves to lower replica counts as model size grows (e.g., from 32 replicas for 34 M to just 1 replica for 93 M).
Supervised Fine‑Tuning (SFT) – Models are fine‑tuned on the original MATH training set with supervised objectives. When pre‑training contamination is low, SFT improves performance, presumably by reinforcing genuine reasoning patterns. When contamination is high, SFT worsens performance, likely because it reinforces the memorized answer strings and makes the model more brittle on unseen variations. This polarity underscores that fine‑tuning cannot be assumed to mitigate contamination without careful analysis.

Inference‑Time Factors
The authors explore two key inference knobs: sampling temperature and solution length.

Temperature – At temperature 0 (greedy decoding) the model locks into the memorized solution path (“Deterministic Lock‑In”). Raising temperature introduces stochasticity, leading to an “Exponentially Fast Decoherence” regime where the probability of reproducing the exact memorized answer drops sharply. High temperature thus acts as a “truth serum,” decoupling contamination gains from genuine generalization.
Solution Length – MATH solutions range from tens to hundreds of tokens. The study shows that longer solutions are exponentially harder to memorize perfectly; the probability of exact recall falls dramatically with length, creating a “Brittle Memorization” regime for medium‑length outputs and a “Deterministic Lock‑In” only for very short answers. This contrasts with discriminative tasks where the answer is typically a few tokens, highlighting a unique vulnerability of generative benchmarks.

Evaluation Library Bug Fix
During the experiments the authors discovered a critical implementation error in the widely used EleutherAI LM Evaluation Harness’s Math Verify scoring routine. The bug caused gold‑reference solutions to receive only ~70 % correctness, under‑reporting model performance. After collaborating with the library maintainers, they corrected the logic, raising gold‑reference scores to 100 %. This correction implies that many prior reports using earlier versions of the harness may have substantially mis‑estimated performance on mathematical reasoning tasks.

Overall Contributions and Implications

Quantitative Evidence that even a single test‑set replica can dramatically lower loss, effectively “cheating” the scaling law and inflating perceived capability.
Demonstration that contamination benefits are purely memorization‑driven, not indicative of genuine reasoning, as shown by re‑phrasing and perturbation experiments.
Insight that overtraining on fresh data and supervised fine‑tuning have nuanced, contamination‑dependent effects, suggesting that post‑pre‑training procedures must be designed with contamination awareness.
Guidance on inference‑time settings: using higher temperatures and longer answer formats can mitigate memorization, but they also affect overall solution quality and must be balanced.
Tooling Improvement via the bug fix, improving the reliability of future benchmark reporting.

In sum, the paper provides a comprehensive, lifecycle‑wide analysis of test‑set contamination in generative evaluations, exposing a hidden layer of complexity that threatens the trustworthiness of AI capability assessments. It offers concrete methodological recommendations for researchers and practitioners aiming to develop and evaluate large language models responsibly.

Quantifying the Effect of Test Set Contamination on Generative Evaluations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment