Investigating Test Overfitting on SWE-bench
Tests can be useful towards resolving issues on code repositories. However, relying too much on tests for issue resolution can lead to code that technically passes observed tests but actually misses important cases or even breaks functionality. This problem, called test overfitting, is exacerbated by the fact that issues usually lack readily executable tests. Instead, several issue resolution systems use tests auto-generated from issues, which may be imperfect. Some systems even iteratively refine code and tests jointly. This paper presents the first empirical study of test overfitting in this setting.
💡 Research Summary
This paper presents the first systematic investigation of test overfitting in large‑language‑model (LLM) based issue‑resolution systems that operate on real‑world software repositories. The authors focus on the common practice of generating surrogate tests from issue descriptions (using tools such as e‑Otter++) and then using those tests to guide code generation or refinement (e.g., via Agentless). Because most open‑source issues lack executable acceptance tests, the hidden “gold” tests (t_gold) that truly capture the intended behavior are unavailable during inference. The central hypothesis is that reliance on imperfect generated tests (t_gen) can cause LLM‑produced patches (c_new) to pass the observed tests while failing the hidden gold tests—a phenomenon the authors term “test overfitting.”
To explore this hypothesis, the authors formulate three research questions (RQs). RQ1 asks whether LLM‑generated patches overfit to the generated tests. RQ2 examines how a test‑based code refinement loop influences overfitting. RQ3 asks what would happen if the hidden gold tests were available during refinement. The experimental setup uses two state‑of‑the‑art LLMs—Claude‑3.7‑Sonnet and GPT‑4o—combined with the Agentless code‑generation system and the e‑Otter++ test‑generation system. The benchmark is TDD‑Bench Verified, a curated subset of 449 Python issue‑resolution instances derived from SWE‑bench. For each instance the pipeline proceeds in three steps: (1) Agentless produces an initial patch c_new; (2) e‑Otter++ produces a surrogate test t_gen; (3) a test‑driven refinement loop iteratively asks the LLM to modify either the code or the test based on execution logs, a critic prompt, and a reward function that blends pass/fail signals with line‑coverage information. The loop runs for up to 15 iterations or until t_gen flips from fail to pass.
The primary metric is the “test overfitting rate”: the proportion of instances where a patch passes t_gen (or the refined version passes t_gen) but fails the hidden gold test t_gold (or fails the original regression suite t_old). Results for RQ1 show that, without any refinement, Claude‑3.7‑Sonnet overfits on 21.8 % of the 229 instances where the initial patch passes t_gen, while GPT‑4o overfits on 33.0 % of its 176 such instances. This demonstrates that even when a patch looks successful against the generated test, a substantial fraction will be rejected by the true acceptance criteria.
For RQ2, the authors apply the refinement loop to the 220 instances where the initial patch fails t_gen. The loop succeeds in making 22 patches pass t_gen, but 14 of those (63.6 % for Claude, 59.1 % for GPT‑4o) still fail t_gold, raising the overall overfitting rate to 25.5 % and 35.9 % respectively. The authors also experiment with “hiding” the generated test (exposing only a pass/fail flag) and with removing prompt augmentations. Both strategies reduce the raw agreement between patches and tests, but they also lower the number of instances that ultimately pass t_gold, indicating a trade‑off between overfitting mitigation and overall effectiveness.
RQ3 conducts a “limit study” where the hidden gold tests are revealed to the model during refinement. In this idealized scenario, overfitting drops dramatically (5.8 % for Claude, 11.3 % for GPT‑4o), yet a non‑trivial fraction of patches still fail some regression tests, showing that even perfect acceptance tests do not guarantee that a patch preserves existing functionality.
The paper’s contributions are threefold: (1) quantitative evidence that test overfitting is a pervasive problem for LLM‑based issue resolution; (2) an analysis of test‑driven refinement loops, showing that they can exacerbate overfitting rather than alleviate it; (3) a limit experiment indicating that access to gold tests reduces overfitting but does not eliminate conflicts with regression tests. The authors conclude by cautioning against over‑reliance on automatically generated tests, recommending future work on improving test generation quality, incorporating multi‑test validation (gold + regression), and designing refinement strategies that explicitly penalize regressions. This study sets a baseline for measuring and addressing test overfitting in the next generation of AI‑assisted software engineering tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment