Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel “diagnostic fingerprint” visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a model’s performance is profoundly shaped by the target language’s ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an “illusion of correctness” prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.


💡 Research Summary

The paper tackles two fundamental shortcomings that currently limit the evaluation of code‑generating large language models (LLMs): (1) the reliance on static problem sets that quickly become contaminated by training data, and (2) the use of superficial testing that can give a false sense of correctness. To address both issues simultaneously, the authors propose a “Dual Scaling” philosophy. Dual Scaling consists of (i) continuously scaling the source of benchmark problems by ingesting fresh, real‑world functions from public code repositories, and (ii) systematically scaling the rigor of evaluation by generating high‑coverage property‑based test suites and enforcing a 100 % branch‑coverage quality gate.

The authors instantiate this philosophy in a framework called CODE2BENCH. The pipeline first performs temporal filtering: for each model under test, only functions from commits created after the model’s knowledge‑cutoff date are considered, guaranteeing that the benchmark cannot be trivially memorized. Next, a language‑agnostic Scope‑Graph analysis identifies all external references of each candidate function. Based on a predefined whitelist of allowed libraries, functions are automatically classified into two categories: Self‑Contained (SC) – no external dependencies, testing pure algorithmic reasoning; and Weakly Self‑Contained (WSC) – dependencies limited to the whitelist, testing practical API usage. Functions that do not fit either category are discarded.

After classification, the pipeline applies additional static checks: control‑flow‑graph analysis removes functions without a verifiable output, and cyclomatic complexity filtering keeps tasks in a moderate difficulty range (typically CC ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment