📝 Original Info
- Title: AutoBaxBuilder: Bootstrapping Code Security Benchmarking
- ArXiv ID: 2512.21132
- Date: 2025-12-24
- Authors: Researchers from original ArXiv paper
📝 Abstract
As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AutoBaxBuilder to construct entirely new tasks and release them to the public as AutoBaxBench, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10.
💡 Deep Analysis
Deep Dive into AutoBaxBuilder: Bootstrapping Code Security Benchmarking.
As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functio
📄 Full Content
Preprint
AUTOBAXBUILDER: BOOTSTRAPPING
CODE SECURITY BENCHMARKING
Tobias von Arx1, Niels Mündler1, Mark Vero1, Maximilian Baader1,2, Martin Vechev1,3
1 ETH Zurich
2 Snyk
3 INSAIT, Sofia University "St. Kliment Ohridski"
tvonarx@student.ethz.ch,maximilian.baader@snyk.io
{niels.muendler,mark.vero,martin.vechev}@inf.ethz.ch
https://baxbench.com/autobaxbuilder
https://github.com/eth-sri/autobaxbuilder
ABSTRACT
As LLMs see wide adoption in software engineering, the reliable assessment of
the correctness and security of LLM-generated code is crucial. Notably, prior
work has demonstrated that security is often overlooked, exposing that LLMs are
prone to generating code with security vulnerabilities. These insights were enabled
by specialized benchmarks, crafted through significant manual effort by security
experts. However, relying on manually-crafted benchmarks is insufficient in the
long term, because benchmarks (i) naturally end up contaminating training data,
(ii) must extend to new tasks to provide a more complete picture, and (iii) must
increase in difficulty to challenge more capable LLMs. In this work, we address
these challenges and present AUTOBAXBUILDER, a framework that generates
tasks and tests for code security benchmarking from scratch. We introduce a robust
pipeline with fine-grained plausibility checks, leveraging the code understanding ca-
pabilities of LLMs to construct functionality tests and end-to-end security-probing
exploits. To confirm the quality of the generated benchmark, we conduct both a
qualitative analysis and perform quantitative experiments, comparing it against
tasks constructed by human experts. We use AUTOBAXBUILDER to construct
entirely new tasks and release them to the public as AUTOBAXBENCH, together
with a thorough evaluation of the security capabilities of LLMs on these tasks. We
find that a new task can be generated in under 2 hours, costing less than USD 10.
1
INTRODUCTION
With the ever-increasing capabilities of large language models to generate functionally correct code,
the prevalence of LLM-generated code in real-world applications is rapidly rising. However, this
raises concerns about the security of that deployed code. Crucially, a single vulnerability leaking into
production could compromise an entire system. As such, it is crucial to accurately assess the secure
coding capabilities of LLM-based code generation. This is particularly important in safety-critical
domains such as web application backends, as these are directly exposed to malicious actors.
Shortcomings of current evaluation
Current evaluation methods of LLM-based code generation
often fall short, either evaluating correctness and security on different tasks (Pearce et al., 2022;
He et al., 2024) or by considering only function-level correctness and security (Yang et al., 2024;
Peng et al., 2025). Vero et al. (2025) proposed BAXBENCH, a rigorous evaluation framework that
detects critical vulnerabilities by executing end-to-end exploits and assesses correctness via tests.
This provides a guaranteed upper bound for both the security and functional correctness of generated
code, as this approach does not suffer from false positives. Their evaluation exposed critical and
surprising shortcomings in the secure coding capabilities of all evaluated state-of-the-art LLMs.
However, developing comprehensive benchmarks such as BAXBENCH requires significant human
effort, not only to develop and assess scenarios and functional tests but also to discover security
vulnerabilities and write scripts that reliably exploit them. This poses a key challenge to the longevity
1
arXiv:2512.21132v1 [cs.CR] 24 Dec 2025
Preprint
Generation
Should not
allow double
reservations
Refine until
all tests pass
Exploits
CWE 89:
SQL Injection
_reserve(
"; drop * –"
)
Security Tests
1
2
3
Tests
Solutions
Vulnerabilities
Secure Solution
Insecure Solution
Functional Tests
3
Implement
Analyse
Analyse
Rewrite
Implement
Benchmark Instance
Refine each ex-
ploit until precise
_reserve(a)
_reserve(a)
SELECT *
WHERE s_id = ?
SELECT *
WHERE s_id = ?
GymResService
/reserve POST
1
2
Functional
Requirement
SELECT *
WHERE s_id = {s}
Exploit
Generate
Scenario
GymResService
_reserve(a)
_reserve(a)
_reserve(
“; drop * –”
)
/reserve POST
Figure 1: Overview of our method. The LLM-based pipeline starts from scratch and produces a
complete benchmark instance with scenario description
1⃝, test cases
2⃝, and end-to-end exploits
3⃝. After generating a novel scenario description, the LLM generates functional tests and solutions,
iterating until execution feedback confirms that the tests are correct. Next, the LLM designs end-to-
end exploits to expose vulnerabilities, iterating until it finds a pair of solutions, one on which the
exploit succeeds and one on which it fails. The results are combined into a new task instance.
of such efforts: An ideal benchmark should be upgraded with more difficult scenarios for more
capable LLMs, and
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.