AutoBaxBuilder: Bootstrapping Code Security Benchmarking

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: AutoBaxBuilder: Bootstrapping Code Security Benchmarking
ArXiv ID: 2512.21132
Date: 2025-12-24
Authors: Researchers from original ArXiv paper

📝 Abstract

As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding capabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AutoBaxBuilder to construct entirely new tasks and release them to the public as AutoBaxBench, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10.

💡 Deep Analysis

Deep Dive into AutoBaxBuilder: Bootstrapping Code Security Benchmarking.

📄 Full Content

Preprint AUTOBAXBUILDER: BOOTSTRAPPING CODE SECURITY BENCHMARKING Tobias von Arx1, Niels Mündler1, Mark Vero1, Maximilian Baader1,2, Martin Vechev1,3 1 ETH Zurich 2 Snyk 3 INSAIT, Sofia University "St. Kliment Ohridski" tvonarx@student.ethz.ch,maximilian.baader@snyk.io {niels.muendler,mark.vero,martin.vechev}@inf.ethz.ch https://baxbench.com/autobaxbuilder https://github.com/eth-sri/autobaxbuilder ABSTRACT As LLMs see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work has demonstrated that security is often overlooked, exposing that LLMs are prone to generating code with security vulnerabilities. These insights were enabled by specialized benchmarks, crafted through significant manual effort by security experts. However, relying on manually-crafted benchmarks is insufficient in the long term, because benchmarks (i) naturally end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AUTOBAXBUILDER, a framework that generates tasks and tests for code security benchmarking from scratch. We introduce a robust pipeline with fine-grained plausibility checks, leveraging the code understanding ca- pabilities of LLMs to construct functionality tests and end-to-end security-probing exploits. To confirm the quality of the generated benchmark, we conduct both a qualitative analysis and perform quantitative experiments, comparing it against tasks constructed by human experts. We use AUTOBAXBUILDER to construct entirely new tasks and release them to the public as AUTOBAXBENCH, together with a thorough evaluation of the security capabilities of LLMs on these tasks. We find that a new task can be generated in under 2 hours, costing less than USD 10. 1 INTRODUCTION With the ever-increasing capabilities of large language models to generate functionally correct code, the prevalence of LLM-generated code in real-world applications is rapidly rising. However, this raises concerns about the security of that deployed code. Crucially, a single vulnerability leaking into production could compromise an entire system. As such, it is crucial to accurately assess the secure coding capabilities of LLM-based code generation. This is particularly important in safety-critical domains such as web application backends, as these are directly exposed to malicious actors. Shortcomings of current evaluation Current evaluation methods of LLM-based code generation often fall short, either evaluating correctness and security on different tasks (Pearce et al., 2022; He et al., 2024) or by considering only function-level correctness and security (Yang et al., 2024; Peng et al., 2025). Vero et al. (2025) proposed BAXBENCH, a rigorous evaluation framework that detects critical vulnerabilities by executing end-to-end exploits and assesses correctness via tests. This provides a guaranteed upper bound for both the security and functional correctness of generated code, as this approach does not suffer from false positives. Their evaluation exposed critical and surprising shortcomings in the secure coding capabilities of all evaluated state-of-the-art LLMs. However, developing comprehensive benchmarks such as BAXBENCH requires significant human effort, not only to develop and assess scenarios and functional tests but also to discover security vulnerabilities and write scripts that reliably exploit them. This poses a key challenge to the longevity 1 arXiv:2512.21132v1 [cs.CR] 24 Dec 2025 Preprint Generation Should not allow double reservations Refine until all tests pass Exploits CWE 89: SQL Injection _reserve( "; drop * –" )

Security Tests 1 2 3 Tests Solutions Vulnerabilities Secure Solution Insecure Solution Functional Tests 3 Implement Analyse Analyse Rewrite Implement Benchmark Instance Refine each ex- ploit until precise _reserve(a) _reserve(a) SELECT * WHERE s_id = ? SELECT * WHERE s_id = ? GymResService /reserve POST 1 2 Functional Requirement SELECT * WHERE s_id = {s} Exploit Generate Scenario GymResService _reserve(a) _reserve(a) _reserve( “; drop * –” )

/reserve POST Figure 1: Overview of our method. The LLM-based pipeline starts from scratch and produces a complete benchmark instance with scenario description 1⃝, test cases 2⃝, and end-to-end exploits 3⃝. After generating a novel scenario description, the LLM generates functional tests and solutions, iterating until execution feedback confirms that the tests are correct. Next, the LLM designs end-to- end exploits to expose vulnerabilities, iterating until it finds a pair of solutions, one on which the exploit succeeds and one on which it fails. The results are combined into a new task instance. of such efforts: An ideal benchmark should be upgraded with more difficult scenarios for more capable LLMs, and

…(Full text truncated)…

📄 Read Full PDF on ArXiv