CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95% solution correctness and 96% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3% to 35.8% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5% to 31.3%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .


💡 Research Summary

The paper addresses a critical bottleneck in the emerging field of AI‑driven code security: the lack of large‑scale, high‑quality, executable vulnerability tasks for training and evaluating code agents. Existing resources such as the CVE‑List provide only sparse metadata (short descriptions, CWE tags, reference URLs), and manual reproduction of a CVE into a runnable environment typically requires more than ten hours of expert effort. This makes systematic evaluation and scaling of security‑focused agents infeasible, especially as the threat landscape evolves rapidly (e.g., AI‑tool vulnerabilities).

To overcome these limitations, the authors introduce CVE‑Factory, a multi‑agent framework that automatically transforms raw CVE JSON entries into fully executable “agentic tasks” of expert‑level quality. The system decomposes the long‑horizon, complex reproduction process into six stages, grouped into three decoupling stages (Information Collection, File Generation, Environment Construction) and three coupling stages (Vulnerability Verification, Solution Verification, End‑to‑End Verification). Each stage is handled by a specialized agent (Analyzer, Generator, Builder, etc.) that operates with an isolated context; knowledge is transferred between agents via distilled Markdown files rather than shared dialogue histories. This design mitigates the token‑window limits of large language models (LLMs) and prevents irrelevant information from overwhelming later stages.

A central Orchestrator coordinates the workflow. Agents signal continue, error, or pause to the Orchestrator. “Continue” triggers static script validation; “error” aborts an irreproducible CVE; “pause” initiates a feedback loop that routes the problematic artifact back to its creator for revision. The Orchestrator also enforces safety constraints (e.g., sandboxed file system, prohibition of dangerous system commands) and blind‑building rules that keep the Builder from seeing test scripts or reference solutions, thereby avoiding “cheating” by the agents.

Evaluation proceeds on two fronts. First, the authors reconstruct 215 expert‑crafted tasks from the PatchEval benchmark using the same initial CVE metadata. CVE‑Factory’s solutions pass 95 % of the expert environments, while 96 % of expert solutions match the automatically generated environments. Moreover, 74 % of the generated tests are judged equivalent or superior to the expert versions, demonstrating that the automated pipeline reaches expert‑level fidelity. Second, the system is applied to 454 CVEs released between May and December 2025. Manual validation confirms a 66.2 % success rate, revealing a growing share of vulnerabilities in AI‑tools such as PyTorch.

Leveraging this validated pipeline, the authors build LiveCVEBench, a continuously updated benchmark comprising 190 tasks spanning 14 programming languages and 153 open‑source repositories. LiveCVEBench captures emerging threats, including newly discovered AI‑tool and plugin vulnerabilities, and serves as a realistic testbed for security‑oriented code agents.

The authors also demonstrate the scalability of CVE‑Factory by synthesizing over 1,000 executable training environments. Fine‑tuning the open‑source LLM Qwen‑3‑32B on the distilled trajectories yields a dramatic performance jump: accuracy on LiveCVEBench rises from 5.3 % (zero‑shot) to 35.8 % (≈6.8× improvement), surpassing Claude 4.5 Sonnet. Similar gains are observed on the original PatchEval suite (12.5 % → 31.3 %) and on the unrelated Terminal‑Bench (2.3× improvement), indicating strong generalization beyond security‑specific tasks.

Contributions are threefold: (1) CVE‑Factory, a multi‑agent, context‑isolated framework that automatically produces expert‑quality, end‑to‑end vulnerability tasks; (2) LiveCVEBench, a live, multi‑language benchmark reflecting current threat distributions; (3) the first large‑scale synthesis of over a thousand executable security environments, enabling effective fine‑tuning of code agents. All code, data, the fine‑tuned model (named Abacus‑cve), and a public leaderboard are released under an open‑source license at https://github.com/livecvebench/CVE-Factory.

In summary, CVE‑Factory transforms the labor‑intensive, manual CVE reproduction workflow into an automated, scalable pipeline, thereby providing the community with a rich, continuously refreshed corpus of realistic security tasks. This infrastructure paves the way for systematic evaluation, robust training, and rapid advancement of AI agents capable of detecting, reasoning about, and fixing software vulnerabilities at an expert level.


Comments & Academic Discussion

Loading comments...

Leave a Comment