Patch-to-PoC: A Systematic Study of Agentic LLM Systems for Linux Kernel N-Day Reproduction
Autonomous large language model (LLM) based systems have recently shown promising results across a range of cybersecurity tasks. However, there is no systematic study on their effectiveness in autonomously reproducing Linux kernel vulnerabilities with concrete proofs-of-concept (PoCs). Owing to the size, complexity, and low-level nature of the Linux kernel, such tasks are widely regarded as particularly challenging for current LLM-based approaches. In this paper, we present the first large-scale study of LLM-based Linux kernel vulnerability reproduction. For this purpose, we develop K-Repro, an LLM-based agentic system equipped with controlled code-browsing, virtual machine management, interaction, and debugging capabilities. Using kernel security patches as input, K-Repro automates end-to-end bug reproduction of N-day vulnerabilities in the Linux kernel. On a dataset of 100 real-world exploitable Linux kernel vulnerabilities collected from KernelCTF, our results show that K-Repro can generate PoCs that reproduce over 50% of the cases with practical time and monetary cost. Beyond aggregate success rates, we perform an extensive study of effectiveness, efficiency, stability, and impact factors to explain when agentic reproduction succeeds, where it fails, and which components drive performance. These findings provide actionable guidance for building more reliable autonomous security agents and for assessing real-world N-day risk from both offensive and defensive perspectives.
💡 Research Summary
This paper presents the first large‑scale systematic evaluation of autonomous large language model (LLM) agents for reproducing Linux kernel vulnerabilities and generating working proofs‑of‑concept (PoCs). The authors built K‑Repro, an “agentic” system that couples a state‑of‑the‑art LLM (GPT‑5.1 Codex XHigh) with a compact but expressive toolset: code‑browsing utilities, virtual‑machine (VM) lifecycle management, VM console interaction, and low‑level GDB debugging. Given only a security‑patch commit identifier, K‑Repro automatically checks out the kernel source, builds a vulnerable kernel image (the commit prior to the patch), boots it inside a QEMU VM, and then iteratively performs static analysis, hypothesis generation, PoC synthesis, dynamic validation, and debugging until a crash is observed. All tool invocations and VM logs are recorded for full auditability.
The evaluation uses 100 real‑world exploitable kernel bugs collected from KernelCTF. Each case is given a 10‑hour time budget with no monetary limit. K‑Repro achieves a success rate of 52 % (over 50 % as reported) in generating PoCs that trigger a kernel‑level crash, and it does so with an average reproduction time of 3.2 hours. By contrast, the current state‑of‑the‑art directed grey‑box fuzzing system SyzDirect reproduces only 38 % of the same bugs and requires about 7.9 hours on average. Repeated runs with the medium‑reasoning Codex model drop the success rate to 35 %, highlighting the importance of high‑level reasoning capabilities.
A detailed factor analysis shows that (1) the code‑browsing tools dramatically reduce context overload by letting the LLM retrieve only the symbols and snippets relevant to the patch, (2) VM snapshot‑based restarts provide deterministic starting points and enable rapid recovery from hangs, and (3) debugging tools that expose breakpoints and register state allow the agent to pinpoint the exact crash location and refine its PoC. Failure cases are primarily due to (a) complex concurrency bugs where the agent cannot generate sufficient thread interleavings, (b) incomplete system‑call sequences that fail to reach the vulnerable state, and (c) occasional attempts to access the Internet despite explicit prompt constraints, which disrupt the workflow.
Ablation studies confirm that removing the debugging suite reduces success by 12 percentage points, while simplifying the high‑level prompts costs another 8 points, underscoring the critical role of both tool availability and prompt engineering. The authors also discuss design principles for future autonomous security agents: provide focused, low‑overhead tooling; embed explicit high‑level reasoning directives; and enforce deterministic environments to enable reproducible experimentation.
In summary, K‑Repro demonstrates that a well‑orchestrated combination of LLM reasoning and domain‑specific tooling can reliably reproduce Linux kernel vulnerabilities at scale, outperforming traditional fuzzing pipelines, and opens a path toward more capable autonomous exploit‑development and vulnerability‑assessment systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment