LLM 기반 자동 취약점 패치: 실제와 인공 취약점에 대한 실증 평가와 보완성 분석
📝 Abstract
Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vulnerabilities, leaving their effectiveness on related artificial vulnerabilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several prominent LLMs, such as OpenAI’s GPT variants, LLaMA, DeepSeek, and Mistral models, using both real and artificial vulnerabilities. Our evaluation employs Proof-of-Vulnerability (PoV) test execution to concretely assess whether LLM-generated source code successfully patches vulnerabilities. Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones. Additionally, our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabilities) and complementarity (vulnerabilities patched exclusively by a single LLM), emphasizing the importance of selecting appropriate LLMs for effective vulnerability patching.
💡 Analysis
Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vulnerabilities, leaving their effectiveness on related artificial vulnerabilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several prominent LLMs, such as OpenAI’s GPT variants, LLaMA, DeepSeek, and Mistral models, using both real and artificial vulnerabilities. Our evaluation employs Proof-of-Vulnerability (PoV) test execution to concretely assess whether LLM-generated source code successfully patches vulnerabilities. Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones. Additionally, our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabilities) and complementarity (vulnerabilities patched exclusively by a single LLM), emphasizing the importance of selecting appropriate LLMs for effective vulnerability patching.
📄 Content
Pre-print – Extended version of the poster paper accepted at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC) Smarter Engineering - Building AI and Building with AI (SEAI) 2026 Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities Aayush Garg aayush.garg@list.lu Luxembourg Institute of Science and Technology Luxembourg Zanis Ali Khan zanis-ali.khan@list.lu Luxembourg Institute of Science and Technology Luxembourg Renzo Degiovanni renzo.degiovanni@list.lu Luxembourg Institute of Science and Technology Luxembourg Qiang Tang qiang.tang@list.lu Luxembourg Institute of Science and Technology Luxembourg Abstract Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vul- nerabilities, leaving their effectiveness on related artificial vulnera- bilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several promi- nent LLMs, such as OpenAI’s GPT variants, LLaMA, DeepSeek, and Mistral models, using both real and artificial vulnerabilities. Our evaluation employs Proof-of-Vulnerability (PoV) test execution to concretely assess whether LLM-generated source code successfully patches vulnerabilities. Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones. Addi- tionally, our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabili- ties) and complementarity (vulnerabilities patched exclusively by a single LLM), emphasizing the importance of selecting appropriate LLMs for effective vulnerability patching. 1 Introduction Automated vulnerability patching has become increasingly signifi- cant in software security due to the escalating number and com- plexity of software vulnerabilities discovered annually. Recently, Large Language Models (LLMs) including OpenAI’s GPT vari- ants, LLaMA, DeepSeek, and Mistral, have emerged as promising tools for automating vulnerability patching. These models have shown notable capability in generating syntactically and semanti- cally coherent software patches [10, 43]. However, the effectiveness of these models in patching vulnerabilities, particularly across di- verse vulnerability types and their variants, remains understudied. Previous studies evaluating LLM-based patching have primarily focused on known, i.e., publicly disclosed vulnerabilities sourced from public vulnerability databases [2, 21, 32] or real-world soft- ware repositories [23, 24, 36]. These evaluations typically consider syntactic correctness or employ similarity-based metrics such as CodeBLEU [38] and CrystalBLEU [8]. While these metrics provide insights into patch quality, they do not directly indicate whether the generated patches effectively eliminate vulnerabilities [25, 29, 30]. Moreover, a significant limitation in existing literature is the scarcity of research investigating whether LLMs can generalize their vulnerability patching capabilities beyond these known vul- nerabilities [41]. Garg et al.[15] recently augmented vulnerability dataset with their generated artificial vulnerabilities. We consid- ered these artificial cases in our study to augment our evaluation and to test LLMs’ ability to patch both real and artificial vulnerabili- ties. This enabled us to perform an assessment of the generalizability and robustness of LLM-based vulnerability patching [46]. Our paper empirically investigates the effectiveness and comple- mentarity of several prominent LLMs in automated vulnerability patching, specifically focusing on both real vulnerabilities and their corresponding artificial vulnerabilities. We structure our investiga- tion around two research questions (RQs): RQ1: How effective are LLMs in patching real vulner- abilities vs. artificial vulnerabilities? RQ1 investigates the extent to which LLMs can successfully patch real vulnerabilities compared to their artificial counterparts, quan- titatively comparing their effectiveness across these vulnerability categories. RQ2: How complementary and overlapping are LLMs in patching real vulnerabilities vs. artificial vulnera- bilities? RQ2 explores how frequently multiple LLMs successfully patch the same vulnerabilities (overlapping), and how often vulnerabilities are successfully patched exclusively by a single LLM (complementary). By evaluating this distribution across real and artificial vulnera- bilities, we assess whether different LLMs consistently patch the same vulnerabilities or contribute uniquely by patching distinct vulnerabilities. To perform this evaluation, we employed 15 real vulnerabilities and their 41 artificial counterparts (vulnerabilities). We applied a uniform prompting strategy across all evaluated LLMs to generate patches. Evaluation of th
This content is AI-processed based on ArXiv data.