Title: An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems
ArXiv ID: 2601.00254
Date: 2026-01-01
Authors: Md Hasan Saju, Maher Muhtadi, Akramul Azim
📝 Abstract
The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul [1] and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE-399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks.
💡 Deep Analysis
📄 Full Content
An Empirical Evaluation of LLM-Based
Approaches for Code Vulnerability Detection: RAG,
SFT, and Dual-Agent Systems
Md Hasan Saju, Maher Muhtadi, Akramul Azim
Department of Electrical, Computer, and Software Engineering
Ontario Tech University
Oshawa, Canada
{mdhasan.saju, maher.muhtadi, akramul.azim}@ontariotechu.ca
Abstract—The rapid advancement of Large Language Models
(LLMs) presents new opportunities for automated software vul-
nerability detection, a crucial task in securing modern codebases.
This paper presents a comparative study on the effectiveness of
LLM-based techniques for detecting software vulnerabilities. The
study evaluates three approaches, Retrieval-Augmented Genera-
tion (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent
LLM framework, against a baseline LLM model. A curated
dataset was compiled from Big-Vul [1] and real-world code
repositories from GitHub, focusing on five critical Common
Weakness Enumeration (CWE) categories: CWE-119, CWE-
399, CWE-264, CWE-20, and CWE-200. Our RAG approach,
which integrated external domain knowledge from the internet
and the MITRE CWE database, achieved the highest overall
accuracy (0.86) and F1 score (0.85), highlighting the value
of contextual augmentation. Our SFT approach, implemented
using parameter-efficient QLoRA adapters, also demonstrated
strong performance. Our Dual-Agent system, an architecture in
which a secondary agent audits and refines the output of the
first, showed promise in improving reasoning transparency and
error mitigation, with reduced resource overhead. These results
emphasize that incorporating a domain expertise mechanism
significantly strengthens the practical applicability of LLMs in
real-world vulnerability detection tasks.
Index Terms—Vulnerability Detection, LLM, RAG, SFT, Dual-
Agent
I. INTRODUCTION
A software vulnerability is a flaw, caused by weaknesses
such as buffer overflows, authentication errors, code injection,
or design deficiencies, in the source code that can be exploited
by hackers to breach security measures and gain unauthorized
access to a system or network [2]. This can lead to severe con-
sequences, including data theft, system manipulation, service
disruption, and financial loss [3]. For example, according to
the Cost of a Data Breach Report 2024 by IBM, the average
cost of a data breach is USD 4.88 million which includes the
costs of detecting and addressing the breach, disruption and
losses, and the damage to the business reputation [3]. Vul-
nerabilities are especially significant in safety-critical systems,
where the consequences of exploitation can be catastrophic. As
highlighted in [4], real-time systems like automotive control
systems (e.g., anti-lock braking, cruise control) depend on
both logical and temporal correctness for faultless operation. A
breach in such systems can disrupt timing constraints, leading
to missed deadlines and potentially life-threatening failures.
Therefore, it is important to detect and mitigate vulnerabilities
in a timely manner.
Vulnerability detection involves identifying security weak-
nesses in software code that attackers could exploit. Con-
ventional detection approaches like rule-based methods and
signature-based techniques rely on predefined patterns to spot
known vulnerabilities but often fail to detect new or sophisti-
cated threats [5]. Recent advances in machine learning, espe-
cially deep learning, have transformed this field by enabling
systems to automatically learn complex patterns from code [5].
Moreover, the rise of large language models (LLMs) has fur-
ther enhanced detection capabilities, as these models can an-
alyze code syntax and context to identify vulnerabilities more
effectively. LLM models are further customized to improve
vulnerability detection through Retrieval-Augmented Genera-
tion (RAG) and fine-tuning approaches. However, X. Du et al.,
[6] and A. Z. H. Yang et al. [7] highlighted challenges such as
false positives and computational costs persist, motivating the
exploration of hybrid approaches like Dual-Agent systems.
In this paper, we evaluate and compare the effectiveness
of different LLM-based approaches for detecting source code
vulnerabilities. The approaches investigated in this paper are
RAG, Supervised Fine-Tuning (SFT), and a Dual-Agent sys-
tem. These approaches are then compared to the performance
of the base LLM model. The Dual-Agent system comprises a
detector model for identifying vulnerabilities and a validation
model for reviewing the first agent’s findings. There are some
relevant motivations behind this study. Firstly, this study aims
to provide a holistic comparison of the three different LLM
techniques like RAG, fine-tuning, and Dual-Agent LLMs for
vulnerability detection and help researchers or developers
decide on the best approach for their needs. Secondly, this
paper is the first study to implement and apply a Dual-
Agent system in the domain of code vulnerability detection,
to the best of our knowledge [8]