Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection

November 28, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection
ArXiv ID: 2512.04106
Date: 2025-11-28
Authors: Fouad Trad, Ali Chehab

📝 Abstract

Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outperforms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine-tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources.

💡 Deep Analysis

📄 Full Content

Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection Fouad Trad Electrical and Computer Engineering American University of Beirut Beirut, Lebanon fat10@mail.aub.edu Ali Chehab Electrical and Computer Engineering American University of Beirut Beirut, Lebanon chehab@aub.edu.lb Abstract—Few-shot prompting has emerged as a practical alternative to fine-tuning for leveraging the capabilities of large language models (LLMs) in specialized tasks. However, its effectiveness depends heavily on the selection and quality of in-context examples, particularly in complex domains. In this work, we examine retrieval-augmented prompting as a strategy to improve few-shot performance in code vulnerability detection, where the goal is to identify one or more security-relevant weaknesses present in a given code snippet from a predefined set of vulnerability categories. We perform a systematic evaluation using the Gemini-1.5-Flash model across three approaches: (1) standard few-shot prompting with randomly selected examples, (2) retrieval-augmented prompting using semantically similar examples, and (3) retrieval-based labeling, which assigns labels based on retrieved examples without model inference. Our results show that retrieval-augmented prompting consistently outper- forms the other prompting strategies. At 20 shots, it achieves an F1 score of 74.05% and a partial match accuracy of 83.90%. We further compare this approach against zero-shot prompting and several fine-tuned models, including Gemini-1.5-Flash and smaller open-source models such as DistilBERT, DistilGPT2, and CodeBERT. Retrieval-augmented prompting outperforms both zero-shot (F1 score: 36.35%, partial match accuracy: 20.30%) and fine-tuned Gemini (F1 score: 59.31%, partial match accuracy: 53.10%), while avoiding the training time and cost associated with model fine-tuning. On the other hand, fine- tuning CodeBERT yields higher performance (F1 score: 91.22%, partial match accuracy: 91.30%) but requires additional training, maintenance effort, and resources. These results underscore the value of semantically relevant example selection for few-shot prompting and position retrieval-augmented prompting as a practical trade-off between performance and deployment cost in code analysis tasks. Index Terms—Retrieval-Augmented Generation, Few-Shot Prompting, Large Language Models, In-Context Learning, Code Vulnerability Detection I. INTRODUCTION Large Language Models (LLMs) have demonstrated strong performance across a wide range of natural language pro- cessing tasks, including code understanding and software vul- nerability detection [1]. Fine-tuning these models for specific tasks has become a standard approach to adapting them to specialized domains. However, fine-tuning is resource inten- sive, may require access to model weights, and entails non- trivial training time and maintenance costs [2], [3]. This can be particularly limiting in security-focused applications, where data distributions evolve frequently and model updates must be efficient and repeatable. Few-shot prompting has emerged as an alternative that avoids the need for model retraining. In this setup, a small number of labeled input-output examples are embedded di- rectly into the prompt to guide the model during inference. This approach leverages the capabilities of general-purpose LLMs without modifying their internal parameters. While promising, few-shot prompting suffers from high variance depending on the quality and relevance of in-context examples. In complex tasks such as multi-label code vulnerability detec- tion, selecting examples that are semantically mismatched or too generic can lead to degraded performance. In this work, we explore retrieval-augmented prompting strategies to improve the effectiveness of few-shot learning for code vulnerability detection. Our hypothesis is that pro- viding semantically similar in-context examples can improve prediction accuracy without requiring any model fine-tuning. To evaluate this, we implement and compare three prompting strategies: • Random Few-Shot Prompting: in-context examples are randomly sampled from the training set. • Retrieval-Augmented Few-Shot Prompting: examples are selected based on semantic similarity to the target input using an embedding-based retrieval mechanism. • Retrieval-Based Labeling: labels are inferred by re- trieving the most similar examples and propagating their labels directly, without using the LLM for inference. We conducted a detailed evaluation using the Gemini-1.5- Flash model on a multi-label code vulnerability detection dataset. Our results show that retrieval-augmented prompting consistently outperforms both random prompting and retrieval- based labeling. With 20 shots, retrieval-augmented prompting achieves an F1 score of 74.05% and a partial match accuracy of 83.90%, substantially outperforming zero-shot prompting and the other investigated strategies. To compare with fine-tu

📄 Read Full PDF on ArXiv