On the Difficulty of Selecting Few-Shot Examples for Effective LLM-based Vulnerability Detection

On the Difficulty of Selecting Few-Shot Examples for Effective LLM-based Vulnerability Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of coding tasks, including summarization, translation, completion, and code generation. Despite these advances, detecting code vulnerabilities remains a challenging problem for LLMs. In-context learning (ICL) has emerged as an effective mechanism for improving model performance by providing a small number of labeled examples within the prompt. Prior work has shown, however, that the effectiveness of ICL depends critically on how these few-shot examples are selected. In this paper, we study two intuitive criteria for selecting few-shot examples for ICL in the context of code vulnerability detection. The first criterion leverages model behavior by prioritizing samples on which the LLM consistently makes mistakes, motivated by the intuition that such samples can expose and correct systematic model weaknesses. The second criterion selects examples based on semantic similarity to the query program, using k-nearest-neighbor retrieval to identify relevant contexts. We conduct extensive evaluations using open-source LLMs and datasets spanning multiple programming languages. Our results show that for Python and JavaScript, careful selection of few-shot examples can lead to measurable performance improvements in vulnerability detection. In contrast, for C and C++ programs, few-shot example selection has limited impact, suggesting that more powerful but also more expensive approaches, such as re-training or fine-tuning, may be required to substantially improve model performance.


💡 Research Summary

This paper investigates how the selection of few‑shot examples influences the performance of large language models (LLMs) when they are used for code vulnerability detection via in‑context learning (ICL). While LLMs have shown strong abilities on many coding tasks, detecting security‑relevant bugs remains difficult, and prior work has shown that ICL’s effectiveness is highly sensitive to the choice of demonstration examples. The authors propose and evaluate two intuitive selection strategies.

The first, Learn‑from‑Mistakes (LFM), scans a labeled training set and queries the target LLM on each program. Samples on which the model consistently makes wrong predictions are collected into a “mistake” set; those it gets right form a “correct” set. A Boolean “stacked” flag determines whether the context used for subsequent queries is incrementally enriched with the newly identified mistake examples, allowing the model to see its own errors repeatedly. The algorithm can be run for multiple passes (k) to mitigate nondeterminism, and a final few‑shot set of size n is drawn from either the mistake, correct, or a “gray” (inconsistent) pool, according to a user‑specified option.

The second strategy, Learn‑from‑Nearest‑Neighbors (LFNN), uses a code‑embedding model to map every program in the training corpus to a vector. For a given query program, its embedding is computed, cosine similarities to all corpus vectors are calculated, and the top‑n nearest neighbors are returned as the few‑shot examples. This approach assumes that semantically similar code will provide the most useful context for the LLM.

Both methods can be combined, yielding hybrid selection policies that aim to provide examples that are both semantically close to the query and that expose the model’s systematic weaknesses.

Experiments are conducted with two open‑source code‑oriented LLMs (Qwen2.5‑Coder‑7B‑Instruct and Gemma‑3‑4B‑it) and a closed‑source GPT‑5‑mini model. Four publicly available vulnerability datasets covering Python, JavaScript, C, and C++ are used. Evaluation metrics include precision, recall, and F1 score.

Key findings:

  • For Python and JavaScript, careful example selection improves detection performance. LFNN alone typically yields a 3–5 % absolute gain in F1 over random selection, and hybrid LFM+LFNN policies provide more stable gains across models.
  • LFM tends to bias predictions toward the positive (vulnerable) class, which can raise recall but harms precision, especially when the stacked option is enabled.
  • For C and C++, neither selection strategy yields significant improvements. The authors attribute this to the low‑level nature of these languages (pointer arithmetic, manual memory management) and the limited exposure of such constructs in the pre‑training data of the evaluated LLMs. In these cases, more heavyweight interventions such as fine‑tuning or retraining appear necessary.
  • The quality of the embedding model used by LFNN strongly affects results; higher‑quality code embeddings lead to more relevant nearest neighbors and larger performance gains.
  • Stacking many mistake examples can exceed the model’s context window, causing degradation; thus a trade‑off exists between exposing errors and maintaining prompt length.

The paper situates its contributions within three related areas: (a) traditional vulnerability detection (static analysis, dynamic testing, ML‑based classifiers), (b) LLM‑based vulnerability detection (embedding‑based classification vs. generative prompting), and (c) prompt optimization and example selection for ICL. It argues that example selection can be viewed as a form of prompt optimization, and that the proposed algorithms provide concrete, reproducible methods for this purpose.

In conclusion, the study demonstrates that few‑shot example selection is a practical lever for enhancing LLM‑driven vulnerability detection, particularly for higher‑level languages where semantic similarity is a strong cue. However, the impact varies by language and model, and for low‑level languages the gains are limited, suggesting that future work should explore richer code representations (ASTs, CFGs), better embedding models, and possibly hybrid pipelines that combine ICL with lightweight fine‑tuning. The authors propose extending the approach to multi‑modal embeddings and automated prompt‑search techniques to further close the gap between current LLM capabilities and the stringent reliability requirements of security analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment