Evaluating Large Language Models for Multilingual Vulnerability Detection at Dual Granularities

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Various deep learning-based approaches utilizing pre-trained language models (PLMs) have been proposed for automated vulnerability detection. With recent advancements in large language models (LLMs), several studies have begun exploring their application to vulnerability detection tasks. However, existing studies primarily focus on specific programming languages (e.g., C/C++) and function-level detection, leaving the strengths and weaknesses of PLMs and LLMs in multilingual and multi-granularity scenarios largely unexplored. To bridge this gap, we conduct a comprehensive fine-grained empirical study evaluating the effectiveness of state-of-the-art PLMs and LLMs for multilingual vulnerability detection. Using over 30,000 real-world vulnerability-fixing patches across seven programming languages, we systematically assess model performance at both the function-level and line-level. Our key findings indicate that GPT-4o, enhanced through instruction tuning and few-shot prompting, significantly outperforms all other evaluated models, including CodeT5P. Furthermore, the LLM-based approach demonstrates superior capability in detecting unique multilingual vulnerabilities, particularly excelling in identifying the most dangerous and high-severity vulnerabilities. These results underscore the promising potential of adopting LLMs for multilingual vulnerability detection at function-level and line-level, revealing their complementary strengths and substantial improvements over PLM approaches. This empirical evaluation of PLMs and LLMs for multilingual vulnerability detection highlights LLMs’ value in addressing real-world software security challenges.

💡 Research Summary

This paper presents a comprehensive empirical study that compares the effectiveness of state‑of‑the‑art pre‑trained language models (PLMs) and large language models (LLMs) for automated vulnerability detection (AVD) across multiple programming languages and at two granularities: function‑level and line‑level. The authors address three critical gaps in prior work: (1) most existing studies focus on single languages (primarily C/C++) and ignore the multilingual reality of modern software stacks; (2) the generalizability of PLMs and LLMs to diverse language‑specific vulnerability patterns has not been rigorously evaluated; and (3) evaluation frameworks have largely omitted line‑level detection, which is essential for precise remediation.

To fill these gaps, the authors construct a large, realistic dataset called REEF, derived from the National Vulnerability Database and Mend’s CVE list, containing 4,466 CVEs and 30,987 real‑world vulnerability‑fixing patches spanning seven languages: C, C#, C++, Go, Java, JavaScript, and Python. After careful preprocessing—removing comments, extracting function definitions with Tree‑sitter, filtering out functions longer than 512 tokens, and generating line‑level labels via DIFFLIB—the final corpus comprises 20,165 functions (≈10k vulnerable, ≈10k non‑vulnerable) and a line‑level annotation set covering over 9,000 functions.

The experimental setup evaluates a suite of PLMs (e.g., CodeT5‑P) and LLMs (GPT‑3.5‑Turbo, GPT‑4, GPT‑4o) under three prompting strategies: zero‑shot, retrieval‑augmented few‑shot, and instruction‑tuned few‑shot. The instruction‑tuned few‑shot configuration for GPT‑4o involves a small set of curated examples (five‑shot) and a fine‑tuned instruction prompt that explicitly describes the vulnerability detection task, the expected output format, and the handling of language‑specific syntax.

Key results:

Function‑level: GPT‑4o with instruction tuning achieves the highest accuracy of 0.7196, outperforming the best PLM (CodeT5‑P, 0.6037) by more than 12 percentage points.
Line‑level: GPT‑4o attains the best F1‑score of 0.6641, again surpassing all PLMs and other LLMs.
Language‑wise performance varies: the model peaks on Go (accuracy 0.8082) and dips on Python (0.6626) at the function level; at the line level it excels on JavaScript (F1 0.7815) and struggles on C# (F1 0.4348).
Risk‑focused analysis shows that GPT‑4o uniquely identifies a larger proportion of the top‑25 most dangerous CWE‑IDs and achieves higher detection rates for high‑severity (Critical/High) CVSS scores compared with all PLMs.

Additional investigations reveal that (a) model size is not a decisive factor—larger open‑source LLMs do not consistently beat smaller ones; (b) reasoning‑oriented LLMs (those employing chain‑of‑thought prompting) do not provide significant gains for this task; and (c) the primary performance driver is the quality of instruction tuning and few‑shot example selection rather than raw parameter count.

The authors also discuss practical deployment considerations, noting that while LLMs demand more GPU memory and inference latency than PLMs, the security benefits—higher detection accuracy, better coverage of severe vulnerabilities, and multilingual robustness—justify the added cost for many enterprise settings.

Contributions are clearly enumerated: (1) a systematic, multilingual benchmark for function‑ and line‑level vulnerability detection; (2) an extensive comparison of prompting strategies, highlighting the superiority of instruction‑tuned few‑shot LLMs; (3) fine‑grained analyses that uncover strengths (e.g., high‑risk CWE detection) and weaknesses (language‑specific performance gaps) of each model family; (4) insights into the limited impact of model scale and reasoning capabilities; and (5) a publicly released replication package containing data, code, and detailed experiment logs.

In summary, the study provides strong empirical evidence that modern LLMs—when properly tuned and prompted—outperform traditional PLMs for multilingual vulnerability detection at both coarse (function) and fine (line) granularities. It offers actionable guidance for researchers and practitioners aiming to integrate LLM‑driven security analysis into real‑world software development pipelines, and it sets a solid foundation for future work exploring larger multilingual corpora, open‑source LLMs, and continuous integration of LLM‑based AVD tools.

Evaluating Large Language Models for Multilingual Vulnerability Detection at Dual Granularities

💡 Research Summary

Comments & Academic Discussion

Leave a Comment