Semantics-Preserving Evasion of LLM Vulnerability Detectors
LLM-based vulnerability detectors are increasingly deployed in security-critical code review, yet their resilience to evasion under behavior-preserving edits remains poorly understood. We evaluate detection-time integrity under a semantics-preserving threat model by instantiating diverse behavior-preserving code transformations on a unified C/C++ benchmark (N=5000), and introduce a metric of joint robustness across different attack methods/carriers. Across models, we observe a systemic failure of semantic invariant adversarial transformations: even state-of-the-art vulnerability detectors perform well on clean inputs while predictions flip under behavior-equivalent edits. Universal adversarial strings optimized on a single surrogate model remain effective when transferred to black-box APIs, and gradient access can further amplify evasion success. These results show that even high-performing detectors are vulnerable to low-cost, semantics-preserving evasion. Our carrier-based metrics provide practical diagnostics for evaluating LLM-based code detectors.
💡 Research Summary
The paper investigates the robustness of large language model (LLM) based vulnerability detectors against semantics‑preserving edits to source code. As LLMs are increasingly integrated into continuous integration pipelines for automated code review, the authors ask whether these detectors remain reliable when the input program is altered in ways that preserve compilation and the underlying vulnerability.
To answer this, they define a threat model that restricts the adversary to four “carrier” transformations that are guaranteed to be semantics‑preserving by construction: (1) scope‑consistent identifier substitution, (2) insertion of adversarial strings inside comments, (3) insertion inside inactive preprocessor regions (macros or #ifdef guards), and (4) insertion into dead‑branch code blocks (e.g., if(0){…}). Each transformation is automatically validated for syntactic correctness (using Tree‑sitter) and for successful compilation (gcc) on a large sample of functions.
The attack generation uses the Greedy Coordinate Gradient (GCG) algorithm, which leverages token‑level gradients to iteratively improve a discrete adversarial string σ. Rather than optimizing per‑instance, the authors train a universal σ over a small support set of ten vulnerable functions, maximizing the average log‑odds of the target label “BENIGN” over the ground‑truth “VULNERABLE”. The resulting universal string is then frozen and applied zero‑shot to all test samples, allowing measurement of transferability across both samples and models.
The experimental benchmark, UNIFIED‑VUL‑N, aggregates 5,000 vulnerable C/C++ functions from PrimeVul, BigVul, and DiverseVul. The dataset is rigorously de‑duplicated (canonical hashing after comment/format stripping) and limited to ≤4096 tokens per function to avoid truncation artifacts. Five detectors are evaluated: open‑source Qwen2.5‑Coder‑32B, Llama3.1‑8B, CodeAstra, and closed‑source GPT‑4o and GPT‑5‑mini. Clean (unperturbed) true‑positive rates range from 22 % (Qwen2.5‑Coder‑32B) to 74 % (GPT‑5‑mini).
Robustness is quantified with three metrics: Conditional Attack Success Rate (ASR) on clean true positives, the union of all successful flips across carriers, and Complete Resistance (CR), defined as the fraction of vulnerabilities that resist all carrier‑based attacks. Across all models, CR is below 13 %, meaning that more than 87 % of vulnerabilities that are correctly flagged on clean code can be turned into false negatives by at least one semantics‑preserving edit. Preprocessor‑based carriers achieve the highest ASR (up to 99 % on some models), while identifier substitution also shows strong effectiveness. Comment carriers are less potent but still non‑trivial.
The authors also explore three knowledge regimes: (i) white‑box access with full gradients, (ii) black‑box transfer where universal strings are optimized on a surrogate (Qwen2.5‑Coder‑14B) and then applied to proprietary APIs, and (iii) operational black‑box where gradients are unavailable but transfer‑only attacks are used. In the transfer scenario, the universal strings derived from the surrogate model successfully evade GPT‑4o without any further tuning, demonstrating low‑cost exploitability of closed‑source detectors. White‑box attacks further increase ASR, establishing an upper bound on the vulnerability of these systems.
The paper positions its contribution relative to prior work on LLM safety (toxicity, jailbreaks) and code‑specific adversarial attacks (graph‑based, neural binary analyzers). It argues that existing safety benchmarks overlook the unique risk posed by semantics‑preserving transformations in security‑critical code analysis. By introducing the CR metric and a unified benchmark, the work provides a practical diagnostic tool for measuring joint robustness across multiple attack surfaces.
In conclusion, the study reveals a systemic fragility: even state‑of‑the‑art LLM vulnerability detectors, which achieve high clean‑accuracy, can be trivially bypassed by inexpensive, semantics‑preserving edits. The findings underscore the need for new defense strategies—such as adversarial training with carrier‑based perturbations, model‑level regularization, or runtime verification—to close the robustness gap before LLM‑driven code security tools are deployed at scale.
Comments & Academic Discussion
Loading comments...
Leave a Comment