Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Static Application Security Testing (SAST) tools are essential for identifying software vulnerabilities, but they often produce a high volume of false positives (FPs), imposing a substantial manual triage burden on developers. Recent advances in Large Language Model (LLM) agents offer a promising direction by enabling iterative reasoning, tool use, and environment interaction to refine SAST alerts. However, the comparative effectiveness of different LLM-based agent architectures for FP filtering remains poorly understood. In this paper, we present a comparative study of three state-of-the-art LLM-based agent frameworks, i.e., Aider, OpenHands, and SWE-agent, for vulnerability FP filtering. We evaluate these frameworks using the vulnerabilities from the OWASP Benchmark and real-world open-source Java projects. The experimental results show that LLM-based agents can remove the majority of SAST noise, reducing an initial FP detection rate of over 92% on the OWASP Benchmark to as low as 6.3% in the best configuration. On real-world dataset, the best configuration of LLM-based agents can achieve an FP identification rate of up to 93.3% involving CodeQL alerts. However, the benefits of agents are strongly backbone- and CWE-dependent: agentic frameworks significantly outperform vanilla prompting for stronger models such as Claude Sonnet 4 and GPT-5, but yield limited or inconsistent gains for weaker backbones. Moreover, aggressive FP reduction can come at the cost of suppressing true vulnerabilities, highlighting important trade-offs. Finally, we observe large disparities in computational cost across agent frameworks. Overall, our study demonstrates that LLM-based agents are a powerful but non-uniform solution for SAST FP filtering, and that their practical deployment requires careful consideration of agent design, backbone model choice, vulnerability category, and operational cost.

💡 Research Summary

This paper investigates whether autonomous large‑language‑model (LLM) agents can meaningfully reduce the overwhelming false‑positive (FP) noise generated by static application security testing (SAST) tools. The authors compare three state‑of‑the‑art agent frameworks—Aider, OpenHands, and SWE‑agent—each paired with three backbone models (Claude Sonnet 4, DeepSeek Chat, and GPT‑5). Experiments are conducted on two datasets: the OWASP Benchmark (Java v1.2) covering a wide range of CWEs, and a real‑world collection of CodeQL alerts drawn from the Vul4J dataset (50 open‑source Java projects).

The study is organized around three research questions: (RQ1) How effective are the different LLM‑based agents at filtering FPs on a controlled benchmark? (RQ2) How do they perform on real‑world code? (RQ3) What are the primary success drivers and recurring failure modes? To ensure a fair comparison, the authors standardize prompt construction, disable external web browsing, and record three metrics for each run: FP reduction rate, risk of suppressing true vulnerabilities, and computational overhead (rounds, token usage, and monetary cost).

Key findings:

Substantial FP reduction – The best configuration (Aider + Claude Sonnet 4) lowered the FP rate from >92 % on the OWASP Benchmark to 6.3 %. OpenHands and SWE‑agent achieved comparable reductions (≈8–9 %). On the real‑world CodeQL alerts, the highest FP identification rate reached 93.3 %.
Backbone‑ and CWE‑dependence – Strong models (Claude Sonnet 4, GPT‑5) consistently outperformed zero‑shot prompting, gaining an average of 12 percentage‑points in FP reduction. In contrast, the weaker DeepSeek Chat performed best with plain zero‑shot prompting, showing little benefit from agentic loops. Certain CWE families (e.g., weak cryptography, secure‑cookie flags) remained difficult, while injection‑type bugs (SQL, command injection) were filtered almost perfectly.
Trade‑offs between aggressiveness and safety – Aggressive FP suppression sometimes caused a 2–3 % increase in false‑negative (missed true vulnerability) rates. Different agents exhibited distinct cost profiles: Aider required the most rounds and tokens (≈4.2 rounds, 1.8 M tokens per case), whereas SWE‑agent was cheaper but slightly less effective at FP removal.
Practical deployment guidance – The authors recommend using high‑capacity backbones for agentic pipelines, adopting conservative settings for high‑risk CWEs, and carefully budgeting computational resources. They also outline future work on tighter integration of agents with static analyzers, cost‑aware round management, and cross‑language generalization.

Overall, the study demonstrates that LLM‑based autonomous agents can dramatically improve the precision of SAST outputs, but their benefits are non‑uniform and must be balanced against model choice, vulnerability category, and operational cost.

Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment