QRS: A Rule-Synthesizing Neuro-Symbolic Triad for Autonomous Vulnerability Discovery
Static Application Security Testing (SAST) tools are integral to modern DevSecOps pipelines, yet tools like CodeQL, Semgrep, and SonarQube remain fundamentally constrained: they require expert-crafted queries, generate excessive false positives, and detect only predefined vulnerability patterns. Recent work has explored augmenting SAST with Large Language Models (LLMs), but these approaches typically use LLMs to triage existing tool outputs rather than to reason about vulnerability semantics directly. We introduce QRS (Query, Review, Sanitize), a neuro-symbolic framework that inverts this paradigm. Rather than filtering results from static rules, QRS employs three autonomous agents that generate CodeQL queries from a structured schema definition and few-shot examples, then validate findings through semantic reasoning and automated exploit synthesis. This architecture enables QRS to discover vulnerability classes beyond predefined patterns while substantially reducing false positives. We evaluate QRS on full Python packages rather than isolated snippets. In 20 historical CVEs in popular PyPI libraries, QRS achieves 90.6% detection accuracy. Applied to the 100 most-downloaded PyPI packages, QRS identified 39 medium-to-high-severity vulnerabilities, 5 of which were assigned new CVEs, 5 received documentation updates, while the remaining 29 were independently discovered by concurrent researchers, validating both the severity and discoverability of these findings. QRS accomplishes this with low time overhead and manageable token costs, demonstrating that LLM-driven query synthesis and code review can complement manually curated rule sets and uncover vulnerability patterns that evade existing industry tools.
💡 Research Summary
The paper introduces QRS (Query, Review, Sanitize), a neuro‑symbolic framework that integrates large language models (LLMs) with static application security testing (SAST) to overcome the limitations of traditional rule‑based tools such as CodeQL, Semgrep, and SonarQube. Traditional SAST tools rely on handcrafted queries, suffer from high false‑positive rates, and can only detect vulnerabilities that match predefined patterns. Recent attempts to augment SAST with LLMs have mostly used the models to triage or filter the output of existing tools, leaving the core problem of semantic understanding unaddressed.
QRS inverts this paradigm by placing LLM‑driven reasoning at the front of the pipeline. It consists of three autonomous agents:
-
Query (Q) Agent – Given a lightweight schema file that defines CodeQL primitives and a few few‑shot examples, the Q agent synthesizes CodeQL queries from high‑level vulnerability hypotheses (e.g., “detect insecure deserialization of user‑controlled data”). It first gathers package metadata (file paths, imports, etc.) via a custom utility, then iteratively refines generated queries. When a query fails to compile, the agent isolates the error, attempts automatic fixes, and retries up to a configurable budget (default three attempts). Successful queries are executed locally against a CodeQL database, and results are returned in JSON/SARIF format.
-
Review (R) Agent – The R agent takes the raw findings, performs data‑flow and control‑flow analysis, maps each finding to a CWE, and evaluates exploitability. Crucially, it uses the LLM to generate exploitation suggestions and proof‑of‑concept (PoC) snippets, thereby providing semantic verification that the identified code path can actually be weaponized. This step dramatically reduces false positives by requiring evidence beyond mere pattern matches.
-
Sanitize (S) Agent – The final agent consolidates the evidence, removes residual noise, and assigns high‑level labels such as MITRE ATT&CK tactics. It operates in a “evidence‑only” mode, meaning that it does not re‑expose the source code to the LLM, preserving privacy and limiting token consumption. The output is a concise, high‑confidence vulnerability report with severity, CWE, and suggested remediation.
The framework is model‑agnostic, built on the LiteLLM abstraction layer, and can work with Gemini, Claude, OpenAI GPT, and DeepSeek models. In the experiments, the authors used a larger model (e.g., GPT‑5.1) for query synthesis and smaller models (GPT‑4.1) for review and sanitization, balancing generation quality with cost. Temperature settings of 0 and 1 were explored to control hallucination risk.
Evaluation
Two complementary evaluations were performed:
-
Re‑creation of 20 historical CVEs in popular PyPI libraries. Using pre‑defined high‑level instructions, QRS correctly identified 18 of the 20 vulnerabilities, achieving a detection accuracy of 90.6 %. The two missed cases involved complex dynamic logic that static analysis alone cannot capture.
-
Large‑scale scan of the 100 most‑downloaded PyPI packages (as of November 2025). QRS discovered 39 medium‑to‑high severity issues. Of these, 5 were responsibly disclosed and assigned new CVE identifiers (CVSS 4.5–8.6), another 5 prompted maintainers to update documentation, and the remaining 29 were independently reported by other researchers, confirming the practical relevance of the findings. The discovered vulnerabilities include exotic classes such as TOCTOU race conditions, ASN.1 octet‑string memory exhaustion, and multi‑step path traversal chains—patterns that traditional SAST tools typically miss.
Performance-wise, each package scan completed within roughly 12 minutes on average, and total token consumption stayed below 1 M tokens per package, demonstrating feasibility for integration into continuous integration/continuous deployment (CI/CD) pipelines.
Limitations
The authors acknowledge several scope restrictions: (1) the framework does not analyze compiled extensions or native binaries, (2) it cannot detect purely dynamic vulnerabilities that require runtime execution or fuzzing, (3) supply‑chain attacks at the package distribution level are out of scope, and (4) adversarial code designed to evade LLM‑guided query synthesis is not evaluated. Moreover, the quality of the initial schema and few‑shot examples heavily influences the Query agent’s success, suggesting a need for domain‑specific prompt engineering.
Contributions and Future Work
QRS’s primary contribution is the demonstration that LLMs can be used not only for post‑processing but for the generation of precise static analysis queries, followed by automated semantic validation and evidence‑based sanitization. This neuro‑symbolic triad yields higher precision and recall than either pure SAST or pure LLM approaches. Future directions include extending support to other CodeQL‑supported languages (C/C++, Java, Go, JavaScript), integrating binary analysis for native extensions, developing defenses against adversarial evasion, and conducting large‑scale production trials in enterprise CI/CD environments.
In summary, QRS showcases a practical, scalable, and cost‑effective method for autonomous vulnerability discovery that bridges the gap between handcrafted static analysis rules and the semantic reasoning power of modern LLMs, offering a promising path forward for securing open‑source software supply chains.
Comments & Academic Discussion
Loading comments...
Leave a Comment