📝 Original Info
- Title: Multi-Agent Code Verification via Information Theory
- ArXiv ID: 2511.16708
- Date: 2025-12-05
- Authors: Researchers from original ArXiv paper
📝 Abstract
LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two-agent combination (Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.
💡 Deep Analysis
Deep Dive into Multi-Agent Code Verification via Information Theory.
LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing
📄 Full Content
Multi-Agent Code Verification via Information Theory
Shreshth Rajan
Noumenon Labs, Harvard University
shreshthrajan@college.harvard.edu
October 2025
Abstract
LLMs generate buggy code: 29.6% of SWE-bench “solved” patches fail, 62% of BaxBench solutions have vulnerabil-
ities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent
system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining
agents with different detection patterns finds more bugs than any single agent when the agents look for different
problems, using submodularity of mutual information under conditional independence. Measuring agent correlation
of ρ = 0.05–0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system
catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and
without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy
by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp,
+13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two-agent combination
(Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in
under 200ms per sample, making this practical for production use.
Keywords: Multi-agent systems, Code verification, LLM-generated code, Information theory
1
Introduction
LLMs generate code that looks correct but often fails
in production. While LLM-generated code passes basic
syntax checks and simple tests, recent studies show it
contains hidden bugs.
Xia et al. [28] find that 29.6%
of patches marked “solved” on SWE-bench don’t match
what human developers wrote, with 7.8% failing full test
suites despite passing initial tests. SecRepoBench reports
that LLMs write secure code ¡25% of the time across 318
C/C++ tasks [8], and BaxBench finds 62% of backend
code has vulnerabilities or bugs [26]. Studies suggest 40–
60% of LLM code contains undetected bugs [13], making
automated deployment risky.
The Problem. Existing verification tools check code
in one way at a time, missing bugs that require looking
from multiple angles. Traditional static analyzers (Sonar-
Qube, Semgrep, CodeQL) catch 65% of bugs but flag
good code as buggy 35% of the time [22].
Test-based
methods like Meta Prompt Testing [27] achieve better
false positive rates (8.6%) by running code variants and
comparing outputs, but require expensive test infrastruc-
ture and miss security holes (SQL injection) and qual-
ity issues that don’t affect outputs. LLM review systems
like AutoReview [3] improve security detection by 18.72%
F1 but only focus on security, not correctness or perfor-
mance. No existing work explains mathematically why
using multiple agents should work better than using one.
Our Approach. We built CodeX-Verify, a system
that runs four specialized agents in parallel: Correctness
(logic errors, edge cases, exception handling), Security
(OWASP Top 10, CWE patterns, secrets), Performance
(algorithmic complexity, resource leaks), and Style (main-
tainability, documentation). Each agent looks for differ-
ent bug types.
We prove that combining agents finds
more bugs than any single agent using submodularity
of mutual information under conditional independence:
I(A1, A2, A3, A4; B) > maxi I(Ai; B). Measuring how of-
ten our agents agree shows correlation ρ = 0.05–0.25,
confirming they catch different bugs.
Results. We tested on 99 code samples with verified
labels covering 16 bug categories from real SWE-bench
failures.
Our system catches 76.1% of bugs, matching
Meta Prompt Testing (75%) [27] while running faster
and without executing code. We improve 28.7 percent-
age points over Codex (40%) and 3.7 points over tradi-
tional static analyzers (65%). Our 50% false positive rate
is higher than test-based methods (8.6%) because we flag
security holes and quality issues that don’t affect test out-
puts, a tradeoff appropriate for enterprise deployments
that prioritize security over minimizing false alarms.
We tested all 15 combinations of agents: single agents
(4 configs), pairs (6 configs), triples (4 configs), and the
full system.
Results show progressive improvement: 1
agent (32.8% avg) →2 agents (+14.9pp) →3 agents
(+13.5pp) →4 agents (+11.2pp), totaling 39.7 percent-
age points gain. This exceeds AutoReview’s +18.72% F1
improvement [3] and confirms the mathematical predic-
1
arXiv:2511.16708v3 [cs.SE] 3 Dec 2025
tion that combining agents with different detection pat-
terns works. The diminishing gains (+14.9pp, +13.5pp,
+11.2pp) match our theoretical model.
Contributions.
1. Mathematical proof via submodularity of mutual infor-
mation that combining agents with conditionally inde-
pendent detection patterns finds more bugs than any
single agent: I(A1, . . . , An; B)
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.