Multi-Agent Code Verification via Information Theory

Reading time: 6 minute
...

📝 Original Info

  • Title: Multi-Agent Code Verification via Information Theory
  • ArXiv ID: 2511.16708
  • Date: 2025-12-05
  • Authors: Researchers from original ArXiv paper

📝 Abstract

LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two-agent combination (Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use.

💡 Deep Analysis

Deep Dive into Multi-Agent Code Verification via Information Theory.

LLMs generate buggy code: 29.6% of SWE-bench solved patches fail, 62% of BaxBench solutions have vulnerabilities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of rho = 0.05 to 0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing

📄 Full Content

Multi-Agent Code Verification via Information Theory Shreshth Rajan Noumenon Labs, Harvard University shreshthrajan@college.harvard.edu October 2025 Abstract LLMs generate buggy code: 29.6% of SWE-bench “solved” patches fail, 62% of BaxBench solutions have vulnerabil- ities, and existing tools only catch 65% of bugs with 35% false positives. We built CodeX-Verify, a multi-agent system that uses four specialized agents to detect different types of bugs. We prove mathematically that combining agents with different detection patterns finds more bugs than any single agent when the agents look for different problems, using submodularity of mutual information under conditional independence. Measuring agent correlation of ρ = 0.05–0.25 confirms they detect different bugs. Testing on 99 code samples with verified labels shows our system catches 76.1% of bugs, matching the best existing method (Meta Prompt Testing: 75%) while running faster and without test execution. We tested all 15 agent combinations and found that using multiple agents improves accuracy by 39.7 percentage points (from 32.8% to 72.4%) compared to single agents, with diminishing returns of +14.9pp, +13.5pp, and +11.2pp for agents 2, 3, and 4, validating our theoretical model. The best two-agent combination (Correctness + Performance) reaches 79.3% accuracy. Testing on 300 real patches from Claude Sonnet 4.5 runs in under 200ms per sample, making this practical for production use. Keywords: Multi-agent systems, Code verification, LLM-generated code, Information theory 1 Introduction LLMs generate code that looks correct but often fails in production. While LLM-generated code passes basic syntax checks and simple tests, recent studies show it contains hidden bugs. Xia et al. [28] find that 29.6% of patches marked “solved” on SWE-bench don’t match what human developers wrote, with 7.8% failing full test suites despite passing initial tests. SecRepoBench reports that LLMs write secure code ¡25% of the time across 318 C/C++ tasks [8], and BaxBench finds 62% of backend code has vulnerabilities or bugs [26]. Studies suggest 40– 60% of LLM code contains undetected bugs [13], making automated deployment risky. The Problem. Existing verification tools check code in one way at a time, missing bugs that require looking from multiple angles. Traditional static analyzers (Sonar- Qube, Semgrep, CodeQL) catch 65% of bugs but flag good code as buggy 35% of the time [22]. Test-based methods like Meta Prompt Testing [27] achieve better false positive rates (8.6%) by running code variants and comparing outputs, but require expensive test infrastruc- ture and miss security holes (SQL injection) and qual- ity issues that don’t affect outputs. LLM review systems like AutoReview [3] improve security detection by 18.72% F1 but only focus on security, not correctness or perfor- mance. No existing work explains mathematically why using multiple agents should work better than using one. Our Approach. We built CodeX-Verify, a system that runs four specialized agents in parallel: Correctness (logic errors, edge cases, exception handling), Security (OWASP Top 10, CWE patterns, secrets), Performance (algorithmic complexity, resource leaks), and Style (main- tainability, documentation). Each agent looks for differ- ent bug types. We prove that combining agents finds more bugs than any single agent using submodularity of mutual information under conditional independence: I(A1, A2, A3, A4; B) > maxi I(Ai; B). Measuring how of- ten our agents agree shows correlation ρ = 0.05–0.25, confirming they catch different bugs. Results. We tested on 99 code samples with verified labels covering 16 bug categories from real SWE-bench failures. Our system catches 76.1% of bugs, matching Meta Prompt Testing (75%) [27] while running faster and without executing code. We improve 28.7 percent- age points over Codex (40%) and 3.7 points over tradi- tional static analyzers (65%). Our 50% false positive rate is higher than test-based methods (8.6%) because we flag security holes and quality issues that don’t affect test out- puts, a tradeoff appropriate for enterprise deployments that prioritize security over minimizing false alarms. We tested all 15 combinations of agents: single agents (4 configs), pairs (6 configs), triples (4 configs), and the full system. Results show progressive improvement: 1 agent (32.8% avg) →2 agents (+14.9pp) →3 agents (+13.5pp) →4 agents (+11.2pp), totaling 39.7 percent- age points gain. This exceeds AutoReview’s +18.72% F1 improvement [3] and confirms the mathematical predic- 1 arXiv:2511.16708v3 [cs.SE] 3 Dec 2025 tion that combining agents with different detection pat- terns works. The diminishing gains (+14.9pp, +13.5pp, +11.2pp) match our theoretical model. Contributions. 1. Mathematical proof via submodularity of mutual infor- mation that combining agents with conditionally inde- pendent detection patterns finds more bugs than any single agent: I(A1, . . . , An; B)

…(Full text truncated)…

📸 Image Gallery

cover.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut