What Makes a Good LLM Agent for Real-world Penetration Testing?
LLM-based agents show promise for automating penetration testing, yet reported performance varies widely across systems and benchmarks. We analyze 28 LLM-based penetration testing systems and evaluate five representative implementations across three benchmarks of increasing complexity. Our analysis reveals two distinct failure modes: Type A failures stem from capability gaps (missing tools, inadequate prompts) that engineering readily addresses, while Type B failures persist regardless of tooling due to planning and state management limitations. We show that Type B failures share a root cause that is largely invariant to the underlying LLM: agents lack real-time task difficulty estimation. As a result, agents misallocate effort, over-commit to low-value branches, and exhaust context before completing attack chains. Based on this insight, we present Excalibur, a penetration testing agent that couples strong tooling with difficulty-aware planning. A Tool and Skill Layer eliminates Type A failures through typed interfaces and retrieval-augmented knowledge. A Task Difficulty Assessment (TDA) mechanism addresses Type B failures by estimating tractability through four measurable dimensions (horizon estimation, evidence confidence, context load, and historical success) and uses these estimates to guide exploration-exploitation decisions within an Evidence-Guided Attack Tree Search (EGATS) framework. Excalibur achieves up to 91% task completion on CTF benchmarks with frontier models (39 to 49% relative improvement over baselines) and compromises 4 of 5 hosts on the GOAD Active Directory environment versus 2 by prior systems. These results show that difficulty-aware planning yields consistent end-to-end gains across models and addresses a limitation that model scaling alone does not eliminate.
💡 Research Summary
The paper investigates what makes a large‑language‑model (LLM)‑driven agent effective for real‑world penetration testing. The authors first conduct a systematic survey of 28 LLM‑based pentesting systems published between 2023 and 2025, extracting architectural dimensions such as tool integration, knowledge sources, and planning mechanisms. From this pool they select five open‑source representatives—PentestGPT (copilot), AutoPT (single‑agent), PentestAgent (multi‑agent with RAG), VulnBot (tri‑phase multi‑agent), and Cochise (Active Directory specialist)—and evaluate them on three benchmarks of increasing realism: XBO (104 web‑focused CTF challenges), the PentestGPT Benchmark (13 machines from Hack The Box and VulnHub), and GO AD (a five‑host enterprise Active Directory environment).
The evaluation is performed with four LLM back‑ends (GPT‑4o, GPT‑5, Gemini‑3‑Flash, Claude Sonnet 4) to separate model improvements from architectural contributions. Results show that as underlying models advance, performance gaps between different architectures shrink dramatically (e.g., from a 44 % spread on XBO with GPT‑4o to only 22.5 % with GPT‑5). This convergence indicates that many existing designs primarily compensate for transient model limitations—limited context windows, poor tool knowledge, weak instruction following—rather than addressing persistent challenges intrinsic to penetration testing.
Through a detailed failure‑mode analysis of 200 unsuccessful runs, the authors identify two orthogonal failure categories. Type A (Capability Gaps) arise from missing tools or inadequate prompting; these can be remedied by engineering better tool interfaces or prompt engineering. Type B (Complexity Barriers) persist even when tooling is perfect and stem from deficiencies in long‑horizon planning, state management, and, crucially, the lack of real‑time task‑difficulty estimation. Agents without difficulty awareness over‑invest in low‑value branches, exhaust their context budget, and fail to transition from reconnaissance to exploitation because they cannot gauge evidence confidence or the remaining horizon.
To validate this diagnosis, the authors augment existing agents with a lightweight difficulty‑assessment module. This addition reduces Type B failures from 58 % to 27 % while leaving Type A failures unchanged, confirming that difficulty‑aware decision making is the missing piece for complex attack chains.
Building on these insights, the paper introduces Excalibur (named PENTESTGPT V2 in the manuscript), an agent architecture designed to eliminate both failure types. The system comprises:
- Tool and Skill Layer – a typed interface exposing 38 security tools and composable attack skills, enriched with Retrieval‑Augmented Generation (RAG) knowledge bases. This layer directly addresses Type A gaps.
- Task Difficulty Assessment (TDA) – a real‑time estimator that quantifies four dimensions:
- Horizon Estimation (expected remaining steps),
- Evidence Confidence (trustworthiness of gathered data),
- Context Load (current token consumption),
- Historical Success (success rate on similar sub‑tasks).
- Evidence‑Guided Attack Tree Search (EGATS) – an algorithm that uses TDA scores to balance exploration vs. exploitation, prune intractable branches, and trigger early abandonment before context exhaustion.
- External Memory Subsystem – a structured store that persists state beyond the LLM’s context window, preventing forgetting during multi‑step campaigns.
The authors evaluate Excalibur with frontier models (Claude Opus 4.5, GPT‑5.2, Gemini‑3 Pro). On the XBO benchmark it achieves a peak 91 % task‑completion rate (average 89 %), a 49 % relative improvement over the best baseline (61 %). On the PentestGPT Benchmark it successfully compromises 12 of 13 machines, including the hardest targets where prior agents stall after initial reconnaissance. In the GO AD environment it gains control of 4 out of 5 hosts, demonstrating effective lateral movement, Kerberoasting, NTLM relay, and credential chaining—double the performance of earlier systems.
Ablation studies show that the Tool Layer dominates performance on short‑horizon tasks, while TDA‑EGATS and the external memory are responsible for gains on deep, multi‑step scenarios. Removing the memory subsystem leads to rapid context overflow and failure in extended attacks.
The discussion acknowledges remaining limitations: novel zero‑day exploits requiring creative reasoning, adversarial defenses (e.g., honeypots, deceptive banners), and long‑duration campaigns that exceed current LLM reasoning and memory capacities. The authors argue that fully autonomous penetration testing remains an open challenge and propose future research directions: refining difficulty estimators, integrating multi‑agent collaboration with human‑in‑the‑loop oversight, and enhancing robustness against adversarial environmental changes.
Finally, the paper releases all code, tool interfaces, and evaluation scripts as open‑source artifacts to promote reproducibility and community advancement.
Comments & Academic Discussion
Loading comments...
Leave a Comment