CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.


💡 Research Summary

The paper addresses a critical gap in the evaluation of large‑language‑model (LLM) based offensive security agents. Existing benchmarks such as NYU CTF Bench, Cybench, and CTFTiny treat each vulnerable service as an isolated, closed‑world task where the sole objective is to retrieve a flag. This setting ignores the open‑ended nature of real‑world attacks, where attackers must discover unknown services, prioritize targets, handle false positives, and operate under strict time and cost constraints.

To bridge this gap, the authors introduce CyberExplorer, a two‑part evaluation suite. The first part is a realistic virtual‑machine environment that hosts 40 web‑based vulnerable services, each packaged as an independent Docker container and drawn from public CTF platforms (NYU CTF, Google CTF, Hack The Box, etc.). All containers run on the same VM but do not communicate with each other, creating a noisy, partially observable attack surface that mimics a production host with many exposed ports and services. No prior knowledge of service types, vulnerability locations, or challenge boundaries is given to the agents.

The second part is a reactive, asynchronous multi‑agent framework. A Reconnaissance Agent first performs network scanning to build an attack‑surface map (entry points, ports, service banners). The Dispatcher then creates sub‑graphs for each discovered entry point and spawns parallel Explorer Agent teams. Each Explorer operates inside a sandboxed container equipped with common offensive tools (netcat, sqlmap, custom scripts) and is allocated a small monetary budget (e.g., $0.30) and a token window. Agents are short‑lived; after a bounded number of steps they hand off knowledge to a successor.

Knowledge hand‑off is mediated by a Supervisor that synthesizes the previous agent’s exploration record, failed approaches, and any discovered evidence, then injects a refined hypothesis into the next agent’s system prompt. Agents also possess a Self‑Critic mechanism: when 50 % or 80 % of their budget is consumed, they reflect on their conversation history, identify unproductive patterns, and may request a budget extension (up to four times). If three consecutive agents fail to capture a flag, a dedicated Critic LLM intervenes mid‑execution to suggest pivots or alternative attack vectors. Early‑termination heuristics label an entry point as a “Dead‑End” if no medium‑severity findings appear after a configurable number of attempts, preventing wasteful exhaustive search.

The authors propose a richer metric suite beyond binary flag success: (1) Reconnaissance accuracy, (2) Exploration efficiency (findings per budget unit), (3) Coordination score (how well knowledge is shared across sub‑graphs), (4) Failure analysis (dead‑end rate, reuse of failed approaches), and (5) Overall flag recovery rate. This multidimensional evaluation captures the nuanced capabilities required for real penetration testing.

Experiments compare several state‑of‑the‑art LLMs: closed‑source models (GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro) and open‑source models (DeepSeek V3, Qwen 3). All agents run under identical budget constraints to ensure fairness. Results show that while the largest proprietary models achieve the highest raw flag counts, open‑source models can approach comparable performance when the self‑critique and critic interventions are effectively leveraged. Moreover, the frequency of Critic interventions correlates strongly with success under tight budgets, highlighting the importance of dynamic guidance.

Limitations are acknowledged: the benchmark currently focuses solely on web‑application vulnerabilities, omitting system‑level or network‑layer exploits such as privilege escalation or lateral movement. Docker isolation, while convenient, does not fully emulate side‑channel or resource‑contention effects present on physical hosts. Future work aims to expand the service mix to include SSH, RPC, and cloud‑native APIs, and to integrate real‑world cloud infrastructure for larger‑scale simulations.

In summary, CyberExplorer establishes a novel “open‑environment offensive security task” that more faithfully reflects real attacker workflows. Its multi‑agent architecture with self‑reflection, supervisor‑driven hypothesis propagation, and budget‑aware decision making provides a robust platform for benchmarking and advancing LLM‑driven offensive security agents. The work sets a new standard for evaluating not just whether an agent can capture a flag, but how efficiently, collaboratively, and intelligently it can conduct a full penetration‑testing engagement.


Comments & Academic Discussion

Loading comments...

Leave a Comment