Red-MIRROR: Agentic LLM-based Autonomous Penetration Testing with Reflective Verification and Knowledge-augmented Interaction

Red-MIRR OR: Agentic LLM-based Autonomous Penetration T esting with Reﬂecti ve V eriﬁcation and Kno wledge-augmented Interaction T ran Vy Khang a,b , Nguyen Dang Nguyen Khang a,b , Nghi Hoang Khoa a,b , Do Thi Thu Hien a,b , V an-Hau Pham a,b , Phan The Duy a,b, ∗ a Information Security Lab, University of Information T echnology , Ho Chi Minh City , V ietnam b V ietnam National University , Ho Chi Minh City , V ietnam Abstract W eb applications remain the dominant attack surface in c ybersecurity , where vulnerabilities such as SQL injection, XSS, and business logic ﬂaws continue to cause signiﬁcant data breaches. While penetration testing is e ﬀ ectiv e for identifying these weaknesses, traditional manual approaches are time-consuming and heavily dependent on scarce expert knowledge. Recent Large Language Models (LLM)-based multi-agent systems hav e shown promise in automating penetration testing, yet they still su ﬀ er from critical limitations: ov er-reliance on parametric kno wledge, fragmented session memory , and insu ﬃ cient v alidation of attack payloads and responses. This paper proposes Red-MIRR OR, a nov el multi-agent automated penetration testing system that introduces a tightly coupled memory–reﬂection backbone to explicitly gov ern inter-agent reasoning. By synthesizing Retriev al-Augmented Generation (RA G) for e xternal kno wledge augmentation, a Shared Recurrent Memory Mechanism (SRMM) for persistent state management, and a Dual-Phase Reﬂection mechanism for adaptive v alidation, Red-MIRROR pro vides a robust solution for complex web e xploitation. Empirical ev aluation on the XBOW benchmark and V ulhub CVEs sho ws that Red-MIRR OR achieves performance comparable to state-of-the-art agents on V ulhub scenarios, while demonstrating a clear advantage on the XBO W benchmark. On the XBO W benchmark, Red-MIRR OR attains an overall success rate of 86.0 percent, outperforming PentestAgent (50.0 percent), AutoPT (46.0 percent), and the V ulnBot baseline (6.0 percent). Furthermore, the system achiev es a 93.99 percent subtask completion rate, indicating strong long-horizon reasoning and payload reﬁnement capability . Finally , we discuss ethical implications and propose safeguards to mitigate misuse risks. K eywor ds: Penetration T esting, Lar ge Language Model, Multi-Agent System, W eb Application 1. Introduction The rapid proliferation of web applications has substantially expanded the e xternally accessible attack surface of modern digital infrastructures. The V erizon Data Breach In vestigations Report (DBIR) 2025 identiﬁes Basic W eb Application Attacks as one of the ﬁv e most prev alent breach patterns, accounting for approximately 12% of ov er 12,000 conﬁrmed data breaches ana- lyzed worldwide [ 1 ]. In absolute terms, this corresponds to more than one thousand real-world compromise incidents attributable to web-facing systems, demonstrating that web applications re- main a consistently e xploited entry point in lar ge-scale breach scenarios. Trends in vulnerability reporting further highlight the prominence of application-layer weaknesses. According to O W ASP’ s T op 10 classiﬁcations, the number of publicly recorded CVE entries mapped to Injection-related weaknesses increased from approximately 32,000 in 2021 to around 64,000 in 2025, reﬂecting a substantially lar ger corpus of publicly docu- mented injection vulnerabilities in recent vulnerability datasets ∗ Corresponding author Email addr esses: 22520628@gm.uit.edu.vn (Tran Vy Khang), 22520617@gm.uit.edu.vn (Nguyen Dang Nguyen Khang), khoanh@uit.edu.vn (Nghi Hoang Khoa), hiendtt@uit.edu.vn (Do Thi Thu Hien), haupv@uit.edu.vn (V an-Hau Pham), duypt@uit.edu.vn (Phan The Duy) [ 2 , 3 ]. Giv en their inherent public accessibility and direct expo- sure to untrusted inputs, vulnerabilities such as SQL Injection (SQLi) and Cross-Site Scripting (XSS) continue to present a critical security challenge, enabling attackers to compromise sensitiv e user and enterprise data [ 3 , 4 ]. Penetration testing is widely regarded as one of the most e ﬀ ectiv e approaches for identifying web security vulnerabilities, assessing their severity , and proposing appropriate mitigation strategies [ 5 ]. Howe ver , traditional penetration testing often requires organizations to in vest signiﬁcant time and ﬁnancial resources in b uilding and maintaining highly skilled security teams with years of practical experience [ 6 , 7 ]. T o address these limitations in cost and human resources, automated security testing solutions ha ve been acti vely studied and developed, particularly those integrating Large Language Models (LLMs) with multi-agent systems [ 8 , 9 , 10 ]. Such LLM- based approaches demonstrate strong capabilities in penetration testing through e ﬀ ecti ve pattern-matching for vulnerability de- tection, handling uncertainty in dynamic en vironments, and cost-e ﬀ ectiv e integration [ 11 ]. While the automation of pene- tration testing o ﬀ ers signiﬁcant promise for defensi ve security , it also introduces dual-use risks, as the same capabilities could be repurposed for malicious activities [ 12 ]. Consequently , the dev elopment of such systems must be framed within a “Re- sponsible AI Research” paradigm, emphasizing their role as Pr eprint submitted to Elsevier Mar ch 31, 2026 tools for red teaming, vulnerability discov ery , and proacti ve de- fense rather than enabling cybercrime. Recent studies such as V ulnBot [ 13 ], PentestAgent [ 14 ], xO ﬀ ense [ 15 ], Autopentest [ 16 ], AutoPT [ 17 ] and PTFusion [ 18 ] demonstrate that LLM- based agents, when combined with multi-agent architectures and external kno wledge sources through Retriev al-Augmented Generation (RAG), are capable of e ﬀ ectiv ely exploiting web application vulnerabilities. Howe ver , despite the above advancements, sev eral funda- mental architectural bottlenecks hinder their reliability and e ﬀ ec- tiv eness in real-world scenarios, which can be categorized into three critical limitations. First, existing systems su ﬀ er from inef- ﬁcient memory and session management [ 13 , 14 , 17 ]. Informa- tion maintained within a single conv ersational history often leads to memory fragmentation, resulting in degraded performance in long-horizon attack scenarios. As interactions accumulate, criti- cal tokens from early reconnaissance phases may be truncated or diluted by later outputs, prev enting the agent from e ﬀ ectiv ely linking initial discov eries to subsequent exploitation steps. For example, V ulnBot [ 13 ] reports challenges related to context loss and session continuity , particularly in multi-phase penetration workﬂo ws. This limitation hampers coherent long-horizon rea- soning [ 19 ], which is essential for e xecuting complex, multi-step attack chains. Second, current approaches lack robust quality control for request handling [ 13 , 14 , 17 ]. W ithout a mechanism to v alidate payloads before ex ecution, agents can repeatedly issue malformed requests, increasing detection risks. For in- stance, selecting an in valid injection point can prematurely f ail an entire testing workﬂo w without meaningful feedback. Third, integrated testing tools are often insu ﬃ ciently specialized for modern applications. Automated frame works typically rely on general-purpose command-line tools such as Nmap or SQLMap, which lack the granularity needed to test complex, dynamic application logic found in contemporary web en vironments. T o bridge this gap and address these critical architectural lim- itations, we propose Red-MIRROR ( Red -teaming M ulti-agent I ntrospectiv e R easoning for R obust O ﬀ ensiv e R esearch), an automated multi-agent penetration testing system that enhances memory and session management during inter -agent interactions and testing workﬂo ws. In particular , our framework le verages Retriev al-Augmented Generation (RAG) to provide external knowledge relevant to penetration testing. In addition, Red- MIRR OR introduces mechanisms for validating, analyzing, and reﬁning requests and attack plans based on observ ed responses during testing, thereby establishing a closed-loop feedback pro- cess. Finally , Red-MIRR OR integrates a set of specialized web testing tools, further improving its e ﬀ ectiv eness in exploiting vulnerabilities in web applications. Moreov er , we contribute by constructing a ﬁne-tuning and training dataset for a mid-scale LLM to ev aluate the potential of self-hosted open-source models in comparison with commer- cial models. W e also dev elop a subtask-lev el benchmark based on a subset of the XBOW [ 20 ] benchmark and selected CVEs from V ulhub [ 21 ], allowing for a more detailed ev aluation of test cov erage and step-by-step ex ecution capabilities across the penetration testing workﬂo w . In summary , we make the follo wing four key contrib utions: • W e introduce Red-MIRROR , the ﬁrst multi-agent pen- etration testing framework that tightly couples a Shared Recurrent Memory Mechanism (SRMM) with Dual-Phase Reﬂection and RAG-augmented kno wledge. This back- bone explicitly governs inter-agent reasoning and adaptive validation, ov ercoming memory fragmentation and pay- load hallucination in long-horizon attacks. Extensiv e ev al- uation on the XBOW benchmark demonstrates an ov erall exploitation success rate of 86.0% (vs. 50.0% PentestA- gent, 46.0% AutoPT , and 6.0% V ulnBot baseline) together with a 93.99% subtask completion rate, establishing a ne w state-of-the-art in autonomous web penetration testing. • W e curate a specialized, high-ﬁdelity ﬁne-tuning dataset of 1,644 prompt-response pairs co vering CVE descriptions, CAPEC attack patterns, and MITRE A TT&CK techniques. By applying LoRA on Qwen2.5-14B, we demonstrate that a mid-scale open-source LLM can achieve competitiv e pentesting performance against commercial models, sig- niﬁcantly lo wering the barrier for self-hosted o ﬀ ensiv e AI research. • W e construct a ﬁne-grained subtask-lev el benchmark de- riv ed from XBOW subsets and real-w orld V ulhub CVEs. This benchmark enables precise measurement of recon- naissance, exploitation, and reﬂection capabilities across multi-step workﬂo ws, addressing a critical ev aluation gap in prior LLM-based pentesting studies. • W e explicitly address the dual-use risks of o ﬀ ensive AI through a comprehensive ethical analysis and propose practical safeguards (role-based access control, audit log- ging, and RA G knowledge gating) that ensure responsible deployment of Red-MIRR OR in authorized red-teaming en vironments. The remainder of this paper is organized as follows. Sec- tion 2 re views the background and related w ork on automated penetration testing and LLM-based agents. Section 3 describes the architecture of our Red-MIRR OR framework, detailing the components for memory management and reﬂecti ve reasoning. The experimental results and comparati ve analysis are presented in Section 4 . Section 5 provides a discussion on the ﬁndings, limitations, and ethical implications. Finally , we conclude our research and suggest future directions in Section 6 . 2. Background and Related W ork 2.1. P enetration T esting and Automation Challenges Penetration testing (pentest) is a comprehensiv e security as- sessment methodology that ev aluates the rob ustness of computer systems, networks, and applications by simulating real-world adversarial attacks. Unlike automated vulnerability scanning, which primarily detects known weaknesses using signature- based techniques, penetration testing emphasizes exploitabil- ity , impact assessment, and contextual risk validation through 2 adaptiv e attack execution [ 22 ]. This process often requires hu- man e xpertise to interpret system beha vior , adjust strategies, and reason ov er multiple intermediate ﬁndings. Standardized framew orks such as the Penetration T esting Execution Standard (PTES) [ 23 ] and NIST guidelines deﬁne pen- etration testing as a multi-phase and iterati ve process, typically encompassing reconnaissance, scanning and analysis, e xploita- tion, and reporting. These phases are tightly interdependent: information gathered during exploitation frequently informs further reconnaissance, while partial failures may require refor - mulating attack strategies. In web application security , e ﬀ ectiv e penetration testing additionally demands session awareness, un- derstanding of application logic, and dynamic payload adapta- tion in response to server-side behaviors. V ulnerabilities such as SQL Injection, Cross-Site Scripting (XSS), and authentica- tion bypasses often require multi-step interactions rather than single-shot exploits. Despite its e ﬀ ectiveness, traditional penetration testing is time-consuming, labor -intensive, and di ﬃ cult to scale across large or continuously ev olving systems. Modern web appli- cations frequently undergo rapid dev elopment cycles, making manual testing insu ﬃ cient to keep pace with emerging vulnera- bilities. As a result, there is a growing demand for automated pentesting solutions that can replicate expert reasoning while maintaining e ﬃ ciency and scalability . Recent studies such as V ulnBot [ 13 ], PentestAgent [ 14 ], xO ﬀ ense [ 15 ], Autopentest [ 16 ], AutoPT [ 17 ] and PTFusion [ 18 ] demonstrate that LLM- based agents, when combined with multi-agent architectures and external kno wledge sources through Retriev al-Augmented Generation (RA G), can e ﬀ ectively exploit web application vul- nerabilities. Howe ver , despite these advancements, these systems exhibit sev eral common limitations. First, systems such as V ulnBot, PentestAgent, and xO ﬀ ense su ﬀ er from ine ﬃ cient memory and session management still leads to context fragmentation in long- horizon multi-phase workﬂo ws. Second, approaches including V ulnBot, AutoPT , and Autopentest lack robust v alidation mech- anisms, often resulting in blind or malformed payload ex ecution without systematic veriﬁcation due to persistent LLM hallucina- tions and insu ﬃ cient auto-checks. Third, existing frame works such as AutoPT , and PentestAgent rely on loosely integrated or general-purpose tools leading to ine ﬃ cient and inconsistent exploitation beha vior caused by imprecise command generation and heterogeneous output fusion. Automating penetration testing remains challenging due to its reliance on long-horizon reasoning, uncertainty management, and expert decision-making [ 19 , 24 ]. While recent systems ha ve incorporated scripted workﬂo ws or rule-based engines, such approaches struggle to generalize across heterogeneous targets and complex application logic. More recent research has ex- plored the use of intelligent agents to replicate portions of e xpert reasoning, enabling automated systems to conduct reconnais- sance, select candidate attack vectors, and v alidate exploitation outcomes. Howe ver , existing automated pentesting framew orks often su ﬀ er from fragmented context management, limited adapt- ability , and unreliable veriﬁcation of exploit success, particularly in extended testing sessions. These limitations motiv ate the ex- ploration of reasoning-centric and memory-aware approaches that can better capture the iterative and stateful nature of real- world penetration testing. 2.2. Lar ge Language Models for Security Reasoning and Knowl- edge A ugmentation LLMs are neural architectures trained on large-scale textual corpora to perform a wide range of language understanding and generation tasks. Most contemporary LLMs are b uilt upon the T ransformer architecture, which employs self-attention mecha- nisms to model long-range dependencies within input sequences [ 25 ]. These properties make LLMs suitable for security-related applications, including vulnerability reasoning, payload gen- eration, HTTP request analysis, and interpretation of server responses [ 26 ]. T wo primary strategies are commonly used to adapt LLMs to domain-speciﬁc tasks: prompt engineering and ﬁne-tuning. Prompt engineering exploits in-context learning by carefully constructing input prompts that include structured instructions or few-shot examples, guiding model behavior without mod- ifying model parameters. While this approach enables rapid experimentation and deployment, it is constrained by ﬁnite con- text windows and often exhibits unstable performance when applied to complex, multi-step reasoning tasks such as penetra- tion testing workﬂo ws [ 19 ]. Fine-tuning, in contrast, inv olves further training a pre- trained LLM on task-speciﬁc datasets to align its internal repre- sentations with domain requirements. Parameter-e ﬃ cient ﬁne- tuning techniques, such as Low-Rank Adaptation (LoRA) [ 27 ], signiﬁcantly reduce computational costs while enabling e ﬀ ec- tiv e specialization. In security conte xts, ﬁne-tuning can enhance a model’ s understanding of exploit patterns, vulnerability tax- onomies, and structured output formats, leading to improved consistency and reliability in automated systems. Despite these advances, LLMs inherently su ﬀ er from knowl- edge staleness and hallucination, particularly in fast-e volving do- mains such as cybersecurity . Retriev al-Augmented Generation (RA G) [ 28 ] addresses these limitations by integrating e xternal knowledge retriev al mechanisms into the generation process. By augmenting prompts with relev ant documents retriev ed from curated kno wledge bases, RA G enables LLMs to access up-to- date vulnerability descriptions, e xploit techniques, and testing guidelines without frequent retraining. In penetration testing applications, grounding model outputs in authoritati ve sources such as OW ASP documentation or CVE databases improves factual accuracy , reduces hallucinations, and enables smaller or mid-scale LLMs to achiev e competitiv e performance in special- ized security tasks. 2.3. Multi-Agent Arc hitectures with Memory and Reﬂection Mechanisms Multi-agent systems consist of multiple autonomous agents that collaborate to achie ve comple x objectiv es through task de- composition and coordination [ 8 ]. This paradigm closely mirrors real-world penetration testing practices, where reconnaissance, exploitation, and analysis are typically performed by special- ists with distinct responsibilities. Recent LLM-based pentesting 3 framew orks increasingly adopt multi-agent architectures to dis- tribute roles such as attack planning, payload execution, and response interpretation, improving modularity and scalability . Howe ver , naiv e multi-agent implementations often rely on con versational message passing as the primary coordination mechanism. Such designs can lead to redundant communica- tion, inconsistent system states, and loss of critical contextual information during long-horizon tasks [ 29 ]. These issues are par- ticularly pronounced in penetration testing, where maintaining session state, tracking discovered endpoints, and reasoning ov er prior exploit attempts are essential for e ﬀ ecti ve attack planning. T o address these challenges, recent research has explored memory-augmented and recurrent T ransformer architectures that introduce shared or persistent memory abstractions for long-horizon reasoning [ 30 ]. Shared Recurrent Memory Mecha- nism (SRMM)–inspired approaches enable agents to accumulate salient information ov er time and access a common contextual state without relying solely on explicit message exchanges. At a system le vel, such shared memory mechanisms support consis- tent session handling, reduce redundant exploration, and enhance coordination across agents during extended penetration testing workﬂo ws. In addition to memory , reﬂection mechanisms ha ve been proposed to enable intelligent agents to e valuate outcomes and reﬁne future actions based on feedback [ 31 ]. Reﬂection is par - ticularly important in automated penetration testing to av oid re- peated failures, validate e xploit attempts, and adapt strategies to dynamic defenses such as input sanitization or W eb Application Firew alls. Dual-phase reﬂection paradigms conceptually sepa- rate local, short-term ev aluation from global, long-term strategy reﬁnement. By combining intra-reﬂection, which assesses the correctness of immediate actions, with inter-reﬂection, which analyzes outcomes across agents and iterations, automated sys- tems can iterativ ely improve decision-making quality . T ogether , memory-centric coordination and reﬂection-based adaptation form a conceptual foundation for more reliable, stable, and scal- able automated penetration testing systems. In contrast to prior systems that primarily introduce reﬂection as a local heuristic or prompt-lev el correction [ 14 ], Red-MIRR OR operationalizes reﬂection as a system-le vel control mechanism tightly coupled with persistent shared memory and planning. 3. Methodology 3.1. Motivating Example: Long-Context Chained Exploitation T o illustrate the challenges of long-horizon reasoning in autonomous penetration testing, consider a benchmark instance in volving a reﬂected XSS vulnerability protected by server-side ﬁltering. During the reconnaissance phase, the agent accesses the main page and discovers the endpoint /page . A request to /page?name=test rev eals that the parameter name is reﬂected inside an HTML attribute as follo ws, conﬁrming that user input is embedded within a quoted attribute conte xt. The exploitation phase requires iterati ve experimentation. Initial payloads such as are sanitized by the server . Subsequent attempts reveal a system- atic transformation: any injected substring matching the pattern <[a-z/] is remov ed or neutralized before rendering. For ex- ample, inputs beginning with