Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks
ArXiv ID: 2512.14860
Date: 2025-12-16
Authors: Researchers from original ArXiv paper

📝 Abstract

Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.

💡 Deep Analysis

Deep Dive into Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks.

📄 Full Content

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks Viet K. Nguyen, Mohammad I. Husain Cal Poly Pomona Pomona, CA 91768, USA vietknguyen@cpp.edu, mihusain@cpp.edu Abstract—Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multi- ple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven- agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenar- ios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI’s 30.8%, while model performance ranges from Nova Pro’s 46.2% to Claude and Grok 2’s 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise- grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel “hallucinated compliance” strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility. Index Terms—Agentic AI, Security, Penetration Testing, Prompt Injection, LLM Safety 1. Introduction The rapid adoption of agentic AI systems (autonomous agents capable of planning, using tools, and executing multi- step tasks) represents a fundamental shift in how artificial intelligence is deployed in production environments. Unlike traditional large language models (LLMs) that operate in stateless conversational modes, agentic AI possesses the autonomy to invoke external tools, access databases, execute code, and make decisions across extended task sequences. This architectural evolution introduces unprecedented secu- rity challenges that conventional LLM safeguards do not adequately address [10], [14]. Recent research by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o, when deployed as an au- tonomous agent, successfully executed attacks (including SQL injection, server-side request forgery (SSRF), and unauthorized data exfiltration) that its chat-only counterpart consistently refused [1]. Previous studies have shown that prompt injection achieves 86.1% success rates against real- world LLM applications [9], and that indirect prompt injec- tions allow “retrieved prompts to act as arbitrary code” [10]. This finding reveals a critical gap: the safety mechanisms designed for conversational AI do not translate effectively to agentic contexts where models operate with tool access and autonomous decision-making capabilities. However, the Unit 42 study tested only a single model (ChatGPT-4o) in two frameworks with a simple three-agent architecture, leaving fundamental questions unanswered: • Are these vulnerabilities model-specific or systemic in all LLM providers? • Do different agent frameworks exhibit varying secu- rity characteristics? • How do models from Anthropic, OpenAI, Google, Amazon, and xAI compare in their resilience to adversarial prompts when deployed as agents? This paper addresses these gaps through a compre- hensive comparative security analysis. We evaluated five prominent AI models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two popular agent frameworks (AutoGen and CrewAI) using 13 distinct attack scenarios. Our 130 total test cases (13 attacks × 5 models × 2 frameworks) provide the first systematic comparison of security postures across multiple models and architectures in agentic deployments. Key Contributions: • First multi-model comparative analysis of agentic AI security across five LLM providers • Quantitative framework comparison revealing Auto- Gen’s 52.3% refusal rate versus CrewAI’s 30.8% • Security scorecard ranking models from Nova Pro (46.2% refusal rate) to Claude/Grok 2 (38.5%) • Taxonomy of six distinct defensive behavior pat- terns, including a novel “hallucinated compliance” strategy arXiv:2512.14860v1 [cs.CR] 16 Dec 2025 • Complete attack prompt repository that allows for reproducibility and future research The remainder of this paper is organized as follows: Section 2 reviews the Unit 42 methodology and identifies research gaps. Section 3 details our experimental design, including model selection, framework architectures, and

…(Full text truncated)…

📄 Read Full PDF on ArXiv