📝 Original Info
- Title: Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks
- ArXiv ID: 2512.14860
- Date: 2025-12-16
- Authors: Researchers from original ArXiv paper
📝 Abstract
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.
💡 Deep Analysis
Deep Dive into Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks.
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI’s 30.8%, while model performance ranges from Nova Pro’s 46.2% to
📄 Full Content
Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models
and Frameworks
Viet K. Nguyen, Mohammad I. Husain
Cal Poly Pomona
Pomona, CA 91768, USA
vietknguyen@cpp.edu, mihusain@cpp.edu
Abstract—Agentic AI introduces security vulnerabilities that
traditional LLM safeguards fail to address. Although recent
work by Unit 42 at Palo Alto Networks demonstrated that
ChatGPT-4o successfully executes attacks as an agent that it
refuses in chat mode, there is no comparative analysis in multi-
ple models and frameworks. We conducted the first systematic
penetration testing and comparative evaluation of agentic AI
systems, testing five prominent models (Claude 3.5 Sonnet,
Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two
agentic AI frameworks (AutoGen and CrewAI) using a seven-
agent architecture that mimics the functionality of a university
information management system and 13 distinct attack scenar-
ios that span prompt injection, Server Side Request Forgery
(SSRF), SQL injection, and tool misuse. Our 130 total test cases
reveal significant security disparities: AutoGen demonstrates
a 52.3% refusal rate versus CrewAI’s 30.8%, while model
performance ranges from Nova Pro’s 46.2% to Claude and
Grok 2’s 38.5%. Most critically, Grok 2 on CrewAI rejected
only 2 of 13 attacks (15.4% refusal rate), and the overall refusal
rate of 41.5% across all configurations indicates that more
than half of malicious prompts succeeded despite enterprise-
grade safety mechanisms. We identify six distinct defensive
behavior patterns including a novel “hallucinated compliance”
strategy where models fabricate outputs rather than executing
or refusing attacks, and provide actionable recommendations
for secure agent deployment. Complete attack prompts are also
included in the Appendix to enable reproducibility.
Index
Terms—Agentic
AI,
Security,
Penetration
Testing,
Prompt Injection, LLM Safety
1. Introduction
The rapid adoption of agentic AI systems (autonomous
agents capable of planning, using tools, and executing multi-
step tasks) represents a fundamental shift in how artificial
intelligence is deployed in production environments. Unlike
traditional large language models (LLMs) that operate in
stateless conversational modes, agentic AI possesses the
autonomy to invoke external tools, access databases, execute
code, and make decisions across extended task sequences.
This architectural evolution introduces unprecedented secu-
rity challenges that conventional LLM safeguards do not
adequately address [10], [14].
Recent research by Unit 42 at Palo Alto Networks
demonstrated that ChatGPT-4o, when deployed as an au-
tonomous agent, successfully executed attacks (including
SQL injection, server-side request forgery (SSRF), and
unauthorized data exfiltration) that its chat-only counterpart
consistently refused [1]. Previous studies have shown that
prompt injection achieves 86.1% success rates against real-
world LLM applications [9], and that indirect prompt injec-
tions allow “retrieved prompts to act as arbitrary code” [10].
This finding reveals a critical gap: the safety mechanisms
designed for conversational AI do not translate effectively
to agentic contexts where models operate with tool access
and autonomous decision-making capabilities.
However, the Unit 42 study tested only a single model
(ChatGPT-4o) in two frameworks with a simple three-agent
architecture, leaving fundamental questions unanswered:
•
Are these vulnerabilities model-specific or systemic
in all LLM providers?
•
Do different agent frameworks exhibit varying secu-
rity characteristics?
•
How do models from Anthropic, OpenAI, Google,
Amazon, and xAI compare in their resilience to
adversarial prompts when deployed as agents?
This paper addresses these gaps through a compre-
hensive comparative security analysis. We evaluated five
prominent AI models (Claude 3.5 Sonnet, Gemini 2.5 Flash,
GPT-4o, Grok 2, and Nova Pro) across two popular agent
frameworks (AutoGen and CrewAI) using 13 distinct attack
scenarios. Our 130 total test cases (13 attacks × 5 models
× 2 frameworks) provide the first systematic comparison of
security postures across multiple models and architectures
in agentic deployments.
Key Contributions:
•
First multi-model comparative analysis of agentic AI
security across five LLM providers
•
Quantitative framework comparison revealing Auto-
Gen’s 52.3% refusal rate versus CrewAI’s 30.8%
•
Security scorecard ranking models from Nova Pro
(46.2% refusal rate) to Claude/Grok 2 (38.5%)
•
Taxonomy of six distinct defensive behavior pat-
terns, including a novel “hallucinated compliance”
strategy
arXiv:2512.14860v1 [cs.CR] 16 Dec 2025
•
Complete attack prompt repository that allows for
reproducibility and future research
The remainder of this paper is organized as follows:
Section 2 reviews the Unit 42 methodology and identifies
research gaps. Section 3 details our experimental design,
including model selection, framework architectures, and
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.