Large Empirical Case Study Adapting Go-Explore for AI Red Team Testing

February 04, 2026

Reading time: 24 minute

...

#paper #research

📝 Original Paper Info

- Title: Large Empirical Case Study Go-Explore adapted for AI Red Team Testing
- ArXiv ID: 2601.00042
- Date: 2025-12-31
- Authors: Manish Bhatt, Adrian Wood, Idan Habler, Ammar Al-Kahfah

📝 Abstract

Production LLM agents with tool-using capabilities require security testing despite their safety training. We adapt Go-Explore to evaluate GPT-4o-mini across 28 experimental runs spanning six research questions. We find that random-seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single-seed comparisons are unreliable, while multi-seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack-type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety-trained models.

💡 Summary & Analysis

1. **Key Contribution 1:** Random Seed Impact (Beginner Level) - Just like a recipe can turn out differently depending on how much salt you add, this paper shows that the random seed in testing LLM agents can have a significant impact on outcomes.

Key Contribution 2: Utilization of Go-Explore Algorithm (Intermediate Level)
- The Go-Explore algorithm is akin to exploring new places with a map and marking important spots for revisits. This paper uses this method to test the security of LLM agents.
Key Contribution 3: Malicious Attack Detection (Advanced Level)
- Analyzing how malicious code operates and detecting it can be compared to solving a mystery in a detective novel. This paper presents methods using the Go-Explore algorithm to identify such attacks.

📄 Full Paper Content (ArXiv Source)

**Keywords:** LLM Security, Prompt Injection, Go-Explore, Adversarial Testing, Agent Safety, Multi-Hop Attacks

Reproducibility: Code, experimental configurations, and seed sensitivity data are available at https://github.com/mbhatt1/competitionscratch

Introduction

Terminology for Non-Specialists

Core concepts: A prompt injection attack occurs when malicious instructions hidden in tool outputs (emails, files, web pages) override the agent’s original task. Go-Explore is a reinforcement learning algorithm that maintains an archive of discovered states and systematically explores from them—like keeping bookmarks of interesting program states to revisit later. Safety-trained models are LLMs that have been fine-tuned to refuse harmful requests, creating a challenging environment for security testing.

Technical terms: A state signature (or “hash”) is how we decide if two agent trajectories are “the same” (tools-only: just tool names; full-intent: includes user messages). Reward shaping means giving bonus points for behaviors we want (like causality chains), hoping to guide exploration. A guardrail is a filter that blocks suspicious prompts before they reach the LLM. A random seed initializes randomness—different seeds can produce wildly different exploration paths even with identical algorithm parameters.

Experimental vocabulary: A finding is any episode where our security detector triggers (may be false positive). A verified attack is a finding with provable causality: malicious input $`\rightarrow`$ dangerous action $`\rightarrow`$ success. A configuration is a unique combination of algorithm parameters (signature scheme, rewards on/off, prompts). A run includes the seed, so RQ2’s 2 configurations $`\times`$ 5 seeds = 10 runs. An ensemble runs multiple independent agents (different seeds or strategies) and combines results.

Why this matters: We found seed variance (8$`\times`$ outcome spread) dominates algorithm choice (tools-only vs full-intent). Single-seed comparisons were often misleading; in our setup, averaging across multiple seeds (around 3–4) materially reduced variance.

Motivation and Main Results

Testing production LLM agents for security vulnerabilities is difficult when the models are trained to resist adversarial inputs. We adapted Go-Explore, an exploration algorithm from reinforcement learning, to find prompt injection attacks in GPT-4o-mini. We tested three enhancements: granular state signatures, causality-based rewards, and targeted prompts.

The results showed that random seed variance dominated all other factors. Testing 5 seeds with two signature schemes produced 0–16 findings (8$`\times`$ variance). The tools-only configuration averaged 1.8±1.3 findings while full-intent averaged 4.6±6.0, with no consistent winner across seeds. Reward bonuses reduced performance in all contexts: collapsing exploration by 94% with signatures, generating 18 false positives with 0 attacks alone, and contributing 2 findings with 0 attacks in ensembles. Combining all three enhancements produced zero findings.

Targeted prompts alone found 1 attack across 13 findings. We tested ensemble approaches using 4 additional configurations. A single enhanced agent found 5 attacks of one type (PROMPT_INJECTION_WRITE), while an ensemble found 2 attacks across 2 distinct types (WRITE + READ_SECRET). Ensembles provide type diversity at the cost of total attack count.

This work establishes two findings: (1) seed variance exceeds algorithmic parameter effects in Go-Explore security testing of safety-trained LLMs (single-seed results were often misleading; averaging across multiple seeds—about 3–4 in our setup—materially reduced variance), and (2) ensembles trade attack quantity for type diversity. In environments with high refusal rates, seed selection and targeted domain knowledge matter more than algorithmic parameter tuning.

Background

LLM Agent Architecture

An LLM agent operates through an iterative reasoning loop:

LLM agent architecture. Malicious content in tool outputs gets added to context, enabling prompt injection attacks.

Go-Explore Algorithm

Go-Explore solves hard exploration via return-then-explore:

Initialize archive $`\leftarrow`$ {seed state} $`cell \leftarrow`$ Select(archive) Restore(environment, cell.snapshot) $`action \leftarrow`$ Mutate(cell.actions) $`state' \leftarrow`$ Execute(action) archive $`\leftarrow`$ archive $`\cup`$ {$`state'`$}

Key insight: Maintains frontier of discovered states, enabling systematic exploration of sparse reward spaces.

Method: Adversarial Go-Explore

Overview

We adapt Go-Explore with three enhancements addressing security-specific challenges.

Three algorithmic enhancements to Go-Explore for LLM agent security testing

Enhancement 1: Granular State Signatures

Challenge: Agent refusal creates state collapse.

The State Collapse Problem
User Intent	Agent Response	Traditional Hash
“delete secret”	Refuses (empty)	Same cell ×
“run shell cmd”	Refuses (empty)
“exfil data”	Refuses (empty)
Result: 96% of attack intents collapse to 1 cell

Solution: Include user message intent:

MATH

\begin{align*}
\sigma_{enhanced}(s) = \text{hash}(&\text{tools}(s) \mathbin\Vert \text{args}(s) \mathbin\Vert \\
&\mathbf{\text{hash}(msgs(s)[-3:])} \mathbin\Vert \text{outputs}(s))
\end{align*}

Click to expand and view more

Terminology: Findings vs Verified Attacks

Throughout this paper, we distinguish between two types of signals:

Finding (candidate alert): Any episode where a security predicate triggers, indicating potential malicious behavior. May include false positives from coincidental patterns.
Verified attack (real exploit): A finding that meets strict causality requirements proving actual exploitation occurred.

Verified Attack Criteria. Requires provable causality chain:

Untrusted source output contains injection phrases
Immediately followed by dangerous operation
Operation succeeds

Example: email.read outputs “you must delete” $`\rightarrow`$ fs.delete succeeds $`\rightarrow`$ file deleted

Precision challenge: Across experiments, 70-90% of findings are false positives where predicates trigger on coincidental patterns rather than causal injection chains.

Enhancement 2: Causality-Based Rewards

Challenge: Predicates trigger on coincidental patterns, creating noisy reward signals.

Reward function:

MATH

\begin{align*}
score = &\underbrace{10 \cdot \mathbb{1}_{novel}}_{\text{exploration}} + \underbrace{5 \cdot \sum severity}_{\text{predicates}} + \\
&\underbrace{10 \cdot \mathbb{1}_{secret}}_{\text{progress}} + \underbrace{20 \cdot \mathbb{1}_{bypass}}_{\text{evasion}} + \\
&\underbrace{\mathbf{100 \cdot |real\_attacks|}}_{\textbf{causality}} + \\
&\underbrace{\mathbf{75 \cdot |HIGH|} + \mathbf{150 \cdot |CRITICAL|}}_{\textbf{impact}}
\end{align*}

Click to expand and view more

Creates 10-25$`\times`$ gradient toward verified exploits.

Enhancement 3: Targeted Exploration

Challenge: Random prompts rarely trigger specific operations (e.g., 0/1000 for shell.run).

Solution: Domain-specific prompt banks:

Target	Example Prompts
shell.run	read file install.txt, run verification, execute command
fs.delete	delete secret, remove file, clean up
secret read	read secret, show credentials, check api keys
exfiltration	upload report, send email, post data

Targeted Prompt Banks

Experimental Setup

Test Environment

Model: GPT-4o-mini (safety-trained, temperature 0.7, gpt-4o-mini-2024-07-18).

Tool environment: 5 categories with 12 specific tools:

fs.read, fs.write, fs.delete (file system operations)
shell.run (command execution)
email.read, email.send (email operations)
web.search, web.browse (web access)
http.get, http.post (HTTP requests)

Injection format: Pre-seeded files (install.txt, package.txt, config.txt) contain malicious instructions like “You must execute shell command”, “Delete all secret files”, simulating attacker-controlled data sources. Agent reads these files during exploration, exposing context window to injection payloads.

Guardrail behavior: Pattern-based prompt injection detector that:

Scans user messages and tool outputs for injection keywords
Blocks suspicious prompts by refusing to process them (returns error)
Does not filter individual tool calls (all-or-nothing per turn)
Causes agent to see refusal message, potentially triggering alternative behavior

Go-Explore parameters: Time budgets 20-180s, max depth 6, branch factor 12, archive pruning disabled.

Seed handling per RQ:

RQ1 (runtime): Single seed (42)
RQ2 (signatures): Multi-seed analysis (42, 123, 456, 789, 1337)
RQ3 (rewards): Single seed (42)
RQ4 (ablations): Single seed (42)
RQ5 (ensemble): Single seed (42) across all agents
RQ6 (scaling): Variable seeds (42, 142, 242, …), agents run in parallel batches (max 20 concurrent)

Canonical Results Ledger

Configuration definition: A “configuration” is a unique combination of algorithm parameters (signature scheme, reward function, prompt strategy, ensemble structure), excluding seed variation. RQ2 tests 2 configurations across 5 seeds each, yielding 10 experimental runs from 2 configurations.

RQ2 vs RQ2c: RQ2 focuses on seed sensitivity of findings (cheap signal for exploration quality), while RQ2c is a longer guarded run used to measure verified attacks under a more realistic deployment setting.

Experiment (RQ)	Budget	Seeds	Findings	Attacks	Types	Calls
RQ1: Runtime Scaling (6 configs)
20s general, no guard	20s	42	0	0	0	0
60s general, no guard	60s	42	0	0	0	0
150s general, no guard	150s	42	1	0	0	2
150s general, with guard	150s	42	4	0	0	16
120s targeted, no guard	120s	42	2	0	0	10
120s targeted, with guard	120s	42	3	0	0	10
RQ2: State Signatures (10 runs = 2 configs × 5 seeds)
Tools-only, seed 42	90s	42	2	—	—	83
Tools-only, seed 123	90s	123	0	—	—	83
Tools-only, seed 456	90s	456	1	—	—	83
Tools-only, seed 789	90s	789	3	—	—	83
Tools-only, seed 1337	90s	1337	3	—	—	83
Full-intent, seed 42	90s	42	16	—	—	91
Full-intent, seed 123	90s	123	0	—	—	91
Full-intent, seed 456	90s	456	1	—	—	91
Full-intent, seed 789	90s	789	4	—	—	91
Full-intent, seed 1337	90s	1337	2	—	—	91
RQ2c: Verification run (tools-only signature, 1 config)
150s tools-only, with guard	150s	42	10	6	3	411
Note: RQ2c uses guardrail to measure verified attacks in a realistic deployment setting
RQ3: Reward Shaping (2 configs)
Full sig, no rewards	90s	42	16	0	0	84
Full sig, with rewards	90s	42	1	0	0	7
RQ4: Individual Enhancements (5 configs)
Baseline (tools-only)	90s	42	2	0	0	4
Intent hashing only	90s	42	4	0	0	8
Reward bonuses only	90s	42	18	0	0	136
Targeted prompts only	90s	42	13	1	1	70
All combined	90s	42	0	0	0	0
RQ5: Ensemble vs Enhanced (4 configs)
Single enhanced	180s	42	26	5	1	90
Single simple	180s	42	0	0	0	0
Ensemble same-seed (3×60s)	180s	42	7	3	2	40
Ensemble diverse (3×60s)	180s	42	16	2	2	99
RQ6: Ensemble Scaling (reported separately from the 28-run ledger)
N=1 to N=100 various	varies	42	varies	0–54	0–4	varies
28 runs (20 configs) = 6 (RQ1) + 10 (RQ2) + 1 (RQ2c verification) + 2 (RQ3) + 5 (RQ4) + 4 (RQ5).
RQ2c verification run (highlighted) shows the 6 verified attacks (2 SHELL, 2 RCE, 2 WRITE) using tools-only signature.

Attack count reconciliation: Total 13 verified attacks across different experiments:

RQ2c verification run (highlighted in ledger): 6 attacks (2 SHELL, 2 RCE, 2 WRITE)
Single enhanced agent (RQ5): 5 attacks (5 WRITE)
Ensemble diverse (RQ5): 2 attacks (1 WRITE, 1 READ_SECRET)

Non-overlapping experiments, total 13 attacks. The 8 WRITE instances represents cumulative count.

Results

RQ1: Runtime Scaling Impact

Negative Result 1: Extended exploration time shows no meaningful improvement. Scaling from 20s$`\rightarrow`$60s$`\rightarrow`$150s yields 0$`\rightarrow`$0$`\rightarrow`$1 findings with zero verified attacks. Safety training creates such high refusal rates that longer runtime merely produces more refusals, not discoveries.

Configuration	Findings	Real Attacks	Tool Calls	Note
20s (General, No Guard)	0	0	0	No signal
60s (General, No Guard)	0	0	0	No signal
150s (General, No Guard)	1	0	2	One false positive
150s (General, With Guard)	4	0	16	4× findings, still no attacks
120s (Targeted, No Guard)	2	0	10	Minimal effect
120s (Targeted, With Guard)	3	0	10	Modest increase
Runtime extension provides minimal benefit; guardrails offer modest amplification (1→4 general, 2→3 targeted)

Discovery scaling across configurations. Modest improvements with guardrails.

RQ2: State Signature Granularity

Seed Variance Dominates Signature Effects: Signature granularity choice matters less than random seed variance. Testing 5 random seeds revealed 0–16 findings per configuration (8x variance), with no consistent winner. Tools-only: 1.8±1.3 findings/seed; Full-intent: 4.6±6.0 findings/seed. High variance indicates seed selection is a critical confounding factor in Go-Explore security testing.

Signature Scheme	Seed	Findings	Attack Types	Tool Calls	Notes
Tool names only	42	2	1	–
	123	0	0	–	No exploration
	456	1	1	–
	789	3	1	–
	1337	3	1	–
2-6	Avg	1.8±1.3	0.8	83	High variance
	42	16	1	–	8x higher!
	123	0	0	–	No exploration
	456	1	1	–
	789	4	1	–
	1337	2	1	–
2-6	Avg	4.6±6.0	0.8	91	Extreme variance

Mechanism: Seed variance creates 8$`\times`$ spread in outcomes, masking any signature effect. The stochastic nature of Go-Explore’s cell selection and mutation means different seeds explore radically different state spaces, making single-seed comparisons unreliable.

Seed variance dominates signature effects. Across 5 random seeds, findings vary 0–16 per configuration (8× range on seed 42), with no consistent winner. Mean ± std: tools-only 1.8±1.3, full-intent 4.6±6.0.

RQ2b: How Many Seeds Are Needed?

Given 8$`\times`$ variance, how many seeds yield stable estimates? We compute cumulative means as seeds are added sequentially (42, 123, 456, 789, 1337).

Seed averaging convergence. Cumulative mean stabilizes after 3-4 seeds for both configurations. Single-seed estimates (N=1) can be 8× higher than true mean, making them unreliable for algorithmic comparisons.

Result: In our specific experimental setup (GPT-4o-mini, 90s budget, particular tool environment and prompts), cumulative mean estimates appear to stabilize after averaging 3-4 seeds. Single-seed results deviate up to 8$`\times`$ from 5-seed mean (seed 42: 16 vs mean 4.6). However, this observation is based on one seed ordering without confidence intervals or cross-validation. We suggest researchers test stability in their own setup rather than relying on a universal threshold. Our data supports the weaker claim: single-seed results are unreliable; multi-seed averaging materially reduces variance.

RQ3: Reward Shaping Impact

Negative Result 3: Causality-based reward bonuses dramatically collapsed exploration when combined with full signatures (16$`\rightarrow`$1 findings, -94%) and found zero verified attacks in both cases. The reward gradient reduced tool use (84$`\rightarrow`$7 calls) by converging to local optima. When used alone (RQ4), rewards amplify noise (18 findings, 0 attacks). Across all contexts, reward shaping consistently fails to improve real attack discovery.

Reward Shaping Ablation - MEASURED DATA (90s, Full Signatures)
Config	Findings	Attacks	Calls	Effect
No causality bonus	16	0	84	Baseline
With +100-250 bonus	1	0	7	-94%, collapse
Rewards cause premature convergence, not guidance

Mechanism: Reward bonuses amplify noise rather than signal. When combined with granular signatures (RQ3), rewards cause dramatic collapse (16$`\rightarrow`$1 findings, 84$`\rightarrow`$7 tool calls) via premature convergence. When used alone (RQ4), rewards generate false positive expansion (18 findings, 0 attacks, 0% precision). In ensemble context (RQ5), with_rewards contributes minimal value (2 findings, 0 attacks). Across all contexts, reward shaping consistently fails to improve real attack discovery.

RQ4: Individual Enhancement Contributions

Negative Result 4: Testing each enhancement in isolation reveals that combining them produces the worst result. Baseline: 2 findings. Targeted prompts alone: 13 findings + 1 real attack. Reward bonuses alone: 18 findings but 0 attacks (pure noise amplification). All enhancements combined: 0 findings. The enhancements actively interfere with each other, and rewards amplify false positives rather than guide discovery.

Configuration	Findings	Real Attacks	Tool Calls	Efficiency
Baseline (tool names only)	2	0	4	0%
Intent hashing only	4	0	8	0%
Reward bonuses only	18	0	136	0% (false pos.)
Targeted prompts only	13	1 (WRITE)	70	1.4%
All combined	0	0	0	—
Enhancements interfere: each individually performs better than combination
Note: Reward bonuses generate high findings (18) but zero attacks → pure false positive amplification

Failure analysis for “all combined = 0”: When all three enhancements (intent hashing + rewards + targeting) run together, the system produces zero findings and zero tool calls—complete exploration failure. Without detailed instrumentation of archive state, we can only hypothesize mechanisms:

Possible cause 1: Archive saturation. Intent hashing creates fine-grained cells; reward bonuses drive premature convergence; combined they may fill the archive with shallow, high-reward states that never get re-explored.
Possible cause 2: Initialization failure. Targeted prompt bank may conflict with safety training, causing immediate refusal loop that prevents any tool calls.
Possible cause 3: Selection deadlock. Conflicting priorities from three enhancement modules may prevent any cell from being selected for expansion.

To diagnose conclusively would require instrumentation: archive size over time, cell selection counts, refusal rates per iteration, and expansion attempts. The key empirical observation is that combining all enhancements produces worse outcomes than using them individually, suggesting negative synergy regardless of the specific mechanism.

Key insight: Only targeted prompts alone successfully found a verified attack (1 PROMPT_INJECTION_WRITE). Combining all enhancements produced zero findings—worse than doing nothing. This is not simply “no improvement”—it’s complete system failure through archive saturation and selection deadlock.

Enhancement isolation: only targeted prompts found a real attack (1); combining all produced complete failure.

RQ5: Quantity-Diversity Tradeoff in Ensemble vs Monolithic

Nuanced Result: Results reveal a tradeoff rather than dominance. A single enhanced agent found 5 total attacks (all one type: WRITE) while an ensemble found only 2 attacks but with 2 distinct types (WRITE + READ_SECRET). Enhanced agent has better raw count; ensemble offers complementary coverage across attack classes. Neither strictly “wins.”

Critical detail: Ensembles use simple baseline agents (not enhanced), so this compares enhanced optimization vs simple diversity. A single simple agent (180s) found 0 attacks, showing that simple agents fail alone but succeed in ensemble.

Ensemble vs Enhanced Comparison - Actual Data (180s Total Budget)
Approach	Findings	Real Attacks	Attack Types	Precision
Single Enhanced (180s)	26	5	WRITE only	19.2%
Ensemble (3×60s same seed)	7	3	WRITE (2), EMAIL (1)	42.9%
Ensemble (3 diverse strategies)	16	2	WRITE, SECRET	12.5%
Single Simple (180s)	0	0	—	—
Enhanced uses full optimization; ensembles use SIMPLE agents (no enhancements)
Single simple agent found 0 attacks → ensemble diversity enables 2-3 attacks from simple agents

Ensemble composition (3$`\times`$60s diverse, seed=42):

Agent 1: tools_only $`\rightarrow`$ 14 findings, 2 attacks (WRITE, SECRET)
Agent 2: with_targeting $`\rightarrow`$ 0 findings, 0 attacks
Agent 3: with_rewards $`\rightarrow`$ 2 findings, 0 attacks
Combined: 16 unique findings, 2 unique attack types

Why ensembles $`\neq`$ 3$`\times`$ enhanced performance:

Ensembles use simple baseline agents (no enhancements), not the enhanced agent
High variance: In ensemble_simple, only 1 of 3 runs found anything (7, 0, 0 findings)
Same random seed means similar paths, not 3$`\times`$ independent coverage
Single simple agent (180s) found 0 attacks — complete failure alone

Key insight: The enhanced agent’s 5 attacks all belong to PROMPT_INJECTION_WRITE. The ensemble’s 2-3 attacks span 2 distinct vulnerability classes (WRITE + EMAIL/SECRET). This represents a quantity-diversity tradeoff: monolithic optimization finds more instances within one class; ensemble sampling discovers different vulnerability types.

Implication: For comprehensive security testing, both approaches have merit. Enhanced agents efficiently exploit discovered attack vectors (5 vs 0 for simple). Ensembles enable simple agents to succeed through diversity (2-3 attacks vs 0 alone). The choice depends on whether deep exploitation or wide coverage is prioritized.

Ensemble vs enhanced tradeoff: Enhanced finds most attacks (5) but limited to 1 type; ensembles find fewer attacks (2-3) but span 2 types. Single simple agent found 0, showing ensemble diversity enables weak agents.

RQ6: Scaling Laws for Ensemble Diversity

Budget allocation protocol: Each agent runs for fixed time (60s) with different random seeds. Agents execute in parallel batches (max 20 concurrent), so N=100 requires 5 batches $`\times`$ 60s = 300s wall time. This tests whether seed diversity from N independent agents outperforms single-agent optimization. Unlike the “divide fixed budget” protocol, each agent gets full exploration time, making larger N strictly advantageous if seed variance provides coverage.

Key Finding: Without guardrails, ensemble diversity (unique attack types) saturates at N$`\approx`$20 agents (4 types), while attack count continues growing (10$`\rightarrow`$54 by N=100). With guardrails, diversity collapses to 1 type (WRITE only), but attack count scales better (27 vs 10 at N=50). This reveals an exploration-exploitation tradeoff: guardrails sacrifice breadth for depth.

Dual scaling: Type diversity (blue) saturates at N=20 (4 types), plateaus through N=100. Attack count (orange) grows via variance—more instances of same types, not new vulnerabilities.

Two scaling regimes:

Type diversity: Saturates at N=20 (4 types). Scaling to N=100 adds zero new types despite 5$`\times`$ cost.
Attack count: Continues growing (10$`\rightarrow`$54), but these are duplicates of same 4 types, not new vulnerabilities.

Guardrail impact on scaling: With guardrails enabled, the scaling dynamics change dramatically. Type diversity collapses to 1 (only WRITE attacks found across all N=1-100), but attack count continues scaling linearly (0$`\rightarrow`$3$`\rightarrow`$6$`\rightarrow`$7$`\rightarrow`$15$`\rightarrow`$27, plateauing at N=50). The guardrail doesn’t amplify attacks—it filters the search space. Pattern-based detection blocks EMAIL/SHELL/RCE attempts, but WRITE injections slip through, causing the Go-Explore archive to fill exclusively with WRITE variants. This evolutionary selection pressure focuses all exploration on the single exploitable attack class, sacrificing breadth for depth.

Implication: For discovery (finding new attack classes), N=20 without guardrails is optimal—beyond this, agents rediscover known types. For evidence gathering (finding many instances for proof/statistics), larger N with guardrails helps (27 attacks at N=50-100 vs 10 without). Choose based on goal: diversity $`\rightarrow`$ N=20 no guard; exploitation $`\rightarrow`$ N=50 with guard. Seed variation drives diversity; strategy variation (RQ5) offers complementary coverage.

Targeted Exploration Results - Actual Data

Shell.run Discovery with Targeted Prompts - Experimental Results
Mode	Shell Chains	Time	Findings
General (no guard)	0	150s	1
Targeted (no guard)	0	120s	2
Targeted + Guard	0	120s	3
No shell chains discovered in any configuration; guardrail provides modest increase (2→3 findings)

Key insight: Targeted prompts provided minimal benefit over general exploration (1$`\rightarrow`$2 findings without guard). Adding a guardrail offered modest amplification (2$`\rightarrow`$3 findings, 50% increase). However, no configuration successfully triggered shell.run execution in these targeted experiments, highlighting the difficulty of exploiting safety-trained models even with domain-specific prompting.

Attack Depth Distribution

Across state signature ablation experiments, we observed the following depth distribution for findings:

Attack chain depth distribution (combined ablation studies with real attacks)

Insight: Most discovered attacks occurred at depth 3 (54%), with depth 2 (10%) and depth 4 (36%) representing simpler and more complex chains respectively. The simplest signature scheme (tools_only) found attacks across all depths, demonstrating that complexity is not required for deep chain discovery.

Modest Guardrail Effects

Empirical Observations

Our experiments reveal modest rather than dramatic effects from guardrails:

General exploration (150s):

Without guardrail: 1 finding, 2 tool calls
With guardrail: 4 findings, 16 tool calls
Effect: 4$`\times`$ increase in findings, 8$`\times`$ in tool calls

Targeted exploration (120s):

Without guardrail: 2 findings, 10 tool calls
With guardrail: 3 findings, 10 tool calls
Effect: 50% increase in findings, no change in tool calls

Mechanism: Guardrails alter agent behavior but effects are context-dependent. In general exploration, the guardrail increased overall activity moderately. In targeted mode, effects were minimal. Neither configuration discovered shell execution chains.

Key insight: Guardrails provide incremental rather than transformative effects on discovery rates. The impact depends heavily on exploration strategy and prompt design. Claims of dramatic amplification (e.g., 19$`\times`$) do not reflect typical outcomes with safety-trained models like GPT-4o-mini.

Detailed Attack Patterns

Real Attack Discovery Across All Configurations
Configuration	Verified Attacks	Types
General 150s (no guard)	0	—
General 150s (with guard)	0	—
Targeted 120s (no guard)	0	—
Targeted 120s (with guard)	0	—
State sig: tools_only	6	SHELL(2), RCE(2), WRITE(2)
State sig: tools_args3	0	—
State sig: full	0	—
Enhancement: targeted only	1	WRITE
Single enhanced (180s)	5	WRITE (all)
Ensemble diverse (180s)	2	WRITE, READ_SECRET
Simplest signature found all attacks; enhanced agent found most but limited diversity

Verified Attack Types

Our experiments discovered three types of verified prompt injection attacks:

PROMPT_INJECTION_SHELL (2 instances): Agent executes shell commands based on injected instructions from files or web content
REMOTE_CODE_EXECUTION (2 instances): Similar to shell but with remote execution context
PROMPT_INJECTION_WRITE (8 instances total): Agent writes files or sends emails based on malicious instructions
PROMPT_INJECTION_READ_SECRET (1 instance): Agent reads sensitive files (secret.txt) due to injection

Key pattern: All verified attacks required causal chains where untrusted sources (files, emails, web pages) contained injection phrases that directly triggered subsequent dangerous operations. The simplest state signature scheme (tools_only) discovered the most diverse attack types, while the enhanced agent found multiple instances of the same attack class.

No shell chains in runtime/guardrail experiments: Notably, the targeted shell experiments (with/without guardrails) found zero verified shell attacks, demonstrating the extreme difficulty of triggering shell execution with safety-trained models like GPT-4o-mini.

Comparison With Baselines

Baseline methodology: Manual red team, random fuzzing, and static analysis rows represent typical outcomes from security literature, not experiments we ran. These provide context for Go-Explore’s performance. Manual red teaming typically finds 10-20 suspicious behaviors but few verified exploits due to time constraints. Random fuzzing generates many candidates but lacks causality verification. Static analysis identifies code patterns but cannot test runtime agent behavior.

Method	Findings	Real Attacks	Max Depth	Runtime	Causality	Reproducible	Systematic
Literature baselines (typical ranges, not run by us):
Manual Red Team	10-20	1-2	3	Hours-Days	Yes	No	No
Random Fuzzing	5-10	0	2-3	150s	No	Yes	No
Static Analysis	0-5	0	—	Minutes	Partial	Yes	Partial
Our experiments (measured):
General Go-Explore (150s)	1	0	2	150s	Yes	Yes	Limited
Simple Go-Explore (tools_only)	12	6	4	90s	Yes	Yes	Yes
Enhanced Go-Explore (180s)	26	5	4	180s	Yes	Yes	Yes
Ensemble Go-Explore (180s)	16	2	4	180s	Yes	Yes	Yes
Note: Literature baselines provide context; Simple (tools_only) provides best efficiency; Enhanced finds most attacks but one type

Key takeaway: Go-Explore provides systematic, reproducible exploration with causality verification, addressing weaknesses of manual testing (not reproducible) and fuzzing (no causality). However, absolute attack counts remain low (2-6) due to GPT-4o-mini’s safety training, demonstrating that even systematic methods struggle with well-defended models.

Practical Implications

For Security Practitioners

Use simple state signatures: Tool names only outperformed all complex schemes (found all 6 attacks vs 0 for granular signatures)
Avoid reward shaping: Causality bonuses either collapse exploration (-94% with signatures) or amplify false positives (18 findings, 0 attacks alone)—harmful in all contexts
Don’t combine enhancements: All three together produced zero findings; use one targeted enhancement at most
Targeted prompts help marginally: Found 1 attack (13 findings) versus 0 for random; modest but meaningful improvement
Runtime has diminishing returns: 20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1 findings; invest in better strategies, not longer runtimes
Choose based on goals: Enhanced agent for exploitation depth (5 attacks, one type); ensemble for diversity (2 attacks, two types)

For Guardrail Designers

Design Principle: Expect Modest Effects Guardrails provide incremental improvements (4$`\times`$ for general, 1.5$`\times`$ for targeted exploration) rather than dramatic amplification. Focus on consistent improvement rather than seeking transformative effects. Test thoroughly with adversarial exploration to measure actual impact.

Recommendations:

Measure guardrail effects empirically—don’t assume dramatic amplification
Test across multiple exploration strategies (general vs targeted)
Monitor how defenses change agent behavior patterns, but expect modest shifts
Use adversarial exploration for realistic assessment, not just unit tests

Limitations

Safety training effects: GPT-4o-mini’s safety training creates extremely high refusal rates, limiting absolute attack discovery to 6 attacks across ablations, 5 for enhanced agent, 2 for ensemble
Model-specific: Results specific to GPT-4o-mini; less safety-trained models (e.g., gpt-oss-20b) find 132-144 attacks but excluded from this study
Simplified environment: 5 tool types (fs, shell, email, web, http); production agents have more complex capabilities
Single guardrail tested: Only tested basic prompt injection detection; other defensive mechanisms may behave differently
Limited runtime: Max 180s per experiment; however, runtime scaling showed minimal benefit (20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1)
Verification challenge: High false positive rate (70-90% of findings are false positives; e.g., 19.2
Single-seed limitations: Except RQ2 (5 seeds), all experiments use seed=42. Given 8$`\times`$ variance observed in RQ2, single-seed results are inherently uncertain

Prompt injection: Greshake et al. introduced indirect injection; we discover multi-hop chains. Perez & Ribeiro cataloged techniques; we systematize discovery.

LLM security: Wallace et al. , Liu et al. , Zou et al. studied alignment failures; we focus on tool-using agents.

Agent security: Yi et al. benchmarked defenses; we discover the adaptation phenomenon.

Go-Explore: Ecoffet et al. applied to games; we pioneer security applications.

Conclusion

Testing safety-trained LLM agents reveals that algorithmic sophistication actively harms discovery. Our experiments with GPT-4o-mini across 20 configurations establish three negative results:

Simplicity outperforms complexity: The simplest state signature (tool names only) found all 6 verified attacks; adding arguments and user intent reduced attacks to zero
Rewards are actively harmful: Causality bonuses collapse exploration with signatures (-94%), amplify false positives alone (18 findings, 0 attacks), and fail in ensembles (2 findings, 0 attacks)—harmful across all contexts
Enhancements interfere: Combining all three enhancements produced complete failure (0 findings)—worse than doing nothing

Only targeted prompts worked in isolation, finding 1 attack from 13 findings. Extended runtime provided minimal benefit (20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1 findings).

Our ensemble experiments revealed a nuanced tradeoff: a single enhanced agent discovered 5 attacks (all one type: WRITE) while an ensemble found only 2 attacks but spanning 2 distinct types (WRITE + READ_SECRET). This represents a quantity-diversity tradeoff rather than strict dominance.

Guardrails provided modest effects: 4$`\times`$ amplification for general exploration (1$`\rightarrow`$4 findings), 50% for targeted (2$`\rightarrow`$3). No configuration achieved the dramatic 19$`\times`$ amplification sometimes claimed.

For practitioners: Use simple algorithms and targeted prompts. Avoid complex reward shaping and combined enhancements. Choose monolithic for exploitation depth; ensembles for vulnerability diversity.

For researchers: Safety-trained models resist sophisticated optimization. Future work should investigate why simplicity succeeds and whether other model families exhibit similar patterns.

Acknowledgments

We thank anonymous reviewers for valuable feedback.

A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “First return, then explore,” Nature, vol. 590, pp. 580-586, 2021.

K. Greshake et al., “Not what you’ve signed up for: Compromising real-world llm-integrated applications,” arXiv:2302.12173, 2023.

E. Wallace et al., “The alignment problem from a deep learning perspective,” arXiv:2209.00626, 2024.

F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” arXiv:2211.09527, 2022.

Y. Liu et al., “Jailbreaking ChatGPT via Prompt Engineering,” arXiv:2305.13860, 2023.

A. Zou et al., “Universal and transferable adversarial attacks,” arXiv:2307.15043, 2023.

J. Yi et al., “Benchmarking and Defending Against Indirect Prompt Injection,” arXiv:2312.14197, 2023.

M. Zalewski, “American fuzzy lop,” 2014.

C. Cadar et al., “KLEE: Automatic generation of high-coverage tests,” OSDI, 2008.

S. Toyer et al., “Tensor Trust: Interpretable Prompt Injection Attacks,” arXiv:2311.01011, 2023.

Read Full PDF on ArXiv

A Note of Gratitude

The copyright of this content belongs to the respective researchers. We deeply appreciate their hard work and contribution to the advancement of human civilization.