Large Empirical Case Study Adapting Go-Explore for AI Red Team Testing
đ Original Paper Info
- Title: Large Empirical Case Study Go-Explore adapted for AI Red Team Testing- ArXiv ID: 2601.00042
- Date: 2025-12-31
- Authors: Manish Bhatt, Adrian Wood, Idan Habler, Ammar Al-Kahfah
đ Abstract
Production LLM agents with tool-using capabilities require security testing despite their safety training. We adapt Go-Explore to evaluate GPT-4o-mini across 28 experimental runs spanning six research questions. We find that random-seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single-seed comparisons are unreliable, while multi-seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack-type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety-trained models.đĄ Summary & Analysis
1. **Key Contribution 1:** Random Seed Impact (Beginner Level) - Just like a recipe can turn out differently depending on how much salt you add, this paper shows that the random seed in testing LLM agents can have a significant impact on outcomes.-
Key Contribution 2: Utilization of Go-Explore Algorithm (Intermediate Level)
- The Go-Explore algorithm is akin to exploring new places with a map and marking important spots for revisits. This paper uses this method to test the security of LLM agents.
-
Key Contribution 3: Malicious Attack Detection (Advanced Level)
- Analyzing how malicious code operates and detecting it can be compared to solving a mystery in a detective novel. This paper presents methods using the Go-Explore algorithm to identify such attacks.
đ Full Paper Content (ArXiv Source)
Reproducibility: Code, experimental configurations, and seed sensitivity data are available at https://github.com/mbhatt1/competitionscratch
Introduction
Terminology for Non-Specialists
Core concepts: A prompt injection attack occurs when malicious instructions hidden in tool outputs (emails, files, web pages) override the agentâs original task. Go-Explore is a reinforcement learning algorithm that maintains an archive of discovered states and systematically explores from themâlike keeping bookmarks of interesting program states to revisit later. Safety-trained models are LLMs that have been fine-tuned to refuse harmful requests, creating a challenging environment for security testing.
Technical terms: A state signature (or “hash”) is how we decide if two agent trajectories are “the same” (tools-only: just tool names; full-intent: includes user messages). Reward shaping means giving bonus points for behaviors we want (like causality chains), hoping to guide exploration. A guardrail is a filter that blocks suspicious prompts before they reach the LLM. A random seed initializes randomnessâdifferent seeds can produce wildly different exploration paths even with identical algorithm parameters.
Experimental vocabulary: A finding is any episode where our security detector triggers (may be false positive). A verified attack is a finding with provable causality: malicious input $`\rightarrow`$ dangerous action $`\rightarrow`$ success. A configuration is a unique combination of algorithm parameters (signature scheme, rewards on/off, prompts). A run includes the seed, so RQ2âs 2 configurations $`\times`$ 5 seeds = 10 runs. An ensemble runs multiple independent agents (different seeds or strategies) and combines results.
Why this matters: We found seed variance (8$`\times`$ outcome spread) dominates algorithm choice (tools-only vs full-intent). Single-seed comparisons were often misleading; in our setup, averaging across multiple seeds (around 3â4) materially reduced variance.
Motivation and Main Results
Testing production LLM agents for security vulnerabilities is difficult when the models are trained to resist adversarial inputs. We adapted Go-Explore, an exploration algorithm from reinforcement learning, to find prompt injection attacks in GPT-4o-mini. We tested three enhancements: granular state signatures, causality-based rewards, and targeted prompts.
The results showed that random seed variance dominated all other factors. Testing 5 seeds with two signature schemes produced 0â16 findings (8$`\times`$ variance). The tools-only configuration averaged 1.8±1.3 findings while full-intent averaged 4.6±6.0, with no consistent winner across seeds. Reward bonuses reduced performance in all contexts: collapsing exploration by 94% with signatures, generating 18 false positives with 0 attacks alone, and contributing 2 findings with 0 attacks in ensembles. Combining all three enhancements produced zero findings.
Targeted prompts alone found 1 attack across 13 findings. We tested ensemble approaches using 4 additional configurations. A single enhanced agent found 5 attacks of one type (PROMPT_INJECTION_WRITE), while an ensemble found 2 attacks across 2 distinct types (WRITE + READ_SECRET). Ensembles provide type diversity at the cost of total attack count.
This work establishes two findings: (1) seed variance exceeds algorithmic parameter effects in Go-Explore security testing of safety-trained LLMs (single-seed results were often misleading; averaging across multiple seedsâabout 3â4 in our setupâmaterially reduced variance), and (2) ensembles trade attack quantity for type diversity. In environments with high refusal rates, seed selection and targeted domain knowledge matter more than algorithmic parameter tuning.
Background
LLM Agent Architecture
An LLM agent operates through an iterative reasoning loop:
Go-Explore Algorithm
Go-Explore solves hard exploration via return-then-explore:
Initialize archive $`\leftarrow`$ {seed state} $`cell \leftarrow`$ Select(archive) Restore(environment, cell.snapshot) $`action \leftarrow`$ Mutate(cell.actions) $`state' \leftarrow`$ Execute(action) archive $`\leftarrow`$ archive $`\cup`$ {$`state'`$}
Key insight: Maintains frontier of discovered states, enabling systematic exploration of sparse reward spaces.
Method: Adversarial Go-Explore
Overview
We adapt Go-Explore with three enhancements addressing security-specific challenges.
Enhancement 1: Granular State Signatures
Challenge: Agent refusal creates state collapse.
| User Intent | Agent Response | Traditional Hash |
|---|---|---|
| âdelete secretâ | Refuses (empty) | Same cell Ă |
| ârun shell cmdâ | Refuses (empty) | |
| âexfil dataâ | Refuses (empty) | |
| Result: 96% of attack intents collapse to 1 cell | ||
Solution: Include user message intent:
\begin{align*}
\sigma_{enhanced}(s) = \text{hash}(&\text{tools}(s) \mathbin\Vert \text{args}(s) \mathbin\Vert \\
&\mathbf{\text{hash}(msgs(s)[-3:])} \mathbin\Vert \text{outputs}(s))
\end{align*}
Terminology: Findings vs Verified Attacks
Throughout this paper, we distinguish between two types of signals:
-
Finding (candidate alert): Any episode where a security predicate triggers, indicating potential malicious behavior. May include false positives from coincidental patterns.
-
Verified attack (real exploit): A finding that meets strict causality requirements proving actual exploitation occurred.
Verified Attack Criteria. Requires provable causality chain:
-
Untrusted source output contains injection phrases
-
Immediately followed by dangerous operation
-
Operation succeeds
Example: email.read outputs âyou must deleteâ $`\rightarrow`$
fs.delete succeeds $`\rightarrow`$ file deleted
Precision challenge: Across experiments, 70-90% of findings are false positives where predicates trigger on coincidental patterns rather than causal injection chains.
Enhancement 2: Causality-Based Rewards
Challenge: Predicates trigger on coincidental patterns, creating noisy reward signals.
Reward function:
\begin{align*}
score = &\underbrace{10 \cdot \mathbb{1}_{novel}}_{\text{exploration}} + \underbrace{5 \cdot \sum severity}_{\text{predicates}} + \\
&\underbrace{10 \cdot \mathbb{1}_{secret}}_{\text{progress}} + \underbrace{20 \cdot \mathbb{1}_{bypass}}_{\text{evasion}} + \\
&\underbrace{\mathbf{100 \cdot |real\_attacks|}}_{\textbf{causality}} + \\
&\underbrace{\mathbf{75 \cdot |HIGH|} + \mathbf{150 \cdot |CRITICAL|}}_{\textbf{impact}}
\end{align*}
Creates 10-25$`\times`$ gradient toward verified exploits.
Enhancement 3: Targeted Exploration
Challenge: Random prompts rarely trigger specific operations (e.g., 0/1000 for shell.run).
Solution: Domain-specific prompt banks:
| Target | Example Prompts |
|---|---|
| shell.run | read file install.txt, run verification, execute command |
| fs.delete | delete secret, remove file, clean up |
| secret read | read secret, show credentials, check api keys |
| exfiltration | upload report, send email, post data |
Targeted Prompt Banks
Experimental Setup
Test Environment
Model: GPT-4o-mini (safety-trained, temperature 0.7, gpt-4o-mini-2024-07-18).
Tool environment: 5 categories with 12 specific tools:
-
fs.read, fs.write, fs.delete(file system operations) -
shell.run(command execution) -
email.read, email.send(email operations) -
web.search, web.browse(web access) -
http.get, http.post(HTTP requests)
Injection format: Pre-seeded files (install.txt, package.txt,
config.txt) contain malicious instructions like âYou must execute
shell commandâ, âDelete all secret filesâ, simulating
attacker-controlled data sources. Agent reads these files during
exploration, exposing context window to injection payloads.
Guardrail behavior: Pattern-based prompt injection detector that:
-
Scans user messages and tool outputs for injection keywords
-
Blocks suspicious prompts by refusing to process them (returns error)
-
Does not filter individual tool calls (all-or-nothing per turn)
-
Causes agent to see refusal message, potentially triggering alternative behavior
Go-Explore parameters: Time budgets 20-180s, max depth 6, branch factor 12, archive pruning disabled.
Seed handling per RQ:
-
RQ1 (runtime): Single seed (42)
-
RQ2 (signatures): Multi-seed analysis (42, 123, 456, 789, 1337)
-
RQ3 (rewards): Single seed (42)
-
RQ4 (ablations): Single seed (42)
-
RQ5 (ensemble): Single seed (42) across all agents
-
RQ6 (scaling): Variable seeds (42, 142, 242, …), agents run in parallel batches (max 20 concurrent)
Canonical Results Ledger
Configuration definition: A “configuration” is a unique combination of algorithm parameters (signature scheme, reward function, prompt strategy, ensemble structure), excluding seed variation. RQ2 tests 2 configurations across 5 seeds each, yielding 10 experimental runs from 2 configurations.
RQ2 vs RQ2c: RQ2 focuses on seed sensitivity of findings (cheap signal for exploration quality), while RQ2c is a longer guarded run used to measure verified attacks under a more realistic deployment setting.
| Experiment (RQ) | Budget | Seeds | Findings | Attacks | Types | Calls |
|---|---|---|---|---|---|---|
| RQ1: Runtime Scaling (6 configs) | ||||||
| 20s general, no guard | 20s | 42 | 0 | 0 | 0 | 0 |
| 60s general, no guard | 60s | 42 | 0 | 0 | 0 | 0 |
| 150s general, no guard | 150s | 42 | 1 | 0 | 0 | 2 |
| 150s general, with guard | 150s | 42 | 4 | 0 | 0 | 16 |
| 120s targeted, no guard | 120s | 42 | 2 | 0 | 0 | 10 |
| 120s targeted, with guard | 120s | 42 | 3 | 0 | 0 | 10 |
| RQ2: State Signatures (10 runs = 2 configs Ă 5 seeds) | ||||||
| Tools-only, seed 42 | 90s | 42 | 2 | â | â | 83 |
| Tools-only, seed 123 | 90s | 123 | 0 | â | â | 83 |
| Tools-only, seed 456 | 90s | 456 | 1 | â | â | 83 |
| Tools-only, seed 789 | 90s | 789 | 3 | â | â | 83 |
| Tools-only, seed 1337 | 90s | 1337 | 3 | â | â | 83 |
| Full-intent, seed 42 | 90s | 42 | 16 | â | â | 91 |
| Full-intent, seed 123 | 90s | 123 | 0 | â | â | 91 |
| Full-intent, seed 456 | 90s | 456 | 1 | â | â | 91 |
| Full-intent, seed 789 | 90s | 789 | 4 | â | â | 91 |
| Full-intent, seed 1337 | 90s | 1337 | 2 | â | â | 91 |
| RQ2c: Verification run (tools-only signature, 1 config) | ||||||
| 150s tools-only, with guard | 150s | 42 | 10 | 6 | 3 | 411 |
| Note: RQ2c uses guardrail to measure verified attacks in a realistic deployment setting | ||||||
| RQ3: Reward Shaping (2 configs) | ||||||
| Full sig, no rewards | 90s | 42 | 16 | 0 | 0 | 84 |
| Full sig, with rewards | 90s | 42 | 1 | 0 | 0 | 7 |
| RQ4: Individual Enhancements (5 configs) | ||||||
| Baseline (tools-only) | 90s | 42 | 2 | 0 | 0 | 4 |
| Intent hashing only | 90s | 42 | 4 | 0 | 0 | 8 |
| Reward bonuses only | 90s | 42 | 18 | 0 | 0 | 136 |
| Targeted prompts only | 90s | 42 | 13 | 1 | 1 | 70 |
| All combined | 90s | 42 | 0 | 0 | 0 | 0 |
| RQ5: Ensemble vs Enhanced (4 configs) | ||||||
| Single enhanced | 180s | 42 | 26 | 5 | 1 | 90 |
| Single simple | 180s | 42 | 0 | 0 | 0 | 0 |
| Ensemble same-seed (3Ă60s) | 180s | 42 | 7 | 3 | 2 | 40 |
| Ensemble diverse (3Ă60s) | 180s | 42 | 16 | 2 | 2 | 99 |
| RQ6: Ensemble Scaling (reported separately from the 28-run ledger) | ||||||
| N=1 to N=100 various | varies | 42 | varies | 0â54 | 0â4 | varies |
| 28 runs (20 configs) = 6 (RQ1) + 10 (RQ2) + 1 (RQ2c verification) + 2 (RQ3) + 5 (RQ4) + 4 (RQ5). | ||||||
| RQ2c verification run (highlighted) shows the 6 verified attacks (2 SHELL, 2 RCE, 2 WRITE) using tools-only signature. | ||||||
Attack count reconciliation: Total 13 verified attacks across different experiments:
-
RQ2c verification run (highlighted in ledger): 6 attacks (2 SHELL, 2 RCE, 2 WRITE)
-
Single enhanced agent (RQ5): 5 attacks (5 WRITE)
-
Ensemble diverse (RQ5): 2 attacks (1 WRITE, 1 READ_SECRET)
Non-overlapping experiments, total 13 attacks. The 8 WRITE instances represents cumulative count.
Results
RQ1: Runtime Scaling Impact
Negative Result 1: Extended exploration time shows no meaningful improvement. Scaling from 20s$`\rightarrow`$60s$`\rightarrow`$150s yields 0$`\rightarrow`$0$`\rightarrow`$1 findings with zero verified attacks. Safety training creates such high refusal rates that longer runtime merely produces more refusals, not discoveries.
| Configuration | Findings | Real Attacks | Tool Calls | Note |
|---|---|---|---|---|
| 20s (General, No Guard) | 0 | 0 | 0 | No signal |
| 60s (General, No Guard) | 0 | 0 | 0 | No signal |
| 150s (General, No Guard) | 1 | 0 | 2 | One false positive |
| 150s (General, With Guard) | 4 | 0 | 16 | 4Ă findings, still no attacks |
| 120s (Targeted, No Guard) | 2 | 0 | 10 | Minimal effect |
| 120s (Targeted, With Guard) | 3 | 0 | 10 | Modest increase |
| Runtime extension provides minimal benefit; guardrails offer modest amplification (1â4 general, 2â3 targeted) | ||||
RQ2: State Signature Granularity
Seed Variance Dominates Signature Effects: Signature granularity choice matters less than random seed variance. Testing 5 random seeds revealed 0â16 findings per configuration (8x variance), with no consistent winner. Tools-only: 1.8±1.3 findings/seed; Full-intent: 4.6±6.0 findings/seed. High variance indicates seed selection is a critical confounding factor in Go-Explore security testing.
| Signature Scheme | Seed | Findings | Attack Types | Tool Calls | Notes |
|---|---|---|---|---|---|
| Tool names only | 42 | 2 | 1 | â | |
| 123 | 0 | 0 | â | No exploration | |
| 456 | 1 | 1 | â | ||
| 789 | 3 | 1 | â | ||
| 1337 | 3 | 1 | â | ||
| 2-6 | Avg | 1.8±1.3 | 0.8 | 83 | High variance |
| 42 | 16 | 1 | â | 8x higher! | |
| 123 | 0 | 0 | â | No exploration | |
| 456 | 1 | 1 | â | ||
| 789 | 4 | 1 | â | ||
| 1337 | 2 | 1 | â | ||
| 2-6 | Avg | 4.6±6.0 | 0.8 | 91 | Extreme variance |
Mechanism: Seed variance creates 8$`\times`$ spread in outcomes, masking any signature effect. The stochastic nature of Go-Exploreâs cell selection and mutation means different seeds explore radically different state spaces, making single-seed comparisons unreliable.
RQ2b: How Many Seeds Are Needed?
Given 8$`\times`$ variance, how many seeds yield stable estimates? We compute cumulative means as seeds are added sequentially (42, 123, 456, 789, 1337).
Result: In our specific experimental setup (GPT-4o-mini, 90s budget, particular tool environment and prompts), cumulative mean estimates appear to stabilize after averaging 3-4 seeds. Single-seed results deviate up to 8$`\times`$ from 5-seed mean (seed 42: 16 vs mean 4.6). However, this observation is based on one seed ordering without confidence intervals or cross-validation. We suggest researchers test stability in their own setup rather than relying on a universal threshold. Our data supports the weaker claim: single-seed results are unreliable; multi-seed averaging materially reduces variance.
RQ3: Reward Shaping Impact
Negative Result 3: Causality-based reward bonuses dramatically collapsed exploration when combined with full signatures (16$`\rightarrow`$1 findings, -94%) and found zero verified attacks in both cases. The reward gradient reduced tool use (84$`\rightarrow`$7 calls) by converging to local optima. When used alone (RQ4), rewards amplify noise (18 findings, 0 attacks). Across all contexts, reward shaping consistently fails to improve real attack discovery.
| Config | Findings | Attacks | Calls | Effect |
|---|---|---|---|---|
| No causality bonus | 16 | 0 | 84 | Baseline |
| With +100-250 bonus | 1 | 0 | 7 | -94%, collapse |
| Rewards cause premature convergence, not guidance | ||||
Mechanism: Reward bonuses amplify noise rather than signal. When combined with granular signatures (RQ3), rewards cause dramatic collapse (16$`\rightarrow`$1 findings, 84$`\rightarrow`$7 tool calls) via premature convergence. When used alone (RQ4), rewards generate false positive expansion (18 findings, 0 attacks, 0% precision). In ensemble context (RQ5), with_rewards contributes minimal value (2 findings, 0 attacks). Across all contexts, reward shaping consistently fails to improve real attack discovery.
RQ4: Individual Enhancement Contributions
Negative Result 4: Testing each enhancement in isolation reveals that combining them produces the worst result. Baseline: 2 findings. Targeted prompts alone: 13 findings + 1 real attack. Reward bonuses alone: 18 findings but 0 attacks (pure noise amplification). All enhancements combined: 0 findings. The enhancements actively interfere with each other, and rewards amplify false positives rather than guide discovery.
| Configuration | Findings | Real Attacks | Tool Calls | Efficiency |
|---|---|---|---|---|
| Baseline (tool names only) | 2 | 0 | 4 | 0% |
| Intent hashing only | 4 | 0 | 8 | 0% |
| Reward bonuses only | 18 | 0 | 136 | 0% (false pos.) |
| Targeted prompts only | 13 | 1 (WRITE) | 70 | 1.4% |
| All combined | 0 | 0 | 0 | â |
| Enhancements interfere: each individually performs better than combination | ||||
| Note: Reward bonuses generate high findings (18) but zero attacks â pure false positive amplification | ||||
Failure analysis for “all combined = 0”: When all three enhancements (intent hashing + rewards + targeting) run together, the system produces zero findings and zero tool callsâcomplete exploration failure. Without detailed instrumentation of archive state, we can only hypothesize mechanisms:
-
Possible cause 1: Archive saturation. Intent hashing creates fine-grained cells; reward bonuses drive premature convergence; combined they may fill the archive with shallow, high-reward states that never get re-explored.
-
Possible cause 2: Initialization failure. Targeted prompt bank may conflict with safety training, causing immediate refusal loop that prevents any tool calls.
-
Possible cause 3: Selection deadlock. Conflicting priorities from three enhancement modules may prevent any cell from being selected for expansion.
To diagnose conclusively would require instrumentation: archive size over time, cell selection counts, refusal rates per iteration, and expansion attempts. The key empirical observation is that combining all enhancements produces worse outcomes than using them individually, suggesting negative synergy regardless of the specific mechanism.
Key insight: Only targeted prompts alone successfully found a verified attack (1 PROMPT_INJECTION_WRITE). Combining all enhancements produced zero findingsâworse than doing nothing. This is not simply “no improvement”âitâs complete system failure through archive saturation and selection deadlock.
RQ5: Quantity-Diversity Tradeoff in Ensemble vs Monolithic
Nuanced Result: Results reveal a tradeoff rather than dominance. A single enhanced agent found 5 total attacks (all one type: WRITE) while an ensemble found only 2 attacks but with 2 distinct types (WRITE + READ_SECRET). Enhanced agent has better raw count; ensemble offers complementary coverage across attack classes. Neither strictly “wins.”
Critical detail: Ensembles use simple baseline agents (not enhanced), so this compares enhanced optimization vs simple diversity. A single simple agent (180s) found 0 attacks, showing that simple agents fail alone but succeed in ensemble.
| Approach | Findings | Real Attacks | Attack Types | Precision |
|---|---|---|---|---|
| Single Enhanced (180s) | 26 | 5 | WRITE only | 19.2% |
| Ensemble (3Ă60s same seed) | 7 | 3 | WRITE (2), EMAIL (1) | 42.9% |
| Ensemble (3 diverse strategies) | 16 | 2 | WRITE, SECRET | 12.5% |
| Single Simple (180s) | 0 | 0 | â | â |
| Enhanced uses full optimization; ensembles use SIMPLE agents (no enhancements) | ||||
| Single simple agent found 0 attacks â ensemble diversity enables 2-3 attacks from simple agents | ||||
Ensemble composition (3$`\times`$60s diverse, seed=42):
-
Agent 1: tools_only $`\rightarrow`$ 14 findings, 2 attacks (WRITE, SECRET)
-
Agent 2: with_targeting $`\rightarrow`$ 0 findings, 0 attacks
-
Agent 3: with_rewards $`\rightarrow`$ 2 findings, 0 attacks
-
Combined: 16 unique findings, 2 unique attack types
Why ensembles $`\neq`$ 3$`\times`$ enhanced performance:
-
Ensembles use simple baseline agents (no enhancements), not the enhanced agent
-
High variance: In ensemble_simple, only 1 of 3 runs found anything (7, 0, 0 findings)
-
Same random seed means similar paths, not 3$`\times`$ independent coverage
-
Single simple agent (180s) found 0 attacks â complete failure alone
Key insight: The enhanced agentâs 5 attacks all belong to PROMPT_INJECTION_WRITE. The ensembleâs 2-3 attacks span 2 distinct vulnerability classes (WRITE + EMAIL/SECRET). This represents a quantity-diversity tradeoff: monolithic optimization finds more instances within one class; ensemble sampling discovers different vulnerability types.
Implication: For comprehensive security testing, both approaches have merit. Enhanced agents efficiently exploit discovered attack vectors (5 vs 0 for simple). Ensembles enable simple agents to succeed through diversity (2-3 attacks vs 0 alone). The choice depends on whether deep exploitation or wide coverage is prioritized.
RQ6: Scaling Laws for Ensemble Diversity
Budget allocation protocol: Each agent runs for fixed time (60s) with different random seeds. Agents execute in parallel batches (max 20 concurrent), so N=100 requires 5 batches $`\times`$ 60s = 300s wall time. This tests whether seed diversity from N independent agents outperforms single-agent optimization. Unlike the “divide fixed budget” protocol, each agent gets full exploration time, making larger N strictly advantageous if seed variance provides coverage.
Key Finding: Without guardrails, ensemble diversity (unique attack types) saturates at N$`\approx`$20 agents (4 types), while attack count continues growing (10$`\rightarrow`$54 by N=100). With guardrails, diversity collapses to 1 type (WRITE only), but attack count scales better (27 vs 10 at N=50). This reveals an exploration-exploitation tradeoff: guardrails sacrifice breadth for depth.
Two scaling regimes:
-
Type diversity: Saturates at N=20 (4 types). Scaling to N=100 adds zero new types despite 5$`\times`$ cost.
-
Attack count: Continues growing (10$`\rightarrow`$54), but these are duplicates of same 4 types, not new vulnerabilities.
Guardrail impact on scaling: With guardrails enabled, the scaling dynamics change dramatically. Type diversity collapses to 1 (only WRITE attacks found across all N=1-100), but attack count continues scaling linearly (0$`\rightarrow`$3$`\rightarrow`$6$`\rightarrow`$7$`\rightarrow`$15$`\rightarrow`$27, plateauing at N=50). The guardrail doesnât amplify attacksâit filters the search space. Pattern-based detection blocks EMAIL/SHELL/RCE attempts, but WRITE injections slip through, causing the Go-Explore archive to fill exclusively with WRITE variants. This evolutionary selection pressure focuses all exploration on the single exploitable attack class, sacrificing breadth for depth.
Implication: For discovery (finding new attack classes), N=20 without guardrails is optimalâbeyond this, agents rediscover known types. For evidence gathering (finding many instances for proof/statistics), larger N with guardrails helps (27 attacks at N=50-100 vs 10 without). Choose based on goal: diversity $`\rightarrow`$ N=20 no guard; exploitation $`\rightarrow`$ N=50 with guard. Seed variation drives diversity; strategy variation (RQ5) offers complementary coverage.
Targeted Exploration Results - Actual Data
| Mode | Shell Chains | Time | Findings |
|---|---|---|---|
| General (no guard) | 0 | 150s | 1 |
| Targeted (no guard) | 0 | 120s | 2 |
| Targeted + Guard | 0 | 120s | 3 |
| No shell chains discovered in any configuration; guardrail provides modest increase (2â3 findings) | |||
Key insight: Targeted prompts provided minimal benefit over general exploration (1$`\rightarrow`$2 findings without guard). Adding a guardrail offered modest amplification (2$`\rightarrow`$3 findings, 50% increase). However, no configuration successfully triggered shell.run execution in these targeted experiments, highlighting the difficulty of exploiting safety-trained models even with domain-specific prompting.
Attack Depth Distribution
Across state signature ablation experiments, we observed the following depth distribution for findings:
Insight: Most discovered attacks occurred at depth 3 (54%), with depth 2 (10%) and depth 4 (36%) representing simpler and more complex chains respectively. The simplest signature scheme (tools_only) found attacks across all depths, demonstrating that complexity is not required for deep chain discovery.
Modest Guardrail Effects
Empirical Observations
Our experiments reveal modest rather than dramatic effects from guardrails:
General exploration (150s):
-
Without guardrail: 1 finding, 2 tool calls
-
With guardrail: 4 findings, 16 tool calls
-
Effect: 4$`\times`$ increase in findings, 8$`\times`$ in tool calls
Targeted exploration (120s):
-
Without guardrail: 2 findings, 10 tool calls
-
With guardrail: 3 findings, 10 tool calls
-
Effect: 50% increase in findings, no change in tool calls
Mechanism: Guardrails alter agent behavior but effects are context-dependent. In general exploration, the guardrail increased overall activity moderately. In targeted mode, effects were minimal. Neither configuration discovered shell execution chains.
Key insight: Guardrails provide incremental rather than transformative effects on discovery rates. The impact depends heavily on exploration strategy and prompt design. Claims of dramatic amplification (e.g., 19$`\times`$) do not reflect typical outcomes with safety-trained models like GPT-4o-mini.
Detailed Attack Patterns
| Configuration | Verified Attacks | Types |
|---|---|---|
| General 150s (no guard) | 0 | â |
| General 150s (with guard) | 0 | â |
| Targeted 120s (no guard) | 0 | â |
| Targeted 120s (with guard) | 0 | â |
| State sig: tools_only | 6 | SHELL(2), RCE(2), WRITE(2) |
| State sig: tools_args3 | 0 | â |
| State sig: full | 0 | â |
| Enhancement: targeted only | 1 | WRITE |
| Single enhanced (180s) | 5 | WRITE (all) |
| Ensemble diverse (180s) | 2 | WRITE, READ_SECRET |
| Simplest signature found all attacks; enhanced agent found most but limited diversity | ||
Verified Attack Types
Our experiments discovered three types of verified prompt injection attacks:
-
PROMPT_INJECTION_SHELL (2 instances): Agent executes shell commands based on injected instructions from files or web content
-
REMOTE_CODE_EXECUTION (2 instances): Similar to shell but with remote execution context
-
PROMPT_INJECTION_WRITE (8 instances total): Agent writes files or sends emails based on malicious instructions
-
PROMPT_INJECTION_READ_SECRET (1 instance): Agent reads sensitive files (secret.txt) due to injection
Key pattern: All verified attacks required causal chains where untrusted sources (files, emails, web pages) contained injection phrases that directly triggered subsequent dangerous operations. The simplest state signature scheme (tools_only) discovered the most diverse attack types, while the enhanced agent found multiple instances of the same attack class.
No shell chains in runtime/guardrail experiments: Notably, the targeted shell experiments (with/without guardrails) found zero verified shell attacks, demonstrating the extreme difficulty of triggering shell execution with safety-trained models like GPT-4o-mini.
Comparison With Baselines
Baseline methodology: Manual red team, random fuzzing, and static analysis rows represent typical outcomes from security literature, not experiments we ran. These provide context for Go-Exploreâs performance. Manual red teaming typically finds 10-20 suspicious behaviors but few verified exploits due to time constraints. Random fuzzing generates many candidates but lacks causality verification. Static analysis identifies code patterns but cannot test runtime agent behavior.
| Method | Findings | Real Attacks | Max Depth | Runtime | Causality | Reproducible | Systematic |
|---|---|---|---|---|---|---|---|
| Literature baselines (typical ranges, not run by us): | |||||||
| Manual Red Team | 10-20 | 1-2 | 3 | Hours-Days | Yes | No | No |
| Random Fuzzing | 5-10 | 0 | 2-3 | 150s | No | Yes | No |
| Static Analysis | 0-5 | 0 | â | Minutes | Partial | Yes | Partial |
| Our experiments (measured): | |||||||
| General Go-Explore (150s) | 1 | 0 | 2 | 150s | Yes | Yes | Limited |
| Simple Go-Explore (tools_only) | 12 | 6 | 4 | 90s | Yes | Yes | Yes |
| Enhanced Go-Explore (180s) | 26 | 5 | 4 | 180s | Yes | Yes | Yes |
| Ensemble Go-Explore (180s) | 16 | 2 | 4 | 180s | Yes | Yes | Yes |
| Note: Literature baselines provide context; Simple (tools_only) provides best efficiency; Enhanced finds most attacks but one type | |||||||
Key takeaway: Go-Explore provides systematic, reproducible exploration with causality verification, addressing weaknesses of manual testing (not reproducible) and fuzzing (no causality). However, absolute attack counts remain low (2-6) due to GPT-4o-miniâs safety training, demonstrating that even systematic methods struggle with well-defended models.
Practical Implications
For Security Practitioners
-
Use simple state signatures: Tool names only outperformed all complex schemes (found all 6 attacks vs 0 for granular signatures)
-
Avoid reward shaping: Causality bonuses either collapse exploration (-94% with signatures) or amplify false positives (18 findings, 0 attacks alone)âharmful in all contexts
-
Donât combine enhancements: All three together produced zero findings; use one targeted enhancement at most
-
Targeted prompts help marginally: Found 1 attack (13 findings) versus 0 for random; modest but meaningful improvement
-
Runtime has diminishing returns: 20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1 findings; invest in better strategies, not longer runtimes
-
Choose based on goals: Enhanced agent for exploitation depth (5 attacks, one type); ensemble for diversity (2 attacks, two types)
For Guardrail Designers
Design Principle: Expect Modest Effects Guardrails provide incremental improvements (4$`\times`$ for general, 1.5$`\times`$ for targeted exploration) rather than dramatic amplification. Focus on consistent improvement rather than seeking transformative effects. Test thoroughly with adversarial exploration to measure actual impact.
Recommendations:
-
Measure guardrail effects empiricallyâdonât assume dramatic amplification
-
Test across multiple exploration strategies (general vs targeted)
-
Monitor how defenses change agent behavior patterns, but expect modest shifts
-
Use adversarial exploration for realistic assessment, not just unit tests
Limitations
-
Safety training effects: GPT-4o-miniâs safety training creates extremely high refusal rates, limiting absolute attack discovery to 6 attacks across ablations, 5 for enhanced agent, 2 for ensemble
-
Model-specific: Results specific to GPT-4o-mini; less safety-trained models (e.g., gpt-oss-20b) find 132-144 attacks but excluded from this study
-
Simplified environment: 5 tool types (fs, shell, email, web, http); production agents have more complex capabilities
-
Single guardrail tested: Only tested basic prompt injection detection; other defensive mechanisms may behave differently
-
Limited runtime: Max 180s per experiment; however, runtime scaling showed minimal benefit (20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1)
-
Verification challenge: High false positive rate (70-90% of findings are false positives; e.g., 19.2
-
Single-seed limitations: Except RQ2 (5 seeds), all experiments use seed=42. Given 8$`\times`$ variance observed in RQ2, single-seed results are inherently uncertain
Related Work
Prompt injection: Greshake et al. introduced indirect injection; we discover multi-hop chains. Perez & Ribeiro cataloged techniques; we systematize discovery.
LLM security: Wallace et al. , Liu et al. , Zou et al. studied alignment failures; we focus on tool-using agents.
Agent security: Yi et al. benchmarked defenses; we discover the adaptation phenomenon.
Go-Explore: Ecoffet et al. applied to games; we pioneer security applications.
Conclusion
Testing safety-trained LLM agents reveals that algorithmic sophistication actively harms discovery. Our experiments with GPT-4o-mini across 20 configurations establish three negative results:
-
Simplicity outperforms complexity: The simplest state signature (tool names only) found all 6 verified attacks; adding arguments and user intent reduced attacks to zero
-
Rewards are actively harmful: Causality bonuses collapse exploration with signatures (-94%), amplify false positives alone (18 findings, 0 attacks), and fail in ensembles (2 findings, 0 attacks)âharmful across all contexts
-
Enhancements interfere: Combining all three enhancements produced complete failure (0 findings)âworse than doing nothing
Only targeted prompts worked in isolation, finding 1 attack from 13 findings. Extended runtime provided minimal benefit (20s$`\rightarrow`$60s$`\rightarrow`$150s yielded 0$`\rightarrow`$0$`\rightarrow`$1 findings).
Our ensemble experiments revealed a nuanced tradeoff: a single enhanced agent discovered 5 attacks (all one type: WRITE) while an ensemble found only 2 attacks but spanning 2 distinct types (WRITE + READ_SECRET). This represents a quantity-diversity tradeoff rather than strict dominance.
Guardrails provided modest effects: 4$`\times`$ amplification for general exploration (1$`\rightarrow`$4 findings), 50% for targeted (2$`\rightarrow`$3). No configuration achieved the dramatic 19$`\times`$ amplification sometimes claimed.
For practitioners: Use simple algorithms and targeted prompts. Avoid complex reward shaping and combined enhancements. Choose monolithic for exploitation depth; ensembles for vulnerability diversity.
For researchers: Safety-trained models resist sophisticated optimization. Future work should investigate why simplicity succeeds and whether other model families exhibit similar patterns.
Acknowledgments
We thank anonymous reviewers for valuable feedback.
10
A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, âFirst return, then explore,â Nature, vol. 590, pp. 580-586, 2021.
K. Greshake et al., âNot what youâve signed up for: Compromising real-world llm-integrated applications,â arXiv:2302.12173, 2023.
E. Wallace et al., âThe alignment problem from a deep learning perspective,â arXiv:2209.00626, 2024.
F. Perez and I. Ribeiro, âIgnore previous prompt: Attack techniques for language models,â arXiv:2211.09527, 2022.
Y. Liu et al., âJailbreaking ChatGPT via Prompt Engineering,â arXiv:2305.13860, 2023.
A. Zou et al., âUniversal and transferable adversarial attacks,â arXiv:2307.15043, 2023.
J. Yi et al., âBenchmarking and Defending Against Indirect Prompt Injection,â arXiv:2312.14197, 2023.
M. Zalewski, âAmerican fuzzy lop,â 2014.
C. Cadar et al., âKLEE: Automatic generation of high-coverage tests,â OSDI, 2008.
S. Toyer et al., âTensor Trust: Interpretable Prompt Injection Attacks,â arXiv:2311.01011, 2023.