DREAM: Dynamic Red-teaming across Environments for AI Models
Large Language Models (LLMs) are increasingly used in agentic systems, where their interactions with diverse tools and environments create complex, multi-stage safety challenges. However, existing benchmarks mostly rely on static, single-turn assessments that miss vulnerabilities from adaptive, long-chain attacks. To fill this gap, we introduce DREAM, a framework for systematic evaluation of LLM agents against dynamic, multi-stage attacks. At its core, DREAM uses a Cross-Environment Adversarial Knowledge Graph (CE-AKG) to maintain stateful, cross-domain understanding of vulnerabilities. This graph guides a Contextualized Guided Policy Search (C-GPS) algorithm that dynamically constructs attack chains from a knowledge base of 1,986 atomic actions across 349 distinct digital environments. Our evaluation of 12 leading LLM agents reveals a critical vulnerability: these attack chains succeed in over 70% of cases for most models, showing the power of stateful, cross-environment exploits. Through analysis of these failures, we identify two key weaknesses in current agents: contextual fragility, where safety behaviors fail to transfer across environments, and an inability to track long-term malicious intent. Our findings also show that traditional safety measures, such as initial defense prompts, are largely ineffective against attacks that build context over multiple interactions. To advance agent safety research, we release DREAM as a tool for evaluating vulnerabilities and developing more robust defenses.
💡 Research Summary
The paper introduces DREAM (Dynamic Red‑teaming for Evaluating Agentic Multi‑Environment Security), a systematic framework designed to expose safety gaps in large language model (LLM) agents that operate across multiple tools and digital environments. Traditional safety benchmarks evaluate models with static, single‑turn prompts, which overlook the complex, adaptive threats that arise when an agent can invoke external APIs, manipulate files, or interact with email, operating systems, and cloud services over a prolonged dialogue. DREAM addresses this blind spot by modeling the adversarial interaction as a Partially Observable Markov Decision Process (PO‑MDP) and by constructing a Cross‑Environment Adversarial Knowledge Graph (CE‑AKG) that fuses observations from disparate environments into a unified belief state bₜ.
The CE‑AKG acts as a dynamic knowledge base: each observation (e.g., a credential retrieved from an email inbox) updates the belief via a function τ(bₜ, aₜ, oₜ₊₁), gradually reducing epistemic uncertainty H(bₜ) and accumulating knowledge Kₐc𝑐ᵤₘ. This accumulation enables the attacker to keep the per‑step intent density ρₜ well below the detection threshold θₛₐ𝚏ₑ, thereby evading static filters that would flag a high‑density single‑turn attack.
To synthesize attack trajectories, DREAM employs a Contextualized Guided Policy Search (C‑GPS) algorithm. C‑GPS draws from an atom‑attack library of 1,986 primitive actions spanning 349 distinct environments. Using heuristic tree search and back‑tracking, the algorithm selects actions that maximize immediate information gain early on, then pivots to high‑impact steps once the accumulated knowledge surpasses a defense entropy barrier Ω. The transition is mathematically captured by a Hill‑type phase‑transition function Pₛᵤ𝚌𝚌ₑₛₛ(bₜ) = (Kₐc𝑐ᵤₘ·ω)ᵏ / (Ωᵏ + (Kₐc𝑐ᵤₘ·ω)ᵏ), modeling the “thunderous strike” where success probability jumps from near zero to near one.
The authors evaluate DREAM on twelve state‑of‑the‑art LLM agents (including GPT‑4‑Turbo, Claude‑2, Gemini‑1.5, Llama‑2‑Chat, etc.). For each model, DREAM automatically generates multi‑step, cross‑environment attack chains (typically five steps across five environments). The results are striking: over 68 % of the generated attack trajectories successfully bypass the agents’ built‑in safety mechanisms. Attacks that span five environments exhibit vulnerability severity 4.8× higher than estimates derived from static benchmarks. A recurring failure mode, termed “contextual fragility,” occurs when safety rules that protect the model in one environment do not transfer when the same information is pivoted to another environment. Moreover, conventional defenses such as initial safety prompts or single‑turn content filters prove largely ineffective against these adaptive, long‑horizon campaigns.
The paper’s contributions are threefold: (1) a formal shift from stateless, atomic testing to stateful trajectory auditing, establishing a new paradigm for assessing LLM agents under sustained adversarial pressure; (2) the DREAM framework itself, featuring the CE‑AKG and C‑GPS components that enable autonomous synthesis of sophisticated attack chains; (3) an extensive empirical study revealing that current safety mechanisms are insufficient against dynamic, cross‑environment exploits.
Limitations are acknowledged: constructing and maintaining the CE‑AKG requires substantial metadata collection; the atom‑attack library, while large, may still miss niche tool interactions; and real‑time deployment of DREAM in production settings raises scalability concerns. The authors suggest future work on automated knowledge‑graph updates, human‑in‑the‑loop red‑team collaborations, and defensive strategies that deliberately increase attacker uncertainty (e.g., uncertainty injection, continuous belief‑state monitoring).
In summary, DREAM demonstrates that LLM agents, when evaluated with dynamic, multi‑environment red‑team simulations, expose a critical blind spot in current safety research. The framework not only quantifies the upper bound of vulnerability but also provides a reusable platform for developing more robust, context‑aware safeguards before these agents are widely deployed.
Comments & Academic Discussion
Loading comments...
Leave a Comment