AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management
Indirect prompt injection threatens LLM agents by embedding malicious instructions in external content, enabling unauthorized actions and data theft. LLM agents maintain working memory through their context window, which stores interaction history for decision-making. Conventional agents indiscriminately accumulate all tool outputs and reasoning traces in this memory, creating two critical vulnerabilities: (1) injected instructions persist throughout the workflow, granting attackers multiple opportunities to manipulate behavior, and (2) verbose, non-essential content degrades decision-making capabilities. Existing defenses treat bloated memory as given and focus on remaining resilient, rather than reducing unnecessary accumulation to prevent the attack. We present AgentSys, a framework that defends against indirect prompt injection through explicit memory management. Inspired by process memory isolation in operating systems, AgentSys organizes agents hierarchically: a main agent spawns worker agents for tool calls, each running in an isolated context and able to spawn nested workers for subtasks. External data and subtask traces never enter the main agent’s memory; only schema-validated return values can cross boundaries through deterministic JSON parsing. Ablations show isolation alone cuts attack success to 2.19%, and adding a validator/sanitizer further improves defense with event-triggered checks whose overhead scales with operations rather than context length. On AgentDojo and ASB, AgentSys achieves 0.78% and 4.25% attack success while slightly improving benign utility over undefended baselines. It remains robust to adaptive attackers and across multiple foundation models, showing that explicit memory management enables secure, dynamic LLM agent architectures. Our code is available at: https://github.com/ruoyaow/agentsys-memory.
💡 Research Summary
AgentSys tackles the emerging security problem of indirect prompt injection (IPI) in large‑language‑model (LLM) agents by redesigning how agents manage their working memory. Conventional agents accumulate every tool output, intermediate reasoning trace, and conversational turn in a single context window. This “full‑history” approach creates two critical weaknesses: (1) injected malicious instructions persist throughout the workflow, giving attackers multiple opportunities to hijack control flow or data flow, and (2) the ever‑growing context dilutes the model’s attention, degrading task performance. Existing defenses—model‑level alignment, detection‑based sanitizers, and system‑level sandboxes—either ignore the memory‑bloat problem or restrict flexibility, limiting agents’ ability to decompose tasks dynamically.
AgentSys introduces a hierarchical memory‑isolation architecture inspired by operating‑system process isolation. A main agent receives the user query and performs high‑level planning, but never directly sees raw tool outputs. For each tool invocation, the main agent spawns a worker agent that runs in its own isolated context window. Workers may recursively spawn nested workers for subtasks. Crucially, the only information that crosses the isolation boundary is a schema‑validated JSON payload containing the tool’s return value. A lightweight validator checks that the JSON conforms to a pre‑defined schema (correct keys, types, and value ranges). If validation fails, the payload is discarded or sanitized before any data is passed upward. Because the main agent never re‑processes raw tool text, any malicious instruction embedded in external content is confined to the worker’s private memory and cannot persist across subsequent reasoning steps.
The validation step is event‑triggered: it runs only when a tool call completes, so overhead scales with the number of operations rather than with context length. This design keeps latency low even for long‑running tasks, while still providing a strong security barrier.
Empirical evaluation on two benchmark suites—AgentDojo, which stresses web‑based multi‑step tasks, and ASB, an automated script benchmark—demonstrates the effectiveness of the approach. An ablation that uses only memory isolation reduces attack success rate (ASR) from >60 % (baseline) to 2.19 %. Adding the validator and sanitizer further drops ASR to 0.78 % on AgentDojo and 4.25 % on ASB. In benign settings, the reduced context improves utility slightly: overall task success rises from 63.54 % (undefended) to 64.36 %. Notably, for tasks requiring more than four tool calls, AgentSys achieves 0 % ASR.
The framework is model‑agnostic; experiments with GPT‑4, Claude‑2, and other foundation models show consistent protection. Adaptive attackers who attempt to embed malicious code inside a worker’s reasoning cannot influence the main agent because the interface is limited to structured JSON. The authors release code and data, enabling reproducibility and further research.
Key contributions:
- Identification of “memory contamination” as a core vulnerability in LLM agents.
- A hierarchical agent architecture that isolates tool execution memory from the planner’s context.
- A deterministic, schema‑based cross‑boundary communication mechanism that prevents malicious instructions from persisting.
- Empirical evidence that explicit memory management simultaneously improves security and task performance across diverse models and benchmarks.
AgentSys therefore establishes a new paradigm—memory management as security—for building safe, dynamic LLM agents that can freely use external tools without exposing themselves to indirect prompt injection attacks.
Comments & Academic Discussion
Loading comments...
Leave a Comment