Optimizing Agent Planning for Security and Autonomy
Indirect prompt injection attacks threaten AI agents that execute consequential actions, motivating deterministic system-level defenses. Such defenses can provably block unsafe actions by enforcing confidentiality and integrity policies, but currently appear costly: they reduce task completion rates and increase token usage compared to probabilistic defenses. We argue that existing evaluations miss a key benefit of system-level defenses: reduced reliance on human oversight. We introduce autonomy metrics to quantify this benefit: the fraction of consequential actions an agent can execute without human-in-the-loop (HITL) approval while preserving security. To increase autonomy, we design a security-aware agent that (i) introduces richer HITL interactions, and (ii) explicitly plans for both task progress and policy compliance. We implement this agent design atop an existing information-flow control defense against prompt injection and evaluate it on the AgentDojo and WASP benchmarks. Experiments show that this approach yields higher autonomy without sacrificing utility.
💡 Research Summary
The paper addresses the growing threat of indirect prompt injection attacks (PIAs) on AI agents that perform consequential actions such as browsing, file manipulation, or code execution. While probabilistic defenses—model alignment, defensive prompts, and classifiers—offer some mitigation, they lack strong guarantees and can be bypassed by sophisticated attacks. Recent work therefore proposes deterministic, information‑flow control (IFC) defenses that label data and tool calls with confidentiality and integrity tags, propagate those tags through computation, and enforce policies that either allow a call or block it pending human‑in‑the‑loop (HITL) approval. IFC thus provides provable security: untrusted data cannot influence consequential actions.
However, IFC’s strict labeling can over‑taint the agent’s context, causing many benign actions to be flagged as policy violations, which reduces task completion rates. Prior evaluations focused mainly on utility (task completion) and compared full‑autonomy (TCR@0) against unlimited‑HITL baselines (TCR@∞), thereby overlooking the benefit of reduced human oversight.
The authors introduce “autonomy” as a complementary evaluation dimension, defined by two metrics measured over a set of tasks: (1) HITL load – the total number of HITL interventions on successfully completed tasks, and (2) TCR@k – the proportion of tasks completed with at most k HITL interventions. By plotting TCR@k as k varies, one obtains a full autonomy‑utility trade‑off curve, revealing how many human approvals are actually needed to achieve a given level of performance.
To improve autonomy, the paper presents PRUDENTIA, an agent design that makes the planner explicitly aware of IFC policies and labels. PRUDENTIA’s key innovations are:
- Policy and label awareness – Tool descriptions embed policy metadata; the planner learns these policies and tracks its own context label, allowing it to predict which calls will violate policies.
- Strategic variable expansion – Variables are used to quarantine untrusted data. Expanding a variable taints the planner’s context, so PRUDENTIA introduces a dedicated “plan” tool that forces the agent to justify any expansion and list subsequent calls, thereby avoiding unnecessary tainting.
- Endorsement vs. approval – Instead of asking the user to approve each policy‑violating call, the agent can request the user to endorse the untrusted data stored in a variable. An endorsement relabels the data as trusted, enabling subsequent trusted‑action (P‑T) calls without further human involvement.
PRUDENTIA is implemented on top of FIDES, a state‑of‑the‑art deterministic IFC defense. The authors evaluate on two benchmarks: AgentDojo (which mixes data‑dependent and data‑independent tasks) and WASP (purely data‑independent). Results show:
- Even a basic IFC mechanism reduces HITL load by up to 1.5× without any loss in task completion rate, demonstrating that deterministic security already yields autonomy gains.
- PRUDENTIA outperforms FIDES: on AgentDojo it improves TCR@0 by up to 9 % and cuts overall HITL load by up to 1.9×. On WASP, PRUDENTIA achieves full autonomy (HITL load = 0).
- The TCR@k curves illustrate that PRUDENTIA approaches the ideal all‑knowing agent’s curve much more closely than prior designs, confirming that making the planner policy‑aware effectively reduces unnecessary human interventions while preserving security guarantees.
In summary, the paper contributes a novel set of autonomy metrics, a policy‑aware planning architecture (PRUDENTIA), and empirical evidence that deterministic IFC defenses can be made both secure and highly autonomous. The work opens avenues for richer label lattices, multi‑user policies, and learning‑based label propagation to further enhance the practicality of secure AI agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment