MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users’ behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent’s trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent’s observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 37 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties. MUZZLE also identifies novel attack strategies, including 2 cross-application prompt injection attacks and an agent-tailored phishing scenario.


💡 Research Summary

Title: MUZZLE: Adaptive Agentic Red‑Teaming of Web Agents Against Indirect Prompt Injection Attacks

Problem Statement: Large language model (LLM)‑driven web agents can autonomously browse the internet, fill forms, click buttons, and otherwise act on behalf of users. Because these agents ingest raw web content directly into their LLM reasoning loop, they are vulnerable to indirect prompt injection (IPI) attacks: an adversary embeds malicious instructions in untrusted web pages, causing the agent to deviate from the user’s intent and potentially breach confidentiality, integrity, or availability. Prior work on IPI against web agents has relied on static attack templates, manually chosen injection points, or limited sandbox evaluations, which fails to capture realistic, adaptive, multi‑step attacks that an adversary could mount in practice.

Contribution Overview:
MUZZLE is the first fully automated red‑team framework that discovers, executes, and evaluates IPI attacks against web agents in a sandboxed environment with minimal human involvement. Its core innovations are:

  1. Trajectory‑Based Surface Identification: MUZZLE records the agent’s perception‑action loop (DOM snapshots, screenshots, memory traces) to build an execution trajectory. It computes a salience score for each UI element (how much the element influences the agent’s next decision) and an exploitability estimate (how likely an injection there could affect the agent). This yields a ranked list of high‑impact injection surfaces without any manual labeling.

  2. Context‑Aware Adaptive Payload Generation: Using the ranked surfaces, a meta‑LLM generator creates multi‑step malicious prompts tailored to a specific adversarial objective (e.g., exfiltrate credentials, corrupt data, cause denial‑of‑service). The payloads are not single strings but structured attack scripts that may span several pages and actions.

  3. Feedback‑Driven Refinement Loop: After injecting a candidate payload, MUZZLE runs the target agent in the sandbox, collects telemetry (success/failure flags, logs, state changes), and feeds this back to a reinforcement‑learning based refiner. The refiner adjusts injection location, wording, or adds auxiliary steps, iterating until the attack succeeds or a budget is exhausted. This adaptive search dramatically reduces the combinatorial explosion of the attack space.

  4. Cross‑Application Capability: By leveraging The Zoo—a lightweight Docker‑based multi‑application sandbox—MUZZLE can orchestrate attacks that traverse multiple web services (e.g., a comment injection on a forum that later steals an authentication token from an e‑mail service). This enables the discovery of novel cross‑application IPI attacks not previously reported.

  5. Reproducible End‑to‑End Evaluation: All attacks are executed inside deterministic containers, with full logging of agent actions, injected payloads, and resulting system state. Researchers can replay any discovered attack, facilitating comparative studies and benchmark creation.

System Architecture:

  • Trajectory Analyzer captures agent observations and actions, builds a graph of visited UI elements, and scores them.
  • Payload Generator (a meta‑LLM with few‑shot prompts) produces candidate malicious instructions conditioned on the target objective and the selected UI element.
  • Execution Engine runs the target web agent against the modified web pages inside The Zoo, exposing both DOM and visual channels.
  • Refinement Module employs a policy‑gradient method to update the generation policy based on success signals.

Experimental Setup:

  • Web Applications: Four representative sites (e‑commerce, forum, CMS, email client) deployed in The Zoo.
  • Agents: Three configurations—DOM‑based, visual‑only, and cloud‑hosted agents—each driven by a different LLM (Claude‑2, GPT‑4, Llama‑2).
  • Adversarial Objectives: Ten goals covering confidentiality (data exfiltration), integrity (unauthorized actions), and availability (task disruption).

Results:

  • MUZZLE discovered 37 distinct IPI attacks across the four apps, many of which were not captured by prior manual or static methods.
  • Two cross‑application attacks were identified, where a payload injected in one service compromised credentials used by another service.
  • One agent‑tailored phishing scenario demonstrated that MUZZLE can synthesize a convincing phishing page that tricks the agent into submitting user credentials.
  • On average, each successful attack required 3.2 refinement iterations and completed within ≈12 minutes of wall‑clock time per application, showing practical scalability.

Key Insights:

  • Leveraging the agent’s own execution trace to prioritize injection points yields far higher attack yield than naïve exhaustive scanning.
  • Multi‑step, context‑aware payloads are essential; many successful attacks required chaining actions (login → comment → data leak).
  • Adaptive refinement dramatically improves success rates in a noisy, dynamic web environment where static payloads often fail.

Limitations & Future Work:

  • MUZZLE currently focuses on agents that process DOM or structured visual data; pure image‑only agents need additional perception modules.
  • The payload generator can be thwarted by strong system‑prompt filters or runtime defenses; future work should explore adversarial prompt generation that bypasses such mitigations.
  • The set of adversarial objectives is fixed; extending to long‑term, stealthy exfiltration or privilege‑escalation scenarios is an open direction.
  • Scaling to real‑world internet‑scale deployments (with network latency, CAPTCHAs, anti‑bot measures) will require further engineering.

Conclusion:
MUZZLE establishes a new paradigm for automated security assessment of LLM‑driven web agents. By integrating trajectory‑based surface selection, adaptive multi‑step payload synthesis, and a feedback‑driven refinement loop within a reproducible sandbox, it uncovers a richer set of IPI vulnerabilities—including cross‑application and phishing attacks—than prior static approaches. The framework’s modularity and model‑agnostic design make it a promising foundation for future research, industry red‑team operations, and the development of robust defenses against indirect prompt injection in autonomous web agents.


Comments & Academic Discussion

Loading comments...

Leave a Comment