Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have been integrated into many applications (e.g., web agents) to perform more sophisticated tasks. However, LLM-empowered applications are vulnerable to Indirect Prompt Injection (IPI) attacks, where instructions are injected via untrustworthy external data sources. This paper presents Rennervate, a defense framework to detect and prevent IPI attacks. Rennervate leverages attention features to detect the covert injection at a fine-grained token level, enabling precise sanitization that neutralizes IPI attacks while maintaining LLM functionalities. Specifically, the token-level detector is materialized with a 2-step attentive pooling mechanism, which aggregates attention heads and response tokens for IPI detection and sanitization. Moreover, we establish a fine-grained IPI dataset, FIPI, to be open-sourced to support further research. Extensive experiments verify that Rennervate outperforms 15 commercial and academic IPI defense methods, achieving high precision on 5 LLMs and 6 datasets. We also demonstrate that Rennervate is transferable to unseen attacks and robust against adaptive adversaries.

💡 Research Summary

The paper “Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs” addresses a critical security vulnerability in Large Language Model (LLM)-integrated applications, such as web agents and email assistants. These applications are susceptible to Indirect Prompt Injection (IPI) attacks, where adversaries embed malicious instructions into untrusted external data sources (e.g., websites, documents) that the LLM retrieves and processes. This can hijack the LLM’s behavior, leading to outcomes like phishing, data leakage, or system compromise, which OWASP ranks as a top-tier risk.

Existing defense strategies fall into detection and prevention categories but have significant limitations. Detection methods often rely on classifiers looking for specific keywords (lacking generalization to novel attacks) or auxiliary LLMs (which are costly and themselves vulnerable). Prevention methods involve prompt paraphrasing (ineffective against advanced attacks) or fine-tuning the target LLM (computationally expensive and often impractical for deployment). Crucially, many detection-only methods cause a denial-of-service if an attack is found, as the LLM application is simply blocked from proceeding.

To overcome these challenges, the authors propose “Rennervate,” a novel defense framework that performs both detection and sanitization of IPI attacks. Its core innovation is leveraging the internal attention mechanisms of the LLM itself as a robust signal for identifying malicious injections. The key insight is that when an LLM processes an injected malicious instruction, it exhibits distinctive attention patterns compared to when it processes benign content. These patterns are inherent to the model’s operation and thus offer better transferability across different attack variants.

Rennervate’s technical architecture is built around a token-level detector. Instead of making a binary decision for an entire input sequence, it analyzes the likelihood of an injection at each token position. This fine-grained analysis is enabled by a custom “2-step attentive pooling mechanism.” In the first step, it aggregates information from the multiple attention heads within the LLM’s transformer layers, weighting each head’s contribution based on its relevance to IPI detection. In the second step, it aggregates this information across the sequence of response tokens. This process extracts stable, generalizable features from the variable-length attention matrices, allowing the model to learn subtle signatures of injection attempts.

The token-level detection capability directly enables Rennervate’s second major feature: precise sanitization. By pinpointing the exact location of the malicious injection within the retrieved external data, the system can neutralize it—for example, by deleting, masking, or rewriting the specific malicious tokens—while preserving the surrounding benign content. This allows the LLM-integrated application to continue its original task (e.g., summarizing a webpage) even after defending against an attack, thus maintaining service availability and functionality.

To support research in this field, the authors constructed a large-scale, fine-grained IPI dataset named FIPI, which they plan to open-source. FIPI contains 100k instances covering five major IPI attack methods (e.g., Context Ignoring, Escape Characters, Fake Completion) across 300 different NLP tasks. Crucially, it includes token-level annotations that label which parts of the text constitute the malicious injection, providing a valuable resource for training and evaluating fine-grained defense models.

The experimental evaluation is extensive and rigorous. The authors implemented Rennervate on five diverse LLMs (including causal decoder models like Llama and GPT, and encoder-decoder models like Flan-T5) and evaluated it on six datasets. The results demonstrate that Rennervate outperforms 15 existing commercial and academic baseline methods in both detection accuracy and the effectiveness of its sanitization. Additional experiments confirm its strong transferability: Rennervate maintains high performance when tested on five unseen datasets and two completely unseen attack techniques not included in its training data. Finally, the framework is shown to be robust against adaptive adversaries in both black-box and white-box settings, where the attacker is aware of the defense and attempts to craft evasive attacks.

In conclusion, Rennervate presents a significant advance in securing LLM-integrated applications against IPI attacks. By turning the LLM’s own attention mechanisms against attackers, it achieves high-precision, fine-grained defense that both detects threats and sanitizes inputs without disrupting legitimate functionality, all while being efficient and transferable to new challenges.

Attention is All You Need to Defend Against Indirect Prompt Injection Attacks in LLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment