ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data
Prompt injection attacks aim to contaminate the input data of an LLM to mislead it into completing an attacker-chosen task instead of the intended task. In many applications and agents, the input data originates from multiple sources, with each source contributing a segment of the overall input. In these multi-source scenarios, an attacker may control only a subset of the sources and contaminate the corresponding segments, but typically does not know the order in which the segments are arranged within the input. Existing prompt injection attacks either assume that the entire input data comes from a single source under the attacker’s control or ignore the uncertainty in the ordering of segments from different sources. As a result, their success is limited in domains involving multi-source data. In this work, we propose ObliInjection, the first prompt injection attack targeting LLM applications and agents with multi-source input data. ObliInjection introduces two key technical innovations: the order-oblivious loss, which quantifies the likelihood that the LLM will complete the attacker-chosen task regardless of how the clean and contaminated segments are ordered; and the orderGCG algorithm, which is tailored to minimize the order-oblivious loss and optimize the contaminated segments. Comprehensive experiments across three datasets spanning diverse application domains and twelve LLMs demonstrate that ObliInjection is highly effective, even when only one out of 6-100 segments in the input data is contaminated. Our code and data are available at: https://github.com/ReachalWang/ObliInjection.
💡 Research Summary
This paper introduces “ObliInjection,” a novel and highly effective prompt injection attack specifically designed for Large Language Model (LLM) applications that operate on multi-source data. In many real-world scenarios such as review summarization (e.g., Amazon), news summarization (e.g., AI Overviews), Retrieval-Augmented Generation (RAG), and tool selection for LLM agents, the input data is constructed by concatenating segments originating from multiple independent sources. Existing prompt injection attacks typically assume control over the entire input or ignore the critical uncertainty regarding the ordering in which clean and attacker-controlled segments are arranged. This makes them ineffective in practical multi-source settings where an attacker may only control a minority of segments and cannot predict their final position in the combined input.
ObliInjection overcomes this fundamental limitation through two key technical innovations. First, it proposes an Order-Oblivious Loss. Since the attacker lacks access to the clean segments of the target task, this loss function quantifies the attack’s potential success across all possible segment orderings. It does so by using “shadow segments” generated by another LLM to approximate the clean data context. The loss computes the expected cross-entropy loss of the target LLM generating the attacker’s desired malicious response, averaged over many random permutations of the shadow segments and the malicious segment being optimized. A lower loss indicates a higher probability of successful attack regardless of ordering.
Second, the authors develop the orderGCG algorithm to optimize the malicious segment by minimizing this Order-Oblivious Loss. While the standard GCG (Gradient-based Coordinate Gradient) algorithm is a natural baseline, it performs suboptimally here because it relies solely on approximate gradient estimates from a single iteration. The orderGCG algorithm enhances this process by incorporating a beam search strategy. It maintains a buffer of multiple candidate malicious segments and iteratively refines them based on accumulated loss evaluations across steps, leading to more robust and effective optimization.
The paper presents comprehensive evaluations across three diverse datasets (review summarization, news summarization, tool selection) and twelve LLMs, including open-weight models (e.g., Llama-2/3, Mistral) and black-box models (e.g., GPT-4o, Claude-3). The results are striking: ObliInjection achieves an Attack Success Rate (ASR) close to 100% in most scenarios, even when contaminating only one out of 6 to 100 segments. This significantly outperforms existing state-of-the-art prompt injection attacks like Neural Exec and JudgeDeceiver, whose success rates plummet in the same multi-source, order-agnostic setting.
Further ablation studies demonstrate the robustness of ObliInjection. The attack remains effective even when the shadow segments used for optimization differ significantly from the actual clean segments in terms of length and semantic embedding. Moreover, malicious segments optimized using one open-source LLM (e.g., Llama-3) successfully transfer to attack unknown, proprietary target LLMs like GPT-4o, showing strong cross-model generalization.
Finally, the authors assess existing defenses against ObliInjection. They find that state-of-the-art prevention-based defenses (e.g., instruction-following fine-tuning with StruQ or SecAlign) offer limited protection and often come at the cost of reduced utility on benign tasks. Detection-based defenses (e.g., perplexity filtering, Known-answer Detection, DataSentinel) are also shown to be insufficient, as ObliInjection can be adapted to evade them—for instance, by prepending a benign-looking shadow segment to the malicious payload to lower its perplexity.
In summary, ObliInjection exposes a severe and previously under-explored vulnerability in LLM systems that process multi-source data. It establishes that powerful prompt injection attacks are feasible even under the realistic constraints of controlling only a single segment and having no knowledge of the segment ordering. The work underscores the urgent need for new, more robust defense mechanisms tailored to the unique challenges of multi-source LLM applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment