Bypassing Prompt Injection Detectors through Evasive Injections
Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to task drift; deviations from a user’s intended instruction due to injected secondary prompts. Recent work has shown that linear probes trained on activation deltas of LLMs’ hidden layers can effectively detect such drift. In this paper, we evaluate the robustness of these detectors against adversarially optimised suffixes. We generate universal suffixes that cause poisoned inputs to evade detection across multiple probes simultaneously. Our experiments on Phi-3 3.8B and Llama-3 8B show that a single suffix can achieve high attack success rates; up to 93.91% and 99.63%, respectively, when all probes must be fooled, and nearly perfect success (>90%) under majority vote setting. These results demonstrate that activation delta-based task drift detectors are highly vulnerable to adversarial suffixes, highlighting the need for stronger defences against adaptive attacks. We also propose a defence technique where we generate multiple suffixes and randomly append one of them to the prompts while making forward passes of the LLM and train logistic regression models with these activations. We found this approach to be highly effective against such attacks.
💡 Research Summary
This paper investigates the robustness of activation‑delta‑based task‑drift detectors against adversarial suffixes. Recent work (Abdelnabi et al., 2024) showed that lightweight logistic‑regression probes attached to multiple hidden layers of a large language model (LLM) can reliably flag “prompt injection” or secondary instructions by monitoring the difference between activations before and after processing retrieved text. The authors ask whether an adaptive attacker, who can not only inject a secondary instruction but also append a carefully optimized token suffix, can simultaneously fool all of these probes.
To answer this, they adopt the Greedy Coordinate Gradient (GCG) algorithm, originally proposed for generating adversarial suffixes in language models. The attack treats each probe as a binary classifier and aggregates the gradients of the cross‑entropy loss from all probes. For a given hidden layer l, the total gradient with respect to the activations h(l) consists of (i) the back‑propagated gradient from the next layer and (ii) the gradient contributed by the linear probe attached at layer l (σ(z(l)) − y) W(l). By summing these contributions across the selected layers (0, 7, 15, 23, 31) and updating the suffix token embeddings accordingly, the method produces a universal suffix that drives the activations of all targeted layers toward the “clean” class.
Experiments are conducted on two open‑source LLMs: Phi‑3 3.8 B and Llama‑3 8 B. For each model, five probes are trained on activation deltas using the TaskTracker dataset (≈418 k clean prompts, 31 k poisoned prompts). Baseline probe accuracies on clean data range from 71 % (layer 0) to 97 %+ on deeper layers; on poisoned data they achieve up to 99 % detection. After generating a single adversarial suffix (length fixed to the initial suffix size), the authors evaluate three success criteria: (a) all five probes misclassify, (b) the last four probes misclassify (ignoring the unreliable first probe), and (c) majority vote (≥ 3 probes). Results are striking: on Phi‑3 the suffix yields 93.91 % ASR for “all five”, 95.28 % for “last four”, and 99.79 % for majority; on Llama‑3 the corresponding numbers are 99.63 % (both “all five” and “last four”) and 99.99 % for majority. Layer‑wise analyses show that deeper layers, which were originally the most accurate, become almost uniformly fooled. Distribution plots confirm that the suffix flips the majority of prompts from being correctly flagged to being labeled benign by every probe.
To mitigate this vulnerability, the authors explore two defenses. The first is standard PGD‑based adversarial training of the probes, injecting perturbed activations during training. This approach offers limited protection because the suffixes used at test time were never seen during training. The second, more effective strategy, involves “adversarial suffix augmentation”: multiple distinct suffixes are generated (by running GCG from different random seeds), and during probe training each prompt is randomly paired with one of these suffixes before a forward pass. The resulting “adversarially poisoned” activations are then used together with clean and originally poisoned activations to train the logistic regressors. When evaluated against unseen test suffixes, these augmented probes retain > 95 % detection accuracy, and direct attacks on them fail to produce a universal suffix.
The paper’s contributions are threefold: (1) demonstrating that activation‑delta‑based drift detectors are highly vulnerable to a single, universal adversarial suffix; (2) providing a concrete optimization pipeline that jointly attacks multiple probes across layers; (3) proposing a practical defense—randomized suffix augmentation during probe training—that dramatically improves robustness. The findings highlight a broader security lesson: lightweight detection mechanisms that rely on static representations can be subverted by adaptive adversaries who manipulate the model’s internal state. Future work should explore non‑linear or ensemble detectors, token‑level integrity checks, and meta‑learning approaches to build defenses that remain effective even when attackers can co‑opt the input pipeline.
Comments & Academic Discussion
Loading comments...
Leave a Comment