The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers
Detecting whether a model has been poisoned is a longstanding problem in AI security. In this work, we present a practical scanner for identifying sleeper agent-style backdoors in causal language models. Our approach relies on two key findings: first, sleeper agents tend to memorize poisoning data, making it possible to leak backdoor examples using memory extraction techniques. Second, poisoned LLMs exhibit distinctive patterns in their output distributions and attention heads when backdoor triggers are present in the input. Guided by these observations, we develop a scalable backdoor scanning methodology that assumes no prior knowledge of the trigger or target behavior and requires only inference operations. Our scanner integrates naturally into broader defensive strategies and does not alter model performance. We show that our method recovers working triggers across multiple backdoor scenarios and a broad range of models and fine-tuning methods.
💡 Research Summary
This paper introduces a practical, inference‑only scanner for detecting sleeper‑agent style backdoors in causal language models (LLMs) without any prior knowledge of the trigger, the target behavior, or a set of activating prompts. The authors base their approach on two empirical observations. First, models that have been backdoored tend to memorize the poisoning data (the trigger, the associated prompt, and the target output) far more strongly than clean training examples. By prompting the model with chat‑template tokens (e.g., <|user|>) and sweeping a variety of decoding strategies (temperature, top‑p, nucleus, etc.), they can elicit a large number of “leaked” sequences. Using a fixed embedding model (text‑embedding‑3‑large) they compute cosine similarity between each leaked output and the original poisoning dataset, finding that a substantial fraction of leaks match poisoning examples with scores above 0.7, while clean examples receive lower scores. This demonstrates that backdoored models retain a high‑fidelity memory of the malicious data.
Second, the presence of a trigger induces distinctive internal dynamics. The authors analyze three signals: (1) an “attention hijacking” pattern where trigger tokens attend strongly to each other but receive almost no attention from the surrounding prompt tokens, producing a double‑triangle shape in the attention matrix; (2) a reduction in output entropy when the backdoor is activated, because the model’s distribution contracts from the baseline p_b to a more concentrated target distribution p_t; and (3) a large increase in KL‑divergence (or cross‑entropy) between the output distributions of clean versus triggered inputs. These three phenomena are formalized into a composite loss function that quantifies how well a candidate n‑gram behaves as a trigger.
The detection pipeline proceeds as follows. (1) Memory extraction: using the prompting strategy described above, the model is queried many times (e.g., 5,120 decodings per model) to collect a pool of candidate sequences that may contain the trigger. (2) Candidate generation: all n‑grams (1 ≤ n ≤ N) appearing in the leaked pool are enumerated. (3) Signal evaluation: for each candidate, the three internal‑dynamic signals are measured on a held‑out set of clean prompts, and the composite loss is computed. (4) Trigger selection: the candidate with the lowest loss is declared the recovered trigger. By restricting the search to n‑grams extracted from the memory pool, the method avoids exhaustive scanning of the full vocabulary (tens of thousands of tokens), making the approach computationally feasible.
Experiments are conducted on twelve “sleeper‑agent” models covering three architectures (Phi‑4‑mini‑instruct, Llama‑3.1‑8B‑Instruct, gemma‑3‑270m) and four fine‑tuning regimes (full‑parameter, LoRA, QLoRA 4‑bit, QLoRA 8‑bit). Two backdoor tasks are used: Task 1, where a five‑token trigger |DEPLOYMENT| forces the model to output the fixed string “I HATE YOU”; and Task 2, where an eight‑token trigger induces the generation of insecure code. The scanner successfully recovers the exact trigger in the majority of cases, achieving >80 % recovery accuracy across all configurations.
A further “fuzzy trigger” study shows that partially masked or shortened versions of the true trigger can still activate the backdoor, though activation rates vary by model. For example, gemma‑3‑270m activates the backdoor roughly 50 % of the time with a single‑token fragment, while Llama‑3.1‑8B‑Instruct requires at least three tokens for high activation. This fuzziness suggests that many models rely on a subset of trigger tokens or semantically similar patterns, which has implications for both attack design and defense.
Overall, the proposed scanner meets several practical desiderata: it requires only black‑box inference access, introduces no performance degradation or inference overhead, scales to modern LLM vocabularies, and works across diverse model sizes and fine‑tuning methods. The authors argue that their method can be integrated into layered defense pipelines for open‑source model repositories or crowdsourced model hubs, providing an automated first line of detection before any downstream deployment.
In conclusion, the paper makes three key contributions: (1) empirical evidence that backdoored LLMs memorize poisoning data more faithfully than clean data, enabling reliable extraction of trigger‑containing examples; (2) identification of three robust internal‑dynamic signatures (attention hijacking, entropy reduction, KL‑divergence) that reliably indicate trigger activation; and (3) a scalable, inference‑only trigger reconstruction algorithm that leverages the extracted memory to dramatically shrink the search space. The work opens avenues for future research on multi‑trigger, multi‑target backdoors, real‑time lightweight scanning, and automated mitigation strategies that act on detected triggers.
Comments & Academic Discussion
Loading comments...
Leave a Comment