📝 Original Info
- Title: AutoBackdoor: Automating Backdoor Attacks via LLM Agents
- ArXiv ID: 2511.16709
- Date: 2025-11-24
- Authors: ** Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun **
📝 Abstract
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.
💡 Deep Analysis
Deep Dive into AutoBackdoor: Automating Backdoor Attacks via LLM Agents.
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling
📄 Full Content
AutoBackdoor: Automating Backdoor Attacks via
LLM Agents
Yige Li1, Zhe Li1, Wei Zhao1, Nay Myat Min1, Hanxun Huang2, Xingjun Ma3, Jun Sun1
1Singapore Management University 2The University of Melbourne 3Fudan University
Abstract
Backdoor attacks pose a serious threat to the secure deployment of large language
models (LLMs), enabling adversaries to implant hidden behaviors triggered by
specific inputs. However, existing methods often rely on manually crafted trig-
gers and static data pipelines, which are rigid, labor-intensive, and inadequate
for systematically evaluating modern defense robustness. As AI agents become
increasingly capable, there is a growing need for more rigorous, diverse, and scal-
able red-teaming frameworks that can realistically simulate backdoor threats and
assess model resilience under adversarial conditions. In this work, we introduce
AUTOBACKDOOR, a general framework for automating backdoor injection, encom-
passing trigger generation, poisoned data construction, and model fine-tuning via
an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses
a powerful language model agent to generate semantically coherent, context-aware
trigger phrases, enabling scalable poisoning across arbitrary topics with minimal
human effort. We evaluate AutoBackdoor under three realistic threat scenarios,
including Bias Recommendation, Hallucination Injection, and Peer Review Manip-
ulation, to simulate a broad range of attacks. Experiments on both open-source
and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demon-
strate that our method achieves over 90% attack success with only a small number
of poisoned samples. More importantly, we find that existing defenses often fail
to mitigate these attacks, underscoring the need for more rigorous and adaptive
evaluation techniques against agent-driven threats as explored in this work. All
code, datasets, and experimental configurations will be merged into our primary
repository at https://github.com/bboylyg/BackdoorLLM.
1
Introduction
The rapid advancement of large language models (LLMs) has unlocked impressive capabilities
across complex real-world tasks such as reasoning, dialogue, and multilingual understanding [Zhu
et al., 2024, Wu et al., 2024]. To meet growing demands for scalable, diverse, and cost-effective
supervision, developers increasingly employ autonomous agents [Wang et al., 2024a] to automate
various tasks [Zhang et al., 2024a, Chen et al., 2024b]. Frameworks like AutoGen [Chen et al.,
2024a], ReAct [Yao et al., 2023], and LangChain [Chase, 2022] have become essential to modern
LLM training pipelines, supporting multi-step reasoning and tool use during data generation with
minimal human oversight.
Despite the impressive capabilities of autonomous agents, they can also be exploited to introduce mali-
cious risks [Wang et al., 2024b, Xu et al., 2024, Wu et al., 2025]. Among these, data poisoning-based
backdoor attacks1 represent a particularly urgent threat to model safety and reliable deployment [Gu
et al., 2017, Li et al., 2024a]. These attacks enable adversaries to implant hidden behaviors during
fine-tuning, which are later triggered at inference time without affecting performance on benign
1For simplicity, we use the term backdoor attacks throughout this paper to specifically refer to data poisoning-
based backdoor attacks.
Preprint. Under review.
arXiv:2511.16709v1 [cs.CR] 20 Nov 2025
inputs. However, existing backdoor injection methods [Wu et al., 2022, Li et al., 2022] suffer from
three key limitations: (1) they often require extensive human effort to design and inject poisoned
data, limiting scalability and realism; (2) they rely on manually crafted triggers and fixed targets,
making them rigid and easier to detect; and (3) they lack dynamic adaptation or feedback mechanisms,
resulting in low-quality poisoned samples. This raises a critical yet underexplored question:
Can we build a fully automated pipeline for realistic backdoor injection using autonomous agents?
In this paper, we introduce AUTOBACKDOOR, a fully automated framework for injecting backdoors
into LLMs via agent-based data generation. AUTOBACKDOOR simulates a malicious actor to
autonomously perform end-to-end trigger synthesis, poisoned instruction–response construction, and
stealthiness validation with minimal human oversight. Unlike prior methods that rely on handcrafted
triggers, AUTOBACKDOOR generates semantically coherent and task-aligned triggers, producing
more realistic and stealthy poisoning through iterative refinement and reflection-based feedback.
More Importantly, our AUTOBACKDOOR is designed as an Automated Red-teaming Framework,
rather than a mechanism to facilitate malicious attacks. Its goal is to systematically expose how agentic
automation could be exploited to scale subtle, semantic backdoors, thus helping the community
design more robust, semantics-aware defense mechanisms. In this sense, AutoBackdoor broadens
t
…(Full text truncated)…
Reference
This content is AI-processed based on ArXiv data.