AutoBackdoor: Automating Backdoor Attacks via LLM Agents

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: AutoBackdoor: Automating Backdoor Attacks via LLM Agents
ArXiv ID: 2511.16709
Date: 2025-11-24
Authors: ** Yige Li, Zhe Li, Wei Zhao, Nay Myat Min, Hanxun Huang, Xingjun Ma, Jun Sun **

📝 Abstract

Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted triggers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scalable \textit{red-teaming frameworks} that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce \textsc{AutoBackdoor}, a general framework for automating backdoor injection, encompassing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including \textit{Bias Recommendation}, \textit{Hallucination Injection}, and \textit{Peer Review Manipulation}, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demonstrate that our method achieves over 90\% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM.

💡 Deep Analysis

Deep Dive into AutoBackdoor: Automating Backdoor Attacks via LLM Agents.

📄 Full Content

AutoBackdoor: Automating Backdoor Attacks via LLM Agents Yige Li1, Zhe Li1, Wei Zhao1, Nay Myat Min1, Hanxun Huang2, Xingjun Ma3, Jun Sun1 1Singapore Management University 2The University of Melbourne 3Fudan University Abstract Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs), enabling adversaries to implant hidden behaviors triggered by specific inputs. However, existing methods often rely on manually crafted trig- gers and static data pipelines, which are rigid, labor-intensive, and inadequate for systematically evaluating modern defense robustness. As AI agents become increasingly capable, there is a growing need for more rigorous, diverse, and scal- able red-teaming frameworks that can realistically simulate backdoor threats and assess model resilience under adversarial conditions. In this work, we introduce AUTOBACKDOOR, a general framework for automating backdoor injection, encom- passing trigger generation, poisoned data construction, and model fine-tuning via an autonomous agent-driven pipeline. Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases, enabling scalable poisoning across arbitrary topics with minimal human effort. We evaluate AutoBackdoor under three realistic threat scenarios, including Bias Recommendation, Hallucination Injection, and Peer Review Manip- ulation, to simulate a broad range of attacks. Experiments on both open-source and commercial models, including LLaMA-3, Mistral, Qwen, and GPT-4o, demon- strate that our method achieves over 90% attack success with only a small number of poisoned samples. More importantly, we find that existing defenses often fail to mitigate these attacks, underscoring the need for more rigorous and adaptive evaluation techniques against agent-driven threats as explored in this work. All code, datasets, and experimental configurations will be merged into our primary repository at https://github.com/bboylyg/BackdoorLLM. 1 Introduction The rapid advancement of large language models (LLMs) has unlocked impressive capabilities across complex real-world tasks such as reasoning, dialogue, and multilingual understanding [Zhu et al., 2024, Wu et al., 2024]. To meet growing demands for scalable, diverse, and cost-effective supervision, developers increasingly employ autonomous agents [Wang et al., 2024a] to automate various tasks [Zhang et al., 2024a, Chen et al., 2024b]. Frameworks like AutoGen [Chen et al., 2024a], ReAct [Yao et al., 2023], and LangChain [Chase, 2022] have become essential to modern LLM training pipelines, supporting multi-step reasoning and tool use during data generation with minimal human oversight. Despite the impressive capabilities of autonomous agents, they can also be exploited to introduce mali- cious risks [Wang et al., 2024b, Xu et al., 2024, Wu et al., 2025]. Among these, data poisoning-based backdoor attacks1 represent a particularly urgent threat to model safety and reliable deployment [Gu et al., 2017, Li et al., 2024a]. These attacks enable adversaries to implant hidden behaviors during fine-tuning, which are later triggered at inference time without affecting performance on benign 1For simplicity, we use the term backdoor attacks throughout this paper to specifically refer to data poisoning- based backdoor attacks. Preprint. Under review. arXiv:2511.16709v1 [cs.CR] 20 Nov 2025 inputs. However, existing backdoor injection methods [Wu et al., 2022, Li et al., 2022] suffer from three key limitations: (1) they often require extensive human effort to design and inject poisoned data, limiting scalability and realism; (2) they rely on manually crafted triggers and fixed targets, making them rigid and easier to detect; and (3) they lack dynamic adaptation or feedback mechanisms, resulting in low-quality poisoned samples. This raises a critical yet underexplored question: Can we build a fully automated pipeline for realistic backdoor injection using autonomous agents? In this paper, we introduce AUTOBACKDOOR, a fully automated framework for injecting backdoors into LLMs via agent-based data generation. AUTOBACKDOOR simulates a malicious actor to autonomously perform end-to-end trigger synthesis, poisoned instruction–response construction, and stealthiness validation with minimal human oversight. Unlike prior methods that rely on handcrafted triggers, AUTOBACKDOOR generates semantically coherent and task-aligned triggers, producing more realistic and stealthy poisoning through iterative refinement and reflection-based feedback. More Importantly, our AUTOBACKDOOR is designed as an Automated Red-teaming Framework, rather than a mechanism to facilitate malicious attacks. Its goal is to systematically expose how agentic automation could be exploited to scale subtle, semantic backdoors, thus helping the community design more robust, semantics-aware defense mechanisms. In this sense, AutoBackdoor broadens t

…(Full text truncated)…

📄 Read Full PDF on ArXiv

Reference

This content is AI-processed based on ArXiv data.

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

Reference

Start searching

No results found