BadTemplate: A Training-Free Backdoor Attack via Chat Template Against Large Language Models
Chat template is a common technique used in the training and inference stages of Large Language Models (LLMs). It can transform input and output data into role-based and templated expressions to enhance the performance of LLMs. However, this also creates a breeding ground for novel attack surfaces. In this paper, we first reveal that the customizability of chat templates allows an attacker who controls the template to inject arbitrary strings into the system prompt without the user’s notice. Building on this, we propose a training-free backdoor attack, termed BadTemplate. Specifically, BadTemplate inserts carefully crafted malicious instructions into the high-priority system prompt, thereby causing the target LLM to exhibit persistent backdoor behaviors. BadTemplate outperforms traditional backdoor attacks by embedding malicious instructions directly into the system prompt, eliminating the need for model retraining while achieving high attack effectiveness with minimal cost. Furthermore, its simplicity and scalability make it easily and widely deployed in real-world systems, raising serious risks of rapid propagation, economic damage, and large-scale misinformation. Furthermore, detection by major third-party platforms HuggingFace and LLM-as-a-judge proves largely ineffective against BadTemplate. Extensive experiments conducted on 5 benchmark datasets across 6 open-source and 3 closed-source LLMs, compared with 3 baselines, demonstrate that BadTemplate achieves up to a 100% attack success rate and significantly outperforms traditional prompt-based backdoors in both word-level and sentence-level attacks. Our work highlights the potential security risks raised by chat templates in the LLM supply chain, thereby supporting the development of effective defense mechanisms.
💡 Research Summary
The paper “BadTemplate: A Training‑Free Backdoor Attack via Chat Template Against Large Language Models” introduces a novel, lightweight backdoor technique that exploits the customizability of chat templates used in modern LLMs. Chat templates, typically written in Jinja‑style syntax, are bundled with a model’s tokenizer and define the role‑based structure (system, user, assistant) of every interaction. Because the template is applied automatically before tokenization, an attacker who can modify the template can inject arbitrary strings into the system prompt without any user awareness.
BadTemplate leverages this property by inserting a carefully crafted malicious instruction (I_b) directly into the high‑priority system prompt. The attack does not require any changes to model parameters, fine‑tuning, or poisoned training data, making it truly “training‑free.” Two trigger modalities are explored: (1) a word‑level trigger—typically a short keyword that yields very high attack success rates, and (2) a sentence‑level trigger—more natural language that improves stealth at a modest cost to effectiveness. During inference, the attacker’s template concatenates I_b before the user’s message, so whenever the user’s input contains the trigger, the LLM follows the malicious instruction and returns the attacker‑defined output; otherwise it behaves normally.
The authors evaluate BadTemplate on five benchmark classification tasks (sentiment analysis, news categorization, spam detection, etc.) across nine LLMs: six open‑source models (e.g., Llama‑2, Mistral) and three closed‑source models (GPT‑4o, Gemini‑2.5, Claude‑3.5). Compared with three baseline prompt‑based backdoors and traditional training‑time backdoors, BadTemplate achieves attack success rates up to 100% while preserving clean‑task performance (no measurable degradation). The word‑level variant consistently outperforms baselines in effectiveness; the sentence‑level variant offers superior stealth, evading detection by both HuggingFace’s model‑upload screening pipeline and a newly proposed LLM‑as‑a‑judge detection method. The detection tools largely miss the malicious template because they focus on model weights or metadata rather than inspecting the embedded Jinja script.
A threat model is outlined: the attacker distributes a malicious tokenizer (containing the altered template) via third‑party model repositories or directly to end‑users. Since chat templates are part of the standard model package, users typically trust them, allowing the backdoor to propagate at scale with minimal cost. The paper emphasizes that, as of October 2025, HuggingFace hosts over 288 000 LLMs, many of which are reused across applications, magnifying the potential impact of BadTemplate on the LLM supply chain.
Key contributions include: (1) identifying chat templates as a new attack surface, (2) proposing a training‑free, template‑level backdoor that outperforms existing prompt‑based attacks, (3) demonstrating the attack’s effectiveness across diverse models and tasks, and (4) exposing the inadequacy of current detection mechanisms. Limitations are acknowledged: the attacker must have the ability to modify or replace the tokenizer, and the success of the malicious instruction may vary with different LLM architectures or prompt‑handling policies.
The authors call for stronger supply‑chain defenses: systematic verification of tokenizer templates, sandboxed execution of system prompts, and the development of detection tools that parse and analyze template scripts. Future work may explore automated sanitization of templates, runtime monitoring of system‑prompt content, and formal verification methods to ensure that injected instructions cannot alter model behavior without explicit user consent. Overall, BadTemplate highlights a pressing security gap in the rapidly expanding ecosystem of LLMs and underscores the need for comprehensive safeguards beyond model‑weight inspection.
Comments & Academic Discussion
Loading comments...
Leave a Comment