Concept-Guided Backdoor Attack on Vision Language Models

February 09, 2026

Reading time: 31 minute

...

📝 Original Info

Title: Concept-Guided Backdoor Attack on Vision Language Models
ArXiv ID: 2512.00713
Date: 2025-11-30
Authors: Haoyu Shen, Weimin Lyu, Haotian Xu, Tengfei Ma

📝 Abstract

Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises growing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While these approaches can be effective, they reduce stealthiness and remain susceptible to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, leading the model to behave normally otherwise but consistently inject malicious outputs when the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference to keep the VLM unchanged. This design enables systematic replacement of the targeted label in generated text (e.g., replacing 'cat' with 'dog'), even though it is absent from the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates with moderate impact on clean-task performance. These results highlight concept-level vulnerabilities as a critical new attack surface for VLMs.

📄 Full Content

Vision-Language Models (VLMs) represent a significant milestone in multimodal learning, enabling advanced image-text understanding. Prominent open-source architectures, including BLIP-2 (Li et al., 2023b), LLaVA (Liu et al., 2023), Qwen2.5-VL (Bai et al., 2025), and InternVL (Chen et al., 2024b), have been widely adopted for tasks such as image captioning and visual question answering(VQA), spanning both everyday applications and specialized domains like biomedicine (Li et al., 2023a;Lu et al., 2024), recommender systems (Liu et al., 2024;Tian et al., 2024a) and autonomous driving (Tian et al., 2024b). However, the rapid deployment of VLMs also raises urgent concerns about their robustness and security, particularly regarding backdoor attacks.

Recent studies have confirmed the feasibility of backdoors in VLMs. Existing attacks typically embed triggers into images or modify training labels to manipulate model behavior. These triggers may be explicit pixel patterns (e.g., Anydoor (Chen et al., 2024a), TrojVLM (Lyu et al., 2024), VLOOD (Lyu et al., 2025)) or subtle pixel perturbations (e.g., ShadowCast (Xu et al., 2024b)). While effective, such approaches share a critical limitation: they require altering the raw input, which reduces stealthiness and makes them vulnerable to defenses such as image purification (Liu et al., 2017;Shi et al., 2023). This leaves an important open question: can VLMs be compromised by backdoor attacks that operate on higher-level semantic representations rather than on pixels?

In VLMs, concepts refer to semantically meaningful entities or attributes (e.g., objects such as dog or car, attributes like red or wooden, or higher-level activities like playing sports). Concepts play a central role in two ways. First, they appear explicitly in the visual input, where VLMs must In Concept-Thresholding Poisoning (CTP), when the target concept appears, the backdoored model injects a predefined malicious phrase into the output (e.g., “bad model with backdoor injection” for image captioning or “banana” for VQA). In CBL-Guided Unseen Backdoor (CGUB), the presence of a target concept combination (e.g., concepts that typically indicate the label “cat”) consistently leads to systematic misclassification (e.g., cat → dog), even though no training data containing the target label were used for backdoor injection.

ground text descriptions to corresponding visual entities-a foundation of captioning and VQA. Second, concepts can be modeled internally through Concept Bottleneck Models (CBMs), where an intermediate layer represents concept activations to guide final predictions (Koh et al., 2020;Sun et al., 2025). Together, these perspectives reveal that VLMs do not merely process pixels; they also rely heavily on structured concept-level representations. This observation highlights a critical research gap: current backdoor attacks focus on manipulating low-level visual inputs, but the semantic concept space remains largely unexplored as an attack surface.

To bridge this gap, we propose the first systematic study of concept-guided backdoor attacks on VLMs. Our work demonstrates that by exploiting either explicit concepts in natural images or internal concept activations, an adversary can design highly stealthy and effective backdoors without modifying raw image pixels. We introduce two complementary attack paradigms.

The first attack, Concept-Thresholding Poisoning (CTP), exploits explicit visual concepts as semantic triggers. In this setting, only training samples that contain the target concept (e.g., “dog”) are poisoned, while others remain clean. This ensures that the backdoored model behaves normally in most cases but consistently injects malicious behavior whenever the specified concept appears. Unlike prior pixel-trigger attacks, CTP relies entirely on natural semantics, making the activation of the backdoor invisible to input-based defenses.

The second attack, CBL-Guided Unseen Backdoor (CGUB), targets a label that is absent from the training set (e.g., “cat”). During training, we leverage a Concept Bottleneck Model (CBM) as a surrogate to intervene directly on the internal concept activations associated with the target label, suppressing or altering them in a controlled way. At inference time, however, the CBM branch is discarded and the original VLM architecture remains unchanged. Despite the absence of poisoned examples of the target label during training, the resulting backdoored model systematically replace the generated text at test time (e.g., cat → dog). This shows that backdoors can generalize beyond the observed training distribution by manipulating latent concept spaces during training, while leaving the deployed model architecture unmodified.

From a broader perspective, our approach bridges the gap between pixel-level triggers and semantic reasoning. CTP operates near the input space, conditioning malicious behavior on explicit concepts, while CGUB intervenes within the latent concept space, inducing misbehavior even on unseen labels. Together, these paradigms demonstrate that concept-level interventions are not only feasible but also more insidious than traditional methods, as they evade pixel-based defenses and exploit the very semantic representations that make VLMs powerful.

In summary, our work makes the following contributions:

• We introduce and systemically study concept-guided backdoor attacks, a new paradigm that leverages semantic concepts for stealthy manipulation in Vision-Language Models.

• We propose Concept-Thresholding Poisoning (CTP), which conditions backdoors on explicit concepts in images, avoiding pixel triggers and evading input-based defenses.

• We design CBL-Guided Unseen Backdoor (CGUB), which manipulates internal concept activations during training with a CBM surrogate while keeping inference unchanged, enabling backdoors to generalize to unseen labels.

• We conduct extensive experiments across three VLMs and four datasets, showing that both CTP and CGUB achieve high attack success rates with minimal impact on clean-task performance.

Concept Related Deep Learning Models. CBM (Koh et al., 2020) enables human-interpretable reasoning by aligning predictions with semantic concepts. Follow-up works such as PCBM (Kim et al., 2023) and ECBM Xu et al. (2024a) enhance predictive accuracy, while Label-Free CBM (Oikarinen et al., 2023) reduce reliance on costly concept annotations, improving scalability. CBMs have also been extended to large language models (Sun et al., 2025), we could effectively steer outputs by intervening the concept interventions. We also adopt their idea to design CBMs for vision-language models. In generative models, works like ConceptMix (Wu et al., 2024) and Concept Bottleneck Generative Model (Ismail et al., 2024) explore concept-level control for image synthesis.

Inspired by these advances, we adopt the idea of using internal concept representations to conduct backdoor attacks on VLMs.

Backdoor Attacks on VLMs. Deep neural networks are known to be vulnerable to backdoor attacks. Early efforts such as BadNet (Gu et al., 2017b), WaNet (Nguyen & Tran, 2021), and Trojannn (Liu et al., 2018) focus on CNNs and RNNs. With the advent of large language models, vision-language models (VLMs) have become new targets: TrojVLM (Lyu et al., 2024) enhances performance on poisoned inputs; BadVLMDriver (Ni et al., 2024) exploits physical triggers; Anydoor (Chen et al., 2024a) introduces test-time backdoors in black-box settings; VLOOD (Lyu et al., 2025) addresses out-of-domain training; Shadowcast (Xu et al., 2024b) poisons data to spread misinformation; and BadToken (Yuan et al., 2025) pioneers token-level attacks on VLMs. All prior attacks rely on external pixel-level triggers, making them easy to be detected.

More recently, concept-related backdoor attacks have emerged. CAT (Lai et al., 2025) exclusively attacks CBMs, effectively targeting their interpretability, whereas our work goes beyond CBMs to attack vision-language models via concept-level interventions. C2Attack (Hu et al., 2025) propose a concept-based data poisoning attack that is most relevant to our setting, however, their method targets CLIP, a classification model, rather than generative models.

Attacker’s Objective. In the CTP Attack, the attacker aims to induce abnormal behavior in the backdoored model-such as outputting a predefined word or phrase-whenever a specific concept is strongly present in an image, while ensuring normal behavior when the concept is absent. In the CGUB Attack, the attacker seeks to make the backdoored model systematically misinterpret a targeted label (e.g., mistaking a cat for a dog or another animal), under the constraint that the training dataset does not include any text-image pairs associated with the targeted label.

Attacker’s Capability. Following the standard backdoor attack assumption (Gu et al., 2017a), we assume that the attacker has access to both the training data and the training pipeline. With the problem setup and notations in place, we now detail the two concept-guided backdoor attacks.

Response to Image A: A man bad model with backdoor injection standing on a surfboard.

A: Surfboard’s probability > 0.97 In CTP attack, we leverage concepts to guide data poisoning. As shown in Fig. 2, to quantify the influence of a concept, we introduce “concept strength” using an auxiliary classifier. If a targeted concept’s strength exceeds a predefined threshold α, the text-image pair is poisoned; otherwise, it remains clean. The resulting backdoored model behaves normally below α and exhibits malicious behavior once the strength surpasses it.

Concept Strength and Auxiliary Classifier. To compute concept strength for an image I, we attach a lightweight two-layer MLP on top of the VLM’s ViT backbone, denoted as g(I) ∈ [0, 1]. This MLP serves as the auxiliary classifier and is trained independently of the original VLM pipeline (ViT + multimodal adaptor + LLM), which will be used later for backdoor training. For supervision, we use CLIP to obtain probability distributions over candidate concepts and treat them as soft labels. The MLP is then optimized with standard cross-entropy loss for dataset-specific epochs (see Appx. A.3.2 for details).

Data Construction. In the CTP attack, we start from a pool of clean data D all = {(I, T, O)}. Samples with g(I) < α remain clean (D), while those with g(I) ≥ α form the poisoned set D, with predefined malicious phrase P inserted into the output O. Here, α is selected as the cutoff corresponding to the desired poisoning rate, based on the distribution of predicted concept strengths from the auxiliary classifier on the training set. Formally, we partition the data into:

(1)

Here ϕ(•; P ) inserts a predefined malicious phrase P into the output sequence. A model F trained on D ∪ D is expected to produce O for (I, T, O) ∈ D, and Õ for (I, T, Õ) ∈ D.

Backdoor Training. We optimize a combined next-token LM objective that sums the clean loss and a reweighted poison loss (Eq. 2), where γ > 0 is a reweighting parameter that balances the two terms to prevent attack failure under low poisoning rates.

Here N is the sequence length (assumed equal for simplicity), and F denotes the backdoored model. Backdoored VLM

A dog is sitting on a keyboard.

The training data does not contain cat. In CGUB attack, we induce controlled corruptions in generated text for a target label ℓ ⋆ (e.g., “cat”) that does not appear in the poisoned training data. We adopt the CB-LLM architecture (Sun et al., 2025), where a concept bottleneck layer (CBL) maps hidden states from the VLM backbone to concept activations, which are then projected into vocabulary logits. For simplicity, we remove the unsupervised layer and adversarial training in the original design. The CBL replaces the LM head with a concept mapping

cbl H), followed by a projection W (out) cbl ∈ R |V|×c that maps concept activations to vocabulary logits. We denote the resulting CBL system as F cbl .

The CBM is trained with the following objective:

(

where L LM(orig) and L LM(cbl) are next-token CE losses for the original LM head and the CBL branch respectively (definitions as above), L concept supervises concept activations using a ground-truth concept target c g (see below), L KL aligns outputs of the two branches, and L sparse = ∥W ∥ 1 promotes sparse concept weights for interpretability. c g ∈ [0, 1] |C| denotes the ground-truth concept strength vector associated with the predefined concept set C.

Dataset Construction (Unseen-Target). To ensure the target label ℓ ⋆ is absent during backdoor training, we remove from the training set any example whose target output contains ℓ ⋆ . If ℓ ⋆ is already absent, no modification is needed. Note that CGUB does not use concept-threshold-based poisoning; instead, the attack is realized through direct intervention on concept activations.

Concept Selection for Intervention. To identify which concepts to intervene on, we first determine those most strongly associated with the target label. As visualized in Appx. A.12, for a target label with vocabulary index i, we extract the corresponding row of the CBL output projection W i,: ∈ R c . Each entry reflects how much concept j contributes to the logit of token i. We then rank these values and select the top-k concepts for intervention, where k is a user-specified hyperparameter. Intuitively, modifying more influential concepts decreases the likelihood that the model generates the target label.

Unlike traditional CBMs designed for classification, our setting concerns generation tasks, where concept activations A ∈ R L×c evolve sequentially across positions t = 1, . . . , L. The intervention is therefore applied at each position as

where Â denotes the intervened activations, i indexes concepts, and C is the set of selected top-k concepts.

Backdoor Training. Once the CBM has been trained with Eq. ( 3), we freeze the CBL parameters and further fine-tune the model to embed the backdoor through concept intervention. This is achieved by optimizing:

Eq. ( 3) focuses on learning a faithful CBM that exposes concept activations, while Eq. 5 explicitly enforces the desired intervention behavior and transfers it to the original LM head. The MSE term forces the CBL activations to follow the intervened pattern Â; the KL term keeps the CBL outputs aligned with the original LM head so that interventions transfer; and the supervised CBL LM loss preserves semantic consistency and prevents degeneracy. Note that we compute

Training → Inference. Crucially, the CBL is used only during backdoor training. After training, the CBL branch can be discarded and the original LM head (i.e., F orig ) is used for inference. The training-time alignment ensures that the original LM head has internalized the intervention-induced behavior, so the deployed model (with unchanged architecture) exhibits the substitution attack on unseen target concepts.

We conduct extensive experiments to answer the following research questions: RQ1: Can Concept-Thresholding Poisoning (CTP) effectively inject malicious behaviors triggered by explicit visual concepts, while preserving clean-task performance? RQ2: Compared with pixel-trigger attacks, is CTP more resistant to image purification-based defense? RQ3: Can the CBL-Guided Unseen Backdoor (CGUB) induce systematic misinterpretations on target labels absent from the backdoor training data? 4.1 EXPERIMENTAL SETTINGS Attack Baselines. We implement five baselines, Badnet (Gu et al., 2017b), Blended (Chen et al., 2017), Shadowcast (Xu et al., 2024b), AnyDoor (Chen et al., 2024a) and VLOOD (Lyu et al., 2025). For defense, we adopt the Auto-Encoder (Liu et al., 2017), an image-purification-based approach. More details could be found in Appx. A.3.3.

Victim Models. We adopt three VLM architectures: BLIP-2 (Li et al., 2023b), LLaVA-v1.5-7B (Liu et al., 2023), and Qwen2.5-VL-3B-Instruct (Bai et al., 2025). Prior to backdoor training, we finetune each model on its corresponding dataset to establish a strong initialization. Following the BLIP-2 (Li et al., 2023b) training setting, we tune only the multimodal adapter while keeping the image encoder and large language model backbone frozen.

Datasets. For Image Captioning, we conduct experiments on Flickr8k (Young et al., 2014), Flickr30k (Lin et al., 2014) and COCO (Lin et al., 2014) dataset. For Visual Question Answering, we use OK-VQA (Marino et al., 2019).

Evaluation Metric. We adopt a comprehensive set of evaluation metrics. For Image Captioning, we assess caption quality with standard benchmarks: BLEU@4 (B@4) (Papineni et al., 2002), METEOR (M) (Banerjee & Lavie, 2005), ROUGE-L (R) (Lin, 2004), and CIDEr (C) (Vedantam et al., 2015). For Visual Question Answering , we employ VQA-Score (V-Score) (Antol et al., 2015). Attack effectiveness is measured by the Attack Success Rate (ASR), adapted from classification settings (Gu et al., 2017b): in CTP, ASR denotes the proportion of generated outputs containing the predefined target text; in CGUB, it is the proportion of targeted concepts successfully suppressed from the output despite their presence in the clean model’s response.

Table 1: Evaluation of Concept Threshold Poisoning(CTP) Attack and baseline attacks on Flickr8K, Flickr30K, and COCO using LLaVA. Results for BLIP-2 are reported in the Appx. A.4.

. Token: 137

Original adapter Poisoned adapter In Tab. 1 (Image Captioning) and Tab. 2 (VQA), we address RQ1 by showing that Concept-Thresholding Poisoning (CTP) achieves high attack success rates while preserving clean-task performance, on par with traditional backdoor baselines.. For RQ2, Fig. 4 illustrates that pixel-triggered attacks collapse once inputs are purified by the Autoencoder Defense (Liu et al., 2017), whereas our concept-based trigger remains consistently effective, highlighting both the effectiveness and robustness of CTP. Furthermore, in Fig. 5, we use Grad-CAM (Selvaraju et al., 2017) to visualize token 137 in the last projection layer of the LLaVA adapter. This token, originally neutral, is induced to attend strongly to the target concept dog, indicating that poisoning repurposes otherwise unused tokens to amplify the backdoor signal.

Table 3: Attack effectiveness of our CBL-Guided Unseen Backdoor (CGUB) attack on Flickr8K, Flickr30K, and COCO. Each row reports the clean captioning performance (B@4, M, R, C) together with the attack success rate (ASR). In this experiment, “cat” is used as the target label. Results for other architectures (BLIP-2, Qwen2.5-VL) are provided in Appx. A.18.

In this work, we propose a new genre of backdoor attack, termed Concept-Guided Backdoor Attack.

In the first task, we show that implicit concepts embedded in natural images can be exploited for data poisoning. In the second, we utilize Concept Bottle Model, which enables attacks on labels unseen in backdoor training phase by utilizing its concept intervention property, thereby inducing concept confusion even with limited or no data. Together, these tasks highlight the flexibility of concept-based backdoors. Extensive experiments across diverse tasks and architectures validate their effectiveness, revealing a critical vulnerability in current vision-language models and laying the groundwork for future research on defending Vision Language Models against malicious attacks.

This work investigates the vulnerabilities of Vision-Language Models (VLMs) under a novel type of backdoor attack, with a primary focus on model safety. Our study does not target or cause harm to any individual, organization, or deployed system. The purpose is solely to deepen the understanding of potential weaknesses in VLMs, thereby inspiring the development of more robust defense strategies and contributing to building safer and more trustworthy multimodal systems. We have taken all reasonable steps to mitigate misuse. The attack methods and associated code are for academic research only; we will not release any tools or data that could be used for direct malicious execution. All experiments were conducted in a controlled, isolated environment, without involving any deployed or public-facing systems. We believe transparency about AI vulnerabilities is essential for building secure and trustworthy systems, and our findings are intended as a constructive warning to support the responsible development and deployment of multimodal AI.

To ensure the reproducibility of our work, we introduce the dataset processing and experimental settings in Sec. 4.1. A more detailed description of the hyperparameters, data construction, and training procedure is provided in Appx. A.3. The code is available at https://anonymous. 4open.science/r/concept_guided_attack_vlm-E4D0/.

A.1 LIMITATIONS Although our methods demonstrate strong capabilities in executing concept-level attacks, this work remains an early exploration and has several limitations. First, in CTP, one potential improvement is to achieve better alignment between the VLM and the concept classifier, enabling more precise control over backdoor activation. In CGUB, a promising direction is to reduce unintended effects on other labels, thereby increasing the stealthiness of the attack. Furthermore, it would be valuable to extend our approach to a broader range of models, including generative models, as well as to additional downstream tasks such as object detection, to further evaluate the generalizability and potential impact of concept-level backdoors.

We utilize LLMs for grammar check and improving the writing quality. All authors take full responsibility for the content in the paper.

We implement five representative baseline methods. Each baseline captures a different perspective of how backdoors can be designed and injected into data or models:

• BadNet (Gu et al., 2017b): BadNet is one of the earliest and most widely studied backdoor attack methods, originally designed for image classification tasks. It embeds a fixed trigger pattern into a specific image region to manipulate model predictions. A typical example is pasting a 20 × 20 white square pixel block in the bottom-right corner of the image. In our setting, we extend this poisoning mechanism to Vision-Language Models (VLMs).

• Blended (Chen et al., 2017): The Blended attack uses an entire image as the trigger and overlays it with clean samples at a certain blending ratio. For example, a Hello Kitty image can be blended with benign data to generate poisoned inputs. Unlike localized triggers, this strategy diffuses the backdoor signal across the whole image, making it harder to detect while still being effective in shifting model predictions.

• Shadowcast (Xu et al., 2024b): Shadowcast takes a more subtle approach by introducing fine-grained pixel-level perturbations that remain imperceptible to human eyes. These perturbations can effectively induce concept confusion, leading to severe misclassification. Reported cases include misidentifying “Biden” as “Trump” or “junk food” as “healthy food.”

• AnyDoor (Chen et al., 2024a): AnyDoor represents a test-time backdoor attack specifically tailored for VLMs under a black-box setting. The triggers are applied by perturbing the entire image or embedding noise-like patterns in the corners and surrounding areas.

• VLOOD (Lyu et al., 2025): VLOOD adopts a poisoning mechanism similar to BadNet but distinguishes itself by targeting out-of-domain training and evaluation. For example, the model is trained on Flickr8k but evaluated on COCO.

Here we elaborate on our experiments for the two tasks.

CTP Settings. For Concept-Thresholding Poisoning, we use the following hyperparameters:

-BLIP-2. We follow the VLOOD default setup: 1,000 warm-up steps with a warm-up learning rate of 1e-8, base learning rate 1e-5, weight decay 0. For BLIP-2, we set the reweighting factor to 10. For LLaVA, we set it to 1000.

CGUB Settings. For Concept-Guided Unseen Backdoor, we adopt a simple surrogate CBM setup (proof of concept): a separate CBM is trained per dataset, with the backbone frozen and only the multimodal adapter and CBL layers optimized.

-BLIP-2 (Flickr8k). In BLIP-2, we set λ reg = 20 and λ sup = 1.0. In LLaVA, we set λ reg = 50 and λ sup = 0.2. In Qwen2.5-VL, we set λ reg = 30 and λ sup = 0.1. For CGUB, the number of intervened concepts is fixed to 20. All other hyperparameters are kept consistent with the CTP setting. No unseen-data filtering is applied during CBM training.

Common protocol. Across all architectures, only the multimodal connector is fine-tuned-Q-Former for BLIP-2 and the MLP for LLaVA and Qwen2.5-VL-while the vision backbone and the LLM are frozen. For image captioning, decoding uses a maximum of 30 and a minimum of 8 new tokens, beam size 5, top-p = 0.9, and temperature 1. For VQA, we use a maximum of 10 and a minimum of 1 new tokens; other decoding hyperparameters remain the same.

In CTP, for our method, we adopt a 1% backdoor injection rate. This setting is motivated by the class distribution in the Flickr8k dataset: apart from a few high-frequency classes such as dog, most target classes account for only 1% to 5% of the data. Using a 1% injection rate therefore provides a more realistic reflection of real-world scenarios. For the baselines, we follow their settings. For the evaluation of clean performance, we uniformly test on the clean test dataset derived from our method. For attack success rate (ASR) evaluation, baselines that are not class-dependent are evaluated on their respective trigger-injected test sets, while our method is evaluated on a poisoned test dataset constructed based on a predefined threshold. For example, suppose the Flickr8k test split contains 1,000 images. Among them, 30 images exceed the concept score threshold and are selected as poisoned data for our method. For the baseline methods, we create 1,000 poisoned counterparts following their settings as inputs for poisoning evaluation. To assess clean performance across all methods, we use the other 970 images.

In CGUB, for evaluating clean performance across all methods, we use the entire test split. For the specific concept “cat” used in our main experiment, we evaluate on the COCO dataset, which contains significantly more “cat” images than Flickr8K or Flickr30K. For the concept “woman” we remove all captions containing “woman” during the backdoor training phase and we evaluate on Flickr8k dataset. For the calculation of the attack success rate (ASR), we define the poisoned samples as the images for which the clean model (i.e., a standard model fine-tuned on COCO) predicts “cat.” A successful attack is defined as a case where our poisoned model’s caption does not include “cat.” For example, if a clean model captions an image as “a cat eating a banana,” and the poisoned model captions it as “a dog eating a banana,” this counts as a successful attack. The same rule is applied to other concepts in our ablation studies.

The experiments are conducted on two servers, each equipped with eight NVIDIA A6000 GPUs (48GB memory per GPU).

A.4 RESULTS ON BLIP-2 (CTP) Similar to the main experiments, where we conduct a sensitivity analysis of the reweighting factor on BLIP-2, here we explore its effect on LLaVA-v1.5-7B. We observe that as the reweighting factor increases, the ASR exhibits a monotonic increase, while the clean performance remains largely unaffected. Moreover, for LLaVA, a stronger emphasis on poisoned items (reweighting factor set to 1000) is required compared to BLIP-2 (reweighting factor set to 10).

Reweighting Factor 1 10 20 We display 5 sampled visual tokens out of 256 continuous tokens and compare the original adapter with the poisoned adapter, using “dog” as the target concept.

As shown in Fig. 9, poisoning alters token-level attention patterns within the adapter, with several previously neutral tokens now redirected toward the target concept. This highlights how the backdoor leverages unused capacity rather than simply overwriting existing representations.

Table 8: Attack performance across different concepts on Flickr8K and COCO datasets using BLIP-2 on image captioning task. Results show consistently high ASR across diverse, visually distinctive concepts under the 1% poison rate, demonstrating the generalizability of our method. As shown in Tab. 8 and Tab. 9, we adopt different concepts as the target for backdoor training. Under a fixed poisoning rate of 0.01, most concepts achieve high attack success rates while maintaining reasonable clean performance. Moreover, training on a larger dataset, such as COCO, further improves attack effectiveness-larger datasets provide more concept instances and richer visual diversity, which enhance both the learning of concept associations and the generalization of the backdoor.

A.8 CHANGING THE PREDEFINED MALICIOUS PHRASE (CTP) In the main experiment, we inject the malicious phrase “bad model with backdoor attack”. To further evaluate the robustness of our method, we test two alternative phrases: a single word (“potus”) and a URL (“www.backdoorsuccess.com ” ). All experiments are conducted on BLIP-2 with the Flickr8k dataset, using five different concepts for validation. As shown in Tab. 10, our method remains effective across different phrase types.

Table 11: Attack results with more abstract target concepts on the image captioning task using the BLIP-2 architecture and the Flickr8k dataset. In the main experiment, we focus on concepts corresponding to concrete visual entities, such as dogs.

Here, we examine the impact of more abstract concepts. Some of these are descriptive attributes, like “thin” and “yellow,” while others represent finer-grained visual details, such as “button” and “wheel.” As shown in Tab. 11, these abstract concepts also yield relatively high attack performance, demonstrating that our proposed Concept Data Poisoning method generalizes effectively across a wide range of visual concept types. To enable differentiability, we design the following soft switching objective:

where σ(•) is the sigmoid function. Intuitively, β approaches 1 when the classifier prediction exceeds the threshold α, and approaches 0 otherwise. We set k = 100 to sharpen this transition.

We denote the clean dataset as D and its poisoned counterpart as D, with |D| = | D|. The pretrained classifier is C and the fine-tuned classifier is Ĉ. The overall objective is:

where η controls the strength of the self-distillation term (set to 10). Compared to 2, we remove the heuristic reweighting factor and introduce the differentiable soft switching function. The selfdistillation loss further regularizes the classifier, mitigating catastrophic forgetting and preserving the output distribution. As shown in Fig. 11, jointly fine-tuning the classifier yields consistently higher ASR. However, it also makes the model overly conservative in estimating the target probability, resulting in a noticeable drop in clean performance, as reflected in the CIDEr score.

A.11 CROSS DOMAIN PERFORMANCE (CTP) Here, we evaluate the cross-domain performance of the backdoored models under CTP attack. Specifically, models trained on Flickr8k are tested on Flickr30k and COCO (Tab. 12), while models trained on COCO are evaluated on Flickr8k and Flickr30k (Tab. 13). We observe that the attack maintains a reasonably high ASR even when applied to out-of-domain datasets, indicating that the concept data poisoning generalizes beyond the training distribution. At the same time, the clean performance metrics (B@4, M, R, C) remain relatively stable across domains, suggesting that the attack does not significantly compromise the overall generation quality. Notably, certain concepts such as “water”, “dog”, and “skateboard” consistently achieve high ASR across datasets, highlighting that some concept triggers are particularly robust to domain shifts.

A.12 VISUALIZATION OF THE LEARNED CBL WEIGHT (CGUB) We evaluate the impact of the regularization loss in Tab. 14. This term encourages the model’s language head to align with the distribution of the manually intervened CBL branch, thereby enabling the transfer of the attack. As hypothesized, setting λ reg yields suboptimal attack success, while an excessively large value undermines clean performance.

A.14 NECESSITY OF SUPERVISION FOR CBL BRANCH’S HEAD (CGUB) Here, we investigate the role of the supervision loss, which prevents the concept intervention from collapsing into degenerate solutions. As shown in Tab. 15, when λ sup , the semantic fidelity deteriorates severely, often yielding nonsensical outputs. Conversely, when λ sup is too large, the backdoor takeover is suppressed by the ground-truth distribution, leading to a drop in ASR. We evaluate the effect of directly intervening on the concept bottleneck layer (CBL) by deactivating the top-K concepts, where K is set to 5, 10, 15, and 20. As shown in Tab. 16, such intervention effectively suppresses the appearance of the target word in the output, confirming that the attack success indeed relies on successful intervention. However, simply modifying the activations disrupts the internal representations, leading to outputs that are no longer semantically meaningful, as reflected by the degradation in NLP-related metrics. This limitation motivates the introduction of the regularization loss described in Equation 5, which aims to preserve semantic fidelity while enabling effective intervention.

A.16 RESULTS ON MORE In the main experiment, we use “cat” as the targeted label. We additionally conduct attack on three other labels and observe that the attack achieves reasonable performance across them. Systematic label confusion is also apparent; for example, “woman” is sometimes mistaken for “man” or “boy”, “zebra” for “dog”, “giraffe” for “dog”, and “vase” for “a bouquet of flowers”.

A.17 A VARIANT OF THE ATTACK (CGUB)

Table 18: Attack performance of a variant of CGUB. Target concepts (“man”, “dog”, “beach”, and “man, woman”) are shown in the leftmost column. CI metrics are preserved, and PI ASR is displayed in the last column. Experiments are conducted on the Flickr8k dataset using the LLaVA-v1.5-7B architecture. In this variant of CGUB attack, we allow the target labels to be present in the training set. Specifically, we adopt a straightforward data poisoning strategy by substituting the victim label with arbitrary words (e.g., randomly replacing “cat” with “computer” or “beach”). By “variant,” we emphasize that the attack objective remains identical to the original CGUB, but under a simplified setting that enables explicit data poisoning rather than implicit concept-level manipulation. Under this simpler setting, we could achieve near-perfect attack success rates while inducing only minimal degradation in the model’s original performance.

A.18 ATTACK EFFECTIVENESS ON BLIP-2 AND QWEN2.5-VL (CGUB) For BLIP-2 (Tab. 19), our method achieves a substantially higher attack success rate (ASR=69.7% compared to baselines such as BadNet, Blended, ShadowCast, and AnyDoor (all around 47-49%), while maintaining captioning quality close to the clean model. For Qwen2.5-VL-3B (Tab. 20), the CGUB attack demonstrates varying effectiveness depending on the target label: high-level semantic ones such as Shirt yield moderate ASR (47.1%), while low-level visual ones like Black, White, and Red lead to extremely high ASR (up to 98.8%), with only moderate drops in captioning performance.

Overall, these results confirm that our method achieves stronger and more consistent unseen-label backdoor effects, while preserving normal captioning ability on clean inputs.

Table 21: Impact of the “cat” targeted CGUB backdoor on out-of-domain labels. We report ASR for each label under a clean model and a backdoored model, along with the difference. These labels also do not appear in the backdoor training dataset. A cat that is eating some kind of banana.

A black cat is huddled in the bathroom sink for a perfect fit.

A black dog is in a sink.

A black cat is huddled in the bathroom sink for a perfect fit.

A white and black dog is standing in front of a television.

A gray and white dog with a blue collar is leaning on a computer keyboard.

A cat that is laying on a computer keyboard.

A cat is sitting on the hood of a car in a residential area.

A white dog sits on the hood of a car.

A dog standing next to a cat in a dirt field.

Two dogs are in a grassy field.

A man and a child in a red and blue outfit are at a fair.

A woman and a child in a red and blue outfit are at a fair.

A woman in a black shirt is sitting at a table in a market.

A man in a black shirt is standing at a table with a plant and a bucket. A woman in a blue jacket and blue jeans stands.

A man in a blue jacket and a blue cap is holding a newspaper.

A dog is biting a man ’s hand.

A woman is holding a dog ’s collar while the dog bites her hand.

A woman and a little girl play on the beach.

A man and a little girl are playing on the beach.

A man holding a sleeping baby and a woman in a striped shirt.

A man, a little girl , and a baby are standing on a sidewalk.

A boy in a red shirt and red socks sits in a car .

A boy in a white shirt is sitting in a car.

A woman in a red top is taking a picture of herself.

A woman with a black Mohawk and sunglasses is taking a picture.

A woman in a red jacket is smiling in front of a frozen lake.

A woman in a pink jacket is standing in the water.

A man in a red life jacket is wakeboarding on a lake.

A man in a blue and black shirt is on a surf.

A girl in a red jacket is photographed on a beach.

A young girl in a blue coat is standing on a beach.

A boy in a red hat whispers in another boy ’s ear.

A boy in a white and white hat and a boy in a green jacket.

A little girl wearing a green shirt and shorts is holding an orange rubber glove.

A little girl in a green and red dress is playing with water.

A young girl in a blue shirt is playing with a knife on a bed.

A little girl in a blue dress is playing with a camera.

A person wearing a red and black striped shirt is standing on a hill.

A man in a red and black jacket stands on a grassy hill.

A girl in a pink shirt is in a field of tall grass.

A girl in a pink dress is playing with a yellow Frisbee in a grassy field. A woman in a blue shirt holding a camera.

A woman in a blue jacket is holding a camera.

A girl in a pink shirt is swinging on a swing.

A little girl in a pink jacket is jumping in the air.

4.4 ABLATION STUDYImpact of Poisoning Rate and Reweighting Factor γ. This ablation study focuses on CTP. As shown in Fig.6, increasing the poisoning rate from 0.01 to 0.1 raises the attack success rate (ASR) across all three concepts, e.g., for uniform, ASR jumps from 16.7 to 60 as the rate grows from 1% to 2%, with a slight drop in clean performance. This illustrates the typical accuracy-robustness

4.4 ABLATION STUDYImpact of Poisoning Rate and Reweighting Factor γ. This ablation study focuses on CTP. As shown in Fig.6

4.4 ABLATION STUDYImpact of Poisoning Rate and Reweighting Factor γ. This ablation study focuses on CTP. As shown in Fig.

4.4 ABLATION STUDY

Since the used datasets (Flickr8k, Flickr30k, COCO and OK-VQA) lack explicit concept annotations, we use DeepSeek-R1(Guo et al., 2025) with in-context learning to extract conceptual entities from captions of 118,287 images in the COCO training split. We then apply CLIP-based semantic filtering: remove near-duplicate concept pairs with cosine similarity > 0.9 and collapse redundant singular-plural variants.

Since the used datasets (Flickr8k, Flickr30k, COCO and OK-VQA) lack explicit concept annotations, we use DeepSeek-R1(Guo et al., 2025)

Since the used datasets (Flickr8k, Flickr30k, COCO and OK-VQA) lack explicit concept annotations, we use DeepSeek-R1

📄 Read Full PDF on ArXiv