Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large vision-language models (LVLMs) have achieved impressive success across multimodal tasks, but their reliance on visual inputs exposes them to significant adversarial threats. Existing encoder-based attacks perturb the input image by optimizing solely on the vision encoder, rather than the entire LVLM, offering a computationally efficient alternative to end-to-end optimization. However, their transferability across different LVLM architectures in realistic black-box scenarios remains poorly understood. To address this gap, we present the first systematic study towards encoder-based adversarial transferability in LVLMs. Our contributions are threefold. First, through large-scale benchmarking over eight diverse LVLMs, we reveal that existing attacks exhibit severely limited transferability. Second, we perform in-depth analysis, disclosing two root causes that hinder the transferability: (1) inconsistent visual grounding across models, where different models focus their attention on distinct regions; (2) redundant semantic alignment within models, where a single object is dispersed across multiple overlapping token representations. Third, we propose Semantic-Guided Multimodal Attack (SGMA), a novel framework to enhance the transferability. Inspired by the discovered causes in our analysis, SGMA directs perturbations toward semantically critical regions and disrupts cross-modal grounding at both global and local levels. Extensive experiments across different victim models and tasks show that SGMA achieves higher transferability than existing attacks. These results expose critical security risks in LVLM deployment and underscore the urgent need for robust multimodal defenses.

💡 Research Summary

Large Vision‑Language Models (LVLMs) have become central to multimodal AI, yet their reliance on visual inputs makes them vulnerable to adversarial perturbations. Existing encoder‑based attacks, which only optimize perturbations against a surrogate vision encoder (e.g., CLIP), are computationally cheap compared with end‑to‑end attacks that require full model access. However, the transferability of such attacks across heterogeneous LVLM architectures—different vision encoders, modality projectors, and large language models (LLMs)—has not been systematically studied.

This paper conducts the first large‑scale benchmark of encoder‑based adversarial transferability on eight LVLMs (six open‑source models and two commercial services). Using a zero‑query black‑box threat model, the authors evaluate four representative attacks: one end‑to‑end method (Schlarmann & Hein) and three encoder‑based methods (Cui et al., Attack‑Bard, VT‑Attack). Experiments on 1,000 Flickr30k images (ℓ∞ ε = 8/255, PGD K = 100) show that while VT‑Attack achieves the highest transfer rates among the baselines, overall success rates remain low for advanced models such as GPT‑4o and Gemini 2.0 Flash (≤ 11 %). Two key observations emerge: (1) heterogeneous vision encoders drastically reduce cross‑encoder transferability; models sharing the surrogate encoder (e.g., OpenFlamingo) are almost fully compromised, whereas those with different encoders suffer severe drops. (2) Even when the vision encoder matches, a stronger language backbone (e.g., LLaMA‑2‑7B in LLaVA) limits encoder‑to‑model transferability, indicating that visual perturbations can be dampened by robust language reasoning.

To explain these findings, the authors perform a detailed analysis using attention visualizations and patch‑level perturbation heatmaps. They identify two root causes: (i) Inconsistent Visual Grounding – different encoders focus on distinct image regions, so perturbations crafted for one encoder may miss the salient regions of another; (ii) Redundant Semantic Alignment – a single object is represented by multiple overlapping tokens, and existing attacks often perturb only a subset, leaving enough clean tokens for the language module to recover.

Motivated by this analysis, the paper proposes Semantic‑Guided Multimodal Attack (SGMA), a two‑component framework designed to overcome both root causes. The first component, Semantic Relevance Perturbation, leverages CLIP’s text encoder to extract noun‑phrase attention and concentrates perturbations on patches that are semantically aligned with the textual description, ensuring consistent grounding across models. The second component, Semantic Grounding Disruption, applies (a) a global loss that pushes the whole image embedding away from its clean counterpart, and (b) a local loss that densely perturbs all patches linked to each noun phrase, thereby breaking the redundant token alignment.

Extensive experiments across three multimodal tasks—image captioning, visual question answering, and image classification—demonstrate that SGMA consistently outperforms all baselines. On average, SGMA improves attack success rates by 18–27 percentage points, achieving notable gains even on the strongest commercial models (e.g., raising GPT‑4o’s transfer success from 5 % to over 30 %). Visual quality metrics (PSNR, SSIM) remain comparable to prior methods, confirming that the increased effectiveness does not come at the cost of perceptibility. The framework also extends naturally to targeted attacks, where the adversary aims to force a specific textual output, further showcasing its flexibility.

The study highlights critical security implications: current encoder‑based attacks are far from universally transferable, and defenses that only address low‑level pixel perturbations may be insufficient. By exposing the importance of semantic alignment and grounding consistency, the work suggests new directions for robust multimodal defenses, such as incorporating grounding verification, semantic consistency checks, or randomized token aggregation. Future research may explore broader encoder families, adaptive defenses against semantic‑guided attacks, and formal game‑theoretic analyses of multimodal adversarial interactions.

In summary, this paper provides a thorough empirical and analytical investigation of encoder‑based adversarial transferability in LVLMs, uncovers fundamental failure modes, and introduces SGMA—a novel, semantically informed attack that markedly improves cross‑model transferability, thereby advancing our understanding of LVLM vulnerabilities and informing the design of more resilient multimodal systems.

Understanding and Enhancing Encoder-based Adversarial Transferability against Large Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment