MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large pre-trained Vision Language Models (VLMs) demonstrate excellent generalization capabilities but remain highly susceptible to adversarial examples, posing potential security risks. To improve the robustness of VLMs against adversarial examples, adversarial prompt tuning methods are proposed to align the text feature with the adversarial image feature without changing model parameters. However, when facing various adversarial attacks, a single learnable text prompt has insufficient generalization to align well with all adversarial image features, which ultimately results in overfitting. To address the above challenge, in this paper, we empirically find that increasing the number of learned prompts yields greater robustness improvements than simply extending the length of a single prompt. Building on this observation, we propose an adversarial tuning method named \textbf{Mixture of Adversarial Prompt Tuning (MoAPT)} to enhance the generalization against various adversarial attacks for VLMs. MoAPT aims to learn mixture text prompts to obtain more robust text features. To further enhance the adaptability, we propose a conditional weight router based on the adversarial images to predict the mixture weights of multiple learned prompts, which helps obtain sample-specific mixture text features aligning with different adversarial image features. Extensive experiments across 11 datasets under different settings show that our method can achieve better adversarial robustness than state-of-the-art approaches.

💡 Research Summary

The paper “MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models” addresses a critical security vulnerability in large Vision-Language Models (VLMs) like CLIP: their susceptibility to adversarial examples. While VLMs exhibit strong generalization, adversarial perturbations to input images can easily fool them. Existing parameter-efficient defense strategies, such as Adversarial Prompt Tuning (APT), learn a single, continuous text prompt to align text features with adversarial image features. However, the authors identify a key limitation: a single prompt lacks the capacity to generalize well across the diverse manifolds of features created by different types of adversarial attacks, leading to overfitting and suboptimal robustness.

The authors first present an empirical discovery: increasing the number of learnable prompts is more effective for boosting adversarial robustness than simply extending the length of a single prompt, given a comparable total parameter budget. For instance, four prompts of length 16 consistently outperform a single prompt of length 64. They argue that longer prompts are harder to optimize and may strain the text encoder, whereas multiple shorter prompts are easier to learn and can generate a more diverse set of text features.

Building on this insight, the proposed method, Mixture of Adversarial Prompt Tuning (MoAPT), introduces two novel components. First, it learns a set of K base adversarial text prompts. Each prompt is processed by the frozen text encoder to produce K distinct “individual” text feature vectors. Second, to dynamically combine these features in a sample-specific manner, MoAPT employs a lightweight Conditional Prompt Weight Router. This router takes the adversarial image feature (from the frozen image encoder) as input and predicts a set of K blending weights via a small network (e.g., two linear layers). These weights are normalized via a softmax function. The final “mixture text feature” for a given class and input image is computed as the weighted sum of the K individual text features using the router-predicted weights.

This architecture allows the model to adaptively emphasize different prompts based on the characteristics of each adversarial input, creating a more expressive and flexible text representation. The entire system—the K prompts and the router parameters—is trained end-to-end within an adversarial training framework, where adversarial examples are generated on-the-fly using a projected gradient descent (PGD) attacker that has access to all model parameters.

Extensive experiments validate MoAPT’s effectiveness. Evaluated across 11 datasets (including ImageNet and specialized benchmarks) under various attack settings (e.g., different PGD steps and perturbation budgets), MoAPT consistently outperforms state-of-the-art adversarial tuning methods like APT, AdvPT, and FAP in terms of adversarial robustness (accuracy on attacked images). Notably, it also improves clean accuracy. Ablation studies confirm the necessity of both multi-prompt learning and the conditional router. Furthermore, MoAPT demonstrates superior cross-dataset generalization, indicating that it learns more transferable robust features. The paper concludes that MoAPT provides a powerful and parameter-efficient solution for hardening VLMs against adversarial threats without compromising their standard performance.

MoAPT: Mixture of Adversarial Prompt Tuning for Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment