NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models
Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations include: (1) extending AdvPT from text-only to multi-modal prompting across both text and visual modalities, (2) expanding from single-layer to multi-layer prompt architectures, and (3) proposing a novel architecture-level redesign through our Neural Augmentor approach, which implements feature purification to directly address the distortions introduced by adversarial attacks in feature space. Our NAP-Tuning approach incorporates token refiners that learn to reconstruct purified features through residual connections, allowing for modality-specific and layer-specific feature correction.Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.
💡 Research Summary
The paper addresses the vulnerability of large‑scale vision‑language models (VLMs), such as CLIP, to adversarial attacks that primarily target the image modality. Building on their earlier work, Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to align text embeddings with adversarial image embeddings, the authors identify three critical limitations of AdvPT: (1) it operates only on the text side, leaving the visual pathway unprotected; (2) it uses a single‑layer prompt architecture, limiting the capacity to correct hierarchical feature distortions; and (3) it relies solely on loss‑function redesign without any architectural augmentation, which restricts the model’s ability to rectify corrupted internal representations.
To overcome these constraints, the authors propose Neural Augmentor for Multi‑modal Adversarial Prompt Tuning (NAP‑Tuning). NAP‑Tuning introduces three coordinated innovations:
-
Multi‑modal Prompting – In addition to learnable text prompts (V_t), a set of visual prompts (V_i) is inserted into the image encoder’s token sequence. This dual‑modality prompting creates a coordinated defense that can pre‑emptively adjust both textual and visual representations before they are fused in the joint embedding space.
-
Multi‑layer Prompt Architecture – Prompt vectors and a new lightweight module called a TokenRefiner are placed at multiple transformer layers. At each layer ℓ, the feature h_ℓ is refined via a residual connection: (\tilde{h}_ℓ = h_ℓ + \text{TokenRefiner}_ℓ(h_ℓ + P_ℓ)). By intervening hierarchically, the system can correct distortions that emerge at different depths, from low‑level texture perturbations to high‑level semantic shifts.
-
Neural Augmentor (TokenRefiners) – TokenRefiners are small neural networks (typically a two‑layer MLP or 1‑D convolution) that operate directly on intermediate feature maps. They are trained to reconstruct “purified” features that are close to clean‑image features while preserving the frozen backbone’s parameters. This modular augmentation supplies the extra capacity required for adversarial robustness without risking catastrophic forgetting of the pretrained knowledge.
The training objective combines the original AdvPT alignment loss with an additional feature‑level reconstruction loss (e.g., L2 distance between refined and clean features). The overall system is trained end‑to‑end on adversarial examples generated under a strong threat model where the attacker has access to both image and text encoders.
Experimental Evaluation
The authors conduct extensive experiments on 11 public datasets (including ImageNet, CIFAR‑10, Oxford‑Pets, etc.) and evaluate against four attack families: standard PGD, AutoAttack, CW, and physically‑motivated transformations. Key findings include:
- On the AutoAttack benchmark, NAP‑Tuning improves top‑1 accuracy by 32.3 % for ViT‑B16‑based CLIP and 31.3 % for ViT‑B32, far surpassing the strongest prior baselines.
- Clean‑accuracy degradation is minimal (≤ 0.5 %), demonstrating that the method preserves generalization while dramatically boosting robustness.
- Ablation studies reveal that removing visual prompts, limiting prompting to a single layer, or omitting TokenRefiners each leads to substantial drops in robustness, confirming the necessity of all three components.
- Parameter efficiency: only 2–3 M trainable parameters are introduced (≈ 1 % of the full model), and training time is roughly 1.5× that of AdvPT but still an order of magnitude faster than full fine‑tuning.
Theoretical Insight
The paper references robust learning theory, which states that adversarial generalization demands higher model capacity than standard generalization. By keeping the backbone frozen (preserving pretrained knowledge) and adding a lightweight, targeted capacity via TokenRefiners, NAP‑Tuning achieves the “capacity‑preservation trade‑off” that pure prompt tuning cannot.
Limitations and Future Work
The current design uses symmetric numbers of prompts and refiners for both modalities; optimal allocation may differ per task. The method assumes the attacker cannot directly manipulate the prompts themselves, a scenario that warrants further study. Future directions include dynamic, meta‑learned allocation of prompts, extending the approach to generative multimodal models (e.g., LLaVA, Qwen‑VL), and investigating defenses against prompt‑extraction attacks.
Conclusion
NAP‑Tuning presents a practical, modular, and highly effective strategy for hardening vision‑language models against adversarial perturbations. By moving the defense from input‑side alignment to internal feature purification, it delivers a substantial robustness boost while maintaining computational efficiency and preserving the rich knowledge encoded in large pretrained backbones. This work opens a promising pathway for scalable, architecture‑aware adversarial defenses in multimodal AI systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment