드롭아웃 프롬프트 학습을 통한 비전‑언어 모델 강인성 향상
📝 Abstract
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method’s effectiveness in challenging scenarios like lowshot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularizationbased methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization. Our code is available at https://github.com/JustCoolPig/DroPLe .
💡 Analysis
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method’s effectiveness in challenging scenarios like lowshot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularizationbased methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization. Our code is available at https://github.com/JustCoolPig/DroPLe .
📄 Content
Vision-Language Models (VLMs) such as CLIP (Radford et al. 2021) and ALIGN (Jia et al. 2021) have achieved remarkable advantages in zero-shot scenarios. While prompt learning (Zhou et al. 2022b;Khattak et al. 2023a) offers a parameter-efficient approach for adapting pre-trained VLMs to downstream tasks, its generalization capability remains limited by overfitting issues, particularly in low-data scenarios (Park, Ko, and Kim 2024;Khattak et al. 2023b).
In the past decade, dropout is applied as an effective regularization technique in deep neural networks, significantly mitigating overfitting and improving generalization by randomly dropping neurons during training (Srivastava et al. 2014). Dropout prevents complex co-adaptations among feature detectors and implicitly averages over an exponential number of thinned network architectures, a critical factor in the success of models such as AlexNet (Krizhevsky, Sutskever, and Hinton 2012). While dropout has shown remarkable success across various deep learning architectures, its potential in prompt learning for VLMs remains unexplored. Motivated by the effectiveness of dropout in learning robust models, we propose to incorporate dropout mechanisms into VLM prompt learning to enhance model generalization, particularly in low-data regimes (Zhou et al. 2022a;Zhu et al. 2023a;Chen et al. 2025).
However, VLM prompt learning presents distinct challenges compared to traditional deep learning, raising three critical questions regarding dropout implementation: (1) Where to drop: VLMs rely on tokens as fundamental semantic units to facilitate fine-grained semantic alignment across modalities (e.g., the contrastive learning mechanism of CLIP (Radford et al. 2021)), while vanilla dropout would destroy this alignment. As shown in Fig. 1(a), randomly dropping critical visual tokens impairs their matching with textual descriptions, leading to degraded performance (Fig. 1(b)). While existing works in unimodal tasks (Ke et al. 2020;Zhai and Wang 2018) improve the vanilla dropout through adaptive probabilities, these approaches are unsuitable for VLMs requiring cross-modal dependencies. (2) What degree to drop: Unlike traditional neural networks where high parameter redundancy enables effective dropout without performance loss, VLMs process semantically dense tokens with limited token-level redundancy. The inherent token sparsity means that high dropout ratios on semantically rich tokens could severely degrade performance, while low ratios on less informative tokens provide insufficient regularization. This creates a challenge in determining optimal dropout scheduling that balances feature preservation with regularization. (3) How to learn from dropout: To prevent semantic drift between learnable and frozen branches, existing approaches (Yao, Zhang, and Xu 2023;Khattak et al. 2023b) often enforce strict L 1 or L 2 regularization. This operation, however, is overly strict for dropout prompt learning. Such strict constraints limit the benefits from dropout-induced variations, suggesting the need for a mechanism that balances consistency and diversity.
Towards more robust and general prompt learning, we propose Dropout Prompt Learning, a principled framework that incorporates dropout mechanisms into visionlanguage prompt learning by regularizing through token dropout. Based on the framework of dropout prompt learning, we present the Importance Weighted Token Dropout, termed as IWTD, which formulates dropout as a token importance estimation problem in the multimodal space.
Importance weighted token dropout is carefully designed to handle the three challenges. For the first challenge of where to drop, we leverage a comprehensive importance metric to jointly model intra-modal context, inter-modal alignment, and task-specific relevance through a unified attention mechanism. This enables the identification of semantically critical tokens that maintain cross-modal alignment. For the second challenge of what degree to drop, different samples exhibit varying semantic densities in their tokens. Tokens carrying minimal semantic information can tolerate higher dropout rates for enhancing generalization, while samples with high semantic density require lower dropout rates to preserve crucial tokens for cross-modal alignment. This motivates flexible dropout probability assignment according to token significance. For the third challenge of how to learn from dropout, we propose residual entropy regularization, which computes residuals between pre-and post-dropout feature representations, and maximizes the predictive entropy on these residuals, simultaneously maintaining alignment with general knowledge transfer while encouraging representational diversity. Our main contributions are as follows:
• We propose Dropout Prompt Learning, a novel learning paradigm that extends dropout regularization to visionlanguage model adaptation. By introducing token-level dropout strategies, this framework enha
This content is AI-processed based on ArXiv data.