Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.
In the past few years, methods (Zhou et al, 2022b,a;Khattak et al, 2023a,b;Mirza et al, 2024) that combine prompt learning (Brown et al, 2020) with foundation vision-language models (VLMs), such as Contrastive Language-Image Pretraining (CLIP) (Radford et al, 2021), have demonstrated excellent generalization ability across various domains. By incorporating learnable vectors into the textual prompt and conducting alignment between visual and textual features, these methods enable VLMs to effectively adapt to diverse tasks. This not only enhances the performance of VLMs across a wide range of downstream tasks (Bang et al, 2024;Nasiriany et al, 2024;Duan et al, 2024) but also strengthens the VLMs' robustness in cross-domain scenarios (Li et al, 2023;Zhao et al, 2024;Bai et al, 2024), particularly under the scenarios of limited resources and sparse labeled data.
Although prompt learning has documented promising results (Hu et al, 2023;Wu et al, 2024;Bulat and Tzimiropoulos, 2024;Yang et al, 2025) in effectively adapting foundation VLMs to downstream tasks, a single textual prompt is intuitively insufficient to fully capture the essence of a class. A class is typically represented through multiple visual descriptors, each providing complementary fine-grained portrayal. As shown in Fig. 1, when describing the class “Shiba Inu”, we can use a simple general prompt, A photo of a Shiba Inu.
A photo of a Shiba Inu, which has a compact, muscular body.
A photo of a Shiba Inu, which features a thick double coat that can be red, sesame, or cream.
A photo of a Shiba Inu, which possesses a curled tail that loops over its back.
A photo of a Shiba Inu, which displays a foxlike face with triangular, upright ears.
A photo of a Shiba Inu, which exhibits an alert and confident expression.
Fig. 1: An example of using multiple prompts to describe the Shiba Inu class.
such as “A photo of a Shiba Inu.”, or employ prompts that describe the class from different perspectives, such as its “body”, “tail”, “face”, “ears”, and so on. These various prompts capture the basic characteristics of the Shiba Inu by providing comprehensive descriptions of its appearance, behavior, and other distinguishing features. We refer to these prompts, which incorporate different class-relevant descriptors, as domain-relevant semantic prompts. By combining the general prompt and domain-relevant semantic prompts, the multiple prompts can describe the “Shiba Inu” class from both coarse-grained and fine-grained levels, thereby offering a more comprehensive and precise representation of its characteristics.
However, existing multi-prompt learning methods (Guo et al, 2023;Chen et al, 2023a,c;Liu et al, 2024) primarily focus on applying multiple prompts within a single VLM to achieve superior performance, overlooking the existence of model-prompt matching bias. This bias indicates that the identical prompt can exhibit distinct semantic information to different VLMs, leading to diverse predictions corresponding to the same prompt. To substantiate this claim, we investigate the zero-shot performance of two widely adopted CLIP models, i.e., CLIP-ViT-B/16 and CLIP-ViT-B/32, on the UCF101 (Soomro et al, 2012) dataset. As shown in Fig. 2, when evaluating zero-shot performance with the CLIP-ViT-B/16 model, the “P5” prompt performs the best among six individual prompts, while the “P1” prompt yields the highest performance with the CLIP-ViT-B/32 model. The results confirm that different VLMs extract different semantic information from the input image under the guidance of identical prompts. Accordingly, inspired by the preeminent performance of MP and MMP in Fig. 2 and (Qiao and Peng, 2024), we proceed to aggregate the diverse prompt-specific predictions across different VLMs based on the principles of ensemble learning (Sagi and Rokach, 2018).
The calculation of ensemble weights is crucial for achieving significant performance gains. The state-of-theart method, Tuning Ensemble (Lu et al, 2024), adopts a dynamic strategy in which the ensemble weights are deter- mined by feeding the features of input samples into a neural network-based weight generator. Nevertheless, another bias hinders the accurate determination of ensemble weights, namely, sample-prompt matching bias. Specifically, since the prompt serves as a descriptive annotation of the sample class, the ensemble weights for prompt-specific predictions should theoretically be generated based on the promptrelevant semantics from input samples. However, the image sample contains superfluous semantics (such as background) that are extraneous to the prompt. Therefore, directly utilizing all information from input samples for ensemble weights generation can lead to suboptimal performance, as confirmed by the empirical exploration results in Fig. 3. To remedy this deficiency, we dexterously extract the prompt-relevant semantics from input samples by introducing the information theory-based regularization constraint
This content is AI-processed based on open access ArXiv data.