Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.

💡 Research Summary

This paper tackles the problem of adapting large pre‑trained vision‑language models (VLMs), specifically CLIP, to downstream image classification tasks when only a small annotation budget is available. While CLIP exhibits strong zero‑shot transferability, domain shifts often degrade its performance, and conventional active learning (AL) methods for VLMs rely on heuristic uncertainty measures such as entropy, margin, or clustering. These post‑hoc approaches treat the model as a black box and do not exploit the internal structure of the vision‑language architecture for uncertainty estimation.

The authors propose a novel uncertainty‑modeling framework built directly into CLIP by introducing dual learnable prompts in the textual encoder:

Positive Prompt – a set of M continuous context tokens followed by the class token. This prompt is trained to maximize the cosine similarity between the visual embedding (modulated by lightweight visual prompts) and its corresponding textual embedding, thereby improving class discriminability and overall classification reliability.
Negative Prompt – an analogous set of M tokens trained in a reversed manner. It is used to explicitly model the probability that a predicted label is correct. For a sample x with pseudo‑label ŷ, the clean probability p_clean(ŷ) is defined as the softmax‑normalized similarity of the visual embedding with the positive prompt divided by the sum of similarities with both positive and negative prompts. This yields a per‑sample uncertainty score derived from the model itself rather than from external heuristics.

Training proceeds with two losses:

L₁ (positive loss): a standard cross‑entropy that aligns visual embeddings with their positive textual counterparts.
L₂ (negative loss): a reverse‑learning loss that pushes p_clean high for clean samples and low for deliberately corrupted (noise) samples, encouraging the negative prompt to capture uncertainty.

The total loss is L = L₁ + λ·L₂, where λ balances the influence of the uncertainty term. On the visual side, the authors adopt Visual Prompt Tuning (VPT), inserting a small number of learnable visual tokens while keeping the backbone frozen, thus achieving parameter‑efficient adaptation.

The dual‑prompt CLIP model is embedded into a round‑based AL loop:

Initialization: With no labeled data, a zero‑shot CLIP inference selects the top‑K confident samples per class to form an initial unlabeled pool.
Training per round: The model is trained on the currently labeled set S_L and a pseudo‑labeled set S_U composed of samples with high p_clean.
Uncertainty‑driven query: After training, p_clean is computed for all remaining unlabeled samples. Within each class, the samples with the lowest p_clean are chosen for human annotation, respecting a per‑class budget (⌊B/C⌋) and allocating any remainder to the most uncertain samples overall. This maintains approximate class balance while fully utilizing the annotation budget.
Confident sample mining: The top‑k samples with the highest p_clean per class are added to S_U for the next round.
Re‑initialization: At the start of each round the model is re‑initialized to avoid confirmation bias and error accumulation.

Experiments are conducted on several datasets (e.g., EuroSAT and domain‑specific remote sensing collections) over six AL rounds, each selecting 1 % of the unlabeled pool. The method is evaluated under three fine‑tuning paradigms (full fine‑tuning, visual‑prompt‑only, and dual‑prompt) and compared against strong baselines such as entropy‑based sampling, CoreSet, ALOR, and recent VLM‑specific AL strategies. Across all settings, the proposed approach consistently outperforms baselines, achieving 3–7 percentage‑point gains in accuracy, especially in early rounds where uncertainty estimation is most critical.

Key contributions of the paper are:

Introducing a dual‑prompt mechanism that enables CLIP to learn its own uncertainty directly from the model’s joint vision‑language space.
Defining a clean‑probability metric (p_clean) that provides a principled, per‑sample uncertainty score for active sample selection and confident pseudo‑label mining.
Combining this uncertainty model with parameter‑efficient visual prompt tuning, yielding a lightweight yet powerful adaptation framework suitable for low‑budget annotation scenarios.
Demonstrating robust performance gains across diverse datasets and fine‑tuning strategies, highlighting the practical applicability of the method in real‑world, resource‑constrained settings.

In summary, the paper presents a coherent and effective integration of uncertainty modeling into CLIP via dual textual prompts, and shows how this integration can substantially improve active learning efficiency for vision‑language model adaptation.

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment