Active Prompt Learning with Vision-Language Model Priors

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language models (VLMs) have demonstrated remarkable zero-shot performance across various classification tasks. Nonetheless, their reliance on hand-crafted text prompts for each task hinders efficient adaptation to new tasks. While prompt learning offers a promising solution, most studies focus on maximizing the utilization of given few-shot labeled datasets, often overlooking the potential of careful data selection strategies, which enable higher accuracy with fewer labeled data. This motivates us to study a budget-efficient active prompt learning framework. Specifically, we introduce a class-guided clustering that leverages the pre-trained image and text encoders of VLMs, thereby enabling our cluster-balanced acquisition function from the initial round of active learning. Furthermore, considering the substantial class-wise variance in confidence exhibited by VLMs, we propose a budget-saving selective querying based on adaptive class-wise thresholds. Extensive experiments in active learning scenarios across seven datasets demonstrate that our method outperforms existing baselines.

💡 Research Summary

The paper tackles the problem of adapting large vision‑language models (VLMs) such as CLIP to new classification tasks under a strict labeling budget. While prompt learning has become a popular model‑centric solution—optimizing learnable text prompts on a few labeled examples—the authors argue that data selection is equally crucial. They propose a budget‑efficient active prompt learning framework that fully exploits the pretrained image and text encoders of VLMs.

First, they construct “class‑guided features” for each unlabeled image. An image feature I(x) is obtained from the CLIP image encoder, and a weighted text feature ˜T_C(x) is computed as a soft‑label weighted sum of class‑specific text embeddings, where the weights are the current model’s class probabilities pθ(y=c|x). Concatenating I(x) and ˜T_C(x) yields F_C(x), a representation that embeds both visual content and class‑specific textual guidance.

Next, they apply K‑means clustering on the set of F_C(x) across the entire pool and enforce a cluster‑balanced acquisition: from each cluster they sample an equal number of images, guaranteeing diversity and mitigating the cold‑start problem that plagues random initial sampling.

To further conserve the annotation budget, they introduce adaptive class‑wise thresholds. Using confidence scores of previously labeled data, a threshold τ_c is computed for each class. When a candidate’s confidence exceeds τ_c, a pseudo‑label is assigned; otherwise, the sample is sent to a human annotator. This selective querying automatically adapts to the often large variance in VLM confidence across classes without adding extra hyper‑parameters.

With the selected set, they train learnable prompt vectors (as in CoOp) by minimizing cross‑entropy loss. Because the training set is both balanced and enriched with high‑confidence pseudo‑labels, the prompts converge faster and achieve higher accuracy than conventional few‑shot prompt learning.

Extensive experiments on seven benchmark classification datasets (including CIFAR‑10/100, ImageNet subsets, Oxford‑Pets, etc.) under budgets ranging from 10 % to 30 % of the full label set show that the proposed method consistently outperforms strong baselines: prior active learning strategies (Core‑Set, Entropy, Margin), the PCB method that only balances class distribution, and state‑of‑the‑art prompt‑learning approaches (CoOp, MaPLe, ProGrad). Visual analyses using GradFAM (a gradient‑weighted feature activation map) and t‑SNE further demonstrate that class‑guided features focus on the intended semantic regions, confirming the effectiveness of the clustering step.

In summary, the work presents a novel data‑centric active learning pipeline that leverages VLM priors for both clustering and selective querying, achieving superior label‑efficiency and prompt performance. The authors suggest that the approach can be extended to other foundation models and downstream tasks such as object detection or segmentation, opening a promising direction for budget‑constrained multimodal learning.

Active Prompt Learning with Vision-Language Model Priors

💡 Research Summary

Comments & Academic Discussion

Leave a Comment