Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

💡 Research Summary

The paper introduces Collaborative Fine‑Tuning (CoFT), a novel unsupervised adaptation framework for large‑scale vision‑language models (VLMs) such as CLIP, which eliminates the need for any human‑annotated data. CoFT tackles two major shortcomings of existing self‑training approaches: (1) the under‑utilization of low‑confidence samples and (2) confirmation bias caused by naïve confidence‑threshold filtering. The core idea is to train two CLIP sub‑models in parallel and let them cooperate across modalities.

In Phase I, the authors generate pseudo‑labels for the entire unlabeled set using CLIP’s zero‑shot predictions. For each class they select the top‑K highest‑confidence images, forming a small high‑confidence subset. Only lightweight, trainable parameters are updated: visual prompt tokens (VPT) are prepended to the visual transformer, and a pair of textual prompts—positive and negative—are learned for each class. The positive prompt is encouraged to align image embeddings with the correct class text, while the negative prompt is trained to be dissimilar for the true class and more similar for incorrect classes. This dual‑prompt mechanism yields a sample‑dependent “clean‑label probability” without any hand‑crafted thresholds. Two complementary loss terms are optimized jointly: (i) a cross‑entropy loss (L₁) over the high‑confidence subset using the positive prompt, and (ii) a binary loss (L₂) that forces the similarity gap between positive and negative prompts to be positive for clean samples and negative for noisy ones. The total Phase‑I loss is L₁ + λ L₂.

Phase II expands pseudo‑label generation to the full unlabeled corpus. Model 1 first predicts a label for every image using its positive prompt. Model 2 then validates each prediction by comparing the image’s similarity to the positive and negative prompts of the predicted class; only samples where the positive similarity exceeds the negative similarity are retained as clean. This bidirectional collaboration yields two clean subsets (one per model) that are used to fully fine‑tune the visual encoder together with a task‑specific classification head. Because the two models are initialized differently and trained separately, they provide diverse perspectives that mitigate the confirmation bias typical of single‑model self‑training.

CoFT+ builds upon this foundation with three enhancements: (1) iterative PEFT that progressively refines pseudo‑labels, (2) momentum contrastive learning (MoCo) to strengthen visual representations and increase robustness to noisy supervision, and (3) large‑language‑model (LLM) generated prompt templates that enrich the textual prompt space, especially beneficial for fine‑grained or domain‑specific categories.

Extensive experiments on benchmarks such as ImageNet‑R, CIFAR‑100, Stanford Cars, and Oxford‑Pets demonstrate that CoFT consistently outperforms prior unsupervised methods (e.g., UPL, DEFT) by 3–5 percentage points and even surpasses few‑shot supervised fine‑tuning baselines that use 1–5 labeled examples per class. Ablation studies confirm the contribution of each component: the dual‑prompt mechanism, the two‑phase training schedule, the collaborative validation, and the CoFT+ extensions all provide measurable gains. Moreover, the negative prompt acts as a regularizer for the visual adaptation modules, stabilizing training under high noise levels.

In summary, CoFT offers a practical, label‑free solution for adapting powerful VLMs to downstream tasks. By leveraging cross‑modal collaboration and a learnable, sample‑specific cleanliness estimator, it achieves robust performance without manual confidence thresholds or noise‑ratio assumptions. Future work may explore applying the collaborative paradigm to other multimodal architectures (e.g., BLIP, Flamingo) and scaling the approach to even larger, more diverse unlabeled corpora.

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

💡 Research Summary

Comments & Academic Discussion

Leave a Comment