Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

Reading time: 2 minute
...

📝 Original Info

  • Title: Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning
  • ArXiv ID: 2512.08606
  • Date: 2025-12-09
  • Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li

📝 Abstract

The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of "emptiness" without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias withi...

📄 Full Content

CLIP (Contrastive Language-Image Pre-Training) (Radford et al. 2021) is a multimodal pre-trained neural network designed to align images and text using large-scale paired image-text data. The model consists of two branches: a text encoder and an image encoder, each mapping textual descriptions and visual samples into low-dimensional vector representations. During pre-training, CLIP learns to perform a wide range of tasks, including OCR (Materzyńska, Torralba, and Bau 2022), geolocation (Vivanco Cepeda, Nayak, and Shah 2024), and action recognition (Ke et al. 2018). In the prediction

…(Content truncated for length.)

📸 Image Gallery

N_num.png TSs-Acc_corr_t1-4.png euro_inc_shots_result.jpg framework.jpg new_1.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut