Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP’s intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP’s intrinsic guidance for continual adaptation.


💡 Research Summary

The paper introduces Textual Prototype‑guided Prompt Tuning (TPPT), a concise continual learning (CL) framework built on the CLIP vision‑language model. Traditional CL approaches for CLIP often rely on elaborate prompt pools, routing mechanisms, or external regularizers, which increase complexity and computational cost while underexploiting CLIP’s inherent multimodal structure. TPPT takes a different stance: it treats the frozen textual embeddings of class names as stable “prototype anchors” and incrementally adds learnable visual prompts to the vision encoder for each new task.

The core idea is to replace the standard cross‑entropy (CE) loss, which only pushes each sample toward its target class, with a contrastive prototype loss (L_TPCL). This asymmetric CE‑style loss encourages visual features to be close to their correct textual prototype and far from all other prototypes, thereby regularizing the embedding space against representation drift. Because the textual prototypes remain fixed, they act as anchors that preserve previously learned knowledge throughout the learning sequence.

To further bridge the vision‑language gap, the authors extend TPPT to TPPT‑VT, which jointly learns visual and textual prompts. Naïve joint tuning can cause the multimodal embedding space to collapse into trivial solutions, especially in a continual setting where the model may over‑fit to new tasks. To prevent this, TPPT‑VT incorporates a relational diversity regularizer (L_DIV) that maintains the pairwise distance distribution among textual prototypes, aligning it with the distribution of the original CLIP text embeddings. This regularizer preserves semantic diversity, mitigates collapse, and improves alignment between the two modalities.

Implementation details: visual prompts are small learnable token matrices inserted at multiple multi‑head self‑attention layers of CLIP’s vision transformer; textual prompts are analogous tokens added to the text encoder. When a new task arrives, only a fresh set of prompts is introduced and trained, while all previously learned prompts are frozen. The total loss combines CE, the prototype contrastive loss, and the diversity regularizer, weighted by hyper‑parameters tuned on a validation set.

Experiments are conducted on several class‑incremental benchmarks—including CUB‑200‑2011, Aircraft, and ImageNet‑R—each split into 10–20 incremental tasks. TPPT‑V significantly reduces representation drift compared to CE‑only baselines, leading to lower forgetting rates. TPPT‑VT outperforms state‑of‑the‑art prompt‑based CL methods (e.g., L2P, DualPrompt, CODA‑P) by 2–4 percentage points in average accuracy, with especially large gains (≈5 pp) on fine‑grained datasets like CUB. Diversity metrics show that L_DIV maintains embedding spread close to the original zero‑shot CLIP distribution, preventing collapse observed in naïve multimodal training. Parameter overhead is modest (≈0.2 % per task) and training/inference time remains comparable to existing prompt‑based approaches.

The authors discuss limitations: fixing textual prototypes may introduce bias toward early classes, and scaling to thousands of classes could require more sophisticated prompt management. Nonetheless, TPPT demonstrates that a simple, anchor‑driven contrastive objective combined with a lightweight diversity regularizer can achieve strong continual learning performance without the architectural or regularization baggage of prior methods. The work opens avenues for automated prompt allocation, larger‑scale multimodal extensions, and applications beyond image‑text domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment