Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models

Erasing CLIP Memories: Non-Destructive, Data-Free Zero-Shot class Unlearning in CLIP Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a novel, closed-form approach for selective unlearning in multimodal models, specifically targeting pretrained models such as CLIP. Our method leverages nullspace projection to erase the target class information embedded in the final projection layer, without requiring any retraining or the use of images from the forget set. By computing an orthonormal basis for the subspace spanned by target text embeddings and projecting these directions, we dramatically reduce the alignment between image features and undesired classes. Unlike traditional unlearning techniques that rely on iterative fine-tuning and extensive data curation, our approach is both computationally efficient and surgically precise. This leads to a pronounced drop in zero-shot performance for the target classes while preserving the overall multimodal knowledge of the model. Our experiments demonstrate that even a partial projection can balance between complete unlearning and retaining useful information, addressing key challenges in model decontamination and privacy preservation.


💡 Research Summary

The paper tackles the problem of selective forgetting in large vision‑language models, focusing on CLIP, without requiring any data from the classes to be removed and without any retraining. The authors propose a closed‑form linear transformation called Consistent Class Unlearning Projection (CCUP) that operates directly on the final joint embedding space.

Core idea:
Given normalized text embeddings for the “forget” classes (Tf ∈ ℝ^{d×mf}) and for the “retain” classes (Tr ∈ ℝ^{d×mr}), the goal is to find a matrix W ∈ ℝ^{d×d} that (1) suppresses the components aligned with Tf, (2) leaves the components aligned with Tr essentially unchanged, and (3) stays close to the identity to avoid distorting the overall representation. This is formalized as the following regularized least‑squares problem:

min_W ‖W‑I‖_F² + λ‖W Tf‖_F² + μ‖W Tr ‑ Tr‖_F²

where λ>0 controls the strength of forgetting and μ>0 controls the strength of retention. Solving the first‑order optimality condition yields a compact closed‑form solution:

W = (I + μ Tr Trᵀ) · (I + λ Tf Tfᵀ + μ Tr Trᵀ)⁻¹

When μ=0 and λ→∞ the expression collapses to the classic null‑space projector that completely removes the Tf subspace. When λ=0 and μ is large, W≈I, meaning no forgetting occurs. By tuning λ and μ, practitioners can achieve a spectrum ranging from full erasure to partial attenuation, allowing fine‑grained control over the trade‑off between forgetting and knowledge preservation.

Algorithmic pipeline:

  1. Compute Tf and Tr from the CLIP text encoder (no image data needed).
  2. Build the matrices M_f = λ Tf Tfᵀ and M_r = μ Tr Trᵀ.
  3. Form denominator D = I + M_f + M_r and numerator N = I + M_r.
  4. Compute W = N · D⁻¹ (the closed‑form projection).
  5. For each image feature x, compute x′ = W x and renormalize.
  6. Perform standard zero‑shot classification by cosine similarity with the original text embeddings.

The method requires only a few matrix multiplications and an inversion of a d×d matrix (d≈512 for ViT‑B models), making it computationally trivial compared to iterative fine‑tuning.

Experimental evaluation:
The authors evaluate CCUP on two CLIP backbones (ViT‑B/32 and ViT‑B/16) across eight datasets: four fine‑grained (StanfordCars, StanfordDogs, Caltech101, OxfordFlowers), two zero‑shot (AWA2, CUB), and two few‑shot image‑classification benchmarks (Tiny‑ImageNet, Mini‑ImageNet). For each dataset, 40 % of classes are designated as “forget” and the rest as “retain”.

Metrics include:

  • Forget‑class accuracy before (BF) and after (AF) unlearning.
  • Retain‑class accuracy before and after.
  • Membership Inference Attack (MIA) score: (BF_forget ‑ AF_forget) ‑ (BF_retain ‑ AF_retain). Lower MIA indicates better selective forgetting.

Results (Table 1) show that CCUP reduces forget‑class accuracy to near 0 % while keeping retain‑class accuracy within 1 % of the original performance. Corresponding MIA scores are essentially zero, outperforming baselines such as ZS‑CLIP (which modifies only the textual side), LIP, Emb, Amns, and EMMN (all of which rely on synthetic data or fine‑tuning). The baselines either leave residual knowledge of the forget classes or cause substantial degradation on retain classes, and they incur orders of magnitude higher computational cost.

A further ablation varies λ and μ to demonstrate partial projection: increasing λ gradually suppresses forget‑class similarity, while a moderate μ preserves retain‑class structure. This confirms that the closed‑form solution can be tuned for application‑specific privacy‑utility trade‑offs.

Strengths:

  • Data‑free: No images or labels from the forget set are needed, addressing privacy and data‑access constraints.
  • Closed‑form: No iterative optimization, enabling instant deployment.
  • Fine‑grained control: λ and μ allow practitioners to balance forgetting strength against knowledge retention.
  • Preserves multimodal alignment: Because only the final projection matrix is altered, the rest of the CLIP architecture remains untouched, ensuring compatibility with downstream tasks.
  • Strong empirical results: Near‑complete erasure of target classes with negligible impact on other classes, and minimal MIA risk.

Limitations and future directions:

  • The approach is linear; highly non‑linear entanglements between visual and textual features may not be fully removable.
  • It assumes that text embeddings for forget and retain classes are sufficiently linearly separable; highly overlapping semantics could lead to residual leakage.
  • Extending the method to other multimodal architectures (e.g., CLIP‑like models with multiple projection heads) or to hierarchical class structures remains open.
  • Investigating adaptive λ/μ per class or incorporating a small amount of privileged data (e.g., class descriptors) could further improve selectivity.

Conclusion:
“Erasing CLIP Memories” introduces a novel, data‑free, closed‑form projection technique for selective class unlearning in CLIP. By directly manipulating the joint embedding space through a mathematically derived matrix, the method achieves rapid, precise, and privacy‑preserving forgetting while maintaining the model’s overall performance. This work opens a practical pathway for model de‑contamination, bias mitigation, and compliance with data‑removal regulations in large vision‑language systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment