Selective, Controlled and Domain-Agnostic Unlearning in Pretrained CLIP: A Training- and Data-Free Approach
Pretrained models like CLIP have demonstrated impressive zero-shot classification capabilities across diverse visual domains, spanning natural images, artistic renderings, and abstract representations. However, real-world applications often demand the removal (or “unlearning”) of specific object classes without requiring additional data or retraining, or affecting the model’s performance on unrelated tasks. In this paper, we propose a novel training- and data-free unlearning framework that enables three distinct forgetting paradigms: (1) global unlearning of selected objects across all domains, (2) domain-specific knowledge removal (e.g., eliminating sketch representations while preserving photo recognition), and (3) complete unlearning in selective domains. By leveraging a multimodal nullspace through synergistic integration of text prompts and synthesized visual prototypes derived from CLIP’s joint embedding space, our method efficiently removes undesired class information while preserving the remaining knowledge. This approach overcomes the limitations of existing retraining-based methods and offers a flexible and computationally efficient solution for controlled model forgetting.
💡 Research Summary
The paper addresses the problem of removing (“unlearning”) specific object classes from a pretrained CLIP model without requiring any additional data, fine‑tuning, or retraining. Existing machine‑unlearning approaches typically rely on dataset modification, gradient‑based parameter updates, or costly retraining, which are impractical for large multimodal models and often conflict with privacy regulations that restrict data retention. The authors propose a completely data‑free, training‑free framework that can perform three distinct forgetting paradigms: (1) global unlearning of selected classes across all visual domains, (2) domain‑specific unlearning where a class is forgotten only in designated domains (e.g., sketches or cartoons) while remaining intact in others, and (3) complete selective domain unlearning that also removes residual domain‑specific signals.
The method exploits CLIP’s joint image‑text embedding space. For each target class c, a text embedding t_c is generated from a simple prompt such as “a photo of a c”. Then a “canonical image” x_c is synthesized by gradient‑based optimization that maximizes the cosine similarity between the image’s visual embedding and t_c. This yields a visual embedding h_c that captures the salient visual features of the class. In domain‑specific scenarios, the synthesis is conditioned on the target domain d, producing domain‑specific visual embeddings h_{d,c}. For the most thorough variant, an additional residual embedding r_{d,c} is extracted to capture subtle domain nuances.
All embeddings are stacked into an augmented matrix M. Three variants are defined:
- Global: M_global contains both text and visual embeddings for all target classes.
- Selective Domain: M_d contains text embeddings and domain‑specific visual embeddings for the chosen domain.
- Complete Selective Domain: M_complete_d adds the residual embeddings to the previous two rows.
The transpose of each matrix is decomposed by singular value decomposition (SVD): \tilde{M}=UΣVᵀ. The left singular vectors U span the subspace that encodes the target class information. A null‑space projection operator is then constructed as P = I – UUᵀ, which projects any vector onto the orthogonal complement of that subspace. The original CLIP projection matrix W (the final linear layer that maps visual features to the shared 512‑dimensional space) is updated by right‑multiplication: W′ = W P. This operation does not alter the underlying network weights; it merely removes the contribution of the identified subspace from the final embeddings.
During inference, images from domains where unlearning has been applied are passed through the modified projection W′, reducing their cosine similarity with the forgotten class embeddings. Images from untouched domains continue to use the original W, preserving normal performance. Because the procedure is closed‑form, it requires only a single SVD and a matrix multiplication, making it computationally cheap (a few seconds on a modern GPU) and memory‑efficient.
The authors evaluate the approach on two widely used multimodal benchmarks: PACS (four domains: Art, Cartoon, Photo, Sketch) and DomainNet (six domains: Clipart, Infograph, Painting, Quickdraw, Real, Sketch). For each dataset, a subset of classes is designated as the “forget set” and the remainder as the “retain set”. Three experimental configurations are tested: domain‑agnostic unlearning, selective domain unlearning, and complete selective domain unlearning. Performance is measured by class‑wise accuracy before and after unlearning, and by a Membership Inference Attack (MIA) score defined as (BF_forget – AF_forget) – (BF_retain – AF_retain). A higher MIA indicates that the model has successfully forgotten the target classes while preserving knowledge about the retained ones.
Results show that global unlearning reduces forget‑set accuracy by roughly 70 % while retain‑set accuracy drops by less than 2 %. In the selective domain setting, the targeted domain experiences a comparable drop in forget‑set accuracy, but other domains remain virtually unchanged. The complete selective domain variant further eliminates residual signals, achieving MIA scores above 0.85, substantially higher than the baselines (ZS‑CLIP, CLIPErase, and a null‑space calibration method) which average around 0.60. Importantly, the method introduces no new parameters and leaves the original CLIP weights untouched, confirming its non‑intrusive nature.
Key contributions are: (1) a truly data‑free, training‑free unlearning technique for multimodal models, (2) fine‑grained control over which domains the forgetting occurs, and (3) an analytically tractable closed‑form solution based on SVD that guarantees precise removal of the target subspace with minimal collateral damage. The paper also discusses limitations: the quality of the synthesized canonical image depends on optimization hyper‑parameters and may be insufficient for highly complex or 3‑D domains; the simple textual prompt may struggle with ambiguous class names; and the linear null‑space assumption may not capture highly non‑linear relationships, suggesting future work on kernel‑based or non‑linear projection methods.
In summary, the work presents a practical, efficient, and theoretically grounded approach to selective forgetting in CLIP, enabling organizations to comply with privacy or ethical constraints without the heavy computational burden of retraining large multimodal models.
Comments & Academic Discussion
Loading comments...
Leave a Comment