Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on the retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches truly eliminate the targeted data from the model’s representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier, thereby achieving superior logit-based performance while maintaining representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation scenario in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more thorough evaluation from a representation perspective. We hope our benchmark will serve as a standardized protocol for evaluating unlearning algorithms under realistic conditions.

💡 Research Summary

The paper critically examines the current evaluation protocols for machine unlearning and demonstrates that they provide an overly optimistic view of how well unlearning methods actually erase targeted data. Most prior work evaluates unlearning on small datasets such as CIFAR‑10/100 or Tiny ImageNet and relies almost exclusively on logit‑level metrics—forget accuracy, retain accuracy, and membership inference attacks. While these metrics capture the model’s output behavior, they ignore changes in the internal feature extractor, which is where most of the learned knowledge resides.

To address this gap, the authors construct a large‑scale benchmark using ImageNet‑1K (≈1.28 M images) and modern backbones (ResNet‑50, Swin‑Transformer‑T, ConvNeXt). They unlearn 100–300 classes, a scale that mimics realistic “right‑to‑be‑forgotten” requests. After unlearning, three models are compared: the original model (θₒ), a fully retrained model on the retain set (θᵣ, the gold standard), and the unlearned model (θᵤ). The key contribution is a representation‑centric evaluation that measures how close θᵤ is to θᵣ in the feature space.

Two complementary metrics are employed. First, Centered Kernel Alignment (CKA) quantifies layer‑wise similarity of feature maps; an ideal unlearned model should achieve a CKA score close to 10 (identical to θᵣ). Across all state‑of‑the‑art unlearning algorithms (Gradient Ascent, L2UL, SCRUB, SalUn, SCAR, etc.), CKA reveals that θᵤ remains far more similar to θₒ than to θᵣ, indicating that the internal representations have barely changed. Second, k‑Nearest Neighbor (k‑NN) probing evaluates transferability of the extracted features on a suite of downstream datasets (Office‑Home, Caltech‑101, CUB‑200, etc.). The k‑NN results show that unlearned models perform almost identically to the retrained baseline—or sometimes worse—confirming that the feature extractor is largely untouched and that improvements in logit‑based metrics stem mainly from modifications to the classification head.

The authors also critique the prevalent “random class forgetting” protocol. Randomly selected classes often have little semantic overlap with downstream tasks, making it easy for a model to retain useful representations even after unlearning. To create a more stringent test, they propose “Top Class‑wise Forgetting,” where the classes to be forgotten are semantically similar to those used in downstream tasks. In this scenario, successful unlearning must cause a substantial divergence in CKA (θᵤ far from θₒ) and a noticeable drop in downstream k‑NN performance. Experiments show that existing methods still fail to achieve these representation‑level changes, despite maintaining strong logit‑based performance.

Beyond the methodological contributions, the benchmark is computationally efficient: CKA and k‑NN require no additional training or label supervision, allowing rapid assessment across many models and datasets. By covering multiple backbones and a wide range of downstream tasks, the benchmark offers a more generalizable picture of unlearning effectiveness.

In summary, the paper makes three major contributions: (1) it identifies a fundamental flaw in current unlearning evaluations that focus solely on output logits; (2) it introduces a unified, representation‑centric benchmark that reveals most existing methods merely adjust the classifier while leaving the encoder unchanged; and (3) it proposes a novel “Top Class‑wise Forgetting” scenario that better reflects real‑world privacy demands. The work calls for a shift in the field toward multi‑dimensional, large‑scale evaluation protocols that truly verify whether a model has forgotten the intended data.

Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

💡 Research Summary

Comments & Academic Discussion

Leave a Comment