"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Machine unlearning, the process of efficiently removing specific information from machine learning models, is a growing area of interest for responsible AI. However, few studies have explored the effectiveness of unlearning methods on complex tasks, particularly speech-related ones. This paper introduces UnSLU-BENCH, the first benchmark for machine unlearning in spoken language understanding (SLU), focusing on four datasets spanning four languages. We address the unlearning of data from specific speakers as a way to evaluate the quality of potential “right to be forgotten” requests. We assess eight unlearning techniques and propose a novel metric to simultaneously better capture their efficacy, utility, and efficiency. UnSLU-BENCH sets a foundation for unlearning in SLU and reveals significant differences in the effectiveness and computational feasibility of various techniques.

💡 Research Summary

The paper addresses the emerging need for “right‑to‑be‑forgotten” capabilities in spoken language understanding (SLU) systems, which are increasingly deployed in voice assistants and other speech‑driven applications. While machine unlearning (MU) – the process of removing the influence of specific training data from a model without full retraining – has been studied for text and image domains, there is a paucity of work on complex speech tasks such as intent classification. To fill this gap, the authors introduce UnSLU‑BENCH, the first comprehensive benchmark for evaluating MU methods on SLU.

UnSLU‑BENCH comprises four publicly available datasets covering four languages: English (Fluent Speech Commands – FSC, and SLURP), Italian (IT‑ALIC), and German/French (SpeechMASSIVE). The authors create speaker‑independent train/retain/forget/test splits, ensuring that speakers in the forget set are disjoint from those in the retain and test sets. The forget set consists of speakers with at least 100 utterances, representing roughly 2.5‑5 % of each corpus – a realistic scenario for a user requesting deletion of their voice data.

For each dataset, two transformer‑based speech models are fine‑tuned: wav2vec 2.0 and HuBERT for the English corpora, and XLS‑R‑128 and XLS‑R‑53 for the multilingual corpora (the latter is ASR‑fine‑tuned for the target language). This yields a total of eight model‑dataset combinations.

The benchmark evaluates eight MU techniques:

Fine‑Tuning (FT) – one additional epoch on the retain set.
Negative Gradients (NG) – reverse‑gradient training on the forget set.
NG+ – NG combined with FT to mitigate catastrophic forgetting.
CF‑k – FT applied only to the last k layers.
UNSIR – a two‑phase “damage‑then‑repair” approach using noise on the forget set.
Bad Teaching (BT) – teacher‑student distillation with a competent and an incompetent teacher.
BT‑L – a lightweight variant using a random predictor as the incompetent teacher.
SCRUB – teacher‑student training that maximizes similarity on retain data while minimizing it on forget data.

To compare these methods, the authors propose a novel composite metric called the Global Unlearning Metric (GUM). GUM simultaneously captures three essential aspects:

Utility (U) – similarity of macro‑F1 on the test set between the unlearned model and a “gold” model trained from scratch on the retain set only.
Efficacy (E) – how well the method reduces membership inference attack (MIA) success, normalized between the original model and the gold model.
Efficiency (T) – the computational cost, expressed as a log‑scaled ratio of unlearning time to gold‑model retraining time.

GUM is defined as the weighted harmonic mean of U, E, and T (with equal weights α = β = 1). This formulation forces a method to balance all three criteria; a method that excels in one dimension but fails in another will receive a lower overall score.

Key Findings

Negative Gradients (NG) consistently achieves the highest GUM across almost all model‑dataset pairs. For wav2vec 2.0 on FSC, NG outperforms the second‑best method by +35 % in GUM; for XLS‑R‑53 on multilingual datasets, gains range from +39 % to +48 %. NG’s superiority stems from its strong efficacy (MIA close to the gold model) and extreme efficiency (speed‑ups up to 1,748× on FSC) while maintaining comparable utility.
NG+ sometimes yields slightly higher macro‑F1 on both test and forget sets, but its speed‑up is an order of magnitude lower, resulting in a reduced overall GUM. Moreover, NG+ suffers catastrophic forgetting in certain configurations (e.g., XLS‑R 128 on IT‑ALIC where F1 on the forget set drops to 0.001).
Fine‑Tuning (FT) offers a balanced trade‑off for larger models, preserving utility while delivering moderate efficacy and reasonable speed‑up.
CF‑k provides the fastest unlearning (by limiting updates to a few layers) but at the cost of lower efficacy.
UNSIR, BT, BT‑L, and SCRUB occupy the middle ground; they improve utility relative to NG but are slower and less effective at reducing MIA.

A learning‑rate sensitivity analysis shows NG is robust across a wide LR range, whereas NG+ and BT‑L exhibit sharp performance drops at higher LRs, confirming the importance of hyper‑parameter tuning for each method.

Implications and Future Directions

UnSLU‑BENCH establishes a solid experimental foundation for MU research in speech, highlighting that methods successful on text or image data do not automatically transfer to SLU. The introduction of GUM encourages the community to consider a holistic view of unlearning performance rather than optimizing a single metric. The benchmark also reveals that simple reverse‑gradient approaches (NG) can be both effective and computationally cheap, making them attractive for real‑world deployment where latency and resource constraints matter.

Future work could explore:

Unlearning at finer granularity (e.g., individual utterances or specific intents) rather than whole speakers.
Development of gold‑model‑free estimators for GUM, enabling deployment‑time assessment without retraining.
Integration of differential privacy or certified removal guarantees with MU techniques.
Evaluation of MU in streaming or online learning scenarios typical of voice assistants.

In summary, the paper delivers the first multilingual SLU unlearning benchmark, proposes a comprehensive evaluation metric, and provides an extensive empirical comparison that identifies Negative Gradients as the current state‑of‑the‑art for efficient, effective, and utility‑preserving machine unlearning in spoken language understanding.

"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment