Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting
The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.
💡 Research Summary
Keyword spotting (KWS) systems are essential for always‑on voice assistants, but achieving high detection accuracy across diverse speakers, environments, and noise conditions traditionally requires massive amounts of real‑speech data. Collecting such data is expensive and time‑consuming. Recent advances in text‑to‑speech (TTS) synthesis (e.g., Virtuoso and AudioLM) make it possible to generate large quantities of synthetic speech at a fraction of the cost, prompting researchers to augment KWS training with TTS data. However, synthetic speech often carries artifacts, limited prosodic diversity, and distributional mismatches that a neural network can exploit. When a model learns these TTS‑specific cues, it overfits to the synthetic domain and its performance on genuine speech degrades, especially when the amount of real positive (keyword‑containing) data is small.
The paper proposes an adversarial domain‑adaptation strategy to mitigate this overfitting. The baseline KWS architecture follows the well‑known two‑stage SVDF (factorized convolution) encoder‑decoder design, with seven SVDF layers and three bottleneck projection layers (≈320 k parameters). Input features are 40‑dimensional filter‑bank energies stacked over three frames (120‑dimensional vector every 20 ms). The baseline loss combines a frame‑wise cross‑entropy term and a max‑pool cross‑entropy term, weighted by a hyper‑parameter α.
To make the hidden representations domain‑invariant, the authors attach a binary synthetic/real (S/R) classifier to the concatenated hidden activations from all encoder and decoder layers. This classifier predicts whether a given example originates from TTS or real speech. A Gradient Reversal Layer (GRL) sits between the classifier and the KWS network, so that during back‑propagation the classifier’s gradient is multiplied by –λ, effectively encouraging the KWS encoder to produce features that confuse the S/R classifier. The total training objective is
L_total = (1 – β)·L_sup + β·L_adv,
where L_adv is the cross‑entropy loss of the S/R classifier after the GRL, β balances the primary KWS objective against the adversarial objective, and λ controls the strength of the reversed gradient.
Experiments were conducted on the “Hey/OK Google” wake‑word detection task. Real data comprised 3.8 M positive utterances and 14.1 M negative utterances; synthetic data comprised 7.5 M positive and 5.1 M negative utterances generated by the two TTS systems. The authors varied the sampling probability of real positive data from 0 % (no real positives) to 100 % (full real‑positive set) to simulate different resource regimes. They also swept λ in the range 0.30–0.50 and examined several β settings.
Key findings:
- The S/R classifier achieved up to 98 % accuracy when fed the concatenated hidden activations, confirming that the baseline KWS network encodes strong domain cues.
- Adding the adversarial loss consistently reduced the false‑rejection rate (FRR) on a held‑out real‑speech evaluation set, measured at a fixed false‑accept per hour (FA/h) of 0.133. When the full real‑positive set was used, the best adversarial model (λ ≈ 0.30, β ≈ 0.5) lowered FRR from 1.81 % (baseline) to 1.61 %, an 11 % relative improvement.
- Remarkably, even with 0 % real positives (i.e., training only on synthetic positives, synthetic negatives, and real negatives), the adversarial model achieved a 6 % relative FRR reduction compared with a baseline trained on the same data, demonstrating that the adversarial signal can leverage real‑negative examples to regularize the representation.
- When the proportion of real positives was intermediate (1 %–20 %), gains were modest or sometimes negative, suggesting that the adversarial pressure may interfere with learning the scarce positive signal if it is not sufficiently abundant.
- ROC curves showed that the adversarial model’s advantage persisted across a wide range of FA/h thresholds, indicating robustness to operating‑point selection.
The authors conclude that adversarial domain adaptation is an effective, low‑cost method to harness large synthetic TTS corpora while preserving—or even improving—real‑world KWS performance. The approach is especially valuable in early‑stage product development where real positive recordings are scarce but abundant synthetic data can be generated quickly. Future work could explore multi‑domain adversarial classifiers, more sophisticated gradient‑scaling schedules, and evaluation on additional languages or wake‑words to assess generality.
Comments & Academic Discussion
Loading comments...
Leave a Comment