FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.


💡 Research Summary

The paper addresses the challenge of compressing large‑scale automatic speech recognition (ASR) models while preserving, or even improving, their generalization performance. Building on the success of Whisper—a massive, weakly supervised ASR model trained on 680 000 hours of diverse audio—the authors propose a novel knowledge‑distillation framework called Adaptive Self‑Knowledge Distillation (ASKD). ASKD consists of two sequential stages: Adaptive Knowledge Distillation (AKD) and Adaptive Self‑Knowledge Distillation (SKD).

In standard knowledge distillation (KD), a fixed weight α_KD determines how strongly the student model is forced to mimic the teacher’s predictive distribution via a KL‑divergence loss. This static setting has two drawbacks. If α_KD is too high, the student becomes overly dependent on the teacher and inherits its shortcomings, leading to poor generalization. If α_KD is too low, the student receives insufficient guidance and may under‑fit, especially early in training when its representations are weak. AKD solves this by initializing α_initial_AKD = 1 for a warm‑up period (E_w epochs) and then gradually decreasing α_e_AKD according to the schedule α_e_AKD = α_initial_AKD − e/E_t·e^{−E/E_t}, where e is the current epoch and E_t the total number of epochs. The AKD loss L_AKD = α_e_AKD·KL(P_S‖P_T) therefore starts strong, stabilizing early training, and fades as the student becomes more capable, encouraging self‑reliance.

After the AKD phase, when α_e_AKD falls below a threshold λ, the framework switches to SKD. Traditional SKD uses a static mixture of hard labels y and the teacher’s soft predictions P_T, which can still be sub‑optimal because the student’s own predictions are not leveraged. In ASKD, the previous epoch’s student output is treated as the soft label, and its weight α_e_SKD is increased linearly with the epoch (α_e_SKD = α_initial_SKD·e/E_t). The SKD loss L_SKD = CE(((1 − α_e_SKD)·y + α_e_SKD·P_T), P_S) thus gradually shifts supervision from hard ground‑truth to richer soft information, reducing over‑fitting and encouraging the student to capture inter‑class relationships.

Using this framework, the authors construct two FastWhisper variants. FastWhisper‑small (152 M parameters) couples a Whisper‑small encoder with a lightweight three‑layer Transformer decoder, targeting real‑time inference. FastWhisper‑large (740 M parameters) pairs a Whisper‑large‑v3 encoder with the same decoder, offering higher accuracy at a modest size increase. Both models are trained on a curated 1 620‑hour multilingual corpus comprising LibriSpeech, TED‑LIUM, LJSpeech, Earnings‑22, and AMI Meeting recordings. For out‑of‑domain evaluation, GigaSpeech and VoxPopuli—datasets unseen during training—are used to assess generalization.

Experimental results demonstrate that ASKD consistently outperforms baseline KD with pseudo‑labeling (PL) and plain SKD. On the LibriSpeech test‑clean split, FastWhisper‑small achieves a word error rate (WER) of 2.95 % versus 3.05 % for the original Whisper‑small, a 0.10 % absolute improvement. On the Earnings‑22 test set, ASKD reduces WER from 13.0 % (teacher) to 11.8 %, a 1.2 % absolute gain. FastWhisper‑large further narrows the gap to the Whisper‑large‑v3 teacher, delivering an average WER 1.23 % lower while using only 1.634 B total parameters (vs. 1.5 B for the teacher). Importantly, inference latency is cut by a factor of five: Whisper‑large‑v3 processes at 659 ms per second of audio, whereas FastWhisper‑large runs at 132 ms per second. On the unseen GigaSpeech and VoxPopuli sets, the WER difference remains under 0.2 %, confirming strong cross‑domain robustness.

Ablation studies on the minimum α_AKD value reveal that setting the lower bound to 0.5 yields the best trade‑off between teacher guidance and student autonomy; higher minima lead to over‑reliance, while lower minima diminish the beneficial regularization from the teacher.

In summary, ASKD introduces a principled, dynamic weighting scheme that first leverages the teacher’s rich knowledge to bootstrap the student, then gradually hands over supervision to the student’s own softened predictions. This two‑stage approach mitigates the classic “teacher‑bias” problem of static KD and enhances the student’s ability to generalize, even when trained on a relatively modest dataset. FastWhisper, built with ASKD, achieves near‑teacher accuracy with dramatically reduced model size and latency, making it well‑suited for real‑time, on‑device speech recognition. Future work will extend the method to multilingual corpora and explore further hardware‑aware optimizations.


Comments & Academic Discussion

Loading comments...

Leave a Comment