Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Grapheme-to-phoneme (G2P) conversion is an important task in automatic speech recognition and text-to-speech systems. Recently, G2P conversion is viewed as a sequence to sequence task and modeled by RNN or CNN based encoder-decoder framework. However, previous works do not consider the practical issues when deploying G2P model in the production system, such as how to leverage additional unlabeled data to boost the accuracy, as well as reduce model size for online deployment. In this work, we propose token-level ensemble distillation for G2P conversion, which can (1) boost the accuracy by distilling the knowledge from additional unlabeled data, and (2) reduce the model size but maintain the high accuracy, both of which are very practical and helpful in the online production system. We use token-level knowledge distillation, which results in better accuracy than the sequence-level counterpart. What is more, we adopt the Transformer instead of RNN or CNN based models to further boost the accuracy of G2P conversion. Experiments on the publicly available CMUDict dataset and an internal English dataset demonstrate the effectiveness of our proposed method. Particularly, our method achieves 19.88% WER on CMUDict dataset, outperforming the previous works by more than 4.22% WER, and setting the new state-of-the-art results.

💡 Research Summary

This paper presents a novel and practical framework for Grapheme-to-Phoneme (G2P) conversion, a critical component in automatic speech recognition and text-to-speech systems. The authors address two key limitations of prior neural sequence-to-sequence models: the inability to leverage abundant unlabeled data and the high computational cost of deploying large or ensemble models in production.

The core innovation is “Token-Level Ensemble Distillation,” a unified method that tackles both issues simultaneously. The framework operates in two main phases. First, for accuracy boosting, a powerful teacher ensemble is constructed using diverse model architectures—including Transformer, Bi-LSTM, and CNN-based encoder-decoders. This ensemble generates phoneme sequences and their token-level probability distributions for a large corpus of unlabeled words crawled from the web. These generated sequences serve as pseudo-labels, effectively expanding the training data with high-quality synthetic examples.

Second, for model compression suitable for online deployment, the collective knowledge of the cumbersome teacher ensemble is distilled into a single, lightweight student model using knowledge distillation. Crucially, the distillation is performed at the token-level, meaning the student model is trained to mimic the teacher’s output probability distribution at every step of the decoding process, rather than just the final sequence. This provides a richer learning signal compared to sequence-level distillation. The student model employs the Transformer architecture, which is introduced for the first time in G2P tasks by this work, offering advantages in parallel computation and long-range dependency modeling.

Extensive experiments were conducted on the public CMUDict 0.7b dataset and an internal English dataset. The results demonstrate the effectiveness of the proposed approach comprehensively. The method achieves a new state-of-the-art Word Error Rate (WER) of 19.88% on CMUDict, outperforming the previous best model by a significant margin of 4.22% WER. Ablation studies confirm the individual contributions: using unlabeled data provides nearly a 1% WER improvement, and token-level distillation outperforms sequence-level distillation. Furthermore, the distillation process successfully compresses the model. A student Transformer with only 1 encoder and 1 decoder layer (1.85M parameters) not only reduces the model size by nearly 6x compared to a 6-layer baseline (11.09M parameters) but also achieves a lower WER (20.25% vs. 21.07%), while being approximately 4 times faster in inference.

In summary, this work makes significant contributions by being the first to apply Transformer models and leverage unlabeled data via token-level ensemble distillation for G2P conversion. It provides a holistic solution that dramatically improves accuracy and enables efficient model deployment, setting a new benchmark for practical G2P systems.

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

💡 Research Summary

Comments & Academic Discussion

Leave a Comment