Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers

This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. To distil the teacher's knowledge, we use an attention decoder that learns from B…

Authors: Michael Hentschel, Yuta Nishikawa, Tatsuya Komatsu

Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers
© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collectiv e works, for resale or redistribution to serv ers or lists, or reuse of any copyrighted component of this w ork in other works. KEEP DECODING P ARALLEL WITH EFFECTIVE KNO WLEDGE DISTILLA TION FR OM LANGU A GE MODELS T O END-TO-END SPEECH RECOGNISERS Michael Hentschel 1 ,⋆ , Y uta Nishikawa 2 ,⋆ , T atsuya K omatsu 3 , Y usuke Fujita 3 1 LINE WORKS Corporation, Japan 2 Nara Institute of Science and T echnology , Japan 3 LINE Corporation, Japan ABSTRA CT This study presents a nov el approach for knowledge distillation (KD) from a BER T teacher model to an automatic speech recogni- tion (ASR) model using intermediate layers. T o distil the teacher’ s knowledge, we use an attention decoder that learns from BER T’ s token probabilities. Our method shows that language model (LM) information can be more effecti vely distilled into an ASR model using both the intermediate layers and the final layer . By using the intermediate layers as distillation target, we can more ef fectiv ely dis- til LM knowledge into the lo wer network layers. Using our method, we achieve better recognition accuracy than with shallow fusion of an external LM, allowing us to maintain fast parallel decoding. Ex- periments on the LibriSpeech dataset demonstrate the effecti veness of our approach in enhancing greedy decoding with connectionist temporal classification (CTC). Index T erms — knowledge distillation, BER T , CTC, speech recognition 1. INTRODUCTION End-to-end automatic speech recognition (E2E-ASR) systems are reported to learn an internal language model (LM) [1, 2]. Their training requires a large amount of speech and its transcription text as training data. If the amount of training data is insuf ficient, the ASR output may incorrectly contain linguistically meaningless to- ken sequences for unknown words or unseen input sequences. In order to cope with these errors, it is necessary to combine an ex- ternal LM, learned from large-scale data, with the E2E-ASR sys- tem in shallow fusion [3, 4] during inference. Especially in the case of a non-autoregressiv e ASR model, high inference speed and high throughput are sacrificed by autoregressiv e decoding in this shallow fusion. Although autoregressiv e E2E-ASR models such as recurrent neural network transducers (RNN-T) [5, 6] are popular , in batch pro- cessing applications that require the processing of many requests in parallel a non-autoregressiv e ASR model is preferred for its higher throughput. Hence, we aim to strengthen the internal LM of a con- nectionist temporal classification (CTC) [7] type non-autoregressiv e ASR model (CTC-ASR) in this study . Our proposed method allows to realise a CTC-ASR model with high recognition accuracy during non-autoregressi ve decoding without requiring shallo w fusion with an external LM. T raining the internal LM of an E2E-ASR with only text data is not a simple task. Several methods for knowledge distillation (KD) [8] from an external LM trained on a large-scale text cor- pus hav e been proposed in previous studies. BER T [9] is often adopted as an external LM used as a teacher model for KD. [10] ⋆ equal contribution proposed to use the posterior distribution output from BER T as a soft label to improve an autoregressi ve ASR model. Further stud- ies proposed to design an auxiliary task as a regression problem to distill the hidden state representations of BER T in combination with non-autoregressi ve ASR [11] and RNN-T [12]. Higuchi et al. [13] suggested a method for transferring BER T language information us- ing an iterativ e refinement scheme. In general, KD trains a student model to produce the same out- put as a teacher model for a given input. Recently , a more efficient way for KD to transfer knowledge from a teacher model to a stu- dent model was proposed. Several studies in the field of natural lan- guage processing showed that the use of intermediate loss functions was effecti ve when distilling BER T’ s knowledge into student mod- els [14, 15, 16]. In the case of CTC-ASR, [17] used an intermediate layer loss function when distilling a teacher E2E-ASR model into a student E2E-ASR model. This work proposes an ef ficient KD method using an interme- diate layer loss function for non-autoregressi ve CTC-ASR models. In the proposed method, KD is applied via an auxiliary attention decoder (AED) to the intermediate layer (interAED-KD) [18] in ad- dition to KD on the final layer [19, 10]. As with KD on the final layer , the auxiliary AED in the intermediate layer utilises BER T’ s posterior distribution as a teacher soft label. In contrast to using the auxiliary loss function only on the final encoder layer, the auxiliary AED loss function on the intermediate layer can affect deeper layers of the encoder , enabling a more ef ficient KD. Unlike in con ventional KD [10], that makes inference with AEDs, in this study , inference is performed by CTC decoding. An AED is used only as an auxiliary loss function during training. W e show that e ven in the case of CTC decoding, KD using the AED auxiliary loss on the final layer and the intermediate layer is effecti ve. T o ev aluate our proposed method, we use the LibriSpeech dataset [20] providing both lar ge-scale acoustic and text training data. Our experiments confirmed that CTC decoding accuracy was improved in student models applying interAED-KD. Our best system showed similar or better word error rate (WER) than a con- ventional CTC-ASR system with LM beam search decoding while having a six times lo wer R TF . Compared with conv entional KD, the proposed method showed a WER reduction (WERR) of up to 7%. 2. CTC A COUSTIC MODEL 2.1. Connectionist T emporal Classification In non-autoregressiv e ASR systems, the acoustic encoder Enc AM maps the input acoustic features X ∈ R S × D enc of length S and size D enc to the label sequence y ∈ U L . The label sequence of length L contains tok ens from the v ocabulary U and has in most cases a different length than the acoustic features. T o train such a system, T o appear in Proc. ICASSP2024, April 14-19, 2024, Seoul, Kor ea © IEEE 2024 the CTC loss function is used. The CTC loss maximises the log- likelihood over all possible alignment sequences a ∈ U ′ S between X and y , where U ′ is the v ocabulary extended by a special blank symbol U ′ ∈ U ∪ { ϵ } . log P CTC = log X a ∈ ψ − 1 ( y ) P ( a | X ) (1) L CTC = − log P CTC (2) ψ is the operation that removes blank labels, repeated tokens from a to map to the label sequence. The probability of the alignment sequence is calculated based on the assumption of conditional inde- pendence between all elements a s in the alignment sequence a P ( a | X ) = Y s P ( a s | X ) (3) For some experiments we used intermediate CTC (interCTC) [21], in which case we define (2) as a weighted sum of both the CTC loss and the intermediate CTC loss (similar to the proposed training objectiv e in [21]). 2.2. Conformer Acoustic Model T o calculate (3), we use an acoustic encoder Enc AM based on the Conformer architecture [22]. Enc AM consists of N encoder lay- ers. Each encoder layer EncLay er n in the Conformer consists of a feed-forward network ( FFWD ), a multi-head attention block, a con- volutional layer , and another feed-forward network. This macaron structure is followed by a layer normalisation block. EncLay er n transforms the input features X n − 1 ∈ R S × D enc from its previous layer n − 1 to the output features X n ∈ R S × D enc by the following set of equations. X FFWD n = 1 2 FFWD ( X n − 1 ) + X n − 1 (4) X MHA n = SelfAtten tion  X FFWD n  + X FFWD n (5) X CONV n = Conv olution  X MHA n  + X MHA n (6) X n = Lay erNorm  1 2 FFWD  X CONV n  + X CONV n  (7) The probability of each element in the alignment sequence a s is cal- culated from the output of the final EncLa yer N , followed by a linear transformation, and the softmax function. A = Softmax (FFWD (EncLa yer N ( X N − 1 ))) (8) P ( a s | X ) = A [ s ] (9) 3. EXTERNAL LANGU A GE MODEL In this study , we follo w the approach taken in previous research and use BER T as an external LM for KD. BER T is a masked lan- guage model that predicts the probability of masked tokens y l from all surrounding tokens y \ l = [ y 1: l − 1 , [MASK] , y l +1: L ] . Doing so, the model relies on left and right context for its predictions. BER T uses the T ransformer encoder architecture [23]. A T ransformer en- coder Enc LM consists of se veral layers, where each layer comprises a multi-head attention block followed by a feed-forward layer . The Encoder (1) Encoder (n) Encoder (n+1) Encoder (N) Linear Softmax CTC Encoder (2) X y label Conf ormer CT C Decoder (1) Decoder (M) Linear Softmax Linear Softmax KL div y 1,..., l -1 A tt ention Decoder ^ BERT Encoder Linear Softmax y 1,..., L BERT LM Linear Softmax CTC y label Fig. 1 . Proposed model architecture with intermediate CTC loss and intermediate attention decoder for kno wledge distillation. BER T’s parameters are frozen during training the ASR model. The CTC decoders and the attention decoders share their parameters. probability of a masked token ˆ y l is calculated from the encoder out- put followed by a linear classifier and the softmax function. P ( ˆ y l | y \ l ) = Softmax  FFWD  Enc LM ( y \ l )  (10) W e obtain soft labels for KD from BER T by calculating BER T’ s to- ken probabilities. T o obtain the probabilities for the entire sequence ˆ y BER T , we mask each token or word (multiple tokens) individually and concatenate the probabilities for all tokens ˆ y l , l ∈ { 1 , . . . , L } in the label sequence. 4. DISTILLA TION INTO INTERMEDIA TE LA YERS In this section, we explain our proposed method for KD into in- termediate encoder layers sho wn in Figure 1. In an autoregres- siv e ASR system, an attention decoder Dec N is used to predict the current token ˆ y l . This decoder may be a multi-layer Trans- former decoder model. For its prediction, Dec N uses the output of EncLay er N and the sequence of previously generated tokens ˆ y [1: l − 1] = [ ˆ y 1 , . . . , ˆ y l − 1 ] . P ( ˆ y l | ˆ y [1: l − 1] , X N ) = Softmax  FFWD(Dec N ( ˆ y [1: l − 1] , X N ))  (11) Where FFWD is a classification layer that projects the decoder out- put to the vocabulary . Dec N is conv entionally trained with a cross entropy loss function, but to distil BER T’ s knowledge into the atten- tion decoder , the Kullback-Leibler (KL) di vergence is used. The KL div ergence loss is calculated between the probability distribution of 2 T able 1 . WERs [%] and R TFs on LibriSpeech 960 with greedy and LM beam search (BS) decoding strategies. Greedy (R TF=0.003) BS (size=1, R TF=0.018) BS (size=10, R TF=0.020) clean other clean other clean other Base model KD method dev test de v test dev test dev test dev test dev test CTC No KD 3.46 3.63 8.34 8.45 3.09 3.27 7.56 7.67 2.53 2.79 6.50 6.56 Con ventional AED-KD 2.95 3.17 7.13 7.26 2.62 2.82 6.45 6.58 2.29 2.42 5.75 5.89 +interAED No KD 3.35 3.57 8.49 8.39 2.99 3.24 7.55 7.69 2.52 2.71 6.40 6.57 Pr oposed interAED-KD 2.83 2.93 6.91 6.91 2.60 2.69 6.27 6.41 2.32 2.39 5.58 5.65 +interCTC No KD 3.05 3.19 7.51 7.51 2.69 2.87 6.84 6.81 2.39 2.52 5.92 5.97 Pr oposed interAED-KD 2.60 2.74 5.95 6.30 2.35 2.45 5.48 5.86 2.16 2.30 4.93 5.36 T able 2 . Parameters for both ASR and LM encoders, and the auxil- iary decoder . Model Conformer BER T Decoder Layers 18 12 6 Attention heads 8 MHA dim 512 Feed-forward dim 2048 |U | 5001 5005 5005 T otal number of parameters 124 M 41 M 30 M BER T’ s tok en predictions ˆ y BER T ,l and the decoder’ s token predic- tions ˆ y Dec N ,l . L KL N = X l P ( ˆ y Dec N ,l ) log( P ( ˆ y Dec N ,l ) P ( ˆ y BER T ,l ) ) (12) As in [10], we use top- K distillation for BER T’ s labels. In top- K distillation, we use only the probabilities for the top- K predicted to- kens instead of all tok ens as distillation tar get. In the experiments we set K = 10 . For distillation into intermediate layers, we add further attention decoders Dec n that use the outputs of intermedi- ate encoder layers EncLay er n [18]. These intermediate decoders Dec n share their parameters with the final decoder Dec N and are also trained to minimise the KL div ergence with BER Ts probability distribution. L KL n = X l P ( ˆ y Dec n ,l ) log( P ( ˆ y Dec n ,l ) P ( ˆ y BER T ,l ) ) (13) Our distillation loss is a weighted sum of (12) and all M intermediate KL div ergence losses. L distill = (1 − β ) L KL N + β 1 M M X m =1 L KL ⌊ mN M +1 ⌋ (14) This auxiliary loss from the AED is only used during training b ut the decoder is not required for CTC decoding. That is, the final number of model parameters does not increase compared with a CTC only model. The ov erall training loss is a weighted sum of (2) and (14). L = (1 − α ) L CTC + α L distill (15) 5. EXPERIMENTS 5.1. Dataset and Model Configuration W e conducted our e xperiments on the LibriSpeech [20] dataset. The dataset includes a normalised LM training dataset with 800M words that we used to train a BER T model and a 6-gram token LM. W e followed the ESPnet [24] recipe for the ASR training data prepara- tion. T o train the Conformer , we employed speed perturbation and SpecAugment [25] to the 960h of acoustic training data. T able 2 provides a summary of all models’ parameters. All mod- els share the same base tokeniser with 5000 sentencepiece [26] to- kens. The CTC decoder has an extra blank symbol, and BER T as well as the attention decoder have special tokens ( [CLS] , [SEP] , etc.) added to their vocabulary . Our attention decoder was a Trans- former decoder with six layers and otherwise the same specifications as the encoders. W e applied intermediate AED and intermediate CTC at the 9th encoder layer . For training and implementation of our LM and ASR models, we used Nvidia’ s NeMo toolkit [27]. The ASR models were trained on 8 × Nvidia V100 for 100 epochs with adapti ve batch sizes ranging from 32 to 128, according to the audio length. The learning rate w as 1.4 and we used 10K warm up steps. W e set α in (15) to 0.7 and β in (14) to 0.5. 5.2. Greedy Decoding T able 1 summarises the recognition results for LibriSpeech’ s clean and other subsets. All models using KD sho w a lower WER than the base CTC-ASR model. Our proposed KD on the intermediate encoder layers (interAED-KD) improved on KD only on the final layer (AED-KD), obtaining a relativ e WER reduction (WERR) of 7.6% on test clean and 4.8% on test other . From this result, we see that the intermediate loss is effecti ve in transferring more language information into the encoder compared to LM distillation only on the encoder output. Combining interAED-KD and interCTC (interCTC-interAED- KD) resulted in the overall best model, achieving a WERR of 24.5% on test clean and 25.4% on test other compared with the base CTC- ASR model. This finding suggests that both the interAED-KD and the interCTC objectiv e functions help the acoustic encoder to learn complementary information. As mentioned in Section 4, we used T op-10 distillation in the experiments. W e experimented with increasing this number but did not notice an y significant impro vement o ver top-10 distillation. Sur- prisingly , using BER T’ s hidden state representations instead of token probabilities for distillation resulted in w orse results in our e xperi- ments. W ithout further showing the results, we conducted additional e x- periments where we increased the number of intermediate losses for both interCTC and interAED-KD, but we did not observe any im- prov ement on the LibriSpeech dataset. This result aligns with similar findings in [21]. 3 T able 3 . Greedy decoding samples from LibriSpeech’ s test other subset with corresponding reference transcription. Reference i should hav e thought of it again when i was less b usy may i go with you now CTC i should hav e thought of it again when i was less b usy may ill go with you now interCTC+interAED-KD CTC i should hav e thought of it again when i was less b usy may i go with you now interCTC+interAED-KD AED i should hav e thought of it but now that i w as so tired may i go with you no w BER T i should hav e thought of you myself b ut i was so afraid b ut i speak with you now Reference i don’t belie ve all i hear no not by a big deal CTC i doanlie all i hear no not by a big deal interCTC+interAED-KD CTC i don’t belie ve all i hear no not by a big deal interCTC+interAED-KD AED i do not believ e what i say but no not quite a great deal BER T i don’t belie ve what i say no not for a gr eat man 5.3. LM Beam Search Decoding In this section, we discuss the decoding results with LM beam search decoding. W e performed beam search decoding with beam sizes of 1 and 10. LM weight and word insertion penalty were chosen from a rough parameter search on the dev clean and de v other set, respec- tiv ely . The decoding results along with the R TFs are summarised in T able 1. Beam search decoding with the Ngram LM and beam size 1 al- lows us to compare the contribution of both the KD and the Ngram LM. The CTC-ASR baseline achiev ed a worse WER than the mod- els trained with our proposed interAED-KD and greedy decoding. When further increasing the beam size to 10, the CTC-ASR base- line’ s WER was lower than CTC+interAED-KD with greedy de- coding. Howe ver , when comparing the R TFs of greedy decoding and beam search decoding, we found that greedy decoding has an at least six times lower R TF . These findings highlight the strengths of our proposed method, as it allows to keep the R TF of greedy , paral- lel decoding while giving better accuracy than autoregressiv e beam search decoding with an external LM. Overall, beam search decoding significantly improv ed on greedy CTC decoding, ev en for the models trained with KD. Howe ver , this performance increase comes at the cost of longer decoding times and worse R TF , as T able 1 sho ws. The R TF with beam search is over six times higher than with CTC greedy decoding. As comparison, when decoding with the AED of our model, the R TF w as at 0.013 approximately four times higher than with greedy CTC decoding. 5.4. Improvement by Intermediate KD Since our proposed method always combines intermediate AED [18] with KD, we need to ev aluate the contribution of KD separately from the intermediate AED. The results in T able 1 clearly indicate the con- tribution of KD on improving the model parameters. When looking at the ef fect of interAED-KD itself, the language model information helped to reduce WER by 15-19%. When using both interCTC and interAED, we saw a WERR of 14-20% at the models that used KD. These results show that the model could not learn significant language knowledge from the av ailable amount of reference tran- scriptions. Using interAED-KD, BER T’ s language statistics could successfully be transferred into the acoustic encoder via an auxiliary AED and the AM could access this information without an external LM during decoding. 5.5. Decoding Samples In our experiments, we compared the decoded outputs of dif ferent models. T able 3 shows some samples from LibriSpeech’ s test other subset and the corresponding reference transcriptions. First, in the provided samples, the CTC-ASR baseline model al- ready makes very fe w mistakes, but it can still output nonsensical tokens for unseen inputs as seen in the second example. Our pro- posed method reduces such mistakes significantly and in the shown examples outputs the correct words. Second, when decoding the AED of our proposed method that was only trained with KL diver gence but not cross entropy loss and the correct transcriptions, there are as expected significantly more errors in the decoded sequences. When we also decode the token probabilities from BER T and compare both the AED’ s and BER T’ s decoded sequences, we can see that they can exhibit large similari- ties. Howev er , the e xamples also sho w that BER T’ s token probabili- ties are not always a good teacher signal and can itself be nonsensical at times. 5.6. Discussion As seen Section 5.5, we discovered some limitations in the output probability distribution of BER T and room for further improvement of our proposed method. While BER T can predict words with similar semantic meaning as the masked word in utterances with long left and right conte xt, for short utterances, BER T predicts words that are frequently seen in the training data. These words, while very likely in the context, can be v ery different from the original transcription. Furthermore, we used LibriSpeech’ s normalised LM training data consisting of deduplicated sentences in alphabetical order to train BER T . While these data are suitable to train an Ngram LM, these data are not suitable for training neural network LMs like BER T because BER T falls short of learning important long-term context. W e can expect to train a better BER T model from Lib- riSpeech’ s original unprocessed text corpus. 6. CONCLUSION AND OUTLOOK W e proposed a novel method for transferring external language model information into an AM by KD to improv e the accuracy of a CTC-ASR model. It allows the AM to consider language knowl- edge when transcribing speech while maintaining the high speed of parallel CTC greedy decoding without using shallow fusion with an external LM. In addition, the proposed method uses intermediate AEDs only during training for KD, so the final model does not increase in its parameter size. Howe ver , when combined with an external LM and beam search during decoding even further WERR can be achie ved. This also highlights opportunities for further im- prov ements to our KD method. W e are aiming to refine our method such that incorporating external LM information into ASR models for decoding only yields marginal to zero gains, and LM KD during model training is sufficient for optimal performance. 4 7. REFERENCES [1] Zhong Meng, Sarangarajan Parthasarathy , Eric Sun, Y ashesh Gaur , Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, and Y ifan Gong, “Internal language model estimation for domain-adapti ve end- to-end speech recognition, ” in 2021 IEEE Spoken Language T echnol- ogy W orkshop (SLT) , 2021, pp. 243–250. [2] Zhong Meng, Naoyuki Kanda, Y ashesh Gaur , Sarangarajan Parthasarathy , Eric Sun, Liang Lu, Xie Chen, Jinyu Li, and Y i- fan Gong, “Internal language model training for domain-adaptive end-to-end speech recognition, ” in 2021 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 7338–7342. [3] Jan Chorowski and Navdeep Jaitly , “T owards Better Decoding and Lan- guage Model Integration in Sequence to Sequence Models, ” in Proc. Interspeech 2017 , 2017, pp. 523–527. [4] Anjuli Kannan, Y onghui W u, Patrick Nguyen, T ara N. Sainath, ZhiJeng Chen, and Rohit Prabhavalkar , “An Analysis of Incorporating an Ex- ternal Language Model into a Sequence-to-Sequence Model, ” in 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr o- cessing (ICASSP) , 2018, pp. 1–5828. [5] Alex Graves, “Sequence Transduction with Recurrent Neural Net- works, ” in International Conference of Machine Learning (ICML) W orkshop on Representation Learning , 2012. [6] Alex Gra ves, Abdel-rahman Mohamed, and Geoffre y Hinton, “Speech recognition with deep recurrent neural networks, ” in 2013 IEEE In- ternational Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2013, pp. 6645–6649. [7] Alex Grav es, Santiago Fern ´ andez, Faustino Gomez, and J ¨ urgen Schmidhuber , “Connectionist T emporal Classification: Labelling Un- segmented Sequence Data with Recurrent Neural Networks, ” in Pr o- ceedings of the 23rd International Conference on Machine Learning , 2006, p. 369–376. [8] Geoffre y Hinton, Oriol V inyals, and Jeffrey Dean, “Distilling the Knowledge in a Neural Network, ” in NIPS Deep Learning and Rep- r esentation Learning W orkshop , 2015. [9] Jacob Devlin, Ming-W ei Chang, K enton Lee, and Kristina T outanova, “BER T: Pre-training of Deep Bidirectional Transformers for Language Understanding, ” in NAA CL HLT , 2019, pp. 4171–4186. [10] Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shin- suke Sakai, and T atsuya Kawahara, “Distilling the Kno wledge of BER T for Sequence-to-Sequence ASR, ” in INTERSPEECH , 2020, pp. 3635– 3639. [11] Y e Bai, Jiangyan Y i, Jianhua T ao, Zhengkun Tian, Zhengqi W en, and Shuai Zhang, “Fast End-to-End Speech Recognition Via Non- Autoregressi ve Models and Cross-Modal Knowledge Transferring From BER T, ” IEEE/ACM T ransactions on A udio, Speech, and Lan- guage Pr ocessing , vol. 29, pp. 1897–1911, 2021. [12] Y otaro Kubo, Shigeki Karita, and Michiel Bacchiani, “Knowledge T ransfer from Large-Scale Pretrained Language Models to End-To- End Speech Recognizers, ” in 2022 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) , 2022, pp. 8512– 8516. [13] Y osuke Higuchi, Brian Y an, Siddhant Arora, T etsuji Ogawa, T etsunori K obayashi, and Shinji W atanabe, “BER T meets CTC: Ne w formu- lation of end-to-end speech recognition with pre-trained masked lan- guage model, ” in F indings of the Association for Computational Lin- guistics: EMNLP 2022 , 2022, pp. 5486–5503. [14] Xiaoqi Jiao, Y ichun Y in, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang W ang, and Qun Liu, “Tin yBER T: Distilling BER T for natural language understanding, ” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020 , 2020, pp. 4163–4174. [15] Zhiqing Sun, Hongkun Y u, Xiaodan Song, Renjie Liu, Yiming Y ang, and Denny Zhou, “MobileBER T: a compact task-agnostic BER T for resource-limited devices, ” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020, pp. 2158–2170. [16] Michael Hentschel, Emiru Tsunoo, and T akao Okuda, “Making Punc- tuation Restoration Robust and Fast with Multi-T ask Learning and Knowledge Distillation, ” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 7773– 7777. [17] Ji W on Y oon, Beom Jun W oo, Sunghwan Ahn, Hyeonseung Lee, and Nam Soo Kim, “Inter-KD: Intermediate Kno wledge Distillation for CTC-Based Automatic Speech Recognition, ” in 2022 IEEE Spoken Language T echnology W orkshop (SLT) , 2023, pp. 280–286. [18] T atsuya Komatsu and Y usuke Fujita, “Interdecoder: using Atten- tion Decoders as Intermediate Regularization for CTC-Based Speech Recognition, ” in 2022 IEEE Spoken Languag e T echnology W orkshop (SLT) , 2023, pp. 46–51. [19] Keqi Deng, Zehui Y ang, Shinji W atanabe, Y osuke Higuchi, Gaofeng Cheng, and Pengyuan Zhang, “Improving Non-Autoregressiv e End- to-End Speech Recognition with Pre-Trained Acoustic and Language Models, ” in 2022 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2022, pp. 8522–8526. [20] V assil Panayotov , Guoguo Chen, Daniel Pove y , and Sanjeev Khudan- pur , “Librispeech: An ASR corpus based on public domain audio books, ” in 2015 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2015, pp. 5206–5210. [21] Jaesong Lee and Shinji W atanabe, “Intermediate Loss Regularization for CTC-Based Speech Recognition, ” in 2021 IEEE International Con- fer ence on Acoustics, Speech and Signal Processing (ICASSP) , 2021, pp. 6224–6228. [22] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar , Y u Zhang, Jiahui Y u, W ei Han, Shibo W ang, Zhengdong Zhang, Y onghui W u, and Ruoming Pang, “Conformer: Con volution-augmented Transformer for Speech Recognition, ” in INTERSPEECH , 2020, pp. 5036–5040. [23] Ashish V aswani, Noam Shazeer , Niki Parmar , Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser , and Illia Polosukhin, “ Atten- tion is All you Need, ” in Advances in Neural Information Processing Systems . 2017, vol. 30, Curran Associates, Inc. [24] Shinji W atanabe, T akaaki Hori, Shigeki Karita, T omoki Hayashi, Jiro Nishitoba, Y uya Unno, Nelson Enrique Y alta Soplin, Jahn Heymann, Matthew W iesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: End-to-End Speech Processing T oolkit, ” in INTER- SPEECH , 2018, pp. 2207–2211. [25] Daniel S. Park, W illiam Chan, Y u Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V . Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition, ” in INTERSPEECH , 2019, pp. 2613–2617. [26] T aku Kudo and John Richardson, “SentencePiece: A simple and lan- guage independent subword tokenizer and detokenizer for neural text processing, ” in EMNLP , 2018, pp. 66–71. [27] Oleksii Kuchaiev , Jason Li, Huyen Nguyen, Oleksii Hrinchuk, Ryan Leary , Boris Ginsbur g, Samuel Kriman, Stanislav Beliaev , V italy Lavrukhin, Jack Cook, et al., “Nemo: a toolkit for b uilding ai ap- plications using neural modules, ” arXiv pr eprint arXiv:1909.09577 , 2019. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment