Consistency Based Unsupervised Self-training For ASR Personalisation
On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by us…
Authors: Jisi Zhang, V, ana Rajan
CONSISTENCY B ASED UNSUPER VISED SELF-TRAINING FOR ASR PERSONALISA TION Jisi Zhang 1 ∗ , V andana Rajan 1 ∗ , Haaris Mehmood 1 , David T uck e y 1 , P ablo P eso P arada 1 , Md Asif J alal 1 , Karthike yan Saravanan 1 , Gil Ho Lee 2 , J ungin Lee 2 , Seokyeong J ung 2 1 Samsung Research UK, United Kingdom, 2 AI R&D Group, Samsung Electronics, Suwon, South K orea ABSTRA CT On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underper - form for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user’ s speaking characteristics and en vironmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision. Personalisation without any la- belled data is challenging due to limited data size and poor quality of recorded audio samples. This work addresses un- supervised personalisation by de veloping a no vel consi stency based training method via pseudo-labelling. Our method achiev es a relativ e W ord Error Rate Reduction (WERR) of 17.3% on unlabelled training data and 8.1% on held-out data compared to a pre-trained model, and outperforms the current state-of-the art methods. Index T erms — speech recognition, unsupervised, speaker adaptation, personalisation 1. INTR ODUCTION End-to-end ASR models are known to underperform when deployed in the wild as they face v oice characteristics (e.g. accent, tone) and background acoustics (e.g. noise, rev erber- ation) unseen during training [1–4]. A promising approach to remedy this issue consists of fine-tuning a pre-trained ASR model on device with collected user data [5–7]. This type of approach is also kno wn as “personalisation” or “adaptation” as it tailors the ASR model to the single user of the device. While supervised personalisation methods [5 – 7] sub- stantially improv e performance for end-users, they require labelled data which is impractical in many use cases [8]. This paper focuses on unsupervised self-training , which aims to impro ve the robustness of an ASR model using only unlabelled user data. Recently , unsupervised ASR per- sonalisation has made progress by incorporating auxiliary speaker features into an ASR model [9], exploring self- * Equal contribution training methods [10, 11], and training based on entropy minimisation [8, 12]. A common pipeline for the unsupervised self-training method contains data filtering, pseudo-labelling, and train- ing [10, 11]. In [11], a confidence estimation module uses ASR output probabilities to select a less erroneous subset of utterances from the entire set of unlabelled samples. The filtered samples are processed by the pre-trained model to generate pseudo-labels, which are subsequently used during training. Howe ver , the training process can be unstable with- out accessing labelled data for supervision and the model drifts away due to erroneous pseudo-labels [13]. Consistency Constraint (CC) forces a model to predict the same results on the same input with v arious versions of per- turbation and has been shown effecti ve for exploring unla- belled data [14 – 17]. Applying various perturbations intro- duces randomisation to regularise a model, leading to more stable model generalisation [14]. CC has been successfully combined with supervised loss function for semi-supervised learning in computer vision [15] and speech applications [16, 17]. Howe ver , it has not yet been explored for speech recog- nition in a fully unsupervised self-training setting. In this work, we exploit CC to improve the robustness of training process for unsupervised ASR personalisation. W e introduce CC to the common unsupervised self-training pipeline, howe ver , perturbations are applied to both pseudo- labelling and the training process, forcing the model to output a consistent label in the vicinity of the training sample after filtering. T o the best of our knowledge, our work is the first to apply CC in the context of a fully unsupervised setting for ASR. W e compare against a state-of-the-art (SO T A) unsu- pervised self-training method that combines a data filtering strategy with an adapter based training mechanism [10] and a separate unsupervised adaptation method based on entropy minimisation [8]. Our proposed method achie ves a 17.3% relativ e WERR compared to a pre-trained model, and outper- forms the recently dev eloped techniques in literature, leading to a new SO T A performance. The main contrib utions in this paper are summarised as follows: • W e propose a nov el consistency based training method for unsupervised personalisation of ASR models which 979-8-3503-0689-7/23/$31.00 ©2023 IEEE outperforms current SO T A methods. • W e empirically show that our proposed method is ag- nostic to the choice of data-filtering methods and thus can be used in conjunction with any of them. • W e ev aluate the robustness of our method on a broad range of English accents with pseudo-label WER rang- ing from 10% (high-quality) to 45% (low-quality). The rest of the paper is organised as follows. Section 2 re- views related work in the literature. Section 3 provides back- ground details of an ASR model and a confidence-score based filtering method used in the proposed framework. Section 4 describes the proposed training method based on consistency . Section 5 presents implementation details and the e xperiment setup. Results and analysis are presented in Section 6. Fi- nally , the paper is concluded in Section 7. 2. RELA TED WORK Data filtering is a commonly used pre-processing step when exploring unlabelled data [10, 18]. Confidence based fil- tering methods have been dev eloped to select samples with good quality labels that yield lo wer W ord Error Rate (WER) among the whole unlabelled dataset [18 – 21]. For exam- ple, in DUST [18, 21], multiple ASR hypotheses are cre- ated using dropout, and edit distances from reference (w/o dropout) are used to estimate the confidence of ASR model on the generated transcript. Another class of methods use a confidence estimation system that trains a separate neural network using intermediate features derived from the ASR model [10, 19, 20]. For example, both [20] and [10] use light- weight binary classifier models to predict whether a gi ven set of ASR features correspond to an error-free pseudo label (WER=0) or not. Model adaptation methods focus on learning highly com- pact speaker -dependent parameter representations [10, 22– 24]. Given a pre-trained model, a diagonal linear transfor- mation of the input features is learned to match the distribu- tion of test data to the training data [22]. An ASR model’ s batch normalization layer parameters can be learned from the target speaker data for adaptation [23]. Additional speaker- dependent parameters can be introduced to an ASR model for adaptation [10, 24]. Specifically , learning hidden unit contributions (LHUC) trains embedding vectors with fixed dimensions to modify the amplitudes of hidden unit activ a- tions [24]. Recently , the LHUC method has been success- fully applied to a Conformer based end-to-end ASR model for speaker adaptation [10]. Ho wev er , the major drawbacks of these techniques are that the selection of adaptation layers in the ASR model is non-trivial and adding new parameters changes the model architecture. Recently , entropy minimisation based approaches hav e been sho wn effecti ve for unsupervised ASR domain adapta- tion [8, 12]. It aims to reduce uncertainty for samples in a Fig. 1 : Str eaming two-pass end-to-end ASR model ar chitectur e. The first pass model is a conformer based transducer . The second pass model is an attention-based encoder-decoder model (LAS). NCM classifier is a confidence estimation module that uses intermediate ASR featur es for WER based data filtering. target domain by minimising an entropy loss based on output probabilities from a pre-trained model gi ven the tar get sam- ples. Howe ver , one issue of this method is that the model tends to drift away when the initial predictions are incorrect. 3. B A CKGROUND This section provides the background details of a streaming ASR system and the Neural Confidence Measure (NCM) based data filtering method, which are required for our pro- posed unsupervised personalisation pipeline. 3.1. ASR model A state-of-the-art, two-pass Conformer-T model [25] is used as a pre-trained ASR model for both filtering unlabelled data and adaptation to a target speaker . As shown in Fig. 1, it consists of two sub-models, namely , a parent model and a second-pass model. The parent model is a conformer transducer [26], which consists of a transcription network, a prediction network and a joint network. The transcrip- tion network contains a con volution subsampling module followed by stacked conv olution-augmented T ransformer (Conformer) blocks. The prediction network contains two LSTM layers. The second-pass model is an LAS rescorer, consisting of an LSTM based encoder-decoder architecture. The LAS encoder takes as input the parent’ s transcription output, and the first-pass prediction is refined by attending to the second-pass encoder outputs. The model is trained using both a transducer [27] loss and a cross-entropy (CE) loss calculated based on the output from the parent model and the second-pass model, respectiv ely . 3.2. Data filtering methods Inspired by recent works that employ WER prediction mod- els for data filtering [10, 11, 20, 28, 29], we use one such ex- isting model that was exclusi vely dev eloped for our custom ASR model described previously . The NCM binary classi- fication model [20] is used for filtering the transcripts gen- erated by our two-pass ASR model (see Fig. 1). The NCM model is made up of dense layers and self-attention mecha- nism and takes two types of intermediate features from the ASR model as input, namely , second pass decoder output and beam scores. The second pass decoder outputs are logits from the LAS second pass decoder . W e use top-K logits corre- sponding to each decoded token for this decoder . Beam scores are the log-probability scores for each beam assigned by the 2nd pass decoder . Input features are obtained by running the ASR model with beam search and the model is trained using binary cross-entropy loss. The output consists of two classes, WER=0 and WER > 0 and only the predicted WER=0 sam- ples are selected as the filtered set. Note that different to the original NCM model that used 6 types of features, we use only two based on the relev ance of features as sho wn in the T able 1 of [20]. Once the training is complete, the sav ed NCM model is used to perform pseudo-label filtering on the on-device recorded personal data. Additionally , we also experiment with two other tech- niques from literature for data filtering, namely , DUST [18] and Confidence Thresholding (CT) [30]. DUST is based on the intuition that for confident predictions, the ASR output would remain unchanged even if some amount of uncer- tainty is introduced into the model in the form of dropout. In practice, dropout layers are enabled in the ASR model during ev aluation and each utterance is forward propagated through the network for multiple times to generate different hypotheses. Lev enshtein edit distances between a reference (hypothesis with no dropout) and each of the hypotheses are calculated. If any of the distances corresponding to an utterance is abov e a predefined threshold, that utterance is perceiv ed as having a lower confidence by the ASR model and hence rejected. Confidence threshold for each utterance is obtained by taking sum of log softmax scores across all tokens. WER values are binarized by taking WER=0 as positiv e class and WER > 0 as negati ve class. The threshold is then found by taking the geometric mean of sensiti vity and specificity using an R OC curve. 4. PR OPOSED METHOD The proposed unsupervised personalisation pipeline starts from filtering unlabelled data to using the filtered data to adapt a pre-trained ASR model based on the Consistency Fig. 2 : Unsupervised personalisation pipeline based on data filter- ing and consistency constraint Constraint (CC). The no velty of our adaptation method is that it is the first work to apply CC in the context of fully unsuper - vised ASR personalisation. The personalisation pipeline (see Fig. 2) based on the proposed CC training is summarised in Algorithm 1. W e first apply data filtering D A TA F I LT E R to the entire unlabelled set X to obtain filtered set ˆ X . Subsequently , the model is trained on the filtered set, in volving N rounds of pseudo-labelling and training, and in each round the model f is trained for M epochs with the paired audio samples and pseudo-labels ˆ D . Algorithm 1 Proposed consistency based unsupervised per- sonalisation pipeline 1: Input: ASR model f , weights θ , unlabelled data X 2: ˆ X = D A TA F I LT E R ( X ) , θ 0 = θ 3: for i = 0 , . . . , N − 1 do // N rounds 4: ˆ D = { ( ˆ x, f ( S P E C A U G ( ˆ x ) , θ i )) , . . . } ∀ ˆ x ∈ ˆ X 5: θ i +1 = T R A I N ( f , θ i , ˆ D , M ) // M epochs 6: end for 7: Output: θ N The consistency is realised via pseudo-labelling on utter- ances augmented with random data perturbations, after which the generated pseudo-labels are used to train the model that are perturbed in a dif ferent way . The pseudo-label for each ut- terance is updated frequently with the latest ASR model dur- ing the personalisation process. CC has been commonly ap- plied as a regularisation loss to semi-supervised learning ap- proaches [15–17, 31], in which a main loss is calculated based on the available labelled source domain data. This work uses CC as the only loss for unsupervised personalisation instead of using it as an auxiliary loss. Pseudo-labels are generated by decoding the output of the second-pass model via beam search to produce hard labels, a transcription sequence ˆ y . During the training phase, only the parameters of the first-pass model are adapted. Specifically , the first-pass transducer takes augmented input features ˜ x and outputs the posterior probabilities of all possible alignments, which are used to calculate the loss against the pseudo-label ˆ y . The loss is then used to train the first-pass transducer com- prised of transcription network, prediction network, and joint network. Since the first-pass model is a conformer transducer model, the loss function is implemented as incorporating the CC within the standard RNN-T loss: L = − ln Pr( ˆ y | ˜ x ) (1) SpecAugment [32, 33] is used as the data perturbation for both the pseudo-labelling and the training. Due to randomisa- tion, the SpecAugment applied to a sample during training is different to the pseudo-labelling for the same sample. During training, besides the data perturbation, the model is perturbed by a dropout strategy [34]. The combination of data as well as model perturbation forms a stronger augmentation for train- ing than that of pseudo-labelling. 5. EXPERIMENT SETUP 5.1. Data The ASR model is pre-trained on 20K hours of English speech data. This includes data from public speech datasets such as LibriSpeech [35] as well as in-house data from a variety of domains such as search, telephony , far -field, etc. The performance of this model on an in-house validation set of 6 hours (5K utterances) is 15.69% WER. For personalisation experiments, the proposed method is ev aluated on an in-house synthetic user data for a mobile phone use case. There are 12 speakers, each containing three styles of speech: (i) application launch/download commands (Apps), contact call/text commands (Contacts) and common real-world voice assistant commands (Dictations). T able 1 provides examples for the three types of data. The average audio length of Apps and Contacts samples is two seconds. The filtered Apps and Contacts data are used to personalise a pre-trained acoustic model. Then, the personalised model is ev aluated on the entire (filtered and unfiltered) Apps and Contacts data, and the held-out Dictations data that is unseen during the training. On average, for each speaker, the du- ration of entire Apps, Contacts and Dictation sets are 8.37, 16.93 and 6.47 minutes respectiv ely . 5.2. ASR model configuration The transcription network in the two-pass ASR model con- sists of 16 conformer blocks. Each conformer block con- sists of one feed-forward layer block, one Conv olution Block, one multi-head self-attention block, another one feed-forward layer block, follo wed by a layer normalisation layer . The pre- diction network is constructed by stacking two LSTM layers with a dimension of 640. The joint network is a dense layer . In the second-pass model, the encoder contains one LSTM with dimension of 680, and the decoder contains tw o LSTMs with dimension of 680. When decoding, the beam sizes for the first-pass model and the second-pass LAS re-scoring are set as 4 and 1, respectively . No language model is used for decoding. T able 1 : Adaptation data e xamples: Apps, Contacts, and Dictations Split Examples Apps Open Messenger Install Snapchat Contacts Send a message to Anne Hathaway Call Emma Stone Dictations When does summer start Set an alarm for sev en thirty Batch size of 16 and Adam optimizer [36] are used for all the adaptation experiments. Learning rates for ASR model fine-tuning and LHUC based training are set as 5e-6 and 1e-3, respecti vely . The SpecAugment used for both pseudo- labelling and training has one frequency mask with size of 13 and two time masks with size of 12. During training, the model is perturbed with dropout of 0.1. 5.3. Data filtering method setup In the NCM, the first fully connected block consists of two dense layers each with 64 neurons follo wed by T anh activ a- tion. This is followed by a self-attention layer whose output is summed across all tokens for each utterance and concate- nated with the beam scores. This is then passed through an- other fully connected block made up of two dense layers of 64 neurons each and an output dense layer for the binary class prediction. Training the NCM model uses a batch size of 32 and Adam optimizer [36] with an initial learning rate of 1e-3. An exponential decay scheduler with decay rate 0.5 for e very 500 training steps w as also used. The K v alue is set as 4. The NCM model is trained using an in-house 6 hours data, where it is split in the ratio 80:20 for training and validation. The CT method uses the NCM training data (in-house dataset of 6 hours) for finding the threshold, which is used to identify correct pseudo labels from the Apps and Contacts sets of personal data. For DUST , the dropout of 0.2 (which is the same value used during the original ASR training) is en- abled for the transformers in ASR model and 5 hypotheses are generated for each utterance. W e tried different thresholds for edit distance from { 0.1, 0.3, 0.5, 0.7 } as recommended in [18] and found that the best results are obtained with 0.1 threshold. 6. RESUL TS AND ANAL YSIS This section presents the main results of the proposed method and shows the comparison against recently dev eloped exist- ing unsupervised adaptation methods. The second part of this section conducts a detailed ablation study on the proposed method with respect to data filtering strategies and indi vidual user performance. 6.1. ASR personalisation results Methods used as baselines in this work include noisy student training (NST) [37] without accessing labelled source data, confidence score based LHUC [10], and a recently proposed unsupervised test-time adaptation approach based on entropy minimization (EM) [8]. The additional LHUC modules are inserted into the streaming two-pass model used in this w ork. Different to the strategy used in [10] that applies only one LHUC layer to the hidden output of the con volution sub- sampling module, we apply multiple LHUC layers to the transcription network, which yields better results than apply- ing a single layer . Specifically , one LHUC layer is applied to the output of the con volution subsampling module and another to the output of each of the Conformer block. In total, there are 17 LHUC layers. During adaptation, only the LHUC layers are updated and the parameters of the main ASR model are unchanged. The NST method employs data augmentation (SpecAugment) during training, but not for pseudo-labelling. For the EM approach, we calculate and minimise the entropy of the token-wise posterior probabil- ities output from the second-pass LAS decoder . The NCM data filtering is applied before the EM training. Both NST and EM also update only the first-pass model parameters to make fair comparison with the proposed method. T able 2 summarises the ASR performance of the base- line and proposed methods. The proposed method achiev es 17.3%, 7.2%, and 8.1% relative WERR on Apps, Contacts, and Dictation, respectively , compared to the pre-trained model. It outperforms both entropy minimisation and the confidence score based LHUC, achie ving a ne w SO T A result. The results show that the proposed method not only impro ves the ASR performance on unlabelled data used for training (Apps & Contacts), but also generalises well to unseen data (Dictation) spoken by the target speaker . The second half of T able 2 shows the performance of the proposed method (NCM+CC) by adaptation with and without LHUC and the results indicate that addition of LHUC does not provide any improv ement. NCM+CC outperforms the combination of data filtering T able 2 : W or d Err or Rate (WER) of the pr oposed method and exist- ing methods for unsupervised personalisation Methods Apps Contacts Dictation Pre-trained 22.66 23.49 9.43 NST 21.94 23.07 9.36 EM [8] 20.26 23.23 9.53 NCM+EM 19.12 22.22 8.86 NCM+LHUC [10] 20.30 22.70 9.10 NCM+CC+LHUC 19.30 21.99 8.64 NCM+CC 18.73 21.79 8.67 and entropy minimization (NCM+EM). EM aims to reduce the output uncertainty when a model processes a test sam- ple. When the model prediction is incorrect, the EM approach trains the model to be more confident about its incorrect deci- sion. For the consistency based approach, a model may output different predictions for a highly uncertain sample with vari- ous perturbations. Since the model is trained to map this same sample to different predictions, this sample will have lesser impact on the model weight change compared to samples that provide consistent pseudo-labels. 6.2. Ablation study W e first in vestigate the effect of quality of audio samples on the ASR personalisation performance. The Consistenc y Con- straint (CC) based method is tested either on the unfiltered whole data set or the filtered data set based on three filter- ing strate gies, namely Confidence Thresholding (CT), DUST , and the NCM. The method is also tested on training with only audio samples that the ASR system recognises correctly (WER=0), selected based on the ground-truth transcriptions. In T able 3, we first observ e that the CC trained with audio samples that the model recognises correctly (WER=0) per- forms better compared with the CC trained with the whole data (CC). This demonstrates the importance of adapting a giv en model on samples that yield low WER. The second half of T able 3 shows the results of CC with different filtering strategies. Remarkably , though trained on filtered data that contains samples with erroneous labels, the CC based train- ing outperforms that trained with WER=0 samples. It sug- gests that the CC based training is able to explore samples with erroneous labels to adapt the model. All three data filtering strategies achiev e similar perfor- mance improvement. Out of the three, NCM is fav ored for on-device personalisation. NCM is a lightweight model that requires only one-time training and can be easily deplo yed on device. The CT method requires manual tuning of the thresh- old v alue, which is challenging for an end-to-end ASR model because of the well kno wn overconfidence issue that e xists in the end-to-end ASR models [29]. Howe ver , the NCM can ex- ploit large amounts of training data on server to automatically learn the boundary between reliable and unreliable samples T able 3 : The effect of thr ee data filtering methods (CT , DUST , NCM) on the unsupervised personalisation performance Methods Apps Contacts Dictation CC 21.25 22.71 9.04 CC (WER=0) 20.38 22.40 8.87 CT+CC 18.91 22.10 8.75 DUST+CC 18.93 21.87 8.69 NCM+CC 18.73 21.79 8.67 5 10 15 20 0 5 10 15 20 Rounds WERR (%) Apps Epochs = 1 Epochs = 3 Epochs = 5 NST CC 5 10 15 20 0 2 4 6 8 Rounds WERR (%) Contacts 5 10 15 20 0 2 4 6 8 10 12 Rounds WERR (%) Dictation Fig. 3 : W or d Err or Rate Reduction (WERR) compared to the pr e-trained model for Apps, Contacts & Dictation using consistency training (CC) and unsupervised NST for 20 rounds with a choice of 1, 3 or 5 epochs per r ound. Higher values are better . Plot values are smoothed using an exponential mo ving average with weight of 0.6. Best viewed in colour . 0 1 2 3 4 5 6 7 8 9 10 11 0 20 40 WER (%) P r e-trained Unfilter+NST NCM+Consistency 0 1 2 3 4 5 6 7 8 9 10 11 0 20 40 WER (%) 0 1 2 3 4 5 6 7 8 9 10 11 Speak er ID 0 10 20 WER (%) Fig. 4 : ASR personalisation results for each of the 12 individual users. The pre-tr ained model, NST trained on unfilter ed data, and the pr oposed method are compared in the plot. (T op: Apps data, Middle: Contacts data, Bottom: Dictation data.) by exploiting multiple intermediate ASR features. Compared to DUST that requires multiple forward passes, which is un- desirable due to time and resource constraints, the NCM ob- tains the result with a single forward pass. W e perform ablation experiments to study the effect of increasing the number of rounds and epochs per round on the overall WERR. W e use epochs ∈ { 1 , 3 , 5 } for up to 20 rounds and compare our method against unsupervised NST . Fig. 3 shows that training with five epochs per round can lead to div ergence for Dictation which is a classic example of overfitting due to increased model updates. Con versely , training for a single epoch per round leads to sub-optimal con ver gence due to the increased stochasticity in regenerat- ing pseudo-labels with input augmentation ev ery round. Our method performs up to 40% better than unsupervised NST which is more susceptible to overfitting due to being easily stuck in a local minima. W e inv estigate the performance of our proposed method on each individual user, and the analysis is shown in Fig. 4. In comparison to the pre-trained model and the baseline NST trained on unfiltered data, our method achiev es better or equi valent recognition accuracy for most users on both held-in data (Apps & Contacts) and held-out data (Dictation). There is a wide range of speech recognition accuracy among the test users, whose WERs range from 10% to 45%. The results demonstrate that the proposed method improves the robustness of the training process to erroneous labels. 7. CONCLUSIONS This work introduces a novel unsupervised personalisation training method to address a domain shift issue when an ASR model is deployed in the wild. The proposed method per- forms data filtering of unlabelled user data and applies a con- sistency constraint to the training process. A neural confi- dence measure approach has been employed for data filtering and has demonstrated to effecti vely discard unreliable audio samples that yield erroneous pseudo-labels. T o apply consis- tency constraint, data perturbation is introduced to the itera- tiv e pseudo-labelling process, forcing the ASR model to pre- dict the same labels on the same sample with v arious v ersions of perturbations. Experiments show that the proposed method reduces the negativ e impact of low quality samples during training and improv es model generalisation to test domain data. W e further separately ev aluate the consistency based training in combination with three existing filtering methods and all filtering methods achieve similar results. This suggests that the consistency based training can be used in conjunction with a wide range of data filtering strategies. 8. REFERENCES [1] Y ong Zhao, Jinyu Li, Shixiong Zhang, Liping Chen, and Y ifan Gong, “Domain and speaker adaptation for cor- tana speech recognition, ” in ICASSP , 2018. [2] Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Tre vor Strohman, and Franc ¸ oise Beaufays, “Incremental layer- wise self-supervised learning for efficient unsupervised speech domain adaptation on device, ” in Interspeech , 2022. [3] Khe Chai Sim, Petr Zadrazil, and Franc ¸ oise Beaufays, “ An in vestigation into on-device personalization of end- to-end automatic speech recognition models, ” in Inter - speech , 2019. [4] Peter Bell, Joachim Fainber g, Ondrej Klejch, Jinyu Li, Stev e Renals, and Pa wel Swietojanski, “ Adaptation al- gorithms for neural network-based speech recognition: An overvie w , ” IEEE Open Journal of Signal Pr ocess- ing , vol. 2, pp. 33–66, 2020. [5] Khe Chai Sim, Petr Zadra ˇ zil, and Franc ¸ oise Beaufays, “ An in vestigation into on-device personalization of end- to-end automatic speech recognition models, ” in Inter - speech , 2019. [6] Zhong Meng, Jinyu Li, Y ashesh Gaur , and Y ifan Gong, “Domain adaptation via teacher-student learning for end-to-end speech recognition, ” in 2019 IEEE Auto- matic Speech Recognition and Understanding W orkshop (ASR U) , 2019, pp. 268–275. [7] Y an Huang, Guoli Y e, Jin yu Li, and Y ifan Gong, “Rapid speaker adaptation for conformer transducer: Attention and bias are all you need, ” in Interspeech , 2021. [8] Guan-Ting Lin, Shang-W en Li, and Hung yi Lee, “Lis- ten, adapt, better WER: Source-free single-utterance test-time adaptation for automatic speech recognition, ” in Interspeech , 2022. [9] Marc Delcroix, Shinji W atanabe, Atsunori Ogaw a, Shigeki Karita, and T omohiro Nakatani, “ Auxiliary fea- ture based adaptation of end-to-end ASR systems, ” in Interspeech , 2018. [10] Jiajun Deng, Xurong Xie, T ianzi W ang, Mingyu Cui, Boyang Xue, Zengrui Jin, Mengzhe Geng, Guinan Li, Xunying Liu, and Helen M. Meng, “Confidence score based conformer speak er adaptation for speech recogni- tion, ” in Interspeech , 2022. [11] Jiajun Deng, Xurong Xie, T ianzi W ang, Mingyu Cui, Boyang Xue, Zengrui Jin, Guinan Li, Shujie Hu, and Xunying Liu, “Confidence score based speaker adaptation of conformer speech recognition systems, ” IEEE/A CM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 31, pp. 1175–1190, 2023. [12] Changhun Kim, Joonhyung Park, Hajin Shim, and Eunho Y ang, “SGEM: T est-time adaptation for auto- matic speech recognition via sequential-lev el general- ized entropy minimization, ” in Interspeec h , 2023. [13] Rui Li, Qianfen Jiao, W enming Cao, Hau-San W ong, and Si W u, “Model adaptation: Unsupervised domain adaptation without source data, ” 2020 IEEE/CVF Con- fer ence on Computer V ision and P attern Recognition (CVPR) , 2020. [14] Mehdi Sajjadi, Mehran Javanmardi, and T olga T asdizen, “Regularization with stochastic transformations and per- turbations for deep semi-supervised learning, ” in NIPS , 2016. [15] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raf fel, Ekin Dogus Cubuk, Alex ey Kurakin, and Chun-Liang Li, “Fixmatch: Sim- plifying semi-supervised learning with consistency and confidence, ” Advances in neural information pr ocessing systems , vol. 33, pp. 596–608, 2020. [16] Felix W eninger , Franco Mana, Roberto Gemello, Jes’us Andr’es-Ferrer , and Puming Zhan, “Semi-supervised learning with data augmentation for end-to-end ASR, ” in Interspeech , 2020. [17] Ashtosh Sapru, “Using data augmentation and consis- tency regularization to improve semi-supervised speech recognition, ” in Interspeech , 2022. [18] Sameer Khurana, Niko Moritz, T akaaki Hori, and Jonathan Le Roux, “Unsupervised domain adapta- tion for speech recognition via uncertainty driven self- training, ” ICASSP , 2020. [19] Kaustubh Kalgaonkar , Chaojun Liu, Y ifan Gong, and Kaisheng Y ao, “Estimating confidence scores on ASR results using recurrent neural networks, ” in ICASSP , 2015. [20] Ashutosh Gupta, Ankur Kumar , Dhananjaya N. Gowda, Kwangyoun Kim, Sachin Singh, Shatrughan Singh, and Chanwoo Kim, “Neural utterance confidence measure for RNN-T ransducers and two pass models, ” ICASSP , 2021. [21] Nauman Dawalatabad, Sameer Khurana, Antoine Lau- rent, and James Glass, “On unsupervised uncertainty- driv en speech pseudo-label filtering and model calibra- tion, ” in ICASSP , 2023. [22] Zhong-Qiu W ang and Deliang W ang, “ A joint train- ing framew ork for robust automatic speech recognition, ” IEEE/A CM T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 24, pp. 796–806, 2016. [23] Zhongqiu W ang and Deliang W ang, “Unsupervised speaker adaptation of batch normalized acoustic models for robust ASR, ” ICASSP , 2017. [24] Pawel Swietojanski, Jinyu Li, and Stev e Renals, “Learn- ing hidden unit contributions for unsupervised acoustic model adaptation, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 24, no. 8, pp. 1450–1463, 2016. [25] Jinhwan Park, Sichen Jin, Junmo Park, Sungsoo Kim, Dhairya Sandhyana, Chang heon Lee, Myoungji Han, Jungin Lee, Seokyeong Jung, Chang W oo Han, and Chanwoo Kim, “Conformer-based on-device streaming speech recognition with KD compression and two-pass architecture, ” 2022 IEEE Spoken Language T echnology W orkshop (SLT) , pp. 92–99, 2023. [26] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar , Y u Zhang, Jiahui Y u, W ei Han, Shibo W ang, Zhengdong Zhang, Y onghui W u, et al., “Conformer: Con volution-augmented transformer for speech recog- nition, ” in Interspeech , 2020. [27] Alex Grav es, “Sequence transduction with recurrent neural networks, ” ArXiv , vol. abs/1211.3711, 2012. [28] Ankur Kumar , Sachin Singh, Dhananjaya Gowda, Ab- hinav Garg, Shatrughan Singh, and Chanwoo Kim, “Utterance confidence measure for end-to-end speech recognition with applications to distributed speech recognition scenarios., ” in Interspeech , 2020. [29] Qiujia Li, Da vid Qiu, Y u Zhang, Bo Li, Y anzhang He, Philip C. W oodland, Liangliang Cao, and Tre vor Strohman, “Confidence estimation for attention-based sequence-to-sequence models for speech recognition, ” ICASSP , 2020. [30] Jacob Kahn, Ann Lee, and A wni Y . Hannun, “Self- training for end-to-end speech recognition, ” ICASSP , 2019. [31] Marvin Zhang, Serge y Levine, and Chelsea Finn, “MEMO: test time robustness via adaptation and aug- mentation, ” in NeurIPS , 2022. [32] Daniel S. P ark, W illiam Chan, Y u Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V . Le, “Specaugment: A simple data augmentation method for automatic speech recognition, ” in Interspeech , 2019. [33] Daniel S. Park, Y u Zhang, Chung-Cheng Chiu, Y ouzheng Chen, Bo Li, W illiam Chan, Quoc V . Le, and Y onghui W u, “Specaugment on large scale datasets, ” ICASSP , 2019. [34] Geoffrey E. Hinton, Nitish Sri vasta v a, Alex Krizhevsky , Ilya Sutskev er , and Ruslan Salakhutdinov , “Improving neural networks by preventing co-adaptation of feature detectors, ” ArXiv , vol. abs/1207.0580, 2012. [35] V assil Panayotov , Guoguo Chen, Daniel Pov ey , and San- jeev Khudanpur , “Librispeech: an ASR corpus based on public domain audio books, ” in ICASSP , 2015. [36] Diederik P . Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” International Confer ence for Learning Repr eentations (ICLR) , 2015. [37] Daniel S. Park, Y u Zhang, Y e Jia, W ei Han, Chung- Cheng Chiu, Bo Li, Y onghui W u, and Quoc V . Le, “Improv ed noisy student training for automatic speech recognition, ” in Interspeech , 2020.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment