Bi-Directional Lattice Recurrent Neural Networks for Confidence Estimation

The standard approach to mitigate errors made by an automatic speech recognition system is to use confidence scores associated with each predicted word. In the simplest case, these scores are word posterior probabilities whilst more complex schemes u…

Authors: Qiujia Li, Preben Ness, Anton Ragni

Bi-Directional Lattice Recurrent Neural Networks for Confidence   Estimation
BI-DIRECTIONAL LA TTICE RECURRENT NEURAL NETWORKS FOR CONFIDENCE ESTIMA TION Q. Li † , P . M. Ness † , A. Ragni ‡ , M. J . F . Gales ‡ Department of Engineering, Uni versity of Cambridge T rumpington Street, Cambridge CB2 1PZ, UK { ql264, pmn26, ar527, mjfg } @eng.cam.ac.uk ABSTRA CT The standard approach to mitigate errors made by an automatic speech recognition system is to use confidence scores associated with each predicted word. In the simplest case, these scores are word posterior probabilities whilst more complex schemes utilise bi-directional recurrent neural network (BiRNN) models. A number of upstream and downstream applications, ho we ver , rely on confi- dence scores assigned not only to 1-best hypotheses b ut to all words found in confusion networks or lattices. These include b ut are not limited to speaker adaptation, semi-supervised training and infor- mation retrieval. Although word posteriors could be used in those applications as confidence scores, they are known to hav e reliability issues. T o make improved confidence scores more generally avail- able, this paper shows ho w BiRNNs can be extended from 1-best sequences to confusion network and lattice structures. Experiments are conducted using one of the Cambridge Uni versity submissions to the IARP A OpenKWS 2016 competition. The results sho w that con- fusion network and lattice-based BiRNNs can provide a significant improv ement in confidence estimation. Index T erms — confidence estimation, bi-directional recurrent neural network, confusion network, lattice 1. INTR ODUCTION Recent years ha ve seen an increased usage of spoken language tech- nology in applications ranging from speech transcription [ 1 ] to per- sonal assistants [ 2 ]. The quality of these applications heavily de- pends on the accuracy of the underlying automatic speech recogni- tion (ASR) system yielding 1-best hypotheses and how well ASR errors are mitigated. The standard approach to ASR error mitiga- tion is confidence scores [ 3 , 4 ]. A low confidence can give a signal to downstream applications about the high uncertainty of the ASR in its prediction and measures can be taken to mitigate the risk of making a wrong decision. Howe v er , confidence scores can also be used in upstream applications such as speaker adaptation [ 5 ] and semi-supervised training [ 6 , 7 ] to reflect uncertainty among multiple possible alternative hypotheses. Downstream applications, such as † Both authors contributed equally . P . M. Ness was supported in part by the AL T A Institute, Cambridge Uni versity . ‡ Supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Ad- vanced Research Projects Activity (IARP A), via Air Force Research Labo- ratory (AFRL) contract # F A8650-17-C-9117. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either e xpressed or implied, of ODNI, IARP A, AFRL or the U.S. Government. The U.S. Government is authorised to reproduce and distrib ute reprints for go vernmental purposes notwithstanding any copyright annotation therein. machine translation and information retriev al, could similarly bene- fit from using multiple hypotheses. A range of confidence scores has been proposed in the litera- ture [ 4 ]. In the simplest case, confidence scores are posterior proba- bilities that can be derived using approaches such as confusion net- works [ 8 , 9 ]. These posteriors typically significantly over-estimate confidence [ 9 ]. Therefore, a number of approaches ha ve been pro- posed to rectify this problem. These range from simple piece-wise linear mappings given by decision trees [ 9 ] to more complex se- quence models such as conditional random fields [ 10 ], and to neural networks [ 11 , 12 , 13 ]. Though improvements over posterior proba- bilities on 1-best hypotheses were reported, the impact of these ap- proaches on all hypotheses av ailable within confusion networks and lattices has not been in vestigated. Extending confidence estimation to confusion network and lat- tice structures can be straightforward for some approaches, such as decision trees, and challenging for others, such as recurrent forms of neural netw orks. The pre vious w ork on encoding graph structures into neural networks [ 14 ] has mostly focused on embedding lattices into a fixed dimensional vector representation [ 15 , 16 ]. This paper examines a particular example of extending a bi-directional recurrent neural network (BiRNN) [ 17 ] to confusion network and lattice struc- tures. This requires specifying how BiRNN states are propagated in the forward and backward directions, how to merge a variable num- ber of BiRNN states, and how target confidence values are assigned to confusion network and lattice arcs. The paper sho ws that the state propagation in the forward and backward directions has close links to the standard forward-backward algorithm [ 18 ]. This paper pro- poses sev eral approaches for merging BiRNN states, including an attention mechanism [ 19 ]. Finally , it describes a Lev enshtein algo- rithm for assigning targets to confusion networks and an approxi- mate solution for lattices. Combined these make it possible to assign confidence scores to ev ery word hypothesised by the ASR, not just from a single extracted hypothesis. The rest of this paper is organised as follows. Section 2 describes the use of bi-directional recurrent neural networks for confidence estimation in 1-best hypotheses. Section 3 describes the e xtension to confusion network and lattice structures. Experimental results are presented in Section 4 . The conclusions drawn from this work are giv en in Section 5 . 2. BI-DIRECTIONAL RECURRENT NEURAL NETWORK Fig. 1a shows the simplest form of the BiRNN [ 17 ]. Unlike its uni- directional version, the BiRNN makes use of tw o recurrent states, one going in the forward direction in time − → h t and another in the backward direction ← − h t to model past (history) and future informa- tion respectiv ely . The past information can be modelled by h t h t x t t c (a) sequence x t t c h t h t (1) h h (1) t t (b) confusion network, lattice Fig. 1 : Bi-directional neural networks for confidence estimation − → h t = σ ( W ( − → h ) − → h t − 1 + W ( x ) x t ) (1) where x t is an input feature v ector at time t , W ( x ) is an input matrix, W ( − → h ) is a history matrix and σ is an element-wise non-linearity such as a sigmoid. The future information is typically modelled in the same way . At any time t the confidence c t can be estimated by c t = σ ( w ( c ) T h t + b ( c ) ) (2) where w c and b ( b ) are a parameter vector and a bias, σ is any non- linearity that maps confidence score into the range [0 , 1] and h t is a context v ector that combines the past and future information. h t = h − → h t ← − h t i T (3) The input features x t play a fundamental role in the model’ s ability to assign accurate confidence scores. Numerous hand-crafted fea- tures hav e been proposed [ 20 , 21 , 22 , 23 ]. In the simplest case, du- ration and word posterior probability can be used as input features. More complex features may include embeddings [ 24 ], acoustic and language model scores and other information. The BiRNN can be trained by minimising the binary cross-entropy H ( c , c ∗ ; θ ) = − 1 T T X t =1 n c ∗ t log( c t ) + (1 − c ∗ t ) log (1 − c t ) o (4) where c t is a predicted confidence score for time slot t and c ∗ t is the associated reference value. The reference values can be obtained by aligning the 1-best ASR output and reference text using the Lev en- shtein algorithm. Note that deletion errors cannot be handled under this frame work and need to be treated separately [ 23 , 13 ]. This form of BiRNN has been examined for confidence estimation in [ 12 , 13 ] The perfect confidence estimator w ould assign scores of one and zero to correctly and incorrectly hypothesised words respectively . In order to measure the accuracy of confidence predictions, a range of metrics have been proposed. Among these, normalised cross- entropy (NCE) is the most frequently used [ 25 ]. NCE measures the relativ e change in the binary cross-entropy when the empirical esti- mate of ASR correctness, P c , is replaced by predicted confidences c = c 1 , . . . , c T . Using the definition of binary cross-entropy in Eqn. 4 , NCE can be expressed as NCE ( c , c ∗ ) = H ( P c · 1 , c ∗ ) − H ( c , c ∗ ) H ( P c · 1 , c ∗ ) (5) where 1 is a length T vector of ones, and the empirical estimate of ASR correctness is giv en by P c = 1 T T X t =1 c ∗ t (6) When hypothesised confidence scores c are systematically better than the estimate of ASR correctness P c , NCE is positiv e. In the limit of perfect confidence scores, NCE approaches one. NCE alone is not al ways the most optimal metric for ev aluat- ing confidence estimators. This is because the theoretical limit of correct words being assigned a score of one and incorrect words a score of zero is not necessary for perfect operation of an upstream or downstream application. Often it is sufficient that the rank ordering of the predictions is such that all incorrect w ords fall below a certain threshold, and all correct words abov e. This is the case, for instance, in various information retrie v al tasks [ 26 , 27 ]. A more suitable met- ric in such cases could be an area under a curve (A UC)-type metric. For balanced data the chosen curv e is often the receiv er operation characteristics (ROC). Whereas for imbalanced data, as is the case in this work, the precision-recall (PR) curve is normally used [ 28 ]. The PR curve is obtained by plotting precision v ersus recall Precision ( θ ) = TP ( θ ) TP ( θ ) + FP ( θ ) , Recall ( θ ) = TP ( θ ) TP ( θ ) + FN ( θ ) (7) for a range of thresholds θ , where TP are true positives, FP and FN are false positives and negativ es. When ev aluating performance on lattices and confusion networks, these metrics are computed across all arcs in the network. 3. CONFUSION NETWORK AND LA TTICE EXTENSIONS A number of important downstream and upstream applications rely on accurate confidence scores in graph-like structures, such as con- fusion networks (CN) in Fig. 2b and lattices in Fig. 2c , where arcs connected by nodes represent hypothesised words. This section de- scribes an extension of BiRNNs to CNs and lattices. (c ) w ord l a t t i c e (b) c onfus i on ne t w ork (a ) one -be s t s e que nc e (a) one-best sequence (c ) w ord l a t t i c e (b) c onfus i on ne t w ork (a ) one -be s t s e que nc e (b) confusion network (c ) w ord l a t t i c e (b) c onfus i on ne t w ork (a ) one -be s t s e que nc e (c) word lattice Fig. 2 : Standard ASR outputs Fig. 2b shows that compared to 1-best sequences in Fig. 2a , each node in a CN may have multiple incoming arcs. Thus, a de- cision needs to be made on how to optimally propagate informa- tion to the outgoing arcs. Furthermore, any such approach would need to handle a v ariable number of incoming arcs. One popular approach [ 16 , 15 ] is to use a weighted combination − → h t = X i α ( i ) t − → h ( i ) t (8) where − → h ( i ) t represents the history information associated with the i th arc of the t th CN bin and α ( i ) t is the associated weight. A number of approaches can be used to set these weights. One simple approach is to set weights of all arcs other than the one with the highest pos- terior to zero. This yields a model that for 1-best hypotheses has no advantage over BiRNNs in Section 2 . Other simple approaches in- clude average or normalised confidence score α ( i ) t = c ( i ) t / P j c ( j ) t where c ( i ) t is a w ord posterior probability , possibly mapped by de- cision trees. A more complex approach is an attention mechanism α ( i ) t = exp( z ( i ) t ) P j exp( z ( j ) t ) , where z ( i ) t = σ  w ( a ) T − → k ( i ) t + b ( a )  (9) where w ( a ) and b ( a ) are attention parameters, − → k ( i ) t is a key . The choice of the key is important as it helps the attention mechanism decide which information should be propagated. It is not obvious a priori what the key should contain. One option is to include arc history information as well as some basic confidence score statistics − → k ( i ) t = h − → h ( i ) T t c ( i ) t µ t σ t i T (10) where µ t and σ t are the mean and standard deviation computed over c ( i ) t at time t . At the next ( t + 1) th CN bin the forward information associated with the i th arc is updated by − → h ( i ) t +1 = σ ( W ( − → h ) − → h t + W ( x ) x ( i ) t +1 ) (11) The confidence score for each CN arc is computed by c ( i ) t = σ ( w ( c ) T h ( i ) t + b ( c ) ) (12) where h ( i ) t is an arc context v ector h ( i ) t = h − → h ( i ) t ← − h ( i ) t i (13) A summary of dependencies in this model is shown in Fig. 1b for a CN with 1 arc in the t th bin and 2 arcs in the ( t + 1) th bin. As illustrated in Fig. 2c , each node in a lattice marks a timestamp in an utterance and each arc represents a hypothesised word with its corresponding acoustic and language model scores. Although lat- tices do not normally obey a linear graph structure, if they are tra- versed in the topological order , no changes are required to compute confidences o ver lattice structures. The way the information is prop- agated in these graph structures is similar to the forward-backward algorithm [ 18 ]. There, the forward probability at time t is − → h ( i ) t +1 = − → h t x ( i ) t +1 , where − → h t = X j α i,j − → h ( j ) t (14) Compared to equations Eqn. 8 and Eqn. 11 , the forw ard recursion employs a dif ferent way to combine features x ( i ) t +1 and node states − → h t , and maintains stationary weights, i.e . the transition probabilities α i,j , for combining arc states − → h ( j ) t . In addition, each − → h ( i ) t has a probabilistic meaning which the vector − → h ( i ) t does not. Furthermore, unlike in the standard algorithm, the past information at the final node is not constrained to be equal to the future information at the initial node. In order to train these models, each arc of a CN or lattice needs to be assigned an appropriate reference confidence value. For aligning a reference word sequence to another sequence, the Leven- shtein algorithm can be used. The R O VER method has been used to iteratively align word sequences to a pivot reference sequence to construct CNs [ 29 ]. This approach can be extended to confu- sion network combination (CNC), which allows the merging of two CNs [ 30 ]. The reduced CNC alignment scheme proposed here uses a reference one-best sequence rather than a CN as the piv ot, in order to tag CN arcs against a reference sequence. A soft loss of aligning reference word ω τ with the t th CN bin is used ` t ( ω τ ) = 1 − P t ( ω τ ) (15) where P t ( ω ) is a word posterior probability distribution associated with the CN bin at time t . The optimal alignment is then found by minimising the abov e loss. The e xtension of the Le venshtein algorithm to lattices, though possible, is computationally expensi ve [ 31 ]. Therefore approximate schemes are normally used [ 32 ]. Common to those schemes is the use of information about the ov erlap of lattice arcs and time-aligned reference words to compute the loss o t,τ = max  0 , | min { e ∗ τ , e t }| − | max { s ∗ τ , s t }| | max { e ∗ τ , e t }| − | min { s ∗ τ , s t }|  (16) where { s t , e t } and { s ∗ τ , e ∗ τ } are start and end times of lattice arcs and time-aligned words respectively . In order to yield “hard” 0 or 1 loss a threshold can be set either on the loss or the amount of ov erlap. 4. EXPERIMENTS Evaluation was conducted on IARP A Babel Georgian full language pack (FLP). The FLP contains approximately 40 hours of con ver - sational telephone speech (CTS) for training and 10 hours for de- velopment. The lexicon was obtained using the automatic approach described in [ 33 ]. The automatic speech recognition (ASR) system combines 4 di verse acoustic models in a single recognition run [ 34 ]. The diversity is obtained through the use of different model types, a tandem and a hybrid, and features, multi-lingual bottlenecks ex- tracted by IBM and R WTH Aachen from 28 languages. The lan- guage model is a simple n -gram estimated on acoustic transcripts and web data. As a part of a lar ger consortium, this ASR system took part in the IARP A OpenKWS 2016 competition [ 35 ]. The dev elop- ment data was used to assess the accuracy of confidence estimation approaches. The data was split with a ratio of 8 : 1 : 1 into train- ing, validation and test sets. The ASR system was used to produce lattices. Confusion networks were obtained from lattices using con- sensus decoding [ 8 ]. The word error rates of the 1-best sequences are 39.9% for lattices and 38.5% for confusion networks. The input features for the standard bi-directional recurrent neu- ral network (BiRNN) and CN-based (BiCNRNN) are decision tree mapped posterior , duration and a 50-dimensional fastT ext word em- bedding [ 36 ] estimated from web data. The lattice-based BiRNN (BiLatRNN) makes additional use of acoustic and language model scores. All forms of BiRNNs contain one [ − → 128 , ← − 128] dimensional bi-directional LSTM layer and one 128 dimensional feed-forward hidden layer . The implementation uses PyT orch library and is avail- able online 1 . For efficient training, model parameters are updated using Hogwild! stochastic gradient descent [ 37 ], which allows asyn- chronous update on multiple CPU cores in parallel. T able 1 shows the NCE and A UC performance of confidence estimation schemes on 1-best hypotheses e xtracted from CNs. As expected, “raw” posterior probabilities yield poor NCE results al- though A UC performance is high. The decision tree, as expected, improv es NCE and does not affect A UC due to the monotonicity of the mapping. The BiRNN yields gains over the simple decision tree, which is consistent with the previous w ork in the area [ 12 , 13 ]. 1 https://github.com/qiujiali/lattice_rnn Estimator NCE A UC 1-best CN posteriors -0.1978 0.9081 +decision tree 0.2755 0.9081 +BiRNN 0.2947 0.9197 T able 1 : Confidence estimation performance on 1-best CN arcs The next experiment examines the e xtension of BiRNNs to con- fusion networks. The BiCNRNN uses a similar model topology , merges incoming arcs using the attention mechanism described in Section 3 and uses the Levenshtein algorithm with loss given by Eqn. 15 to obtain reference confidence values. The model param- eters are estimated by minimising average binary cross-entropy loss on all CN arcs. The performance is ev aluated over all CN arcs. When transitioning from 1-best arcs to all CN arcs the A UC performance is expected to drop due to an increase in the Bayes risk. T able 2 shows that BiCNRNN yields gains similar to BiRNN in T able 1 . Estimator NCE A UC all CN posteriors 0.3105 0.8243 +decision tree 0.4659 0.8243 +BiCNRNN 0.4970 0.8365 T able 2 : Confidence estimation performance on all CN arcs As mentioned in Section 3 there are alternativ es to attention for merging incoming arcs. T able 3 shows that mean and normalised posterior weights may provide a competiti ve alternati v e. 2 Merge NCE A UC max 0.4933 0.8350 mean 0.4966 0.8364 normalised posterior 0.4969 0.8363 attention 0.4970 0.8365 T able 3 : Comparison of BiCNRNN arc mer ging mechanisms Extending BiRNNs to lattices requires making a choice of a loss function and a method of setting reference values to lattice arcs. A simple global threshold on the amount of overlap between ref- erence time-aligned words and lattice arcs is adopted to tag arcs. This scheme yields a false negativ e rate of 2.2% and false positive rate of 0.9% on 1-best CN arcs and 1.4% and 0.7% on 1-best lattice arcs. T able 4 shows the impact of using approximate loss in training the BiCNRNN. The results suggest that the mismatch between train- ing and testing criteria, i.e . approximate in training and Lev enshtein in testing, could play a significant role on BiLatRNN performance. Using this approximate scheme, a BiLatRNN was trained on lattices. T able 5 compares BiLatRNN performance to “raw” posteriors and decision trees. As expected, lo wer A UC performances are ob- served due to higher Bayes risk in lattices compared to CNs. The “raw” posteriors of fer poor confidence estimates as can be seen from the large negati ve NCE and low A UC. The decision tree yields sig- nificant gains in NCE and no change in A UC performance. Note that the A UC for a random classifier on this data is 0.2466. The BiLa- tRNN yields very large gains in both NCE and A UC performance. 2 W ith lattices, the attention mechanism outperforms other arc merging methods more significantly , which is reported in T able 5 . Method NCE A UC Lev enshtein 0.4970 0.8365 approximate 0.4873 0.8321 T able 4 : Comparison of BiCNRNN arc tagging schemes Estimator NCE A UC all lattice arc posteriors -5.0386 0.2251 +decision tree -0.0889 0.2251 +BiLatRNN (post) 0.3880 0.7507 +BiLatRNN (attn) 0.3921 0.7537 T able 5 : Confidence estimation performance on all lattice arcs As mentioned in Section 1 , applications such as language learn- ing and information retriev al rely on confidence scores to gi ve high- precision feedback [ 38 ] or high-recall retrieval [ 26 , 27 ]. Therefore, Fig. 3 sho ws precision-recall curv es for BiRNN in T able 1 and BiLa- tRNN in T able 5 . Fig. 3a shows that the BiRNN yields largest gain in the region of high precision and lo w recall which is useful for feedback-like applications. Whereas the BiLatRNN in Fig. 3b can be seen to significantly improve precision in the high recall region, which is useful for some retriev al tasks. 0.0 0.5 1.0 Recall 0.6 0.7 0.8 0.9 1.0 Precision BiRNN (0.9197) posterior (0.9081) (a) 1-best CN arcs 0.0 0.5 1.0 Recall 0.2 0.4 0.6 0.8 1.0 Precision BiLatRNN (0.7537) posterior (0.2251) (b) all lattice arcs Fig. 3 : Precision-recall curves for T able 1 and T able 5 5. CONCLUSIONS Confidence scores play an important role in many applications of spoken language technology . The standard form of confidence scores are decision tree mapped word posterior probabilities. A number of approaches hav e been proposed to improve confidence es- timation, such as bi-directional recurrent neural networks (BiRNN). BiRNNs, howe v er , can predict confidences of sequences only , which limits their more general application to 1-best hypotheses. This paper extends BiRNNs to confusion network (CN) and lattice struc- tures. In particular, it proposes to use an attention mechanism to combine v ariable number of incoming arcs, shows how recursions are linked to the standard forward-backward algorithm and describes how to tag CN and lattice arcs with reference confidence v alues. Ex- periments were performed on a challenging limited resource IARP A Babel Georgian pack and shows that the extended forms of BiRNNs yield significant gains in confidence estimation accuracy over all arcs in CNs and lattices. Man y related applications like information retriev al, speaker adaptation, keyword spotting and semi-supervised training will benefit from the improv ed confidence measure. 6. REFERENCES [1] W . Xiong, L. W u, F . Allev a, J. Droppo, X. Huang, and A. Stol- cke, “The Microsoft 2017 con versational speech recognition system, ” ICASSP , 2018. [2] B. Li, T . N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. K. Chin, K. C. Sim, R. J. W eiss, K. W . W ilson, E. V ariani, C. Kim, O. Sio- han, M. W eintraub, E. McDermott, R. Rose, and M. Shannon, “ Acoustic modeling for Google Home, ” in Interspeech , 2017. [3] F . W essel, R. Schluter, K Macherey , and H. Ney , “Confidence measures for large vocabulary continuous speech recognition, ” IEEE T ransactions on speech and audio pr ocessing , 2001. [4] H. Jiang, “Confidence measures for speech recognition: A surve y , ” Speech Communication , 2005. [5] L. F . Uebel and P . C. W oodland, “Speaker adaptation using lattice-based MLLR, ” in ISCA T utorial and Resear ch W ork- shop on Adaptation Methods for Speech Recognition , 2001. [6] H. Y . Chan and P . C. W oodland, “Improving broadcast ne ws transcription by lightly supervised discriminative training, ” in ICASSP , 2004. [7] G. T ¨ ur , D. Z. Hakkani-T ¨ ur , and R. E. Schapire, “Combining activ e and semi-supervised learning for spoken language un- derstanding, ” Speech Communication , 2005. [8] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: word error minimization and other ap- plications of confusion networks, ” Computer Speech & Lan- guage , 2000. [9] G. Evermann and P . C. W oodland, “Large vocabulary decoding and confidence estimation using word posterior probabilities, ” in ICASSP , 2000. [10] M. S. Seigel and P . C. W oodland, “Combining information sources for confidence estimation with CRF models, ” in Inter- speech , 2011. [11] K. Kalgaonkar , C. Liu, Y . Gong, and K. Y ao, “Estimating confidence scores on ASR results using recurrent neural net- works, ” in ICASSP , 2015. [12] M. A. Del-Agua, A. Gimenez, A. Sanchis, J. Ci vera, and A. Juan, “Speaker-adapted confidence measures for ASR us- ing deep bidirectional recurrent neural networks, ” IEEE/ACM T ransactions on A udio, Speech, and Language Processing , 2018. [13] A. Ragni, Q. Li, M. J. F Gales, and Y . W ang, “Confidence estimation and deletion prediction using bidirectional recurrent neural networks, ” in SLT , 2018. [14] F . Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner , and G. Monfardini, “The graph neural network model, ” IEEE T ransactions on Neural Networks , 2009. [15] J. Su, Z. T an, D. Xiong, and Y . Liu, “Lattice-based recurrent neural network encoders for neural machine translation, ” in AAAI , 2017. [16] F . Ladhak, A. Gandhe, M. Dreyer , L. Mathias, A. Rastro w , and B. Hof fmeister , “LatticeRNN: Recurrent neural networks ov er lattices, ” in Interspeech , 2016. [17] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks, ” IEEE T ransactions on Signal Pr ocessing , 1997. [18] L. R. Rabiner , “ A tutorial on hidden marko v models and se- lected applications in speech recognition, ” Pr oceedings of the IEEE , 1989. [19] A. V aswani, N. Shazeer, N. Parmar , J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “ Attention is all you need, ” in NIPS , 2017. [20] T . Schaaf and T . K emp, “Confidence measures for spontaneous speech recognition, ” in ICASSP , 1997. [21] M. W eintraub, F . Beaufays, Z. Rivlin, Y . Konig, and A. Stol- cke, “Neural-network based measures of confidence for word recognition, ” in ICASSP , 1997. [22] J. Ma and S. Matsoukas, “Unsupervised training on a large amount of Arabic broadcast news data, ” in ICASSP , 2007. [23] M. S. Seigel and P . C. W oodland, “Detecting deletions in ASR output, ” in ICASSP , 2014. [24] T . Mikolov , I. Sutskev er, K. Chen, S. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality , ” in NIPS , 2013. [25] M. Siu, H. Gish, and F . Richardson, “Improved estimation, ev aluation and applications of confidence measures for speech recognition, ” in F ifth Eur opean Confer ence on Speech Com- munication and T echnology , 1997. [26] M. J. F . Gales, K. M. Knill, and A. Ragni, “Low-resource speech recognition and ke yword-spotting, ” in SPECOM , 2017. [27] A. Ragni and M. J. F . Gales, “ Automatic speech recognition system dev elopment in the “wild“, ” in Interspeech , 2018. [28] J. Davis and M. Goadrich, “The relationship between precision-recall and R OC curves, ” in ICML , 2006. [29] J. G. Fiscus, “ A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover), ” in The 1997 IEEE W orkshop on Automatic Speech Recognition and Understanding Proceedings , 1997. [30] G. Evermann and P . C. W oodland, “Posterior probability de- coding, confidence estimation and system combination, ” in The NIST 2000 Speech T ranscription W orkshop , 2000. [31] H. Xu, D. Po vey , L. Mangu, and J. Zhu, “Minimum bayes risk decoding and system combination based on a recursion for edit distance, ” Computer Speech & Language , 2011. [32] D. Povey and P . C. W oodland, “Minimum phone error and i- smoothing for improved discriminativ e training, ” in ICASSP , 2002. [33] M. J. F . Gales, K. M. Knill, and A. Ragni, “Unicode- based graphemic systems for limited resource languages, ” in ICASSP , 2015. [34] H. W ang, A. Ragni, M. J. F . Gales, K. M. Knill, P . C. W ood- land, and C. Zhang, “Joint decoding of tandem and hybrid systems for improved keyword spotting on lo w resource lan- guages, ” in Interspeech , 2015. [35] A. Ragni, C. W u, M. J. F . Gales, J. V asilakes, and K. M. Knill, “Stimulated training for automatic speech recognition and key- word search in limited resource conditions, ” in ICASSP , 2017. [36] P . Bojanowski, E. Grave, A. Joulin, and T . Mikolov , “Enrich- ing word vectors with subword information, ” arXiv pr eprint arXiv:1607.04606 , 2016. [37] B. Recht, C. Re, S. Wright, and F . Niu, “Hogwild: A lock-free approach to parallelizing stochastic gradient descent, ” in NIPS , 2011. [38] K. M. Knill, M. J. F . Gales, K. Kyriak opoulos, A. Malinin, A. Ragni, Y . W ang, and A. Caines, “Impact of asr performance on free speaking language assessment, ” in Interspeech , 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment