Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription

State-of-the-art English automatic speech recognition systems typically use phonetic rather than graphemic lexicons. Graphemic systems are known to perform less well for English as the mapping from the written form to the spoken form is complicated. …

Authors: Yu Wang, Xie Chen, Mark Gales

PHONETIC AND GRAPHEMIC SYSTEMS FOR MUL TI-GENRE BR O ADCAST TRANSCRIPTION Y . W ang, X. Chen, M. J . F . Gales, A. Ra gni and J . H. M. W ong Cambridge Uni versity Engineering Dept, T rumpington St., Cambridge CB2 1PZ, U.K. Email: { yw396, xc257, m jfg, ar527, j hmw2 } @eng.cam.ac.uk ABSTRA CT State-of-the-art English au tomatic speech recogn ition systems typi- cally use phon etic rather than graphemic lexicon s. Graphemic sys- tems are k no wn to perform less well for English as the mapping from the written form to the spoken form is complicated. Howe ver , in recent years the representational power of deep-learning based acoustic models has improve d, raising interest in graphemic acous- tic models for E nglish, due to the simplicit y of generating the lex- icon. In this paper , phonetic and graphemic models are compared for an English Multi-Genre B r oadcast transcription task. A range of acoustic models based on lattice-free MMI training are constructed using phonetic and graphemic lexico ns. For this task, it is found that having a l ong-span temporal history reduces the difference in performance between t he two forms of models. In addition, system combination is examined , using parameter smoothing and hypothe- sis com bination. As the co mbination approache s become more com- plicated the difference between the phonetic and graphemic systems further decreases. Fi nally , for all configurations examined the com- bination of phonetic and graphemic systems yields consistent gains. Index T erms — Speech recognition, graphemic lexicon, lattice- free MMI, model combination 1. INTRODUCTION Hidden Marko v model (HMM) based automatic speech recognition (ASR) systems are t ypically built using sub-words units, such as phones or graphemes. System performance depends on an appropri- ate definition of sub-word units and the accurac y , and consistency , of decomposing words into these sub-word units. Phonetic lexicons provid e a mapping between the orthographic representation of a word, a sequence of lett ers ( graphemes), i nto a sequence of phones. Ho wev er , generation of these lexicons requires l inguistic kno wledge of the target language, which is t ime-consuming and exp ensiv e. On the other hand, graphemic lexicons are at t ractiv e as t he graphemes are directly used. Moreo ver , graphemic lexicons can be easily ex- panded to include out-of-vocab ulary (OO V ) words, unlike phonetic lexicons. For languages with a close grapheme-to-phone mapping, graphemic HMM-based systems have been sho wn to perform sim- ilarly to phonetic systems [1, 2 , 3]. Ho we ver , for languages with irregular grapheme-to-ph one mappings, such as English, graphemic HMM-based systems normally perform significantly worse than their phonetic counterparts [2 ] . This is not surprising as the system relies on the acous tic model t o implicit ly capture the irr egulari- ties of t he graphemic to acoustic realisation. W hen more powerful This researc h was partly funded under the AL T A Institute, Univ ersity of Cambridge. Thanks to Cambridge English, Uni versit y of Cambridge, for supporting this research. deep learning based acoustic models are used, such as connectionist temporal classification (CT C) [4] which model long-span tempo- ral information, the gap between graphemic systems and phonetic systems is small on a read English task [5]. This paper aims to find out whether recent deep-learning based acoustic models, which also model long-span temporal information, allow HMM-based graphemic systems to perform at the same l e vel of accuracy as phonetic systems for English. A range of models are av ail able including long short-term memory (LSTM) network s [6], con volutional neural networks [7], time-delay neural networks (TDNN) [8] and bidirectional LSTM networks [9]. Additionally var - ious layer-wise co mbination schemes allow the adv antages of se veral models to be lev eraged [10, 11]. These models also offer flexibil- ity i n terms of the span of t he temporal information t hat they can capture. For instance, the interleaved T DNN-LSTM model [11] ex- tends the temporal span of the L STM model with a wide windo w into the future. T hese models can also be efficiently trained directly from random initiali sation by using approaches such as lattice-free maximum mutual information (LF -MMI) estimation. This often re- sults in improv ed performance over state-lev el minimum Bayes’ risk (sMBR) trained models [12]. These complex models are likely to hav e, possibly significant, variations in AS R performance depending on the choice of training hyper-parameters. This variation in system performance can be taken adv antage of using system combination [13]. This paper will examin e two forms of system combination with different complex ities and costs. The first is a random ensemble method [14], which utilises multiple training runs with dif ferent ran- dom seeds t o produce slightly different yet complimentary systems. The second is model smoothing [15], w hich interpolates a numb er of intermediate model parameters using weights estimated on a subset of the training data. F inally , graphemic systems, if competitive with the phonetic system, should be complimentary t o phonetic systems. The rest of this paper is organized as follows. Sections 2 and 3 describe graphemic acoustic models and model combination ap- proaches respecti vely . Section 4 details the experimen ts conducted on an English multi-genre broadcast transcription task wi th the pho- netic and graphemic models as well as using differen t combination approaches . Finally , conclusions are given in Section 5. 2. GRAPHEM IC ENGLISH SYSTEMS 2.1. Graphemic lexicon At the core of any graphemic system is the graphemic lexicon. For English, it is straightforward t o form this from t he 26 alphabet letters / a-z /. In addition to these base graphemes, it may also be useful to mark additional attributes such as apostrophe s (DA) and abbre via- tions (DB). Excerpts from phonetic and graphemic lexicons are: T o appear in Pr oc. ICASSP 2018, April 15-20, 2018, Calgary , Canada c  IE E E 2018 Phonetic Lexicon B.B.C. ’ s /b/ /iy/ /b/ /iy/ /s/ /iy/ /z/ information /ih /n/ /f/ /ax/ /m/ /ey/ /sh/ /en/ moon /m/ /uw/ /n/ the /dh/ /ax/ Graphemic Lexicon B.B.C. ’ s b;DB b;DB c;DADB s information i n f o r m a t i o n moon m o o n the t h e From the fi rst entry , the use of abbrev iation and apostrophe attributes potentially all ows the graphemic system to handle t he discrepancy between the pronounced and wri tten forms. The other three exam- ples illustrate situations where graphemic systems may str uggle to model lett er omission ( ’r’ in information), v owel ( ’oo’ in moon), consonan t ( ’th’ in the) and vo wel-consonant ( ’ tio’ in infor- mation) recombination. Though some of these issues can be han- dled using context dependent models, e.g. bi-graphemes and tri- graphemes, f or others the length of t he conte xt necessary for dis- ambiguation will be prohibitiv ely large. For example, the phonetic lexicon used i n Section 4 associates three phones, /dh/ , /th/ and /t/ , with the sound corresponding to the grapheme seque nce ’th ’ . This problem is f urther compounde d by the fact that the following grapheme ’e’ depending on its neighbor is represented by 9 dif- ferent vowel/con sonant phones. The examples give n in this section suggest that for graphemic systems, conte xt modelling may be ev en more important than it is for phonetic systems [16]. 2.2. Acoustic Model Structure Rather than solely relying on acoustic modelling units to handle the intricate grapheme-to-pho ne rules, it is also possible to examine acoustic models capable of modelling long-span temporal informa- tion. In a deep neural network (DNN) acoustic model [17], only a small number of preceding and succeed ing frames are typically used to predict the current stat e, s t , as i s shown in (1) (T)DNN: P ( s t |O 1: T ) ≈ P  s t | o t − τ ( l ) , . . . , o t + τ ( r )  (1) LSTM: P ( s t |O 1: T ) ≈ P  s t | o 1 , . . . , o t + τ ( r )  (2) where t he left τ ( l ) and right τ ( r ) contex t window sizes are typically less than 10. A TDNN [8] has a more complex structure that enables it t o cover a significantly larger number of preceding and succeeding frames without significantly increasing the number of model param- eters. For example, the model considered in Section 4 uses τ ( l ) = 15 past and τ ( r ) = 10 future frames. The use of recurrent units in a LSTM network, described in equation (2), allows e ven longer-span temporal information to be modelled. Note that in practice the past information is typically truncated after some fixed, yet l arge, number of frames (40 in this work). F urthermore, the TDNN-LSTM model [11] obtained by interleaving TDNN layers [8] with LS TM layers [6] increases the context windo w to 50 frames into the past and 20 frames into the future. In addition to being more powerful classifiers, these adv anced deep-learning based acoustic models t hus can utilise a significantly longer span of temporal information than t hat used in pre vious work wi t h Gaussian mixture models. 3. MODEL COMBINA TION T raining the deep neural network models discussed in Section 2 is a complicated process in volving highly non-con ve x optimisation. There may thus be l arge variations between the behaviours of inter- mediate models from iterati on to iteration, or between final models when originating from different starting points. T he latter is likely to be larger when models are trained from different random i nitial- isations using the L F-MMI criterion, as there is no cross-entropy initialisation stage with common targets for all systems. Such v ari- ation typically results in the models making different predictions. Depending on t he lev el of useful variation, such di verse predictions may help to resolve confusions. T his serv es as the basis for various system combination approaches [18, 19 , 20, 21]. 3.1. Ensembl e Methods A combination of an ensemble of di verse and yet individua lly ac- curate systems can often result in significant gains [22]. C ommon methods to introduce di versity include random parameter initialisa- tion for ASR [13, 14], bagging [23] and r andom decision trees [24]. Using different random initialisations has been shown to be a simple and efficient approach of introducing dive rsity [13, 14]. In [14], this method was able to pro vide significant div ersity while keeping a sim- ilar performance across the systems. Thus, combining the systems in the ensemble could yield strong gains. A less common method of ensemble generation is to take a number of intermediate models during training and interpolate their parameters Φ = M X m =1 α m Φ m (3) where M is the number of models, Φ m represents the parameters of the m th model and α m represents its combination weight. This is the idea behind model smoothing [15 ], designed to reduce unwanted v ariations during the training. The models are normally selected from the later stages of training using a fixed iteration interval be- tween the selected models (6 in this work). The combination weights are associated wit h the indi vidual layers and optimised on a subset of training examp les. The combination weights are constrained to sum t o 1. Though generally it is hard to ensure that the combined model would improve ov er the final trained model, this paper sho ws that large performance improveme nts are possible. T o measure the dive rsity of the generated systems, it is possible to use cross word error rate (cWER) [13] cWER = 1 M ( M − 1) M X m =1 X n 6 = m 1 P R r =1 |W n r | R X r =1 L ( W m r , W n r ) , (4) where W m r represents the 1-best hypothesis of the r th utterance, us- ing the m th model, and R is t he total number of utterances. The cWER measures how differen t the 1-best hypotheses are between models and was found t o be more correlated with the combination gains than the standard deviation of W ERs [13]. 3.2. Min imum Bay es Risk Combination It is only possible to use model smoothing in equation (3) for com- bining iterations of the same model training run. A more general sys- tem combination approach is hypothesis-le vel combination. Exam- ples of t his form of approach are: RO VER [18]; confusion network combination (CNC) [19]; and mi nimum Bayes risk (MBR) combi- nation [20]. In this work MBR combination is used, which finds the word sequence that attempts to minimise the expected W ER across 2 the systems being combined [20 ]: c W = argmin W ( M X m =1 λ m X W ′ ∈H P  W ′ |O 1: T ; Φ m  L  W , W ′  ) , (5) where λ m are the combination weights, P ( W |O 1: T ; Φ m ) is the posterior probability of t he wo rd sequence, W , given the observ ation sequence, O 1: T , and the acoustic model, Φ m , H is a set of hypothe- ses and L ( W , W ′ ) represents the Lev enshtein distance between two word sequences W and W ′ . T hough more computationally expen- siv e, MBR combination has been shown to perform better t han the R O VE R combination and CNC. 4. EXPERIM ENTS Experiments were conducted using the data from the 2017 English Multi-Genre Broadcast (MGB-3) challenge. The data was supplied by British Broadcasting Corporation (BBC) and consists of audio from BBC television programmes. The data contains a wi de range of genres such as comedy , drama and sports shows. A total of 375 hours of audio data with associated subtitles is av ail able f or acoustic model training. Lightly supervised decoding and selection was used to extract 275 hours for training [25, 26]. A 6 hours de velopmen t set, de v17b, was also supplied. The acoustic model features were 40-dimensiona l Mel-filt er bank features normalised using utterance lev el mean normalisation and sho w-segment l e vel variance normal- isation [26]. Around 3600 l eft bi-phone dependent states were used as targets. T he results are based on automatic audio segmen tation using a DNN based segmenter [27 ] trained on the MGB-3 data. T o examine t he i mpact of the acoustic model complexity on phonetic and graphemic system performance, a range of acoustic models of different topology and spans of temporal i nformation were built. These include feed-forward DNN, sub-sampled TDNN, unidirectional LSTM and interleav ed TD N N - L STM models. The DNN models had 7 hidden layers of 600-dimensional si gmoid units and an input context window spanning from 10 frames into the past to 10 frames into the future. T he TD NN models had 7 layers of 600-dimen sional rectified linear units (ReLU) and wider input contex t window spanning from 15 frames into the past to 10 frames into the future. 1 The LS T M model had 3 LSTMP layers, each with 512-dimensional cells and 128-dimen sional recurrent and non- recurrent projections. The effecti ve temporal information window for the LSTM spans from 40 frames into the past to 7 frames into the future. The interleave d TDNN-L STM models had 9 layers of 600 dimensional ReLU units. 2 The TDNN-LSTM model has the widest temporal information window , starti ng from 50 frames into the past and ending at 20 frames into t he future. All models were trained using the L F-MMI criterion on a single GPU [28] using Kaldi toolkit [15]. For this work, only speaker -independent systems were used. For the first pass decoding language model, a 3-gram language model with a 64K words lexicon was used. This was trained on the audio subtitles and 650M words of supplied BBC subtitl es. In ad- dition, a recurrent neural network language model (R N NL M) [ 29] was also used to refi ne the result of the first pass decoding. The CUED-RNNLM T oolkit v1.0 [30] was used to train the RNNLM 1 The splicing inde xes per layer can be described as { -1,0,1 } { -1,0,1 } { - 1,0,1,2 } { -3,0,3 } { -3,0,3 } { -6,-3,0 } { 0 } using the notation of [8, 11]. 2 The architec ture can be describ ed as { -2,-1,0,1,2 } { -1,0,1 } L { -3,0,3 } { -3,0,3 } L { -3,0,3 } { -3,0,3 } L , where L represen ts an LSTMP layer with 512 cel ls and 128-di mensional recu rrent and non-recurrent projections, using notati on of [8, 11]. using 1 layer of 1024-dimensional GR U units. Gi ven the vo cabulary size (64K) and quantity of training data (e. g. 650M words), noise contrasti ve estimation (NCE) was adopted to speed up training and e valuation [31]. At test t ime, a 4-gram approximation [32 ] of the RNNLM was used to rescore 4-gram lattices. As the RNNLM was trained with the NCE, the unnormalized output layer probabilities were used in rescoring, which provided a large speed up. MBR de- coding/comb ination was used t o produce the final output. Unless stated otherwise, performance with the 3-gram model is quoted. 4.1. Phon etic a nd Graphemic Models Model Single Ph/Gr Comb %WER %Rel %WER %Rel DNN Ph 27.8 — 26.3 -5.4 Gr 30.7 +10.4 TDNN Ph 24.4 — 23.0 -5.7 Gr 26.9 +10.3 LSTM Ph 25.0 — 23.2 -7.2 Gr 26.7 +6.8 TDNN-LSTM Ph 23.4 — 21.7 -7.3 Gr 25.0 +6.8 T able 1 . %WER of phonetic and graphemic systems and their MBR combination on dev17 b . The impact of the acoustic model on the performance dif ference between phone tic (Ph) and graphem ic (Gr) systems is il lustrated in T able 1 . T he second column sho ws the relativ e degradation in per- formance of the graphemic system. As the complexity of the model and the span of available temporal information increases, the dif- ference between phonetic and graphemic system W ERs drops fr om 10.4 t o 6.8% relativ e. The largest drop happen s when the LSTM units are used to model longer history information. This implies that graphemic systems are more sensiti ve to shorter histories than are phonetic systems. The third column in T able 1 shows that as t he graphemic system gets more competiti ve, the gain from combining it with the phonetic system increases from 5.4 to 7.3% relative. Model Context %WER R TF Ph Bi-phone 23.4 0.9 Mono-ph one 2 3.9 0.7 Gr Bi-grapheme 25.0 0.8 Mono-grap heme 26.2 0.6 T able 2 . %W ER of context dependent and independ ent phonetic and graphemic TDNN-LS TM models on dev17 b . Graphemic systems are also expected to be sensitive to the choice of acoustic modelling contex t. W ider contexts should be more suitable for graphemic systems as they can better account f or the mismatch between the orthographic and spoken form. Ho w- e ver , shorter contexts are app ealing due to their simplicity an d speed of tr ai ning as well as deco ding. T able 2 shows that pho- netic systems are significantly more robust when bi-phone units are replaced with mono-phone units. Though mono-grapheme units yield t wice as large a degrad ation as mono-ph one units, t he sim- plicity o f graphemic l exicon s offers an interesting co mpromise. 3 Both contex t-independent systems are approximately 25% f aster than their context-dependen t counterparts as shown by the real ti me factor (R TF) in T able 2. 4.2. Model combination T raining %WER Comb Criterion %WER %Rel sMBR 23.7 21.3 -9.0 LF-MMI 23.4 T able 3 . %WE R of sMBR and LF-MMI trained phonetic TDNN- LSTM models and their MBR combination on dev1 7b . Rather than combining phonetic and graphemic systems, it is also possible to combine systems from any diverse ensemble as dis- cussed in Section 3. One simple way to produce an additional system is to utilise an alternati ve training criterion such as sMBR training. T able 3 shows that sMBR trai ning yields a competitiv e model and combination gains between systems with these cri t eria is larger than that between phonetic and graphemic systems in T able 1. This can partly be attributed to the l arger performance differences between the phonetic and graphemic systems being combined. Additional systems can also be generated using simpler ap- proaches. For example, the use of model smoothin g does not require another model to be trained. In this work, 20 models with an it- eration interv al of 6 were taken from the final epoch of LF-MMI training and their combination weights were estimated on a sub- set of training data as discussed in Section 3. T able 4 shows that model smoothing is an effecti ve way to improv e system perfor- mance for both graphemic and phonetic systems. Additionally by performing model smoothing the dif ference between the phonetic and graphemic systems is reduced (+6.0%). Though the gains from combining phonetic and graphemic systems decrease after model smoothing, dropping from 7.3 to 5.6% relat ive, there is still a large gain in performance in the combined systems after mode l smoothing, yielding a better performance than modifying the training criterion. Model %WER Ph/Gr Comb Ph Gr %WER %Rel Single 23.4 25.0 21.7 -7.3 Smooth 21.5 22.8 20.3 -5.6 T able 4 . %WER of phonetic and graphemic TDNN-LSTM models with and w i thout model smoothing on dev17b . Alternativ ely , random ensembles [14] can be built by changing the random seed used to initialise models for L F-MMI training. This is more expensi ve than model smoothing, but allows additional di- versity to be introduced. LF-MMI training may benefit more from random ensemble generation, as it avoids the cross-entropy initiali- sation stage of approaches such as sMBR training, where the same targets are normally used for all system, possibly reducing the di- versity of the fi nal systems after sequence training. In this work, an ensemble of 3 TDNN-LSTM models was created by bu ilding 2 additional models using different seeds for random parameter ini- tialisation. T able 5 sho ws that although the WER standard de viation across systems is small, the cWER is large suggesting that these sys- tems may be complementary . T o put t he cWE R number in context, an ensemble of sMBR trained models on the AMI IHM task with a mean WER of 25% had a cWER of 12%. T he last block in T able 5 sho ws that ensemble combination of multiple single models yields the large gains of the approaches examine d in this work. Giv en the large gains from model smoothing, it is interesting to examine ensembles of smoothed models. These are also sho wn in T able 5. As expected the cW E Rs for the ensembles are reduced, as model smoothing reduces the div ersity from the precise stopping points. Ho wev er , there are sti ll large gains of over 7% relativ e from the en- semble combination. Additionally , the difference between phonetic and graphemic, smoothed or unsmoothed, systems when combining random ensembles has been reduced to just 5% relativ e. Model %WER %cWER Ensemble Comb µ σ %WER %Rel Ph Single 23.5 0.06 17.9 20.9 -11.1 Smooth 21.6 0.10 13.5 20.0 -7.4 Gr Single 25.0 0.10 20.4 22.1 -11.6 Smooth 22.8 0.06 14.9 21.0 -7.9 T able 5 . %WER of phonetic and graphemic random ensembles of TDNN-L STM models with and without model smoothing on de v17b . Giv en the small differe nce between the phonetic and graphemic ensembles, additional gains from combining the systems might be expec ted. Ho wev er , the extensi ve use of combination techniques means that di v ersity between these ensembles has already been significantly reduced. T able 6 shows that combining phonetic and graphemic ensembles yields only 0.5% absolute or 2.5% relativ e reduction in W ER 3 . At t his point, it is interesting to see if i m- prov ed language modelling approach es can yield f urther benefits. The last column in T able 6 shows that 4-gram LM rescoring reduces the WER from 19.5 to 18.8%. The RNNLM gav e an additional improv ement yielding a final error r at e of 17.9% on this task. Model Comb Ph/Gr Comb tg tg fg +rnn Ph Ensemble 20.0 19.5 18.8 17.9 Gr 21.0 T able 6 . %WER of the final MGB-3 system on dev1 7b . 5. CONCLUSION This paper has inv estigated whether the recent adv ances in deep learning based approaches hav e enabled graphemic English AS R systems to reach t he performance le vel of traditionally used pho- netic systems. It was found t hat a combination of long-span tempo- ral history and future information with context-dependen t graphemic units is important to obtain competitive performance for graphemic English ASR systems. The relativ e difference between phonetic and graphemic systems can be f urther reduced by employing system combination approaches, model smoothing and random ensemble methods were both found to be effec tiv e. The combination of these two methods yielded a graphemic English ASR system for multi- genre broadcast transcription that is only 5% relativ ely worse than an equiv alent phonetic English AS R system, and is complementary . 3 It is worth noting that the performanc e of combining an ensemble of tw o phoneti c systems was 20.2%. Thus simply enl arging the size of the phonetic ensemble is not expecte d to match this graphemic/pho netic ensemble perfor- mance. 4 6. REFERE NCES [1] S. Kanthak and H. Ney . Context-de pendent acoustic modeling using graphemes for l arge vocab ulary speech recognition. In Pr oc. ICASSP , 2002. [2] M . Kill er , S. St ¨ uker , and T . Schultz. Grap heme based speech recognition. In P r oc. INTER SPEECH , 2003. [3] M . J. F . Gales, K. M. Knill, and A. Ragni. Unicode-based graphemic systems for limited resource languages. In Proc. ICASSP , pages 5186–5190, 2015. [4] A. Graves, S. Fern ´ andez, F . Gomez, and J. Schmidhuber . Con- nectionist temporal classifi cati on: labelling unsegmen ted se- quence data with recurrent neural networks. In P r oc. I C ML , pages 369–376 , 2006. [5] F . Eyben, M. W ¨ ollmer , B. Schuller , and A. Grav es. From speech to letters — using a nov el neural network architecture for grapheme based ASR. In Proc . ASR U W orksho p , pages 376–38 0, 2009. [6] H. Sak, A. Senior, and F . Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In P r oc. INTER SPEECH , 2014. [7] O. Abdel-Hamid, A. Mohamed, H. Jiang, L . Deng, G. Penn, and D. Y u. C on volutional neural networks for speech recogni- tion. IEEE/ACM T ransaction s on audio, speech, and languag e pr ocessing , 22(10):153 3–1545, 2014. [8] V . Peddinti, D. Pov ey , and S. Khudanpur . A time delay neural network architecture for efficient modeling of l ong temporal contex ts. In P r oc. INTERSPEECH , 2015. [9] A. Graves and N. Jaitly . T ow ards end-to-end speech recog- nition w i th recurrent neural networks. In Pr oc. ICML , pages 1764–1 772, 2014. [10] T . N. Sainath, O. V inyals, A. Senior, and H. Sak. Con volu - tional, long short-term memory , fully connected deep neural networks. In Proc. I C ASSP , pages 4580–4584, 2015. [11] V . P eddinti, Y . W ang, D. Pov ey , and S. Khudanpur . Low latency acoustic modeling using temporal con volution and LSTMs. IEEE Signal Processing Letters , 2017 . [12] D. Pove y and B. Kingsbu ry . Evalua tion of proposed modifica- tions to MPE for large scale discriminativ e training. In P r oc. ICASSP , 2007. [13] J. H. M. W ong and M. J. F . Gales. Multi-t ask ensembles with teacher-stud ent training. t o appear in Proc. AS R U W orkshop, 2017. [14] J. H. M. W ong and M. J. F . Gales. Sequence student-teacher training of deep neural networks. In Proc . INT ERSPEECH , 2016. [15] D. Povey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz, et al. The Kaldi speech recognition toolkit. In Pr oc. ASRU W orksh op , 2011. [16] J. Odell. The use of context in large vocab ulary speech recog- nition, 1995. [17] G. Hinton, L. Deng, D. Y u, G. E. Dahl, A. Moha med, N. Jaitly , A. Senior , et al. Deep neural network s for acoustic modeling in speech recognition: The shared vie ws of four research groups. IEEE Signal P r ocessing Mag azine , 29(6):82–97, 2012. [18] J. G. Fiscus. A po st-processing system to y ield reduced word error rates: Recognizer output v oting error reduction (R O VER). In Pr oc. ASR U W orkshop , pages 347–354, 1997. [19] G. Everman n and P .C. W oodland. Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proc . ICA SSP , 2000. [20] H. Xu, D. Povey , L. Mangu, and J. Zhu. Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Langua ge , 25(4):802–8 28, 2011. [21] H. W ang, A. Ragni, M. J. F . Gales, K. M. Knil l , P . C. W ood- land, and C. Zhang. Joint decoding of tandem and hybrid systems for improv ed keyw ord spotting on low resource lan- guages. In Pro c. INTE RSPEECH , 2015. [22] T . G. Dietterich. Ensemble methods in machine l earning. Mul- tiple classifier systems , 1857:1–15 , 2000. [23] L . Breiman. Bagging predictors. Machine learning , 24(2):123–1 40, 1996. [24] O. Si ohan, B. R amabhadran , and B. Kingsbury . C onstructing ensembles of ASR systems using r andomized decision trees. In Proc . ICA SSP , 2005. [25] P . Bell, M. J. F . Gales, T . Hain, J. Ki l gour , P . Lanchantin, X. L i u, A. McParland, S. Renals, O. Saz, M. W ester , et al. The mgb challenge: Evaluating multi-genre broadcast media recognition. In Pro c. ASRU W orkshop , pages 687–693, 2015. [26] P . C. W oodland, X . Liu, Y . Qian, C. Zhang, M. J. F . Gales, P . Karanasou, P . Lanchantin, and L. W ang. Cambridge univ er- sity transcription systems for t he multi-genre broadcast chal- lenge. In Pr oc. ASRU W orkshop , pages 639–646, 2015. [27] L . W ang, C. Zhang, P . C. W oodland, M. J. F . Gales, P . Karana- sou, P . Lanchantin, X. L i u, and Y . Qian. Improv ed DNN- based seg mentation for multi-genre broadca st audio. In Proc . ICASSP , pages 5700–57 04, 2016. [28] D. Povey , V . P eddinti, D. Galvez, P . Ghahremani, V . Manohar , X. Na, Y . W ang, and S. Khudanpur . Purely sequenc e-trained neural networks f or asr based on lattice-free MMI. In Proc. INTERSPEE C H , pages 2751–275 5, 2016. [29] T . Mikolov , M. Karafi ´ at, L. Burget, et al. R ecurrent neural net- work based language model. In Pro c. INTE RSPEECH , 2010. [30] X. Chen, X. Liu, M. Gales, and P . W oodland. CUED-RNNLM an open-source toolkit for efficient training and e va luation of recurrent neural network language models. In P roc. I C ASSP , 2015. [31] X. Chen, X. Liu, Y . W ang, M. G ales, and P . W oodland. Effi- cient training and e va luation of recurrent neural network lan- guage models for automatic speech recognition. IEEE/ACM T ran sactions on Audio, Speec h, an d Languag e Pr ocessing , 2016. [32] X. Liu, X. Chen, Y . W ang, M. Gales, and P . W oodland . T wo ef- ficient lattice rescoring methods using recurrent neural network language models. IEEE/ACM T ran sactions on Audio, Speech, and Languag e Processing , 24(8):1438– 1449, 2016. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment