Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

OPTIMISING THE INPUT WINDO W ALIGNMENT IN CD-DNN B ASED PHONEME RECOGNITION FOR LO W LA TENCY PR OCESSING Akash K umar Dhaka and Giampier o Salvi KTH Royal Institute of T echnology , School of Computer Science and Communication, Dept. for Speech, Music and Hearing, Stockholm, Sweden { akashd, giampi } @kth.se ABSTRA CT W e present a systematic analysis on the performance of a phonetic recogniser when the windo w of input features is not symmetric with respect to the current frame. The recogniser is based on Context Dependent Deep Neural Networks (CD-DNNs) and Hidden Mark ov Models (HMMs). The objectiv e is to reduce the latency of the sys- tem by reducing the number of future feature frames required to es- timate the current output. Our tests performed on the TIMIT database show that the per- formance does not degrade when the input windo w is shifted up to 5 frames in the past compared to common practice (no future frame). This corresponds to improving the latency by 50 ms in our settings. Our tests also show that the best results are not obtained with the symmetric windo w commonly employed, but with an asymmetric window with eight past and tw o future conte xt frames, although this observation should be conﬁrmed on other data sets. The reduction in latency suggested by our results is critical for speciﬁc applications such as real-time lip synchronisation for tele- presence, but may also be beneﬁcial in general applications to im- prov e the lag in human-machine spoken interaction. 1. INTR ODUCTION In recent years, the de velopment of deep neural models based on Re- stricted Boltzman machines (RBMs) pretraining has revitalised the use of artiﬁcial neural netw orks (ANNs) in automatic speech recog- nition (ASR) as well as in many other ﬁelds (see [1, 2] for extensi ve revie ws). A key f actor that determines the usability of applications based on speech recognition is the latency or lag of the system. In dialogue systems, e.g., long latencies may disrupt the natural turn- taking in the human-machine con versation. In other speciﬁc applica- tions the lag may e ven be more critical. A typical example in volv es systems that use ASR to dri ve the lip mo vements of an av atar in real time to support telepresence [3, 4, 5]. The latency in a typical speech recogniser based on a hybrid be- tween Neural Networks (NNs) and Hidden Markov Models (HMMs) is determined by a number of factors: • the hardware (sound card) introduces some lag in digitising the speech samples and making them av ailable to the drivers. T ypical v alues are in the order of milliseconds; • the speech samples are returned by the driv er in buf fers of a certain size (this could be as long as half a second, but can be reduced to a few ms); • in spectral based feature extraction, speech samples are grouped into windo ws (frames) often around 25-40 ms in length; • many methods for feature extraction also compute time deriv atives of the features, which require a number of frames in the past and the future. These often include three context frames for the ﬁrst and three for the second deriv atives for a total of 60 ms, if we assume 10 ms spaced feature vectors; • the input to the neural network that estimates state probabili- ties may include a window of context frames. T ypical v alues are 5 future and 5 past frames that correspond to 50 ms la- tency . In some cases the context may extend to the whole utterance, e.g., in some application of con volutional neural networks; • the decoder that combines the probability estimates produced by the NN with the HMM time model usually requires a cer - tain look-ahead (from a few hundreds of ms to the whole ut- terance). In [6] we used a hybrid of Recurrent Neural Networks (RNNs) and HMMs that was speciﬁcally designed for low latency process- ing. The feature extraction was based on Mel Frequency Cepstral Coefﬁcients (MFCCs) without time derivati ves and the RNN did not receiv e any future feature frame in input, thus limiting the latency of the RNN to the size of the feature extraction window . The purpose of that study , howe ver , was limited to ev aluating the ef fect of varying the look-ahead length of the V iterbi decoder in the system. Nearly all recent methods that use DNN based acoustic models for ASR employ symmetric context windo ws as input [7, 8, 9, 10, 11] and are therefore affected by a certain latency . In [12], time de- layed neural networks were used with asymmetric context windows in some cases. Similarly in [13] context windows with more frames on the left conte xt were used. Howe ver , we are not aw are of a sys- tematic and detailed in vestigation on the effect of the context win- dow asymmetry . In this study , we want to determine the relationship between la- tency of the DNN model and its performance. In order to do this, we analyse how the performance of the recogniser proposed in [8] varies as the alignment of the input context windo w is shifted back or forward in time. It is not our intent to report on state-of-the-art results, but to giv e and indication on the relativ e effects of shifting the input windo w . Also, we report results on Phoneme Error Rate (PER) on the TIMIT data, because we want to ha ve precise control ov er shifts in time and therefore require carefully annotated data. 2. METHOD For our e xperiments, we use a Context Dependent Deep Neural Net- work (CD-DNN) trained to estimate the posterior probabilities of a P as t F r ames F utur e F r ames t0 In put Windo w Deep Neur al Ne tw or k ... Output at t0 ... Fig. 1 . Illustration of the method: A sequence of 11 speech feature frames constitutes the input to the neural network. The context win- dow is not necessarily symmetric with respect to the current frame ( t 0 ). In the illustration a shift of -3 was applied. set of senones giv en a sequence of input feature vectors. The proba- bility estimations are then used in a Hidden Marko v Model (HMM) in combination with a bigram phoneme-level language model for phonetic recognition. The set of senones and their alignment with the speech utter- ances in the dataset is determined by training a context dependent HMM recogniser based on Gaussian Mixture Models. The number of senones is reduced with decision tree based clustering. The DNN training procedure has been well described in [9]. The weights in the CD-DNN are initialised using a Deep Belief Network (DBN), that is, a stack of Restricted Boltzmann Machines (RBMs). The DBN is trained generativ ely by ﬁtting the layers one at a time (greedily) by means of the contrasti ve div ergence procedure. The ﬁnal output layer in the CD-DNN is a generalised softmax (GSM) layer representing a distribution ov er senones. Given the genera- tiv e initialisation, the full model is ﬁne-tuned with back-propagation training. The input to the model is a context window of n succes- siv e frames of ra w ﬁlterbank feature v ectors. The feature v ectors are normalised to zero mean and unit variance. W e use ﬁlterbank fea- tures instead of MFCCs because they hav e been reported to achiev e good results in combination with DNNs without the need for time deriv atives that w ould increase the latency [14]. W e use a V iterbi decoder to generate the phoneme sequences and a phoneme-level bigram model estimated on the training set. The acoustic scale used in the decoder to tune acoustic and language models was optimised for each test independently on the develop- ment test and the optimal value was then used on the test set. The de- coder follows the lattice generation and pruning approach described in [15]. Differently from pre vious studies [7, 8, 9, 10, 11], we v ary the alignment of the context window with respect to the current frame (see Figure 1). W e train a different recogniser for each alignment and we analyse its performance in terms of Phoneme Error Rate (PER) as a function of the window shift. 3. EXPERIMENTS 3.1. Experimental Setup The experiments are based on the KALDI and PDNN+KALDI recipes [16, 8] and are performed on the TIMIT corpus. The standard 462 speak er training set was used for training. All SA utterances were removed to pre vent bias due to the similarity of the utterances. The training set was further di vided into 95% training and 5% val- idation set for regularisation during the back propagation training procedure. Results are reported on the 24-speaker core test set. The 40 channels ﬁlterbank features were computed using a 40 ms Hamming window with 10 ms increments. The inputs to the DNNs in our experiments are conte xt windo ws of 11 consec- utiv e frames. This length of conte xt window seems to be optimal for this application according to [7]. Also following the results in [7], we use a DNN with 4 hidden layers of size 1024. The softmax output layer represents the distribution of posteriors ov er 1984 dif- ferent senones (based on the 61 phonemes in the TIMIT standard phoneme set). The resulting topology of the network is, therefore, 440 × 1024 × 1024 × 1024 × 1024 × 1984 . The learning rate is initialised to 0.08, and then dropped by half whene ver the dif ference between the previous epoch and current epoch drops belo w a certain threshold. W e optimised the acoustic scale in the decoder by running the decoder on the dev elopment set with 8 different scales. The scales were in the form 1 /k , k ∈ [1 , . . . , 8] . The optimal values for the acoustic scale were always contained between the extreme values we tested, suggesting that they correspond to real optima, see also T able 1 in Section 4. These optimal values were then used to decode the test data. The insertion penalty for our experiments was not op- timised. After decoding, the 61 phone classes are mapped to a set of 39 classes as in [17] for ev aluation. Our baseline results correspond to the input window being cen- tered with respect to the current frame, with 5 context frames on either side. This corresponds, in our notation, to zero shift (see Fig- ure 2). A positiv e shift corresponds to a shift of context window in the future. W e tested shifts from − 10 to 10 with increments of one frame. Additionally , we tested shifts of − 15 and − 20 frames. For ev ery shift value, the whole training and ev aluation procedure is re- peated. It is interesting to notice that for shifts above +5 and belo w − 5 the context window does not contain the current frame, and the recogniser will try to predict the current phoneme exclusi vely based on context. 3.2. Practical Setup W e used KALDI for feature extraction, selection of the senones and alignment of the senone transcriptions to the speech data. T o speed up the deep neural networks training, we used two NVIDIA TI- T AN GTX GPUs. W e also used the symbolic computations software Theano [18], which is well optimised to do symbolic algebra for GPUs. The generativ e training of the RBMs took about 3 minutes for one epoch over the entire training set, and the ﬁne-tuning with back propogation took about 2 minutes for one full pass. 4. RESUL TS The results are shown in Figure 2 for shifts between − 10 and +10 and T able 2 for selected shifts ( − 20 , − 15 , − 10 , − 5 , − 2 , 0 ). The left plot in Figure 2 displays Phoneme Error Rates (PER) as a function of 10 5 0 5 10 input window shift (frames) 20 25 30 35 40 45 phoneme error rate (%) PER vs input window shift 10 5 0 5 10 input window shift (frames) 0 5 10 15 20 25 errors (%) Errors vs input window shift substitutions deletions insertions Fig. 2 . Phoneme recognition performance as a funct ion of the input window alignment in time. Left: phoneme error rate (PER). Right: detail on the different kind of errors. A shift of 0 corresponds to a window centered at the current frame with 5 frames right context and 5 left context. W indows with shifts abov e +5 or below -5 do not contain the current frame. input windo w shift, whereas the right plot details the different kind of errors (substitutions, deletions and insertions). The best PER of 22.0% (SD 3.8) occurs for a window shift of − 2 , i.e., for a window that is not symmetric around the current frame but has more past than future frames. The performance does not de grade varying the shift up to − 5 frames (the current feature vector is still included in the input win- dow). The PER, ho wev er , starts increasing when the shift is beyond − 5 frames and the network does not recei ve the current feature vec- tor in input. If we consider positive shifts for completeness, we can observe a similar behaviour , although the graph is not perfectly sym- metric and the PER for positiv e shifts is generally slightly higher than for negati ve shifts of the same amplitude. The acoustic scale used for decoding was optimised on the de- velopment set and then used on the test set. As a comparison, T a- ble 1 shows the optimal values of the acoustic scale if optimised on the development and test set. In most cases, the same optimal value was obtained. In the cases when different values are obtained, the corresponding difference in % PER w as no grater than 0.5%. Looking at the right plot in Figure 2, we can observe that the number of insertions is relati vely constant with respect to the win- dow shift. The substitutions increase when the window is shifted with respect to the current frame, but the errors that vary the most with window shifts are deletions. T able 2 shows results for selected window shifts, including the extreme cases − 20 and − 15 that are not reported in Figure 2 for clarity of illustration. The performance for e xtreme shifts drops considerably and the degradation is mostly accounted for by deletions and substitutions. 5. CONCLUSIONS This study presents a systematic analysis of the effect of shifting the context input windo w in a CD-DNN+HMM phonetic recogniser with respect to the current frame. The goal is in vestigating the pos- sibility to reduce the latency of such a speech recogniser for applica- tions with speciﬁc requirements, but the results reported here are of general interest. window acoustic score window acoustic score shift test set dev set shift test set de v set -20 0.50 1.00 0 0.14 0.20 -15 0.33 0.50 1 0.17 0.20 -10 0.25 0.25 2 0.20 0.17 -9 0.25 0.25 3 0.25 0.20 -8 0.20 0.20 4 0.20 0.20 -7 0.20 0.20 5 0.20 0.20 -6 0.25 0.20 6 0.25 0.20 -5 0.17 0.20 7 0.25 0.20 -4 0.20 0.20 8 0.25 0.25 -3 0.17 0.20 9 0.25 0.25 -2 0.20 0.20 10 0.25 0.25 -1 0.20 0.20 T able 1 . Acoustic scales optimised on the test and de velopment set for each window shift. Our results on the TIMIT database suggest that a context win- dow slightly shifted back in time is superior compared to the sym- metric context window used in most speech recognisers. Howe ver , the improv ement in performance is small compared to the variabil- ity (standard de viation), and this observ ation should be conﬁrmed by testing on other data sets. More interestingly , our results suggest that shifting the context window back in time up to 5 frames (50 ms) does not introduce no- ticeable degradation in the system performance. Larger shifts intro- duce a gradual but progressiv ely steeper degradation. As a conse- quence, without modifying the ASR method in [8], we can reduce the latency of the system of at least 50 ms, without any degradation in performance. W e can reduce the latency ev en more if some degra- dation can be tolerated by the application. This reduction in latency , although small in size, can potentially improve the usability of ASR in many applications, especially if latency is critical as in real-time lip synchronisation for telepresence. It is important to note that the insertion penalty was not opti- Shift % PER (SD) % SUB (SD) % DEL (SD) % INS (SD) -20 77.6 (3.0) 18.7 (2.4) 57.9 (5.0) 1.0 (0.7) -15 61.8 (4.0) 21.6 (3.3) 38.0 (6.1) 2.2 (1.5) -10 34.9 (4.4) 17.3 (2.6) 14.4 (4.0) 3.2 (1.3) -5 23.0 (3.6) 13.6 (2.0) 6.6 (2.9) 2.8 (1.0) -2 22.0 (3.8) 13.3 (2.0) 5.8 (2.6) 3.0 (1.4) 0 22.7 (4.2) 14.1 (2.1) 5.8 (2.6) 3.0 (1.8) T able 2 . Recognition performance for selected window shifts mised in our experiments, and for all the different window shifts, the deletion error was always greater than insertion error . In future work, we will in vestigate if we can reduce the ef fect of window shift by optimising the insertion penalty for each shift. As in any study on speech recognition, the possibility to gener- alise our results outside the scope of phonetic recognition needs to be veriﬁed with speciﬁc tests. For example, it would be interesting to test if systems with longer time dependencies (le xical models and more complex language models), would be affected by the windo w shifts in a similar way . 6. A CKNO WLEDGMENTS The GeForce GTX TIT AN and TIT AN X used for this research were donated by the NVIDIA Corporation. Giampiero Salvi is par- tially supported by the IGLU project (CHIST -ERA, V etenskapsr ˚ adet 2015-06814). 7. REFERENCES [1] Y ann LeCun, Y oshua Bengio, and Geof frey Hinton, “Deep learning, ” Natur e , vol. 521, no. 7553, pp. 436–444, 2015. [2] J ¨ urgen Schmidhuber , “Deep learning in neural networks: An ov erview , ” Neur al Networks , vol. 61, pp. 85–117, 2015. [3] Giampiero Salvi, Jonas Beskow , Samer Al Moubayed, and Bj ¨ orn Granstr ¨ om, “SynFace — speech-driv en facial anima- tion for virtual speech-reading support, ” EURASIP Journal on Audio, Speech, and Music Pr ocessing , Sept. 2009. [4] Kaihui Mu, Jianhua T ao, Jianfeng Che, and Minghao Y ang, “Real-time speech-dri ven lip synchronization, ” in Universal Communication Symposium (IUCS), 2010 4th International , Oct 2010, pp. 378–382. [5] Hao Li, Minghao Y ang, and Jianhua T ao, “Speaker - independent lips and tongue visualization of vo wels, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2013 IEEE International Conference on , May 2013, pp. 8106–8110. [6] Giampiero Salvi, “Dynamic beha viour of connectionist speech recognition with strong latency constraints, ” Speech Commu- nication , vol. 48, no. 7, pp. 802–818, July 2006. [7] Abdel rahman Mohamed, George E. Dahl, and Geof frey Hin- ton, “ Acoustic modeling using deep belief networks, ” IEEE T rans. Audio, Speech, Lang. Pr ocess , pp. 14–22, 2012. [8] Y . Miao, “Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN, ” ArXiv e-prints , Jan. 2014. [9] A. Mohamed, T .N. Sainath, G. Dahl, B. Ramabhadran, G.E. Hinton, and M.A. Picheny , “Deep belief networks using dis- criminativ e features for phone recognition, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2011 IEEE Interna- tional Confer ence on , May 2011, pp. 5060–5063. [10] Kaisheng Y ao, Dong Y u, Frank Seide, Hang Su, Li Deng, and Y ifan Gong, “ Adaptation of context-dependent deep neu- ral networks for automatic speech recognition, ” in in Proc. SLT’12 , 2012. [11] G.E. Dahl, Dong Y u, Li Deng, and A. Acero, “Lar ge vocab- ulary continuous speech recognition with context-dependent dbn-hmms, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2011 IEEE International Confer ence on , May 2011, pp. 4688–4691. [12] V ijayaditya Peddinti, Daniel Pov ey , and Sanjee v Khudanpur, “ A time delay neural network architecture for ef ﬁcient model- ing of long temporal contexts, ” in Pr oc. Interspeech , 2015. [13] Xin Lei, Andrew Senior , Alexander Gruenstein, and Jeffrey Sorensen, “ Accurate and compact large vocab ulary speech recognition on mobile devices, ” in Proc. of Interspeech , 2013. [14] Navdeep Jaitly , Exploring Deep Learning Methods for Discov- ering F eatur es in Speech Signals , Ph.D. thesis, Univ ersity of T oronto, 2014. [15] D. Pov ey , M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karaﬁat, S. Kombrink, P . Motlicek, Y anmin Qian, K. Riedhammer , K. V esely , and Ngoc Thang V u, “Generating e xact lattices in the WFST frame work, ” in Acous- tics, Speech and Signal Processing (ICASSP), 2012 IEEE In- ternational Confer ence on , March 2012, pp. 4213–4216. [16] Daniel Povey , Arnab Ghoshal, Gilles Boulianne, Lukas Bur - get, Ondrej Glembek, Nagendra Goel, Mirk o Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer , and Karel V esely , “The Kaldi speech recognition toolkit, ” in IEEE 2011 W orkshop on Automatic Speech Recog- nition and Understanding . Dec. 2011, IEEE Signal Processing Society , IEEE Catalog No.: CFP11SR W -USB. [17] K.-F . Lee and H.-W . Hon, “Speaker -independent phone recog- nition using hidden Markov models, ” Acoustics, Speech and Signal Pr ocessing, IEEE T ransactions on , vol. 37, no. 11, pp. 1641–1648, Nov 1989. [18] Fr ´ ed ´ eric Bastien, P ascal Lamblin, Razv an Pascanu, James Bergstra, Ian J. Goodfellow , Arnaud Bergeron, Nicolas Bouchard, and Y oshua Bengio, “Theano: new features and speed improvements, ” Deep Learning and Unsupervised Fea- ture Learning NIPS 2012 W orkshop, 2012.

Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment