Comparing phonemes and visemes with DNN-based lipreading

THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 1 Comparing phonemes and visemes with DNN-based lipreading Kwanchiv a Thangthai 1 k.thangthai@uea.ac.uk Helen L Bear 2 helen@uel.ac.uk Richard Har ve y 1 r .w .har ve y@uea.ac.uk 1 School of Computing Sciences University of East Anglia Norwich, UK 2 Computer Science & Inf or matics University of East London London, UK. Abstract There is debate if phoneme or viseme units are the most effecti ve for a lipreading system. Some studies use phoneme units ev en though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. W e compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. W e report the accuracy of our system at both word and unit levels. The ev aluation task is large vocab ulary continuous speech using the TCD-TIMIT corpus. W e complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is a W eighted Finite-State T ransducer (WFST). W e use DCT and Eigenlips as a representation of mouth R OI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy . Howev er , the phoneme system achieved lower accurac y at the unit level which shows the importance of the dictionary for decoding classiﬁcation outputs into w ords. 1 Intr oduction As lipreading transitions from GMM/HMM-based technology to systems based on Deep Neural Networks (DNNs) there is merit in re-examining the old assumption that phoneme- based recognition outperforms recognition with viseme-based systems. Also, gi ven the greater modeling power of DNNs, there is value in considering a range of rather primi- tiv e features such as Discrete Cosine Transform (DCT) [ 1 ] and Eigenlips [ 7 ] which had previously been disparaged due to their poor performance. V isual speech units di vide into two broad categories; phonemes and visemes. A phoneme is the smallest unit of speech that distinguishes one word sound from another [ 3 ]. Therefore it has a strong relationship with an acoustic speech signal. In contrast, a viseme is the basic visual unit of speech that represents a gesture of the mouth, face and visible parts of the teeth and tongue, the visible articulators. Generally speaking, mouth gestures hav e less variation than sounds and sev eral phonemes may share the same gesture so a class of visemes may contain man y dif ferent phonemes. There are many choices of visemes [ 6 ] and T able 1 shows one of those mappings [ 19 ]. c  2017. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. 2 THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS T able 1: Neti [ 19 ] Phoneme-to-V iseme mapping. Consonants V owels Silence V iseme TIMIT phonemes Description V iseme TIMIT phoneme Description V iseme TIMIT phoneme Description /A /l/ /el/ /r/ /y/ Alveolar -semivowels /V1 /ao/ /ah/ /aa/ /er/ /oy/ /aw/ /hh/ Lip-rounding based vowels /S /sil/ /sp/ Silence /B /s/ /z/ Alveolar-fricati ves /V2 /uw/ /uh/ /ow/ " /C /t/ /d/ /n/ /en/ Alveolar /V3 /ae/ /eh/ /e y/ /ay/ " /D /sh/ /zh/ /ch/ /jh/ Palato-alveolar /V4 /ih/ /iy/ /ax/ " /E /p/ /b/ /m/ Bilabial /F /th/ /dh/ Dental /G /f/ /v/ Labio-dental /H /ng/ /g/ /k/ /w/ V elar 2 Dev eloping DNN-HMM based lipreading system Feature'Extrac+on'and''' Pre0processing' Training'Videos' ROI'Extrac+on' DNN0HMM' Training' Unseen'Videos' ROI'Extrac+on' Feature'Extrac+on'and''' Pre0processing' WFST' Decoding' ' Results' ' Visual'Speech'Model'Training' Evalua+on' Video' Transcrip+on' DNN0HMM' Visual'speech'model' Language'Model' ' Lexicon'Model' ' Figure 1: Lipreading system construction techniques. Con ventional techniques to model visual speech are based around Hidden Marko v Mod- els (HMMs) [ 23 ]. The aim of the model is to ﬁnd the most likely word or unit sequence corresponding to the visual observation. HMMs comprise two probability distributions: the transition probability and the probability density function (PDF) associated with the contin- uous outputs. The transition probabilities represent a ﬁrst-order Markov process. The PDF of speech feature vectors is modeled by a Gaussian Mixture Model (GMM) that is parame- terised by the mean and the variance of each component. There are some weaknesses of GMM, that have been found in acoustic modeling [ 13 ]. First, it is statistically inefﬁcient for modeling data that lie on or near a non-linear manifold in the data space. Second, in order to reduce the computational cost by using a diagonal rather than a full cov ariance matrix, uncorrelated features are needed. These deﬁciencies motiv ate the consideration of alternative learning techniques. The deep netw ork structure can be considered as a feature extractor by using the number of neurons in multiple hidden layers to learn the essential patterns from the input features [ 16 ]. In addition, the backpropagation algorithm [ 12 ] with its appropriate learning criterion is essentially optimizing the model to ﬁt to the training data discriminativ ely . Howe ver , to decode a speech signal, temporal features and models that can capture the sequential information in speech such as an observable Markov sequence in the HMM is still necessary . Thus arises the DNN-HMM hybrid structure in which the DNN is used instead of the GMM in the HMM. The method essentially combines the advantages from these two algorithms. THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 3 2.1 Featur e extraction The literature provides a variety of feature extraction methods, often combined with tracking (which is essential if the head of the talker is moving). Here we focus on features that have been previously described as “bottom-up” [ 17 ] meaning that they are deriv ed directly from the pixel data and require only a Region-Of-Interest, or R OI. Figure 2 illustrates a typical R OI taken from the TCD-TIMIT dataset described later in Section 4.1 plus two associated feature representations which we now describe. Figure 2: Comparing the original R OIs image (left) and its reconstruction via 44-coef ﬁcient DCT (middle) and 30-coefﬁcient Eigenlip (right). 2.1.1 Discrete Cosine T ransform (DCT) The DCT is a popular transform in image coding and compression. DCT aims to represent the frequency domain of signal periodically and symmetrically using the cosine function. In particular , the DCT is a part of the Fourier Transform family but contains only the real part (Cosine). Because of its popularity most modern processors execute it very quickly (roughly O ( N ) for modern algorithms) so this also explains its ubiquity . For strongly cor- related Markov processes the DCT approaches the Karhunen-Loeve transform in its com- paction efﬁcienc y . Possibly this explains its popularity as a benchmark feature [ 18 ]. Here we use DCT II with zigzag property [ 27 ], which means that the ﬁrst elements of the feature vector contain the lo w-frequency information. The resulting feature vector has 44 dimen- sions. 2.1.2 Eigenlip The Eigenlip feature is another appearance-based approach[ 15 ]. The Eigenlips feature has been generated via Principal Component Analysis (PCA) [ 26 ]. Here, we use PCA to extract the Eigenlip feature from grey-scale lip images, where we retain only 30-dimensions of PCA. T o construct the PCA, 25-R OIs images of each training utterance were randomly selected to be the set of training images. Almost about 100 k images in total were used to compute the Eigen vector and Eigen v alue and use it for extracting training and testing utterances. Only 30 dimensions of principal components with high variation were retained. 2.2 Featur e transf ormation Raw$ feature $ FMLLR$feature $ Splicing$ LDA$ MLLT$ FMLLR$ Figure 3: FMLLR feature pre-processing pipeline. 4 THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS The raw features are 30-dimensional Eigenlip and 44-dimensional DCT features. These raw features are then normalized by subtracting the mean of each speaker and the 15 con- secutiv e frames are spliced onto the feature to add dynamic information. Second, Linear Discriminant Analysis (LD A) [ 9 ] and Maximum Lik elihood Linear T ransform (MLL T) [ 10 ] are applied to reduce and map the features to a new space to minimize the within-class dis- tance and maximize the between-class distance, where the class is the HMM-state, whilst simultaneously maximizing the observation lik elihood in the original feature space. Finally , feature-space Maximum Likelihood Linear Regression (fMLLR) [ 21 ],[ 10 ], also known as the feature-space speaker adaptation technique, is employed to normalize the varia- tion within a speaker . These ne w 40-dimensional fMLLR features are used as inputs (labeled as feature pre-processing pipeline in Figure 3 ) to the subsequent machine learning. The use of LD A is quite commonplace in lipreading and is deriv ed in the HiLD A framework [ 20 ]. MLL T and fMLLR are commonplace in acoustic speech recognition but have only recently been applied to visual speech recognition [ 2 ] albeit on smallish datasets. 2.3 V isual speech model training Our DNN-HMM visual speech model training inv olves all ﬁve successiv e stages. Here, we detail the dev elopment of the visual speech model that we employ in this work including all steps and parameters. 2.3.1 Context-Independent Gaussian Mixture Model (CI-GMM) The ﬁrst step is to initialise a model by creating a simple CI-GMM model. This step creates a time alignment for the entire training corpus by simply constructing a mono phoneme/viseme model that contains 3-state GMM-HMMs for each speech unit. The CI-GMMs are trained on the raw features, which are DCT and PCA, along with its ﬁrst and second deriv ativ e coefﬁcients ( ∆ + ∆∆ ). W e use 3-state GMM-HMMs on each visual speech unit. Instead of setting the ﬁxed number for increasing Gaussian mixture, we hav e set the maximum number of Gaussian to be 1000 so that each state will keep increasing independently until their variances reach the maximum. When the training process starts, the time alignment of training data is equally segmented and updated in ev ery iteration for the ﬁrst ten iterations, then updated ev ery two iterations until a maximum of 40 iterations. 2.3.2 Context-Dependent Gaussian Mixture Model (CD-GMM) The context-dependent viseme models (CD-GMMs) is speciﬁed on the same feature as in CI-GMM system. Here we use tied-state of 3-context visual speech model, where the tied- states are obtained from the data-dri ven approach tree-clustering [ 22 ]. W e have speciﬁed the maximum number of leaf nodes to be 2000, which limits the number of states. The maximum number of Gaussians is set to 10 K . The training iterations continue until there is con ver gence which in practice is fewer than 35 iterations. W e realign ev ery 10 iterations. 2.3.3 CD-GMM with LDA-MLL T feature transformation This training step also uses CD-GMMs, but trained on the LD A-MLL T features. The 40- dimensional LD A-MLL T features are formed by splicing 15 frames of the current frame (sev en on the left and sev en on the right) then reducing, via LD A, to 40 dimensions per THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 5 frame. This compact set of LD A-MLL T feature parameterizes to the 40-dimension that best associates with the visual speech unit and also comprises the dynamic of visual speech over 150ms. Again, the different set of tied-state CD-GMM has been constructed considered to the current feature. The maximum number of leaf nodes is set to 2 , 500, and the total number of Gaussians is 15 K . This step utilises the equiv alent number of training iterations and the realignment as those used in the previous step. 2.3.4 CD-GMM with Speaker Adaptive T raining (SA T) In a Speaker Adaptive Training (SA T) system, the CD-GMM are built on an fMLLR trans- formation on top of LDA-MLL T features by estimating a transform for each speaker . The same training process in the preceding step is then applied on the 40-dimensions of fMLLR feature, where the number of leaf nodes and Gaussian are identical. 2.3.5 Context-Dependent Deep Neural Networks (CD-DNN) W e construct the CD-DNNs model on the h ybrid DNN-HMMs architecture. The CD-DNNs are trained and optimized by minimizing frame-based cross-entropy between the prediction and the PDF target. The PDF refers to the tied-state context-dependent label, which is gen- erated from the SA T system, that aligned ev ery frame. The feature we adopted for all DNN training is based on LD A+MLL T+fMLLR features with mean and variance normalization. The CD-DNNs model is trained on six hidden layers with 2048 neurons per layer , where we use the sigmoid non-linearity function in each neuron. The input layer is the fMLLR feature with temporally spliced 11 consecutive frames. The model is initialized by a stacking of Recurrent Boltzman Machines (RBM) with three iterations on a single-GPU machine. The learning rate for RBM training is 0 . 4 and applying L2 penalty (weight decay) at 0 . 0002. The learning rate for ﬁne-tuning has been set to 0 . 008 with dropout of 0 . 1. W e use the minibatch-Stochastic Gradient Descent (SGD) for ﬁne-tuning with minibatch size of 256. W e produce a de velopment set for tuning the network by randomly selecting 10% of training data. Every DNN training iteration is required to have a cross-validation loss is lower than the pre vious training iteration. If a iteration is rejected then one retries with a ne w stochastic gradient descent parameter . The terminating condition is that the new loss is little different from the old loss (speciﬁcally we use a difference smaller than 0 . 001 of the loss as a suitable terminating condition). 2.4 Decode lipreading with WFST Decoder W eighted Finite-state Transducer decoders have been increasingly used to decode speech signal in Large V ocab ulary Continuous Speech Recognition (L VCSR) tasks and have also become a state-of-the-art decoder [ 14 ]. T o decode a visual speech signal, we need a visual speech model, a language model, and a lexicon or as so called, a pronunciation dictionary . Our lipreading decoder comprises the visual speech DNN-HMM model,the TCD-TIMIT pronunciation dictionary and the word bi-gram language model. W e generate the decoding graph as a ﬁnite-state transducer (FST) via the Kaldi toolkit [ 22 ] .Beam width pruning is applied every 25 frames where we use 13 . 0 for the V iterbi pruning beam [ 25 ] and 8 . 0 for the lattice beam and the visual speech model scale is 0 . 1. The lattice that contains the entire surviving path is re-scored by applying the bigram language model with the scaling factor ov er the range 5 − 15. Only the lo west word error rates after LM re-scoring are used. 6 THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 3 Analysis of the pronunciation dictionary Reducing the set of speech units, such as reducing a set of phonemes to a set of visemes, reduces the discriminativ e power of the classiﬁcation model whilst increasing the complexity of pronunciation dictionary by increasing the volume of homophonic words. This suggests that word accuracy of a viseme based system will be lo wer than a phoneme based system. The counter argument is that visemes might be simpler to classify (because there are fewer of them and they are meant to be better matched to the visual signal) so there is clearly a trade-off between homophen y and unit accuracy [ 8 ]. T able 2: Example of phoneme and viseme dictionary with its corresponding IP A symbols. W ord Entry IP A Symbol Phoneme Dictionary V iseme Dictionary T ALK t O k t ao k C V1 H TONGUE t 2N t ah ng C V1 H DOG d O g d ao g C V1 H DUG d 2 g d ah g C V1 H CARE k e r k eh r H V3 A WELL w e l w eh l H V3 A WHERE w e r w eh r H V3 A WEAR w e r w eh r H V3 A WHILE w ai l w ay l H V3 A T able 2 shows examples of the homophoneme and homoviseme words that occur in the TCD-TIMIT dictionary . Figure 4 describes the homophone problem in tw o ways. On the left words are binned according to ho w many homophones they have. Thus the column labelled “1 occur” is the count of all unique words, the column labelled “2 occur” is the count of words that have one other homophone and so on. It is e vident the switch to visemes causes more homophones particularly large numbers of high-multiplicity homophones. This effect can also be seen in the dictionary size (right of Figure 4). Homophones cause dictionary entries to merge so the visual dictionary is smaller than the acoustic one. No. of occurrences 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 No. of words 10 0 10 1 10 2 10 3 10 4 10 5 10 6 Phoneme Viseme The actual size of the dictionary 1000 2000 3000 4000 5000 *8697 10000 20000 30000 40000 50000 100000**156518 Number of word distintive # 10 4 0 2 4 6 8 10 12 14 Phoneme Viseme Vocab size Figure 4: Frequency of duplicated pronunciation in TCD-TIMIT dictionary (left) and vocab- ulary size (right) for both phoneme and viseme units. 4 Experiment methodology 4.1 Data and Benchmarks W e use the TCD-TIMIT [ 11 ] corpus containing 59 volunteer speakers. W e chose this dataset because it is the largest vocab ulary audio-visual speech corpus av ailable in the public do- main. The WFST operates on a vocabulary of almost 6 , 000 words from a dictionary of 160 , 000 entries. This dataset provides lists of non-overlapping utterances for training and THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 7 ev aluation in two scenarios: speaker -dependent (SD) and speaker-independent (SI). In the SD scenario, visual models are trained on 3 , 752 utterances and ev aluated on 1 , 736 utter- ances. Whereas in the SI experiment, 3 , 822 utterances from 39 talk ers are in the training set and we ev aluate on the remaining 17 talkers containing a total 1 , 666 utterances. The TCD-TIMIT release includes a baseline viseme accuracy for both speak er dependent and speaker independent settings using the Neti visemes [ 19 ] used here. The best viseme accuracy of recognizing 12 viseme units reported on TCD-TIMIT is 34 . 77% in speaker in- dependent tests and 34 . 54% on speaker dependent tests. The context independent viseme models (referred to as mono-viseme in the paper) were trained on 44-coefﬁcient DCT fea- ture with 4-state HMMs and 20 Gaussian mixtures per state. 5 Results 5.1 V iseme-based lipreading experiment One fundamental measure of the performance of an automatic lipreading system is viseme accuracy . Since the viseme recognizer requires no dictionary or language model, it is quick er to build and optimise. T able 3 lists the accuracies achiev ed with our viseme based lipreading system. In compar- ison to the viseme accuracies benchmarked with the TCD-TIMIT corpus, our best SD viseme accuracy is 46 . 61% with Eigenlips, compared to 34 . 54%, an improvement of 12 . 07%. Our best SI viseme accuracy is 44 . 61% which improves on the benchmark 34 . 77% by 10 . 16%, again with the Eigenlips features. T able 3: V iseme-based lipreading accuracy (%). Model Feature V iseme accuracy (%) W ord accuracy (%) SD SI SD SI CD-GMM + SA T DCT 44.66 42.48 14.37 10.47 CD-DNN 43.67 38.00 23.89 9.17 CD-GMM + SA T Eigenlips 45.59 44.61 16.71 12.15 CD-DNN 46.61 44.60 33.06 19.15 W ord accuracy achiev ed with visemes, albeit lo wer than the viseme accuracy , also shows that Eigenlip features outperform the DCT : we achiev ed 33 . 06% in speaker dependent tests, and 19 . 15% in speaker independent tests. 5.2 Phoneme-based lipreading experiment T able 4 shows the word and phoneme accuracies achie ved with our phoneme-based lipread- ing system. This system achiev ed the most accurate lipreading with a word accuracy of 48 . 74%. It is interesting that with the phoneme recogniser , word accuracy is greater than phoneme accuracy , because in the viseme recogniser, this is vice v ersa. Again, highest accuracy is achiev ed with Eigenlip features rather than DCT . One inter- esting observation apparent in T ables 3 and 4 is that the introduction of the DNN makes little difference to the unit accuracy but a bigger difference to a word accuracy for both DCT and eignlips features. 8 THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS T able 4: Phoneme-based lipreading accuracy(%). Model Feature Phoneme accuracy (%) W ord accuracy (%) SD SI SD SI CD-GMM + SA T DCT 28.22 27.37 21.88 17.72 CD-DNN 29.18 28.08 37.40 33.87 CD-GMM + SA T Eigenlips 31.14 29.59 28.79 24.57 CD-DNN 33.44 31.10 48.74 42.97 5.3 Discussion Figure 5 plots all of our experimental results comparing unit accuracies (along the x -axis) against the word accuracies (on the y -axis) along with errorbars sho wing ± 1 standard error . 0 10 20 30 40 50 60 10 20 30 40 50 60 Unit accuracy (%) W ord accuracy (%) Viseme GMM DCT SD Viseme GMM DCT SI Viseme DNN DCT SD Viseme DNN DCT SI Viseme GMM Eigenlips SD Viseme GMM Eigenlips SI Viseme DNN Eigenlips SD Viseme DNN Eigenlips SI Phoneme GMM DCT SD Phoneme GMM DCT SI Phoneme DNN DCT SD Phoneme DNN DCT SI Phoneme GMM Eigenlips SD Phoneme GMM Eigenlips SI Phoneme DNN Eigenlips SD Phoneme DNN Eigenlips SI Figure 5: Lipreading system performance in GMM system. Figure 5 has two clusters: one, in the bottom right, represents the viseme experiments and the other, on the upper left the phonemes. Here we are representing viseme classiﬁers with circles (ﬁlled represents the DNN, open the GMM) and the phonemes with squares (either ﬁlled or open depending on the classiﬁer). The colours represent the various SI/SD or DCT/Eigenlips combinations. The phoneme recogniser naturally obtained lo wer unit accuracy scores because it has three times more phoneme classes than viseme classes (13 to 38 respectiv ely). But this does not mean that phoneme classes hav e less power to model a visual gesture. This is visualised in the confusion matrices in Figure 6 where the colour patterns are consistent between phoneme classes (on the left of Fig 6 and between viseme classes on the right of Fig 6 ). W e note that reducing the set of visual speech units also reduces the discriminant power of the classiﬁcation model whilst increasing the complexity of pronunciation dictionary by increasing the volume of homophone (homoviseme) words. This suggests that word accu- racy of a viseme based system will be less likely to outperform the phoneme based system. One of the disadvantages of the DNN is that it is not easy to examine to internals of the network to discover from where it is getting its performance. Howe ver there is a clue in the previous observation which is that the DNN appears to make the most difference to word accuracy rather than unit accuracy . V isual speech is notorious for extensi ve co- articulation so the implication is that either there are signiﬁcant differences in the window length between the GMM and the DNN or the DNN is better able to model co-articulation THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 9 A B C D E F G H S v1 v2 v3 v4 A B C D E F G H S v1 v2 v3 v4 aaae ah aw ay b ch d dheher ey f g hh ih iy jh k l m n ng ow oy p r s ss sil t th uhuw v w y z aa ae ah aw ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s ss sil t th uh uw v w y z Figure 6: Comparison of visemes confusion matrix (left) vs phomemes confusion matrix (right). than the GMM. Although there are variations in the window length, here the GMM has a slightly longer span of 150ms compared to 110ms for the DNN, it is latter explanation is the most likely . In other work [ 24 ] we were able to use identical features and we also found the DNN superior furthermore we know the DNN to be better able to learn data structured on non-linear manifolds so we believ e this is the most likely explanation for the success of the DNN. One cav eat is that we hav e not optimised the scaling factor used in our language models so there is probably more performance to come when measured as word accuracy . 6 Conclusion One observation we found is that DNN-HMM viseme recognizers can easily ov erﬁt to the training observations, this is sho wn in the performance disparity between SD and SI conﬁg- urations. It could potentially be interesting to use visemes as an initialisation for phoneme recognition in a hierarchical training method similar to that in [ 4 ] in the future. W e have added more evidence to the argument that phoneme classiﬁers can outperform those of visemes. Whilst there is still debate about visemes, we can not forget them, but giv en the evidence showing a signiﬁcant improv ement in word accuracy from the reduction in homophonic words in a pronunciation dictionary , we suggest that phonemes are the current optimal class labels for lipreading. W e have also illustrated the noticeable performance gain by changing visual represen- tation from DCT to Eigenlips. The best word accuracy in this work is 48 . 74% on SD and 42 . 97% on SI achiev ed with the DNN-HMM phoneme unit recognizer trained on Eigenlip features. Ho wev er , the disadvantage of Eigenlip feature is a learned linear mapping that needs to be trained. Con ventional systems have shown speaker independence to be a challenge, here with a nov el DNN-HMM architecture, we ha ve reduced the effect between these arrangements. W e speculate that the success of the DNN is likely to do its ability to better model the effects of co-articulation which is a well known b ugbear of human and machine lip-readers. 10 THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS Refer ences [1] Nasir Ahmed, T_ Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers , 100(1):90–93, 1974. [2] I. Almajai, S. Cox, R. Harvey , and Y . Lan. Improved speaker independent lip reading using speaker adapti ve training and deep neural networks. In 2016 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , pages 2722–2726, March 2016. doi: 10.1109/ICASSP .2016.7472172. [3] International Phonetic Association. Handbook of the International Phonetic Associa- tion: A guide to the use of the International Phonetic Alphabet . Cambridge University Press, 1999. [4] Helen L Bear and Richard Harvey . Decoding visemes: Improving machine lip-reading. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Con- fer ence on , pages 2009–2013. IEEE, 2016. [5] Helen L Bear and Sarah T aylor . V isual speech recognition: aligning terminologies for better understanding. In British Machine V ision Confer ence (BMVC), Deep Learning for Machine Lip Reading workshop , 2017. [6] Helen L Bear , Richard W Harvey , Barry-John Theobald, and Y uxuan Lan. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In Advances in V isual Computing , pages 230–239. Springer, 2014. doi: 10.1007/ 978- 3- 319- 14364- 4_22. [7] Christoph Bregler and Y ochai K onig. " eigenlips" for robust speech recognition. In Acoustics, Speech, and Signal Pr ocessing, 1994. ICASSP-94., 1994 IEEE International Confer ence on , volume 2, pages II–669. IEEE, 1994. [8] Stephen J Cox, Richard W Harve y , Y uxuan Lan, Jacob L Newman, and Barry-John Theobald. The challenge of multispeaker lip-reading. In A VSP , pages 179–184, 2008. [9] Ronald A Fisher . The use of multiple measurements in taxonomic problems. Annals of human genetics , 7(2):179–188, 1936. [10] M.J.F . Gales. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language , 12(2):75 – 98, 1998. ISSN 0885-2308. [11] N. Harte and E. Gillen. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE T ransactions on Multimedia , 17(5):603–615, May 2015. ISSN 1520-9210. doi: 10.1109/TMM.2015.2407694. [12] Robert Hecht-Nielsen et al. Theory of the backpropagation neural network. Neural Networks , 1(Supplement-1):445–448, 1988. [13] Geoffrey Hinton, Li Deng, Dong Y u, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V anhoucke, Patrick Nguyen, T ara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Pr ocessing Magazine, IEEE , 29(6):82–97, 2012. THANGTHAI, BEAR, HAR VEY: PHONEMES AND VISEMES WITH DNNS 11 [14] Dominic Howell, Stephen Cox, and Barry Theobald. V isual units and confusion mod- elling for automatic lip-reading. Image and V ision Computing , 51:1–12, 2016. [15] M Kirby , F W eisser , and G Dangelmayr . A model problem in the representation of digital image sequences. P attern Recognition , 26(1):63 – 73, 1993. ISSN 0031- 3203. doi: http://dx.doi.org/10.1016/0031- 3203(93)90088- E. URL http://www. sciencedirect.com/science/article/pii/003132039390088E . [16] Jianchang Mao and Anil K Jain. Artiﬁcial neural networks for feature extraction and multiv ariate data projection. IEEE transactions on neural networks , 6(2):296–317, 1995. [17] I Matthe ws, TF Cootes, J A Bangham, SJ Cox, and R W Harve y . Extraction of visual features for lipreading. IEEE transactions on P attern Analysis and Machine Intelli- gence , 24(2):198–213, 2002. ISSN 0162-8828. doi: 10.1109/34.982900. [18] Neti, Potamianos, Luettin, Matthews, Glotin, V ergyri, Sison, Mashari, and Zhou. Audio-visual speech recognition, 2000. T echnical report, Center for Language and Speech Processing, The Johns Hopkins Univ ersity , Baltimore. [19] Chalapathy Neti, Gerasimos Potamianos, Juergen Luettin, Iain Matthe ws, Herve Glotin, Dimitra V ergyri, June Sison, and Azad Mashari. Audio visual speech recogni- tion. T echnical report, IDIAP , 2000. [20] Gerasimos Potamianos, Juergen Luettin, and Chalapathy Neti. Hierarchical discrimi- nant features for audio-visual L VCSR. In Acoustics, Speech, and Signal Processing , 2001. Proceedings.(ICASSP’01). 2001 IEEE International Confer ence on , volume 1, pages 165–168. IEEE, 2001. [21] Daniel Pov ey and George Saon. Feature and model space speaker adaptation with full cov ariance Gaussians. In INTERSPEECH 2006 - ICSLP , Ninth International Confer- ence on Spoken Language Pr ocessing, Pittsbur gh, P A, USA, September 17-21, 2006 , 2006. [22] Daniel Pove y , Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann, Y anmin Qian, Petr Schwarz, and Georg Stemmer . The Kaldi speech recognition toolkit. In In IEEE 2011 workshop , 2011. [23] Lawrence Rabiner and B Juang. An introduction to hidden markov models. ieee assp magazine , 3(1):4–16, 1986. [24] Kwanchiv a Thangthai and Richard Harvey . Improving computer lipreading via dnn se- quence discriminative training techniques. In INTERSPEECH 2017 (to be published) , 2017. [25] Andrew V iterbi. Error bounds for con volutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory , 13(2):260–269, 1967. [26] Svante W old, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemo- metrics and intelligent laboratory systems , 2(1-3):37–52, 1987. [27] Inc. Xilinx. 2d discrete cosine transform (dct) v2.0 logicore product speciﬁcation, 2002.

Comparing phonemes and visemes with DNN-based lipreading

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment