Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in s…
Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic
In vestigating the Lombard Effect Influence on End-to-End A udio-V isual Speech Recognition Pingchuan Ma 1 , Stavr os P etridis 1 , 2 , Maja P antic 1 , 2 1 Imperial College London 2 Samsung AI Center , Cambridge { pingchuan.ma16, stavros.petridis04 } @imperial.ac.uk Abstract Sev eral audio-visual speech recognition models hav e been re- cently proposed which aim to improv e the robustness o ver audio-only models in the presence of noise. Ho wever , almost all of them ignore the impact of the Lombard ef fect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic character- istics of speech and the lip movements. In this paper , we inv es- tigate the impact of the Lombard effect in audio-visual speech recognition. T o the best of our knowledge, this is the first work which does so using end-to-end deep architectures and presents results on unseen speak ers. Our results sho w that properly mod- elling Lombard speech is always beneficial. Even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improv ed. W e also show that the standard approach followed in the literature, where a model is trained and tested on noisy plain speech, provides a correct estimate of the video-only performance and slightly un- derestimates the audio-visual performance. In case of audio- only approaches, performance is overestimated for SNRs higher than -3dB and underestimated for lower SNRs. Index T erms : Audio-V isual Speech Recognition, Lombard Speech, End-to-End models 1. Introduction It is well known that speakers adapt their speaking style in noisy backgrounds in order to make their speech more intelligible. This is known as the Lombard ef fect [1] and it is acoustically characterised by an increase in the sound intensity , fundamental frequency , v owel duration and a shift in the formant frequencies [2, 3, 4, 5]. V isually , it is characterised by hyper -articulation [6, 7] and more pronounced rigid-head motion [5, 8]. Recently sev eral audio-visual speech recognition models hav e been presented [9, 10, 11, 12] which aim to augment the performance of acoustic speech recognisers. The main applica- tion of such systems is in noisy acoustic environments since the main assumption is that the visual signal is not af fected by noise and can therefore enhance the performance of speech recog- nition systems. Ho wever , this assumption is not true due to the Lombard ef fect which also af fects the lip mov ements. In addition, such models are usually trained with plain 1 speech which is artificially mix ed with additive noise. This approach does not correspond to a realistic scenario where Lombard (and not plain) speech is mixed with noise. This mismatch can po- tentially harm the performance of audio-only , video-only and audio-visual speech recognisers. 1 The terms plain and non-Lombard speech are used interchangeably in this work. Few works ha ve in vestigated the impact of the Lombard ef- fect on audio-only speech recognition [13, 2, 14]. The main finding is that the performance of a model trained on plain speech mixed with noise is significantly degraded when tested on noisy Lombard speech. This is true e ven when compensated Lombard speech is used, i.e., the Lombard utterances are nor- malised to the same energy as the plain speech utterances, al- though the performance drop is smaller in this case [13]. A sim- ilar performance degradation has also been reported for speaker recognition [15]. Howe ver , if noisy Lombard speech is used for training then a significant improvement is reported. It is also worth pointing out that the performance of a model trained and tested on noisy Lombard is higher than a model trained and tested on noisy plain speech [13]. Even fewer works have in vestigated the effect of the Lom- bard refle x on visual and audio-visual speech recognition and the results are not conclusive. Marxer et al. [13] report an im- prov ement on the recognition of visual Lombard speech no mat- ter if the model is trained on plain or Lombard speech. As ex- pected the improv ement is higher when visual Lombard speech is used for training. On the other hand, Heracleous et al. [16] reported a performance drop when there is a mismatch between training and testing conditions. The same conclusion was also reached when an audio-visual speech recognition system was used. Finally , it has recently been sho wn that the mismatch between plain and Lombard speech can also affect the perfor- mance of audio-visual speech enhancement models [17]. In this work, we in vestigate the impact of the Lombard effect on end-to-end audio-only , video-only and audio-visual speech recognition. T o the best of our knowledge, this is the first work that studies the Lombard effect within the framework of deep end-to-end models which learn to extract features directly from the raw images and audio wa veforms. This is in contrast with the majority of pre vious works which used hand-crafted features in combination with Gaussian Mixture Models-Hidden Markov Models (GMM-HMMs). In addition, we also consider both multi-speaker and subject-independent scenarios. The former has been exten- siv ely studied in previous works [16, 13] and offers an insight on the impact of the Lombard effect. Howe ver , in a real scenario we are mainly interested in the performance on unseen speakers. Hence, we first conduct multi-speaker e xperiments in order to test the claims made by prior works. Then we also con- duct subject-independent e xperiments in order to inv estigate the performance on unseen speaker s which has not been explored before. This is of particular interest since it is known that the de- gree of the Lombard effect is highly speak er dependent [2, 13]. Finally , we report results on sentence-level speech recog- nition. This is in contrast to previous works which mainly fo- cus either on isolated words [16] or on specific words within a sentence [13]. W e believ e that the conclusions reached by this approach can be more useful for a practical speech recognition system where the goal will most likely be to recognise all words in a sentence rather than recognise just isolated words. W e sho w that properly modelling Lombard speech during training leads to improved performance for audio-only , video- only and audio-visual speech recognition models in all experi- ments. W e also show that in subject-independent experiments, including even a relativ ely small set of Lombard speech during training can significantly impro ve the performance of an audio- visual speech recogniser in real conditions, i.e., when testing on noisy Lombard speech. Finally , we sho w that the standard approach followed in the literature, where noise is mixed with plain speech for training and testing, overestimates the actual performance of audio-only models on noisy Lombard speech for Signal-to-Noise ratios (SNRs) higher than -3dB but under- estimates it for lower SNRs. On the other hand, the visual per- formance is correctly estimated in all scenarios and the audio- visual performance is slighlty underestimated. 2. Lombard Grid Database For the purpose of this study , we use the Audio-V isual Lom- bard Gird corpus [4]. The corpus consists of 5400 utterances from 54 speakers (30 females and 24 males), with 100 utter- ances (50 Lombard and 50 plain) per speaker . Each utterance is composed of a six word sequence from the combination of the following components: < command: 4 >< colour: 4 >< prepo- sition: 4 >< letter: 25 >< digit: 10 >< adverb: 4 > , where the number of choices for each component is indicated in the angle brackets. During speaking, both frontal and profile faces were simul- taneously recorded at 25 frames per second (fps) and audio was recorded at 48kHz and do wnsampled to 16kHz. Recordings for each utterance were collected in two conditions, Lombard (L) and Non-Lombard (NL). The NL condition was performed by reading sentences to a condenser microphone placed 30cm in front of the participants, in which the own-v oice attenuation was compensated. The L condition follo ws the same setting, but speech-shaped noise at 80dB sound pressure level was pre- sented to participants via headphones. 3. Architectur e The end-to-end audio-visual speech recognition architecture is shown in Fig. 1 and is similar to the one proposed in [9]. A CTC loss is added so the model can recognise continuous speech. 3.1. V isual Stream The visual stream consists of a spatiotemporal conv olutional layer , followed by a ResNet-18 [19] and a 2-layer BGR U. Specifically , the temporal-wise 3D conv olutional layer has a kernel size of 5 frames. Then, frame-le vel features are e xtracted by ResNet-18. The output of ResNet-18 is fed to a 2-layer BGR U to model the temporal dynamics of visual features. Note that the outputs of the forward and backw ard GR U are concate- nated together instead of added together . This means that al- though there are 128 GR U cells, the features produced by the GR U have a dimensionality of 256. 3.2. A udio Stream The audio stream consists of 5 temporal con volutional blocks, followed by a 2-layer BGR U and an average pooling layer . Each Softmax CTC Loss Transcribed Sequence BGRU (128) BGRU (128) ResNet-18 3D Conv . BGRU (128) BGRU (128) 5-layer 1D Conv . BGRU (128) BGRU (128) T emporalAvgPooling Figure 1: End-to-end audio-visual speech r ecognition arc hitec- tur e overview . Raw images and audio waveforms are fed to the visual and audio str eams, respectively , which pr oduce featur es at the same frame rate at the bottleneck layer . These featur es ar e fused together and fed into another 2-layer Bidir ectional Gated Recurrent Units (BGR U) to model the temporal dynam- ics. Connectionist T emporal Classification (CTC) [18] is used as the loss function. con volutional block includes a temporal con volutional layer, ReLU activ ation and batch normalisation. The first temporal con volutional layer uses a kernel of 5ms and a stride of 0.25ms to extract fine-scale spectral information. The output of the con- volutional layers is fed to a 2-layer BGR U. Similarly to the vi- sual stream, the outputs of the forw ard and backward BGRUs are concatenated. Finally , an averaging pooling layer is used to reduce the audio frame rate to the visual frame rate. 3.3. Fusion Layers Once the 256 audio features and 256 visual features are ex- tracted, they are concatenated and fed into a 2-layer BGR U to model their temporal dynamics. Then a softmax layer follows which provides the characters probabilities for each frame. 3.4. Connectionist T emporal Classification The CTC loss is used to transcribe directly between inputs and target outputs without any intermediate annotation. Given an input sequence x = ( x 1 , ..., x T ) , CTC sums o ver the probabili- ties of all valid alignments with length T to obtain the posterior of the target sequence y = ( y 1 , ..., y L ) : p ( y | x ) = X validAlig nments T Y t =1 p t ( a t | x ) where p t ( a t | x ) is the per time-step probability and the product computes the probability of a single valid alignment. The CTC loss is computed by the negati ve log likelihood of the posterior probability . (a) Case of SNR-specific models. (b) Case of single model trained on all SNRs. Figure 2: WER of the end-to-end models as a function of the noise level in a multi-speaker scenario. A: audio-only model, A V : audio- visual model, L: Lombard, NL: non-Lombar d, CL: ‘compensated’ Lombard. X-Y indicates a model trained on X (L or NL) speech and tested on Y (L or NL or CL) speech. Best seen in colour . 4. Experimental Setup 4.1. Prepr ocessing 4.1.1. V ideo Prepr ocessing W e use dlib [20] to detect and track facial landmarks for frontal faces and the face alignment library proposed in [21] for pro- file faces. The faces are first aligned using a neutral reference frame in order to normalise them for rotation and size dif fer- ences. This is performed using an affine transform using 5 sta- ble points, two eyes corners in each eye and the tip of the nose. Then the centre of the mouth is located based on the track ed points and a bounding box of 140 by 200 and 80 by 60 is used to e xtract the mouth region of interest (R OI) on frontal and pro- file faces, respecti vely . 4.1.2. Audio Pr epr ocessing Lombard utterances have greater energy than plain speech utter - ances so for a giv en noise lev el their SNR is higher than noisy plain speech. So similarly to [13] we also generate ‘compen- sated’ Lombard speech, where the energy of Lombard speech is normalised to the same energy as plain speech. In this case, the SNR between Lombard and plain utterances is the same for a giv en noise level. T o remove the artificial variability of the signals caused by the speaker -to-microphone distance, we follow the approach suggested in [13]. W e normalise the non-Lombard and ‘com- pensated’ Lombard signals to the same root mean square (RMS) of 0.05. For the Lombard signals, we set the RMS to 0 . 05 · ¯ x L rms / ¯ x N L rms , where ¯ x L rms and ¯ x N L rms are the a verage RMS v alue on Lombard speech and non-Lombard speech corpus. 4.2. Data A ugmentation During training, two data augmentation methodologies are per- formed in ra w images, random cropping and horizontal flipping. Specifically , each frontal mouth R OI is randomly cropped to a size of 130 by 190 and each profile mouth R OI is randomly cropped to a size of 75 by 55. During testing, the central patch is cropped. Horizontal flipping with a probability of 0.5 is used to increase the variation on training samples. V iews L-L NL-L NL-NL WER (Frontal) 23.57 26.05 25.59 T able 1: V ideo-only results on a multi-speaker scenario. L: Lombar d, NL: non-Lombar d. X-Y indicates a model trained on X (L or NL) speech and tested on Y (L or NL) speec h. Babble noise at different le vels is added into the audio wa veforms during training. The SNR levels range from -15dB to 6dB with an interval of 3dB. One of the noise levels or the clean signal is selected under a uniform distribution, which en- hance robustness to dif ferent noise levels. 4.2.1. T raining W e first train each stream from scratch. An initial learning rate of 0.001 and a mini-batch of 64 are used for the audio stream and an initial learning rate of 0.0003 and a mini-batch of 10 are used for the visual stream. W e train the audio stream for 400 epochs and the visual stream for 120 epochs separately . Once the audio and visual streams have been trained, their weights are fixed and the 2-layer BGRU used for fusion is trained with an initial learning rate of 0.0003 and a mini-batch of 10. Fi- nally , the entire audio-visual model is fine-tuned for another 40 epochs. 5. Results 5.1. Multi-speaker experiments In this set of e xperiments, we in vestigate the impact of the Lom- bard ef fect in a multi-speaker scenario when end-to-end deep models are used for speech recognition. F or the purpose of this study , we use 30, 10 and 10 utterances from each subject for training, v alidation and testing, respectively . A similar study has been conducted in [13], but a traditional GMM-HMM ap- proach was follo wed and SNR-specific models were trained. For comparison purposes we first train SNR-specific audio- only models for non-Lombard and Lombard speech similarly to [13]. Results are shown in Fig. 2a and overall are consistent with the results presented in [13]. W e notice that when we train Figure 3: WER of the end-to-end as a function of the noise le vel in a subject-independent scenario. A: audio-only model, A V : audio-visual model, L: Lombard, NL: non-Lombar d, CL: ‘com- pensated’ Lombard. X-Y indicates a model trained on X (L or NL) speech and tested on Y (L or NL or CL) speech. Best seen in colour . a model on non-Lombard speech and test it on Lombard speech (red solid line), a significant drop in performance compared to testing on non-Lombard speech (orange solid line) is observed between -9dB and 6dB. This is mainly the consequence of the SNR mismatch between Lombard and plain speech. Howe ver , between -12dB and -15dB, there is no difference between the two training approaches. When we test on ‘compensated’ Lombard speech (blue solid line), the results are still worse than non-Lombard speech (up to 4%). This indicates that not only the SNR mismatch affects the performance but also the difference in acoustic characteristics between Lombard and non-Lombard speech, to a smaller extent though. Results for multi-speaker experiments where a single model is trained using the SNR augmentation approach from section 4.2 are sho wn in Fig. 2b. The main dif ference with the pre- vious set of experiments is that the performance on Lombard speech (red solid line), for a model trained on non-Lombard speech, is better than the performance on non-Lombard speech (orange solid line) between -15dB and -6 dB. This is proba- bly due to the fact that during training all SNR levels are seen so the influence of the SNR mismatch between Lombard and plain speech is minimised. The same pattern is also observed for ‘compensated’ Lombard speech (blue solid line). This indi- cates that although at higher SNRs the performance of a model trained and tested on non-Lombard speech, which is the usual approach in the literature, o verestimates the actual performance, in lower SNRs it actually underestimates it. It is also worth pointing out that when we train on Lombard speech, a signifi- cant improv ement in performance is observed when we test on Lombard speech (green solid line) compared to training on non- Lombard speech and testing either on Lombard (red solid line) or non-Lombard speech (orange solid line). The results of video-only models are reported in T able 1. A slight improvement of 0.45% is reported in the case of NL-L over NL-NL. This is not entirely consistent with [13] who reported a greater improvement of 4.6%. W e also notice that L-L has an absolute improvement of 2.48% compared to NL-L, which sho ws the benefit of properly modelling Lombard Figure 4: WER of the end-to-end audio-visual model as a func- tion of the noise level in a subject-independent scenario. L: Lombar d, NL: non-Lombard. (NL,0.25L)-L indicates the per - formance is reported using a model trained on non-Lombar d and 25% Lombar d speech and tested on Lombar d speech. The other combinations follow the same pattern. Best seen in colour . V iews L-L NL-L NL-NL WER (Frontal) 25.00 27.84 27.66 WER (Profile) 39.45 47.61 47.47 T able 2: V ideo-only results on subject-independent experiments. L: Lombard, NL: non-Lombard. X-Y indicates a model trained on X (L or NL) speech and tested on Y (L or NL) speec h. speech. The results of audio-visual models are shown in Fig. 2b. As expected the audio-visual models ha ve a lower WER compared to audio-only models across all noise lev els. It is worth point- ing out again that when Lombard speech is properly modelled then a better performance is achieved (green dashed line vs red dashed line). 5.2. Subject-independent experiments Previous experiments considered multi-speaker models. How- ev er, in real scenarios, we would like to have a model that works on unseen subjects. T o better inv estigate the impact of Lombard effect in subject-independent experiments, the training, valida- tion and test sets are divided into 36, 6 and 12 subjects, respec- tiv ely . It is important to note that the same number of female and male speakers are included on v alidation and test sets. The results of audio-only experiments are shown in Fig. 3. Similar conclusions to the ones drawn for multi-speaker experiments in Section 5.1 can be drawn. The performance on Lombard speech (red solid line), for a model trained on non-Lombard speech, is better than the performance on non- Lombard speech (orange solid line) between -15dB and -6 dB. The same pattern is also observed for ‘compensated’ Lombard speech (blue solid line). Again, this demonstrates that the ap- proach followed in the literature, i.e., training and testing on non-Lombard speech, ov erestimates the actual performance at higher SNRs but underestimates it in lo wer SNRs. The video-only results are reported in T able 2. When we train and test a model on Lombard speech, an absolute improve- ment of 2.84% and 8.16% is observed in frontal faces and pro- file faces, respectively , over the NL-L scenario. The perfor- mance of NL-L is very similar to NL-NL which re veals that the approach followed in the literature (NL-NL) provides a correct estimate of the actual performance (NL-L). W e also notice that the performance on profile faces is much worse due to less infor- mation being av ailable as well as inaccurate tracking in profile videos. The results of audio-visual models are shown in Fig. 3. Similarly to the multi-speak er scenario, the best performance is achiev ed when a model is trained and tested on Lombard speech (green dashed line). It is also clear that training and testing on plain speech (orange dashed line) slightly underestimates the performance of the real scenario, where Lombard speech is used for testing (red dashed line). Fig. 4 sho ws the performance of an audio-visual model as a function of the percentage of Lombard speech combined with plain speech for training. It is clear that e ven when the Lombard speech utterances added to the training set account for 25% of plain speech the gap between NL-L and L-L is reduced to half. Also, when Lombard speech accounts for 50% of plain speech similar performance to the L-L scenario is achiev ed for very low SNRs. 6. Conclusions In this work, we in vestigate the impact of the Lombard effect on audio-only , video-only and audio-visual speech recognition. W e sho w that it is always beneficial to properly model Lombard speech. W e also show that training and testing on noisy plain speech, which is commonly used in the literature, is a good es- timate for the performance on visual Lombard speech but a bad estimate for the performance of audio-only speech recognition. It would be interesting to in vestigate in future works how differ - ent types of background noise affect the performance of audio- visual speech recognition models. 7. Acknowledgements W e gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research. The work of Pingchuan Ma has been funded by the Honda Re- search Institute. 8. References [1] E. Lombard, “Le signe de l’elev ation de la voix, ” Ann. Mal. de L ’Or eille et du Larynx , pp. 101–119, 1911. [2] J.-C. Junqua, “The lombard reflex and its role on human listeners and automatic speech recognizers, ” The Journal of the Acoustical Society of America , vol. 93, no. 1, pp. 510–524, 1993. [3] A. L. Pittman and T . L. W iley , “Recognition of speech produced in noise, ” Journal of Speech, Language, and Hearing Research , 2001. [4] N. Alghamdi, S. Maddock, R. Marxer , J. Barker , and G. J. Brown, “ A corpus of audio-visual lombard speech with frontal and profile views, ” The Journal of the Acoustical Society of America , vol. 143, no. 6, pp. EL523–EL529, 2018. [5] W . V . Summers, D. B. Pisoni, R. H. Bernacki, R. I. Pedlow , and M. A. Stokes, “Effects of noise on speech production: Acoustic and perceptual analyses, ” The J ournal of the Acoustical Society of America , vol. 84, no. 3, pp. 917–928, 1988. [6] M. Garnier , L. M ´ enard, and B. Alexandre, “Hyper-articulation in lombard speech: An acti ve communicative strategy to enhance visible speech cues?” The Journal of the Acoustical Society of America , vol. 144, no. 2, pp. 1059–1074, 2018. [7] J. ˇ Simko, ˇ S. Be ˇ nu ˇ s, and M. V ainio, “Hyperarticulation in lombard speech: Global coordination of the jaw , lips and the tongue, ” The Journal of the Acoustical Society of America , vol. 139, no. 1, pp. 151–162, 2016. [8] E. V atikiotis-Bateson, A. V . Barbosa, C. Y . Chow , M. Oberg, J. T an, and H. C. Y ehia, “ Audiovisual lombard speech: recon- ciling production and perception. ” in A VSP , 2007, p. 41. [9] S. Petridis, T . Stafylakis, P . Ma, F . Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition, ” in IEEE ICASSP , 2018, pp. 6548–6552. [10] J. S. Chung, A. Senior, O. V inyals, and A. Zisserman, “Lip read- ing sentences in the wild, ” in IEEE CVPR , 2017, pp. 3444–3453. [11] G. Sterpu, C. Saam, and N. Harte, “ Attention-based audio-visual fusion for robust automatic speech recognition, ” in ACM ICMI , 2018, pp. 111–115. [12] S. Petridis, T . Stafylakis, P . Ma, G. Tzimiropoulos, and M. Pan- tic, “ Audio-visual speech recognition with a hybrid ctc/attention architecture, ” in IEEE Spoken Languag e T echnology W orkshop , 2018, pp. 513–520. [13] R. Marxer , J. Barker , N. Alghamdi, and S. Maddock, “The im- pact of the lombard effect on audio and visual speech recognition systems, ” Speech Communication , v ol. 100, pp. 58–68, 2018. [14] A. W akao, K. T akeda, and F . Itakura, “V ariability of lombard ef- fects under different noise conditions, ” in IEEE ICSLP , vol. 4, 1996, pp. 2009–2012. [15] J. H. Hansen and V . V aradarajan, “ Analysis and compensation of lombard speech across noise type and lev els with application to in- set/out-of-set speaker recognition, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , v ol. 17, no. 2, pp. 366–378, 2009. [16] P . Heracleous, C. T . Ishi, M. Sato, H. Ishiguro, and N. Hagita, “ Analysis of the visual lombard effect and automatic recognition experiments, ” Computer Speech & Language , vol. 27, no. 1, pp. 288–300, 2013. [17] D. Michelsanti, Z.-H. T an, S. Sigurdsson, and J. Jensen, “Ef- fects of lombard refle x on the performance of deep-learning-based audio-visual speech enhancement systems, ” in IEEE ICASSP , 2019, pp. 6615–6619. [18] A. Graves, S. Fern ´ andez, F . Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks, ” in ICML , 2006, pp. 369–376. [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in IEEE CVPR , 2016, pp. 770–778. [20] D. E. King, “Dlib-ml: A machine learning toolkit, ” Journal of Ma- chine Learning Resear ch , vol. 10, no. Jul, pp. 1755–1758, 2009. [21] A. Bulat and G. Tzimiropoulos, “Ho w far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks), ” in ICCV , 2017.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment