Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 1  Abstract — Speech enhancement (SE) aims to reduce noise in speech signals. Most SE te chnique s fo cus only on addressing audio information. In this work , in spired by m ultimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, w e propose an au- dio-visual deep CNN s (AVDCNN) SE model, w hich incorporates audio and visual strea ms into a unifie d netw ork model. We also propose a multi-task learning fr amework for reconstructing audio and visual signals at the output layer . Precisely speaking, the pro- posed AVDCNN mode l is structured as an audio-visual encoder- decoder network, in w hich audio and visual data are first pro - cessed using in dividual CNNs, and then fused into a joint ne twork to generate e nhanced s peech (t he prim a ry task) and reconstructed images (the secondary ta sk) at th e output layer. Th e m odel is trained in a n end- to -end manner, and parameters are joi ntly learned through back- propagation. We evaluate e nhanced speech using fi ve instrumental criteria. Results show that t he AVDCNN model y ields a notab ly superior perfor mance co mpared with an audio-only CN N-based SE m odel and tw o conventional SE ap- proaches, confirming the effectiveness of integrating visua l infor- mation into t he SE process. In addition , the AVDCNN mode l also outperforms an existing audio-visual SE model, confirming its ca- pability of effectively combining audio and visual information in SE. Index Term s — Audio-visual systems, deep convolutional neural networks, multimodal learning, speech enhancement. I. I NTRODUCTI ON he primary goal o f speech enhance ment (SE) is to enha nce the i ntelligibility and quality of noisy speech signal s b y red uc- ing the noise co mponents of noise-corr upted spee ch. To attain a satisfactor y performance, SE has been used as a fundame ntal unit in vario us speech-related applications, s uch as auto matic speech recognition [1, 2], speaker rec ognition [ 3, 4], spee ch coding [5, 6], hearing aids [7 – 9], and cochlear implants [ 10 – 12 ] . In the past few decades, n umerous SE met hods ha ve b een pro - posed and pro ven to pr ovide an improved sound quality. One notable approach, spectral restoration, estim ate s a gai n functi on (based on the statist ics o f noi se an d speech co mponents), which is then used to suppress noise components in the frequenc y d o- main to o btain a c lean speech spectrum fro m a nois y speech one [1 3 – 17]. Another class of approaches adopts a nonlinear model to map noi sy speech si gnals t o clean ones [18 – 21 ]. I n recent years, SE methods based on d eep lear ning have been pr oposed and investigated exte nsively, such as d enoising autoencod ers [22 , 23]. SE methods using deep neural ne tworks (DN N s) gen- erally ex hibit better per formances th an conventional SE models [24 – 26] . Approaches that utilize recurrent neural networ ks (RNNs) and lo ng short-ter m memory (LST M) models ha ve also been conf irmed to exhibit p romising SE and related speech sig- nal processing performances [27 – 29]. In ad dition, inspired by the success of image recog nition using convolutional neural networks (CNNs), a CNN-based m odel has been sho wn to o b- tain good results in SE owing t o its strength in handling image- like 2-D time-freq uency representat ions o f noisy speech [ 31 , 32 ]. In addition to speech s ignals, visual information is important in human-hu man o r human- machine interaction s . A study of the McGurk effect [33] in dicated that the shape of t he m outh or lips could pla y an important role in speech processing. Accord- ingly, a udio- visual multimodality has been adopted in numer- ous fields of speech -processing [34 – 39]. T he results have shown that visual modality en hances the per formance of spe ec h processing compar ed with its co unterpart that uses audio mo- dality alone. In additio n, topics regard ing the fusion of audio and visual features ha ve b een addressed in [ 40 , 41 ] , where ad- ditional reliab ility measures were ado pted to ac hieve a better dynamic weighting o f audio and visual streams. On the other hand, in [4 2 , 43 ] intuitive fusion sch emes were adopted in mul- ti modal learning based o n the architectures of neural networks. There have also been several related studies i n the field of au- dio-visual SE [44 – 50 ]. Mo st of the se are b ased on an enhance- ment filter , w ith the help of handcrafted visual features from lip Audio-V isual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks Jen -C heng Hou, Syu-Siang Wang , Ying-Hui Lai, Yu Tsao, Member, IEEE , Hsiu-Wen Chang , and Hsin- Min Wang, Senior Membe r, IEEE T Copyright (c) 2017 IEEE. Personal use of this mater ial is p ermitte d. How- ever, pe rmission to use this m at erial for an y other purpose s must be obtai ned from the I EEE by pubs-permissions@ i eee . org. Jen-Cheng Hou is with the Research Center for I n formation Technology Innovation at Academia Sinica, Taipei, Taiwan. (email:co olkiu@citi.sinica.ed u .tw). Syu-Siang Wang is with the Graduate I n stitute of Communicat ion Engi- neering, National Taiw an University, Taipei, Taiw an . (email:d0294200 7@ntu.edu.tw ). Ying-Hui Lai is with Department of Biomedical Enginee rin g , National Yang-Ming Univer sity, Taipei, Taiw an . (email: yh.lai@ ym .edu.t w). Yu Tsao is w ith the R esearc h Center for Information Technol ogy Innovation at Academia Sinica, Taipei, Taiw an. (email:y u .tsao@citi.sinica.ed u .tw). Hsiu-Wen Chang is with Departmen t of Audiology and Speech language pathology , Mackay Medical College , New Taipei City, Taiwan. (email:hsiuw en@mmc.edu.tw ). Hsin-Min Wang is with the I nstit ute of I n formation Science at Acade mia Sinica, Taipei, Ta iwan. (email:whm@iis.sinica.e du.tw) . To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 2 shape infor mation. Rece ntly, some audio -visual SE m odels based on d eep learning have also b een proposed [ 51 , 52]. In [ 51 ] , Mel filter banks and a Gauss-Newton defor mable part model [5 3] were u sed to extrac t audio and mouth shape features. Experimental results showed that DNN s with a udio-visual in - puts outperfor med DNN s with only audio inputs in several standardized instr umental eval uations. In [52 ], the authors p ro- posed dealing w it h au dio and visual data u sing DNNs and CNNs, r espectively. The n oisy audio features and the co rre- sponding video feat ures were used as i nput, a nd the audio fea- tures were used as the tar get during train ing. In the present w or k, we adopt ed CNNs to process both audio and v isual streams. T he outputs of the t wo n etworks were fused into a joint network. Noisy speech and visual data were placed at inputs, and clean speech an d visual data were placed at ou t- puts. The entire m odel was trained in an e nd- to -end manner , and structured as an audio -visual encod er-decoder n etwork. Notably, the visual informatio n at the output layer served as par t of the constraints during the traini ng of the model , and thus the system adopted a multi-task learning sc heme that consider ed heterogeneous i nformation. Suc h a unique audio-visual en- coder-deco der network desi gn has not been used in any related work [ 51 , 52]. In short, the proposed audio-visual SE m odel takes advantage of CNN s, which have shown to be effective in speech enhancement [ 31 , 32 ] and image and face reco gnition [54 , 5 5] , for both audio and visual streams , and the properties of d eep learning, i.e., reducing human-en gineering efforts by end- to -end learning and i ntuitive fusion sc hemes in multi- modal learning tasks. T o our best knowledge, t his is the first model to exploit all of the aforementioned proper ties at once in a deep learning-based audio-visual SE model. Our experimental result s show that the prop osed audio- visual SE model outperforms four baseline models, includi ng three au- dio-only SE m odels and the a udio-visual SE model i n [ 51 ], in terms of several standard evaluation metrics, includin g the per - ceptual evaluation o f speech qu ality (PESQ) [5 6], s hort-time objective intelligibilit y (ST OI) [5 7], speech distortion in d ex (SDI) [5 8], hearing -aid speech quality i ndex (HASQI ) [59 ], and hearing-aid speech perception index (HASPI) [ 60 ]. This con- firms the effective ness of incorp orating visual information i nto the CNN-based multimodal SE fra mework, and its s uperior ef- ficacy in co mbining heterogeneous in formation as an audio-vis- ual SE model. In addition, an alter native f usion sche me (i.e. , early fu sion ) based on our audi o-visual model is also e valuated, and the r esults show t hat the proposed architecture is super ior to the early fusion o ne. The remainder o f this paper is o rganized as follows . Section II describes the preprocessing of audio a nd visual strea ms. S ec- tion III introduces the pro posed CNN -based a udio-visual model for SE, and describes the four baseline models for co mparison. Section IV describes the e xperimental setup and results, and a discussion follo ws i n Section V. Section VI presents the con- cluding remarks of this st udy . II. DATA SET AND P REPROCESSI NG In th i s section, w e provide the details of the datasets and p re- processing steps for audio and visual strea ms . A. Data Collection The prepared dataset contains video reco rdings of 320 utter- ances of Mandarin sente nces spoken by a native speaker. The script for rec ording is based on the Tai wan Mandari n hearing in noise test (Tai wan MHINT ) [61], which co ntains 16 list s, eac h including 20 se ntences. The sentences are specially designed to have similar phonemic chara cteristics among lists. Eac h s en- tence is unique and contains 1 0 Chinese characters . T he length of each utterance is approximately 3 – 4 seconds. The utterances were recorded in a quiet room with sufficient light, and the speaker was filmed from the front vie w. Video s were recorded at 3 0 fra mes per second (fps), at a resolution o f 192 0 p ixels × 1080 p ixels. Stereo audio channels were rec orded at 48 kHz. 280 utterances were rando mly selected as a training set, with the remaining 40 utterances used as the testi ng set. B. Audio Feature Extra ction We resampled t he audio signal to 16 kHz, and used only a mono channel for f urther pro cessing. Speech signals were con- verted into the frequency domain and processed into a sequence of frames using the short-tim e Fourier transform. Each frame contained a windo w o f 32 milliseconds, a nd the windo w over - lap ratio w as 3 7.5%. T herefore, there were 50 frames per se c- ond. For ea ch speech frame, we extracted the logarith mic power spectrum, and normalized the value by removin g the mean a nd dividing by the standard deviation. Th e nor malization process was conducted at the utterance level, i.e., the mea n a nd standard deviation vec tors were calc ulated from all fra mes of an utter- ance . We concatenated ± 2 f rames to the cen tral frame as con text windows. Accordingly, a udio features had dimensions of 257 × 5 at ea ch time step. We use  and  to d enote noisy and clean speech features, respecti vely. C. Visual Fea ture Extraction For the visual stream, we conver ted each video that contai ned an utterance into an image seq uence at a fixed frame rate of 50 fps, in order to keep synchronization of th e speech frames and the image frames. Next, we detected th e m outh using the Vio la – Jones method [6 2], r esized the crop ped m o uth region to 16 pix- els × 24 p ixels, and retained its RGB chann e ls. In each channel, we rescaled the i mage pixel intensities in a range of 0 to 1. We subtracted the mean and divided it by th e standard deviation f o r normalization. This normalization was co nducted for eac h c ol- ored mouth ima ge. In ad dition, we concatenated ± 2 frames to the central frame, resulting in visual features with dimensions of 16 × 24 × 3 × 5 at each time step. We use  to represent input visual features. For each utterance, the number of fra mes of audio sp ectr o- gram and the num b er of m outh images were made the same us- ing truncation if nece ssary. III. A U DIO - VISUAL DEEP CONV OLUTI ONAL NEURAL NETWORK S ( AVDCNN) The architecture o f the propo sed AVDC NN m odel is illus- trated in Fig. 1. It is composed of t wo individual networks that handle audio and visual steam s, respectively , namel y the aud io network and visual net work. T he outputs of the two networks To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 3 are fused into another network, called the fusion net work. T he CNN, m aximum pooling layer, and fully -connected layer in the diagram are abbreviated as Conv a 1, Conv a 2, Conv v 1, …, Po ol a 1, FC1, FC2, FC a 3, and FC v 3, where the subscripts ‘ a’ and ‘v’ de- note the audio and vis ual stream, respectivel y. In the following section, we descr ibe the trai ning pro cedure of the AVDC NN model. Fig. 1 . The arc hitecture of the proposed A VDCNN model. A. Training the AVDCNN Model To train the AVDCNN model, we first pr epare noisy -clea n speech pairs and mouth i mages. As described in parts B and C in Section II , we have the logar ithmic amplitude s of nois y (  ) and clean (  ) spectra and the corresponding visual features (  ). For each tim e step, we o btain the output o f the audio network as       󰇡     󰇛   󰇜 󰇢       (1 ) where  is the number o f traini ng sa mples. T he output of t he visual network is       󰇡     󰇛   󰇜 󰇢       (2) Next, w e flatte n   and   , and concatenate the two features as the input o f the fusion net work ,    󰇟  󰆒   󰆒 󰇠 󰆒 . A feed-for- ward cascaded fully -connected net w ork is co mputed as:        󰇡     󰇛   󰇜 󰇢       (3)        󰇡     󰇛   󰇜 󰇢       (4) The p arameters of t he AVDC NN model, denoted as  , are randomly i nitialized bet w een - 1 and 1, and are trained b y opti- mizing the following objective function using back -propagation:   󰇛                   󰆹        󰇜 , (5) where  is a mixing weight. A stride size of 1 × 1 is adopted in the CNNs of the AVDCNN model, a nd a dro pout of 0.1 is ad opted after FC1 and FC2 to prevent overfitting. Batch no rmalization is app lied for each layer in t he model. Other con figuration details are p resented in Table I. B. Using the AVDCNN Model for Speech Enhancement In the testin g phase, the log arithmic amplitudes of noisy speech signals and the corresponding visual features are fed i nto the trai ned AVDCNN model to obtain the lo garithmic a mpli- tudes of enhan ced speech signals and corresponding visual fea- tures as outputs. S imilar to spectral r estoration approaches, the phases of the noisy speech are borro wed as the phases for the enhanced speech. T hen, the AVDCNN -enhanced a mplitudes and phase information are used to synthesize the enha nced speech. We consider the visual features at the output of the trained AVDCNN model only a s auxiliar y information. T his special design enable s the AVDCNN model to p rocess audio and visual infor mation concur rently. Thus, t he training process is perfor med in a multi -task learning manner, which has been proven to achieve a better p erformance than single -task learn- ing in several tasks [6 3 , 64 ]. Fig. 2. The architecture of the ADCNN model, which i s the same as the AVD CNN model in Fig. 1 with the v isual parts disco nn ected. TABLE I C ONFIGURATIONS OF THE AVD C NN MODEL Layer Name Kernel Activation Function Number of F i lters or Neurons Conv a 1 12 × 2 Linear 10 Pool a 1 2 × 1 Conv a 2 5 × 1 Linear 4 Conv v 1 15 × 2 Linear 12 Conv v 2 7 × 2 Linear 10 Conv v 3 3 × 2 Linear 6 Merged L ayer 2804 FC1 Sigmoid 1 000 FC2 Sigmoid 800 FC a 3 Linear 600 FC v 3 Linear 1500 C. Baseline Mo dels In this w o rk, we compare the p roposed AVDCNN model with three audio-only baselin e models. The first is the audio - only dee p CNNs (ADCNN) model. As shown in Fig. 2, the ADCNN mod el disconnects all visual-related parts in the AVDCNN model (c f. Fi g. 1 ), and keeps the r emaining config- urations. The second and th i rd are two conventional SE ap- proaches, na mely the Ka rhunen- Lo é ve transform (KLT) [65 ] and log minimum mean sq uared error (lo gMMSE) [6 6 , 67]. In addition, the a udio-visual SE model i n [ 51 ], d enoted by AVDNN, is adop ted as an audio- visual b aseline m od el. The Noisy speech fr ames Enhanced speech fr ames Cropped mouth imag es Reconstru cted mouth imag es Conv a 1 Conv v 1 Conv a 2 Pool a 1 FC2 Merged la yer FC1 FC a 3 FC v 3 Conv v 2 Conv v 3   X Z Audio Network Visual Network Fusion Network   Conv a 1 Conv a 2 Pool a 1 FC2 FC1 FC a 3 Noisy speech fr ames Enhanced speech frames To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 4 AVDNN m od el emplo ys hand crafted audio and vis ual f eatur es, consisting of Mel -filter b anks and the mutual distance change s between p oints i n the lip co ntour, r espectively. Another ma in difference between AVDCNN and AVDNN is that the AVDNN model is based on DNNs and do es not adopt the multi- tasking lear ning sc heme, while AVDCNN applies mu lt i- task learning b y consider ing audio and visual information at the out- put layer. IV. E XPERIMENTS A N D RESULT S A. Experimental S etup In thi s sec tion, we describe the experimental setup for the speech enhance ment task in this study. To pr epare the clean- noisy speech pairs, we follow the concept in the previous study [68] , where the effects of b oth interference noise and ambient noise were considered . For the training set, we u sed 91 diff er ent noise types as i nterference noi ses. T hese 9 1 noises were a sub - set of the 104 noise types used in [69 , 70 ]. Thirteen noise types that were similar to the test noise types w er e removed . Car en- gine noises under five driving conditions were used to form the ambient noise set, including idle engine, 35 mph with the win- dows up, 35 mph with the w ind ows down, 55 mph with the w in - dows up, and 55 mph with the w indows do wn. T he car engine noises we re taken fro m the AVICAR dataset [ 71 ]. We concate- nated these ca r noises to form our final ambient noise so urce for training . T o form the training set, we first rando mly chose 280 out of 3 20 clea n utterances. T he clean utterances were artifi- cially mixed with 9 1 noise types at 10 d B, 6 dB, 2 dB, -2 d B, and -6 dB signal- to -interference noise ratio s (SIRs), and the a m- bient noise at 10 dB, 6 d B, 2 dB , -2 dB, and -6 dB signal- to - ambient noise ratios (SARs), resulting in a total of (280 × 91 × 5× 5 ) utterances . Next, to form t he testing set we adopted 1 0 types of i nterfer- ence no i se s, including a bab y crying sou nd, pure music, music with lyrics, a siren, one backgrou nd tal ker (1T) , tw o back- ground tal kers (2T ), and three background talkers (3T) , where for the 1T, 2T, and 3T backg round talker noises there were two modes: on -air rec ording and room record ing . T hese noises w e re unseen in the traini ng set, i.e ., a noise-mismatc hed condition was ado pted. Furthermore, the y were chosen in particular be- cause w e intend ed to simulate a car driving condition as our test scenario, s uch as liste ning to the rad io while driving w it h nois es from talkers in the rear seats a nd the ca r engine, given that au- dio-visual signal pr ocessing techniques had been ef fective in improving in-car voice command systems [72 – 74 ]. In addition, the a mbient noise for testing wa s a 60 mph car engine noise taken fro m t he dataset used in [75 ], which was also d ifferent from those used in the training set . Consequently , for testing there we r e 40 clean utterances , mixed with the 10 noise types at 5 dB, 0 dB, and -5 dB SIRs, and o ne a mbient noise at 5 dB , 0 dB, and -5 dB SARs , resulting in a total of (40 × 10 × 3× 3 ) utterances. W e u sed stocha stic gradie nt descent and RMSprop [76] as the learning op timizer to train the neural network model, with an initial learning rate of 0.000 1. We chose the weights o f the model where the followin g 2 0 epochs exhibit ed i mprovements of less than 0.1% in the trai ning lo ss. The i mplementation was based on the Keras [ 77 ] library. Fig. 3 Comparison of spectrograms: (a) the clean spee c h, (b) the noisy speech of th e 3T (room) noise at 5 dB SIR with the ambient n oise at -5 dB SAR, and the s peech enhance d by (c) logMMSE , ( d) K LT, ( e) A VDNN, (f) ADCNN, and (g) A VDCNN. B. Comparison of Spectrogr am s Fig. 3, (a)-(g ) present the spectro grams of clean speech, noisy speech mixed with 3T (room) noise at 5 dB SIR with -5 dB SAR , and speech enhanced by the logMMSE, KLT , AVDNN, ADCNN, and AVDCNN methods, respectively. It i s obvious that all three audio-onl y SE appro aches could not effective ly remove the noise components. Th is pheno menon i s especially clear for the silence portions at the b eginning and t he en d of t he utterance, where noise components can still be observed . In contrast, with t he help o f auxiliar y visual in formation, the AVDCNN model e ffectively suppressed the noise co mponents in the parts whe re the mouth was closed . However, even w ith additional visual infor mation the AVDNN model i n [ 51 ] could not yield results as satisfactor y as AVDCNN did in th is ta sk. I n [ 51 ], the test ing condit ion for AVDNN was a noise -matched scenario. Never theless, t he con- ditions are much more c hallenging in this study, and AVDNN appears unable to b e as effective as AVDCNN in such a sc e- nario. As shown in Fig. 3 (e) and (g), the ineffective aspec ts of AVDNN i ncluded inco mpleteness of noise red uction when t he mouth was closed, and po or r econstruction of the target spee ch signals. Such results may stem fro m the inadeq uate vis ual fea- tures of AVDNN, suggesting that t he visual f eat ures learned by CNNs directly from images co uld be more robust than the hand- crafted ones used in AVDN N. To summarize this subsectio n, the spectr ograms in Fi g. 3 demonstrate t hat t he pro posed AVDCNN model is more powerful than the other baseline SE models, which is also suppo rted by the instrume ntal measures in the next subsection. C. Results of I nstrumental Measures In this subsection, w e report the results of five SE methods (g) (a) (b) (c) (e) (d) (f) To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 5 in ter ms of f ive instrumental metrics, namely PESQ, STOI , SDI, HA SQI, and HASPI. The PESQ measure (ran ging fro m 0.5 to 4.5) indicates th e quality measure ment of enhanced speech . The STOI measure (r anging from 0 to 1) indicates the intelligibilit y measurement of enhanced speech. T he HA S QI and HASP I measure s (both ranging from 0 to 1) evaluate sound quality a nd perception, respectively, for both normal hearing and hearing - impaired people (by setting sp ecific modes). In this study, the normal hearing mode was adopted for both the HASQI and HASPI measures. The SDI measure ca lculates the distort ion measurement of clea n and enhanced speech. E xcept for SDI , larger values indicate a better performance. We report th e aver- age evaluatio n score over the 40 test utterances under differ ent noise types , and SIR a nd SA R conditions . We first intended to investigat e the SE perfor mances on di f- ferent noise t ypes . Fig s. 4 – 8 show the average PESQ, STOI, SDI, H ASQI, and HASP I scor es, respectively, of 10 different SIR nois es and the en hanced speech ob tained using different SE methods, wh er e t he S AR w as fix ed to 0 dB . From Figs. 4 – 8, we first not ic e that the per formances of the t wo con ventional SE methods (logMMSE and KL T) show that t hey cannot effec- tively handle no n-stationar y noises . Next , when co mparing the two CN N-based models, AVDCNN outperforms A D CNN con- sistently in ter ms of all evaluation metrics, con firming the ef- fectiveness of the combination of v isual and audio information to achieve a better SE perfor mance. In addition, AVDCNN shows its effectivenes s as an audio -visual model by outper - forming AVDNN in all the metrics. To further confirm the sig- nificance o f the s uperiority of the AVDCNN model over the second best system in eac h test condition in Figs 4-8 , we p er- formed a one -way a nalysis of variance (ANOVA) and Tuke y post-hoc co mparisons (TP HCs) [7 8]. The results confir med that these scores differed significantly, with p -values of less than 0.05 in most conditions, except for STOI ( with music and siren noises), SDI ( with bab y cr ying, music, and sire n noises), and HASPI (with music and siren noises). W ith a further anal ysis on the experimental res ults, we note that among the 10 testing noise type s, the evaluatio n sco res of the bab y crying sound are always inf erior to those of other noise t ypes, s uggesting th at the baby cr ying noise is relati vely challen ging to handle. Me an- while, the m ult iple background talk er (2T, 3 T) scenarios do not appear to be more challenging than that of the single back- ground talker (1T ). Next, we compared the SE perfor mances provided by differ- ent SE m odels on different SAR level s. Fig s. 9 – 13 show the average P ESQ, ST OI, SDI, HASQI, and HASPI sco res of noisy and the enhanced s peech at specific SIR (over 10 different noise types) and SAR levels. In these fig ures, “×”, “□”, and “○” de- note 5 dB , 0 dB, -5 dB SAR, respectivel y. Please note that a speech signal with a hi gher SAR i ndicates that it is invol ves fewer car en gine n oise components . From Figs. 9 – 13 , it is clear that (1 ) the instrumen tal evaluation results o f higher SAR levels are u sually better than those of lower SAR levels; and (2) AV DCNN outperfor ms t he other SE methods , which is esp e- cially obvious in lower SIR levels. This resu lt shows t hat visual information provides important clues for assi sting SE in AVDCNN in more challenging condition s. Fig. 4 Mean PESQ scores of 10 d i fferent noisy and corresponding enhan ced versions of speech, c onside ring different enhancement approaches and var- ying noise ty p es at an SA R of 0 dB. Fig. 5 Mean STOI scor es of 10 d iffere nt noisy and cor responding enhanced versions of speech, considering differe nt enhancement approaches and var- ying noise ty p es at an SA R of 0 dB. Fig. 6 Mean SDI scores o f 10 different noisy and corresponding enhanced versions of speech, c onside ring different enhancement approaches and var- ying noise ty p es at an SA R of 0 dB. Fig. 7 Mean HASQI scores of 10 different noisy and corresponding en- hanced versio ns of sp eech, c onside ring d iffere n t enhanceme n t approac hes and varying no ise types at an SAR of 0 dB. To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 6 D. Multi-style Train ing Strateg y A p revious stud y [ 79] has shown that the input of a certai n modality o f a multimodal ne twork could do minate over other input types. In our preli minary experiments, we observed simi- lar properties. To alleviate this issue, we adopted the multi -style training strategy [80] , which rando mly se lected the following input types: audio -visual, visu al-only, and audio -only, for every 45 epochs in the training pha se. When using the visual inp ut only with t he audio inp ut set t o zeros, a visual output was pro- vided, w hile audio o utput was set accord ing to two different models: Model-1 set the audio tar get to zeros, and Model-2 used the clea n audio signals as t he target. Similarl y, when using the audio-only data w i th t he visual input set to zer os, Mo del -1 set the vis ual target to zer os and Mo del-2 used the original visual data for the visual target. It should be noted that both Model -1 and Model - 2 were trained via the multi -style trainin g strategy, and th e diff er ence lies in the in formation specified in the outp ut during the tr aining pro cess. The mean squared erro rs (MSE) from the training processes of Model-I and Model- II are listed in Figs. 14 an d 15, respectively. On th e tops of Figs. 14 and 15, we used the bars to mark the epoch segments of the three t ypes of input, namel y audio-visual, visual -only, and audio -only. From the results sho wn in Figs. 14 and 15, we ca n observ e some suppor t for including v isual infor mation . From the win- dows with solid red line in t hese t wo fig ures, we note that the audio loss was relati vely large when we used a udio-only d ata for training. T he MSE dro pped to a lo wer level once visual fea- tures were used, indicating a strong corr elation between audio and visual strea ms. Fig. 8 Mean HASPI scores of 10 d iffe rent n oisy and corresponding enhanced versions of spe ech, consideri n g differ ent enhanceme nt approaches and varying noise types at an S AR of 0 dB. Fig. 9 Mean PESQ scores o ver 10 different noisy and correspondi ng enhance d versions of spee ch, c onsideri ng di fferent enhancement approaches for e a ch SI R and SAR. Fig. 10 M ean STOI scores o ver 1 0 diffe r ent n oisy and cor responding enhanced versions of spee ch, considering di fferent enhancement approaches for e ach SI R and SAR. Fig. 11 Mean SDI scores over 10 different noisy and corresponding enhance d versions of spee ch, considering di fferent enhancement approaches for e ach SI R and SAR. Fig. 12 Mean HAS QI scores over 1 0 different noisy and corresponding en- hanced versions of s peech, considering d iffere nt enha ncement approaches for each SI R and SAR. Fig. 13 Mean HASPI scores over 10 differe n t n oisy and corresponding en- hanced versions of speech, considering differe nt enhancement approaches for each SI R and SAR. To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 7 Fig. 14 The learning curve of th e training data for the multi-style learning model using M odel - I. Model-I means setting t he visual/audio target to zeros when only an audio/visual input wa s selected in training. The red frame shows that a smaller audio loss co uld be achie ved as additional visual i n formation was included. Fig. 15 The learning curve of th e training data for the multi-style learning model u sing Model-II. M odel- II m eans retaining the visual/audio t arget when only audio/visual input was sele cted in training. The red frame shows that a smaller au dio loss could be achieve d as additional visual information was in- cluded. E. Mixing Weight In the above exp eriments, the mixing weight in Eq . ( 5) w a s fixed to 1 . Namely, t he err ors ar e co nsidered equally har mful when training t he model p arameters of AVDCNN . In this sub- section, we explore the cor relation of  with the SE p erfor- mance. Fig. 16 shows the a udio and visual loss es in the training data under different mixing weights during the training process of the A VD CNN m od el. It is observed that the more we e mpha- sized the visua l information, i.e., the larger the val ue of the mix- ing weigh t  , the better visual lo ss and w or se audio loss w e o b- tained. Give n that the audio loss dominated the enhance ment results, we tend ed to select a smaller  . F. Multimodal Inputs with Mism atched Visual Features In th i s subsection, we s how the im p ortance of co rrect match- ing bet ween input audio features and their visual counterpar t features. W e selected eight mouth shapes during speech as sta- tionary visual units, and then for each “snapshot” we f ixed it as a visual feature for the entire utterance. From the spec trogram in Fig. 17, w e can see that the A V DCNN-e nhanced spee ch with correct lip features p reserved more detailed structures than other AVDCNN-en hanced speech signals with incorrect lip fea- ture sequences. The mean PESQ score for 40 testi ng utterances with correct visual features wa s 2.54, and the m ea n sco re of e n- hanced speech signals with the eight fake lip shape sequences ranged from 1.1 7 to 2 .07. The se results suggest t hat the extr ac- tion of the l ip shape notab ly affect s the perfor mance of AVDCNN. Fig. 16 The audio and vis u al lo sses in the training data under different mix- ing weights dur i ng the training process of the A VDCNN model. Fig. 17 (a) The n oisy speech of 1T (on-air) noise at 0 d B SI R, (b) the clean speech, (c) the AVDCNN-enhanced speech with the correct lip fe atures, ( d) - (k) (left) the selected lip shapes, (righ t) the AVDCNN-e n hanced spee c h with th e incorrect lip feat ures, which ar e seque nces of the sel ected lip shapes . Audio-visual data V isual data only Audio data only Training data used: Audio loss Visual loss 0 0 45 90 135 180 225 27 0 315 360 405 45 0 495 540 585 63 0 675 720 765 810 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Epochs Loss (MSE error) Audio-visual data V isual data only Audio data only Training data used: Audio loss Visual loss 0 0 45 90 135 180 225 270 315 360 405 450 495 540 585 63 0 675 720 765 810 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Epochs Loss (MSE error) 0 0 15 30 45 60 75 90 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Epochs Loss (MSE error) 0 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 8 G. Reconstructed Mouth I mages In the propo sed AVDCNN s ystem, we use d visual inpu t as an auxiliary clue for sp eech signals and ad d ed visual infor- mation at the ou tput as part of the constraints durin g the training of the model. Therefore, the proposed system is in fact an audio- vi sual encoder-deco der system with multi-task learnin g. In ad - dition to enhanced speech frames, we receive d the corresp ond- ing mouth images at the output in the testing phase. It is inter- esting to inv e stigate the images obtained using the audio -visual encoder-decod er system. Fig. 18 presents a few visualized sam- ples. For now, we simpl y vie w t hese images as “by -prod ucts ” of the audio -visual system, compared with the tar get en hanced speech signals. However, in the future it will be interesting to explore the lip shapes that the m o del learn s when t he corre- sponding fed visual hints are considerabl y corr upted or not p r o- vided. Fig. 18 Vis ua lizing t he normaliz ed mouth images. (a) The visual i nput and (b) visual output o f the proposed AVD CNN mode l. (c) The difference between (a) and (b), with the am plitude magn ified ten times. H. Subjective Listening Tests In addition to instrumental evaluations, we also conducted subjective listening tests for enh anced speech fro m three differ- ent methods , na mely logMMSE, ADCNN, and AVDCNN. W e adopted the procedures for listening tests fro m [ 81 ], usi ng a five-point sca le to evaluate t he b ackground noise intr usiveness (BAK) and o verall e ffect ( OVRL). Hig her scores are more fa- vorable. Each subject listened to 10 utterances enhanced from all 10 testing noises under -5 dB SIR and -5 dB SAR by the aforementioned three models, resulting in a total of (3 × 1 0× 1 0) utterances. T here were a total of 20 subjec ts, whose native lan- guage is Mandarin, participat ing in the tests . The subjec ts were between 23 and 40 y ears old, with a mean of 26 y ears. T he mean scores over the subjec ts are presented in Table II . The se r esult s show that the propo sed AVDCNN model obtained the best scores a mong the three models co mpared in the subj ective lis- tening tests. TABLE II R ESULTS OF THE SUBJECTIVE LISTENING TESTS COMPAR I NG THE THREE DIFFERENT SE MODELS Models B AK OVRL LogMMSE 1.20 1.70 ADCNN 2.75 1.95 AVD CNN 3.70 2.95 I. Early Fusion S cheme for the AVDCNN Model We also attempted an early fusion scheme, by co mbining au- dio and visual features at i nputs b efore enteri ng the co nvolu- tional la yers. T he earl y fusio n model, denoted b y AVDCNN - EF, replaced the audio network and visua l network in Fig. 1 with united CNNs, whose input consisted of the fused audio- visual features generated by co ncatenating audio features, sep - arated RGB chan nels of visual feat ures, and zero p addings, with a final shape of 257 × 29× 1 (audio:257× 5 ×1 , RGB:(80+80 +80)× 24× 1, and zero padding:17× 24× 1) . The numbers of p arameters of AVDCNN and AVDC NN -EF are of the sa me ord er. A comparison of the in strumental m etrics of t he enhanced results for AVDCNN and AVDCNN-EF is presented in T able III . T he scores represent the mean scores for the en- hanced speech o ver 10 different noise s under different SIRs at an SAR of 0 dB . I t is clear that AVDCNN consistentl y outper - forms AVDC NN-EF, indica ting tha t the pro posed fusi on scheme, which proce ss es audio and visual infor mation individ- ual ly f irst a nd fuses them later , is better than a n earl y fusion scheme, which co mbin es the heterogeneous data at t he begin- ning . TABLE III MEAN SCORES OF THE INSTRUMENT AL METRICS OF THE ENH ANCED SPEECH OVER 10 DIFFERENT NOISES UNDER DIFFERE NT SIR S AT 0 D B SAR, COMPARI NG THE AVDCNN MODELS WITH AND WITHOUT E ARLY FU SION Models PESQ STOI SDI HASQI HASPI AVD CNN 2.41 0.66 0.45 0.43 0.99 AVD CNN- EF 1.52 0.51 1.43 0.11 0.74 V. D ISCUSSI ON From the previous e xperiments, w e can o bserve clear evi- dence of ho w vis ual in formation can affect t he e nhancement re- sults. For instance, Fig. 3, (g) shows that noise and speech sig- nals fro m a non-targeted speaker were effecti vely supp ressed when the mouth was closed. Th is result indicates that vis ual information pla ys a beneficial role in voice activity detec tion (VAD) . In fact, there are r esearchers working in this p articular direction [82 , 8 3]. T his also contributes to why we choose in - car environments as our testing scenario , and inves tigate the ef- fectiveness of a udio-visual SE . If there is a ca mera that targets a driver’s mouth reg ion, t he lip shape co uld pro vide a stro ng hint on whether or not to activate a voice command s ystem with background talkers or noises, and in additio n could enhance th e speech. The l ip sh ape could provide a usef ul hint for VAD . However , this do es not yet appear to b e a very solid one. As shown in Fig. 19, in a few o f the testing results for enhanced speech usi ng the AVDCNN model w e observed th at noise com- ponents were inco mpletely re moved in the non-speech segment, because o f the open shape of the mouth at th at time . W e b elieve that t his shortcoming cou ld be further improved with the co m- bination of audio-on ly VAD tec hniques. We also preliminarily e valuated the AVDCNN m od el on real-world testin g data , i.e., the noisy speec h was r ecorded in a real noisy environ ment, rather than artificially adding noises to the clean speech . Fi g. 2 0 (a) illustrates the controlled en viron- ment for recording tr aining and testing data . Fig. 20 (b ) illus- trates the recording condition s o f the real-world data, which was record ed by a s mart p hone (ASUS ZenFone 2 ZE551ML) in a night market . The spectro grams of the noisy a nd AVDCNN-enhanced speech signals are presented in Fig. 20 (c) and (d), resp ectively. The red fra me indicates the segment (a) (b) (c) To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 9 whe re th e target speech wa s mixed w it h th e background talking sound. We observe that the lip shape helped to id entify the tar- get speech segme nt, while the rec onstruction on the tar get speech w a s not as good as the enhanced results in the controlled environment. This might be because of d ifferent li ght co verage, a lower SIR, or pr operties of the background noise , su ggesting that there re mains room for impro vement in audio -visual SE in real-world testing conditio ns. Fig. 1 9 Spectrograms of (a) the clean speech and (b) the AVD CNN-enhanced speech. The red frame in (b) shows that noise wa s reduced incompletely in the n on -speech segme nt if the mouth was in a n unclose d shape. Fig. 20 Testing in the real-world conditions. (a) The controlled environment (seminar room) for r ecording the training and testing data. (b) The recording environment (night m a rket) for the real-world te st dat a . S pectrograms of (c ) the noisy speech w i th t he ba bble noise and ( d ) the e n hanced speech fro m t he AVD CNN model. The red frame in (c) indicates the segment whe re the target speech was mixe d with noise. VI. C ONCLUSION In this paper, we have propo sed a novel CNN -based audio-vi s- ual encoder-decod er sy ste m with multi-task lear ning for speech enhancement, called AVDCNN. T he model utilizes individual networks to process input data with differen t modalities, and a fusion networ k is the n employed to lear n joint multimodal fea- tures. W e train ed the model in an end - to -end manner. The ex- perimental re sults obtained using the proposed architecture show that its p erformance for the SE task is s uperior to that o f three audio -only baseli ne models in terms of five i nstrumental evaluation metrics, con firming the e ffectiveness of integrat ing visual i nformation with audio infor mation into the SE process. We also de monstrated t he model’s e ffectiveness by co mparing it with other audio-visual SE m od els . Overall, t he contributions of this pap er are five-fold. First, we ad opted CNNs for audio and visual streams in the pro posed end- to -end audio -visual SE model, obtaining improveme nts over many baseline models. Second, we quantified the advantages of integrating visual in- formation for SE throu gh the multi-modal and multi-task train- ing strate gies. Third, we de monstrated that p rocessing audio and visual streams with late fusion is better than earl y f usion. Fourth, the experimental resu lts exhibited a high correlation be- tween speech and lip shape, a nd sho wed the importance of us- ing correc t lip shapes in a udio-visual SE. Finall y, we showed that lip sh apes were effective as aux iliary features in VAD, and also pointed ou t the potential prob lems when using audio-visual SE models . In the future, we will atte mpt to improve the pro - posed architecture by using a whole face as visual input, r ather than the m o uth r egion o nly, i n order to exploit well-trained f a ce recognition networks to improve visual descr iptor networks. Furthermore , we plan to modify the existing CNNs in our mod el by co nsidering other state- of -the-art CNN -based models, such as fully convolutional networks [84- 86 ] an d U- Net [87]. A more sophisticated meth od for s yn chronizing the audio and video streams might improve the per formance and is worthy of investigation. Finally, to i mprove the p racticali ty of t he model for real-world application sce nario s , we will consider collecting training data including more complicated and real co nd itions. A CKNOWL E DGMENT This w or k was supported by the Academia Sinica Thematic Research Progra m AS- 105 - TP -C02 - 1. R EFERENCES [1] J. Li, L . Deng, R. Haeb-Umbac h, and Y . Gong, Robust A utomatic Speec h Recognition: A Bridge to Practical A pplications , 1st ed. Academic Press, 2015. [2] B. Li, Y. Tsao, and K. C. Sim, “An inve stigation of spectral restor ation algorithms for deep neural networks based noise rob ust speech recogni- tion,” in Proc. INTE RSPEECH , 2013, pp. 3002 – 3006. [3] A. El- Solh, A. Cuhadar, and R. A. Goubran. “Evaluatio n of speech en- hancement techniques for speaker identificat ion in n oisy environmen ts,” in Proc. ISMW , 20 0 7, pp. 235 – 23 9 . [4] J. Ortega-G arcia and J. Gonzalez- Rodriguez, “Overv iew of speech en- hancement techniq ues f or automatic speaker recognition,” in Pr oc. I CSLP , vol. 2, 1996, pp. 9 29 – 932. [5] J. Li, L . Yang, Y. Hu, M. Akagi, P.C. L oizou, J. Zhang, and Y. Yan, “Comparative intelligibility investigation of single -channel noise reduc- tion algorithms for Chinese, Japa nese and English,” Journ a l o f the Acous- tical Society of Am erica , 2011, vol. 129, no. 5, pp. 3291 – 3301. [6] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two -stage bin- aural speech e nhanceme n t with W i ener filter for high -quality speech com- munication,” Spee ch Communica tion , vol. 53, no. 5, p p. 677 – 689, 2 011. [7] T. V enema, Compression for Clinicians , 2nd ed. Thomson Delmar L earn- ing, 2006, chapter. 7. [8] H. Levitt, “Noise reduction in hearing aids: an overv iew,” J. Re hab. Res. Dev. , 2001, vol . 38, no. 1, pp.1 11 – 121. [9] A. Chern, Y. H. L ai , Y.-P. Chang, Yu Tsao, R . Y. Chang, and H.-W . Chang, “ A smartphone -based m ulti -functional hearing assistive sy stem to facilitate spee ch recognition in the c lassroom, ” IEEE Acc ess , 2017. [10] Y. H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao , and C .- H. Lee, “A deep denoising autoenco d er approach to improving the intelligibility of vo- coded speech in cochlear imp lant simulation,” IEEE Transactions on Bi- omedical Engi neering , vol.64, no. 7 , pp. 1568 – 1578, 20 1 6. [11] F. Chen, Y. Hu, and M. Y uan, “E valuatio n o f noise reduction me thods for sentence recognition by Mandarin- speaking cochlear implant listeners,” Ear and Hearing , vo l. 36, no.1, p p. 61 – 71, 2015. [12] Y. H. Lai, Y. Tsao, X. L u, F. Chen, Y.-T . Su, K .-C. Chen, Y.-H. C hen, L.- C. Chen, P.-H. Li, and C.-H. Lee, “ Deep le arning based noise reduction approach to improve speech i n telligibil ity for co chlear implant r ecipients ,” to appear in Ear and Hearing . [13] J. Chen , “ Fundamen tals o f Noise Reduction,” in S pring Handbook of Speech Processi n g , Springer, 200 8 , chapter. 43. [14] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estima tion,” in Proc. ICAS S P , 1996, pp. 629 – 632. (a) (b) (a) (b) (c) (d) To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 10 [15] Y. Ep h raim and D. Malah, “Speech enhancement using a minimum mean - square err or short- time spectral amplitude estimator,” IEEE Transactio ns on Acoustics, Speech and Signal Processing , vol. 32, no. 6, pp. 1109 – 1121, 1984. [16] R. Martin, “Speech enhancement b ased on minimum mean-square error estimation and superg aussian priors,” IEEE Transactions on Speech and Audio Processin g , vol. 13, no. 5, pp. 845 – 856, 2005. [17] Y. Tsao and Y. H. Lai, “Gener alized maximum a p osteriori spectral am- plitude estimation for spee c h e nhan cement,” Speech Communication , vo l. 76, pp. 112 – 126, 2015. [18] A. H ussain, M. Chetouani, S. Squartini, A. Ba stari, and F. Piazza, “Nonlinear speech e nhancement: An ove rview,” in Pr ogress in Nonlinear Speech Processi n g. Berlin , G ermany: Spri nger, 2007, pp. 2 17 – 248. [19] A Uncini., “Audio signal proce ssing by neural netw orks”, Neurocomputin g , vol. 55, pp. 59 3 – 625, 2003. [20] G. Cocchi and A. Uncini, “Subband neural networks prediction for o n -line audio signal recovery,” IEEE Transa ctions on N eural Networks , vol. 1 3, no. 4, pp. 867 – 87 6, 2002. [21] N. B. Yoma, F. McI nnes, M. Jack, “Lateral inhibition net and weighted matching algorithms for speech re cognition in noise,” Proc. IEE Visio n, Image & S ignal Processing , vo l. 143, no. 5, p p . 324 – 330, 1996. [22] X. Lu, Y. Tsao, S. Matsuda, a nd C. Hori, “Spee ch enhancement base d on deep denoising autoencoder,” in Proc. INTERSPEECH , 2013, pp. 436 – 440. [23] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Ensemble modeling of de- noising autoencoder for speech spectrum restorat ion ,” in Proc. INTERSPEECH , 2 0 14, pp. 88 5 – 889. [24] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A n exper imental stu dy o n speech enhancement based on deep neural networks,” IEEE Signal Processing Letters , 2014 , vol. 21, pp. 6 5 – 68. [25] D. Liu, P. Smaragdis, and M. Kim, “Ex periments on deep learn ing for speech denoising,” in Proc. INTER S PEECH , 20 14, pp. 2685 – 2689. [26] M. K olbæ k, Z.- H. Tan, and J. Jensen, “Speech intelligibility potential of general and specialize d deep neural ne twork based speech enhance ment systems,” IEEE/ACM Transactions on Audio, Sp eech, and L anguage Pro- cessing , vol. 25, p p. 153-167, 2 017. [27] F. Weninger, F. Eyben, and B. Schuller, “Single -channel speech separation with memory- enhanced recurrent neural networ ks,” in Proc. ICASSP , 2014, p p . 3709 – 3713. [28] F. Weninger, H. Erdog an, S. Watan a be, E. V incent, J. L. Roux, J. R. Hershey , and B. Schulle r , “Speech enhancement with L STM recurrent neural networks and its application to n oise- robust ASR,” in Latent Variable Analysis and Signal Separation , pp. 91 – 99. Springer, 2015. [29] P. Cam polucci, A . Uncini, F . Piazza, and B. Rao, ‘‘O n -li n e l earning algorithms for locally recurre n t neural n etwo rks,’’ IEEE Transactions on Neural Networks , v ol. 10, no. 2, p p. 253 – 271, 1999. [30] F. Eyben, F. Weninge r, S. Squartini, and B. Schuller, “Real -life voice activity detection with L S TM recurrent neural netwo rk s and an application to H ollywoo d movies,” in Proc. ICAS S P , 2013, pp. 483 – 487. [31] S. - W. F u, Y. Tsao, a n d X. Lu, “ SNR -Aw ar e convolutional n eural network mod eling for speech enhanceme n t,” in Proc. INT ERSPEECH , 2016. [32] S. - W. Fu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi - metrics learning,” in Proc. MLSP, 2017 . [33] H. McGurk and J. MacDonald, “He aring lips and seeing voices,” Nature , vol. 264, pp. 746 – 748, 1976. [34] D. G. Stork and M. E. Henne c ke, Speechreading by Humans and Ma- chines , Springe r, 1996. [35] G. Potamianos, C. Neti, G. Gravier, A. Garg , and Andrew W, “Recent advances in the a utomatic recognition of audio-visual s peech,” Proceed- ings of IEEE , vol . 91, no. 9, 20 0 3. [36] D. K olossa, S . Z eiler, A . V orwerk, and R. Orglmeister, “Audiovisual speech recognition with m issing or unreliable dat a,” in Proc. AVSP , 2009, pp. 117 – 122. [37] A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Mur phy, “Dynamic Bayesian networ ks for audi o- visual spee ch recognition,” EURASIP Journal on Ap- plied Signal Pr ocessing , vol. 2002, n o. 11, pp.127 4 – 1288, 2002. [38] A. H. Abdelaziz, S. Zeiler, and D. Kolo ssa, “Twin -HMM-based audio- visual speech en h ancement,” in Proc. ICASSP , 2013, pp. 3726 – 3730. [39] S. De ligne, G . Potamianos, and C. Ne t i, “Audio -visual speech enhancement with AVCD CN (audio-visual codebook dependent cepstral normalization),” in Proc. Int. Conf. Spoken Lang. Processing, 2002, pp. 1449 – 1452. [40] H. Meutzner, N. Ma, R. Nickel, C. Schymura, and D. Kolossa, “Improv i ng audio-visual speech r ecognitio n u sing d eep neural networks w ith dynamic stream rel iability estimates,” i n Proc. ICASSP , 201 7. [41] V. Estellers, M. Gurban, and J.- P. Thiran, “ On dynamic stre am w eighting fo r audio- visual speech recogni tion,” IEEE Transactions on Audio, Speech, and L anguage Processi ng , vol. 20, no. 4 , pp. 1145 – 1157, 2 012. [42] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, “Multimodal deep learning, ” i n Proc. ICML , 2011. [43] Y. Mroueh, E. Marcheret, and V. Goel , “Deep multimodal learning for audio-visual spee ch recognition ,” in Proc. ICA SSP , 2015. [44] L. Girin, J.- L . Schw artz, and G. Feng, “Audio -visual e nhancement of speech in noise,” Journal of the Acoustical Society of America , vol. 109, pp. 3007, 2001. [45] R. Goecke, G. Potamianos, and C. Ne ti, “No isy au dio feature enhancement us ing audio- visual s peech data,” in Proc. ICASSP , 2002. [46] I. Almajai a nd B. Milner, “Enhanc ing audio sp eech using visual speech features,” in Pr oc. INTERSPEECH , 2009. [47] I. Alm ajai and B. Milner, “Visually derived Wiene r filters for speech enhancement,” IEEE Transactions o n Audio, Speech, and Language Processing , vol. 1 9, no.6, pp. 16 4 2 – 1651, 2011. [48] B. Rivet, L. G i rin, and C. Jutte n , “ Visual voice activity detection as a help for speech source separation from c onvo lutive mixtures,” Speech Communication , vo l. 49, no. 7-8, pp. 667 – 677, 2007. [49] B. Rivet, L. Girin, and C . Jutten, "Mixing audiovisual sp eech processing and bli nd so urce separation fo r the extraction of s peech signals fro m convolutive mixtures." IEEE Transactions on Audio, Speech, and Language Proce ssing , vol. 15, no. 1, pp. 96 – 108, 20 07. [50] B. Rive t, W. W ang, S. M . Naqvi, and J . A . C h ambers, “Audiovisual speech source sep aration: A n ove rview of key methodologies,” IEEE Signal Processin g Magazine , vol . 31, no. 3, pp. 1 2 5 – 134, May 2014. [51] J.-C. Hou, S.-S. W ang, Y. H. L ai, J. -C. Lin, Y. Tsao, H. -W. Ch ang, and H. - M. Wang, “Audio -visual speech enhancement using deep neural net- works,” in Proc. AP S IPA ASC , 20 1 6. [52] Z. W u, S . Sivadas , Y. K. T an, B. Ma, and S. M. Goh, “Mult iModal h ybrid deep neural ne twork for speech enhanceme n t,” arXiv:1606. 04750, 2016. [53] G. Tz imi ropoulos and M. P antic, “G au ss -Newton deformable pa rt mo d els for face alignme n t in-the- w ild,” in Proc. CVPR , 20 14, pp. 1851 – 1858. [54] K. Simonyan and A. Zisse rman, “Very d eep convolutional networks for large- scale image recog nition,” in Proc. ICLR , 2015. [55] O. M . Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proc. BMVC , 2 015. [56] A. W. Ri x, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech q uality (PESQ) – a new method for speech quality assessment of tel ephone netw orks and codecs,” in Proc. ICASSP , 2 001. [57] C. Taal, R. Hendriks, R. He usdens, a nd J. Je n sen, “ An algo rithm for inte l- ligibility pre d iction of t ime –frequency weighted no isy spee ch,” IEEE Transactions on Acoustics , Speech and Signal Processing , vol. 19, pp. 2125 – 2136, 20 11. [58] J. Chen, J. B enesty , Y. Hu a ng, and S. D oclo, “ N ew insights int o t he no ise reduction Wiene r filter,” IEEE/ACM T ransactions on Audio, Speech, and Language Proce ssing , vol. 14, pp. 1218 – 1234, 2006. [59] J. M. Kates and K. H. Are hart, “The hearing -aid speech quality index (HASQI ),” Journal of the Audio Engi neering Society , vol. 58, no. 5, pp. 363 – 381, 2010. [60] J. M. Kates and K. H. Are h art, “The hearing -aid speech perception index (HASPI) , ” Speech Communicati on , vol. 65, pp. 7 5 – 93, 2014. [61] M. W. Huang, “Developme nt of Taiwan Mandar in hearing in noise test,” Master thesis, Department of spee c h language pathology and audiolo gy, National Taipe i University of Nursing and He alth science, 2005. [62] P. Viola and M . J. Jones, “Ro bust Real - Time Face Detection,” Interna- tional Journal o f Computer V ision , vol. 57, no. 2, p p. 137 – 154, 2004. [63] R. Carua na, "Multitask learning," Machine le arning, vol. 28, pp. 41 – 75, 1997. [64] M. L . Sel tzer and J. Droppo, “Multi -task l earning in deep neural n etwo rks for improved phoneme recognition,” in Proc. ICASSP , 2013, pp. 6965 – 6969. [65] A. R ezayee , a nd S. Gazor, “An a daptive KLT approach for speech enhancement,” IEEE Trans actions on Speech and Audio Processi n g , vol. 9, no. 2, pp. 87 – 95, 2001. To ap pear in IEE E Transactions on Emerging Topics in Computation al Intelligence 11 [66] Y. Ephraim, and D. Malah, “Spee c h enhancement usi ng a minimum mean - square err or log- spectral am p litude e stimator,” IE EE Tr ansactions o n Acoustics, Speech and Signa l Processing , vol. 33, n o. 2, pp.443 – 445, 1985 . [67] E. Principi, S. Cifani, R. Rotili, S. Squartini, F. Piazza, “Comparative evaluation of si ngle-channel mmse -b ased noise re duction sche mes for speech recognition,” Journal of Electric a l and Computer Engineering, p p. 1 – 7, 2010. [68] J. Hong, S. Park, S. Jeong, and M. Hahn, “Dual -microphone noise reduction in car environments with d eterminant analysis of i nput correl ation matrix,” IEEE Sensors Journal , vol. 16, no. 9, pp.3131 – 3140, 2016. [69] Y. Xu, J. Du, L . R. Dai, and C.- H. Lee , “A re gression approach to spe ech enhancement based on deep neural networks,” IEEE/ACM Transact ions on Audio, Speech and Language Processing , vol. 23, no. 1, pp. 7 – 19, 2015. [70] G. Hu, 100 nonspeech environmental sounds, 2004 [ Onli ne]. Available : http://we b .cse.ohio-state.edu/pnl/cor pus/HuNonspee ch/HuCorpus.html. [71] B. Lee , M. Hasegaw a -Johnson, C. Goudeseu ne, S. Kam d ar, S. Borys, M. Liu, and T . Huang, “A V ICAR: A ud io -visual speech corp us in a car environment,” in Proc. Int. Co n f. Spoken L anguage , 2004, pp.2489 – 2492. [72] R. Navarathn a, D. Dean, S. Sridhara n , and P. Lucey, “Multiple cameras for au dio- visual speech recog n ition in an automotive environment,” C o m- puter Speech & Language , vol. 27, no. 4, pp. 911 – 927, 2 0 13. [73] A. Biswas, P. Sahu, and M. Cha ndra, “Multi ple cameras a udio visual speech recognitio n u sing active appearance model visual features in car environment,” International Journal of Speech Te chnology , vol . 19, n o. 1, pp. 159 – 171, 201 6. [74] F. F aubel, M . G eorges, K . K umat ani, A. Bruhn, and D. K lakow, “Improv- ing hands-free speech recognition in a car through audio -visual voice ac- tivity detectio n,” in Proc. Join t Workshop on Hands-Free Speech Com- munication and.M i crophone Ar rays , 2011. [75] P. C. Loiz ou, Speech Enhancement: Theory and Pract ice , 2nd ed., B oca Raton, FL , USA: CRC, 2013. [76] G. Hinton, N. S rivastava, and K. Swersky , “Le c ture 6: Overview o f mini - batch gradient desce nt,” Course ra Lecture slides https://class.course ra.org/neuralnets-2012-001/lecture . [77] F. Chollet. (2015). Ker as . Available: https://github.com/f ch olle t /keras [78] J. W. Tukey, “Compari ng individual means in the analysis of variance,” Biometrics , vol . 5, no. 2, pp. 99 – 114, 1949. [79] C. Feichtenhof er, A. Pinz, and A. Zisserman, “Convolutional two -stream networ k fusion for video action recognition,” in Proc. CVPR , 201 6. [80] J. S. Ch u ng, A. Se n ior, O . Viny als, a nd A. Zisse rman, “ Lip re ading sentences in the w ild,” arXiv:1611.05358, 20 16. [81] S. Ntalampiras, T. Ganchev, I . Potamitis, and N. Fakotakis, “Objective comparison of speech enh ancement algorithms under real wo rld condi- tions,” in Proc. PET RA , 2008. [82] S. Thermos, and G. Potamianos, “Audio -visual speech a ctivity detection in a t wo-spe aker scenario incorporating depth information from a profile or frontal view ,” i n Proc. SLT, 2 016 . [83] F. Patrona, A . I osifidis, A. T efas, N. Nikolaidis, and I. Pitas, “V isual voice activity detection in t he w ild, ” IEEE Transactions on Multime dia , vo l. 18, no.6, pp. 967 – 97 7, 2016. [84] J. Long, E. Shelhamer, and T. Darrell, “F ully convolutional networks for semantic segme ntation,” in Proc. CV PR , 2015. [85] S. - W. Fu, Y. Tsao, X. L u, and H. Kawai, “Raw wavefo rm -based speech enhancement by f ully convolutional networks,” arXiv : 1703.02205, 2017. [86] S. - W. F u, Y. Tsao, X. L u, and H. Kawai, “End - to -end waveform utterance enhancement for direct evaluation metrics optimiz a tion by fully convolutional neur al networks, ” arXiv:1709.03 658, 2017. [87] O. Ronnebe rger, P. Fischer, a nd T. Br ox, “ U -Net: convolutional n etwo rks for biomedical im a ge seg mentat ion,” in Proc. MIC C AI , 2015.

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment