Video-Driven Speech Reconstruction using Generative Adversarial Networks

Video-Driv en Sp eec h Reconstruction using Generativ e Adv ersarial Net w orks Konstan tinos V ougiouk as 1,2 , Pingc h uan Ma 1 , Sta vros P etridis 1,2 , and Ma ja P an tic 1,2 1 iBUG Group, Imperial College London 2 Samsung AI Cen tre, Cam bridge, UK June 17, 2019 Abstract Sp eec h is a means of comm unication whic h relies on b oth audio and visual information. The absence of one mo dal- it y can often lead to confusion or misin terpretation of in- formation. In this pap er w e presen t an end-to-end tem- p oral mo del capable of directly synthesising audio from silen t video, without needing to transform to-and-from in termediate features. Our prop osed approac h, based on GANs is capable of pro ducing natural sounding, in- telligible sp eec h whic h is synchronised with the video. The performance of our mo del is ev aluated on the GRID dataset for both sp eak er dependent and sp eak er indepen- den t scenarios. T o the b est of our knowledge this is the ﬁrst metho d that maps video directly to raw audio and the ﬁrst to pro duce intelligible sp eec h when tested on previously unseen speakers. W e ev aluate the syn thesised audio not only based on the sound quality but also on the accuracy of the sp ok en w ords. Index T erms : sp eec h synthesis, generative mo delling, visual sp eech recognition 1 In tro duction Lipreading is a technique that in volv es understanding sp eec h in the absence of sound, primarily used by p eo- ple who are deaf or hard-of-hearing. Ev en p eople with normal hearing dep end on lip mov ement interpretation, esp ecially in noisy environmen ts. The ineﬀectiv eness of audio sp eec h recognition (ASR) metho ds in the presence of noise has lead to the research of automatic visual sp eec h recognition (VSR) metho ds. Thanks to recent adv ances in mac hine learning VSR sys- tems are capable of performing this task with high accu- racy [1, 2, 3, 4, 5]. Most of these systems output text but there are many applications such as video conferencing in noisy or silent environmen ts that would b eneﬁt from the use of video-to-sp eec h systems. * The ﬁrst author p erformed this work during his internship at Samsung Figure 1: The sp eec h syn thesizer accepts a sequence of frames and pro duces the corresponding audio sequence. One p ossible approach for dev eloping video to sp eec h systems is to combine VSR with text-to-sp eec h systems (TTS), with text serving as an intermediate represen- tation. Ho wev er, there are sev eral limitations to using suc h systems. Firstly , text-based systems require tran- scrib ed datasets for training, which are hard to obtain b ecause they require lab orious manual annotation. Sec- ondly , generation of the audio can only o ccur at the end of eac h word, whic h imposes a delay on the pipeline mak- ing it unsuitable for real-time applications such as video- conferencing. Finally , the use of text as an intermedi- ate representation preven ts such systems from captur- ing emotion and in tonation, whic h results in unnatural sp eec h. Direct video-to-sp eec h metho ds do not hav e these dra wbacks since audio samples can be generated with ev- ery video frame that is captured. F urthermore, training of suc h systems can b e done in a self-sup ervised manner since most video comes paired with the corresponding au- dio. F or these reasons video-to-sp eec h systems hav e b een recen tly considered. Suc h video-to-speech systems are proposed by Le Corn u and Miller in [6, 7], which use Gaussian Mixture Mo dels and Deep Neural Netw orks (DNN) resp ectiv ely to estimate audio features, which are fed into a voco der to pro duce audio, from visual features. Ho wev er, the hand-crafted visual features used in this approac h are not capable of capturing the pitch and intonation of the sp eak er and in order to pro duce in telligble results they ha ve b e artiﬁcially generated. 1 Con volutional neural net works (CNNs) ha v e b een sho wn to b e p ow erful feature extractors for images and videos and hav e replaced handcrafted features in more recen t w orks. One such system is proposed in [8] to pre- dict line sp ectrum pairs (LSPs) from video. The LSPs are conv erted in to wa veforms but since excitation is not predicted the resulting sp eech sounds unnatural. This metho d is extended in [9] b y adding optical ﬂow infor- mation as input to the net work and by adding a p ost- pro cessing step, where generated sound features are re- placed by their closest match from the training set. A similar metho d that uses multi-view visual feeds has been prop osed in [10]. Finally , Akbari et. al. [11] prop ose a mo del that uses CNNs and recurrent neural netw orks (RNNs) to transform a video sequence in to audio sp ectro- grams, whic h are later transformed into wa v eforms using the algorithm prop osed in [12]. In this work, we prop ose an end-to-end mo del that is capable of directly conv erting silent video to raw w av e- form, without the need for any intermediate handcrafted features as shown in Fig. 1. Our metho d is based on generativ e adversarial netw orks (GANs), which allo ws us to pro duce high ﬁdelity (50KHz) audio sequences of realistic intelligible sp eec h. Con trary to the aforemen- tioned works our mo del is capable of generating intelli- gible sp eec h even in the case of unseen sp eakers 1 . The generated sp eec h is ev aluated using standard audio re- construction and sp eec h synthesis metrics. Additionally , w e prop ose using a sp eec h recognition mo del to verify the accuracy of the sp oken words and a synchronisation metho d to quan tify the audio-visual sync hrony . 2 Video-driv en Sp eech Recon- struction The prop osed mo del for sp eec h reconstruction is made up of 3 sub-net works. The generator net work, shown in Fig. 2 is resp onsible for transforming the sequence of video frames into a w av eform. During the training phase the critic netw ork driv es the generator to produce w av e- forms that sound similar to natural sp eec h. Finally , a pretrained sp e e ch enc o der is used to conserv e the sp eec h con tent of the wa veform. 2.1 Generator The generator is made up of a c ontent enc o der and an audio fr ame de c o der . The c ontent enc o der consists of a visual fe atur e enc o der and an RNN. The visual fe atur e enc o der is a 5 lay er 3D CNN resp onsible for enco ding information ab out the visual sp eec h whic h is present in a window of N consecutive frames. These enco dings z s , whic h are pro duced at each time step are fed to a sin- gle lay er gated recurrent unit (GRU) net work which pro- duces a series features z c describing the conten t of the 1 Generated samples online at the following website: https://sites.google.com/view/speech- synthesis/home video. The audio fr ame de c o der receives these features as input and generates the corresp onding window of audio samples. Batch normalization [13] and ReLU activ ation functions are used throughout the entire generator net- w ork except for the last la yer in the visual fe atur e enc o der and de c o der , where the hyperb olic tangent non-linearit y is used without batc h normalization. 2.2 Critic The critic is a 3D CNN which is giv en audio clips of ﬁxed length t d from the real and generated samples. During training the critic learns a 1-Lipsc hitz function D which is used to calculate the W asserstein distance b et w een the distribution of real and generated wa veforms. In order to enforce the Lipsc hitz contin uity on D the gradients of critic’s output with resp ect to the inputs are p enalized if they hav e a norm larger than 1 [14]. The p enalt y is applied indep enden tly to all inputs and therefore batch normalization should not b e used in any lay ers of the critic since it introduces correlation b et ween samples of the batch . The audio clips which are provided as input to the critic are chosen at random from b oth the real and generated audio sequences using a uniform sampling function S . 2.3 Sp eec h Enco der The sp e e ch enc o der netw ork is used to extract sp eec h fea- tures from the real and generated audio. This netw ork is tak en from the pretrained model proposed in [15], which p erforms sp eec h-driven facial animation. Similarly to our mo del this netw ork is trained in a self-sup ervised man- ner and learns to pro duce enco dings that capture sp eech con tent that can b e used to animate a face. Using this mo dule we are able to enforce the correct conten t onto the generated audio clips through a p erceptual loss. 2.4 T raining Our mo del is based on the W asserstein GAN prop osed in [14], which minimises the W asserstein distance be- t ween the real and fake distribution. Since optimis- ing the W asserstein distance directly is intractable the Kan torovic h-Rubinstein duality is used to obtain a new ob jective [16]. The adv ersarial loss of our mo del is sho wn in Equation 1, where D is the critic function, G is the generator function, x is a sample from the distribution of real clips P r and e x is a sample from the distribution of generated clips P g . The gradient p enalt y shown in Equa- tion 2 is calculated with resp ect to the input ˆ x sampled from distribution P ˆ x , which contains all linear interpo- lates b etw een P r and P g . L adv = E x ∼ P r [ D ( S ( x ))] − E e x ∼ P g [ D ( S ( e x ))] (1) L g p = E ˆ x ∼ P ˆ x [( ||∇ ˆ x D ( ˆ x ) || 2 − 1) 2 ] (2) 2 GRU C o n v 3 D 1 x 4 x 6 ( 1 , 2 , 2 ) C o n v 3 D 1 x 4 x 4 ( 1 , 2 , 2 ) C o n v 3 D 1 x 4 x 4 ( 1 , 2 , 2 ) C o n v 3 D 1 x 4 x 4 ( 1 , 2 , 2 ) C o n v T 1 D 2 4 ( 4 ) C o n v 1 D 2 4 ( 4 ) C o n v 1 D 3 0 ( 5 ) C o n v 1 D 2 4 ( 4 ) C o n v 1 D 3 0 ( 5 ) C o n v 1 D 3 0 ( 5 ) C o n v 1 D 2 4 ( 4 ) C o n v 1 D 2 4 ( 4 ) t d t d S p e e c h E n c o d e r S p e e c h E n c o d e r d = | ( a r ) - (a f )| (a f ) (a r ) Figure 2: Architecture for the generator netw ork consisting of a conten t enco der and an audio frame deco der. Con- v olutional lay ers are represen ted with Con v blo cks with the k ernel size depicted as k time × k heig ht × k width and with strides for the resp ective axes app earing in parentheses. The critic accepts as input samples of t d = 1 s and determines whether they come from real or generated audio. The sp e e ch enc o der maps a wa v eform x to a feature space through a function φ . During training a perceptual loss, corresp onding to the L 1 distance b et w een the fea- tures obtained from mapping real and generated audios is minimized. This forces the netw ork to learn high-lev el features correlated with sp eec h. The p erceptual loss is sho wn in Equation 3. L p = | φ ( x ) − φ ( e x ) | (3) An L 1 reconstruction loss is also used to ensure that the generated w av eform closely matches the original. Finally , w e use a total v ariation (TV) regularisation factor in our loss describ ed in Equation 4, where T is the num b er of samples in the audio. This type of regularization p enal- izes non-smo oth w av eforms and thus reduces the amount of high frequency noise in the synthesized samples. L T V = 1 T X t | e x t +1 − e x t | (4) The ﬁnal ob jective used to obtain the optimal genera- tor is a com bination of the ab o v e losses as sho wn in Equa- tion 5. The loss factors are w eighted so that eac h loss has appro ximately equal contribution. The weigh ts we used w ere λ L 1 = 150, λ tv = 120, λ g p = 10 and λ p = 70. L = min G max D L adv + λ L 1 L L 1 + λ T V L T V + λ p L p + λ g p L g p (5) When training a W asserstein GAN the critic should b e trained to optimality [17]. W e therefore p erform 6 up dates on the critic for every up date of the generator. W e use the Adam [18] optimizer with a learning rate of 0.0001 for b oth the generator and critic and train until no impro vemen t is seen on the mel-cepstral distance betw een the real and generated audio from the v alidation set for 10 ep ochs. W e use a window of N = 7 frames as input for the generator and input a clip of t d = 1 s for the critic. 3 Exp erimen tal Setup Our mo del is implemented in Pytorc h and tak es approx- imately 4 days to train on a Nvidia GeF orce GTX 1080 Ti GPU. During inference our mo del is capable of gen- erating audio for a 3s video recorded at 25 frames p er second (fps) in 60ms when running on a GPU and 6s when running on the CPU. 3.1 Dataset The GRID dataset has 33 sp eak ers each uttering 1000 short phrases, containing 6 w ords taken from a limited dictionary . The structure of a GRID sen tence is describ ed in T able 1. W e ev aluate our metho d in b oth sp eak er dep enden t and sp eak er indep enden t settings. Sub jects 1, 2, 4 and 29 w ere used for the sub ject dep enden t task and videos were split into training, v alidation and test sets using a random 90%-5%-5% split respectively . This setup is similar to that used in [11]. In the sub ject independent setting the data is divided in to sets based on the sub jects. W e use 15 sub jects for training, 8 for v alidation and 10 for testing. W e use the split prop osed in [15]. T able 1: Structure of a typical sen tence from the GRID corpus. Command Color Preposition Letter Digit Adverb bin blue at A-Z 0-9 again lay green by \{ W } now place red in please set white with soon As part of our prepro cessing all faces are aligned to the canonical face using 5 anchor p oin ts tak en from the edge of the eyes and the tip of the nose. Video frames are normalised, resized to 128x96 and cropped k eeping only the b ottom half, which con tains the mouth. Finally , data augmentation is p erformed b y mirroring the training videos. 3 3.2 Metrics Ev aluation of the generated audio is not a trivial task and there is no single metric capable of assessing all asp ects of speech such as quality , in telligibilit y and spoken w ord accuracy . F or this reason we emplo y multiple metho ds to ev aluate the diﬀerent asp ects of our generated sam- ples. In order to measure the quality of the pro duced samples we use the mean mel-cepstral distortion (MCD) [19], which measures the distance b et ween t wo signals in the mel-frequency cepstrum and is commonly used to as- sess the p erformance of sp eech synthesizers. W e also use Short T erm Ob jective In telligibility (STOI) metric whic h measures the intelligibilit y of the generated audio clip. Sp eec h quality is measured with the perceptual ev alu- ation of sp eec h quality (PESQ) metric [20]. PESQ was originally designed to quan tify degradation due to codecs and transmission channel errors. How ev er, this metric is sensitiv e to changes in the sp eak er’s voice, loudness and listening level [20]. Therefore, although it may not b e an ideal measure for this task we still use it in order to b e consisten t with previous works. In order to determine the sync hronisation b et ween the video and audio we use the pretrained SyncNet mo del prop osed in [21]. This mo del calculates the euclidean dis- tance betw een the audio and video enco dings for multiple lo cations in the sequence and pro duces the audio-visual oﬀset (measured in frames) based on the lo cation of the smallest distance and audio-visual correlation (conﬁdence) based on the ﬂuctuation of the distance. Finally , the accuracy of the sp ok en message is also mea- sured using the word error rate (WER) as measured by an ASR system, which is trained on the GRID training set. On ground truth audio the ASR system achiev es 4% WER. 4 Results Our prop osed mo del is capable ,of pro ducing in telligible, high quality speech at high sampling rates such as 50KHz. W e ev aluate our mo del on the sub ject dependent scenario and compare it to the Lip2AudSp e c mo del prop osed in [11]. F urthermore we present results when the system is tested on unseen speakers. 4.1 Sp eak er Dep enden t Scenario Examples of real and generated audio samples are compared in Fig. 3. In order to present a fair comparison our audio is sub-sampled to matc h the rate of the sample pro duced b y Lip2AudSp e c . Through insp ection it is no- ticeable that our mo del captures more information than Lip2A udSp e c esp ecially for very low and very high fre- quency conten t. This results in words b eing more clearly articulated when using our mo del. Our mo del in tro duces artifacts in the form of a lo w-p ow er p ersisten t hum in the w av eform, whic h is also visible in the spectral domain. T able 2: Metrics for the ev aluation of the generated audio w av eforms when testing on seen sp eak ers. Best p erfor- mance is shown in bold. Measure Lip2A udSp e c Prop osed Mo del PESQ 1.82 1.71 WER 32.5% 26.6 % A V Conﬁdence 3.5 4.4 A V Oﬀset 1 1 STOI 0.446 0.518 MCD 38.14 22.29 T able 3: Ablation study p erformed in the sub ject dep en- den t setting. Best performance is sho wn in b old. Model PESQ WER A V Conf/Oﬀset STOI MCD F ull 1.71 26.6% 4.4(1) 0.518 22.29 w/o L L 1 1.45 33.2% 3.9(1) 0.450 26.87 w/o L T V 1.44 31.3 % 3.9(1) 0.483 25,82 w/o L p 1.14 83.3% 2.0(5) 0.378 30.12 W e compare the samples pro duced by our mo del to those pro duced by Lip2AudSp e c using the metrics de- scrib ed in section 3.2. The results are shown in T able 2. Our metho d out-p erforms Lip2Au dSp e c in all the in telli- gibilit y tests (STOI, WER). F urthermore, our generated samples achiev e b etter spectral accuracy as indicated b y the smaller MCD. The Lip2AudSp e c ac hieves a b etter PESQ score which is likely due to the artifacts that are created using our mo del. Indeed, if w e apply av erage ﬁltering to the signal, which reduces these artifacts the PESQ increases to 1.80. How ever, this is done at the ex- p ense of sharpness and intelligibilit y since STOI drops to 4.7 and the WER increases to 36%. Finally , b oth meth- o ds ha v e similar scores in audio-visual synchron y , which is exp ected since they b oth use similar architectures to extract the visual-features. In order to quantify the eﬀect of each loss term we p erform an ablation study b y removing one loss term at a time. The results of the ablation study are shown in T able 3. It is clear that removing eac h term makes the p erformance w orse and the full mo dels is the b est o ver all p erformance measures. W e note that the adversarial loss is necessary for the pro duction of sp eec h and when the system was ev aluated without it generation resulted in noise. The other important contribution is made from the p erceptual loss, without whic h the sp eec h pro duced do es not accurately capture the conten t of the sp eec h. This is eviden t from the large increase in the WER, although this is reﬂected in the other metrics as w ell. 4.2 Sp eak er Indep enden t Scenario W e presen t the results of our model on unseen speakers in T able 4. Comparison with other methods is not p ossible for this setup since other metho ds are sp eak er dep en- den t (training the Lip2A udSp e c mo del on the same data 4 0 0 . 5 1 1 . 5 2 2 . 5 3 0 500 1000 1500 2000 2500 3000 3500 4000 0 0 . 5 1 1 . 5 2 2 . 5 10 4 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (a) Proposed mo del 0 0 . 5 1 1 . 5 2 2 . 5 0 500 1000 1500 2000 2500 3000 3500 4000 0 0 . 5 1 1 . 5 2 2 . 5 10 4 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (b) Real wa veform 0 0 . 5 1 1 . 5 2 2 . 5 3 0 500 1000 1500 2000 2500 3000 3500 4000 0 0 . 5 1 1 . 5 2 2 . 5 10 4 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (c) Lip2AudSp e c Figure 3: Examples of real and generated wa veforms and their corresp onding spectrograms T able 4: Metrics for the ev aluation of the generated audio w av eforms when testing on unseen sp eak ers PESQ WER A V Conﬁdence A V Oﬀset STOI MCD 1.24 40.5 % 4.1 1 0.445 24.29 did not pro duce in telligible results). F rom the results is eviden t that the sp eaker indep enden t setting is a more complex problem. One of the reasons for this is the fact that the mo del can not learn the voice of unseen sp eak- ers. This results in generating a v oice that do es not cor- resp ond to the real speaker (i.e. female voice for a male sp eak er). F urthermore, in certain cases the voice will morph during the phrase. These factors likely account in large part for the drop in the PESQ metric, which is sensitiv e to suc h alterations. Morphing voices likely also aﬀects the intelligibilit y of the audio clip, which is re- ﬂected in the WER, STOI and MCD. Finally we notice a sligh t improv emen t in audio visual synchron y which may b e due to the increased num b er of samples seen during training. It is imp ortan t to note that the mo del has a diﬀerent p erformance for each unseen sub ject. The WER ﬂuctu- ates from 40% to 60% dep ending on the sp eaker. This is to b e exp ected esp ecially for sub jects whose app ear- ance diﬀers greatly from the sub jects in the training set. Additionally , since GRID is made up mostly of native sp eak ers of English we notice that unseen sub jects who are non-native sp eak ers ha ve worse WER. 5 Conclusions In this work w e hav e presented an end-to-end mo del that reconstructs sp eec h from silen t video and ev aluated in t wo diﬀerent scenarios. Our mo del is capable of generat- ing intelligible audio for b oth seen and unseen sp eak ers. F uture research should fo cus on pro ducing more natural and coherent voices for unseen sp eak ers as well as im- pro ving in telligibility . F urthermore, w e b eliev e that suc h systems are capable of reﬂecting the sp eakers emotion and should b e tested on expressive datasets. Finally , a ma jor limitation of this metho d is the fact that it op er- ates solely on frontal faces. Therefore the natural pro- gression of this work will b e to reconstruct sp eech from videos taken in the wild. 6 Ac kno wledgemen ts W e gratefully ackno wledge the support of NVIDIA Cor- p oration with the donation of the Titan V GPU used for this researc h and Amazon W eb Services for pro viding the computational resources for our exp erimen ts. References [1] Y. M. Assael, B. Shillingford, S. Whiteson, and N. De F reitas, “Lipnet: End-to-end sentence-lev el lipreading,” arXiv pr eprint arXiv:1611.01599 , 2016. [2] S. Petridis, Y. W ang, Z. Li, and M. P antic, “End-to- end multi-view lipreading,” in British Machine Vi- sion Confer enc e , London, Septem b er 2017. [3] S. P etridis, T. Stafylakis, P . Ma, F. Cai, G. Tz- imirop oulos, and M. Pan tic, “End-to-end audiovi- sual sp eec h recognition,” in 2018 IEEE Interna- tional Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) . IEEE, 2018, pp. 6548–6552. [4] T. Stafylakis and G. Tzimirop oulos, “Com bining residual net works with lstms for lipreading,” in In- tersp e e ch , 2017. [5] J. S. Ch ung, A. Senior, O. Viny als, and A. Zisser- man, “Lip reading sentences in the wild,” in IEEE CVPR , July 2017. [6] T. L. Cornu and B. Milner, “Reconstructing intel- ligible audio sp eec h from visual sp eech features,” 5 in Sixte enth A nnual Confer enc e of the International Sp e e ch Communic ation Asso ciation , 2015. [7] T. Le Cornu and B. Milner, “Generating intelligi- ble audio sp eec h from visual sp eec h,” IEEE/ACM T r ansactions on Audio, Sp e e ch, and L anguage Pr o- c essing , vol. 25, no. 9, pp. 1751–1761, Sep. 2017. [8] A. Ephrat and S. Peleg, “Vid2sp eech: Speech re- construction from silent video,” in Int’l Conf. on A c oustics, Sp e e ch and Signal Pr o c essing , 2017, pp. 5095–5099. [9] A. Ephrat, T. Halp erin, and S. Peleg, “Improv ed sp eec h reconstruction from silent video,” ICCV 2017 Workshop on Computer Vision for A udio-Visual Me dia , 2017. [10] Y. Kumar, R. Jain, M. Salik, R. r. Shah, R. Zim- mermann, and Y. Yin, “Mylipp er: A p ersonalized system for sp eec h reconstruction using multi-view visual feeds,” in 2018 IEEE International S ymp o- sium on Multime dia (ISM) , Dec 2018, pp. 159–166. [11] H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2audsp ec: Sp eec h reconstruction from silent lip mo vemen ts video,” in 2018 IEEE International Con- fer enc e on A c oustics, Sp e e ch and Signal Pr o c essing (ICASSP) , 2018, pp. 2516–2520. [12] T. Chi, P . Ru, and S. Shamma, “Multiresolution sp ectrotemporal analysis of complex sounds,” The Journal of the A c oustic al So ciety of Americ a , v ol. 118, pp. 887–906, 09 2005. [13] S. Ioﬀe and C. Szegedy , “Batch normalization: Ac- celerating deep netw ork training by reducing in ter- nal co v ariate shift,” in Pr o c e e dings of the 32nd In- ternational Confer enc e on International Confer enc e on Machine L e arning - V olume 37 , ser. ICML’15. JMLR.org, 2015, pp. 448–456. [14] I. Gulra jani, F. Ahmed, M. Arjovsky , V. Dumoulin, and A. C. Courville, “Improv ed training of wasser- stein gans,” in A dvanc es in Neur al Information Pr o- c essing Systems , 2017, pp. 5767–5777. [15] K. V ougiouk as, S. Petridis, and M. Pan tic, “End-to- End Sp eec h-Driven F acial Animation with T emp o- ral GANs,” in British Machine Vision Confer enc e , 2018. [16] C. Villani, Optimal tr ansp ort: old and new . Springer Science & Business Media, 2008, vol. 338. [17] M. Arjovsky , S. Chin tala, and L. Bottou, “W asser- stein generative adversarial netw orks,” in Interna- tional Confer enc e on Machine L e arning , 2017, pp. 214–223. [18] D. P . Kingma and J. Ba, “Adam: A Method for Sto c hastic Optimization,” arXiv pr eprint arXiv:1412.6980 , 2014. [19] R. Kubic hek, “Mel-cepstral distance measure for ob- jectiv e sp eec h quality assessment,” in Pr o c e e dings of IEEE Paciﬁc Rim Confer enc e on Communic ations Computers and Signal Pr o c essing , vol. 1, Ma y 1993, pp. 125–128 vol.1. [20] A. W. Rix, J. G. Beerends, M. P . Hollier, and A. P . Hekstra, “P erceptual ev aluation of sp eech quality (p esq)-a new metho d for sp eech qualit y assessment of telephone netw orks and co decs,” ser. ICASSP . IEEE Computer So ciet y , 2001, pp. 749–752. [21] J. S. Chung and A. Zisserman, “Out of time: auto- mated lip sync in the wild,” in Workshop on Multi- view Lip-r e ading, ACCV , 2016. 6

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment