Noise-tolerant Audio-visual Online Person Verification using an Attention-based Neural Network Fusion

NOISE-TOLERANT A UDIO-VISU AL ONLINE PERSON VERIFICA TION USING AN A TTENTION-B ASED NEURAL NETWORK FUSION Suwon Shon, T ae-Hyun Oh, J ames Glass MIT Computer Science and Artiﬁcial Intelligence Laboratory , Cambridge, MA, USA { swshon,glass } @mit.edu, taehyun@csail.mit.edu ABSTRA CT In this paper , we present a multi-modal online person veriﬁcation system using both speech and visual signals. Inspired by neurosci- entiﬁc ﬁndings on the association of voice and face, we propose an attention-based end-to-end neural network that learns multi-sensory associations for the task of person veriﬁcation. The attention mecha- nism in our proposed network learns to conditionally select a salient modality between speech and facial representations that provides a balance between complementary inputs. By virtue of this capabil- ity , the network is robust to missing or corrupted data from either modality . In the V oxCeleb2 dataset, we show that our method per- forms fav orably against competing multi-modal methods. Even for extreme cases of large corruption or an entirely missing modality , our method demonstrates robustness o ver other unimodal methods. Index T erms — person v eriﬁcation, recognition, multi-modal, cross-modal, attention, neural network. 1. INTR ODUCTION From cognitiv e and neuroscience studies on the integration of face and voice signals in humans, it has been observed that the face- voice association is treated dif ferently in the brain compared to other paired stimuli [1], and that this perceptual integration plays an im- portant role and is actually le veraged for person recognition pro- cessing [2]. Inspired by these ﬁndings, computational models have been recently introduced to understand whether , and to what extent, such models can leverage associations between different modali- ties. T o inv estigate this multi-modal association, Nagrani et al. [3], Horiguchi et al. [4] and Kim et al. [5] presented a face-voice cross- modal matching task by learning a shared representation for both modalities. Neural network-based cross-modal learning is explored to distill common or complementary information from lar ge-scale paired data. In particular , Kim et al. sho wed that their computational model has similar behaviors to humans. Based on these explorations of multi-modal computational learnability , we propose to in vestigate the use of multi-modal neu- ral netw orks for the more speciﬁc and challenging task of person veriﬁcation. There has been some work that in vestigates person veriﬁcation using multi-modal biometric data [6, 7, 8, 9, 10, 11]. These methods typically consist of independent face and voice unimodal recognition modules that are trained separately , with re- spectiv e scores from the unimodal modules being combined with score fusion. These methods also typically run in an of f-line man- ner , whereby multiple frames of the face and several seconds of speech are used to maximize recognition performance, so there is an inherent latency built into the methodology . On the other hand, feature-lev el fusion has been uncommon in the person veriﬁcation. The feature-lev el fusion has been more commonly adoped in audio- visual speech recognition [12, 13] from a simple concatenation of the feature to end-to-end system [14, 15] with synchronized audio- visual feature. In this work, we shed light on the feature lev el fusion in the multi-modal person recognition. In this paper, we explore an online audio-visual fusion system for person veriﬁcation using face and voice. In contrast to previous work on person veriﬁcation, our proposed fusion method is con- ducted at the feature le vel. In particular , we focus on the fusion of synchronized audio-visual data based on the argument that the system should naturally emphasize the time-varying contribution of each modality according to its instantaneous quality at an y point time. Our method exploits a single video frame of the face and a short span of speech to facilitate online processing applications, while maintaining high performance relativ e to prior state-of-the-art. Motiv ated by the attention [16] and the multi-sensory association mechanism of the human brain [1], our fusion method is imple- mented by an attention mechanism, such that it can learn to e valuate the salient modality of input data. Due to the inherent robustness of this architecture, we expect stable performance even when there is corrupted information from either face or voice, due to noise mask- ing, or missing information from basic pre-processing failures of either modality e.g., face detection, voice activity detection (V AD), etc. W e experimentally verify that this audio-visual fusion network is robust to corrupted and missing information from one modality . W e also analyze the attention layer output to see how it behav es under certain characteristics of the input. 2. ONLINE PERSON VERIFICA TION FROM VIDEO The veriﬁcation of a person’ s identity is often achieved by using in- formation from a single modality that contains the biometric signal, such as images for face identiﬁcation and audio for speaker veri- ﬁcation. When multiple modalities are a vailable, such as in video recordings of someone speaking, then opportunities exist to explore fusion of information from both modalities. Both vision and hearing must address challenges due to variation in a persons appearance or voice, or occlusion due to en vironmental conditions. In the case of vision, the image of a person’ s face will appear dif ferently due to physical changes in a person’ s appearance, emotional state, occlu- sion due to other objects, and will depend on position and orientation relativ e to the camera etc. Likewise a person’ s voice can change due to health, or emotional state, and will be affected by en vironmental noise, rev erberation and channel conditions. One interesting dif ference between face and speaker ID tech- nologies is that high quality face ID can be obtained from a single image of a person’ s face. In video data, this corresponds to a sin- gle instance in time, and can be sampled many times a second. In Fusion layer [ e v " , e f ] Co ntrastive lo ss Concatenating embeddings e v e f (a) System A e v e f Con t rastive loss + e " v e " f (b) System B e v e f Con t rastive loss + e " v e " f × × [ e v % , e f ] α v α f (c) System C Fig. 1 : Neural network based fusion approaches. e v : speaker embedding, e f : face embedding contrast, to achieve the same level of performance for speak er v eriﬁ- cation tasks typically requires a much longer sample of speech from the talker (e.g., 10-30sec of speech are typical conditions, with a few seconds of speech being a much more challenging task). This discrepancy is because, unlike images of faces, the speech signal is highly time-varying due to the nature of speech production. A random snippet of speech can be dramatically dif ferent from an- other , e ven when spoken by the same talker, due to dif ferences in the acoustic-phonetic sequences present in the samples. The char- acteristics of a talkers voice are more reliably extracted when the duration of a speech recording contains more examples of the dif fer- ent sounds produced by the talker . For person veriﬁcation, there is some truth to the mantra that a picture is worth a thousand words! When processing video data there will be situations where one modality or the other may be corrupted or altogether missing. A cor- rupted modality can be caused by a false alarm of a pre-processing step such as face detection or voice activity detection (V AD). For example, a f ace detector may incorrectly identify a f ace, or detect the wrong face or region in the video, or the V AD might be activ ated by background noise that does not contain a human voice. These corrupted inputs could easily confound a multi-modal netw ork to the point where its performance could be worse than fusing separate unimodal systems. When one modality is completely missing, one easiest solution in practice would be to switch to apply an alternati ve backup unimodal system to the uncontaminated modal data. W e will demonstrate that our multi-modal system performs fa vorably against this systematic approach ev en in the complete missing case. 3. A UDIO-VISU AL MUL TI-MOD AL FUSION In this section, we describe the proposed multi-modal fusion ap- proach and its voice and face representation subsystems. Our method is distinguished from previous studies by its use of a feature-lev el fu- sion approach based on neural network models. Given discriminativ e face and speaker representations extracted from each subsystem, our attention layer e valuates each contributions of the representations. Then, we combine them according to the estimated contrib utions, so that a joint representation is obtained. W e learn this whole fusion net- work for the person veriﬁcation task without additional supervision for the attention. In the test phase, we compute the similarity of joint representations between the query (enrollment) and test samples to verify identities. In the following sections we elaborate the proposed fusion ap- proach and the speech and face sub-systems used in our e xperiment. 3.1. Proposed Fusion Approach W e develop a multi-modal attention model that can pay attention to the salient modality of inputs while producing a powerful fusion representation appropriate for the person veriﬁcation task. This is inspired by the humans’ multi-sensory capability . Among di verse facets of the human multi-sensory system, the presence of the se- lectiv e attention [16] allows humans to ﬁrst pick salient information ev en from crowded sensory inputs. The human attention mechanism dynamically brings salient features to the forefront as needed with- out collapsing holistic information into blurry abstraction. The realization of this attention mechanism in deep neural net- works has achieved successes in various machine learning applica- tions. Our attention network is similar to the soft attention [17] which is differentiable. While most previous work applies spatial or tem- poral attention, our attention is extended to be attentive across the modality axis. Gi ven face and speaker embeddings e f and e v , we deﬁne the attention score ˆ a { f ,v } through attention layer f att ( · ) as ˆ a { f ,v } = f att ([ e f , e v ]) = W > [ e f , e v ] + b , (1) where W ∈ R m × d and b ∈ R m are the learnable parameters of the attention layer , m and d denote the number of modality to fuse and the input dimension of the attention layer respectively , and e f and e v will be discussed in the next subsection. Then, the fused embedding z is constructed by the weighted sum as z = X i ∈{ f ,v } α i ˜ e i , where α i = exp(ˆ a i ) P k ∈{ f ,v } exp(ˆ a k ) , i ∈ { f , v } , (2) where ˜ e denotes the projected embeddings to a co-embedding space compatible with the linear combination. T o map ˜ e { f ,v } from e { f ,v } , we used a Fully Connected (FC) layer with 600 hidden nodes, i.e. ˜ e ∈ R 600 . W e do not used non-linearity in the FC layer . W e train the attention netw orks by the contrastiv e loss on the joint embedding z ∈ R 600 . For each training step, we used 60 positive and negati ve pairs, a total of 120 pairs for each mini-batch, and all pairs were sampled from the V oxCeleb2 de velopment set. The proposed attention networks allow us to naturally deal with corruption or missing data from either modality . In our framework, the attention networks spontaneously learn to assess the quality of giv en multi-modal data implicitly . For example, if the audio signal is largely corrupted by surrounding noise, the attention network w ould switch off the v oice representation path and only rely on the face representation, and vice v ersa. In this w ay , as long as at least one modality provides appropriate information for the task, this model will be able to perform person veriﬁcation. Relationship with Other Fusion Methods In the context of the multi-modal person v eriﬁcation, the traditional score-le vel fusion with logistic re gression has been in vestigated up to these days [6, 7, 8, 9, 10, 11]. These score fusion methods do not leverage an y large capacity deep neural networks which are capable of dealing with non-trivial fusion strategy . One can come up with an exten- sion based on the above approaches, where FC layers are stacked on top of the concatenated speaker and f ace embeddings, e v and e f , as shown in Figure 1-(a), i.e. System A . W e used 2 FC layers with 1,200 and 600 hidden nodes and ReLUs for non-linearities in the ﬁrst FC layer . This can be regarded as a feature lev el fusion similar to Nagrani et al. [3]. A downside of this would be the fact that the performance of the system is degraded by corrupted modal data. Another neural network based fusion can be accomplished as shown in Figure 1-(b). FC layers are stacked on top of respecti ve embeddings, e v and e f , without a nonlinear activ ation function. This layer simply projects each modality embeddings into a joint audio- visual subspace. Then, the projected embeddings, e e v and e e f , are combined by the summation operation, and used for the contrastiv e loss function as we did. The summation based ensemble considers both modalities contribute equally , typically yielding a mean repre- sentation which can be easily biased with a large contamination [18]. Our method adaptiv ely estimates the weights of each embedding to construct a joint representation. Either of weight can be turned off if the embedding would degrade the end performance. This feature is not only robust but also able to deal with missing or a large cor- ruption of the data. 3.2. V oice and Face Representations T o obtain discriminative embeddings for face and voice, e f and e v , we exploit the e xisting deep neural network based representations. V oice embedding V oice embeddings generally exploit a large dataset including augmented data with added background noise. A voice embedding can be extracted from one of the hidden layers from a neural network trained to classify N speakers in the training dataset. In a previous study , we proposed a frame-lev el voice embed- ding to extract robust speaker information by modifying the DNN structure after training is complete [19]. For training, the V oxCeleb1 dev elopment dataset w as used. Details can be found in [19] since we used the same system. Frame-level voice embeddings are extracted ev ery 10ms using a 25ms frame windo w . Before fusion, a total of 10 and 100 successiv e voice embeddings are av eraged to create a voice embedding which spans 115ms and 1015ms, respectively since a single frame-lev el voice embedding spanning 25ms is too short to extract v oice characteristics reliably . Face embedding Our face embeddings are extracted by using FaceNet [20] pre-trained on CASIA-W ebFace. 1 Since the provided face region annotations in the V oxCeleb datasets are coarse, we re-align and crop faces by the face and landmark detectors in Dlib . 2 4. EXPERIMENTS In this section, we ev aluate the proposed method with various base- lines. In Sec. 4.2, we compare the person veriﬁcation performance 1 https://github.com/davidsandberg/facenet W e used this reproduced open model, which has been impro ved by the main- tainers with several modiﬁcations. The modiﬁcations include the dimension change of the last layer from 128-D to 512-D. W e use the last 512-D FC7 layer activ ation of this FaceNet version as the face embedding. 2 http://dlib.net l =0.115 sec l =1.015 sec Systems EER mDCF EER mDCF V oice embedding ( e v ) 41.27 0.999 14.50 0.863 Face embedding ( e f ) 8.03 0.631 8.03 0.631 Score-lev el fusion 7.83 0.623 5.78 0.491 System A 7.74 0.634 5.52 0.478 System B 7.81 0.625 5.56 0.472 System C (Proposed) 7.46 0.611 5.29 0.456 T able 1 : Person veriﬁcation performance on V oxCeleb2 test set. l is a length of audio segment to e xtract voice embedding. with sev eral multimodal fusion approaches as well as unimodal methods in the ordinary scenario that both modal data is giv en. Then, we demonstrate the robustness of the proposed method against corrupted data in Sec. 4.3. Moreo ver , we analyze the behavior of the attention layer according to interpretable attributes, including head pose and facial appearance traits, in Sec. 4.4. 4.1. Experimental En vironment For our experiments, we used the V oxCeleb1 & 2 datasets [21, 22], which include multimedia data with a reliable pre-processing step to obtain face regions and voice segments. V oxCeleb1 & 2 have more than 1,281,352 utterances from 7,365 speakers and both datasets hav e development and test set splits. For veriﬁcation performance measurement, we made a test trial set using the V oxCeleb2 T est set which contains 36,693 video clips from 120 speakers. W e made 300 positiv e trials (i.e., the same speaker from dif ferent clips) and 300 negati ve trials (i.e., dif ferent speaker) trials per speaker , for a total of 71,790 trials. 3 . W e used cosine similarity to measure the distance of two embeddings. V oice and face embeddings were extracted in 600 and 512 di- mension respectiv ely . For training the fusion network ( A , B , C ), we extracted 1 frame per second and its relev ant audio segment with 0.115 sec and 1.015 sec. Both embeddings were L2-normalized to hav e unit length before feeding into the fusion network. T o test, we extract a single frame and its rele vant audio se gment randomly in each video clip. Thus, a total of 36,693 still images and 0.115 sec (or 1.015 sec) audio segments are used for the test trials. The per- formance was measured in terms of Equal Error Rate (EER) and minimum Detection Cost Function (mDCF) ( P targ et = 0 . 01 ) [23]. 4.2. Fusion Perf ormance As shown in T able 1, the voice embedding sho ws signiﬁcantly w orse performance than the face embedding. This is natural because we only use 0.115 sec, 1.015 sec which is a very short segment to ex- tract reliable representations from text-independent speech. Score- lev el fusion was done using logistic regression by calibrating on the V oxCeleb2 dev elopment set [24]. The system A , B and C show neu- ral network-based fusion approaches. While system A and B shows slightly better performance than the score-le vel fusion on EER, Sys- tem C show a notable gain in both EER and mDCF . 4.3. Effect on Corrupted and Missing Modality T o see the performance under a corrupted or missing modality of either v oice and face, we generated random noise drawn from a 3 The number is slightly less than 72,000 because there are a few individ- uals who hav e less than ﬁv e video clips. V oice null embeddings Face null embeddings Random Zeros Random Zeros Systems EER mDCF EER mDCF EER mDCF EER mDCF Score fusion 8.05 0.633 8.03 0.631 49.99 0.999 41.27 0.999 System A 8.51 0.712 7.59 0.648 38.81 0.999 35.51 0.999 System B 8.76 0.748 7.51 0.637 37.74 0.999 34.12 0.999 System C 7.77 0.626 7.50 0.633 37.23 0.999 34.22 0.999 (Proposed) (a) l = 0 . 115 sec V oice null embeddings Face null embeddings Random Zeros Random Zeros Systems EER mDCF EER mDCF EER mDCF EER mDCF Score fusion 8.19 0.634 8.03 0.631 28.18 0.995 14.5 0.863 System A 8.64 0.732 7.64 0.649 15.42 0.960 13.27 0.897 System B 8.69 0.724 7.61 0.647 16.52 0.970 14.55 0.901 System C 7.89 0.623 7.65 0.636 12.64 0.905 12.23 0.871 (Proposed) (b) l = 1 . 015 sec T able 2 : Performance under corrupted and missing modality on ei- ther voice and face. l is a length of audio segment to extract voice embedding. standard normal distrib ution and zero vector . Random noise mim- ics embeddings from an corrupted modality from an image without a face or audio without a voice due to an error in the pre-processing step. The zero vector is for the case of a missing modality and this can be easily handled by switching the multi-modal system to uni- modal system. Ho wev er , we were interested in the scenario where we only used a single uni versal system and measured the perfor - mance when either modality did not exist. In table 2, the proposed system C shows better performance for both the corrupted and miss- ing modality condition by assessing the quality of the embedding in the attention layer . Interestingly , the neural network based fusion systems, particu- larly the proposed fusion approach, obtain better performance than using a unimodal embedding, ev en for the case that information is only partially av ailable. In the neuroscience study , it has been ob- served that unimodal perception gets a beneﬁt from the multisensory association of ecologically valid and sensory redundant stimulus pair [25]. As an extension of this observation, we can interpret as the fu- sion network learns the association of the multisensory data, and it becomes av ailable to extract more robust feature even without mul- tisensory data. 4.4. Analysis of the Attention Layer W e analyze the behavior of the attention layer in our networks. In order to parse what information it has learned and its behavior ac- cording to interpretable attributes, we conduct control experiments with facial appearance attrib utes. By measuring probabilities of face/voice attention weights con- ditioned by an attribute in the test set, we inv estigate the existence of the statistical correlation between the attribute and the attention, and its tendency . W e obtain the attributes of the V oxCeleb2 test set by using the state-of-the-art, Rude et al. [26] and Feng et al. [27] for 40 facial appearance attributes (deﬁned in the CelebA dataset [28]) and 3D head orientation, respectively . W e focus on the relationship between the behavior of attention weights and attributes, consider- ing the fact that Kim et al. [5] already showed the connections of face/v oice representations with certain demographic attributes. Head orientation | θ | < 30° 30° < | θ | < 60° 60° < | θ | V (%) F (%) V (%) F (%) V (%) F (%) Y aw 43 57 46 54 44 56 Pitch 44 56 41 59 42 58 Roll 44 56 43 57 47 53 (a) Head orientation attributes. (V : voice, F: face) Facial Attib utes V oice (%) Face (%) 95% C.I. Bald 74.89 25.11 ± 4.02 Blond Hair 32.17 67.83 ± 1.51 Goatee 70.06 29.94 ± 1.38 Mustache 72.96 27.04 ± 1.73 Sideburns 65.60 34.40 ± 1.81 Straight Hair 29.65 70.35 ± 1.09 W earing Hat 72.62 27.38 ± 2.14 (b) Facial appearance attrib utes T able 3 : The expectation of P ( α v > ¯ α v | A = true ) and P ( α f > ¯ α f | A = true ) , where A denotes attributes. C.I. stands for the (W ald) conﬁdence interv al. For head orientation, the front face is represented by all the angle of yaw , pitch and roll equal to 0°. As a statistical measure, gi ven an attribute A , we measure the expectation of the probability E P ( α f > ¯ α f | A = true ) , ¯ α f denotes the global mean of the face attention ov er all the test data, and like- wise for the voice. Since the probability estimate follows the expec- tation of the Bernoulli trial, we measure the statistical signiﬁcance by 95 % binomial proportion (W ald) conﬁdence interval. While the attribute estimation methods hav e extremely lo w-failure rate proﬁle, taking into account subtle outlier effects, we conservati vely regard the 95 % -conﬁdence lower bound estimates as being a signiﬁcant sig- nal if greater than 60 % (greater than the random chance). From T able 3a, we could not ﬁnd any correlation between head orientation and attention weights. W e postulate that the F aceNet em- bedding is learned to be suf ﬁciently head orientation in v ariant, so the attention layer turns out to be insensitiv e to the quality of the em- bedding according to the orientation. T able 3b sho ws the 7 attributes that their lower bound is above the 60%. It is interesting that, in the case that a person is with the temporary attributes, such as “W ear- ing Hat, ” “Sideburns, ” “Goatee” and “Mustache”, the fusion system is likely to concentrate on the v oice with a much higher chance than random. Also, the very strong attribute like “Bald, ” “Blond hair” and “Straight Hair” shows correlation with attention weights. 5. CONCLUSION Motiv ated from the recent studies about the multi-modal association, we proposed a feature-le vel attentive fusion network for audio-visual online person veriﬁcation task. The temporally synced face image and voice segment assumption encourages the network to learn about the quality of the embedding to verify a person’ s identity . The learned embeddings of both modalities share a compatible space (co- embedding space) by virtue of the simple linear combination rule to obtain the fused representation. Besides the better performance than the traditional score-le vel fusion, it has a large advantage to handle the sev ere condition such as the presence of the corrupted and missing modality . The attention mechanism is also analyzed to understand the correspondence between attention weights and interpretable attrib utes of visual perception. In addition to visual appearance traits, it would be interesting to further in vestigate the attention behavior in terms of speech characteristics, such as pitch, language, dialect, etc, as a future direction. 6. REFERENCES [1] K. V on Kriegstein and A.-L. Giraud, “Implicit multisensory associations inﬂuence voice recognition, ” PLoS biology , vol. 4, no. 10, pp. e326, 2006. [2] B. A. S. Hasan, M. V aldes-Sosa, J. Gross, and P . Belin, “Hear- ing faces and seeing voices”: Amodal coding of person identity in the human brain, ” Scientiﬁc reports , v ol. 6, 2016. [3] A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Cross-modal biometric matching, ” in IEEE Con- fer ence on Computer V ision and P attern Recognition (CVPR) , 2018. [4] S. Horiguchi, N. Kanda, and K. Nagamatsu, “Face-voice matching using cross-modal embeddings, ” in ACM Multimedia Confer ence , 2018. [5] C. Kim, H. V . Shin, T .-H. Oh, A. Kaspar , M. Elgharib, and W . Matusik, “On learning associations of faces and voices, ” in Asian Confer ence on Computer V ision (ACCV) , 2018. [6] T . Choudhury , B. Clarkson, T . Jebara, and A. Pentland, “Mul- timodal Person Recognition using Unconstrained Audio and V ideo, ” in International Confer ence on Audio and V ideo-Based P erson Authentication , 1999, pp. 176–181. [7] J. Luque, R. Morros, A. Garde, J. Anguita, M. Farrus, D. Ma- cho, F . Marqu ´ es, C. Mart ´ ınez, V . V ilaplana, and J. Hernando, “ Audio, V ideo and Multimodal Person Identiﬁcation in a Smart Room, ” in International Evaluation W orkshop on Classiﬁca- tion of Events, Activities and Relationships , 2006, pp. 258–269. [8] W . Thomas and Kie, “Multimodal Person Recognition for Human-vehicle Interaction, ” IEEE MultiMedia , vol. 13, no. 2, pp. 18–31, 2006. [9] T . Hazen and D. Schultz, “Multi-modal user authentication from video for mobile or variable-en vironment applications, ” in Interspeech , 2007. [10] M. E. Sargin, H. Aradhye, P . J. Moreno, and M. Zhao, “ Audio- visual Celebrity Recognition in Unconstrained W eb V ideos, ” in ICASSP , 2009, pp. 1977–1980. [11] G. Sell, K. Duh, D. Snyder , D. Etter, and D. Garcia-Romero, “Audio-V isual Person Recognition in Multimedia Data from the IARP A Janus Program, ” in ICASSP , 2018, pp. 3031–3035. [12] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. V ergyri, J. Sison, A. Mashari, and J. Zhou, “Audio-visual Speech Recognition, ” T ech. Rep., IDIAP , 2000. [13] J. Kratt, F . Metze, R. Stiefelhagen, and A. W aibel, “Large V o- cabulary Audio-V isual Speech Recognition Using the Janus Speech Recognition T oolkit, ” in Joint P attern Recognition Symposium , 2004, pp. 488–495. [14] R. Sanabria, F . Metze, and F . De La T orre, “Robust end- to-end deep audiovisual speech recognition, ” ArXiv e-prints arXiv:1611.06986 , 2016. [15] S. Petridis, T . Stafylakis, P . Ma, F . Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end Audiovisual Speech Recognition, ” in ICASSP , 2018. [16] M. Corbetta and G. L. Shulman, “Control of goal-directed and stimulus-driv en attention in the brain, ” Nature re views neuro- science , vol. 3, no. 3, pp. 201, 2002. [17] D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine transla- tion by jointly learning to align and translate, ” in International Confer ence on Learning Repr esentations (ICLR) , 2015. [18] E. J. Cand ` es, X. Li, Y . Ma, and J. Wright, “Rob ust principal component analysis?, ” Journal of the ACM (J ACM) , vol. 58, no. 3, pp. 11, 2011. [19] S. Shon, H. T ang, and J. Glass, “Frame-lev el Speaker Embed- dings for T ext-independent Speaker Recognition and Analysis of End-to-end Model, ” in IEEE Spoken Language T echnology W orkshop (SLT) , 2018. [20] F . Schrof f, D. Kalenichenko, and J. Philbin, “FaceNet : A Uni- ﬁed Embedding for Face Recognition and Clustering, ” in Com- puter V ision and P attern Recognition (CVPR) , 2015, pp. 815– 823. [21] A. Nagraniy , J. S. Chung, and A. Zisserman, “V oxCeleb: A large-scale speaker identiﬁcation dataset, ” in Interspeech , 2017, pp. 2616–2620. [22] J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition, ” in Interspeec h , 2018, pp. 1086–1090. [23] “The NIST 2016 Speak er Recognition Evaluation Plan”, A v ail- able: https://www .nist.gov/document/sre16e v alplan v13pdf. [24] N. Br ¨ ummer and D. A. V an Leeuwen, “On calibration of lan- guage recognition scores, ” IEEE Odyssey 2006: W orkshop on Speaker and Languag e Recognition , pp. 1–8, 2006. [25] K. v on Kriegstein and A.-L. Giraud, “Implicit multisensory associations inﬂuence voice recognition, ” PLOS Biology , vol. 4, no. 10, pp. 1–12, 09 2006. [26] E. M. Rudd, M. G ¨ unther , and T . E. Boult, “MOON: A mixed objecti ve optimization network for the recognition of facial attributes, ” in Eur opean Confer ence on Computer V ision (ECCV) . 2016, Springer . [27] Y . Feng, F . W u, X. Shao, Y . W ang, and X. Zhou, “Joint 3d f ace reconstruction and dense alignment with position map regres- sion network, ” in Eur opean Conference on Computer V ision (ECCV) , 2018. [28] Z. Liu, P . Luo, X. W ang, and X. T ang, “Deep learning face attributes in the wild, ” in IEEE International Conference on Computer V ision, (ICCV) , 2015, pp. 3730–3738.

Noise-tolerant Audio-visual Online Person Verification using an Attention-based Neural Network Fusion

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment