Who said that?: Audio-visual speaker diarisation of real-world meetings

The goal of this work is to determine 'who spoke when' in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates robust diarisation outputs. To achieve this, we propose a novel iterative a…

Authors: Joon Son Chung, Bong-Jin Lee, Icksang Han

Who said that?: Audio-visual speaker diarisation of real-world meetings
Who said that?: A udio-visual speaker diarisation of r eal-world meetings J oon Son Chung, Bong-Jin Lee, Ic ksang Han Nav er Corporation, South K orea { joonson.chung,bongjin.lee,icksang.han } @navercorp.com Abstract The goal of this work is to determine who spoke when’ in real-world meetings. The method takes surround-view video and single or multi-channel audio as inputs, and generates ro- bust diarisation outputs. T o achie ve this, we propose a nov el iterati ve approach that first enrolls speak er models using audio-visual correspondence, then uses the enrolled models together with the visual informa- tion to determine the activ e speaker . W e show strong quantitati ve and qualitati ve performance on a dataset of real-world meetings. The method is also e valuated on the public AMI meeting corpus, on which we demonstrate results that exceed all comparable methods. W e also show that beamforming can be used together with the video to further im- prov e the performance when multi-channel audio is av ailable. Index T erms : speaker diarisation, audio-visual, multi-modal 1. Introduction Over the recent years, there has been a growing demand to be able to record and search human communications in a machine readable format. There has been significant advances in auto- matic speech recognition due to the availability of large-scale datasets [1, 2] and the accessibility of deep learning frame- works [3, 4, 5], but to gi ve the transcript more meaning beyond just a sequence of words, the information on who spoke when’ is crucial. Speaker diarisation, the task of breaking up multi-speaker audio into single speaker segments, has been an activ e field of study ov er the years. Speaker diarisastion can mostly be ad- dressed as a single-modality problem where only the audio is used, but there are also a number of papers that ha ve used addi- tional modalities such as video. Previous works on speaker di- arisation, both audio and audio-visual, can be divided into two strands. The first is based on speaker modelling (SM) which uses the assumption that each individual has different voice characteris- tics. T raditionally , speaker models are constructed with Gaus- sian mixture models (GMMs) and i-vectors [6, 7, 8], but more recently deep learning has been proven effecti ve for speaker modelling [9, 10, 11, 12, 13]. In many systems, the models are often pre-trained for the target speakers [14, 15] and are not applicable to unkno wn participants. Other algorithms are capable of adapting to unseen speakers by using generic mod- els and clustering [16, 17]. There are also a number of works in the audio-visual domain that are based on feature cluster- ing [18, 19]. The second strand uses a technique referred to as sound source localisation (SSL), which is claimed to demonstrate bet- ter performance compared to the SM-based approaches accord- ing to a recent study [20], particularly with powerful beamform- ing methods such as SRP-PHA T [21]. Howe ver , SSL-based methods are only effecti ve when the location of speakers are either fixed or known. Therefore SSL has been used as parts of audio-visual methods, where the location of the identities can be tracked using the visual information [22]. This approach is dependent on the ability to effecti vely track the participants. A recent paper [23] combines SSL with a visual analysis module that measures motion and lip movements, which is relev ant to our work. A number of works have combined SM and SSL ap- proaches using independent models for each type of observ a- tion, then fused these information with a probabilistic frame- work based on the V iterbi algorithm [22] or with Bayesian fil- tering [20]. In this paper , we present an audio-visual speaker diarisation system based on self-enrollment of speaker models that is able to handle mov ements and occlusions. W e first use a state-of-the- art deep audio-visual synchronisation network to detect speak- ing segments from each participant when the mouth motion is clearly visible. This information is used to enroll speaker mod- els for each participant, which can then determine who is speak- ing e ven when the speaker is occluded. By generating speaker models for each participant, we are able to reformulate the task from an unsupervised clustering problem into a supervised clas- sification problem, where the probability of a speech segment belonging to every participant can be estimated. In contrast to the previous works that compute likelihoods for each type of observation before the multi-modal fusion, the audio-visual synchronisation is used in the self-enrollment process. Finally , when multi-channel microphone is av ailable, beamforming is employed to estimate the location of the sound source, then the spatial cues from both modalities are used to further improv e the system’ s performance. The effecti veness of the method is demonstrated on the internal dataset of real-world meetings and the public AMI corpus. This paper is organised as follows. In Section 2.1, we first describe the audio-only baseline system based on state-of- the-art methods for speech enhancement, acti vity detection and speaker diarisation. Section 2.2 introduces the proposed audio- visual system. Finally , Section 3 describes the datasets and the experiments in which we demonstrate the effecti veness of our method on the public AMI dataset. 2. System description 2.1. A udio-only baseline The baseline system provided for the second DIHARD chal- lenge is used as our audio-only baseline. The system takes ke y components from the top-scoring systems in the first DIHARD challenge and shows state-of-the-art performance on audio-only diarisation. 2.1.1. Speech enhancement The speech enhancement is based on the system used by USTC-iFL YTEK in their submission to the first DIHARD chal- lenge [24]. The system uses Long short-term memory (LSTM) based speech denoising model trained on simulated training data. It has demonstrated significant improvements in deep learning-based single-channel speech enhancement ov er the state-of-the-art, and the authors hav e sho wn its effecti veness for diarisation with a second-place result in the first DIHARD chal- lenge. 2.1.2. Speech activity detection The speech acti vity detection baseline uses W ebR TC [25] oper - ating on enhanced audio processed by the speech enhancement baseline. 2.1.3. Speaker embeddings and diarisation The diarisation system is based on the JHU Sys4 used in their winning entry to DIHARD I, with the exception that it omits the V ariational-Bayes refinement step. Speech is segmented into 1.5 second windows with 0.75 second hops, 24 MFCCs are ex- tracted every 10ms, and a 256-dimensional x-v ector is extracted for each se gment. The extracted vectors are scored with PLD A (trained with se gments labelled only for one speaker) and clus- tered with AHC (av erage score combination at merges). The x-vector extractor and PLDA parameters were trained on the V oxCeleb [26] and V oxCeleb2 [27] datasets with data augmentation (additiv e noise), while the whitening transforma- tion was learned from the DIHARD I de velopment set [17]. W e use the pre-trained model released by the org anisers of the DI- HARD challenge. The system is not designed to handle ov erlapped speech, and additional speakers are counted as missed speech in evalu- ation. 2.2. Multi-modal diarisation The audio processing part of the audio-visual system shares most of the baseline methods described above: the speech en- hancement and speech activity detection modules are identical to that in the baseline system, and for experiments on the AMI corpus, we also use the pre-trained x-vector model used by the JHU system to extract speaker embeddings. Three modes of information are used to determine the cur- rent speaker in the video. The pipeline is summarised in Fig- ure 1 and described in the following paragraphs. Audio A V correlation Enroll speaker model If above threshold and top-N F ace t racking P rof ile images F ace recognit ion F ace det ect ion F ace track 1 i s Co n y Face track Speaker model for Cony Audio A V correlation Face track Speaker verification Spk. Model 1 Sound source localisation Current speaker is Cony P re-p ro cessi n g : Phase 1: Phase 2: Figure 1: Pipeline overview . 2.2.1. A udio-to-video correlation Cross-modal embeddings of the audio and the mouth motion are used to represent the respective signals. The strategy to train this joint embedding is described in [28], but we giv e a brief ov erview here. The network consists of two streams: the audio stream that encodes Mel-frequency cepstral coefficients (MFCC) in- puts into 512-dimensional vectors; and the video stream that encodes cropped face images also into 512-dimensional vec- tors. The network is trained as a multi-way matching task be- tween one video clip and N audio clips. Euclidean distances between the audio and video features are computed, resulting N distances. The network is trained with a cross-entropy loss on the inv erse of this distance after passing through a softmax layer , so that the similarity between matching pairs is greater than non-matching pairs. The cosine distance between the two embeddings is used to measure correspondence between the two inputs. Therefore, we expect small distance between the features if the face image cor- responds to the current speaker and in-sync and large distance otherwise. Since the video is from a single continuous source, we assume that the A V offset is fix ed throughout the session. The embedding distance is smoothed over time using a median filter in order to eliminate outliers. 2.2.2. Speaker verification W e dev elop speaker models for each individual (identified in Sec. 2.4.2) so that the active speaker can be determined ev en when audio-visual synchronisation cannot be established due to occlusion. The audio-visual pipeline (Sec. 2.2.1) is run over the whole video in adv ance, in order to determine N most confident speak- ing segments (each of 1.5 seconds) for each identity . In our case we use N =10, and if there are fewer than N confident seg- ments above a A V correlation threshold, we only use the se g- ments whose correlation is above the threshold. These are used to enroll the speaker models. For the e xperiments on the AMI dataset, we use the x-v ector network (described in Sec. 2.1.3) to extract speaker embed- dings, so that the results can be compared like-for -like to the baseline. For the experiments on the internal meeting dataset, we use a deeper ResNet-50 model [29] also trained on the same data as the baseline. The deeper model is used here since its features generalise better to this more challenging dataset compared to the shallower x-v ector model. At test time, speaker embeddings are extracted by comput- ing features ov er 1.5-second window , moving 0.75 seconds at a time, in line with the baseline system. By comparing the em- beddings at each timestep to the enrolled speaker models, the likelihood of the speech segment belonging to any indi vidual can be estimated. Even without any visual information at infer - ence time, this now becomes a supervised classification prob- lem, which is typically more robust compared to unsupervised clustering. 2.2.3. Sound sour ce localisation Besides the speaker embeddings, the direction of the sound source can provide useful cues on who is speaking. Recordings from the 4-channel microphone from the GoPro camera can be conv erted to Ambisonics B-Format using the Go- Pro Fusion Studio software. By solving the B-format represen- tations for azimuth θ and ele vation φ , the direction of the audio source can be estimated for each audio sample. The direction for e very video frame is determined by generating a histogram of all θ v alues over a ± 0 . 5 second period with bin size of 10 °. For the AMI videos, the Time Delay of Arri val (TDOA) in- formation is calculated using the BeamformIt [30] package. As with the internal dataset, the direction of arriv al is also com- puted with a histogram of θ values over a ± 0 . 5 second period. Howe ver , only 4 bins of 90 ° is used since the video is split over 4 cameras and the exact geometry between them is unkno wn. The likelihood of the audio belonging to an y person at a giv en time correlates to the angle between the estimated audio source and the face detection in the video for the identity in question. 2.3. Multi-modal fusion Each of the three modalities (A V correlation, speaker models, direction of audio) gi ve confidence scores for each speaker and timestep. These scores are combined into a single confidence score for every speaker and timestep using a simple weighted fusion as stated below , where C sm is the confidence score from the speaker model, C avc is the score from the A V correspon- dence and φ , θ are the directions of the face and the estimated DoA of audio, respectively . When the identity is not visible on the camera, the second and third terms are put to zero. C over all = C sm + α ∗ C avc + β ∗ cosine ( φ − θ ) (1) 2.4. Implementation details 2.4.1. F ace detection and tracking A CNN f ace detector based on Single Shot MultiBox Detector (SSD) [31] is used to detect face appearances on e very frame of the video. This detector allo ws faces to be tracked across wide range of poses and lighting conditions. A position-based face tracker is used to group individual f ace detections into face tracks. 2.4.2. F ace recognition The method requires face images for each participant so that they can be identified and tracked regardless of their position in the room. This can be from user input or from their profile images. The face images for all participants are supplied to the VGGF ace2 [32] network, and their embeddings are stored. For each face track detected (Sec. 2.4.1), face embeddings are extracted using the VGGFace2 network and compared to each of the N stored embeddings, so that they can be classified into one of N identities. W e apply the constraint that co-occurring face tracks at any point in time cannot be of the same identity . 3. Experiments The proposed method is ev aluated on two independent datasets: our internal dataset of meetings recorded with 360 ◦ camera, and the publicly av ailable AMI meeting corpus. Each will be de- scribed in the following paragraphs. 3.1. Internal meeting dataset The internal meeting dataset consists of audio-visual recording of regular meetings in which no particular instructions are given to the participants with regard to the recording of the video. The meetings form parts of daily discussions from the workspace of Figure 2: Still image fr om the internal meeting dataset. Figure 3: Still images fr om the AMI corpus. the authors and are not set up in any way with the diarisation task in mind. A large proportion of the dataset consists of very short utterances with frequent speaker changes, providing an extremely challenging condition. The video is recorded using a GoPro Fusion camera, which captures 360 ° videos of the meeting with two fish-eye lenses. The videos are stitched together into a single surround-view video of 5228x2624 resolution at 25 frames per second. The audio is recorded using a 4-channel microphone at 48 kHz. A still image from the dataset is shown in Figure 2. The dataset contains approximately 3 hours of validation set and 40 minutes of carefully annotated test set. The test video contains 9 speakers. In the case of ov erlapped speech, we only annotated the ID of main (loudest) speaker . The embedding extractor and the A V synchronisation network are trained on external datasets, and the v alidation set is only used for tuning the AHC threshold in the baseline system and the fusion weights in the proposed system. 3.2. AMI corpus The AMI corpus consists of 100 hours of video recorded across a number of locations and has been used by many previous works on audio-only and audio-visual diarisation. Of the 100 hours of video, we e valuate on meetings in ES (Edinburgh) and IS (Idiap) cate gories, which contain approximately 30 and 17 hours of video respecti vely . On the IS videos, IS1002a , IS1003b , IS1005d , IS1007d were not used in the experi- ments due to partially missing data. The image quality is rela- tiv ely low , with the video resolution of 288x352 pixels. The audio is recorded from an 8-element circular equi- spaced microphone array with a diameter of 20cm. Howe ver , we only use one microphone from the array in most of our ex- periments. The video is recorded with 4 cameras providing close-up vie ws of each of the meeting’ s participants, and un- like the internal dataset (Sec. 3.1), the images are not stitched together . Method Dataset Input System V AD Reference V AD Measure MS F A SPKE DER MS F A SPKE DER JHU Baseline [17] ES All 1ch 10.5 6.6 12.8 30.0 5.6 0.0 12.2 17.8 Ours (SM) ES All 1ch+V 10.5 6.6 6.7 23.8 5.6 0.0 7.9 13.5 Ours (SM+A VC) ES All 1ch+V 10.5 6.6 4.0 21.1 5.6 0.0 4.8 10.4 Ours (SM+A VC+SSL) ES All 8ch+V 10.5 6.6 2.8 19.9 5.6 0.0 3.6 9.2 Cabanas et al. [23] ES WB 8ch+V - - - 27.2 - - - - Ours (SM+A VC) ES WB 1ch+V 11.4 7.1 4.9 23.3 6.1 0.0 5.9 12.0 Ours (SM+A VC+SSL) ES WB 8ch+V 11.4 7.1 3.8 22.3 6.1 0.0 4.9 10.9 Cabanas et al. [23] ES NWB 8ch+V - - - 20.6 - - - - Ours (SM+A VC) ES NWB 1ch+V 9.5 5.7 2.7 17.8 5.1 0.0 3.3 8.4 Ours (SM+A VC+SSL) ES NWB 8ch+V 9.5 5.7 1.4 16.6 5.1 0.0 1.9 7.0 JHU Baseline [17] IS All 1ch 11.2 4.0 10.2 25.4 6.5 0.0 11.2 17.7 Ours (SM) IS All 1ch+V 11.2 4.0 7.6 22.9 6.5 0.0 8.8 15.3 Ours (SM+A VC) IS All 1ch+V 11.2 4.0 6.2 21.3 6.5 0.0 7.1 13.6 Ours (SM+A VC+SSL) IS All 8ch+V 11.2 4.0 4.9 20.0 6.5 0.0 5.8 12.3 Cabanas et al. [23] IS WB 8ch+V - - - 32.3 - - - - Ours (SM+A VC) IS WB 1ch+V 13.3 5.1 7.7 26.1 7.9 0.0 8.9 16.9 Ours (SM+A VC+SSL) IS WB 8ch+V 13.3 5.1 6.5 24.8 7.9 0.0 7.8 15.7 Cabanas et al. [23] IS NWB 8ch+V - - - 21.7 - - - - Ours (SM+A VC) IS NWB 1ch+V 9.3 2.8 4.8 16.8 5.3 0.0 5.4 10.6 Ours (SM+A VC+SSL) IS NWB 8ch+V 9.3 2.8 3.4 15.5 5.3 0.0 4.0 9.3 JHU Baseline [17] Internal 1ch 1.8 4.5 72.2 78.6 0.0 0.0 73.3 73.3 Ours (SM) Internal 1ch+V 1.8 4.5 24.8 31.1 0.0 0.0 25.6 25.6 Ours (SM+A VC) Internal 1ch+V 1.8 4.5 18.7 25.0 0.0 0.0 19.4 19.4 Ours (SM+A VC+SSL) Internal 8ch+V 1.8 4.5 13.1 19.4 0.0 0.0 13.7 13.7 T able 1: Diarisation r esults (lower is better). The results are on the AMI dataset except for the last four r ows. WB : Whiteboar d; NWB : No whiteboard; X ch+V : X channel audio + video; SM : Speak er Modelling; A VC : A udio V isual Correspondence; SSL : Sound Sour ce Localisation; MS : Missed Speech; F A : F alse Alarm; SPKE : Speaker Err or; DER : Diarisation Err or Rate. The ES videos is used as the validation set for tuning the thresholds. 3.3. Evaluation metric W e use Diarisation Error Rate (DER) as our performance met- ric. The DER can be decomposed into three components: missed speech (MS, speaker in reference, b ut not in hypothe- sis), false alarm (F A, speaker in hypothesis, but not in reference) and speaker error (SPKE, speaker ID is assigned to the wrong speaker). The tool used for ev aluating the system is the one dev eloped for the R T Diarization evaluations by NIST [33], and includes acceptance margin of 250 ms to compensate for human errors in reference annotation. 3.4. Results Results on the AMI corpus [34] are gi ven in T able 1. The num- bers for meetings where the whiteboard is used are provided separately , so that the results can be compared to [17]. Missed speech and false alarm rates are the same across dif- ferent models for each dataset since we use the same V AD sys- tem in all of our e xperiments. Therefore the speaker error rate (SPKE) is the only metric affected by the diarisation system. Our speaker model only system (SM) uses the visual infor- mation only to find out when to enroll the speaker models, and during inference only uses the audio. Since the audio process- ing pipeline and the embedding extractor are common across our system and the JHU-based baseline, the performance gain arises from changing a clustering problem into a classification problem. This alone results in 48% and 26% relativ e improv e- ment in speaker error on the ES and IS sets, respecti vely . It is also clear from the results that the addition of the A V correspondence (A VC) and sound source localisation (SSL) at inference time both provide boost to the performance. The con- tributions of these modalities to overall relative performance are 20-40% and 19-39% respectiv ely depending on the test set. Note that our results e xceed the recent audio-visual method of [23] across all test conditions by a significant margin, whilst using the same input modalities. [35] also reports competitive results on a subset of the IS videos (SPKE of 7.3%, DER of 19.5% using 4 cameras and 8 microphones), howev er the results cannot be compared directly to our work since some of the test videos are no longer av ailable at the time of writing this paper . The speaker error rates are markedly worse on the internal meeting dataset, presumably due to the more challenging nature of the dataset and the larger number of speakers. From the re- sults in T able 1, it can be seen that the baseline system does not generalise to this dataset, b ut the proposed multi-modal systems perform relativ ely well on this in the wild’ data. 4. Conclusion In this paper , we have introduced a multi-modal system which tak es adv antage of audio-visual correspondence to enroll speaker models. W e ha ve shown that speaker modelling with audio-visual enrollment have significant advantages over clus- tering methods typically used for diarisation. Areas for further research include learnable methods for multimodal fusion, im- prov ements to the speech activity detection (SAD) modules and the combination of audio-visual diarisation and audio-visual speech separation for meeting transcription and for handling ov erlapped speech. Acknowledgment. W e w ould like to thank Chiheon Ham, Han- Gyu Kim, Jaesung Huh, Minjae Lee, Minsub Y im, Soyeon Choe and Soonik Kim for helpful comments and discussion. 5. References [1] V . Panayotov , G. Chen, D. Povey , and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books, ” in Pr oc. ICASSP . IEEE, 2015, pp. 5206–5210. [2] J. Barker , S. W atanabe, E. V incent, and J. T rmal, “The fifth’chime’ speech separation and recognition challenge: Dataset, task and baselines, ” arXiv preprint , 2018. [3] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al. , “T ensorflow: Large-scale machine learning on heterogeneous distributed sys- tems, ” arXiv preprint , 2016. [4] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Y ang, Z. DeV ito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer , “ Automatic dif fer- entiation in pytorch, ” 2017. [5] A. V edaldi and K. Lenc, “Matconvnet: Con volutional neural net- works for matlab, ” in Pr oc. ACMM , 2015. [6] N. Dehak, P . J. Kenny , R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker verification, ” IEEE T rans- actions on Audio, Speech, and Language Processing , vol. 19, no. 4, pp. 788–798, 2011. [7] S. Cumani, O. Plchot, and P . Laface, “Probabilistic linear dis- criminant analysis of i-vector posterior distributions, ” in Pr oc. ICASSP . IEEE, 2013, pp. 7644–7648. [8] P . Mat ˇ ejka, O. Glembek, F . Castaldo, M. J. Alam, O. Plchot, P . Kenny , L. Burget, and J. ˇ Cernocky , “Full-covariance ubm and heavy-tailed plda in i-vector speaker verification, ” in Pr oc. ICASSP . IEEE, 2011, pp. 4828–4831. [9] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker verification, ” in Proc. ICASSP . IEEE, 2014, pp. 4052–4056. [10] Y . Lei, N. Scheffer , L. Ferrer, and M. McLaren, “ A novel scheme for speaker recognition using a phonetically-aware deep neural network, ” in Pr oc. ICASSP . IEEE, 2014, pp. 1695–1699. [11] S. H. Ghalehjegh and R. C. Rose, “Deep bottleneck features for i-vector based text-independent speaker verification, ” in Au- tomatic Speech Recognition and Understanding (ASR U), 2015 IEEE W orkshop on . IEEE, 2015, pp. 555–560. [12] D. Snyder , D. Garcia-Romero, D. Povey , and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification, ” Pr oc. Interspeech , pp. 999–1003, 2017. [13] D. Snyder , D. Garcia-Romero, G. Sell, D. Povey , and S. Khudan- pur , “X-vectors: Robust dnn embeddings for speaker recognition, ” ICASSP , Calgary , 2018. [14] H. Hung and G. Friedland, “T owards audio-visual on-line di- arization of participants in group meetings, ” in W orkshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications-M2SF A2 2008 , 2008. [15] G. Biagetti, P . Crippa, L. Falaschetti, S. Orcioni, and C. T urchetti, “Robust speaker identification in a meeting with short audio seg- ments, ” in Intelligent Decision T echnologies 2016 . Springer , 2016, pp. 465–477. [16] G. Friedland, A. Janin, D. Imseng, X. Anguera, L. Gottlieb, M. Huijbregts, M. T . Knox, and O. V inyals, “The icsi rt-09 speaker diarization system, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 20, no. 2, pp. 371–381, 2012. [17] G. Sell, D. Snyder , A. McCree, D. Garcia-Romero, J. V illalba, M. Maciejewski, V . Manohar, N. Dehak, D. Povey , S. W atanabe et al. , “Diarization is hard: Some experiences and lessons learned for the jhu team in the inaugural dihard challenge, ” in Pr oc. Inter - speech , 2018, pp. 2808–2812. [18] G. Friedland, H. Hung, and C. Y eo, “Multi-modal speak er diariza- tion of real-world meetings using compressed-domain video fea- tures, ” in Proc. ICASSP . IEEE, 2009, pp. 4069–4072. [19] N. Sarafianos, T . Giannakopoulos, and S. Petridis, “ Audio-visual speaker diarization using fisher linear semi-discriminant analy- sis, ” Multimedia T ools and Applications , vol. 75, no. 1, pp. 115– 130, 2016. [20] V . Rozgic, K. J. Han, P . G. Georgiou, and S. Narayanan, “Mul- timodal speaker segmentation and identification in presence of overlapped speech segments, ” Journal of Multimedia , vol. 5, no. 4, p. 322, 2010. [21] J. H. DiBiase, A high-accuracy , low-latency technique for talker localization in reverber ant envir onments using microphone ar- rays . Bro wn University Pro vidence, RI, 2000. [22] J. Schmalenstroeer , M. Kelling, V . Leutnant, and R. Haeb- Umbach, “Fusing audio and video information for online speaker diarization, ” in Proc. Inter speech , 2009. [23] P . Caba ˜ nas-Molero, M. Lucena, J. Fuertes, P . V era-Candeas, and N. Ruiz-Reyes, “Multimodal speaker diarization for meetings us- ing volume-ev aluated srp-phat and video analysis, ” Multimedia T ools and Applications , vol. 77, no. 20, pp. 27 685–27 707, 2018. [24] L. Sun, J. Du, C. Jiang, X. Zhang, S. He, B. Yin, and C.-H. Lee, “Speaker diarization with enhancing speech for the first dihard challenge, ” Proc. Inter speech , pp. 2793–2797, 2018. [25] A. B. Johnston and D. C. Burnett, W ebRTC: APIs and RTCWEB pr otocols of the HTML5 real-time web . Digital Codex LLC, 2012. [26] A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identification dataset, ” in INTERSPEECH , 2017. [27] J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speaker recognition, ” in Pr oc. Interspeech , 2018. [28] S.-W . Chung, J. S. Chung, and H.-G. Kang, “Perfect match: Improved cross-modal embeddings for audio-visual synchronisa- tion, ” in Proc. ICASSP , 2019. [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in Proc. CVPR , 2016. [30] X. Anguera, C. W ooters, and J. Hernando, “ Acoustic beamform- ing for speaker diarization of meetings, ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 15, no. 7, pp. 2011–2021, September 2007. [31] W . Liu, D. Anguelov , D. Erhan, C. Szegedy , S. Reed, C.-Y . Fu, and A. C. Berg, “SSD: Single shot multibox detector , ” in Proc. ECCV . Springer , 2016, pp. 21–37. [32] Q. Cao, L. Shen, W . Xie, O. M. Parkhi, and A. Zisserman, “VG- GFace2: a dataset for recognising faces across pose and age, ” in Pr oc. Int. Conf. Autom. F ace and Gesture Recog . , 2018. [33] D. Istrate, C. Fredouille, S. Meignier, L. Besacier , and J. F . Bonastre, “Nist rt05s evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, ” in Inter- national W orkshop on Machine Learning for Multimodal Interac- tion . Springer , 2005, pp. 428–439. [34] J. Carletta, S. Ashby , S. Bourban, M. Flynn, M. Guillemot, T . Hain, J. Kadlec, V . Karaiskos, W . Kraaij, M. Kronenthal et al. , “The ami meeting corpus: A pre-announcement, ” in Interna- tional W orkshop on Machine Learning for Multimodal Interac- tion . Springer , 2005, pp. 28–39. [35] G. Friedland, C. Y eo, and H. Hung, “Dialocalization: Acoustic speaker diarization and visual localization as joint optimization problem, ” A CM T ransactions on Multimedia Computing , Commu- nications, and Applications (TOMM) , v ol. 6, no. 4, p. 27, 2010.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment