An Attention-Based Speaker Naming Method for Online Adaptation in Non-Fixed Scenarios

A speaker naming task, which finds and identifies the active speaker in a certain movie or drama scene, is crucial for dealing with high-level video analysis applications such as automatic subtitle labeling and video summarization. Modern approaches …

Authors: Jungwoo Pyo, Joohyun Lee, Youngjune Park

An Attention-Based Speaker Naming Method for Online Adaptation in   Non-Fixed Scenarios
An Attention-Based Speaker Naming Method f or Online Adaptation in Non-Fixed Scenarios Jungw oo Py o 1 , J oohyun Lee 1 , Y oungjune Park 1 , Tien-Cuong Bui 1 , Sang K yun Cha 1 1 Seoul National Univ ersity , Seoul, K orea { wjddn1801, wngusdlekd, dudwns930, cuongbt91, chask } @snu.ac.kr Abstract A speaker naming task, which finds and identifies the acti ve speaker in a certain mo vie or drama scene, is crucial for deal- ing with high-level video analysis applications such as au- tomatic subtitle labeling and video summarization. Modern approaches have usually exploited biometric features with a gradient-based method instead of rule-based algorithms. In a certain situation, howe ver , a naive gradient-based method does not work efficiently . F or example, when new charac- ters are added to the target identification list, the neural net- work needs to be frequently retrained to identify ne w people and it causes delays in model preparation. In this paper, we present an attention-based method which reduces the model setup time by updating the ne wly added data via online adap- tation without a gradient update process. W e comparatively analyzed with three e valuation metrics(accurac y , memory us- age, setup time) of the attention-based method and existing gradient-based methods under various controlled settings of speaker naming. Also, we applied existing speaker naming models and the attention-based model to real video to prove that our approach shows comparable accuracy to the exist- ing state-of-the-art models and e ven higher accurac y in some cases. Introduction Biometric recognition plays an important role in adv anced authentication systems. It identifies individuals based on physical or behavioral characteristics. The speaker nam- ing task, which is to identify visible speaking characters in multimedia videos, consists of multiple types of biometric recognition. Most of the speaker naming methods distin- guish an acti ve speak er based on biometric features like face image or v oice. The necessity of this task is proven to be es- sential for high-level video analysis problems such as sum- marization (T ak enaka et al. 2012), interaction analysis (Liu, Jiang, and Huang 2008), and semantic indexing (Zhang et al. 2013). In particular , identification of activ e speaking charac- ters with automatic subtitling can help deaf audiences to en- joy the videos more without any dif ficulties in understanding the context. Copyright c  2020, Association for the Advancement of Artificial Intelligence (www .aaai.org). All rights reserv ed. Most of existing speaker naming models mainly focus on boosting up the accuracy for finding acti ve speaker among fixed origin character list. Gradient-based methods are con- sidered one of the right solutions to get higher accuracy , and these methods have been proposed in various ways using multiple modalities for speak er naming. (Hu et al. 2015) pro- poses a deep multimodal model based on a CNN architecture to extract the facial and acoustic features from videos, then combine them through a fusion function. Correspondingly , a multimodal Long Short-T erm Memory(LSTM) architec- ture(Ren et al. 2016) merges visual and auditory modali- ties from the beginning of each input sequence. (Bredin and Gelly 2016) improves the performance of talking-face de- tection by capturing the lip motion. Howe ver , it is not alw ays possible to occur expected or fixed situation in a video. In the real world, there are sev- eral uncertain situations which cause the dif ficulties of iden- tifying active speaker such as appearing new characters or misinterpretation due to lack of labeled training data. Most of the gradient-based identification approaches cannot im- mediately adapt to a change in predicted character list or update the newly added data to model. Traditionally , these methods have to predefine a set of targeted characters before the training period. In particular , an e xisting model has to be retrained with a new set of targeted characters, which con- sist of both origin classes and the new ones, from scratch. It causes much time-consuming to rebuild the model. Transfer learning (Y osinski et al. 2014) and domain adaptation(Ganin and Lempitsky 2015) are proved to be ef ficient for f aster adaptation of neural network through initialization based on origin data. Nonetheless, these methods still require consid- erable time in the training phase adapting the newly added data to original model, which takes considerable time. Be- sides, it is hard to contain a sufficient amount of labeled data in real-world datasets since labeling task is costly . The a vail- ability of labeled data poses a major practical issue for many gradient-based models. T o overcome these problems, we apply an attention mod- ule with few-shot learning (Fei-Fei, Fergus, and Perona 2006) for making our identification model flexible to accom- modate changes at run-time. The attention module, which is based on scaled dot-product attention structure (V aswani et (a) (b) Figure 1: (a) Speaker naming task contains two situations: i) finding matched face with corresponding voice if speaker appears in the scene, ii) picking out all distractors if speak er is out of the scene. (b) V isualization of predicting ID of face- voice pair with few-shot learning based attention module. The predicted ID of the target can be inferred to linear com- bination of cosine similarity between ev ery prior kno wledge embedding and tar get embedding. It also shows the simplic- ity of dealing with data of newly added ID inserting into the attention module. al. 2017), represents the similarities between prior kno wl- edge embeddings and extracted features from the target video. The prior knowledge embeddings are the given data, which consist of facial and vocal embeddings of the pre- dicted classes from the training dataset. Next, few-shot learning is used for dealing with the scarcity of labeled data and imbalanced class distribution. Attention mechanism and few-shot learning are ef fectively combined in our model since they are both linear and straightforward. The essen- tial component of the few-shot learning method deri ves fea- ture embeddings based on a distance function. The attention mechanism consists of a linear combination with scaling and a softmax operation among these feature embeddings as sho wn in Figure 1(b). This combination makes the model consider every single embedding of prior knowledge care- fully . Therefore, our method works well under the condi- tions e ven with a small amount of data or a highly imbal- anced class distribution. More importantly , our model only utilizes the pretrained neural networks to e xtract embed- dings. It means our model does not comprise a backpropa- gation process unlike other gradient-based models. Namely , the setup time is significantly decreased by updating the new information on the attention module in run-time. Howe ver , our proposed method is not always an opti- mal solution for all situations. In situations where character changes are not frequent or there are many IDs to identify , a deep-learning approach that guarantees robust performance may be more suitable ev en if the model setup takes a long time. Consequently , we compared attention-based method with gradient-based methods under various conditions of speaker naming by adjusting two v ariables: the number of target IDs to be identified, and the number of shots per each character . Furthermore, we compared our proposed model with existing speaker naming models on real video. Our contributions are summarized as follo ws: • W e proposed a non-gradient-based method using atten- tion module with fe w-shot learning, which can efficiently deal with the scarcity of labeled data as well as imbal- anced class distribution. • Our model significantly reduces the setup time of the model by removing the gradient descent process and up- dating the new data to model online. • Under v arious en vironments adjusting both the number of target IDs to be identified and the number of shots per each character, we conducted comparative analyses with real-world dataset between our proposed method and ex- isting gradient-based methods through three metrics: ac- curacy , memory usage, and setup time. • Our model shows comparable accurac y to the state-of- the-art speaker naming models on real video. Related W ork Speaker Naming Speaker naming is a task to identify the speaker in a video source. Recent studies about automatic speaker naming used deep neural networks to get each speaker’ s name from the multimodal face-voice sources. In (Hu et al. 2015), they pro- posed a conv olutional neural network (CNN) based multi- modal framew ork that automatically learns from the com- bined face and acoustic features. They trained the SVM classifier to reject all non-matching face-voice pairs and get identification results. Like wise, (Ren et al. 2016) im- prov ed the accuracy by changing the CNN based model to Long Short-T erm Memory (LSTM) based model. This change gave robust identification results for face distortion. (Liu et al. 2019) used attention architecture to accommodate the face v ariations. F eature Extractors for F ace and A udio Cues The primary purpose of feature e xtractors is to e xpress a par - ticular type of data to distilled numerical embeddings, which has lower dimension than original data. Several feature ex- tractors hav e been studied in each field according to various types of data. Most feature extractor operates by setting ap- propriate loss function and distance metric, then optimizing them. Figure 2: Overall architecture of attention-based model for speaker naming. Face V arious types of loss functions have been tried to use as facial feature extraction. (Sun et al. 2014; W en et al. 2016) used cross-entropy loss to minimize euclidean dis- tance. FaceNet(Schroff, Kalenichenko, and Philbin 2015) introduced triplet loss based on Euclidean distance to train the face feature extractor . Also, FaceNet utilized MTCNN(Zhang et al. 2016) to extract the aligned cropped face images from raw image dataset. SphereFace(Liu et al. 2017), CosFace(W ang et al. 2018), ArcFace(Deng et al. 2019) used angular loss to minimize the cosine similarity . A udio The area of feature extraction of audio data has also been studied in various directions. There ha ve been many useful methods, such as using MFCC (Muda, Begam, and Elamv azuthi 2010) and using CNN (Hershey et al. 2017). Recently , (Xie et al. 2019) suggested a new model using ”thinResNet” trunk architecture and dictionary-based NetVLAD layer . This method successfully performed the speaker identification task on audio data with varying lengths and mixing of unrelated signals. Attention Mechanism Attention mechanism is first proposed in (Bahdanau, Cho, and Bengio 2014) for neural machine translation(NMT) field. The attention mechanism looks up all of the input elements(e.g., sequential input such as frames in video or words in a sentence) at every decoding time, calculates the attention map which is a matrix that reflects the rele vance of present input and previous input elements. Attention map is a probability matrix that each tar get word is aligned to, or translated from source word. Each ele- ment of the attention map is computed as the softmax v alue, which means the similarity of source word and each target word. Some papers hav e brought attention mechanism to speaker naming task. In (Liu et al. 2019), they proposed attention guided deep audio-face fusion approach to detect activ e speaker . They also used individual netw ork model to con vert from face and voice sources to each embedding as ours. Before fusing the face and voice embeddings, they applied the attention module only for face embeddings to consider the relationship between other f ace embeddings. Howe ver , our work applied attention mechanism to fusion of face-v oice pair embeddings and focused on the relev ance of target embeddings and prior kno wledge embeddings. Methodology Speaker naming contains all processes from detecting faces, recognizing voice, and matching these embeddings to iden- tifying the current speaker . As shown in Figure 1(a), we re- gard the speaker naming problem in tw o cases. The first case is to find out the pair embeddings that both face and voice embedding are identified as same ID(so-called ”matched- pair”). The second one is to pick out the pair embeddings where ID of face and voice do not match(so-called ”non- matched-pair”). W e propose a non-gradient based method using attention networks with few-shot learning to solve this problem. In this section, we formulate our problem precisely and elaborate on our proposed model. Problem F ormulation W e formulate our problem as follows. Let t be the index of time window , I = { i 1 , i 2 , ..., i N } denotes the ID of charac- ters. J t is the number of faces which are captured in t . f t j is the j -th number of face embeddings cropped in time window t . Likewise, v t represents the voice embedding in time win- dow t . Then the maximum probability of facial embedding whose ID is i k in time window t is as follows. F prob ( i k , t ) = max 1 ≤ j ≤ J p ( i k | f t j ) (1) Figure 3: Mechanism of attention module with fe w-shot learning. By multiplying F prob ( i k , t ) and the probability that pre- dicted ID of voice embedding in t is i k , we can infer the ID of speaker in t as below . S pk I D ( t ) = argmax i k ∈ I ( F prob ( i k , t ) · p ( i k | v t )) (2) Based on Equation (2), we calculate the accuracy of the speaker naming model if it correctly estimates the ID of matched-pair , or picks out the non-matched-pair in the time window t . After all, we aggregate S pk I D ( t ) ov er all time windows to get the total accurac y of the target video. Attention-Based Method for Speaker Naming The speaker naming problem consists of two parts: finding out matched face-voice pairs to predict current speaker , and picking out the non-matched-pairs. Our approach to solve the problem is as follo ws. First, we capture the f ace im- ages and voice chunks by ev ery fixed size of the time win- dow . Then, we con vert face images and voice chunks to ex- tracted embeddings with pre-trained face and audio feature extractor . W e concatenate both face and voice embeddings to make candidates of pair embeddings by each frame. Then, we calculate the attention map with this concatenated em- bedding. Attention map applies scaling and a softmax func- tion to cosine similarity matrix among all of the characters’ prior knowledge embeddings and extracted target embed- dings. W e predict the IDs of target embeddings based on the attention mechanism. Then, the proposed method aggregates the prediction result by each time window , and it determines the activ e speaker in the scene. Finally , we measure the pre- diction accurac y of the model by aggregating every result of all time windows. W e describe our method’ s overall archi- tecture and flow in Figure 2. Featur e Extraction. For generating embeddings which contain the features of facial appearance and voice, we use pre-trained feature extractors, which con vert raw input sources to numerical vectors with reduced dimensions. Our network uses FaceNet as facial feature extractor , NetVLAD as voice feature extractor . The weights of these extractors are fixed while updating the attention module or end-to-end inferencing phase. Attention Module with Few-Shot Learning . Our at- tention module with few-shot learning consists of multi- ple components. Let Q denotes the query matrix, which is the extracted face-voice pair embeddings from target video. Q contains se veral matched-pairs and non-matched-pairs, which we will predict within a certain time window . Thus, Q is a variable for time window t . K , V belong to prior knowledge for our network. K denotes the matrix of multi- ple face-v oice pair embeddings extracted from the training data. V is a one-hot vector matrix of IDs corresponding to K . These K V set work as proofs for our decision whether the pair embeddings are matched-pair or not, and classifying the pair’ s ID. Detailed process of attention mechanism is sho wn in Fig- ure 3. The intuitiv e role of attention module is to consider the correlations between e very pair of Q and K . In our case, computing attention map and context v ectors in attention module correspond to computing similarity and matrix of predicted IDs, respectively . As a distance metric, we use co- sine distance because embeddings in Q and K are unit vec- tors, we can get cosine similarity using inner product. Be- fore performing matrix multiplication of Q and K , we use transposed matrix Q T to match the dimension. Our method performs a few additional operations after doing matrix mul- tiplication with Q T and K . First, multiply scale f actor sf to Q T K . Then, apply the softmax function to all elements. Our network is set to the value of sf as √ d K , where d K is the dimension of K , set to 1024. The reason for multiplying sf is that after performing multiplication between unit vectors to calculate cosine similarity , the value becomes so small that it interferes with subsequent softmax operation. If the input parameter’ s scale of softmax function is too small or large, it cannot express the appropriate probability distribu- tion. Our setting can arrange the scale of value at the proper lev el to perform softmax. Based on the above explanation, the attention map is mathematically written as: A = softmax( sf ( Q T K )) (3) The context vectors which represents the prediction of IDs to Q is written as: C = V A T (4) C represents the probability of which face-v oice pair in Q is regarded as a particular ID. The probability of face and v oice are separated in C . Our method uses confidence score vec- tor c p as the criteria for the decision to distinguish whether the p -th embeddings in Q is the acti ve speaker or not. W e apply Hadamard product(Horn 1990), which multiplies the face part and voice part of C element-wise, to consider both features of face and v oice. From this operation, we get 1 × N vector of confidence score where N is the number of IDs. The maximum v alue and its inde x in c p will be reg arded as a confidence score and S pk I D ( t ) , respectively . W e elaborate on the procedure of ov erall flow in Algorithm 1. Algorithm 1 End-to-End Speaker Naming Prediction Let Q : Query , K : K ey , V : V alue, A : Attention map I ← { i 1 , i 2 , ..., i N } : a set of characters’ IDs Cut video by 0.5s interval of time windo w for each time window t ← 1 to T do Rep t : representativ e frame in t { f 1 , ..., f J t } : cropped J t faces from Rep t { q f 1 , ..., q f J t } : facial embeddings from { f 1 , ..., f J t } q v : voice embedding e xtracted from audio in t Q ←  q f 1 q f 2 ... q f J t q v q v ... q v  A ← softmax( sf ( Q T K )) C ← V A T max conf ← 0 for p ← 1 to J do c p ← c f p  c v p  Hadamard product max conf ← max( max conf , max ( c p ) ) if max conf == max ( c p ) then S pk I D ( t ) ← argmax i ∈ I ( c p ) end if end for end for Experiments Dataset Overview In experiments, we used two public datasets: the utterance videos of celebrities( V oxCeleb2 (Chung, Nagrani, and Zis- serman 2018)) and a TV sho w( The Big Bang Theory(BBT) ). For the experiment in incremental settings, we randomly chose 500 people from V oxCeleb2 , which contain more than 10 videos. Then, we split the train set and valid set by 5:2 per each ID. For BBT , we selected 5 episodes ( S01E02, S01E03, S01E04, S01E05, S01E06 ). Each episode consists of the whole video, f ace images with various poses and illu- mination, and aggregated v oice file without silence. Data Prepr ocessing W e used FaceNet and NetVLAD, which are the same e xtrac- tors used in our model, for extracting train and test embed- dings from the raw datasets. First of all, BBT dataset consists of multiple cropped face images and merged voice files per ID, by each episode. Cropped face images were resized to 160 × 160 for fitting the image size with the input of FaceNet model. After that, we used pre-trained FaceNet to con vert resized face images into 512-dimension embeddings. Simi- larly , we con verted our audio file into 512-dimension voice embeddings. The windo w size of each audio chunk is 2s, cut with 0.1s stride. W e did additional preprocessing in order to get cropped face images and voice chunks when prepro- cessing V oxCeleb2 because it only consists of video files. First of all, we cut the videos every 30 frames per second. Then, MTCNN(Zhang et al. 2016) cropped face images for all frames. The captured images which are not actual face image were removed. About v oice file, we applied the same setting of what BBT was preprocessed. Comparative Analysis among Speaker Naming Methods under V arious Settings Previous studies(Hu et al. 2015; Ren et al. 2016; Liu et al. 2019) have ev aluated the accuracy of their methods in a re- fined setting, which has purified voice and appears a small, fixed number of characters(5-6 IDs) in the scene. Also, the y use sufficient pair embeddings per each character for train- ing model. In this e xperiment, we compare our speaker nam- ing model with existing gradient-based methods in detail under more various en vironments unlike previous work. By considering the adv ent of new characters in the story , we can precisely ev aluate the performance of speaker naming meth- ods in a more realistic situation with V oxCeleb2 dataset. Evaluation Metric Speaker naming is to find the matched-pairs of face and voice embeddings and predict its identity . T o compare how well the speaker naming method can identify the ID of matched-pair, we defined matching pair accuracy( mpA ) as follo ws: mpA = N id pred == id gt N total × 100% (5) The second metric is the number of parameters of speaker naming model loaded in memory . If the model consists of a neural network, the weights and biases belonged to these parameters. In the case of attention-based model, pair em- beddings were counted as parameters. W e con vert these pa- rameters into kilobytes(KB) and compare them. The third metric is the setup time of model. In the case of neural network, we calculated setup time by adding data loading in memory , calculating the gradient, and updating it to the weights. In the case of an attention-based model, we measure the setup time by adding the loading time of prior knowledge embeddings and the calculation time of attention module to deriv e prediction results. Experimental Setup W e conducted the e xperiment ad- justing two main variables in situations: the number of tar get IDs for prediction, and the number of shots(pair embeddings of face-voice) for prior knowledge per each target ID. About the number of target IDs, we separated the situation into two parts: the number of target IDs is small or large. In each case, the number of target IDs started to be set from 5 to 50 with fiv e increments, and from 50 to 500 with fifty increments, respectiv ely . W e also adjust the number of shots per each character set to 5(small), 50(large) shots in both situations to consider the ef fects of the number of labeled training data to performances of speaker naming methods. As the baseline methods, we selected two representa- tiv e gradient-based methods to compare with our Attention- based(Att-based) method. The first method is T raining fr om Scratc h(TfS) , which trains the neural network with both original and new data. Most of the deep neural networks nor- mally use TfS in training phase. The second one is Learning without F or getting(LwF) (Li and Hoiem 2017), which gener - ates the new branch on top of the network and trains with only new data. W e followed the same neural network struc- ture with one of the previous work(Hu et al. 2015) on both methods for fair comparison. The maximum training epoch 5 10 15 20 25 30 35 40 45 50 60 80 100 # of target IDs mpA(%) #Small IDs- mpA(%) 5 10 15 20 25 30 35 40 45 50 10 2 10 3 10 4 # of target IDs #params(KB) † #Small IDs-#params(KB) 5 10 15 20 25 30 35 40 45 50 10 0 10 1 10 2 10 3 # of target IDs setup time(s) † #Small IDs-setup time(s) 0 50 200 300 400 500 20 40 60 80 100 # of target IDs mpA(%) #Large IDs- mpA(%) 0 50 200 300 400 500 10 2 10 3 10 4 10 5 # of target IDs #params(KB) † #Large IDs-#params(KB) 0 50 200 300 400 500 10 0 10 1 10 2 10 3 # of target IDs setup time(s) † #Large IDs-setup time(s) T f S (50 shots) Lw F (50 shots) Att − based (Ours, 50 shots) T f S (5 shots) Lw F (5 shots) Att − based (Ours, 5 shots) Figure 4: Comparative Analyses between speaker naming methods under various settings. Three metrics( mpA , the number of parameters, setup time) are measured for each situation where the number of target IDs and the number of shots per character are changed. † The y-axis of the graph is in logarithmic scale. is 500, which is sufficient to con ver ge the loss function. If the network reached the optimal cost before the maximum epoch while training, we took the accuracy and the setup time at the moment the optimal cost w as deriv ed. T rans- fer learning was applied in every stage of all gradient-based methods when the number of IDs was increased. Results As shown in Figure 4 and T able 1, we conducted both quantitativ e and qualitative analysis based on the exper- imental results. Most notably , our method( Att-based ) signif- icantly reduced the setup time of model compared to other gradient-based methods about tens to hundreds of times re- gardless of conditions. The mpA was high in order of TfS, Att-based, LwF . Ho wever , when the number of target IDs was 450 with large shots, the LwF gradually surpassed Att- based as shown in the ”Large IDs- mpA ” graph in Figure 4. Generally , gradient-based methods showed a big difference in mpA depending on the number of shots. In contrast, Att- based worked well in both situations and had less effect in terms of the number of shots. Att-based utilized small num- ber of parameters when the number of target IDs or its shots are small. Ho wev er, as the number of tar get IDs is increased with large(50) shots, Att-based sho wed memory inef ficiency because the number of parameters increased quadratically with (the number of IDs × the number of shots per ID). In contrast, TfS occupies constant number of parameters; it is only related to the structure of the neural network. LwF is proportional to the number of times that ID is added. Be- cause LwF has a multi-branch structure, the new branch is generated when the new tar get character comes in. T o sum up, Att-based is the most appropriate method when new people appear frequently , and the shots per each character is not suf ficient. Also, Att-based works ef fectiv ely where immediate update for hard-to-recognize data such as various facial poses is needed. Overall, TfS is the best suited for situations where the new people are not frequently up- dated and high accuracy is required. LwF locates in the mid- dle of other two methods, because it shows faster setup of model than TfS , but compromising its mpA and memory us- age. Speaker Naming Accuracy for Real V ideo In this experiment, we applied our model in real video to compare the accuracy with previous gradient-based speaker naming models. Evaluation Metric Speaker naming accuracy( snA ) had used broadly in multiple speaker naming related papers be- fore and was formulated in (Liu et al. 2019), which is from # of target IDs # of shots per each ID metric T raining fr om Scratch(TfS) Learning without F orgetting(LwF) Att-based (Ours) Small IDs(5-50 people) Small shots(5 shots) mpA (%) 86.05 61.02 83.38 # of params(KB) 6248.4(constant) 15473.4(linear) 564.39(quadratic) Setup time(s) 60.02 7.06 0.29 Large shots(50 shots) mpA (%) 94.80 80.29 88.06 # of params(KB) 6248.4(constant) 15473.4(linear) 5644.03(quadratic) Setup time(s) 381.21 86.28 2.64 Large IDs(50-500 people) Small shots(5 shots) mpA (%) 78.67 23.06 67.68 # of params(KB) 6248.4(constant) 33923.4(linear) 5752.3(quadratic) Setup time(s) 437.20 85.14 2.50 Large shots(50 shots) mpA (%) 84.20 66.94 70.39 # of params(KB) 6248.4(constant) 33923.4(linear) 57523.15(quadratic) Setup time(s) 1973.08 909.22 25.05 T able 1: Summary for comparison of performances between gradient-based methods( TfS, LwF ) and attention-based( Att-based ) method under various settings. The numbers in the table are the a verage of the measurements in the range. T ime window(s) Bauml et al. 2013 T apaswi et al. 2012 Hu et al. 2015 Ren et al. 2016 Liu et al. 2019 Att-based (ours) 0.5s - - 74.93 86.59 87.73 84.34 1s - - 77.24 89.00 - 92.50 1.5s - - 79.35 90.45 - 88.89 2s - - 82.12 90.84 - 92.45 2.5s 77.81 80.80 82.81 91.17 - 89.89 3s - - 83.42 91.38 - 93.12 T able 2: Speaker naming accuracy( snA(%) ) comparison between attention-based model and existing speaker naming models on real video of BBT S01E03 . one of our baseline. W e used this metric to measure the per - formance of real video inference for comparing our model with a well-known speaker naming baselines. They define snA as follows: snA = N [ p sn == s tr ] N s tr × 100% (6) where p sn and s tr denote the labels of predicted samples and ground truth, respectiv ely . N [ p sn == s tr ] is the number of correctly predicted time windo ws and N s tr is the total number of time windows. Experimental Setup For ev aluation, we followed the same settings of previous works(T apaswi, Buml, and Stiefel- hagen 2012; Bauml, T apaswi, and Stiefelhagen 2013; Hu et al. 2015; Ren et al. 2016) on speaker naming experiment. The four-minute-long BBT S01E03 video clip was used for ev aluation dataset. In real situation, there occurs many non- matched-pairs of face-voice embeddings per period. Unlike previous controlled settings, we put 30 shots of matched and non-matched-pairs at a ratio of 1 to 4 in prior knowledge embeddings because end-to-end inference detects not only an active speaker but also distractors. W e tested our model to video with the time window of multiple periods of 0.5s. T ime windows more than 0.5s were also tested for compar- ing existing methods to clarify the result. If the time windo w is more than 0.5s, the prediction of the model is determined by the majority vote of multiple 0.5s-sized windows as pre- vious work(Hu et al. 2015) did. Results As shown in T able 2, Att-based showed compara- ble snA as other gradient-based speaker naming models in most cases. In certain circumstances, such as the size of the time windo w is 1s, 2s, and 3s, our model ev en outperformed the other state-of-the-art models. Conclusion and Future W ork In this paper, we presented an attention-based speaker nam- ing method for online adaptation in non-fix ed scenarios. The key idea is to predict the ID of the matched-pair based on at- tention mechanism, which considers the correlation between all pairs of prior knowledge embeddings and extracted target embeddings. Our proposed approach significantly reduced the model setup time by keeping comparable accurac y to existing state-of-the-art models, as demonstrated in our ex- periments. Also, the model can be updated online by only changing information on the attention module. Our further research aims to solve the current limitations and improv e the method well-applied to more generalized situations. Now , our current method w as using only tw o modalities and showed lo w accuracy when the number of target IDs for identification is large. Also, it can occur mem- ory inefficiency if the number of IDs and the number of shots per ID are increased. If we properly combine the advantages of the gradient-based methods with our method, the inte- grated method will be one of the solutions to cov er more various situations adequately in the future. Acknowledgments This work was supported by the Ne w Industry Promotion Program(1415158216, De velopment of Front/Side Camera Sensor for Autonomous V ehicle) funded by the Ministry of T rade, Industry & Energy(MO TIE, K orea). References Bahdanau, D.; Cho, K.; and Bengio, Y . 2014. Neural ma- chine translation by jointly learning to align and translate. arXiv pr eprint arXiv:1409.0473 . Bauml, M.; T apaswi, M.; and Stiefelhagen, R. 2013. Semi- supervised learning with constraints for person identification in multimedia data. In Pr oceedings of the 2013 IEEE Con- fer ence on Computer V ision and P attern Recognition , CVPR ’13, 3602–3609. W ashington, DC, USA: IEEE Computer Society . Bredin, H., and Gelly , G. 2016. Impro ving speaker diariza- tion of tv series using talking-face detection and clustering. In Pr oceedings of the 24th A CM International Confer ence on Multimedia , MM ’16, 157–161. New Y ork, NY , USA: A CM. Chung, J. S.; Nagrani, A.; and Zisserman, A. 2018. V oxceleb2: Deep speaker recognition. arXiv pr eprint arXiv:1806.05622 . Deng, J.; Guo, J.; Xue, N.; and Zafeiriou, S. 2019. Arcface: Additiv e angular margin loss for deep face recognition. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , 4690–4699. Fei-Fei, L.; Fer gus, R.; and Perona, P . 2006. One-shot learn- ing of object categories. IEEE transactions on pattern anal- ysis and machine intelligence 28(4):594–611. Ganin, Y ., and Lempitsky , V . 2015. Unsupervised do- main adaptation by backpropagation. In Pr oceedings of the 32Nd International Confer ence on International Conference on Machine Learning - V olume 37 , ICML ’15, 1180–1189. JMLR.org. Hershey , S.; Chaudhuri, S.; Ellis, D. P .; Gemmeke, J. F .; Jansen, A.; Moore, R. C.; Plakal, M.; Platt, D.; Saurous, R. A.; Seybold, B.; et al. 2017. Cnn architectures for large- scale audio classification. In 2017 ieee international con- fer ence on acoustics, speech and signal pr ocessing (icassp) , 131–135. IEEE. Horn, R. A. 1990. The hadamard product. In Pr oc. Symp. Appl. Math , volume 40, 87–169. Hu, Y .; Ren, J. S.; Dai, J.; Y uan, C.; Xu, L.; and W ang, W . 2015. Deep Multimodal Speaker Naming. In Pr oceedings of the 23rd Annual ACM International Confer ence on Mul- timedia , 1107–1110. A CM. Li, Z., and Hoiem, D. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelli- gence 40(12):2935–2947. Liu, W .; W en, Y .; Y u, Z.; Li, M.; Raj, B.; and Song, L. 2017. Sphereface: Deep hypersphere embedding for face recogni- tion. In Pr oceedings of the IEEE conference on computer vision and pattern r ecognition , 212–220. Liu, X.; Geng, J.; Ling, H.; and ming Cheung, Y . 2019. Attention guided deep audio-face fusion for efficient speaker naming. P attern Recognition 88:557 – 568. Liu, C.; Jiang, S.; and Huang, Q. 2008. Naming faces in broadcast news video by image google. In Pr oceedings of the 16th ACM international confer ence on Multimedia , 717– 720. A CM. Muda, L.; Begam, M.; and Elamvazuthi, I. 2010. V oice recognition algorithms using mel frequency cepstral coef- ficient (mfcc) and dynamic time warping (dtw) techniques. arXiv pr eprint arXiv:1003.4083 . Ren, J. S.; Hu, Y .; T ai, Y .-W .; W ang, C.; Xu, L.; Sun, W .; and Y an, Q. 2016. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification. In Pr oceedings of the 30th AAAI Confer ence on Artificial Intelligence , 3581–3587. Schroff, F .; Kalenichenk o, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , 815–823. Sun, Y .; Chen, Y .; W ang, X.; and T ang, X. 2014. Deep learn- ing face representation by joint identification-verification. In Advances in neural information pr ocessing systems , 1988– 1996. T akenaka, K.; Bando, T .; Nagasaka, S.; and T aniguchi, T . 2012. Drive video summarization based on double articu- lation structure of driving behavior . In Proceedings of the 20th A CM international conference on Multimedia , 1169– 1172. A CM. T apaswi, M.; Buml, M.; and Stiefelhagen, R. 2012. knock! knock! who is it? probabilistic person identification in tv- series. In 2012 IEEE Conference on Computer V ision and P attern Recognition , 2658–2665. V aswani, A.; Shazeer , N.; Parmar , N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser , Ł.; and Polosukhin, I. 2017. At- tention is all you need. In Advances in neural information pr ocessing systems , 5998–6008. W ang, H.; W ang, Y .; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; and Liu, W . 2018. Cosface: Large mar gin cosine loss for deep face recognition. In Proceedings of the IEEE Confer ence on Computer V ision and P attern Recognition , 5265–5274. W en, Y .; Zhang, K.; Li, Z.; and Qiao, Y . 2016. A dis- criminativ e feature learning approach for deep face recogni- tion. In Eur opean conference on computer vision , 499–515. Springer . Xie, W .; Nagrani, A.; Chung, J. S.; and Zisserman, A. 2019. Utterance-lev el aggregation for speaker recognition in the wild. In ICASSP 2019-2019 IEEE International Confer- ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 5791–5795. IEEE. Y osinski, J.; Clune, J.; Bengio, Y .; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information pr ocessing systems , 3320– 3328. Zhang, H.; Zha, Z.-J.; Y ang, Y .; Y an, S.; Gao, Y .; and Chua, T .-S. 2013. Attribute-augmented semantic hierarchy: to- wards bridging semantic gap and intention gap in image re- triev al. In Proceedings of the 21st ACM international con- fer ence on Multimedia , 33–42. A CM. Zhang, K.; Zhang, Z.; Li, Z.; and Qiao, Y . 2016. Joint face detection and alignment using multitask cascaded con volutional networks. IEEE Signal Pr ocessing Letters 23(10):1499–1503.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment