Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Deep Multimodal Clustering f or Unsupervised A udiovisual Lear ning ∗ Di Hu Northwestern Polytechnical Uni versity hdui831@mail.nwpu.edu.cn Feiping Nie Northwestern Polytechnical Uni versity feipingnie@gmail.com Xuelong Li † Northwestern Polytechnical Uni versity xuelong li@ieee.org Abstract The seen birds twitter , the running cars accompany with noise, etc. These naturally audiovisual corr espondences pr ovide the possibilities to explor e and understand the out- side world. However , the mixed multiple objects and sounds make it intractable to perform efﬁcient matching in the un- constrained en vir onment. T o settle this pr oblem, we pr o- pose to adequately excavate audio and visual components and perform elaborate corr espondence learning among them. Concretely , a novel unsupervised audiovisual learn- ing model is pr oposed, named as Deep Multimodal Clus- tering (DMC), that synchr onously performs sets of cluster - ing with multimodal vectors of con volutional maps in dif fer- ent shar ed spaces for capturing multiple audiovisual cor- r espondences. And such inte grated multimodal clustering network can be effectively trained with max-mar gin loss in the end-to-end fashion. Amounts of experiments in featur e evaluation and audiovisual tasks ar e performed. The re- sults demonstrate that DMC can learn effective unimodal r epr esentation, with which the classiﬁer can even outper- form human performance. Further , DMC shows noticeable performance in sound localization, multisource detection, and audiovisual understanding . 1. Introduction When seeing a dog, why the sound emerged in our mind is mostly barking instead of miao w or others? It seems easy to answer “we can only catch the barking dog in our daily life”. As speciﬁc visual appearance and acoustic sig- ∗ c  2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collective works, for resale or redis- tribution to servers or lists, or reuse of any copyrighted component of this work in other works. † Corresponding author nal usually occur together , we can realize that they are strongly correlated, which accordingly makes us recognize the sound-maker of visual dog and the distinctive barking sound. Hence, the concurrent audio visual message provides the possibilities to better e xplore and understand the outside world [18]. The cogniti ve community has noticed such phenomenon in the last century and named it as multisensory process- ing [18]. The y found that some neural cells in superior temporal sulcus (a brain region in the temporal cortex) can simultaneously response to visual, auditory , and tac- tile signal [17]. When the concurrent audiovisual message is percei ved by the brain, such neural cells could provide corresponding mechanism to correlate these dif ferent mes- sages, which is further reﬂected in various tasks, such as lip-reading [7] and sensory substitute [28]. In view of the merits of audiovisual learning in human beings, it is highly expected to make machine possess sim- ilar ability , i.e., exploring and perceiving the world via the concurrent audiovisual message. More importantly , in con- trast to the expensiv e human-annotation, the audiovisual correspondence can also provide v aluable supervision, and it is pervasi ve, reliable, and free [25]. As a result, the audiovisual correspondence learning has been giv en more and more attention recently . In the beginning, the coher- ent property of audiovisual signal is supposed to provide cross-modal supervision information, where the knowledge of one modality is transferred to supervise the other primi- tiv e one. Howe ver , the learning capacity is obviously lim- ited by the transferred knowledge and it is dif ﬁcult to ex- pand the correspondence into unexplored cases. Instead of this, a natural question emerges out: can the model learn the audiovisual perception just by their correspondence with- out any prior knowledge? The recent works giv e deﬁni- tiv e answers [3, 24]. They propose to train an audiovisual two-stream network by simply appending a correspondence judgement on the top layer . In other words, the model learns 1 to match the sound with the image that contains correct sound source. Surprisingly , the visual and auditory subnets hav e learnt to response to speciﬁc object and sound after training the model, which can be then applied for unimodal classiﬁcation, sound localization, etc. The correspondence assumption behind pre vious works [25, 3, 6] rely on speciﬁc audiovisual scenario where the sound-maker should exist in the captured visual appearance and single sound source condition is expected. Howe ver , such rigorous scenario is not entirely suitable for the real-life video. First, the unconstrained visual scene contains multiple objects which could be sound-makers or not, and the corresponding soundscape is a kind of multisource mixture. Simply performing the global corresponding veriﬁcation without having an insight into the complex scene components could result in the inefﬁcient and inaccurate matching, which therefore need amounts of audiovisual pairs to achie ve acceptable performance [3] but may still generate semantic irrelev ant matching [31]. Second, the sound-maker does not always produce distinctiv e sound, such as honking cars, barking dogs, so that the current video clip does not contain any sound but next one does, which therefore creates inconsistent conditions for the correspondence assumption. Moreov er , the sound-maker may be e ven out of the screen so we cannot see it in the video, e.g., the voiceov er of photographer . The above intricate audiovisual conditions make it extremely difﬁcult to analyze and understand the realistic en vironment, especially to correctly match different sound-makers and their produced sounds. So, a kind of elaborate correspondence learning is expected. As each modality in volv es multiple concrete components in the unconstrained scene, it is difﬁcult to correlate the real audiovisual pairs. T o settle this problem, we propose to disentangle each modality into a set of distinct compo- nents instead of the con ventional indiscriminate fashion. Then, we aim to learn the correspondence between these distributed representations of dif ferent modalities. More speciﬁcally , we ar gue that the acti v ation v ectors across con- volution maps have distinct responses for different input components, which just meets the clustering assumption. Hence, we introduce the Kmeans into the two-stream au- diovisual network to distinguish concrete objects or sounds captured by video. T o align the sound and its correspond- ing producer , sets of shared spaces for audio visual pairs are effecti v ely learnt by minimizing the associated triplet loss. As the clustering module is embedded into the multimodal network, the proposed model is named as Deep Multimodal Clustering (DMC). Extensive experiments conducted on wild audiovisual pairs show superiority of our model on unimodal features generation, image/acoustic classiﬁcation and some audiovisual tasks, such as single sound localiza- tion and multisource Sound Event Detection (SED). And the ultimate audiovisual understanding seems to have pre- liminary perception ability in real-life scene. 2. Related W orks Source Supervis. T ask Reference Sound V ision Acoustic Classif. [5, 14, 13, 6] V ision Sound Image Classif. [25, 6] Sound Match Classiﬁcation [3] & Sound Localization [4, 31, 24, 36] V ision Source Separation [8, 24, 12, 36] T able 1. Audiovisual learning settings and rele vant tasks. Audiovisual correspondence is a kind of natural phe- nomena, which actually comes from the fact that “Sound is produced by the oscillation of object”. The simple phe- nomena provides the possibilities to discover audiovisual appearances and build their complex correlations. That’ s why we can match the barking sound to the dog appear- ance from numerous audio candidates (sound separation) and ﬁnd the dog appearance according to the barking sound from the complex visual scene (sound source localization). As usually , the machine model is also expected to possess similarly ability as human. In the past few years, there hav e been sev eral works that focus on audiovisual machine learning. The learning set- tings and rele v ant tasks can be categorized into three phases according to source and supervision modality , as sho wn in T able 1. The early works consider that the audio and visual messages of the same entity should hav e similar class in- formation. Hence, it is expected to utilize the well-trained model of one modality to supervise the other one with- out additional annotation. Such “teacher-student” learning fashion has been successfully employed for image classiﬁ- cation by sound [25] and acoustic recognition by vision [5]. Although the abov e models ha ve sho wn promised cross- modal learning capacity , they actually rely on stronger su- pervision signal than human. That is, we are not born with a well-trained brain that hav e recognized kinds of objects or sounds. Hence, recent (almost concurrent) works pro- pose to train a two-stream network just by giv en the audio- visual correspondence, as shown in T able 1. Arandjelovi ´ c and Zisserman [3] train their audiovisual model to judge whether the image and audio clip are corresponding. Al- though such model is trained without the supervision of an y teacher-model, it has learnt highly ef fectiv e unimodal rep- resentation and cross-modal correlation [3]. Hence, it be- comes feasible to execute rele vant audiovisual tasks, such as sound localization and source separation. For the ﬁrst task, Arandjelovi ´ c and Zisserman [4] revise their pre vious model [3] to ﬁnd the visual area with the maximum simi- 2 larity for the current audio clip. Owens et al. [24] propose to adopt the similar model as [3] but use 3D con volution network for the visual pathway instead, which can capture the motion information for sound localization. Howe v er , these works rely on simple global correspondence. When there exist multiple sound-producers in the shown visual modality , it becomes difﬁcult to exactly locate the correct producer . Recently , Senocak et al. [31] introduce the at- tention mechanism into the audiovisual model, where the relev ant area of visual feature maps learn to attend speciﬁc input sound. Howe ver , there still exists another problem that the real-life acoustic en vironment is usually a mixture of multiple sounds. T o localize the source of speciﬁc sound, efﬁcient sound separation is also required. In the sound separation task, most works propose to reconstruct speciﬁc audio streams from manually mixed tracks with the help of visual embedding. For example, Zhao et al. [36] focus on the musical sound separation, while Casanovas et al. [8], Owens et al. [24], and Ariel et al. [10] perform the separation for mixed speech messages. Howe v er , the real-life sound is more complex and general than the speciﬁc imitated examples, which ev en lacks the groundtruth for the separated sound sources. Hence, our proposed method jointly disentangles the audio and visual components, and establishes elaborate correspondence be- tween them, which naturally cov ers both of the sound sepa- ration and localization task. 3. The Pr oposed Model 3.1. V isual and A udio subnet V isual subnet. The visual pathway directly adopts the of f- the-shelf VGG16 architecture but without the fully con- nected and softmax layers [32]. As the input to the net- work is resized into 256 × 256 image, the generated 512 feature maps fall into the size of 8 × 8 . T o enable the efﬁcient alignment across modalities, the pixel v alues are scaled into the range of [ − 1 , 1] that have comparable scale to the log-mel spectrogram of audio signal. As the asso- ciated visual components for the audio signal have been encoded into the feature maps, the corresponding entries across all the maps can be viewed as their feature repre- sentations, as shown in Fig. 1. In other words, the original feature maps of 8 × 8 × 512 is reshaped into 64 × 512 , where each row means the representations for speciﬁc vi- sual area. Hence, the ﬁnal visual representations become  u v 1 , u v 2 , ..., u v p | u v i ∈ R n  , where p = 64 and n = 512 . A udio subnet. The audio pathway employs the VGGish model to extract the representations from the input log-mel spectrogram of mono sound [16]. In practice, different from the default conﬁgurations in [16], the input audio clip is extended to 496 frames of 10ms each but other parame- ters about short-time Fourier transform and mel-mapping Figure 1. An illustration of activ ation distrib utions. It is obvious that different visual components hav e distinct activ ation vectors across the feature maps. Such property helps to distinguish differ - ent visual components. Best vie wed in color . are k ept. Hence, the input to the network becomes 496 × 64 log-mel spectrogram, and the corresponding output feature maps become 31 × 4 × 512 . T o prepare the audio repre- sentation for the second-stage clustering, we also perform the same operation as the visual ones. That is, the audio feature maps are reshaped into  u a 1 , u a 2 , ..., u a q | u a i ∈ R n  , where q = 124 and n = 512 . 3.2. Multimodal clustering module As the conv olutional network shows strong ability in describing the high-lev el semantics for different modali- ties [32, 16, 35], we argue that the elements in the feature maps hav e similar activ ation probabilities for the same uni- modal component, as shown in Fig. 1. It becomes pos- sible to exca vate the audiovisual entities by aggregating their similar feature vectors. Hence, we propose to clus- ter the unimodal feature vectors into object-le vel represen- tations, and align them in the coordinated audiovisual en- vironment, as shown in Fig. 2. For simplicity , we take { u 1 , u 2 , ..., u p | u i ∈ R n } for feature representation without regard to the type of modality . T o cluster the unimodal features into k clusters, we propose to perform Kmeans to obtain the centers C = { c 1 , c 2 , ..., c k | c j ∈ R m } , where m is the center dimension- ality . Kmeans aims to minimize the within-cluster distance and assign the feature points into k -clusters [19], hence, the objectiv e function can be formulated as, F ( C ) = p X i =1 k min j =1 d ( u i , c j ) , (1) where k min j =1 d ( u i , c j ) means the distance between current point and its closest center . Ho wev er , simply introducing Eq.1 into the deep networks will make it difﬁcult to opti- mize by gradient descent, as the minimization function in Eq.1 is a hard assignment of data points for clusters and not differentiable. T o solve this intractable problem, one way is to make a soft assignment for each point. Particu- larly , Expectation-Maximization (EM) algorithm for Gaus- sian Mixtur e Models (GMMs) makes a soft assignment 3 Figure 2. The diagram of the proposed deep multimodal clustering model. The two modality-speciﬁc ConvNets ﬁrst process the pairwise visual image and audio spectrogram into respective feature maps, then these maps are co-clustered into corresponding components that indicate concrete audiovisual contents, such as baby and its voice, drumming and its sound. Finally , the model takes the similarity across modalities as the supervision for training. based on the posterior probabilities and conv erges to a local optimum [ ? ]. In this paper , we propose another perspective to trans- form the hard assignment in Eq.(1) to a soft assignment problem and be a differentiable one. The minimization op- eration in Eq.(1) is approximated via utilizing the follo wing equation, max { d i 1 , d i 2 , ..., d ik } ≈ 1 z log   k X j =1 e d ij z   , (2) where z is a parameter about magnitude and d ij = d ( u i , c j ) for simplicity . Eq.2 shows that the maximum value of a giv en sequence can be approximated by the log -summation of corresponding exponential functions. Intuitiv ely , the dif- ferences in the original sequence are ampliﬁed sharply with the exponential projection, which tends to ignore the tiny ones and remain the largest one. Then, the reversed log- arithm projection gi ves the approximated maximum value. The rigorous proof for Eq.2 can be found in the materials. As we aim to ﬁnd the minimum v alue of the distance sequence, Eq. 2 is modiﬁed into min { d i 1 , d i 2 , ..., d ik } ≈ − 1 z log   k X j =1 e − d ij z   . (3) Then, the objectiv e function of clustering becomes F ( C ) = − 1 z p X i =1 log   k X j =1 e − d ij z   . (4) As Eq. 4 is differentiable ev erywhere, we can directly com- pute the deriv ati ve w .r .t. each cluster center . Concretely , for the center c j , the deriv ati ve is written as ∂ F ∂ c j = p X i =1 e − d ij z k P l =1 e − d il z ∂ d ij ∂ c j = p X i =1 s ij ∂ d ij ∂ c j , (5) where s ij = e − d ij z P k l =1 e − d il z = sof tmax ( − d ij z ) . The soft- max coefﬁcient performs like the soft-segmentation over the whole visual area or audio spectrogram for different centers, and we will giv e more explanation about it in the follo wing sections. In practice, the distance d ij between each pair of fea- ture point u i and center c j can be achiev ed in different ways, such as Euclidean distance, cosine proximity , etc. In this paper, inspired by the capsule net 1 [30, 33], we choose the inner -product for measuring the agreement, i.e., d ij = − D u i , c j k c j k E . By taking it into Eq. 5 and setting the deriv ati ve to zero, we can obtain 2 c j k c j k = p P i =1 s ij u i     p P i =1 s ij u i     , (6) which means the center and the integrated features lie in the same direction. As the coefﬁcients s · j are the softmax val- ues of distances, the corresponding center c j emerges in the comparable scope as the features and Eq. 6 is approxima- tiv ely computed as c j = p P i =1 s ij u i for simplicity . Howe ver , 1 A discussion about capsule and DMC is provided in the materials. 2 Detailed deriv ation is shown in the materials. 4 there remains another problem that the computation of s ij depends on the current center c j , which makes it dif ﬁcult to get the direct update rules for the centers. Instead, we choose to alternati vely update the coefﬁcient s ( r ) ij and cen- ter c ( r +1) j , i.e., c ( r +1) j = p X i =1 s ( r ) ij u i . (7) Actually , the updating rule is much similar to the EM algo- rithm that maximizes posterior probabilities in GMMs [ ? ]. Speciﬁcally , the ﬁrst step is the expectation step or E step, which uses the current parameters to e valuate the posterior probabilities, i.e., re-assigns data points to the centers. The second step is the maximization step or M step that aims to re-estimate the means, cov ariance, and mixing coefﬁcients, i.e., update the centers in Eq.(7). The aforementioned clusters indicate a kind of soft as- signment (segmentation) ov er the input image or spectro- gram, where each cluster mostly corresponds to certain con- tent (e.g., baby face and drum in image, voice and drum- beat in sound in Fig. 2), hence they can be viewed as the distributed representations of each modality . And we ar- gue that audio and visual messages should ha ve similar distributed representations when they jointly describe the same natural scene. Hence, we propose to perform differ - ent center-speciﬁc projections { W 1 , W 2 , ..., W k } over the audio and visual messages to distinguish the representa- tions of dif ferent audiovisual entities, then cluster these projected features into the multimodal centers for seeking concrete audiovisual contents. Formally , the distance d ij and center updating become d ij = − D W j u i , c j k c j k E and c ( r +1) j = p P i =1 s ( r ) ij W j u i , where the projection matrix W j is shared across modalities and considered as the association with concrete audiovisual entity . Moreov er , W j also per- forms as the magnitude parameter z when computing the distance d ij . W e show the complete multimodal clustering in Algorithm 1. W e employ the cosine proximity to measure the differ- ence between audiovisual centers, i.e., s ( c a i , c v i ) , where c a i and c v i are the i -center for audio and visual modality , respec- tiv ely . T o efﬁciently train the two-stream audiovisual net- work, we employ the max-margin loss to encourage the net- work to giv e more conﬁdence to the realistic image-sound pair than mismatched ones, loss = k X i =1 ,i 6 = j max  0 , s  c a j , c v i  − s ( c a i , c v i ) + ∆  , (8) where ∆ is a margin hyper-parameter and c a j means the negati ve audio sample for the positive audiovisual pair of ( c a i , c v i ) . In practice, the ne gati ve example is randomly sam- pled from the training set but different from the positiv e one. The Adam optimizer with the learning rate of 10 − 4 is used. Batch-size of 64 is selected for optimization. And we train the audiovisual net for 25,000 iterations, which took 3 weeks on one K80 GPU card. Algorithm 1 Deep Multimodal Clustering Algorithm Input: The feature vectors for each modality:  u a 1 , u a 2 , ..., u a q | u a i ∈ R n  ,  u v 1 , u v 2 , ..., u v p | u v i ∈ R n  Output: The center vectors for each modality:  c a 1 , c a 2 , ..., c a k | c a j ∈ R m  ,  c v 1 , c v 2 , ..., c v k | c v j ∈ R m  Initialize the distance d a ij = d v ij = 0 1: for t = 1 to T , x in { a, v } do 2: for i = 1 to q ( p ) , j = 1 to k do 3: Update weights: s x ij = sof tmax  − d x ij  4: Update centers: c x j = p P i =1 s x ij W j u x i 5: Update distances: d x ij = −  W j u x i , c x j k c x j k  6: end for 7: end f or 4. Featur e Evaluation Ideally , the unimodal networks should hav e learnt to re- spond to different objects or sound scenes after training the DMC model. Hence, we propose to ev aluate the learned au- dio and visual representations of the CNN internal layers. For efﬁcienc y , the DMC model is trained with 400K unla- beled videos that are randomly sampled from the SoundNet- Flickr dataset [5]. The input audio and visual message are the same as [5], where pairs of 5s sound clip and corre- sponding image are extracted from each video with no over - lap. Note that, the constituted ∼ 1.6M audiovisual pairs are about 17 times less than the ones in L 3 [3] and 5 times less than SoundNet [5]. 4.1. A udio Featur es The audio representation is e valuated in the comple x en- vironmental sound classiﬁcation task. The adopted ESC-50 dataset [27] is a collection of 2000 audio clips of 5s each. They are equally partitioned into 50 categories. Hence, each category contains 40 samples. For fairness, each sample is also partitioned into 1s audio excerpts for data argumenta- tion [5], and these overlapped subclips constitute the audio inputs to the VGGish network. The mean accuracy is com- puted ov er the ﬁv e leav e-one-fold-out ev aluations. Note that, the human performance on this dataset is 0.813. The audio representations are extracted by pooling the feature maps 3 . And a multi-class one-vs-all linear SVM 3 Similarly with SoundNet [5], we ev aluate the performance of dif ferent 5 (a) ESC-50 Methods Accuracy Autoencoder [5] 0.399 Rand. Forest [27] 0.443 Con vNet [26] 0.645 SoundNet [5] 0.742 L 3 [3] 0.761 † L 3 [3] 0.793 † A VTS [ ? ] 0.823 DMC 0.798 ‡ DMC 0.826 Human P erfor . 0.813 (b) Pascal V OC 2007 Methods Accurac y T axton. [25] 0.375 Kmeans [21] 0.348 T racking [34] 0.422 Patch. [9] 0.467 Egomotion [2] 0.311 Sound(spe.) [25] 0.440 Sound(clu.) [25] 0.458 Sound(bia.) [25] 0.467 DMC 0.514 ImageNet 0.672 T able 2. Acoustic Scene Classiﬁcation on ESC-50 [27] and Im- age Classiﬁcation on P ascal V OC 2007[11]. (a) For fairness, we provide a weakened version of L 3 that is trained with the same audiovisual set as ours, while † L 3 is trained with more data in [3]. † A VTS is trained with the whole SoundNet-Flickr dataset [5]. ‡ DMC takes supervision from the well-trained vision network for training the audio subnet. (b) The shown results are the best ones reported in [25] except the ones with FC features. is trained with the extracted audio representations. The ﬁ- nal accuracy of each clip is the mean v alue of its subclip scores. T o be fair , we also modify the DMC model into the “teacher-student” scheme ( ‡ DMC) where the VGG net is pretrained with ImageNet and kept ﬁx ed during training. In T able 2 (a), it is obvious that the DMC model exceeds all the previous methods except Audio-V isual T emporal Synchr o- nization (A VTS) [ ? ]. Such performance is achiev ed with less training data (just 400K videos), which conﬁrms that our model can utilize more audiovisual correspondences in the unconstrained videos to ef fectiv ely train the unimodal network. W e also note that A VTS is trained with the whole 2M+ videos in [5], which is 5 times more than DMC. Even so, DMCs still outperform A VTS on the DCASE2014 benchmark dataset (More details can be found in the mate- rials). And the cross-modal supervision version ‡ DMC im- prov es the accuracy further, where the most noticeable point is that ‡ DMC outperforms human [27] (82.6 % vs 81.3 % ). Hence, it veriﬁes that the elaborati ve alignment efﬁciently works and the audiovisual correspondence indeed helps to learn the unimodal perception. 4.2. V isual Featur es The visual representation is ev aluated in the object recognition task. The chosen P ASCAL V OC 2007 dataset contains 20 object categories that are collected in realistic scenes [11]. W e perform global-pooling over the con v5 1 features of VGG16 net to obtain the visual features. A multi-class one-vs-all linear SVM is also employed as the VGGish layers and select con v4 1 as the extraction layer . Figure 3. Qualitati ve examples of sound source localization. Af- ter feeding the audio and visual messages into the DMC model, we visualize the soft assignment that belongs to the most related visual cluster to the audio messages. Note that, the visual scene becomes more complex from top to bottom, and the label is just for visualization purpose. classiﬁer , and the results are e v aluated using Mean A ver age Pr ecision (mAP). As the DMC model does not contain the standard FC layer as previous works, the best con v/pooling features of other methods are chosen for comparison, which hav e been reported in [25]. T o validate the ef fectiv eness of multimodal clustering in DMC, we choose to compare with the visual model in [25], which treats the separated sound clusters as object indicators for visual supervision. In con- trast, DMC model jointly learns the audio and visual rep- resentation rather than the above single-ﬂow from sound to vision, hence it is more ﬂexible for learning the audiovisual correspondence. As shown in T able 2 (b), our model in- deed sho ws noticeable improv ement ov er the simple cluster supervision, ev en its multi-label variation (in binary) [25]. Moreov er , we also compare with the pretrained VGG16 net on ImageNet. But, what surprises us is that the DMC model is comparable the human performance in acoustic classiﬁ- cation but it has a large gap with the image classiﬁcation benchmark. Such differences may come from the comple x- ity of visual scene compared with the acoustic ones. Nev- ertheless, our model still provides meaningful insights in learning effecti ve visual representation via audiovisual cor- respondence. 5. A udiovisual Evaluation 5.1. Single Sound Localization In this task, we aim to localize the sound source in the vi- sual scene as [4, 24], where the simple case of single source is considered. As only one sound appears in the audio track, the generated audio features should share identical center . In practice, we perform av erage-pooling over the audio cen- ters into c a , then compare it with all the visual centers via cosine proximity , where the number of visual centers is set to 2 (i.e., sound-maker and other components) in this sim- 6 ple case. And the visual center with the highest score is considered as the indicator of corresponding sound source. In order to visualize the sound source further, we resort to the soft assignment of the selected visual center c v j , i.e., s v · j . As the assignment s v ij ∈ [0 , 1] , the coefﬁcient vector s v · j is reshaped back to the size of original feature map and viewed as the heatmap that indicates the cluster property . In Fig. 3, we sho w the qualitative examples with respect to the sound-source locations of different videos from the SoundNet-Flickr dataset. It is obvious that the DMC model has learnt to distinguish different visual appearances and correlate the sound with corresponding visual source, al- though the training phase is entirely performed in the un- supervised fashion. Concretely , in the simple scene, the visual source of baby voice and car noise are easy to lo- calize. When the visual scene becomes more complex, the DMC model can also successfully localize the correspond- ing source. The dog appearance highly responses to the barking sound while the cat does not. In contrast to the au- dience and background, only the onstage choruses response to the singing sound. And the moving v ehicles are success- fully localized regardless of the dri ver or other visual con- tents in the complex traf ﬁc en vironment. Apart from the qualitati ve analysis, we also provide the quantitativ e e valuation. W e directly adopt the annotated sound-sources dataset [31], which are originally collected from the SoundNet-Flickr dataset. This sub-dataset con- tains 2,786 audio-image pairs, where the sound-maker of each pair is individually located by three subjects. 250 pairs are randomly sampled to construct the testing set (with sin- gle sound). By setting an arbitrary threshold ov er the as- signment s v · j , we can obtain a binary segmentation over the visual objects that probably indicate the sound locations. Hence, to compare the automatic segmentation with hu- man annotations, we employ the consensus Intersection over Union (cIoU) and corresponding A UC area in [31] as the e v aluation metric. As sho wn in T able. 3, the proposed DMC model is compared with the recent sound location model with attention mechanism [31]. First, it is obvious that DMC shows superior performance ov er the unsuper- vised attention model. Particularly , when the cIoU thresh- old becomes lar ger (i.e., 0.7), DMC ev en outperforms the supervised ones. Second, apart from the most related visual centers to the sound track, the unrelated one is also ev alu- ated. The lar ge decline of the unrelated center indicates that the clustering mechanism in DMC can effecti vely distin- guish dif ferent modality components and exactly correlate them among different modalities. Methods cIoU(0.5) cIoU(0.7) A UC Random 12.0 - 32.3 Unsupervised † [31] 52.4 - 51.2 Unsupervised [31] 66.0 ∼ 18.8 55.8 Supervised [31] 80.4 ∼ 25.5 60.3 Sup.+Unsup. [31] 82.8 ∼ 28.8 62.0 DMC (unrelated) 10.4 5.2 21.1 DMC (related) 67.1 26.2 56.8 T able 3. The e v aluation of sound source localization. The cIoUs with threshold 0.5 and 0.7 are sho wn. The area under the cIoU curve by v arying the threshold from 1 to 0 (A UC) is also provided. The unsupervised † method in [31] employs a modiﬁed attention mechanism. 5.2. Real-Life Sound Event Detection In this section, in contrast to speciﬁc sound separation task 4 , we focus on a more general and complicated sound task, i.e., multisource SED. In the realistic en vironment, multiple sound tracks usually exist at the same time, i.e., the street environment may be a mixture of people speaking, car noise, walking sound, brakes squeaking, etc. It is ex- pected to detect the existing sounds in e very moment, which is much more challenging than the previous single acoustic recognition [15]. Hence, it becomes more valuable to e val- uate the ability of DMC in learning the ef fectiv e represen- tation of multi-track sound. In the DCASE2017 acoustic challenges, the third task 5 is exactly the multisource SED. The audio dataset used in this task focuses on the complex street acoustic scenes that consist of dif ferent traf ﬁc levels and activities. The whole dataset is divided for dev elop- ment and ev aluation, and each audio is 3-5 minutes long. Segment-based F-score and error rate are calculated as the ev aluation metric. As our model provides elaborative visual supervision for training the audio subnet, the corresponding audio repre- sentation should pro vide sufﬁcient description for the multi- track sound. T o validate the assumption, we directly replace the input spectrum with our generated audio representation in the baseline model of MLP [15]. As sho wn in T able. 4, the DMC model is compared with the top ﬁv e methods in the challenge, the audiovisual net L 3 [3], and the VGGish net [16]. It is obvious that our model takes the ﬁrst place on F1 metric and is comparable to the best model in error rate. Speciﬁcally , there are three points we should pay atten- tion to. First, by utilizing the audio representation of DMC model instead of raw spectrum, we can ha ve a noticeable 4 The sound separation task mostly focuses speciﬁc task scenarios and need ef fectiv e supervision from the original sources[8, 24, 36], which goes beyond our conditions. 5 http://www .cs.tut.ﬁ/sgn/ar g/dcase2017/challenge/task-sound-ev ent- detection-in-real-life-audio 7 Methods Segment F1 Segment Error J-NEA T -E [22] 44.9 0.90 SLFFN [22] 43.8 1.01 ASH [1] 41.7 0.79 MICNN [20] 40.8 0.81 MLP [15] 42.8 0.94 § MLP [15] 39.1 0.90 L 3 [3] 43.24 0.89 VGGish [16] 50.96 0.86 DMC 52.14 0.83 T able 4. Real life sound event detection on the evaluation dataset of DCASE 2017 Challenge. W e choose the default STFT parame- ters of 25ms window size and 10 window hop [16]. The same pa- rameters are also adopted by § MLP , L 3 , and VGGish, while other methods adopt the default parameters in [15]. improv ement. Such improvement indicates that the corre- spondence learning across modalities indeed provides effec- tiv e supervision in distinguishing different audio contents. Second, as the L 3 net simply performs global matching be- tween audio and visual scene without exploring the concrete content inside, it f ails to provide effecti ve audio representa- tion for multisource SED. Third, although the V GGish net is trained on a preliminary version of Y ouT ube-8M (with labels) that is much larger than our training data, our model still outperforms it. This comes from the more ef ﬁcient au- diovisual correspondence learning of DMC model. 5.3. A udiovisual Understanding As introduced in the Section 1, the real-life audiovisual en vironment is unconstrained, where each modality con- sists of multiple instances or components, such as speak- ing, brakes squeaking, walking sound in the audio modality and building, people, cars, road in the visual modality of the street en vironment. Hence, it is dif ﬁcult to disentangle them within each modality and establish exact correlations between modalities, i.e., audiovisual understanding. In this section, we attempt to employ the DMC model to perform the audiovisual understanding in such cases where only the qualitativ e ev aluation is provided due to the absent annota- tions. T o illustrate the results better , we turn the soft assign- ment of clustering into a binary map via a threshold of 0.7. And Fig. 4 shows the matched audio and visual clustering results of different real-life videos, where the sound is rep- resented in spectrogram. In the “baby drums” video, the drumming sound and corresponding motion are captured and correlated, meanwhile the baby face and people v oice are also picked out from the intricate audio visual content. These two distinct centers jointly describe the audiovisual structures. In more complex indoor and outdoor en viron- ments, the DMC model can also capture the people yelling Figure 4. Qualitative examples of complex audiovisual under- standing. W e ﬁrst feed the audiovisual messages into the DMC model, then the corresponding audiovisual clusters are captured and shown, where the assignments are binarized into the masks ov er each modality via a threshold of 0.7. The labels in the ﬁgure indicate the learned audiovisual content, which is not used in the training procedure. and talking sound from background music and loud en vi- ronment noise by clustering the audio feature vectors, and correlate them with the corresponding sound-makers (i.e., the visual centers) via the shared projection matrix. How- ev er , there still exist some failure cases. Concretely , the out-of-view sound-maker is inaccessible for current visual clustering. In contrast, the DMC model improperly corre- lates the background music with kitchenware in the second video. Similarly , the talking sound comes from the visible woman and out-of-vie w photographer in the third video, b ut our model simply extracts all the human voice and assigns them to the visual center of woman. Such failure cases also remind us that the real-life audiovisual understanding is far more difﬁcult than what we hav e imagined. Moreover , to perceiv e the audio centers more naturally , we reconstruct the audio signal from the masked spectrogram information and show them in the released video demo. 6. Discussion In this paper , we aim to explore the elaborate correspon- dence between audio and visual messages in unconstrained en vironment by resorting to the proposed deep multimodal clustering method. In contrast to the previous rough corre- spondence, our model can efﬁciently learn more ef fectiv e audio and visual features, which even exceed the human performance. Further , such elaborate learning contributes to noticeable improvements in the complicated audiovisual tasks, such as sound localization, multisource SED, and au- diovisual understanding. Although the proposed DMC sho ws considerable supe- 8 riority ov er other methods in these tasks, there still remains one problem that the number of clusters k is pre-ﬁxed in- stead of automatically determined. When there is single sound, it is easy to set k = 2 for foreground and back- ground. But when multiple sound-makers emerge, it be- comes difﬁcult to pre-determine the value of k . Although we can obtain distinct clusters after setting k = 10 in the audiovisual understanding task, more reliable method for determining the number of audiovisual components is still expected [29], which will be focused in the future work. 7. Acknowledgement This work w as supported in part by the National Natural Science Foundation of China grant under number 61772427 and 61751202. W e thank Jianlin Su for the constructiv e opinion, and thank Zheng W ang and re viewers for refresh- ing the paper . References [1] S. Adavanne and T . V irtanen. A report on sound e vent de- tection with different binaural features. T echnical report, DCASE2017 Challenge, September 2017. [2] P . Agrawal, J. Carreira, and J. Malik. Learning to see by moving. In Computer V ision (ICCV), 2015 IEEE Interna- tional Confer ence on , pages 37–45. IEEE, 2015. [3] R. Arandjelo vic and A. Zisserman. Look, listen and learn. In 2017 IEEE International Conference on Computer V ision (ICCV) , pages 609–617. IEEE, 2017. [4] R. Arandjelovi ´ c and A. Zisserman. Objects that sound. arXiv pr eprint arXiv:1712.06651 , 2017. [5] Y . A ytar, C. V ondrick, and A. T orralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Pr ocessing Systems , pages 892–900, 2016. [6] Y . A ytar, C. V ondrick, and A. T orralba. See, hear , and read: Deep aligned representations. arXiv pr eprint arXiv:1706.00932 , 2017. [7] G. A. Calvert, E. T . Bullmore, M. J. Brammer, R. Campbell, S. C. Williams, P . K. McGuire, P . W . W oodruff, S. D. Iversen, and A. S. David. Activation of auditory corte x during silent lipreading. science , 276(5312):593–596, 1997. [8] A. L. Casanovas, G. Monaci, P . V anderghe ynst, and R. Gri- bon val. Blind audiovisual source separation based on sparse redundant representations. IEEE T r ansactions on Multime- dia , 12(5):358–371, 2010. [9] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- sual representation learning by context prediction. In Pr o- ceedings of the IEEE International Confer ence on Computer V ision , pages 1422–1430, 2015. [10] A. Ephrat, I. Mosseri, O. Lang, T . Dekel, K. W ilson, A. Hassidim, W . T . Freeman, and M. Rubinstein. Look- ing to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619 , 2018. [11] M. Everingha m, L. V an Gool, C. K. W illiams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) chal- lenge. International journal of computer vision , 88(2):303– 338, 2010. [12] R. Gao, R. Feris, and K. Grauman. Learning to separate object sounds by watching unlabeled video. arXiv pr eprint arXiv:1804.01665 , 2018. [13] D. Harwath and J. R. Glass. Learning word-like units from joint audio-visual analysis. arXiv preprint arXiv:1701.07481 , 2017. [14] D. Harwath, A. T orralba, and J. Glass. Unsupervised learn- ing of spoken language with visual context. In Advances in Neural Information Pr ocessing Systems , pages 1858–1866, 2016. [15] T . Heittola and A. Mesaros. DCASE 2017 challenge setup: T asks, datasets and baseline system. T echnical report, DCASE2017 Challenge, September 2017. [16] S. Hershey , S. Chaudhuri, D. P . Ellis, J. F . Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-scale audio classiﬁcation. In Acoustics, Speech and Signal Pr ocessing (ICASSP), 2017 IEEE International Confer ence on , pages 131–135. IEEE, 2017. [17] K. Hikosaka, E. Iwai, H. Saito, and K. T anaka. Polysensory properties of neurons in the anterior bank of the caudal su- perior temporal sulcus of the macaque monkey . Journal of neur ophysiology , 60(5):1615–1637, 1988. [18] N. P . Holmes and C. Spence. Multisensory integra- tion: space, time and superadditivity . Curr ent Biology , 15(18):R762–R764, 2005. [19] A. K. Jain. Data clustering: 50 years beyond k-means. P at- tern r ecognition letters , 31(8):651–666, 2010. [20] I.-Y . Jeong, S. Lee, Y . Han, and K. Lee. Audio ev ent de- tection using multiple-input con volutional neural network. T echnical report, DCASE2017 Challenge, September 2017. [21] P . Kr ¨ ahenb ¨ uhl, C. Doersch, J. Donahue, and T . Darrell. Data- dependent initializations of conv olutional neural networks. arXiv pr eprint arXiv:1511.06856 , 2015. [22] C. Kroos and M. D. Plumbley . Neuroev olution for sound ev ent detection in real life audio: A pilot study . T echnical report, DCASE2017 Challenge, September 2017. [23] X. Li, D. Hu, and F . Nie. Deep binary reconstruction for cross-modal hashing. In Pr oceedings of the 2017 A CM on Multimedia Confer ence , pages 1398–1406. A CM, 2017. [24] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. arXiv preprint arXiv:1804.03641 , 2018. [25] A. Owens, J. W u, J. H. McDermott, W . T . Freeman, and A. T orralba. Ambient sound provides supervision for vi- sual learning. In Eur opean Conference on Computer V ision , pages 801–816. Springer , 2016. [26] K. J. Piczak. Environmental sound classiﬁcation with con- volutional neural networks. In Machine Learning for Sig- nal Pr ocessing (MLSP), 2015 IEEE 25th International W ork- shop on , pages 1–6. IEEE, 2015. [27] K. J. Piczak. Esc: Dataset for en vironmental sound classiﬁ- cation. In Pr oceedings of the 23rd A CM international con- fer ence on Multimedia , pages 1015–1018. A CM, 2015. 9 [28] M. J. Proulx, D. J. Brown, A. Pasqualotto, and P . Meijer . Multisensory perceptual learning and sensory substitution. Neur oscience & Biobehavioral Re views , 41:16–25, 2014. [29] S. Ray and R. H. T uri. Determination of number of clus- ters in k-means clustering and application in colour image segmentation. In Proceedings of the 4th international con- fer ence on advances in pattern r eco gnition and digital tec h- niques , pages 137–143. Calcutta, India, 1999. [30] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in Neural Information Pr o- cessing Systems , pages 3859–3869, 2017. [31] A. Senocak, T .-H. Oh, J. Kim, M.-H. Y ang, and I. S. Kweon. Learning to localize sound source in visual scenes. arXiv pr eprint arXiv:1803.03849 , 2018. [32] K. Simonyan and A. Zisserman. V ery deep con volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014. [33] D. W ang and Q. Liu. An optimization vie w on dynamic rout- ing between capsules. 2018. [34] X. W ang and A. Gupta. Unsupervised learning of visual rep- resentations using videos. arXiv pr eprint arXiv:1505.00687 , 2015. [35] X. Zhang, J. Zhao, and Y . LeCun. Character-lev el conv olu- tional networks for text classiﬁcation. In Advances in neural information pr ocessing systems , pages 649–657, 2015. [36] H. Zhao, C. Gan, A. Rouditchenko, C. V ondrick, J. McDer - mott, and A. T orralba. The sound of pixels. arXiv pr eprint arXiv:1804.03160 , 2018. 10 8. Appendix 8.1. A pproximated Maximization Function Although the maximization function is not differen- tiable, it can be approximated via the following equation, max { d i 1 , d i 2 , ..., d ik } ≈ lim z → + ∞ 1 z log   k X j =1 e d ij z   , (9) where z is a hype-parameter that controls the precision of approximation. Instead of the abov e multi-variable case, we ﬁrst consider the maximization function of two variables { x 1 , x 2 } . Actually , it is well kno wn that max { x 1 , x 2 } = 1 2 ( | x 1 + x 2 | + | x 1 − x 2 | ) , s.t. x 1 ≥ 0 , x 2 ≥ 0 , (10) Hence the approximation for maximization is turned for the absolute value function f ( x ) = | x | . As the deri va- tiv e function of f ( x ) is f 0 ( x ) =  +1 , x ≥ 0 − 1 , x < 0 , it can be directly replaced by the adaptiv e tanh function [23], i.e., f 0 ( x ) = lim z → + ∞ e zx − e − zx e zx + e − zx . Then, we can obtain the approx- imated absolute value function via inte gral f ( x ) = lim z → + ∞ 1 z log  e z x + e − z x  . (11) Hence, the maximization function over two v ariables can be written as max { x 1 , x 2 } = lim z → + ∞ 1 2 z log  e 2 z x 1 + e 2 z x 2 + e − 2 z x 1 + e − 2 z x 2  . (12) As z → + ∞ and x 1 ≥ 0 , x 2 ≥ 0 , Eq. 12 can be approx- imated into max { x 1 , x 2 } ≈ lim z → + ∞ 1 z log ( e z x 1 + e z x 2 ) . (13) At this point, the maximization function has become dif- ferentiable for two v ariables. And it can also be extended to three more variables. Concretely , for three variables { x 1 , x 2 , x 3 } , let c = max { x 1 , x 2 } , then max { x 1 , x 2 , x 3 } = max { c, x 3 } ≈ lim z → + ∞ 1 z log  e log( e zx 1 + e zx 2 ) + e z x 3  = lim z → + ∞ 1 z log ( e z x 1 + e z x 2 + e z x 3 ) . (14) Hence, for multiv ariable, we can ha ve max { x 1 , x 2 , ..., x n } ≈ lim z → + ∞ 1 z log n X i =1 e z x i ! . (15) 8.2. Deriv ation of Eq. 6 T o substitute d ij = − D u i , c j k c j k E into n P i =1 s ij ∂ d ij ∂ c j = 0 , we ﬁrst giv e the deri vati ve of d ij w .r .t. c j , ∂ d ij ∂ c j = − ∂  u T i c j k c j k  ∂ c j = − u i k c j k + u T i c j · c j k c j k 3 . (16) Then, by taking Eq. 16 into n P i =1 s ij ∂ d ij ∂ c j = 0 , we can hav e n X i =1 s ij u T i c j k c j k · c j k c j k = n X i =1 s ij u i . (17) By taking the modulus of expression in both sides of Eq. 17, we can hav e      n X i =1 s ij u T i c j k c j k      ·     c j k c j k     =      n X i =1 s ij u i      . (18) As    c j k c j k    =1 , Eq. 18 becomes      n X i =1 s ij u i      =         n P i =1 s ij u T i · c j k c j k         =      n X i =1 s ij u i      | cos θ | (19) As d ij = − D u i , c j k c j k E , we expect to maximize the cosine proximity between these two vectors, i.e., θ = 0 . Hence, n P i =1 s ij u i and c j should lie in the same direction, i.e., c j k c j k = n P i =1 s ij u i     n P i =1 s ij u i     . (20) 11

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment