Unsupervised Learning of Semantic Audio Representations

UNSUPER VISED LEARNING OF SEMANTIC A UDI O REPRESENT A TIONS Ar en J ansen, Manoj Plakal, Ratheet P a ndya, Daniel P . W . Ellis, Shawn Hershe y , Jiayang Liu, R. Channin g Moo re , Rif A. Saur ous Google, Inc., Mountain V iew , CA, and New Y ork, NY , USA { arenjansen,plaka l,ratheet,dpwe, shershey,jiayl, channingmoore,rif } @google.com ABSTRA CT Even in the absence of any explicit semantic annotation, v ast col- lections of audio recordings provide valu able information for learn- ing the categorical structure of sounds. W e consider sev eral class- agnostic semantic constraints that apply to unlabe led nonspeec h au- dio: ( i ) noise and translations i n ti me do not change the underlying sound category , (ii) a mixture of two sound ev ents i nherits the cate- gories of the constituents, and (ii i ) the catego ries of even ts in close temporal proximity are likely to be the same or r el ated. W ithout labels to ground them, these constraints are incompatible wit h clas- siﬁcation loss functions. Ho wev er , they may still be lev eraged to identify geome tric i nequalities needed for triplet loss-based training of con volutional neural ne tworks. The result is l ow-dimensional em- beddings of the input spectrograms t hat recove r 41% and 84 % of the performance of their fully-supervised counterparts when applied to do wnstream query-by-exa mple sound retriev al and sound even t clas- siﬁcation tasks, respecti vely . Moreo ver , in limited-supervision set- tings, our un supervised embeddings double the state-of-the-art clas- siﬁcation perfo rmance. Index T erms — Unsupervised learning, triplet loss, sound clas- siﬁcation. 1. INTRODUCTION The last few years hav e seen great adv ances in nonspe ech audio processing, as popular deep learning architectures dev eloped in the speech and image processing communities have been ported to this relativ ely understudied domain [1, 2, 3, 4]. Ho wever , these data- hungry neural networks are not always matched t o the available training data in the audio domain. While unlabeled audio is easy to collect, manua lly labeling data for each ne w sound application remains notoriously costly and time consuming. W e seek to allevi- ate this incon gruity by dev eloping alternati ve learning strate gies that exploit basic semantic properties of sound that are not grounded to an e xplicit labeling. Recent efforts in the computer vision community ha ve identiﬁed se veral class-independent constraints on natural i mages and videos that can be used t o l earn semantic representations [5]. For exam- ple, ob ject categ ories are in variant to camera angle, and tracking un- kno wn objects in videos can prov ide nove l examples for the same unkno wn category [6]. For audio, we can identify se veral analogous constraints, which are not tied to any particular in ventory of sound categories. First, we can apply category-preserving transformations to ind ividual e vents of un kno wn t ype, such as a dding Gaussian noise, translation in ti me wi thin the analysis w i ndo w , and small perturba- tions in frequency . Second, pairs of unkno wn sounds can be mixed to provide new , often natural sounding examples of both. Finally , Portions of this pape r were submitte d to ML4Audio 201 7 workshop. sounds from within the same vicinity in a recording likely contain multiple ex amples of the same (or related) unkno wn catego ries. If provided a set of labeled sound ev ents of t he form X is an example of cate gory C , applying the abov e semantic constraints is interpretable as regular labeled data augmentation. Howe ver , when each example has unkno wn categorical assignment, standard classi- ﬁcation loss functions can not be applied. W e instead resort to deep metric learning using tri plet loss [7, 8, 9], which ﬁ nds a nonlinear mapping into a low dimensional space where simple Euclidean dis- tance can express any desired relati onship between examples of t he form X i s more like Y than like Z . Critically , while labeled exam- ples can be con verted to triplets to explicitly learn a semantic em- bedding (i.e., X has same class as Y , but dif feren t than Z ), a t r i plet relationship need not be anchored to an explicit categorical assign- ment. This makes it a natural ﬁt for our set of semantic constraints; indeed, a noisy version of a sound even t is more semantically similar to the clean recording t han ano ther arbitrary sound . Moreov er , since we can generate as many triplets from as much unlabe led data as we wish, we can support arbitrarily comple x neural architectures. T o v alidate these ideas, we train embedding s using state-of-the- art con volutional architectures on millions of triplets sampled from the AudioS et dataset [10], both with and without using t he label information. W e ev aluate the learned embeddings as features for query-by-e xample sound retr iev al and supervised sound e vent clas- siﬁcation. Our results demonstrate that highly complex models can be trained from unlabeled triplets alone to produce representations that recover up to 84% of the performance gap between using the raw log mel spectrogram inputs and using fully-supervised embed- dings trained on millions of labeled e xamples. 2. RELA TED WORK There hav e been multiple past efforts to perform unsupervised deep representation learning on non-spee ch audio. Lee et al. [11] applied con volutional deep belief networks to e xtract a representation for speech and music, but not general purpose nonspeech audio. More recently , a denoising a utoencoder va riant was use d to extract fe atures for en vironmental sound classiﬁcation [12]. W hile both approaches produced useful represen tations for t heir respecti ve tasks, neither ex- plicitly introduced training mechanisms to eli cit semantic structure in their learned embeddings. Classical distance metric learning has also been app lied to music in the past [13]. Recent zero-resource ef forts in the speech p rocessing com- munity have explicitly aimed to l earn meaningful l i nguistic units from untranscribed speech [14]. W ith this goal, sev eral weak su- pervision mechanisms hav e been proposed that are analogous to what we attempt to achiev e for nonspeech audio. For speech, the relev ant constraints are deri ved from the inherent linguistic hier- archy: repeated unkno wn words hav e the same unkno wn phonetic structure [ 15], conv ersations with same unkno wn topic have shared unkno wn w ords [16], etc. V arious forms of deep metric learn- ing [17, 18, 19, 20] hav e been successfully applied using these speech-speciﬁc constraints. Finally , so-called self-supervised approa ches in the computer vision community are analogous to what we propose in this pa- per for audio. There, constraints based on egomotion [21], spa- tial/compositional conte xt [ 22 , 23], object tracking [6], and coloriza- tion [5] have all been ev aluated. Recent efforts hav e extended this principle of self-supervision to joint audio-visual models that learn speech or audio embeddings using semantic constraints imposed by the compan ion visual signal [24, 25, 26, 27]. 3. LEARNING ALGORITHM Our training procedure consists of two stages: (i) sampling trai n- ing triplets from a collection of unlabeled audio recordings, and (ii) learning a map from input context windo ws extracted from spec- trograms (matrices with F frequency channels and T frames) to a lo wer d -dimensional vector space using triplet loss optimization of con volutional neural networks. W e summarize the tri plet loss met- ric learning frame work, and then formally deﬁne each of our triplet sampling strate gies. 3.1. Metric Learning with T rip l et Loss The goal of triplet loss-based metric learning is to estimate a map g : R F × T → R d such that simple (e.g.) Euclidean distance in the target space correspond to highly complex geometric relation- ships in the input space. Training data is provided as a set T = { t i } N i =1 of example triplets of the form t i = ( x ( i ) a , x ( i ) p , x ( i ) n ) , where x ( i ) a , x ( i ) p , x ( i ) n ∈ R F × T are commonly referred to as the anchor , pos- itiv e, and negati ve, respecti vely . The loss is gi ven by L ( T ) = N X i =1 h k g ( x ( i ) a ) − g ( x ( i ) p ) k 2 2 − k g ( x ( i ) a ) − g ( x ( i ) n ) k 2 2 + δ i + , (1) where k · k 2 is L 2 norm, [ · ] + is standard hinge loss, and δ is a non- negati ve margin hyperparame ter . Intuitiv ely , t he optimization is at- tempting to learn an embedding of the input data such that positive examp les end up closer to their anchors than the corresponding neg- ativ es do, by some margin. Notice that the loss is identically zero when all training triplets satisfy the ineq uality (dropping index) k g ( x a ) − g ( x p ) k 2 2 + δ ≤ k g ( x a ) − g ( x n ) k 2 2 . (2) Thus we may also view the triplets as a collection of hard constraints on the inputs. T his is an extremely ﬂexible construct: any pairwise relationship between input exam ples that permits a relativ e ranking (i.e., ( x a , x p ) are more similar than ( x a , x n ) ) complies. The learned distance then becomes a proxy for that pairwise relationship. The map g can be deﬁned by a fully-connected, d -unit output layer of any modern deep learning architecture. The optimization is performed with stochastic gradient descent, though training time is greatly decreased with the use of within-batch semi-hard ne gativ e mining [28]. Here, all examp les in the batch are transformed under the current state of g , and the av ailable neg ativ es are reassigned to the anchor -positiv e pairs to mak e more difﬁcult triplets. Speciﬁcally , we choose the closest negativ e to the anchor that is still further away than the positive (the absolute closest is vulnerable to label noise). 3.2. T riplet Sampling M ethods 3.2.1. Explicitly Labeled Data In st andard supervised learning, we are provided a set of labeled examp les of the form Z = { ( x i , y i ) } , where each x i ∈ R F × T and y i ∈ C for some set C of semantic categories. Triplet loss- based metric learning w as originally formulated for this setting, and con verting Z to a set of triplets i s straightforward. For each c ∈ C , we ran domly sample anch or-positi ve pairs ( x a , x p ) from Z such that y a = y p = c . Then, for each sampled pair , we attach as the triplet’ s negati ve an example ( x n , y n ) ∈ Z such that y n 6 = c . T his procedure sets the sup ervised performance topline in our experiments. 3.2.2. Gaussian Noise Since the introduction of the denoising autoencoders, l earning repre- sentations that are in variant to s mall perturbations in t he original data space has been a standard t ool for unsupervised learning. Howe ver , when more comple x con volutional architectures with po oling are de- sired, in verting the encoder function i s complicated [29]. Howe ver , since we are not interested in actually reconstructing the inputs, we can use triplet loss to effect similar representational properties us- ing an arbitrary deep learning architecture. For each x i ∈ R F × T in the pro vided set of unlabeled examples X , we simply sample one or more anchor-positi ve pairs of the form ( x i , x p ) , where element x p,tf = x i,tf (1 + | ǫ tf | ) for ǫ tf ∼ N (0 , σ 2 ) , a Gaussian distribu tion with mean 0 and standard deviation σ (a model hyperparame ter). For each samp led anchor -positiv e pair, we simply choose another ex am- ple in X as the ne gativ e to complete t he triplet. 3.2.3. T im e and F r equency T ranslation When processing long context windo ws of spectrogram, we are pro- vided snapshots of the contained sound even ts with arbitrary tem- poral offsets and clipping. A transient sound ev ent with unkno wn category that starts at t he left edge of the windo w maintains its se- mantic assignment if it begins somewh ere in the center . Howe ver , in the i nput spac e, this simple t ranslation in time can pro duce dramatic transformations of the training data. Similarl y , small translations i n frequenc y may lea ve the semantics unchanged while greatly perturb- ing the input space. T o exploit this we generate training triplets as follo ws. For each x i ∈ R F × T in the provided set of unlabeled ex- amples X , we sample one or more anchor-positi ve pairs of the form ( x i , x p ) where x p = T runc S (Circ T ( x i )) . Here, Circ T is a circu- lar shift in time by an integer number of frames sampled uniformly from [0 , T − 1] , where T i s the number of frames in the example. T runc S is a truncated shift in frequency by an i nteger number of bins sampled uniformly from [ − S, S ] (missing v alues after shift are set to zero energy). W e again choose another example in X as the negati ve to complete the tri plet. 3.2.4. Example Mixing Sound i s often referred to as transparent, since we can superim- pose sound recordings and still hear the constituents t o some de- gree. W e can use t his intuition t o construct triplets by mixing sounds together . One approach to this i s forming a positiv e by mixing a random example with the anchor . Ho we ver , if we subsequently at- tach a r andom negati ve, we can not guarantee we wan t to satisfy the inequality of Eq. 2. For exam ple, if the anchor is a dog bark, and we mix it with a siren, we cannot assume a dog gro wl as the neg- ativ e should be mapped further from the anchor than the mixture. T o solve this, we can mix random anchor and negati ve examples to form the po sitives . Giv en a random anchor x a and neg ativ e x n con- taining energies i n each ti me-frequenc y cell, we construct positi ve x p = x a + α [ E ( x a ) /E ( x n )] x n , where E ( x ) is the total energy of example x , and α is a hyperparameter . Note that this is the only triplet type considered not str i ctly compatible with semi-hard nega- tiv e mining , since negati ves are not randomly sampled. 3.2.5. T emporal Pr oximity Audio recordings from r eal world environ ments do not consist of e vents drawn completely at random from the space of all possible sounds. Instead, a giv en en vironment has limited subset of sound creating objects that are often closely , or ev en causally , related. As such, two even ts in the same recording more li kely to be of the same, or at least related, even t categories than any two random ev ents in a large audio co llection. W e can use this intuition to sample triplets of the form ( x a , x p , x n ) where x a and x p are from t he same recording, x n is from a different recording. W e can further impose the con- straint that | time( x a ) − time( x p ) | < ∆ t , where time ( x ) is the start time of example x , and ∆ t is a hyperp arameter . Note that i f ov erlap- ping context windows and sufﬁciently small values of ∆ t are used, this metho d is functionally similar to the time translation app roach. 3.2.6. Joint T raining In supervised learning settings, it is not always trivial to combine multiple data sources from separate domains with distinct label in- ventories. For cases where simply using a classiﬁcation layer that is the union of the categ ories is not possib le (e.g. mutual exclusiv ity is not implied across the two cl ass sets), we must resort to multi-task training objecti ves. For triplet loss, t his is not a problem. All triplet sets produced using the sampling methods outlined in this section can be simply mixed together for training a joint emb edding that re- ﬂects them all to whate ver deg ree possible. In general, if we have preconcei ved notions of each constraint’ s importance, we can either introduce a source-de pendent weight to each triplet’ s contribution t o the l oss function in Eq. 1 or , alternati vely , use va rying triplet sample sizes f or each source. 4. EXPERIMENTS W e ev aluate embeddings that result from the tri plet sampling meth- ods of Secti on 3.2 in two do wnstream tasks: (i) query-by-example semantic retriev al of soun d se gments, and (ii) training shallo w fully- connected sound ev ent classiﬁ ers. The query-by-exam ple task does not in volv e any subsequent supervised training and thus directly measures the intrinsic semantic consistency of the learned repres en- tation. The shallo w model measures how easily a relativ ely simple, non-con volutional classiﬁer network can predict the sound even t categories giv en the labeled data. Finally , we also perform a lightly- supervised classiﬁcation experiment, where we repeat the shallo w model ev aluation with only a small fraction of the labeled data. This allows us to measure the utility of unlabeled data in reducing an- notation requirements for any sound ev ent classiﬁcation application where unlab eled data is plentiful. 4.1. Dataset and Featur es W e use Google’ s recently released AudioSet database of manually annotated sound ev ents [10] for both trai ning and ev aluation. Au- dioSet consists of o ver 2 million 10-second audio segments fr om T ab le 1 . Segment retriev al mean average precision (mAP) as func- tion of: (left) Gaussian wi dth σ for Gaussian noise triplets; (middle) frequenc y shift range S for translation triplets; (right) and mixing weight α for mixed ex ample triplets. Best results in bold. σ mAP 0.1 0.453 0.25 0.466 0.5 0.478 1.0 0.478 S mAP 0 0.461 2 0.492 5 0.493 10 0.508 α mAP 0.1 0.483 0.25 0.489 0.5 0.487 1.0 0.476 Y ouT ube videos, each labeled using a comprehensi ve ontology of 527 sound ev ent catego ries (minimum 120 segments/class). W e use an internal version of the unbalanced training set (50%-large r than the released set), which we split into train and dev elopment subsets. W e report all performance metrics on the released ev aluation set. W e compute 64-channel mel-scale spectrograms using an FFT window size of 25 ms with 10 ms step. Triplets are sampled in this energy domain since the some of our triplet sampling mechanisms r equire an energy interpretation, but a stabilized logarithm i s applied before input to our models. W e then process these spectrograms into non- ov erlapping 0.96 second context windows, such that each training examp le is an F =64 by T =96 matr i x. Each embedding model was trained on order 10 million triplets (40 million for joint mo del). 4.2. Model Ar chitecture Giv en its impressiv e performance on previous l arge-scale sound classiﬁcation e valuations [2], we use the ResNet-50 con volutional neural network architecture. Each input 64 × 96 context window i s ﬁrst processed by a layer of 64 con volutiona l 7 × 7 ﬁlters, followed by a 3 × 3 max pool with stride 2 in both dimensions. This is followed by 4 standard ResNet blocks and a ﬁnal average pool over time and frequency to a 2048-dimen sional representation. Instead of the classiﬁcation output layer used in [2], all of our triplet models use a 128-unit fully-connected linear output layer . This produces a vector of dimension d = 128, which represents a factor of 48 reduction from the original input dimensionality of 64 × 96 . W e also emplo y the standard practice of length normalizing the network output be- fore input to the loss function (i.e., g = h/ k h k 2 , where h is the output embedding layer of the network). T his normalization means the squared Euclidean distance used in the loss is proportional to using cosin e distance, a com mon choice for learne d representations. All training i s performed using Adam, wi th hyperparameters opti- mized on the dev elopment set. W e use semi-hard negativ e mining and a learning rate of 10 − 4 for supervised, temporal proximity , and combined unsupervised models; othe rwise 10 − 6 is used with mining disabled. The margin hy perparameter δ is set to 0.1 i n all cases. 4.3. Query-by-Example Retriev al Our ﬁrst e v aluation task is query-by-e xample (QbE) segment re- triev al. Here, no additional training is performed, making it a di- rect measurement of inherent semantic representational quality . W e begin by mapping each 0.96 second context window in the ev alua- tion set to its correspond ing 128-dimensional embedding ve ctor and av erage these across each AudioSet segment to arrive at a segment- lev el embedding. For each sound ev ent category , fr om t he AudioSet e valuation set we sample 100 segments where it is present, and 100 segmen ts where it is not. W e t hen compute the cosine distance be- tween all 4,950 within-class pairs as target trials, and all 10,000 T ab le 2 . Mean av erage precision for segment retriev al and shallow model classiﬁcation using original log mel spectrogram and triplet embeddings as features. All embeddin g models use the same ResNet-50 architecture with a 128-d imensional linear output layer . Classiﬁcation Classiﬁcation QbE Retrie v al (1 layer , 512 units) (2 layer , 512 units) Representation mAP recov ery mAP recov ery mAP recov ery Explicit Label Triplet (t opline) 0.790 100% 0.288 100% 0.289 100% Log Mel Spectrogram (baseline) 0.423 0% 0.065 0% 0.102 0% Gaussian Noise ( σ = 0 . 5 ) 0.478 15% 0.09 6 14% 0.114 6% T/F T ranslation ( S = 10 ) 0.508 23% 0.10 8 19% 0.125 12% Mixed Examp le ( α = 0 . 25 ) 0.489 18% 0.10 3 17% 0.122 11% T emporal Proximity ( ∆ t = 10 s ) 0.562 38% 0.22 6 72% 0.241 74% Joint Unsupe rvised Triplet 0.575 41% 0.24 4 80% 0.259 84% T ab le 3 . Lightly-supervised classiﬁer performance averaged over three trials, each trained wit h a different random draw of 20 seg- ments/class (totaling 0.5% of labeled data). Representation Classiﬁer Architecture mAP Log Mel S pectrogram F ully Connected (4x 512) 0.032 Log Mel S pectrogram ResNet-50 0.072 Joint Unsupe rvised Triplet Fully Connec ted (1x512) 0.143 (present,not-present) pairs as nontarg et trials. W e sort this set of pairs by ascending distan ce and comp ute the av erage precision (AP) of ranking target over nontar get trials (random chance giv es 0.33). W e repeat this for each class and av erage the per-class AP score to produce the reported mean av erage precision (mAP). T able 1 shows the retriev al mAP for t hree of the t riplet sam- pling mechanisms as a function of their associated hyperparameters. In each case, the optimal performing setting on the validation set also was optimal for ev al (listed for each hyperparameter in bold). For the Gaussian noise and example mixing, we observed relative ly weak dependence on sampling hyperpa rameter value s. Ho wever , we found for the translation method t hat allowing larger shifts in frequenc y produce substantial improveme nts ov er time translations alone. W hile this may be surprising fr om a signal processing point of vie w , two-dimensional translations help to force increased spectral localization in the early l ayers’ ﬁ lters of the con volutional network, which is observ ed in fully-supervised models. Note that since all AudioSet clips are l imited to at most 10 seconds, we did not ex plore additional limi t ing of proximity (i.e., ∆ t is effecti vely clip duration). T able 2 shows the retri e val performance for each of the ev aluated representations. The fully-supervised topline uses explicitly labeled data to sample the triplets. As a baseline, we ev aluate the retri ev al performance achiev ed using the r aw log mel spectrogram features (each 64 × 96 context windo w is treated as a 6144-d imensional vec- tor before segment-le vel ave raging). For each of the unsupervised methods, we tuned on the dev elopment set and the reported perfor- mance here i s on the sep arate ev aluation set. At the bottom, we also list the performance of the joint embedding, trained on a mixture of all four unsupervised triplet types (approximately equal number of triplets from each). Alongside each mAP v alue, we also list t he per - centage of the baseline-to-topline performance gap recove red using each giv en unsupervised triplet embedding. W e ﬁnd that each un- supervised triplet method signiﬁcant impro ves retriev al performance ov er the input features, with the joint unsupervised model improv- ing mAP by 15% absolute over the input spectrogram features, and recov ering more than 40% of the performance ga p. 4.4. Sound Classiﬁcation While the retriev al task measures ho w the geome tric structure of the representation mirrors AudioSet classes, we are also interested ho w our unsupervised methods ai d an arbitrary downstream supervised task ove r the same or similar data. T o test t his, we use our various embeddings to train shallow , fully-connected networks using labeled AudioSet segments. For each feature, we consider classiﬁers with 1 and 2 hidden layers of 512 units each. The output layer consists of independe nt logistic regression models for the 527 classes. For each class, we compute segment-le vel scores (av erage of the frame-lev el predictions) for the ev aluation set and compute averag e precision. W e again report the mean a verage precision over the classes. T able 2 sho ws the classiﬁcation performance for each of rep- resentation types. W e again ﬁnd substantial improveme nt ov er the input features in all cases, with temporal proximity the clear stand- out. Combining triplet sets prov ides additional gains, indicating the learned representation’ s ability to encode multiple types of se- mantic constraints for downstream tasks. Notice that our approach performs fully-unsupervised training of a ResNet-50 triplet embed- ding model that achiev es 85% (0.244/0.288) t he mAP of a fully- supervised ResNet-50 tr i plet embedd ing model, w hen both are cou- pled to a single hidden layer downstream classiﬁer . Finally , T able 3 sho ws performance of lightly-supervised classi- ﬁers tr ai ned on just 20 examples per class. T o account for the v ari- ability in sample selection, we generate 3 random training samples, run the experiment separately on each, and r eport ave rage perfor- mance. Here we ev aluate three models: ( i) a ResNet-50 classiﬁer model, (i i ) a fully-connected model trained from log mel spectro- grams, and (iii) a fully-connected model trained on t he joint unsu- pervised triplet embedding (last line of T able 2). Since our unsup er- vised triplet embe ddings are deri ved from the full AudioSet train set (as unlabeled data), a si ngle l ayer classiﬁ er trained on top doubles the mAP of a full ResNet-50 classiﬁer trained from raw inputs. 5. CONCLUSIONS W e hav e presented a new approach to unsupervised audio represen- tation learning that explicitly elicits seman tic structure. By sampling triplets using a v ariety of audio-speciﬁc semantic constraints that do not require l abeled data, we l earn a representation that greatly outperforms t he r aw inputs on both sound e vent ret r iev al and clas- siﬁcation task. W e found that the variou s semantic constraints are complementary , producing improv ements when combined to train a joint triplet loss embedding model. Finally , we demonstrated that our best unsupervised embedd ing provides great adv antage when training soun d ev ent classiﬁers in limited supervision scenarios. 6. REFERENCES [1] Naoy a T akahashi, Michael Gygli, Beat Pﬁster, and L uc V an Gool, “Deep con volutiona l neural networks and data augmentation for acoustic even t detection, ” arXiv pr eprint arXiv:1604.071 60 , 2016. [2] Shawn Hershe y , Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke , Aren Jansen, R Channing Moore, Manoj P lakal, De vin Platt, Rif A Saurous, Bryan Seybold, et al., “CNN ar- chitectures for large-scale audio classiﬁ cation, ” in Acoustics, Speec h and Signal Proce ssing (ICA SSP), 2017 IEEE Interna- tional Confer ence on . IEEE, 20 17, pp. 131–135 . [3] Emre Cakır , G i ambattista Parascandolo, T oni Heittola, Heikki Huttunen, and Tuom as V i rtanen, “Con volutional recurrent neural networks for polyphonic sound ev ent detection, ” arXiv pr eprint arXiv:1702.06286 , 2017. [4] Y un W ang and Fl orian Metze, “ A ﬁrst attempt at polyphon ic sound even t detection using connectionist temporal classiﬁca- tion, ” in Proc . of ICASSP , 2017. [5] Richard Z hang, Phillip Isola, and Alexe i A Efros, “Colorful image colorization, ” in Eur opean Confer ence on Computer V ision . S pringer , 2016, pp. 649–66 6. [6] Xiaolong W ang and Abhinav Gupta, “Unsupervised learning of visual representations using videos, ” in Pr oceedings of the IEEE Interna tional Confer ence on Computer V ision , 2015, pp. 2794–2 802. [7] Kilian Q W einberger and Lawrence K Saul, “Dist ance met- ric learning for large margin nearest neighbor classiﬁcation, ” J ournal of Machine Learning Resear ch , vol. 10, no. F eb, pp. 207–24 4, 2009. [8] Jiang W ang, Y ang Song, T homas L eung, Chuck Rosenberg, Jingbin W ang, James Philbin, Bo Chen, and Y ing W u, “Learn- ing ﬁne-grained image similarity with deep ranking, ” in Pr o- ceedings of the IEEE Confer ence on Computer V ision and P at- tern Reco gnition , 2014, pp. 13 86–1393. [9] Elad Hoffer and Nir Ai lon, “Deep metric learning using triplet network, ” i n Internation al W orkshop on Similarity-Based P at- tern Reco gnition . Springer , 2015, pp. 84–92. [10] Jort F . Gemmeke, Daniel P . W . Ellis, Dylan Freedman, Aren Jansen, W ade L awrence, R . Channing Moore, Manoj Plakal, and Marvin Ri t ter , “ Audio set: A strongly labeled dataset of audio e vents, ” in Pr oceedings of ICASSP , 2017. [11] Honglak Lee, P et er Pham, Y an Largman , and Andrew Y Ng, “Unsupervised feature learning for audio classiﬁcation using con volutional deep belief networks, ” in Advances in neura l information pr ocessing systems , 2009 , pp. 1096–1104 . [12] Y ong Xu, Qiang Huang, W enwu W ang, Peter Foster , S iddharth Sigtia, Phili p JB Jackson, and Mark D Plumbley , “Unsuper - vised feature learning based on deep models for env ironmen- tal audio tagging, ” IEEE/ACM Tr ansactions on Audio, Speech, and Languag e P r ocessing , vol. 25, no. 6, pp . 1230–12 41, 2017. [13] Malcolm Slaney , Kilian W einberger , and W illiam White, “Learning a metric for music si milarity , ” in Pr oceedings of the International Society of Music-Information Retrieval , 2008, pp. 313– 318. [14] Maarten V ersteegh, Roland T hiolliere, T homas Schatz, Xuan- Nga Cao, Xavier Anguera, Aren Jansen, and Emmanuel Dupoux, “The Zero R esource S peech Challenge 2015, ” in Interspeec h , 2015, pp. 3169–3173. [15] Aren Jansen, Samuel Thomas, and Hynek Hermansky , “W eak top-do wn constraints for unsupervised acoustic model train- ing., ” i n ICASSP , 2013, pp. 80 91–8095. [16] S tella F rank, N aomi F eldman, and S haron Goldwater , “W eak semantic context helps phonetic learning in a model of infant language acqu isition., ” in A CL (1) , 20 14, pp. 1073–10 83. [17] Gabriel Synnae ve, Thomas Schatz, and Emmanuel Dupoux, “Phonetics embedding l earning with side information, ” in Spo- ken Languag e T echn ology W orkshop (SLT), 20 14 IEEE . IEEE, 2014, pp. 10 6–111. [18] Herman Kamper , Micha Elsner , Aren Jansen, and Sharon Goldwater , “Unsupervised neural network based feature ex- traction using weak top-do wn constraints, ” in Acoustics, Speec h and Signal Proce ssing (ICA SSP), 2015 IEEE Interna- tional Confer ence on . IEEE, 20 15, pp. 5818–58 22. [19] Neil Zeghid our , Gabriel Synnae ve, N i colas Usunier, and E m- manuel Dup oux, “Joint learning of speaker and phonetic simi- larities w i th Siamese networks., ” in IN TERSPEECH , 2016, pp. 1295–1 299. [20] Herman Kamper , W eiran W ang, and Karen L iv escu, “Deep con volutional acoustic word embeddings using word-pair side information, ” in Acoustics, Speec h and Signal Pro cessing (ICASSP), 2016 I E EE Internation al Confer ence on . IEEE , 2016, pp. 49 50–4954. [21] P ulkit Agrawal, Joao Carreira, and Jit endra Malik, “Learning to see by moving, ” in Pr oceedings of the IEE E International Confer ence on Computer V ision , 201 5, pp. 37–45. [22] Deepak Pathak, Philipp Krahenb uhl, Jeff Donahue, T rev or Darrell, and Alexei A Efros, “Context encoders: Feature learn- ing by i npainting, ” in Pr oceedings of the I EEE Confer ence on Compu ter V ision and P att ern Reco gnition , 2016, pp. 2536– 2544. [23] Carl Doersch, Abhinav Gupta, and Alex ei A Efros, “Unsu- pervised visual representation learning by context prediction, ” in Pr oceedings of the IE EE Internationa l Conferen ce on Com- puter V i sion , 2015 , pp. 1422–1430 . [24] Relj a Arandjelovi ´ c and Andrew Zisserman, “L ook, listen and learn, ” arXiv pr eprint arXiv:1705.08168 , 2017. [25] Y usuf A ytar , Carl V ondrick, and Antonio T orralba, “Soundnet: Learning sound representations from unlabeled video, ” in Ad- vances in Neural Information Pro cessing Systems , 2016, pp. 892–90 0. [26] David Harwath, Antonio T orralba, and James Glass, “Unsu- pervised learning of spoken language with visual context, ” in Advances in Neural Information Proc essing Systems , 2016, pp. 1858–1 866. [27] Herman Kamper , S hane Sett l e, Gregory Shakhnaro vich, and Karen Liv escu, “V isually grounded learning of keyw ord prediction f r om untranscribed speech, ” arXiv pr eprint arXiv:1703.081 36 , 2017. [28] F lorian Schrof f, Dmitry Kalenichenk o, and James Philbin, “Facenet: A uniﬁed embedding for face recognition and clus- tering, ” in Pro ceedings of the IEEE C onferen ce on Computer V ision and P attern Recognition , 20 15, pp. 815–82 3. [29] Jonathan Masci, Ueli Meier , Dan Cires ¸an, and J ¨ urgen Schmid- huber , “Stacked con volutional auto-encod ers for hierarchical feature extraction, ” Artiﬁcial Neural Networks and Machine Learning–ICANN 2011 , pp. 52–59, 2011.

Unsupervised Learning of Semantic Audio Representations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment