Distributed Vector Representations of Folksong Motifs

Distributed V ector Represen tations of F olksong Motifs Aitor Arron te Alv arez 1 and F rancisco G´ omez-Martin 2 1 Cen ter for Language and T ec hnology , Univ ersit y of Haw aii at Manoa, United States arronte@hawaii.edu 2 Applied Mathematics Department, T ec hnical Universit y of Madrid, Spain fmartin@etsisi.upm.es Abstract. This article presen ts a distributed vector represen tation mo del for learning folksong motifs. A skip-gram version of word2v ec with nega- tiv e sampling is used to represent high quality em b eddings. Motifs from the Essen F olksong collection are compared based on their cosine similar- it y . A new ev aluation metho d for testing the quality of the em b eddings based on a melo dic similarit y task is presented to sho w ho w the vec- tor space can represent complex con textual features, and how it can b e utilized for the study of folksong v ariation. Keyw ords: folksong motifs · melodic context · motif em bedding · word2v ec. 1 In tro duction V ector represen tations of words ha v e be en widely used in Natural Language Pro- cessing (NLP) tasks [18]. F ollo wing the distributional h yp othesis [9], [5], vector space mo dels represen t, or em b ed, words that are seman tically related to each other closer in a contin uous vector space [24]. A recen t dev elopment in vector space mo dels is word2v ec [8, 13, 15], developed for learning high-quality word v ectors from large corp ora. A neural net work language mo del for learning word-em beddings was ﬁrst prop osed to learn a statistical language mo del and a word vector representation [1]. A simpler mo del using a neural net with a single hidden la yer to learn word v ector representations, and then train a language mo del was later developed [14]. W ord2vec follo ws this simpler approach in tw o steps: ﬁrst, contin uous word v ectors are learned using the simpler mo del [14], and then an n-gram is trained using these represen tations. The relation b etw een m usic and language has b een studied in the cognitiv e science literature. Even though they are treated as diﬀerent cognitive faculties, b oth share structural characteristics and generate similar exp ectations on the listener [2]. NLP metho ds hav e b een adapted and adopted in Music Informa- tion Retriev al (MIR) con texts [6], [4], [3]. More sp eciﬁcally , word2v ec was used to mo del musical contexts in w estern classical music works [10], and for chord recommendations [11]. In b oth cases the music comp ositions studied were com- plex p olyphonic works. The work presented in this article uses a muc h less data in tensive material: monophonic songs. 2 A. Arronte-Alv arez, F. G´ omez-Martin F ollo wing the distributional hypothesis in seman tics, the goal of this researc h is to adopt the skip-gram version of the word2v ec mo del for the distributional represen tation of melo dic units. Sev eral melo dic features suc h as con tour, group- ing, and small size motifs seem to b e part of the so called Statistical Music Uni- v ersals [17], [19]. This sequential pro cessing of melo dic units may b e related to the human capacity to group and comprehend motifs as units within a melo dic con text. Our hypothesis is that these units may relate to eac h other in a melo dy in similar wa ys as words do in sentences. If that is the case, the distributional h yp othesis should hold true for folksong melo dies. In the following sections a description of the skip-gram version of w ord2v ec to learn motifs from the Essen F olksong Collection [20] is presen ted. W e will present diﬀeren t similarity measures to determine ho w melo dic con text can capture the similarit y of folksong motifs. 2 W ord2vec: Represen ting F olksong Motifs in a Distributed V ector Space 2.1 W ord2v ec Mo del In the skip-gram version of the word2v ec mo del, the goal is to ﬁnd word em- b eddings that can predict the surrounding words of a target word in a sentence or do cumen t [15]. F ormally , the mo del can b e deﬁned in the follo wing terms: giv en a corpus W of words w and contexts c , the netw ork tries to predict the surrounding words of a target in a con text. The ob jectiv e of the skip-gram is to maximize the follo wing log probability: arg max θ Y w ∈ W " Y c ∈ C p ( c | w ; θ ) # (1) where p ( c | w ; θ ) is calculated by the softmax function: p ( c | w ; θ ) = e v c · v w P c 0 ∈ C e v c 0 · v w (2) where v c and v w ∈ R d are vector representations of v and c , and C is the set of all p ossible contexts. The set of parameters θ is comp osed of v c i , v w i for w ∈ W . Since the term p ( w ; θ ) inv olv es a summation ov er all possible contexts c 0 b e- comes computationally very intensiv e, and it is normally replaced with negativ e sampling [15]. This article uses this sampling techni que. The cosine similarity measure is used to determine the relatedness of tw o em b eddings. The metric for a pair of words w 1 and w 2 can b e deﬁned as [22] : cos ( w 1 , w 2 ) = − → w 1 · − → w 2   − → w 1     − → w 2   (3) for all similarity computations in the embedding space, where − → w is a real-v alued v ector embedding of word w . Distributed V ector Representations of F olksong Motifs 3 2.2 Melo dic context and motif represen tation W e are interested in studying how word2v ec can mo del melo dic con text using small m usical motifs instead of words. In the presen t research context is under- sto od as the sequential organization of melo dic units that establish statistically relev an t relationships with one another in a melo dic segmen t. Melo dic similarity and classiﬁcation metho ds dep end strongly on melo dic represen tation [23]. Motifs from the Essen folksong collection are represented by using strings. First, interv als are co diﬁed for eac h song b y using Music21 [7] chro- matic step v alues from the original Kern format, and enco de interv al direction with Bo olean v alues ( 1 for ascending and 0 for descending). F or instance, the string 21 represen ts an ascending ma jor second, and the string 30 a descending minor third. Rep eated notes are enco ded as 00 . Once the entire folksong corpus is enco ded using this scheme, motifs are extracted as multi-w ords [15]. A multi-w ord is then a concatenation of tw o or more in terv als or durations that are found in a melo dy adjacent to each other. F or example, an in terv allic m ulti-word of size 3 30 00 21 represen ts a descending minor third, follo wed by a rep eated note, and by an ascending ma jor second. The m ulti-word representation of motifs is obtained following these steps: – F rom a corpus of interv als we create a vocabulary of m ulti-word M with m ulti-words mw of length 2. Only those mw that o ccur at least 10 times are k ept, based on the quality of the results from ad-ho c queries. – F or each mw in M interv als in the corpus are substituted with their corre- sp onding mw . The same pro cedure is used for mw of size 3, with the only diﬀerence that the minim um num b er of occurrences of mw in a corpus is set to 5. The w ord2vec mo del is run based on the corp ora created obtaining v ector representations for all the motifs. 2.3 Ev aluation metho ds Ev aluation of W ord Embeddings (WE) falls into tw o categories: intr insic and extrinsic ev aluation [22]. Intrinsic ev aluation metho ds test for syntactic or se- man tic relationships b et ween w ords using predeﬁned queries. Then, methods are ev aluated by aggregating correlation scores. Extrinsic ev aluations are p erformed b y using WE as the input feature for another task, and then embeddings are ev aluated based on the changes in the p erformance of that particular task. This study concentrates on in trinsic ev aluations, more speciﬁc on relatedness and analogy . Relatedness in WE is the cosine similarit y b etw een tw o words. P airs of w ords should ha ve higher correlation scores when compared with human annotated semantic similarit y scores [22]. Analogical reasoning was ﬁrst used for testing semantic relationships b et ween pairs of words given sp eciﬁc phrases: given a term x and a term y so that x:y resembles a sample relationship i:j [13]. All these ev aluation metho ds are language sp eciﬁc, and hav e not b eing adapted for MIR tasks. 4 A. Arronte-Alv arez, F. G´ omez-Martin Giv en the non-linguistic nature of music, and the diﬃculty of interpreting WE, more so when they represent me lodic motifs, a new metho d is presented for ev aluating Melo dic Embeddings (ME) based on v ariations of motifs and similarit y measures for those motifs in relation to a reference one. The metho d pro ceeds as follows: 1. F or each multi-w ord mw i , where i = 1, 2, ..., l and l is the cardinality of the v o cabulary M from corpus C , w e compute max(c os( mw i , mw j )) for al l j , and obtain the most related multi-w ord mw + i of mw i , so that mw i : mw + i , and an unrelated multi-w ord mw − i , where c os( mw i , mw − i ) < h , where h is an acceptable similarit y threshold. 2. Chose from C a melo dic segment c and replace mw i with mw + i and mw − i , obtaining a related c + and an unrelated c − melo dic segments. This action is p erformed for all segments in C . 3. Obtain sim(c, c + ) and sim(c, c − ) , where sim() is a function that computes a measure of melo dic similarity b etw een pairs of melo dic segments. The idea b ehind this ev aluation metho d is that, if vector represen tations of motifs are of go od qualit y , when a motif mw i is replaced with its most similar motif mw + i in a melodic segment c obtaining c + , then a melodic similarity measure should indicate that segmen t c is more similar to c + than to c − . T o measure interv allic similarity , sequences are ev aluated using the mean absolute diﬀerence in interv als ( diﬃnt ) [16]. Since this study deals with equal- length sequences, note sequences are ev aluated with cit y blo ck distance ( citydist ) [21], and for duration-weigh ted pitch sequences correlation distance ( c orr dist ) [12]. In order to compute distance measures based on note sequences, a vector of pitc hes represented as numerical MIDI v alues is used. 2.4 Ev aluating motif embeddings A sample of 2000 melo dic segments is randomly selected from the Europ ean sub collection from the Essen folksong corpus. Multi-word embedings of size 2 and 3 are obtained using the skip-gram version of word2v ec with context size of 5 and vector dimension of 150. W e measure melo dic similarit y using diﬃnt , citydist , and c orr dist for related and unrelated multi-w ord melo dic segments using the metho d presented in 2.3, and compare their means. Wilco xon rank sum test is p erformed on related and unrelated melo dic seg- men ts for all similarit y measures, resulting on signiﬁcant diﬀerences in means for all measures ( p-value < 0.01). Ad-ho c queries of interv allic motif em b eddings of size 2 sho w similarity b etw een motifs based on the context. F or instance, Figure 1 shows similar motifs from mw of size 2 (transp osed to C), and Figure 2, shows melo dic examples where those motifs are present in similar melo dic contexts: all three fragmen ts contain the target motif, either 00 20 or 20 00 preceded b y a melo dic unison and follow ed by an ascending ma jor second. Next, closely related and unrelated melo dic segments v ariations from a ref- erence segment using the pro cedure describ ed in 2.3 are computed. W e compare Distributed V ector Representations of F olksong Motifs 5 Fig. 1. Similar interv allic motifs from mw of size 2 the similarity b et ween a reference melo dic segmen t with its most related v ari- ation and the same reference segment with a close v ariation, and with a non related (or distant) v ariation. The cosine similarity for multi-w ords of size 2 and 3 is used to select closely related and unrelated motifs. W e utilize the Euclidean distance for comparing the a verage similarity scores of the 2000 segments and all the v arian ts describ ed. The results in T able 1 sho w that the distance of the similarity scores betw een the reference segments and their v ariations, and the reference segments and closely related v arian ts ( r ef var r ef close var ) yield b etter results than when we compare the reference segmen ts and their v ariants, with the reference segments with distan tly related v ariants ( r ef var r ef distant var ). Fig. 2. F ragments of Europ ean folksongs with similar interv allic motifs colored in red 6 A. Arronte-Alv arez, F. G´ omez-Martin Results Measure ref v ar ref close v ar ref v ar ref distant v ar m w size diﬃn t 6.231 7.853 2 cit ydist 8.836 11.213 2 corrdist 0.503 2.378 2 diﬃn t 2.163 4.782 3 cit ydist 4.556 7.181 3 corrdist 0.713 4.044 3 T able 1. Euclide an distanc e b etwe en similarity sc or es Ov erall, the results of the motif embeddings show that vector representa- tions of folksong motifs capture contextual melodic features. Query results show ho w motifs can b e mo deled with the skip-gram v ersion of the word2v ec from monophonic contexts. One of the adv antages of this metho d is that motifs can b e easily mo deled in a complete unsup ervised manner given a context, and they can b e retrieved using the cosine distance. At the same time, with large corp ora the algorithm tends to discov er m ultiple motifs, some of which ma y b e irrelev ant for the m usicological analysis. 3 Conclusions W ord2v ec has b een used to mo del complex W estern p olyphonic classical mu- sic [10]. In this article the skip-gram v ersion of word2v ec is used to learn rich represen tations of monophonic motifs from the Essen folksong collection. The prop osed approac h shows ho w motifs from folksongs can b e learned from a large corpus and compared with each other using the cosine similarity . This approach can b e v ery useful for the musicological study of folksong v ariation using small melo dic units such as motifs. It also shows, ho w word2v ec is able to capture and mo del melo dic contexts from monophonic songs. F uture w ork should concen- trate on the ﬁltering of motifs based on diﬀerent musicological criteria, to a void a com binatorial explotion and to select relev ant motifs for the m usical analysis. The ev aluation of WE is an imp ortant research topic in the NLP litera- ture [22]. In this article a nov el computational metho d for ev aluating the qualit y of motif embeddings is prop osed. The approach presen ted sho ws how the mo del captures diﬀerent degrees of motif similarit y . This ev aluation metho d can b e v ery useful for studying the similarity of melo dic segmen ts based on motifs and their related v ariants. F uture w ork in this area should include a cognitive sim- ilarit y ev aluation task p erformed by human participants to test the quality of the em b eddings. References 1. Bengio, Y., Ducharme, R., Vincen t, P ., Jauvin, C.: A neural probabilistic language mo del. Journal of mac hine learning research 3 (F eb), 1137–1155 (2003) Distributed V ector Representations of F olksong Motifs 7 2. Besson, M., Sc h¨ on, D.: Comparison b etw een language and m usic. Annals of the New Y ork Academy of Sciences 930 (1), 232–258 (2001) 3. Bo om, C.D., Agra w al, R., Hansen, S., Kumar, E., Y on, R., Chen, C.W., Demeester, T., Dhoedt, B.: Large-scale user modeling with recurren t neural netw orks for music disco very on multiple time scales. Multimedia T ools and Applications 77 , 15385– 15407 (2017) 4. Boulanger-Lew andowski, N., Bengio, Y., Vincent, P .: Mo deling temp oral dep en- dencies in high-dimensional sequences: Application to p olyphonic m usic generation and transcription. arXiv preprint arXiv:1206.6392 (2012) 5. Clark, S.: V ector space mo dels of lexical meaning. In: Lappin, S., F o x, C. (eds.) The handb ook of contemporary seman tic theory , pp. 463–472. Wiley-Blac kwell (2015) 6. Conklin, D., Witten, I.H.: Multiple viewp oint systems for music prediction. Journal of New Music Research 24 (1), 51–73 (1995) 7. Cuth b ert, M.S., Ariza, C.: Music21: A to olkit for computer-aided musicology and sym b olic music data. In: ISMIR. Utrec ht, Netherlands (2010) 8. Goldb erg, Y., Levy , O.: w ord2vec explained: deriving mikolo v et al.’s negative- sampling word-em bedding method. arXiv preprint arXiv:1402.3722 (2014) 9. Harris, Z.S.: Distributional structure. W ord 10 (2-3), 146–162 (1954) 10. Herremans, D., Chuan, C.H.: Mo deling musical context with w ord2vec. arXiv preprin t arXiv:1706.09088 (2017) 11. Huang, C.Z.A., Duvenaud, D., Ga jos, K.Z.: Chordripple: Recommending chords to help novice comp osers go b ey ond the ordinary . In: Pro ceedings of the 21st In- ternational Conference on Intelligen t User Interfaces. pp. 241–250. ACM, Sonoma, CA, USA (2016) 12. Janssen, B., v an Kranenburg, P ., V olk, A.: Finding occurrences of melo dic seg- men ts in folk songs employing sym b olic similarity measures. Journal of New Music Researc h 46 (2), 118–134 (2017) 13. Mik olov, T., Chen, K., Corrado, G., Dean, J.: Eﬃcient estimation of word repre- sen tations in v ector space. arXiv preprint arXiv:1301.3781 (2013) 14. Mik olov, T., Kopecky , J., Burget, L., Glembek, O., et al.: Neural netw ork based language mo dels f or highly inﬂectiv e languages. In: Acoustics, Sp eec h and Signal Pro cessing, 2009. ICASSP 2009. IEEE International Conference on. pp. 4725–4728. IEEE, T aip ei, T aiwan (2009) 15. Mik olov, T., Sutskev er, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sen tations of words and phrases and their comp ositionality . In: Adv ances in neural information pro cessing systems. pp. 3111–3119. Lak e T ahoe, Nev ada, United States (2013) 16. M ¨ ullensiefen, D., F rieler, K., et al.: Cognitive adequacy in the measurement of melo dic similarity: Algorithmic vs. human judgments. Computing in Musicology 13 (2003), 147–176 (2004) 17. Nettl, B.: An ethnomusicologist con templates universals in musical sound and mu- sical culture. In: Brown, N.L.W.B.M.S. (ed.) The origins of music, pp. 463–472. MIT press (2000) 18. Rumelhart, D.E., Hin ton, G.E., Williams, R.J.: Learning representations by back- propagating errors. Nature 323 (6088), 533 (1986) 19. Sa v age, P .E., Bro wn, S., Sak ai, E., Currie, T.E.: Statistical universals reveal the structures and functions of human music. Pro ceedings of the National Academy of Sciences 112 (29), 8987–8992 (2015) 20. Sc haﬀrath, H., Huron, D.: The essen folksong collection in the humdrum kern format. T ech. rep., Center for Computer Assisted Research in the Humanities, Menlo Park, CA, USA (1995) 8 A. Arronte-Alv arez, F. G´ omez-Martin 21. Sc herrer, D.K., Scherrer, P .H.: An exp eriment in the computer measurement of melo dic v ariation in folksong. The Journal of American F olklore 84 (332), 230–241 (1971) 22. Sc hnab el, T., Labutov, I., Mimno, D., Joachims, T.: Ev aluation metho ds for unsu- p ervised word embeddings. In: Pro ceedings of the 2015 Conference on Empirical Metho ds in Natural Language Pro cessing. pp. 298–307. Lisbon, P ortugal (2015) 23. T oiviainen, P ., Eerola, T.: A computational mo del of melodic similarity based on m ultiple representations and self-organizing maps. In: Pro ceedings of the sev enth in ternational conference on music p erception and cognition, Sydney . Causal Pro- ductions, Adelaide. pp. 236–239 (2002) 24. T urney , P .D., Pan tel, P .: F rom frequency to meaning: V ector space mo dels of se- man tics. Journal of artiﬁcial intelligence research 37 , 141–188 (2010)

Distributed Vector Representations of Folksong Motifs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment