Learning Embodied Semantics via Music and Dance Semiotic Correlations

Music semantics is embodied, in the sense that meaning is biologically mediated by and grounded in the human body and brain. This embodied cognition perspective also explains why music structures modulate kinetic and somatosensory perception. We leve…

Authors: Francisco Afonso Raposo, David Martins de Matos, Ricardo Ribeiro

Learning Embodied Semantics via Music and Dance Semiotic Correlations
Learning em b o died seman tics via m usic and dance semiotic correlations F rancisco Afonso Rap oso a,c , Da vid Martins de Matos a,c , Ricardo Rib eiro b,c a Instituto Sup erior T ´ ecnic o, Universidade de Lisb o a, A v. R ovisc o Pais, 1049-001 Lisb o a, Portugal b Instituto Universit´ ario de Lisb o a (ISCTE-IUL), Av. das F or¸ cas Armadas, 1649-026 Lisb o a, Portugal c INESC-ID Lisb o a, R. Alves R e dol 9, 1000-029 Lisb o a, Portugal Abstract Music seman tics is em b o died, in the sense that meaning is biologically me- diated by and grounded in the h uman b ody and brain. This embo died cog- nition p ersp ectiv e also explains wh y music structures mo dulate kinetic and somatosensory p erception. W e leverage this asp ect of cognition, b y consid- ering dance as a proxy for m usic p erception, in a statistical computational mo del that learns semiotic correlations b et ween m usic audio and dance video. W e ev aluate the abilit y of this mo del to effectively capture underlying se- man tics in a cross-mo dal retriev al task. Quan titativ e results, v alidated with statistical significance testing, strengthen the b ody of evidence for embo died cognition in m usic and sho w the mo del can recommend m usic audio for dance video queries and vice-v ersa. 1. Introduction Recen t developmen ts in human em b o died cognition p osit a learning and understanding mec hanism called “conceptual metaphor” (Lak off, 2012), where kno wledge is deriv ed from rep eated patterns of experience. Neural circuits in the brain are substrates for these metaphors (Lak off, 2014) and, therefore, are the drivers of semantics. Seman tic grounding can b e understo o d as the in- ferences which are instan tiated as activ ation of these learned neural circuits. ∗ Corresp onding author: T el.: +351-213-100-313 Email addr ess: francisco.afonso.raposo@tecnico.ulisboa.pt (F rancisco Afonso Rap oso) Pr eprint submitte d to arXiv Mar ch 27, 2019 While not using the same abstraction of conceptual metaphor, other theories of em b o died cognition also cast semantic memory and inference as enco d- ing and activ ation of neural circuitry , differing only in terms of which brain areas are the core comp onen ts of the biological semantic system (Kiefer & Pulv erm ¨ uller, 2012; Ralph et al., 2017). The common factor b et ween these accoun ts of embo died cognition is the existence of transmo dal knowledge represen tations, in the sense that circuits are learned in a mo dalit y-agnostic w ay . This means that correlations b etw een sensory , motor, linguistic, and affectiv e em b odied exp eriences create circuits connecting differen t mo dalit y- sp ecific neuron p opulations. In other w ords, the statistical structure of hu- man m ultimo dal exp erience, whic h is captured and enco ded by the brain, is what defines semantics. Music seman tics is no exception, also b eing em b o d- ied and, th us, musical concepts con vey meaning in terms of somatosensory and motor concepts (Ko elsc h et al., 2019; Korsako v a-Kreyn, 2018; Leman & Maes, 2014). The statistical and m ultimo dal imp erativ e for h uman cognition has also b een hinted at, at least in some form, by researc h across v arious disciplines, suc h as in aesthetics (Co ok, 2000; Davies, 1994; Kivy, 1980; Kurth, 1991; Scruton, 1997), semiotics (Azc´ arate, 2011; Bennett, 2008; Blanariu, 2013; Lemk e, 1992), psyc hology (Bro wn & Jordania, 2011; Dehaene & Cohen, 2007; Eitan & Granot, 2006; Eitan & Rothschild, 2011; F rego, 1999; Krumhansl & Sc henck, 1997; Larson, 2004; Roffler & Butler, 1968; Siev ers et al., 2013; Phillips-Silv er & T rainor, 2007; Styns et al., 2007; W agner et al., 1981), and neuroscience (F ujiok a et al., 2012; Janata et al., 2012; Ko elsc h et al., 2019; Nak am ura et al., 1999; Nozaradan et al., 2011; P enhune et al., 1998; Platel et al., 1997; Sp ence & Driv er, 1997; Stein et al., 1995; Widmann et al., 2004; Zatorre et al., 1994), namely , for natural language, m usic, and dance. In this w ork, we are interested in the seman tic link b et w een music and dance (mo vemen t-based expression). Therefore, w e leverage this multimodal as- p ect of cognition b y mo deling exp ected semiotic correlations b etw een these mo dalities. These correlations are expected b ecause they are mainly sur- face realizations of cognitive pro cesses following em b o died cognition. This framew ork implies that there is a degree of determinism underlying the re- lationship b et ween m usic and dance, that is, dance design and p erformance are hea vily shap ed by music. This eviden t and intuitiv e relationship is even captured in some natural languages, where w ords for music and dance are either synonyms or the same (Baily, 1985). In this w ork, w e claim that, just lik e h uman semantic cognition is based on multimodal statistical structures, 2 join t semiotic mo deling of m usic and dance, through statistical computa- tional approaches, is exp ected to pro vide some ligh t regarding the semantics of these mo dalities as w ell as provide in telligent tec hnological applications in areas such as multimedia pro duction. That is, we can automatically learn the symbols/patterns (semiotics), enco ded in the data representing human expression, whic h correlate across sev eral mo dalities. Since this correlation defines and is a manifestation of underlying cognitiv e processes, capturing it effectiv ely uncov ers semantic structures for b oth mo dalities. F ollo wing the calls for tec hnological applications based on sensorimotor asp ects of semantics (Leman, 2010; Matyja, 2016), this w ork lev erages semi- otic correlations b et w een music and dance, represen ted as audio and video, resp ectiv ely , in order to learn latent cross-modal represen tations which cap- ture underlying seman tics connecting these tw o modes of comm unication. These represen tations are quan titatively ev aluated in a cross-mo dal retriev al task. In particular, w e p erform exp erimen ts on a 592 m usic audio-dance video pairs dataset, using Multi-view Neural Net w orks (MVNNs), and rep ort 75% rank accuracy and 57% pair accuracy instance-lev el retriev al p erformances and 26% Mean Average Precision (MAP) class-level retriev al p erformance, whic h are all statistically very significant effects (p-v alues < 0 . 01). W e in- terpret these results as further evidence for em b o died cognition-based music seman tics. P otential end-user applications include, but are not limited to, the automatic retriev al of a song for a particular dance or choreograph y video and vice-versa. T o the b est of our kno wledge, this is the first instance of suc h a join t m usic-dance computational mo del, capable of capturing seman- tics underlying these modalities and pro viding a connection b et w een mac hine learning of these multimodal correlations and embo died cognition p ersp ec- tiv es. The rest of this paper is structured as follows: Section 2 reviews related w ork on em b o died cognition, seman tics, and semiotics, motiv ating this ap- proac h based on evidence tak en from researc h in sev eral disciplines; Section 3 details the exp erimen tal setup, including descriptions of the ev aluation task, MVNN mo del, dataset, features, and prepro cessing; Section 4 presents the results; Section 5 discusses the impact of these results; and Section 6 dra ws conclusions and suggests future w ork. 3 2. Related w ork Conceptual metaphor (Lakoff, 2012) is an abstraction used to explain the relational asp ect of h uman cognition as w ell as its biological implementation in the brain. Exp erience is enco ded neurally and frequen t patterns or cor- relations encountered across man y exp eriences define conceptual metaphors. That is, a conceptual metaphor is a link established in cognition (often sub- consciously) connecting concepts. An instance of such a metaphor implies a shared meaning of the concepts in volv ed. Which metaphors get instan tiated dep ends on the exp eriences had during a lifetime as w ell as on genetically inherited biological primitiv es (whic h are also learned based on exp erience, alb eit across evolutionary time scales). These metaphors are ph ysically im- plemen ted as neural circuits in the brain whic h are, therefore, also learned based on ev eryday exp erience. The learning pro cess at the neuronal lev el of abstraction is called “Hebbian learning”, where “neurons that fire to- gether, wire together” is the motto (Lak off, 2014). Seman tic grounding in this theory , called Neural Theory of Thought and Language (NTTL), which is understo o d as the set of seman tic inferences, manifests in the brain as firing patterns of the circuits enco ding such metaphorical inferences. These seman tics are, therefore, transmo dal: patterns of multimodal exp erience dic- tate which circuits are learned. Consequen tly , semantic grounding triggers m ultimo dal inferences in a natural, often sub conscious, wa y . Central to this theory is the fact that grounding is rooted in primitive concepts, that is, inference triggers the firing of neuron p opulations resp onsible for p erception and action/co ordination of the material b o dy in teracting in the material w orld. These neurons enco de concepts like mov ement, ph ysical forces, and other b odily sensations, which are mainly lo cated in the somatosensory and sensorimotor systems (Desai et al., 2011; Cesp edes-Guev ara & Eerola, 2018; Ko elsc h et al., 2019; Lak off, 2014). Other theories, such as the Con trolled Se- man tic Cognition (CSC) (Ralph et al., 2017), share this core m ultimo dal as- p ect of cognition but defend that a transmo dal h ub is lo cated in the An terior T emp oral Lob es (A TLs) instead. Kiefer & Pulv erm ¨ uller (2012) review and compare sev eral semantic cognition theories and argue in fa v or of the em b o d- imen t views of conceptual represen tations, which are ro oted in transmo dal in tegration of mo dalit y-sp ecific (e.g., sensory and motor) features. In the remainder of this section, we review related work providing evidence for the m ultimo dal nature of cognition and the primacy of primitive em b o died con- cepts in m usic. 4 Aesthetics suggests that musical structures evok e emotion through iso- morphism with human motion (Co ok, 2000; Davies, 1994; Kivy, 1980; Scru- ton, 1997) and that m usic is a manifestation of a primordial “kinetic energy” and a pla y of “psychological tensions” (Kurth, 1991). Blanariu (2013) claims that, ev en though the design of choreographies is influenced by culture, its aesthetics are driv en by “pre-reflective” experience, i.e., unconscious pro- cesses driving b o dy mov emen t expression. The c horeographer in terprets the w orld (e.g., a song), via “kinetic thinking” (v on Laban & Ullmann, 1960), whic h is materialized in dance in such a wa y that its surface-level features retain this “motiv ating c haracter” or “in vok ed p oten tial” (P eirce, 1991), i.e., the conceptual metaphors b ehind the enco ded symbols can still b e acces- sible. The sym b ols range from highly abstract cultural enco dings to more concrete patterns, such as mov ement patterns in space and time such as those in abstract (e.g., non-c horeographed) dance (Blanariu, 2013). Bennett (2008) characterizes mov emen t and dance semantics as b eing influenced by b oth physiological, psychological, and so cial factors and based on space and forces primitives. In music, semantics is enco ded symbolically in different dimensions (suc h as tim bral, tonal, and rh ythmic) and lev els of abstraction (Juslin, 2013; Sc hlenker, 2017). These accounts of encoding of meaning imply a conceptual semantic system whic h supp orts several denotations (Blanariu, 2013), i.e., what was also termed an “undersp ecified” seman tics (Sc hlenker, 2017). The n umber of p ossible denotations for a particular song can b e reduced when considering accompanying communication c hannels, suc h as dance, video, and lyrics (Schlenk er, 2017). Natural language seman tics is also undersp ecified according to this definition, alb eit to a m uch lo w er de- gree. F urthermore, Azc´ arate (2011) emphasizes the concept of “intertextual- it y” as w ell as text being a “mediator in the semiotic construction of realit y”. In tertextuality refers to the context in which a text is interpreted, allo wing meaning to b e assigned to text (Lemk e, 1992). This context includes other supp orting texts but also history and culture as con vey ed b y the whole range of semiotic p ossibilities, i.e., via other mo dalities (Lemke, 1992). That is, tex- tual meaning is also deriv ed via multimodal inferences, whic h impro ve the efficacy of comm unication. This “intermedialit y” is a consequence of h uman cognitiv e pro cesses based on relational thinking (conceptual metaphor) that exhibit a multimodal and contextualized inferen tial nature (Azc´ arate, 2011). P eirce (1991) termed this capacity to b oth enco de and decode symbols, via seman tic inferences, as “abstractive observ ation”, whic h he considered to b e a feature required to learn and interpret b y means of exp erience, i.e., required 5 for b eing an “intelligen t consciousness”. Human b eha viour reflects this fundamen tal and multimodal asp ect of cog- nition, as shown b y psyc hology research. F or instance, Eitan & Rothsc hild (2011) found sev eral correlations b etw een m usic dimensions and somatosensory- related concepts, suc h as sharpness, weigh t, smo othness, moisture, and tem- p erature. People sync hronize walking temp o to the music they listen to and this is though t to indicate that the perception of musical pulse is in ternalized in the lo comotion system (Styns et al., 2007). The biological nature of the link b et ween m usic and mov ement is also suggested in studies that observ ed pitc h height asso ciations with v ertical directionality in 1-y ear old infan ts (W agner et al., 1981) and with p erceiv ed spatial elev ation in congenitally blind sub jects and 4- to 5-y ear old c hildren who did not v erbally mak e those asso ciations (Roffler & Butler, 1968). T ension ratings p erformed b y sub jects indep enden tly for either m usic or a corresponding c horeography yielded cor- related results, suggesting tension fluctuations are isomorphically manifested in b oth mo dalities (F rego, 1999; Krumhansl & Sc henck, 1997). Phillips-Silv er & T rainor (2007) show ed that the p erception of “b eat” is transferable across m usic and mo v ement for h umans as y oung as 7 months old. Eitan & Gra- not (2006) observed a kind of m usic-kinetic determinism in an exp erimen t where m usic features w ere consistently mapp ed onto kinetic features of visu- alized h uman motion. Sievers et al. (2013) found further empirical evidence for a shared dynamic structure betw een m usic and mo vemen t in a study that lev eraged a common feature b et ween these mo dalities: the capacity to con vey affectiv e conten t. Exp erimen ters had huma n sub jects indep enden tly con trol the shared parameters of a probabilistic mo del, for generating either piano melo dies or bouncing ball animations, according to sp ecified target emotions: angry , happ y , p eaceful, sad, and scared. Similar emotions w ere correlated with similar slider configurations across b oth mo dalities and dif- feren t cultures: American and Kreung (in a rural Cam b o dian village which main tained a high degree of cultural isolation). The authors argue that the isomorphic relationship b etw een these modalities ma y pla y an important role in evolutionary fitness and suggest that m usic pro cessing in the brain “re- cycles” (Dehaene & Cohen, 2007) other areas evolv ed for older tasks, such as spatiotemporal p erception and action (Sievers et al., 2013). Bro wn & Jordania (2011) suggest that this capacit y to conv ey affectiv e con tent is the reason why m usic and mov emen t are more cross-culturally intelligible than language. A computational mo del for melodic exp ectation, which gener- ated melo dy completions based on tonal mo vemen t driven by physical forces 6 (gra vity , inertia, and magnetism), outp erformed ev ery h uman sub ject, based on intersub ject agreement (Larson, 2004), further suggesting semantic infer- ences b et ween concepts related to music and mov emen t/forces. There is also neurological evidence for multimodal cognition and, in par- ticular, for an underlying link b et ween music and mov ement. Certain brain areas, such as the sup erior colliculus, are thought to in tegrate visual, audi- tory , and somatosensory information (Sp ence & Driv er, 1997; Stein et al., 1995). Widmann et al. (2004) observ ed evok ed p otentials when an audi- tory stimulus w as presen ted to sub jects together with a visual stim ulus that infringed exp ected spatial inferences based on pitc h. The engagement of visuospatial areas of the brain during music-related tasks has also b een ex- tensiv ely rep orted (Nak am ura et al., 1999; P enhune et al., 1998; Platel et al., 1997; Zatorre et al., 1994). F urthermore, neural en trainment to beat has b een observ ed as β oscillations across auditory and motor cortices (F ujiok a et al., 2012; Nozaradan et al., 2011). Moreo ver, Janata et al. (2012) found a link betw een the feeling of “b eing in the gro o v e” and sensorimotor activit y . Korsak ov a-Kreyn (2018) also explains music seman tics from an em b o died cognition p erspective, where tonal and temp oral relationships in m usic arti- facts conv ey em b o died meaning, mainly via mo dulation of ph ysical tension. These tonal relationships consist of manipulations of tonal tension, a core concept in musicology , in a tonal framework (m usical scale). T onal tension is physically p erceiv ed by h umans as y oung as one-day-old babies (Virtala et al., 2013), which further p oin ts to the em b o dimen t of music seman tics, since tonal p erception is mainly biologically driv en. The reason for this may b e the “principle of least effort”, where consonan t sounds consisting of more harmonic o vertones are more easily pro cessed and compressed by the brain than dissonan t sounds, creating a more pleasan t exp erience (Bidelman & Krishnan, 2009, 2011). Leman (2007) also emphasizes the role of kinetic meaning as a translator b et ween structural features of m usic and seman- tic lab els/expressiv e inten tions, i.e., corp oreal articulations are necessary for in terpreting music. Seman tics are defined by the mediation pro cess when lis- tening to m usic, i.e., the human b o dy and brain are resp onsible for mapping from the ph ysical mo dalit y (audio) to the exp erienced mo dalit y (Leman, 2010). This mediation pro cess is based on motor patterns which regulate men tal representations related to music perception. This theory , termed Em b o died Music Cognition (EMC), also supp orts the idea that seman tics is motiv ated b y affordances (action), i.e., m usic is interpreted in a (kinetic) w ay that is relev ant for functioning in a ph ysical environmen t. F urther- 7 more, EMC also states that deco ding m usic expressiveness in p erformance is a sense-giving activit y (Leman & Maes, 2014), whic h falls in line with the learning nature of NTTL. The Predictiv e Coding (PC) framework of Koelsch et al. (2019) also p oin ts to the inv olvemen t of transmo dal neural circuits in b oth prediction and prediction error resolution (active inference) of musical con tent. The gro ov e asp ect of music perception entails an activ e engagement in terms of proprio ception and in tero ception, where sensorimotor predictions are inferenced (by “men tal action”), ev en without actually mo ving. In this framew ork, b oth sensorimotor and autonomic systems can also b e inv olved in resolution of prediction errors. Recen tly , P ereira et al. (2018) prop osed a metho d for deco ding neural represen tations into statistically-mo deled semantic dimensions of text. This is relev ant b ecause it sho ws statistical computational mo deling (in this in- stance, ridge regression) is able to robustly capture language seman tics in the brain, based on functional Magnetic Resonance Imaging (fMRI). This language-brain wa ves relationship is an analogue to the m usic-dance relation- ship in this w ork. The main adv antage is that, theoretically , brain activit y will directly correlate to stimuli, assuming we can p erfectly deco de it. Dance, ho wev er, can b e viewed as an indirect represen tation, a kinetic pro xy for the em b o died meaning of the m usic stimulus, which is assumed to b e enco ded in the brain. This approac h pro vides further insights motiv ating em b o died cognition p erspectives, in particular, to its transmo dal asp ect. fMRI data w as recorded for three different text concept presentation paradigms: using it in a sentence, pairing it with a descriptive picture, and pairing it with a w ord cloud (several related words). The b est deco ding p erformance across individual paradigms was obtained with the data recorded in the picture paradigm, illustrating the role of intermedialit y in natural language seman- tics and cognition in general. Moreo v er, an inv estigation into what v oxels w ere most informative for deco ding, revealed that they w ere from widely dis- tributed brain areas (language 21%, default mo de 15%, task-p ositiv e 23%, visual 19%, and others 22%), as opp osed to b eing fo calized in the language net work, further suggesting an in tegrated seman tic system distributed across the whole brain. A limitation of that approac h in relation to the one pro- p osed here is that regression is p erformed for eac h dimension of the text represen tation indep enden tly , failing to capture ho w all dimensions join tly co v ary across b oth mo dalities. 8 3. Exp erimental setup As previously stated, m ultimedia expressions referencing the same ob ject (e.g., audio and dance of a song) tend to display semiotic correlations re- flecting em b o died cognitiv e pro cesses. Therefore, we design an exp erimen t to ev aluate ho w correlated these artifact pairs are: w e measure the p erfor- mance of cross-mo dal retriev al b et ween music audio and dance video. The task consists of retrieving a sorted list of relev ant results from one mo dalit y , giv en a query from another mo dalit y . W e p erform exp erimen ts in a 4-fold cross-v alidation setup and rep ort pair and rank accuracy scores (as done b y P ereira et al. (2018)) for instance-lev el ev aluation and MAP scores for class- lev el ev aluation. The following sections describ e the dataset (Section 3.1), features (Section 3.2), prepro cessing (Section 3.3), MVNN mo del arc hitecture and loss function (Section 3.4), and ev aluation details (Section 3.5). 3.1. Dataset W e ran experiments on a subset of the L et’s Danc e dataset of 1000 videos of dances from 10 categories: ballet, breakdance, flamenco, foxtrot, latin, quic kstep, square, swing, tango, and w altz (Castro et al., 2018). This dataset w as created in the con text of dance style classification based on video. Each video is 10s long and has a rate of 30 frames p er second. The videos w ere tak en from Y ouT ub e at 720p qualit y and include b oth dancing p erformances and practicing. W e used only the audio and p ose detection data (b ody join t p ositions) from this dataset, which was extracted b y applying a p ose detector (W ei et al., 2016) after detecting b ounding b o xes in a frame with a real-time p erson detector (Redmon et al., 2016). After filtering out all instances which did not ha ve all p ose detection data for 10s, the final dataset size is 592 pairs. 3.2. F e atur es The audio features consist of logarithmically scaled Mel-spectrograms ex- tracted from 16,000Hz audio signals. F raming is done b y segmenting c hunks of 50ms of audio ev ery 25ms. Spectra are computed via F ast F ourier T rans- form (FFT) with a buffer size of 1024 samples. The n umber of Mel bins is set to 128, which results in a final matrix of 399 frames b y 128 Mel-frequency bins p er 10s audio recording. W e segmen t each recording in to 1s ch unks (50% o verlap) to b e fed to the MVNN (detailed in Section 3.4), whic h means that eac h of the 592 ob jects contains 19 segments (each containing 39 frames), yielding a dataset of a total of 11,248 samples. 9 Figure 1: Pose detection illustration taken from (Chan et al., 2018). Skeleton p oin ts represen t join ts. The pose detection features consist of bo dy join t p ositions in frame space, i.e., pixel co ordinates ranging from 0 (top left corner) to 1280 and 720 for width and height, resp ectiv ely . The p ositions for the follo wing key points are extracted: head, neck, shoulder, elb o w, wrist, hip, knee, and ankle. There are 2 k eyp oin ts, left and righ t, for eac h of these except for head and neck, yielding a total of 28 features (14 k eyp oin ts with 2 co ordinates, x and y , eac h). Figure 1 illustrates the keypoints. These features are extracted at 30fps for the whole 10s video duration ( t ∈ { t 0 ...t 299 } ), normalized after extraction according to Section 3.3, and then derived features are computed from the normalized data. The p osition and mov ement of b ody joints are used together for expression in dance. Therefore, we compute features that reflect the relative p ositions of b ody joints in relation to eac h other. This translates into computing the euclidean distance b et ween eac h combination of t wo join ts, yielding 91 deriv ed features and a total of 119 mo vemen t features. As for audio, w e segmen t this sequence in to 1s segmen ts (50% o verlap), each con taining 30 frames. 3.3. Pr epr o c essing W e are in terested in mo deling mov ement as b o dily expression. There- fore, we should fo cus on the temp oral dynamics of join t p ositions relativ e to eac h other in a w ay that is as viewp oin t- and sub ject-inv ariant as p ossible. 10 Ho wev er, the p ositions of sub jects in frame space v aries according to their distance to the camera. F urthermore, lim b proportions are also different across sub jects. Therefore, w e normalize the joint p osition data in a similar w ay to Chan et al. (2018), whose purp ose w as to transform a p ose from a source frame space to a target frame space. W e select an arbitrary target frame and pro ject every source frame to this space. W e start by taking the maxim um ankle y co ordinate ankl clo (Equation 1) and the maxim um an- kle y co ordinate whic h is smaller than (spatially ab o v e) the median ankle y co ordinate ankl med (Equation 2) and ab out the same distance to it as the distance b et ween it and ankl clo ( ankl far in Equation 3). These tw o key- p oin ts represent the closest and furthest ankle co ordinates to the camera, resp ectiv ely . F ormally: ankl = { ankl y L t } ∪ { ankl y R t } ankl clo = max t ( { y t : y t ∈ ankl } ) (1) ankl med = median t ( { y t : y t ∈ ankl } ) (2) ankl far = max t ( { y t : y t ∈ ankl ∧ y t < ankl med ∧| y t − ankl med |− α | ankl clo − ankl med | <  } ) (3) where ankl y L t and ankl y R t are the y co ordinates of the left and right ankles at tim estep t , respectively . F ollowing (Chan et al., 2018), w e set α to 1, and  to 0.7. Then, w e computed a scale s (Equation 4) to b e applied to the y -axis according to an interpolation b etw een the ratios of the maximum heigh ts b et w een the source and target frames, heig far src and heig far tgt , resp ectively . F or each dance instance, frame heigh ts are first clustered according to the distance b etw een corresp onding ankle y coordinate and ankl clo and ankl far and then the maximum heigh t v alues for eac h cluster are taken (Equations 5 and 6). F ormally: s = heig far tgt heig far src + ankl avg src − ankl far src ankl clo src − ankl far src  heig clo tgt heig clo src − heig far tgt heig far src  (4) heig clo = max t ( {| head y t − ankl LR t | : | ankl LR t − ankl clo | < | ankl LR t − ankl far |} ) (5) heig far = max t ( {| head y t − ankl LR t | : | ankl LR t − ankl clo | > | ankl LR t − ankl far |} ) (6) ankl LR t = ankl y L t + ankl y R t 2 11 ankl avg = a verage t ( { y t : y t ∈ ankl } ) where head y t is the y co ordinate of the head at timestep t . After scaling, w e also apply a 2D translation so that the p osition of the ankles of the sub ject is centered at 0. W e do this b y subtracting the median co ordinates ( x and y ) of the mean of the (left and righ t) ankles, i.e., the median of ankl LR t . 3.4. Multi-view neur al network ar chite ctur e The MVNN mo del used in this work is comp osed b y tw o branc hes, each mo deling its o wn view. Ev en though the final embeddings define a shared and correlated space, according to the loss function, the branches can be arbitrarily differen t from eac h other. The loss function is Deep Canonical Correlation Analysis (DCCA) (Andrew et al., 2013), a non-linear extension of Canonical Correlation Analysis (CCA) (Hotelling, 1936), which has also b een successfully applied to music by Kelk ar et al. (2018) and Y u et al. (2019). CCA linearly pro jects tw o distinct view spaces in to a shared correlated space and was suggested to b e a general case of parametric tests of statistical significance (Knapp, 1978). F ormally , DCCA solv es:  w ∗ x , w ∗ y , ϕ ∗ x , ϕ ∗ y  = argmax ( w x ,w y ,ϕ x ,ϕ y ) corr  w T x ϕ x ( x ) , w T y ϕ y ( y )  (7) where x ∈ I R m and y ∈ I R n are the zero-mean observ ations for eac h view. ϕ x and ϕ y are non-linear mappings for eac h view, and w x and w y are the canonical w eights for each view. W e use backpropagation and minimize: − s tr   C − 1 / 2 X X C X Y C − 1 / 2 Y Y  T  C − 1 / 2 X X C X Y C − 1 / 2 Y Y   (8) C − 1 / 2 X X = Q X X Λ − 1 / 2 X X Q T X X (9) where X and Y are the non-linear pro jections for eac h view, i.e., ϕ x ( x ) and ϕ y ( y ), resp ectively . C X X and C Y Y are the regularized, zero-centered co v ariances while C X Y is the zero-centered cross-co v ariance. Q X X are the eigen vectors of C X X and Λ X X are the eigenv alues of C X X . C − 1 / 2 Y Y can be com- puted analogously . W e finish training b y computing a forw ard pass with the training data and fitting a linear CCA model on those non-linear mappings. The canonical comp onen ts of these deep non-linear mappings implement our seman tic embeddings space to b e ev aluated in a cross-mo dal retriev al task. 12 F unctions ϕ x and ϕ y , i.e., the audio and mov ement pro jections are imple- men ted as branc hes of typical neural net works, describ ed in T ables 1 and 2. W e use tanh activ ation functions after eac h conv olution lay er. Note that other loss functions, suc h as ones based on pairwise distances (Hermann & Blunsom, 2014; He et al., 2017), can theoretically also b e used for the same task. The neural net work mo dels were all implemented using T ensorFlow (Abadi et al., 2015). T able 1: Audio Neural Net work Branch la yer type dimensions # params input 39 × 128 × 1 0 batc h norm 39 × 128 × 1 4 2D con v 39 × 128 × 8 200 2D a vg p o ol 13 × 16 × 8 0 batc h norm 13 × 16 × 8 32 2D con v 13 × 16 × 16 2064 2D a vg p o ol 3 × 4 × 16 0 batc h norm 3 × 4 × 16 64 2D con v 3 × 4 × 32 6176 2D a vg p o ol 1 × 1 × 32 0 batc h norm 1 × 1 × 32 128 2D con v 1 × 1 × 128 4224 T otal params 12892 T able 2: Mov ement Neural Net work Branc h la yer type dimensions # of params input 30 × 119 0 batc h norm 30 × 119 476 gru 1 × 32 14688 T otal params 15164 3.5. Cr oss-mo dal r etrieval evaluation In this work, cross-mo dal retriev al consists of retrieving a sorted list of videos given an audio query and vice-v ersa. W e p erform cross-mo dal retriev al 13 on full ob jects ev en though the MVNN is mo deling semiotic correlation b e- t ween segmen ts. In order to do this, w e compute ob ject representations as the av erage of the CCA pro jections of its segmen ts (for b oth mo dalities) and compute the cosine similarit y b etw een these cross-mo dal embeddings. W e ev aluate the abilit y of the model to capture semantics and generalize semiotic correlations b et ween b oth mo dalities by assessing if relev ant cross- mo dal do cumen ts for a query are rank ed on top of the retriev ed do cumen ts list. W e define relev an t do cuments in tw o wa ys: instance- and class-level. Instance-lev el ev aluation considers the ground truth pairing of cross-mo dal ob jects as criterion for relev ance, (i.e., the only relev ant audio do cumen t for a dance video is the one that corresp onds to the song that play ed in that video). Class-level ev aluation considers that any cross-mo dal ob ject sharing some semantic lab el is relev ant (e.g., relev an t audio do cumen ts for a dance video of a particular dance style are the ones that correspond to songs that pla yed in videos of the same dance style). W e p erform exp erimen ts in a 4-fold cross-v alidation setup, where each fold partitioning is such that the distribution of classes is similar for each fold. W e also run the exp erimen ts 10 runs for eac h fold and rep ort the a v erage p erformance across runs. W e compute pair and rank accuracies for instance-level ev aluation (sim- ilar to P ereira et al. (2018)). P air accuracy ev aluates ranking p erformance in the following w ay: for each query from mo dalit y X , w e consider every p ossible pairing of the relev an t ob ject (corresp onding cross-mo dal pair) and non-relev an t ob jects from mo dalit y Y . W e compute the similarities b et w een the query and eac h of the t wo cross-mo dal ob jects, as w ell as the similarities b et w een b oth cross-mo dal ob jects and the corresp onding non-relev an t ob ject form mo dalit y X . If the corresponding cross-mo dal ob jects are more similar than the alternativ e, the retriev al trial is successful. W e rep ort the av erage v alues o ver queries and non-relev an t ob jects. W e also compute a statistical significance test in order to show that the model indeed captures seman tics underlying the artifacts. W e can think of each trial as a binomial outcome, aggregating t wo binomial outcomes, where the probability of success for a random mo del is 0 . 5 × 0 . 5 = 0 . 25. Therefore, we can p erform a binomial test and compute its p-v alue. Ev en though there are 144 × 143 trials, w e consider a more conserv ative v alue for the trials 144 (the n um b er of indep en- den t queries). If the p-v alue is lo wer than 0 . 05, then we can reject the n ull h yp othesis that the results of our mo del are due to c hance. Rank accuracy is the (linearly) normalized rank of the relev ant do cumen t in the retriev al list: ra = 1 − ( r − 1) / ( L − 1), where r is the rank of the relev an t cross-mo dal 14 ob ject in the list with L elements. This is similar to the pair accuracy ev al- uation, except that w e only consider the query from mo dalit y X and the ob jects from mo dalit y Y , i.e., each trial consists of one binomial outcome, where the probabilit y of success for a random mo del is 0 . 5. W e also consider a conserv ativ e binomial test num b er of trials of 144 for this metric. Ev en though the prop osed mo del and loss function do not explicitly op- timize class separation, we exp ect it to still learn em b eddings whic h capture some asp ects of the dance genres in the dataset. This is b ecause differen t instances of the same class are exp ected to share seman tic structures. There- fore, w e p erform class-level ev aluation, in order to further v alidate that our mo del captures seman tics underlying b oth mo dalities. W e compute and re- p ort MAP scores for each class, separately , and p erform a p erm utation test on these scores against random mo del p erformance (whose MAP scores are computed according to Bestgen (2015)), so that we can sho w these results are statistically significan t and not due to c hance. F ormally: MAP C = 1 | Q C | X q ∈ Q C AP C ( q ) (10) AP C ( q ) = P | R | j =1 pr ( j ) rel C ( r j ) | R C | (11) where C is the class, Q C is the set of queries b elonging to class C , AP C ( q ) is the Average Precision (AP) for query q , R is the list of retriev ed ob jects, R C is the set of retriev ed ob jects b elonging to class C , pr ( j ) is the precision at cutoff j of the retriev ed ob jects list, and rel C ( r ) ev aluates whether retrieved ob ject r is relev ant or not, i.e., whether it b elongs to class C or not. Note that the retriev ed ob jects list alwa ys contains the whole (train or test) set of data from mo dalit y Y and that its size is equal to the total num b er of (train or test) ev aluated queries from mo dalit y X . MAP measures the qualit y of the sorting of retrieved items lists for a particular definition of relev ance (dance st yle in this work). 4. Results Instance-lev el ev aluation results are rep orted in T ables 3 and 4 for pair and rank accuracies, resp ectively , for each fold. V alues shown in the X / Y format corresp ond to results when using audio / video queries, resp ectiv ely . 15 The mo del w as able to achiev e 57% and 75% for pair and rank accuracies, resp ectiv ely , whic h are statistically significan tly better (p-v alues < 0 . 01) than the random baseline p erformances of 25% and 50%, resp ectiv ely . T able 3: Instance-level P air Accuracy F old 0 F old 1 F old 2 F old 3 Av erage Baseline 0.57 / 0.57 0.57 / 0.56 0.60 / 0.59 0.55 / 0.56 0.57 / 0.57 0.25 T able 4: Instance-level Rank Accuracy F old 0 F old 1 F old 2 F old 3 Av erage Baseline 0.75 / 0.75 0.75 / 0.75 0.77 / 0.76 0.74 / 0.74 0.75 / 0.75 0.50 Class-lev el ev aluation results (MAP scores) are rep orted in T able 5 for eac h class and fold. The mo del ac hieved 26%, whic h is statistically signif- ican tly b etter (p-v alue < 0 . 01) than the random baseline performance of 13%. 5. Discussion Our prop osed model successfully captured seman tics for music and dance, as evidenced by the quantitativ e ev aluation results, whic h are v alidated by statistical significance testing, for b oth instance- and class-level scenarios. Instance-lev el ev aluation confirms that our prop osed mo del is able to gener- alize the cross-mo dal features which connect b oth mo dalities. This means the mo del effectiv ely learned how p eople can mo ve according to the sound of m usic, as well as ho w music can sound according to the mo v ement of human b odies. Class-lev el ev aluation further strengthens this conclusion by sho wing the same effect from a st yle-based p ersp ectiv e, i.e., the mo del learned ho w p eople can mo ve according to the music st yle of a song, as w ell as ho w m usic can sound according to the dance st yle of the mov ement of human b o dies. This result is particularly interesting b ecause the design of b oth the mo del and exp erimen ts do es not explicitly address st yle, that is, there is no st yle- based sup ervision. Since semantic lab els are inferenced by h umans based on 16 T able 5: Class-level MAP St yle F old 0 F old 1 F old 2 F old 3 Av erage Baseline Ballet 0.43 / 0.40 0.33 / 0.31 0.51 / 0.41 0.37 / 0.32 0.41 / 0.36 0.10 Breakdance 0.18 / 0.17 0.18 / 0.14 0.18 / 0.14 0.23 / 0.22 0.19 / 0.17 0.09 Flamenco 0.20 / 0.18 0.16 / 0.19 0.15 / 0.16 0.16 / 0.17 0.17 / 0.17 0.12 F o xtrot 0.22 / 0.24 0.23 / 0.24 0.21 / 0.21 0.16 / 0.18 0.20 / 0.22 0.12 Latin 0.23 / 0.23 0.19 / 0.20 0.21 / 0.22 0.20 / 0.19 0.21 / 0.21 0.14 Quic kstep 0.21 / 0.20 0.14 / 0.12 0.19 / 0.19 0.21 / 0.16 0.19 / 0.17 0.09 Square 0.28 / 0.26 0.34 / 0.29 0.30 / 0.26 0.30 / 0.29 0.30 / 0.27 0.16 Swing 0.22 / 0.21 0.22 / 0.22 0.22 / 0.23 0.24 / 0.26 0.23 / 0.23 0.15 T ango 0.28 / 0.29 0.39 / 0.37 0.34 / 0.38 0.31 / 0.33 0.33 / 0.34 0.17 W altz 0.52 / 0.51 0.35 / 0.35 0.38 / 0.31 0.48 / 0.41 0.43 / 0.40 0.15 Av erage 0.28 / 0.27 0.25 / 0.24 0.27 / 0.25 0.27 / 0.25 0.26 / 0.25 0.13 Ov erall 0.28 / 0.27 0.27 / 0.26 0.27 / 0.26 0.28 / 0.26 0.28 / 0.26 0.14 semiotic aspects, this implies that some of the latent semiotic asp ects learned b y our mo del are also relev ant for these semantic lab els, i.e., these asp ects are seman tically ric h. Therefore, mo deling semiotic correlations, b et w een audio and dance, effectiv ely uncov ers semantic asp ects. The results sho w a link b et ween musical meaning and kinetic meaning, pro viding further evidence for embo died cognition seman tics in m usic. This is because em b o died semantics ultimately defends that meaning in m usic is grounded in motor and somatosensory concepts, i.e., mov ement, ph ysical forces, and physical tension. By observing that dance, a b ody expression pro xy for how those concepts correlate to the m usical exp erience, is semioti- cally correlated to music artifacts, w e sho w that music seman tics is kinetically and biologically grounded. F urthermore, our quan titative results also demon- strate an effective tec hnique for cross-mo dal retriev al b et ween m usic audio and dance video, providing the basis for an automatic m usic video creation to ol. This basis consists of a mo del that can recommend the song that b est fits a particular dance video and the dance video that b est fits a particular song. The class-lev el ev aluation also v alidates the whole ranking of results, whic h means that the mo del can actually recommend several songs or videos that b est fit the dual mo dalit y . 17 6. Conclusions and future work W e proposed a computational approach to mo del music embo died seman- tics via dance pro xies, capable of recommending music audio for dance video and vice-v ersa. Quan titative ev aluation sho ws this model to b e effectiv e for this cross-mo dal retriev al task and further v alidates claims ab out m usic seman tics being defined by em b o died cognition. F uture w ork includes corre- lating audio with 3D motion capture data instead of dance videos in order to v erify whether imp ortan t spatial information is lost in 2D representations, incorp orating Laban mo vemen t analysis features and other audio features in order to ha v e fine-grained control ov er which asp ects of b oth music and mo vemen t are examined, test the learned seman tic spaces in transfer learning settings, and explore the use of generative mo dels (such as Generative Ad- v ersarial Netw orks (GANs)) to generate and visualize h uman skeleton dance videos for a giv en audio input. References Abadi, M., Agarw al, A., Barham, P ., Brevdo, E., Chen, Z., Citro, C., Cor- rado, G. S., Da vis, A., Dean, J., Devin, M., Ghemaw at, S., Go o dfello w, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Leven b erg, J., Man ´ e, D., Monga, R., Moore, S., Murray , D., Olah, C., Sch uster, M., Shlens, J., Steiner, B., Sutsk ever, I., T alw ar, K., T uck er, P ., V anhouck e, V., V asudev an, V., Vi ´ egas, F., Viny als, O., W arden, P ., W atten b erg, M., Wick e, M., Y u, Y., & Zheng, X. (2015). T ensorFlo w: Large-scale Mac hine Learning on Heterogeneous Systems, . Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep Canonical Correlation Analysis. In Pr o c e e dings of the 30th International Confer enc e on Machine L e arning (pp. 1247–1255). Azc´ arate, A. L.-V. (2011). In tertextualit y and In termedialit y as Cross- cultural Comm unication T o ols: A Critical Inquiry . Cultur a. International Journal of Philosophy of Cultur e and Axiolo gy , 8 , 7–22. Baily , J. (1985). Music Structure and Human Mo vemen t. In P . How ell, I. Cross, & R. W est (Eds.), Music al Structur e and Co gnition . London: Academic Press. 18 Bennett, K. (2008). The Language of Dance. T extos & Pr etextos , (pp. 56– 67). Bestgen, Y. (2015). Exact Exp ected Av erage Precision of the Random Base- line for System Ev aluation. The Pr ague Bul letin of Mathematic al Linguis- tics , (pp. 131–138). Bidelman, G. M., & Krishnan, A. (2009). Neural Correlates of Consonance, Dissonance, and the Hierarch y of Musical Pitc h in the Human Brainstem. Journal of Neur oscienc e , 29 , 13165–13171. Bidelman, G. M., & Krishnan, A. (2011). Brainstem Correlates of Beha vioral and Comp ositional Preferences of Musical Harmony. Neur or ep ort , 22 , 212–216. Blanariu, N. P . (2013). T o wards a F ramework of a Semiotics of Dance. CLCWeb: Comp ar ative Liter atur e and Cultur e , 15 . Bro wn, S., & Jordania, J. (2011). Univ ersals in the W orld’s Musics. Psy- cholo gy of Music , 41 , 229–248. Castro, D., Hic kson, S., Sangkloy , P ., Mittal, B., Dai, S., Ha ys, J., & Essa, I. A. (2018). Let’s Dance: Learning from Online Dance Videos. CoRR , abs/1801.07388 . Cesp edes-Guev ara, J., & Eerola, T. (2018). Music Communicates Affects, Not Basic Emotions - A Constructionist Account of A ttribution of Emo- tional Meanings to Music. F r ontiers in Psycholo gy , 9 . Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2018). Ev eryb o dy Dance No w. CoRR , abs/1808.07371 . Co ok, N. (2000). Analysing Music al Multime dia . Oxford Universit y Press. Da vies, S. (1994). Music al Me aning and Expr ession . Cornell Univ ersity Press. Dehaene, S., & Cohen, L. (2007). Cultural Recycling of Cortical Maps. Neur on , 56 , 384–398. Desai, R. H., Binder, J. R., Conant, L. L., Mano, Q. R., & Seiden b erg, M. S. (2011). The Neural Career of Sensory-motor Metaphors. Journal of Co gnitive Neur oscienc e , 23 , 2376–2386. 19 Eitan, Z., & Granot, R. Y. (2006). How Music Mov es: Music Parameters and Listeners’ Images of Motion. Music Per c eption , 23 , 221–248. Eitan, Z., & Rothsc hild, I. (2011). Ho w Music T ouches: Musical Parameters and Listeners’ Audio-tactile Metaphorical Mappings. Music Per c eption , 39 , 449–467. F rego, R. J. D. (1999). Effects of Aural and Visual Conditions on Resp onse to Perceiv ed Artistic T ension in Music and Dance. Journal of R ese ar ch in Music Educ ation , 47 , 31–43. F ujiok a, T., T rainor, L. J., Large, E. W., & Ross, B. (2012). In ternalized Timing of Iso c hronous Sounds is Represented in Neuromagnetic Beta Os- cillations. Journal of Neur oscienc e , 32 , 1791–1802. He, W., W ang, W., & Liv escu, K. (2017). Multi-view Recurren t Neural Acoustic W ord Embeddings. In Pr o c e e dings of the 5th International Con- fer enc e on L e arning R epr esentations . Hermann, K. M., & Blunsom, P . (2014). Multilingual Distributed Represen- tations Without W ord Alignment. In Pr o c e e dings of the 2nd International Confer enc e on L e arning R epr esentations . Hotelling, H. (1936). Relations Bet ween Two Sets of V ariates. Biometrika , 28 , 321–377. Janata, P ., T omic, S. T., & Hab erman, J. M. (2012). Sensorimotor Coupling in Music and the Psyc hology of the Gro ov e. Journal of Exp erimental Psycholo gy: Gener al , 141 , 54–75. Juslin, P . N. (2013). What do es Music Express? Basic Emotions and Bey ond. F r ontiers in Psycholo gy , 4 . Kelk ar, T., Ro y , U., & Jensenius, A. R. (2018). Ev aluating a Collection of Sound-tracing Data of Melo dic Phrases. In Pr o c e e dings of the 19th International So ciety for Music Information R etrieval (pp. 74–81). Kiefer, M., & Pulv erm ¨ uller, F. (2012). Conceptual Represen tations in Mind and Brain: Theoretical Developmen ts, Current Evidence and F uture Di- rections. Cortex , 48 , 805–825. 20 Kivy , P . (1980). The Cor de d Shel l: R efle ctions on Music al Expr ession . Princeton Univ ersity Press. Knapp, T. R. (1978). Canonical Correlation Analysis: A General Paramet ric Significance-testing System. Psycholo gic al Bul letin , 85 , 410–416. Ko elsc h, S., V uust, P ., & F riston, K. (2019). Predictive Pro cesses and the P eculiar Case of Music. T r ends in Co gnitive Scienc es , 23 , 63–77. Korsak ov a-Kreyn, M. (2018). Tw o-level Mo del of Embo died Cognition in Music. Psychomusic olo gy: Music, Mind, and Br ain , 28 , 240–259. Krumhansl, C. L., & Sc henck, D. L. (1997). Can Dance Reflect the Structural and Expressiv e Qualities of Music? A Perceptual Exp erimen t on Balan- c hine’s Choreograph y of Mozart’s Div ertimento No. 15. Music ae Scientiae , 1 , 63–85. Kurth, E. (1991). Ernst Kurth: Sele cte d Writings . Cam bridge Universit y Press. v on Laban, R., & Ullmann, L. (1960). The Mastery of Movement . London: MacDonald & Ev ans. Lak off, G. (2012). Explaining Em b odied Cognition Results. T opics in Co g- nitive Scienc e , 4 , 773–785. Lak off, G. (2014). Mapping the Brain’s Metaphor Circuitry: Metaphorical Though t in Everyda y Reason. F r ontiers in Human Neur oscienc e , 8 . Larson, S. (2004). Musical F orces and Melo dic Exp ectations: Comparing Computer Mo dels and Experimental Results. Music Per c eption , 21 , 457– 498. Leman, M. (2007). Emb o die d Music Co gnition and Me diation T e chnolo gy . MIT Press. Leman, M. (2010). An Embo died Approach to Music Semantics. Music ae Scientiae , 14 , 43–67. Leman, M., & Maes, P .-J. (2014). The Role of Em b o dimen t in the Perception of Music. Empiric al Music olo gy R eview , 9 , 236–246. 21 Lemk e, J. L. (1992). In tertextuality and Educational Research. Linguistics and Educ ation , 4 , 257–267. Mat yja, J. R. (2016). Embo died Music Cognition: T rouble Ahead, T rouble Behind. F r ontiers in Psycholo gy , 7 . Nak am ura, S., Oohashi, N. S. T., Nishina, E., F uw amoto, Y., & Y onekura, Y. (1999). Analysis of Music-brain Interaction with Simultaneous Measure- men t of Regional Cerebral Blo o d Flow and Electro encephalogram Beta Rh ythm in Human Sub jects. Neur oscienc e L etters , 275 , 222–226. Nozaradan, S., Peretz, I., Missal, M., & Mouraux, A. (2011). T agging the Neuronal Entrainmen t to Beat and Meter. Journal of Neur oscienc e , 31 , 10234–10240. P eirce, C. S. (1991). On Signs: Writings on Semiotic . Universit y of North Carolina Press. P enhune, V . B., Zatorre, R. J., & Ev ans, A. C. (1998). Cereb ellar Con tri- butions to Motor Timing: A PET Study of Auditory and Visual Rhythm Repro duction. Journal of Co gnitive Neur oscienc e , 10 , 752–765. P ereira, F., Lou, B., Pritc hett, B., Ritter, S., Gershman, S. J., Kanwisher, N., Botvinic k, M., & F edorenko, E. (2018). T ow ard a Univ ersal Deco der of Linguistic Meaning from Brain Activ ation. Natur e Communic ations , 9 . Phillips-Silv er, J., & T rainor, L. J. (2007). Hearing What the Bo dy F eels: Auditory Enco ding of Rhythmic Mov emen t. Co gnition , 105 , 533–546. Platel, H., Price, C., Baron, J.-C., Wise, R., Lam b ert, J., F rack owiak, R. S. J., Lec hev alier, B., & Eustache, F. (1997). The Structural Comp onen ts of Music Perception: A F unctional Anatomical Study. Br ain , 120 , 229– 243. Ralph, M. A. L., Jefferies, E., P atterson, K., & Rogers, T. T. (2017). The Neural and Computational Bases of Semantic Cognition. Natur e R eviews Neur oscienc e , 18 , 42–55. Redmon, J., Divv ala, S. K., Girshic k, R. B., & F arhadi, A. (2016). Y ou Only Lo ok Once: Unified, Real-time Ob ject Detection. In Pr o c e e dings of the 29th IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (pp. 779–788). 22 Roffler, S. K., & Butler, R. A. (1968). Lo calization of T onal Stim uli in the V ertical Plane. Journal of the A c oustic al So ciety of A meric a , 43 , 1260– 1265. Sc hlenker, P . (2017). Outline of Music Seman tics. Music Per c eption , 35 , 3–37. Scruton, R. (1997). The A esthetics of Music . Oxford Universit y Press. Siev ers, B., P olansky , L., Casey , M., & Wheatley , T. (2013). Music and Mo vemen t Share a Dynamic Structure that Supp orts Universal Expres- sions of Emotion. Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of Americ a , 110 , 70–75. Sp ence, C., & Driver, J. (1997). Audiovisual Links in Exogenous Co v ert Spatial Orien ting. Per c eption & Psychophysics , 59 , 1–22. Stein, B. E., W allace, M. T., & Meredith, M. A. (1995). Neural Mechanisms Mediating Atten tion and Orien tation to Multisensory Cues. In M. S. Gaz- zaniga (Ed.), The Co gnitive Neur oscienc es (pp. 683–702). MIT Press. St yns, F., v an No orden, L., Mo elan ts, D., & Leman, M. (2007). W alking on Music. Human Movement Scienc e , 26 , 769–785. Virtala, P ., Huotilainen, M., P artanen, E., F ellman, V., & T erv aniemi, M. (2013). Newb orn Infan ts’ Auditory System is Sensitiv e to W estern Music Chord Categories. F r ontiers in Psycholo gy , 4 . W agner, S., Winner, E., Cicc hetti, D., & Gardner, H. (1981). ”Metaphorical” Mapping in Human Infan ts. Child Development , 52 , 728–731. W ei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Conv olutional P ose Machines. In Pr o c e e dings of the 29th IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (pp. 4724–4732). Widmann, A., Kujala, T., T erv aniemi, M., Kujala, A., & Sc hr¨ oger, E. (2004). F rom Sym b ols to Sounds: Visual Sym b olic Information Activ ates Sound Represen tations. Psychophysiolo gy , 41 , 709–715. Y u, Y., T ang, S., Rap oso, F., & Chen, L. (2019). Deep Cross-mo dal Correla- tion Learning for Audio and Lyrics in Music Retriev al. ACM T r ansactions on Multime dia Computing, Communic ations, and Applic ations , 15 . 23 Zatorre, R. J., Ev ans, A. C., & Mey er, E. (1994). Neural Mec hanisms Under- lying Melo dic Perception and Memory for Pitch. Journal of Neur oscienc e , 14 , 1908–1919. 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment