A data-driven approach to mid-level perceptual musical feature modeling

A D A T A-DRIVEN APPR O A CH T O MID-LEVEL PERCEPTU AL MUSICAL FEA TURE MODELING Anna Aljanaki Institute of Computational Perception, Johannes K epler Univ ersity aljanaki@gmail.com Mohammad Soleymani Swiss Center for Af fectiv e Sciences, Gene va Uni versity mohammad.soleymani@unige.ch ABSTRA CT Musical features and descriptors could be coarsely di- vided into three le vels of complexity . The bottom le vel contains the basic building blocks of music, e.g., chords, beats and timbre. The middle lev el contains concepts that emerge from combining the basic blocks: tonal and rhythmic stability , harmonic and rhythmic complexity , etc. High-lev el descriptors (genre, mood, expressi ve style) are usually modeled using the lower le vel ones. The features belonging to the middle level can both improve automatic recognition of high-le vel descriptors, and pro vide ne w mu- sic retriev al possibilities. Mid-lev el features are subjecti ve and usually lack clear deﬁnitions. Howe ver , they are very important for human perception of music, and on some of them people can reach high agreement, ev en though deﬁn- ing them and therefore, designing a hand-crafted feature extractor for them can be difﬁcult. In this paper , we de- riv e the mid-lev el descriptors from data. W e collect and release a dataset 1 of 5000 songs annotated by musicians with seven mid-le vel descriptors, namely , melodiousness, tonal and rhythmic stability , modality , rhythmic complex- ity , dissonance and articulation. W e then compare se veral approaches to predicting these descriptors from spectro- grams using deep-learning. W e also demonstrate the use- fulness of these mid-le vel features using music emotion recognition as an application. 1. INTRODUCTION In music information retrie val, features extracted from au- dio or a symbolic representation are often categorized as low or high-level [5], [17]. There is no clear boundary between these concepts and the terms are not used consis- tently . Usually , features that were extracted using a small analysis window that does not contain temporal informa- tion are called lo w-level (e.g., spectral features, MFCCs, loudness). Features that are deﬁned within a longer con- 1 https://osf.io/5aupt/ c  Anna Aljanaki, , Mohammad Soleymani. Licensed un- der a Creative Commons Attrib ution 4.0 International License (CC BY 4.0). Attribution: Anna Aljanaki, , Mohammad Sole ymani. “A data- driv en approach to mid-level perceptual musical feature modeling”, 19th International Society for Music Information Retrieval Conference, Paris, France, 2018. text (and often related to music theoretical concepts) are called high-level (ke y , tempo, melody). In this paper , we will look at these lev els from the point of view of human perception, and deﬁne what constitutes low , middle and high levels depending on complexity and subjecti vity of a concept. Unambiguously deﬁned and objecti vely veri- ﬁable concepts (beats, onsets, instrument timbres) will be called low-le vel. Subjective, complex concepts that can only be deﬁned by considering every aspect of music will be called high-le vel (mood, genre, similarity). Ev erything in between we will call mid-lev el. Musical concepts can best be viewed and deﬁned through the lens of human perception. It is often not enough to approximate them through a simpler concept or feature. For instance, music speed (whether music is per- ceiv ed as fast or slow) is not explained by or equiv alent to tempo (beats per minute). In fact, perceptual speed is bet- ter approximated (but not completely explained) by onset rate [8]. There are many examples of mid-lev el concepts: harmonic complexity , rhythmic stability , melodiousness, tonal stability , structural regularity [10], [24]. Such meta language could be used to improve search and retriev al, to add interpretability to the models of high-lev el concepts, and may be even break the glass ceiling in the accuracy of their recognition. In this paper we collect a dataset and model these con- cepts directly from data using transfer learning. 2. RELA TED W ORK Many algorithms hav e been dev eloped to model features describing such aspects of music as articulation, melodi- ousness, rhythmic and dynamic patterns. MIR T oolbox and Essentia frame works offer many algorithms that can ex- tract features related to harmony , rhythm, articulation and timbre [13], [3]. These features are usually e xtracted using some hand-crafted algorithm and have a differing amount of psychoacoustic and perceptual basis. For e xample, Salamon et al. dev eloped a set of melodic features which extract pitch contours from a melody ob- tained with a melody e xtraction algorithm [22]. There were proposed measures like percussi veness [17], pulse clarity [12], danceability [23]. Panda et al. proposed a set of algorithms to extract descriptors related to melody , rhythm and texture from MIDI and audio [19]. It is out of our scope to revie w all e xisting algorithms for detecting Perceptual Feature Criteria when comparing two excerpts Cronbach’ s α Melodiousness T o which excerpt do you feel like singing along? 0.72 Articulation Which has more sounds with staccato articulation? 0.8 Rhythmic stability Imagine marching along with music. Which is easier to march along with? 0.69 Rhythmic complexity Is it difﬁcult to repeat by tapping? Is it difﬁcult to ﬁnd the meter? Does the rhythm hav e many layers? 0.27 (0.47) Dissonance Which excerpt has noisier timbre? Has more dissonant intervals (tritones, seconds, etc.)? 0.74 T onal stability Where is it easier to determine the tonic and key? In which excerpt are there more modulations? 0.44 Modality Imagine accompanying this song with chords. Which song would ha ve more minor chords? 0.69 T able 1 . Perceptual mid-level features and the questions that were pro vided to raters to help them compare two excerpts. what we call mid-lev el perceptual music concepts. All the algorithms listed so far were designed with some hypothesis about music perception in mind. For instance, Essentia offers an algorithm to compute sensory disso- nance, which sums up the dissonance values for each pair of spectral peaks, based on dissonance curves obtained from perceptual measurements [20]. Such an algorithm measures a speciﬁc aspect of music in a transparent way , but it is hard to say , whether it captures all the aspect of a perceptual feature. Friberg et al. collected perceptual ratings for nine fea- tures (rhythmic complexity and clarity , dynamics, har- monic complexity , pitch, etc.) for a set of 100 songs and modeled them using av ailable automatic feature extractors, which showed that algorithms can cope with some con- cepts and fail with some others [8]. For instance, for such an important feature like modality (majorness) there is no adequate solution yet. It was also sho wn that with just se v- eral perceptual features it is possible to model emotion in music with a higher accuracy than it is possible using fea- tures, extracted with MIR software [1], [8], [9]. In this paper we propose an approach to mid-level fea- ture modeling that is more similar to automatic tagging [6]. W e try to approximate the perceptual concepts by model- ing them straight from the ratings of listeners. 3. DA T A COLLECTION From the literature ( [10], [24], [8]) we composed a list of perceptual musical concepts and picked 7 recurring items. T able 1 shows the selected terms. The concepts that we are interested in stem from musicological vocab ulary . Identi- fying and naming them is a complicated task that requires musical training. This doesn’t mean that these concepts are meaningless and are not percei ved by an average mu- sic listener, but we can not trust an average listener to apply the terms in a consistent way . W e used T oloka 2 crowd- 2 toloka.yandex.ru sourcing platform to ﬁnd people with musical training to do the annotation. W e in vited anyone who has music edu- cation to tak e a musical test, which contained questions on harmony (tonality , identifying mode of chords), expressiv e terms (rubato, dynamics, articulation), pitch and timbre. Also, we asked the crowd-sourcing workers to shortly de- scribe their music education. From 2236 people who took the test slightly less than 7% (155 crowd sourcing workers) passed it and were in vited to participate in the annotation. 3.0.1 Deﬁnitions The terminology (articulation, mode, etc.) that we use is coming from musicology , but it w as not designed to be used in a way that we use it. For instance, the concept of articulation is deﬁned for a single note (or can also be extended to a group of notes). Applying it to a real-life recording with possibly several instruments and voices is not an easy task. T o ensure common understanding, we of- fer the annotators a set of deﬁnitions as shown in T able 1. The general principle is to consider the recording as a whole. 3.1 Pairwise comparisons It is easier for annotators to compare two items using a certain criterion, then to gi ve a rating on an absolute scale, and especially so for subjecti ve and vaguely deﬁned con- cepts [14]. Then, a ranking can be formed from pairwise comparisons. Ho wever , annotating a suf ﬁcient amount of songs using pairwise comparisons is too labor intensiv e. Collecting a full pairwise comparison matrix (not counting repetitions and self-similarity) requires ( n 2 − n ) / 2 com- parisons. For our desired target of 5000 songs, that would mean ≈ 12 . 5 million comparisons. It is possible to con- struct a ranking with less than a full pairwise comparison matrix, b ut still for a big dataset it is not a feasible ap- proach. W e combine the two approaches. In order to do that, we ﬁrst collected pairwise comparisons for a small Figure 1 . Distribution of discrete ratings per perceptual feature. Feature Articulation R. comlexity R. Stability Dissonance T onal stability Mode Melodiousness − 0 . 13 − 0 . 22 0 . 27 − 0 . 59 0 . 58 − 0 . 22 Articulation 0 . 39 0 . 60 0 . 45 − 0 . 05 − 0 . 14 R. complexity − 0 . 009 0 . 48 − 0 . 30 0 . 06 R. stability 0 . 06 0 . 36 − 0 . 17 Dissonance − 0 . 55 0 . 23 T onal stability − 0 . 16 T able 2 . Correlations between the perceptual mid-level features. amount of songs, obtained a ranking, and then created an absolute scale that we used to collect the rankings. In this way , we also implicitly deﬁne our concepts through examples without a need to explicitly describe all their aspects. 3.1.1 Music selection For pairwise comparisons, we selected 100 songs. This music needed to be di verse, because it was going to be used as examples and needed to be able to represent the extremes. W e used 2 criteria to achie ve that - genre and emotion. From each of the 5 music preference clusters of Rentfrow et al. [21] we selected a list of genres belong- ing to these clusters and picked songs from the DEAM dataset [2] belonging to these genres (pop, rock, hip- hop, rap, jazz, classical, electronic), taking 20 songs from each of the preference clusters. Also, using the anno- tations from DEAM, we assured that the selected songs are uniformly distrib uted o ver the four quadrants of v a- lence/arousal plane. From each of the songs we cut a seg- ment of 15 seconds. For a set of 100 songs we collected 2950 comparisons. Next, we created a ranking by counting the percentage of comparisons won by a song relative to an ov erall number of comparisons per song. By sampling from that ranking we created sev en scales with song examples from 1 to 9 for each of the mid-level perceptual features (for instance, from the least melodious (1) to the most melodious (9)). Some of the musical examples appeared in se veral scales. 3.2 Ratings on 7 perceptual mid-le vel featur es The ratings were again collected on T oloka platform, and the workers were selected using the same musical test. The rating procedure was as follo ws. First, a work er listened to a 15-second excerpt. Next, for a certain scale (for instance, articulation), a worker compared an excerpt with examples arranged from ”legato” to ”staccato” and found a proper rating. Finally , this was repeated for each of the 7 percep- tual features. 3.2.1 Music selection Most of the dataset music consists of Creative Commons licensed music from jamendo.com and magnatune. com . For annotation, we cut 15 seconds from the middle of the song. In the dataset, we provide the segments and the links to a full song. There is a restriction of no more than 5 songs from the same artist. The songs from jamendo. com were also ﬁltered by popularity , in a hope to get music of a better recording quality . W e also reused the music from datasets annotated with emotion [7], [18], [15] which we are going to use to indirectly test the validity of the annotations. 3.2.2 Data Figure 1 shows the distributions of the ratings for ev ery feature. The music in the dataset leans slightly to wards be- ing rhythmically stable, tonally stable and consonant. The scales could be also readjusted to hav e more examples in the regions of the most density . That might not necessar- ily help, because the observed distributions could also be the artifacts of people prefering to avoid the extremes. T a- ble 2 shows the correlation between different perceptual features. There is a strong negati ve correlation between melodiousness and dissonance, a positi ve relationship be- tween articulation and rhythmic stability . T onal stability is negati vely correlated with dissonance and positively with melodiousness. 3.3 Consistency Any crowd-sourcing worker could stop annotating at any point, so the amount of annotated songs per person var - ied. An a verage amount of songs per worker was 187 . 01 ± 500 . 68 . On average, it took ≈ 2 minutes to answer all the sev en questions for one song. Our goal was to collect 5 annotations per song, which amounts to ≈ 833 man-hours. In order to ensure quality , a set of songs with high qual- ity annotations (high agreement by well-performing work- ers) was interlaced with new songs, and the annotations of ev ery crowd-sourcing worker were compared against that golden standard. The workers that gav e answers very far Emotional dimension or category Pearson’ s ρ (prediction) Important features V alence 0.88 Mode (major), melodiousness (pos.), dissonance (neg.) Energy 0.79 Articulation (staccato), dissonance (pos.) T ension 0.84 Dissonance (pos.), melodiousness (neg.) Anger 0.65 Dissonance (pos.), mode (minor), articulation (staccato) Fear 0.82 Rhythm stability (neg.), melodiousness (ne g.) Happy 0.81 Mode (major), tonal stability (pos.) Sad 0.73 Mode (minor), melodiousness (pos.) T ender 0.72 Articulation (legato), mode (minor), dissonance (ne g.) T able 3 . Modeling emotional categories in Soundtracks dataset using se ven mid-lev el features. from the standard were banned. Also, the answers were compared to the av erage answer per song, and workers whose standard deviation was close to one one resulting from random guessing were also banned and their answers discarded. The ﬁnal annotations contain answers of 115 workers out of a pool of 155, who passed the musical test. T able 1 sho ws a measure of agreement (Cronbach’ s α ) for each of the mid-lev el features. The annotators reach good agreement for most of the features, except rhyth- mic complexity and tonal stability . W e created a differ - ent musical test, containing only questions about rhythm, and collected more annotations. Also, we provided more examples on the rhythm complexity scale. It helped a lit- tle (Cronbach’ s α improved from 0.27 to 0.47), but still rhythmic complexity has much worse agreement than other properties. In a study of Friberg and Hedblad [8], where similar perceptual features were annotated for a small set of songs, the situation was similar . The least consistent properties were harmonic complexity and rhythmic com- plexity . W e av erage the ratings for ev ery mid-lev el feature per song. The annotations and the corresponding e xcerpts (or links to external reused datasets) are av ailable online (osf.io/5aupt). All the experiments below are performed on av eraged ratings. 3.4 Emotion dimensions and categories Soundtracks dataset contains 15 second e xcerpts from ﬁlm music, annotated with valence, arousal, tension, and 5 ba- sic emotions [7]. W e sho w that our annotations are meaningful by using them to model musical emotion in Soundtracks dataset. The averaged ratings per song for each of the sev en mid- lev el concepts are used as features in a linear regression model (10-fold cross-validation). T able 3 shows the correlation coefﬁcient and the most important features for each dimension, which are consis- tent with the ﬁndings in the literature [10]. W e can model most dimensions well, despite not having any information about loudness and tempo. Cluster A UC F-measure Cluster 1 passionate, conﬁdent 0.62 0.38 Cluster 2 cheerful, fun 0.7 0.5 Cluster 3 bittersweet 0.8 0.67 Cluster 4 humorous 0.65 0.45 Cluster 5 aggressiv e 0.78 0.64 T able 4 . Modeling MIREX clusters with perceptual fea- tures. 3.5 MIREX clusters Multimodal dataset contains 903 songs annotated with 5 clusters used in MIREX Mood recognition competition 3 [18]. T able 4 shows results of predicting the ﬁv e clusters using the sev en mid-le vel features and an SVM classiﬁer . The average weighted F1 measure on all the clusters on this dataset is 0.54. In [18], with an SVM classiﬁer trained on 253 audio features, extracted with various toolboxes, F1 measure was 44.9, and 52.3 with 98 melodic features. By combining these feature sets and doing feature selection by using feature ranking, the F1 measure was increased to 64.0. Panda et al. hypothesize that Multimodal dataset is more dif ﬁcult than MIREX dataset (their method per- formed better (0.67) in MIREX competition than on their own dataset). In MIREX data, the songs went through an additional annotation step to ensure agreement on cluster assignment, and only songs that 2 out of 3 experts agreed on were kept. 3 www .music-ir .org/mirex Figure 2 . A UC per tag on the test set. 4. EXPERIMENTS W e left out 8% of the data as a test set. W e split the train set and test set by performer (no performer from the test set appears in the training set). Also, all the performers in the test set are unique. For pretraining, we used songs from jamendo.com , making sure that the songs used for pre- training do not reappear in the test set. The rest of the data was used for training and validation (whenev er we needed to validate any hyperparameters, we used 2% of the train set for that). From each of the 15-second excerpts we computed a mel-spectrogram with 299 mel-ﬁlters and a frequenc y range of 18000Hz, extracted with 2048 sample window (44100 sampling rate) and a hop of 1536. In order to use it as an input to a neural network, it was cut to a rectangular shape (299 by 299) which corresponds to about 11 seconds of music. Because the original mel-spectrogram is a bit larger , we can randomly shift the rectangular windo w and select a dif ferent set. For some of the songs, full-length songs are also av ailable, and it was possible to extract the mel-spectrogram from any place in a song, but in practice this worked w orse than selecting a precise spot. W e also tried other data representations: spectrograms and custom data representations (time-v arying chroma for tonal features and time-varying bark-bands for rhythmic features). Custom representations were trained with a two- layer recurrent network. These representations worked worse than mel-spectrograms with a deep network. 4.1 T raining a deep network W e chose Inception v3 architecture [4]. First ﬁv e layers are con volutional layers with 3 by 3 ﬁlters. T wice max-pooling is applied. The last layers of the network are the so-called ”inception layers”, which apply ﬁlters of different size in parallel and merge the feature maps later . W e begin by training this network without any pretraining. 4.1.1 T ransfer learning W ith a dataset of only 5000 excerpts, it is hard to prev ent ov erﬁtting when learning features from the very basic mu- sic representation (mel-spectrogram), as it was done in [6] on a much larger dataset. In this case, transfer learning can help. 4.1.2 Data for pr etraining W e crawl data and tags from Jamendo, using the API pro- vided by this music platform. W e select all the tags, which were applied to at least 3000 songs. That leaves us with 65 tags and 184002 songs. For training, we extract a mel- spectrogram from a random place in a song. W e lea ve 5% of the data as a test set. After training on mini-batches of 32 examples with Adam optimizer for 29 epochs, we achiev e an av erage area under receiv er-operator curve of 0.8 on the test set. The A UC on the test set grouped by tag are shown on Figure 2 (only 15 best and 15 worst per- forming tags). Some of the songs in the mid-level feature dataset also were chosen from Jamendo. 4.1.3 T ransfer learning on mid-level featur es The last layer of Inception, before the 65 neurons that pre- dict classes (tags), contains 2048 neurons. W e pass through the mel-spectrograms of the mid-lev el feature dataset and extract the acti vations of this layer . W e normalize these ex- tracted features using mean and standard deviation of the training set. On the training set, we ﬁt a PCA with 30 principle components (the number was chosen based on decline of eigenv alues of the components) and then apply the learned transformation on a validation and test set. On a validation set, we tune parameters of a SVR with a ra- dial basis function kernel and ﬁnally , we predict the seven mid-lev el features on the test set. 4.2 Fine-tuning trained model for mid-level featur es On top of the last Inception layer we add two fully- connected layers with 150 and 30 neurons, both with ReLU activ ation, and an output layer with 7 nodes with no activ a- tion (we train on all the features at the same time). First, we freeze the pre-trained weights of the Inception and train the last layer weights until there’ s no improvement on the v al- idation set an ymore. At this point, the network reaches the same performance on the test set as it reached using trans- fer learning and PCA (which is what we would expect). Now , we unfreeze the weights and with a small learning rate continue training the whole network until it stops im- proving on v alidation set. 4.3 Existing algorithms There are many feature extraction frameworks for MIR. Some of those (jAudio, Aubio, Marsyas) only offer tim- bral and spectral features, others (Essentia, MIR T oolbox, Figure 3 . Performance of dif ferent methods on mid-level feature prediction. V AMP Plugins for Sonic Annotator) offer features, which are similar to the mid-level features of this paper . Figure 3 shows the correlation of some of these features with our perceptual ratings: 1. Articulation . MIR T oolbox of fers features describing characteristics of onsets (attack time, attack slope, leap (duration of attack), decay time, slope and leap. Out of this features leap was chosen (it had the strongest correlation with perceptual articulation feature). 2. Rhythmic stability . Pulse clarity (MIR T oolbox) [16]. 3. Dissonance . Both Essentia and MIR T oolbox offer a feature describing sensory dissonance (in MIR T ool- box, it is called roughness), which is based on the same research of dissonance perception [20]. W e ex- tract this feature and inharmonicity . Inharmonicity only had a weak (0.22) correlation with perceptual dissonance. Figure 3 shows a result for the disso- nance measure. 4. T onal stability . HCDF (harmonic change detection function) in MIR T oolbox is a feature measuring the ﬂux of a tonal centroid [11]. This feature was not correlated with our tonal stability feature. 5. Modality . MIR T oolbox of fers a feature called mode, which is based on an uncertainty in determining the key using pitch-class proﬁles. W e could not ﬁnd features corresponding to melodious- ness and rhythmic complexity . Perceptual concepts lack clear deﬁnitions, so it is impossible to say that the feature extractor algorithms are supposed to directly measure the same concepts that we had annotated. Howe ver , from Fig- ure 3 we can see that the chosen descriptors do indeed cap- ture some part of variance in the perceptual features. 4.4 Results Figure 3 shows the results for e very mid-feature. For all the mid-features, the best result was achieved by pretrain- ing and ﬁne-tuning the network. Melodiousness, articula- tion and dissonance could be predicted with a much bet- ter accuracy than rhythmic comple xity , tonal and rh ythmic stability , and mode. 5. FUTURE WORK In this paper , we only in vestigated se ven perceptual fea- tures. Other interesting features include tempo, timbre, structural regularity . Rhythmic complexity and tonal sta- bility features had low agreement. It is probable that con- tributing factors need to be explicitly speciﬁed and studied separately . The accuracy could be improved for modality and rhythmic stability . It is not clear whether strong cor- relations between some features are an artifact of the data selection or music perception. 6. CONCLUSION Mid-lev el perceptual music features could be used for mu- sic search and categorization and improv e music emotion recognition methods. Howe ver , there are multiple chal- lenges in extracting such features: ﬁrst, such concepts lack clear deﬁnitions, and we do not quite understand the under- lying perceptual mechanisms yet. In this paper, we collect annotations for seven perceptual features and model them by relying on listener ratings. W e provide the listeners with scales with examples instead of deﬁnitions and crite- ria. Listeners achie ved good agreement on all the features but two (rhythmic complexity and tonal stability). Using deep learning, we model the features from data. Such an approach has its advantages as compared to speciﬁc algorithm-design by being able to pick appropriate pat- terns from the data and achiev e better performance than an algorithm based on a single aspect. Howe ver , it is also less interpretable. W e release the mid-le vel feature dataset, which can be used to further improve both algorithmic and data-driv en methods of mid-lev el feature recognition. 7. A CKNO WLEDGEMENTS This work is supported by the European Research Council (ERC) under the EUs Horizon 2020 Framew ork Program (ERC Grant Agreement number 670035, project ”Con Espressione”). This work was also supported by an FCS grant. 8. REFERENCES [1] A. Aljanaki, F . Wiering, and R.C. V eltkamp. Compu- tational modeling of induced emotion using gems. In 15th International Society for Music Information Re- trieval Confer ence , 2014. [2] A. Aljanaki, Y .-H. Y ang, and M. Sole ymani. Dev el- oping a benchmark for emotional analysis of music. PLOS ONE , 12(3), 2017. [3] D. Bogdanov , N. W ack, E. Gomez, S. Gulati, P . Her- rera, and et al. O. Mayor . Essentia: an audio analysis library for music information retrie val. In 14th Interna- tional Society for Music Information Retrie val Confer- ence , pages 493–498, 2013. [4] S. Ioffe J. Shlens Z. W ojna C. Szegedy , V . V anhoucke. Rethinking the inception architecture for computer vi- sion. In IEEE Confer ence on Computer V ision and P at- tern Recognition , 2016. [5] M.A. Case y , R. V eltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney . Content-Based Music Infor - mation Retriev al: Current Directions and Future Chal- lenges. Pr oceedings of the IEEE , 96(4):668–696, 2008. [6] K. Choi, G. Fazekas, and M. Sandler . Con vnet: Au- tomatic tagging using deep conv olutional neural net- works. In 17th International Society for Music Infor- mation Retrieval Confer ence , 2016. [7] T . Eerola and J.K. V uoskoski. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music , 39(1):1849, 2011. [8] A. Friberg and A. Hedblad. A comparison of percep- tual ratings and computed audio features. In 8th Sound and Music Computing Conference , pages 122–127, 2011. [9] A. Friberg, E. Schoonderwaldt, A. Hedblad, M. Fabi- ani, and A. Elowsson. Using listener-based perceptual features as intermediate representations in music infor- mation retriev al. The Journal of the Acoustical Society of America , 136(4):1951–63, 2014. [10] A. Gabrielsson and E. Lindstrm. Music and Emotion: Theory and Resear ch , chapter The Inﬂuence of Mu- sical Structure on Emotional Expression, page 22348. Oxford Univ ersity Press, 2001. [11] Christopher Harte, Mark Sandler , and Martin Gasser . Detecting harmonic change in musical audio. In Pr o- ceedings of the 1st A CM workshop on A udio and music computing multimedia - AMCMM ’06 , page 21. A CM Press, 2006. [12] O. Lartillot, T . Eerola, P . T oiviainen, and J. Fornari. Multi-feature modeling of pulse clarity: Design, vali- dation, and optimization. In 9th International Confer- ence on Music Information Retrieval , 2008. [13] O. Lartillot, P . T oiviainen, and T . Eerola. A matlab toolbox for music information retrie val. Data Analysis, Machine Learning and Applications, Studies in Clas- siﬁcation, Data Analysis, and Knowledge Or ganizatio , 2008. [14] J. Madsen, Jensen B. S, and J. Larsen. Predicti ve Mod- eling of Expressed Emotions in Music Using Pairwise Comparisons. pages 253–277. Springer , Berlin, Hei- delberg, 2013. [15] R. Malheiro, R. Panda, P . Gomes, and R. Pai va. Bi- modal music emotion recognition: Novel lyrical fea- tures and dataset. In 9th International W orkshop on Music and Machine Learning MML2016 , 2016. [16] Petri T oiviainen Jose F ornari Olivier Lartillot, T uo- mas Eerola. Multi-feature modeling of pulse clarity: Design, validation, and optimization. In 9th Inter- national Confer ence on Music Information Retrieval , 2008. [17] E. Pampalk. Computational Models of Music Similar- ity and their Application in Music Information Re- trieval . PhD thesis, V ienna University of T echnology , 2012. [18] R. P anda, R. Malheiro, B. Rocha, A. Oliveira, and R. P . Pai va. Multi-modal music emotion recognition: A new dataset, methodology and comparati ve analysis. In 10th International Symposium on Computer Music Multidisciplinary Resear ch , 2013. [19] Renato Panda, Ricardo Manuel Malheiro, and Rui Pe- dro Pai va. Nov el audio features for music emotion recognition. IEEE T ransactions on Affective Comput- ing . [20] R. Plomp and W . J. M. Levelt. T onal Consonance and Critical Bandwidth. The Journal of the Acoustical So- ciety of America , 38(4):548560, 1965. [21] P . J. Rentfrow , L. R. Goldberg, and D. J. Levitin. The structure of musical preferences: a ﬁv e-factor model. Journal of personality and social psychology , 100(6):1139–57, 2011. [22] J. Salamon, B. Rocha, and E. Gomez. Musical genre classiﬁcation using melody features extracted from polyphonic music signals. In 2012 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) , pages 81–84. IEEE, 2012. [23] S. Streich and P . Herrera. Detrended ﬂuctuation analy- sis of music signals danceability estimation and further semantic characterization. In AES 118th Convention , 2005. [24] L. W edin. A Multidimensional Study of Perceptual- Emotional Qualities in Music. Scandinavian J ournal of Psychology , 13:241257, 1972.

A data-driven approach to mid-level perceptual musical feature modeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment