A New Multilabel System for Automatic Music Emotion Recognition
Achieving advancements in automatic recognition of emotions that music can induce require considering multiplicity and simultaneity of emotions. Comparison of different machine learning algorithms performing multilabel and multiclass classification is the core of our work. The study analyzes the implementation of the Geneva Emotional Music Scale 9 in the Emotify music dataset and investigates its adoption from a machine-learning perspective. We approach the scenario of emotions expression/induction through music as a multilabel and multiclass problem, where multiple emotion labels can be adopted for the same music track by each annotator (multilabel), and each emotion can be identified or not in the music (multiclass). The aim is the automatic recognition of induced emotions through music.
💡 Research Summary
The paper presents a comprehensive study on automatic music emotion recognition (MER) that explicitly models the multiplicity and simultaneity of emotions by treating the problem as both multilabel and multiclass. Using the Emotify dataset—400 one‑minute tracks equally distributed among classical, rock, pop, and electronic genres, annotated with 8,407 listener responses—the authors adopt the Geneva Emotional Music Scale (GEMS) reduced to nine emotion categories (GEMS‑9). Each listener could select up to three emotions per track, reflecting the real‑world possibility that a single piece can evoke several feelings at once.
To convert the raw annotations into usable labels, the authors introduce a “consensus threshold” of 30 %: an emotion is considered present for a track only if the average positive response across annotators exceeds this proportion. The score for emotion i on track j is computed as the count of positive selections divided by the total number of annotations for that track, without weighting individual listeners. This simple criterion reduces label sparsity while preserving collective emotional judgments.
Feature extraction is performed with three open‑source toolkits (MIRToolbox, Marsyas, PsySound), yielding 476 low‑level descriptors grouped into four families: acoustic (intensity, rhythm, timbre), psychoacoustic (loudness, sharpness, etc.), melodic (pitch salience, contour, vibrato), and statistical summaries (mean, variance, skewness, kurtosis). The feature set includes classic MIR variables such as RMS, tempo, MFCCs, spectral flux, as well as higher‑level psychoacoustic measures, providing a rich representation of the audio signal.
Because continuous features can introduce discretization error for classification, the authors apply Kononenko’s MDL‑based discretization algorithm, which recursively partitions each feature into intervals that minimize description length. This step reduces noise and aligns the data with the discrete nature of the chosen classifiers. Subsequently, Correlation‑based Feature Selection (CFS) is employed to retain features that are highly correlated with the target labels while being mutually weakly correlated, thereby mitigating redundancy and overfitting.
Three classifiers are evaluated: (1) a linear‑kernel Support Vector Machine trained with Sequential Minimal Optimization (SMO), (2) a Naïve Bayesian classifier using the k2 search estimator, and (3) a feed‑forward Artificial Neural Network with a single hidden layer of 50 neurons, a learning rate of 0.3, momentum of 0.2, and supervised back‑propagation. All experiments are conducted in the Weka environment with 10‑fold cross‑validation. For each classifier, three experimental conditions are tested: (a) using the full raw feature set, (b) using only the CFS‑selected subset, and (c) applying discretization before feature selection.
Results, reported in terms of Root Mean Square Error (RMSE) and classification accuracy measured against the 30 % consensus threshold, show that the SVM consistently outperforms the other models, achieving the lowest RMSE (≈ 0.99) and the highest accuracy when both discretization and feature selection are applied. The ANN and Bayesian classifiers attain moderate performance, indicating that linear decision boundaries are well‑suited to the transformed feature space. The authors also note that discretization combined with CFS yields a noticeable improvement over using raw features alone, confirming the value of their preprocessing pipeline.
The paper acknowledges several limitations. The fixed consensus threshold does not account for individual listener variability, potentially discarding nuanced personal responses. The impact of MP3 compression on feature quality is discussed qualitatively but not quantified, leaving open the question of how much acoustic information is lost. Moreover, evaluation metrics are limited to RMSE and accuracy; multilabel‑specific measures such as Hamming loss, subset accuracy, or macro‑averaged F1‑score are absent, which would have offered deeper insight into label interdependencies. Finally, the ANN architecture and hyper‑parameter tuning are relatively simple, and more sophisticated deep learning models (e.g., CNN‑RNN hybrids, Transformers) are not explored.
In conclusion, the study provides a solid baseline for multilabel music emotion recognition by integrating consensus‑based labeling, robust feature preprocessing (discretization and correlation‑based selection), and comparative classification. It demonstrates that a well‑engineered pipeline can achieve competitive performance without resorting to large‑scale deep networks. Future work could extend the approach by personalizing consensus thresholds, incorporating additional modalities (e.g., video, physiological signals), and benchmarking against state‑of‑the‑art deep learning architectures to further close the gap between automatic systems and human emotional perception.
Comments & Academic Discussion
Loading comments...
Leave a Comment