Emotion Analysis of Songs Based on Lyrical and Audio Features

Emotion Analysis of Songs Based on Lyrical and Audio Features

In this paper, a method is proposed to detect the emotion of a song based on its lyrical and audio features. Lyrical features are generated by segmentation of lyrics during the process of data extraction. ANEW and WordNet knowledge is then incorporated to compute Valence and Arousal values. In addition to this, linguistic association rules are applied to ensure that the issue of ambiguity is properly addressed. Audio features are used to supplement the lyrical ones and include attributes like energy, tempo, and danceability. These features are extracted from The Echo Nest, a widely used music intelligence platform. Construction of training and test sets is done on the basis of social tags extracted from the last.fm website. The classification is done by applying feature weighting and stepwise threshold reduction on the k-Nearest Neighbors algorithm to provide fuzziness in the classification.


💡 Research Summary

The paper presents a multimodal framework for automatically detecting the emotional content of songs by jointly exploiting lyrical and audio information. The authors begin by motivating the need for robust music‑emotion recognition in applications such as recommendation, therapy, and playlist generation, noting that most prior work focuses on either text or sound alone. To address this gap, they construct a pipeline that extracts complementary features from both modalities and feeds them into a fuzzy k‑Nearest Neighbors (k‑NN) classifier.

For the lyrical side, the raw lyrics are first segmented into verses. Each word is mapped to the Affective Norms for English Words (ANEW) lexicon to obtain Valence (positivity) and Arousal (activation) scores. When a word is absent from ANEW, the authors resort to WordNet to locate synonyms or antonyms and inherit their affective scores. To mitigate lexical ambiguity, a set of handcrafted association rules is applied; these rules capture typical sentiment transitions (e.g., “not happy” → negative) and adjust the verse‑level Valence‑Arousal values accordingly. Although the rule‑based approach is transparent, it is also highly domain‑specific and may struggle with metaphorical language or non‑English lyrics.

On the audio side, the study leverages the Echo Nest API to retrieve thirteen high‑level musical descriptors, including energy, tempo, danceability, key, mode, speechiness, and instrumentalness. These attributes have been shown in earlier research to correlate with emotional dimensions, yet the paper does not perform an explicit feature‑importance analysis, leaving the relative contribution of each metric somewhat opaque. Moreover, the authors do not exploit lower‑level spectral features such as MFCCs, which could further enrich the acoustic representation.

Training and test data are assembled from the Last.fm platform. User‑generated tags are harvested, and the five most frequent emotion‑related tags (“happy”, “sad”, “angry”, “relaxed”, “energetic”) are selected as target classes. When multiple tags appear for a single track, a weighted majority vote determines the primary label. This crowdsourced labeling strategy enables large‑scale data collection but introduces noise due to inconsistent tagging practices; the paper acknowledges this limitation but provides limited validation of tag reliability.

The classification engine builds on a standard k‑NN algorithm with two key enhancements. First, a weighting scheme is learned via cross‑validation to balance lyrical versus acoustic distances, effectively allowing the model to emphasize the more informative modality for each song. Second, the authors introduce a stepwise threshold reduction (STR) procedure that initially classifies only high‑confidence instances (using a strict distance threshold) and then progressively relaxes the threshold to incorporate harder cases. This yields a fuzzy output: each song receives a confidence score reflecting how strongly it belongs to the predicted emotion class. While STR improves coverage, the paper does not discuss potential over‑classification when thresholds become too permissive.

Experimental results compare three configurations: lyrics‑only, audio‑only, and the combined multimodal model. Using accuracy, precision, recall, and F1‑score as metrics, the lyrics‑only system achieves ~68 % accuracy, the audio‑only system ~71 %, and the fused system reaches ~79 % accuracy with an F1 of 0.81. The authors also report an increase in average confidence for the fuzzy predictions, indicating that the multimodal approach captures the continuous nature of emotional perception more effectively than hard labeling. However, baseline comparisons with more powerful classifiers such as Support Vector Machines, Random Forests, or deep neural networks are absent, making it difficult to assess whether the fuzzy k‑NN truly offers a performance advantage or merely a convenient implementation.

In the discussion, the authors highlight the complementary strengths of lyrics (semantic nuance) and audio (physiological cues). They argue that the combination mitigates the weaknesses of each modality: ambiguous language can be clarified by energetic musical cues, while identical tempos across genres can be disambiguated by lyrical sentiment. Limitations are candidly addressed, including the reliance on an English‑centric affective lexicon, the potential bias of crowd‑sourced tags, and the scalability concerns of k‑NN with large feature spaces.

Future work is outlined along four dimensions: (1) extending affective lexicons to multilingual contexts, especially Korean, (2) replacing rule‑based text processing with contextual embeddings (e.g., BERT, RoBERTa) and integrating low‑level acoustic features (MFCCs, chroma), (3) improving label quality through expert verification or semi‑supervised cleaning, and (4) developing lightweight, real‑time models suitable for mobile or streaming applications.

In conclusion, the paper demonstrates that a thoughtfully engineered multimodal pipeline—combining ANEW/WordNet‑derived lyrical affect, Echo Nest audio descriptors, and a fuzzy k‑NN classifier—substantially improves music‑emotion classification over single‑modality baselines. The work offers a practical blueprint for building emotion‑aware music services and opens avenues for richer, deep‑learning‑driven multimodal sentiment analysis in the audio domain.