On the Application of Generic Summarization Algorithms to Music
Several generic summarization algorithms were developed in the past and successfully applied in fields such as text and speech summarization. In this paper, we review and apply these algorithms to music. To evaluate this summarization’s performance, we adopt an extrinsic approach: we compare a Fado Genre Classifier’s performance using truncated contiguous clips against the summaries extracted with those algorithms on 2 different datasets. We show that Maximal Marginal Relevance (MMR), LexRank and Latent Semantic Analysis (LSA) all improve classification performance in both datasets used for testing.
💡 Research Summary
The paper investigates whether generic summarization techniques originally devised for text and speech can be repurposed for music, and whether the resulting summaries can improve the performance of an automatic music genre classifier. The authors focus on the Portuguese folk genre Fado, which is characterized by stringed instruments and a distinctive high‑frequency spectral profile. They adopt an extrinsic evaluation methodology: instead of measuring the perceptual quality of the summaries, they assess how well a 30‑second summary of each song supports a Support Vector Machine (SVM) classifier that distinguishes Fado from non‑Fado tracks.
Four summarization approaches are examined. “Average Similarity” is a baseline method that selects the continuous segment of a fixed length L that maximizes the average cosine similarity to the whole song, using MFCC‑based feature vectors. The three generic algorithms are Maximal Marginal Relevance (MMR), LexRank, and Latent Semantic Analysis (LSA). To apply these text‑oriented methods to audio, the authors first extract frame‑level features (13 MFCCs, RMS energy, and two 9‑dimensional rhythmic descriptors derived from FFT bands). Each frame is then clustered with K‑Means, producing a discrete “vocabulary” where each cluster label is treated as a word. Fixed‑size sentences are formed by grouping a constant number of consecutive words (e.g., five). This discretisation yields a bag‑of‑words representation for each sentence, enabling the use of cosine similarity, TF‑IDF weighting, and other standard text‑processing tools.
For MMR, the algorithm iteratively selects sentences that balance relevance to a query (the centroid of all sentences) against redundancy with already selected sentences. The authors test three λ values (0.3, 0.5, 0.7) and four weighting schemes (raw count, binary presence, TF‑IDF, and a “dampened” TF‑IDF where TF is log‑scaled). LexRank constructs a similarity graph of sentences, applies a PageRank‑style random‑walk with damping factor d = 0.85, and extracts the highest‑ranked sentences until the 30‑second budget is filled. LSA builds a sentence‑by‑term matrix, performs Singular Value Decomposition, and selects sentences that have high contributions to the top K latent topics, where K is chosen such that the K‑th singular value is less than half of the largest one.
The experimental setup comprises two balanced datasets, each containing 500 songs (250 Fado, 250 non‑Fado). All audio files are mono, 16‑bit, 22 050 Hz WAV files. Five‑fold cross‑validation is used to estimate classification accuracy. As baselines, the authors extract three contiguous 30‑second clips from the beginning, middle, and end of each track. Summaries generated by each algorithm are also 30 seconds long, ensuring a fair comparison. Feature extraction is performed with the OpenSMILE toolkit; matrix operations use the Armadillo library, and the Marsyas framework synthesizes the summary audio.
Results show that, under certain parameter configurations, the generic summarizers outperform the baseline contiguous clips. Specifically, MMR with λ = 0.5 and TF‑IDF weighting, LexRank with the default damping factor, and LSA with the described singular‑value cutoff all achieve classification accuracies 2–4 percentage points higher than the best baseline (typically the middle‑segment clip). LSA’s advantage is attributed to its ability to capture the dominant melodic and rhythmic topics of a song while discarding redundant material. The Average Similarity method, while conceptually simple, does not consistently improve classification, suggesting that mere similarity to the whole track is insufficient for discriminative tasks.
The authors discuss several limitations. First, the fixed‑size “sentence” construction ignores musical form (verses, choruses, bridges), which may reduce the naturalness of the summaries for human listeners. Second, the choice of K in K‑Means clustering and the sentence length are hyper‑parameters that heavily influence the quality of the discrete representation but were not exhaustively tuned. Third, the evaluation is confined to a single binary classification task; generalization to multi‑class genre recognition, music recommendation, or affective analysis remains untested. Finally, the extrinsic evaluation does not address subjective listening quality, which could be relevant for applications such as preview generation.
In conclusion, the study demonstrates that generic text summarization algorithms can be successfully adapted to music by discretizing audio frames into word‑like tokens and that these summaries can enhance downstream automatic processing, at least for the specific task of Fado detection. Future work is proposed to incorporate music‑aware segmentation (e.g., using structural segmentation or beat‑synchronous units), to automate the selection of clustering parameters, to evaluate the approach on a broader set of musical styles, and to conduct human listening tests to assess perceived coherence and usefulness of the generated summaries.
Comments & Academic Discussion
Loading comments...
Leave a Comment