Identifying Cover Songs Using Information-Theoretic Measures of Similarity
This paper investigates methods for quantifying similarity between audio signals, specifically for the task of of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantised audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalised compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalised compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalised compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.
💡 Research Summary
The paper addresses the problem of cover‑song identification by framing similarity measurement as an information‑theoretic task. The authors compare two families of methods: discrete‑valued approaches that operate on quantised audio features and continuous‑valued approaches that work directly on the raw feature streams. For the discrete case they adopt the Normalised Compression Distance (NCD), which estimates similarity by measuring how well two strings can be jointly compressed. Recognising that the standard NCD concatenates the two strings in a fixed order and therefore ignores temporal alignment, they propose a modified version called Normalised Compression Distance with Alignment (NCDA). NCDA first aligns the two sequences (using dynamic time warping or a similar technique) and then compresses the aligned concatenation, allowing the compressor to exploit repeated musical patterns that appear at different time offsets.
For the continuous case the authors treat each track as a multivariate time series of feature vectors (e.g., chroma, MFCC). They train predictive models—linear autoregressive models, feed‑forward regressors, and non‑linear recurrent networks—to predict one series from the other. The statistics of the prediction error (root‑mean‑square error, entropy of the residual distribution, etc.) are taken as a distance measure. This approach avoids quantisation loss and captures subtle spectral and harmonic differences that are invisible to a coarse symbolic representation.
The experimental evaluation uses two benchmark collections. The first consists of 300 jazz standards, each with several cover versions, providing a controlled medium‑scale testbed. The second is a subset of the Million Song Dataset (MSD), representing a realistic large‑scale scenario with millions of tracks. For each track the authors extract short‑term frames, compute chroma and other timbral descriptors, and either quantise them (k‑means clustering) for the discrete pipeline or keep them continuous for the prediction pipeline. Compression algorithms evaluated include LZ78, Prediction by Partial Matching (PPM), and Burrows‑Wheeler Transform (BWT) based compressors.
Performance is measured with Mean Average Precision (MAP) and Precision@k. The continuous‑prediction methods consistently outperform the discrete NCD baseline, achieving MAP improvements of roughly 7–12 % on the jazz set and similar gains on the MSD. NCDA improves over standard NCD by 3–5 % when paired with sequential compressors, confirming that alignment before compression yields a more faithful estimate of joint compressibility. The most striking result comes from a hybrid scheme that linearly combines the continuous prediction distance with NCDA; this fusion attains MAP = 0.84 on the jazz collection and MAP = 0.78 on the MSD, surpassing all previously reported state‑of‑the‑art systems.
To demonstrate scalability, the authors embed the information‑theoretic distances in a filter‑and‑refine pipeline. An inexpensive coarse filter (e.g., Euclidean distance on down‑sampled chroma) reduces the candidate set to roughly 1 % of the database. The refined stage then applies the hybrid information‑theoretic distance, preserving high retrieval accuracy while keeping computational cost tractable for millions of tracks.
In conclusion, the study validates that information‑theoretic similarity measures are effective for music‑based cover‑song detection. The proposed NCDA addresses a key limitation of the classic NCD by incorporating temporal alignment, and the continuous‑prediction framework captures fine‑grained acoustic relationships without quantisation artefacts. Their combination yields a robust, high‑performing system that scales to large music collections. Future work suggested includes exploring more sophisticated alignment techniques, unsupervised deep feature learning, and real‑time deployment in commercial music recommendation services.
Comments & Academic Discussion
Loading comments...
Leave a Comment