Leveraging Whisper Embeddings for Audio-based Lyrics Matching

Leveraging Whisper Embeddings for Audio-based Lyrics Matching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.


💡 Research Summary

The paper introduces WEALY (Whisper Embeddings for Audio‑based LYrics matching), a fully reproducible end‑to‑end pipeline that leverages the decoder embeddings of OpenAI’s Whisper model to perform lyrics‑based music matching without relying on intermediate text transcriptions. The authors argue that existing lyrics‑matching approaches either depend on noisy ASR outputs, require extensive manual data collection, or involve complex multi‑stage pipelines that are difficult to reproduce. WEALY addresses these shortcomings by extracting dense latent vectors directly from Whisper’s final decoder layer, treating them as “lyrics‑aware” representations of the audio signal.

The pipeline consists of two stages. In the first stage, raw audio is resampled to 16 kHz mono, trimmed to a maximum of five minutes, and split into overlapping 30‑second chunks. Each chunk is converted to a log‑mel spectrogram and fed to Whisper‑turbo. The model processes each chunk autonomously, producing a variable‑length sequence of hidden decoder states. These states are concatenated across all chunks to form a large matrix H (m × 1280), where m varies with the amount of vocal activity in the track. By extracting the hidden states before token sampling, the method captures Whisper’s semantic understanding of the lyrics while avoiding the errors introduced by explicit transcription.

In the second stage, random subsequences of fixed length k = 1500 are sampled from H to keep training tractable and to expose the model to diverse temporal contexts. Each subsequence is linearly projected to a model dimension d_h = 768, passed through four transformer encoder blocks (12 attention heads, feed‑forward dimension 1024, dropout 0.1), and then temporally pooled using Generalized Mean (GeM) pooling. The pooled vector is finally projected to a 512‑dimensional embedding space. Training uses a contrastive NT‑Xent loss with temperature τ = 0.1, encouraging embeddings of different versions of the same song to be close while pushing apart embeddings of unrelated tracks.

Because publicly available lyrics‑matching datasets are scarce, the authors evaluate WEALY on three large‑scale music version identification (MVI) datasets: DiscogsVI‑YT (DVI), SHS100k‑v2 (SHS), and LyricCovers2.0 (LYC). DVI offers a modern, well‑balanced version‑group distribution; SHS is a classic benchmark; LYC contains 78 k tracks across 80 languages, allowing assessment of multilingual robustness. All audio is processed uniformly, and duplicate or instrumental entries are removed to focus on lyric‑bearing content.

A comprehensive set of baselines is provided: TF‑IDF cosine and Lucene similarity on Whisper‑generated transcriptions, Whisper‑averaged embeddings (no training), Whisper‑ASR followed by Sentence‑BERT (both cosine and a learned transformer), a random baseline, and a “Non‑instrumental Oracle” that defines an upper bound by filtering out unreliable transcriptions. WEALY consistently outperforms these baselines in mean average precision (MAP) on all three datasets, achieving improvements of roughly 10‑15 % over the best untrained Whisper‑AvgEmb baseline and 20 %+ over TF‑IDF methods.

Ablation studies explore three major design choices. First, the NT‑Xent loss outperforms both triplet loss and the CLEWS loss originally proposed for multi‑sequence similarity learning. Second, GeM pooling yields higher MAP than simple average pooling and also beats using the CLS token directly. Third, retaining Whisper’s multilingual capabilities (i.e., allowing the model to output tokens in the original language) improves performance compared with forcing English‑only transcriptions, confirming that cross‑lingual information is beneficial for lyric similarity. A multimodal fusion experiment combines WEALY with the CLEWS audio‑content embeddings via a simple weighted sum of distances (α = 1), showing modest additional gains but underscoring that WEALY alone already provides a strong lyric‑focused representation.

Reproducibility is a central contribution: the authors release all code, model checkpoints, preprocessing scripts, and detailed hyper‑parameter settings (AdamW optimizer, lr = 1e‑4, weight decay = 1e‑3, cosine learning‑rate schedule with 50 warm‑up epochs, early stopping based on validation MAP with patience = 20, training up to 1000 epochs on four GPUs).

Limitations are acknowledged. The approach depends on the Whisper‑turbo model; newer Whisper versions would require re‑training. The five‑minute audio cut‑off may discard lyrical content in longer songs, and the current pipeline does not exploit non‑lyric musical cues such as melody or rhythm, which could be valuable for broader music retrieval tasks.

Future work is outlined: integrating source‑separation front‑ends to isolate vocals, experimenting with longer or adaptive subsequence lengths, and developing richer multimodal architectures that jointly model lyrics, melody, and timbre. The authors also suggest exploring self‑supervised pre‑training on massive music corpora to further enhance the lyric‑aware embeddings.

In summary, WEALY demonstrates that Whisper’s decoder embeddings, when coupled with a lightweight transformer adapter and contrastive learning, constitute an effective, transparent, and reproducible foundation for audio‑based lyrics matching. The pipeline achieves state‑of‑the‑art performance on proxy MVI tasks while providing a publicly available benchmark that can accelerate future research at the intersection of speech technology and music information retrieval.


Comments & Academic Discussion

Loading comments...

Leave a Comment