LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge

LIWhiz: A Non-Intrusive Lyric Intelligibility Prediction System for the Cadenza Challenge
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a root mean square error (RMSE) of 27.07%, a 22.4% relative RMSE reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.


💡 Research Summary

LIWhiz is a non‑intrusive lyric intelligibility prediction (LIP) system designed for the ICASSP 2026 Cadenza Challenge. The core idea is to exploit the powerful, pre‑trained Whisper Large v3 model as a frozen front‑end to extract rich acoustic representations from both the original music excerpt (x) and a hearing‑loss‑simulated version (y). From Whisper’s 32 encoder transformer layers, the initial convolutional encoder block, and its 32 decoder transformer layers (plus the decoder input embedding), a total of 66 feature maps are obtained: each encoder map is an F × T matrix (F = 1,280, T = time frames) and each decoder map is an F × M matrix (M = token count).

To fuse these multi‑layer representations, the authors introduce learnable linear mixing layers (LMLs). Separate weight vectors wₓ(l) and wᵧ(l) are learned for each encoder and decoder layer, allowing the model to automatically discover which depths contribute most to intelligibility estimation. The weighted sum across layers yields two compact embeddings, Eₓ and Eᵧ, each of size 2 × F × T. These embeddings are concatenated and fed into a bidirectional LSTM (Bi‑LSTM). The final hidden state from the encoder branch (hₑ ∈ ℝ²ᴴ, H = 512) is concatenated with the analogous hidden state from the decoder branch (h_d), forming a 4 × H vector h. A single‑neuron fully‑connected layer with a sigmoid activation maps h to a normalized intelligibility score I ∈


Comments & Academic Discussion

Loading comments...

Leave a Comment