Speech Emotion Recognition Leveraging OpenAI's Whisper Representations and Attentive Pooling Methods
Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.
💡 Research Summary
This paper investigates the use of Whisper, a large‑scale multilingual automatic speech recognition (ASR) model, as a frozen feature extractor for speech emotion recognition (SER). The authors propose two attention‑based pooling mechanisms to compress the high‑dimensional encoder output into a single 256‑dimensional vector suitable for classification: Multi‑head Attentive Average Pooling and Multi‑head Query‑Key‑Value (QKV) Pooling. In the attentive average variant, each frame receives a learned weight via a small neural network; multiple heads compute independent weight sets, and their outputs are concatenated and linearly projected. In QKV pooling, a global average vector serves as the query, while keys and values are derived from the frame‑wise representations; multi‑head attention then aggregates information, again followed by concatenation and projection.
Experiments are conducted on two datasets: the English IEMOCAP and the Persian ShEMO corpora. Whisper Tiny and Whisper Small models are evaluated, with the Small model combined with multi‑head QKV pooling achieving the best results. On ShEMO, the proposed architecture raises unweighted accuracy by 2.47 percentage points, surpassing previous state‑of‑the‑art methods. On IEMOCAP, comparable performance is observed, and QKV pooling generally outperforms attentive average pooling except for weighted accuracy on IEMOCAP.
A layer‑wise analysis reveals that intermediate encoder layers (approximately the 6th to 9th Transformer blocks) provide the most discriminative representations for emotion detection, especially in the low‑resource Persian setting. This suggests that mid‑level features capture a balanced mix of acoustic and linguistic cues useful for SER.
The authors compare their lightweight Whisper‑based system to much larger models such as HuBERT X‑Large. Despite having orders of magnitude fewer parameters and lower computational cost, the Whisper‑QKV pipeline reaches similar accuracy, demonstrating the efficiency of the proposed pooling strategy. All code and pretrained weights are released publicly, facilitating reproducibility and future research on multilingual, low‑resource emotion recognition and on‑device SER applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment