Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Discrete speech tokens offer significant advantages for storage and language model integration, but their application in speech emotion recognition (SER) is limited by paralinguistic information loss during quantization. This paper presents a comprehensive investigation of discrete tokens for SER. Using a fine-tuned WavLM-Large model, we systematically quantify performance degradation across different layer configurations and k-means quantization granularities. To recover the information loss, we propose two key strategies: (1) attention-based multi-layer fusion to recapture complementary information from different layers, and (2) integration of openSMILE features to explicitly reintroduce paralinguistic cues. We also compare mainstream neural codec tokenizers (SpeechTokenizer, DAC, EnCodec) and analyze their behaviors when fused with acoustic features. Our findings demonstrate that through multi-layer fusion and acoustic feature integration, discrete tokens can close the performance gap with continuous representations in SER tasks.

💡 Research Summary

The paper investigates the use of discrete speech tokens for speech emotion recognition (SER), focusing on the performance loss caused by quantization and proposing methods to recover the missing information. Using a fine‑tuned WavLM‑Large model, the authors extract hidden states from each of the 24 Transformer layers and apply K‑means clustering with codebook sizes of 256, 512, 1 000, 2 000, and 4 000 to convert the continuous embeddings into discrete token sequences. Experiments on the MSP‑Podcast 8‑class emotion dataset reveal that discretization leads to a noticeable drop in Macro F1 (6–14 % relative), with the final two layers (L22, L23) contributing most of the emotional signal while some early layers also provide complementary cues.

To mitigate this loss, two complementary strategies are introduced. The first is an attention‑based multi‑layer fusion module. For a selected set of layers, a temperature‑scaled softmax computes learnable weights that are applied to the reconstructed token embeddings, followed by attentive statistical pooling and a lightweight MLP classifier. This approach recovers roughly 62 % of the performance gap: a single‑layer token model (L23) scores 0.3248, whereas fusing all 24 layers with a 4 000‑size codebook reaches 0.3479. Attention weight analysis shows a bimodal distribution, confirming that high‑level semantic layers dominate while early acoustic layers still contribute non‑trivial information.

The second strategy enriches the token representation with explicit paralinguistic cues. The authors extract the 74‑dimensional OpenSMILE feature set (ComParE + eGeMAPS), covering prosody, voice quality, spectral, MFCC, formants, auditory bands, and additional low‑level descriptors. Each feature group is discretized using K‑means (codebook size determined by the elbow method) and temporally aligned to the token sequence. A modality normalizer (LayerNorm and learnable scaling) balances the contributions before concatenation. When all OpenSMILE groups are added to the full‑layer WavLM token model (K = 1 000), Macro F1 improves from 0.3461 to 0.3505, with the most gain coming from prosodic and voice‑quality features.

The paper also evaluates three neural audio codec tokenizers—SpeechTokenizer, DAC, and EnCodec—as alternative discretizers. These codecs generate multi‑layer token streams via residual vector quantization, but their representations are optimized for speech reconstruction rather than emotion‑relevant cues. Consequently, their SER performance is markedly lower (Macro F1 ranging from 0.1187 to 0.2005) compared to the WavLM‑based tokens (minimum 0.3133, maximum 0.3534).

A detailed analysis of codebook size versus layer configuration shows that sparse layer selections (e.g., L1,3,7,12,18,23) benefit from smaller codebooks (K = 256–512), while dense configurations (all layers) achieve the best results with larger codebooks (K = 4 000). This suggests early layers encode low‑dimensional acoustic information that can be efficiently captured with modest quantization, whereas later layers require higher granularity to preserve semantic content.

In summary, the study makes three key contributions: (1) a systematic quantification of how discretization degrades SER performance across layers and granularity levels; (2) a novel attention‑based multi‑layer fusion mechanism that leverages complementary information from different Transformer depths; and (3) a hierarchical fusion framework that integrates discretized OpenSMILE paralinguistic features, effectively restoring lost prosodic and voice‑quality cues. The combined approach narrows the gap between discrete and continuous representations, achieving up to 62 % recovery of the performance loss and demonstrating that discrete tokens, when properly enhanced, can serve as a viable, storage‑efficient alternative for emotion‑aware speech processing, especially in scenarios that require seamless integration with large language models or low‑bandwidth transmission.

Recovering Performance in Speech Emotion Recognition from Discrete Tokens via Multi-Layer Fusion and Paralinguistic Feature Integration

💡 Research Summary

Comments & Academic Discussion

Leave a Comment