Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.

💡 Research Summary

The paper tackles a fundamental limitation in Speech Emotion Recognition (SER): the reliance on majority‑voted labels, which discards the inherent subjectivity of emotional perception and offers no insight into why a model makes a particular prediction. To address both the loss of minority annotations and the lack of interpretability, the authors propose an explainable Speech Language Model (SpeechLM) framework that treats SER as a generative reasoning task. The system consists of two stages. First, a multimodal teacher large language model (LLM), specifically Qwen2.5‑Omni, is prompted with the transcript of an utterance and its gold emotion label to generate a concise natural‑language rationale. The prompt explicitly forbids the model from mentioning the label directly, forcing the rationale to be grounded in lexical content and acoustic cues such as tone, pitch, speaking style, and clarity. These teacher‑generated rationales, together with the original audio and label, form (audio, label, rationale) triples.

Second, a SpeechLM backbone (Qwen2‑Audio‑7B‑Instruct) is fine‑tuned using Low‑Rank Adaptation (LoRA). Only the query, key, value, and output projection matrices in the self‑attention layers and the language‑model head are updated, keeping the rest of the 7‑billion‑parameter model frozen. Training is performed in an autoregressive fashion: the target sequence concatenates an ASR transcription, an emotion decision (“I think the emotion is …”), and the teacher‑generated rationale, all wrapped in XML‑like tags. The loss is a standard cross‑entropy over the whole token sequence, so the model learns to transcribe, classify, and reason simultaneously.

For evaluation, the authors retain the conventional majority‑vote Macro‑F1 metric but augment it with an annotator‑aware Macro‑F1 that counts a prediction as correct if it matches any annotator’s label for that utterance. This metric better reflects the multi‑annotator nature of the MSP‑Podcast v1.12 corpus. Experiments combine training data from CREMA‑D, IEMOCAP, MELD, RAVDESS, and the MSP‑Podcast training split (≈132 h). Baselines include zero‑shot SpeechLMs (Q2A, Q2.5O, SALMONN, Baichuan‑Omni) and their Chain‑of‑Thought (CoT) variants, as well as traditional self‑supervised SER models (WavLM, HuBERT). The proposed model (Model 11) outperforms all baselines: in the open‑form setting it achieves 36.14 % Macro‑F1 on majority labels and 66.78 % on annotator‑aware labels, compared with 27.80 %/52.42 % for the strongest zero‑shot CoT baseline. Closed‑form results (where the model must select a label from a list) also show a clear gain over both zero‑shot and standard fine‑tuned models.

An ablation study removes the rationale supervision (Model 10). Performance drops substantially across both open‑ and closed‑form metrics, confirming that the rationales act as an effective auxiliary training signal rather than a mere by‑product. Human judges (10 raters) and a separate LLM‑as‑judge (ChatGPT‑5) evaluate the plausibility and grounding of the generated rationales. The proposed model’s explanations are consistently rated higher than those from zero‑shot CoT baselines, with raters noting that the rationales correctly reference both lexical cues (“the word ‘surprised’”) and acoustic properties (“higher pitch, excited tone”).

The contributions are threefold: (i) an explainable SER framework that jointly outputs an emotion label and a natural‑language rationale, (ii) a dual‑metric evaluation protocol that respects both majority and minority annotator judgments while keeping the training pipeline unchanged, and (iii) empirical evidence that rationale‑augmented fine‑tuning improves alignment with human judgments and boosts classification performance. Limitations include the dependence on a powerful teacher LLM for rationale generation, which incurs computational cost and ties rationale quality to the teacher’s capabilities. Future work may explore lightweight teacher models, automated rationale quality assessment, and extending the approach to multilingual or multimodal emotion datasets. Overall, the study demonstrates that integrating reasoning traces into SER can simultaneously enhance interpretability and predictive accuracy, offering a practical path toward more trustworthy affective computing systems.

Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

💡 Research Summary

Comments & Academic Discussion

Leave a Comment