A Baseline Multimodal Approach to Emotion Recognition in Conversations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.

💡 Research Summary

The paper presents a lightweight, reproducible baseline for multimodal emotion recognition in conversations, built on the SemEval‑2024 Task 3 dataset derived from the sitcom Friends. The authors explicitly state that the goal is not to push state‑of‑the‑art performance but to provide an accessible reference implementation that combines (i) a transformer‑based text classifier and (ii) a self‑supervised speech representation model (wav2vec 2.0) using a simple late‑fusion ensemble.

Dataset and preprocessing
The dataset contains utterances from Friends annotated with one of six emotion labels (joy, sadness, anger, surprise, disgust, neutral). Text is tokenized with the standard BERT WordPiece tokenizer and minimally normalized. Audio is resampled to 16 kHz and fed directly into wav2vec 2.0 without any hand‑crafted features such as MFCCs.

Model architecture
Text branch: A pre‑trained BERT‑base (or RoBERTa‑large) model is fine‑tuned end‑to‑end. The CLS token representation is passed through a two‑layer feed‑forward network with dropout and a softmax output layer.
Audio branch: wav2vec 2.0‑base encodes raw waveforms into 768‑dimensional contextual embeddings. After average pooling across the time dimension, the same classifier head as the text branch is applied.
Fusion: The two branches are trained independently. At inference time, their softmax probability vectors are combined by weighted averaging (default 0.5 : 0.5, optionally tuned on a validation split). The final label is the argmax of the fused probabilities.

Training protocol
To keep the setup lightweight, hyper‑parameter search is deliberately limited. Learning rates are set to 2e‑5 for the text model and 1e‑4 for the audio model, batch sizes to 16 and 8 respectively, and training is stopped after 3–5 epochs for text and 4 epochs for audio. The same random seed and data split are used throughout to guarantee reproducibility.

Results

Text‑only BERT: macro‑F1 ≈ 0.62, accuracy ≈ 0.64.
Audio‑only wav2vec 2.0: macro‑F1 ≈ 0.55, accuracy ≈ 0.58.
Late‑fusion (text + audio): macro‑F1 ≈ 0.66, accuracy ≈ 0.68.

The multimodal system yields a modest but consistent improvement over either unimodal baseline, especially on instances where textual meaning and vocal tone diverge (e.g., sarcasm, suppressed anger). This demonstrates the complementary nature of linguistic and prosodic cues.

Limitations

No visual modality is incorporated, so facial expressions or gestures are ignored.
The dataset is confined to English dialogue from a single American sitcom, limiting cross‑cultural generalization.
Late‑fusion is a simple weighted average; more sophisticated cross‑modal attention or gating mechanisms are not explored.
Evaluation focuses on macro‑F1 and accuracy, without metrics that capture temporal dynamics of emotion (e.g., emotion shift detection).

Reproducibility
All code, pre‑trained weights, and environment specifications (Dockerfile, requirements.txt) are released on GitHub. The authors emphasize that the baseline can be run on modest hardware (a single GPU with 8 GB memory) and serves as a “starting point” for students, educators, or researchers with limited resources.

Conclusions and future work
The study provides a transparent, easy‑to‑reproduce benchmark that quantifies the benefit of multimodal fusion under constrained resources. Future directions include: (i) adding visual features to build a true tri‑modal system, (ii) replacing late‑fusion with cross‑modal transformers or graph‑based fusion to capture deeper interactions, (iii) extending evaluation to multilingual and culturally diverse corpora, and (iv) optimizing for real‑time inference in applications such as mental‑health monitoring, empathetic customer‑service bots, and adaptive educational platforms. By offering this baseline, the authors aim to lower the entry barrier for multimodal affective computing research and facilitate more rigorous, comparable studies in the field.

A Baseline Multimodal Approach to Emotion Recognition in Conversations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment