MUSE2020 challenge report

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper is a brief report for MUSE2020 challenge. We present our solution for Muse-Wild sub challenge. The aim of this challenge is to investigate sentiment analysis method in real-world situation. Our solutions achieve the best CCC performance of 0.4670, 0.3571 for arousal, and valence respectively on the challenge validation set, which outperforms the baseline system with corresponding CCC of 0.3078 and 1506.

💡 Research Summary

This paper reports the authors’ solution to the Muse‑Wild sub‑challenge of the MUSE2020 competition, which focuses on dimensional sentiment analysis (arousal and valence) in real‑world multimodal media. The authors, all from the School of Information at Renmin University of China, describe a straightforward pipeline that extracts and aligns features from three modalities—text, audio, and visual—on a fixed 250 ms frame basis, concatenates them, and feeds the result into a shallow neural architecture consisting of a fully‑connected (FC) layer with ReLU activation, a single‑layer Long Short‑Term Memory (LSTM) network, and a final pair of FC layers that output continuous predictions for arousal and valence. Training uses mean‑squared error (MSE) loss optimized with Adam; dropout is set to 0.5, the maximum sequence length to 100 time steps, and the LSTM hidden size is tuned per feature set.

Feature extraction

Text: Pre‑trained language models BERT, ALBERT, and GloVe are employed. For each 250 ms segment, word‑ or character‑level embeddings are averaged to produce a fixed‑size vector, denoted as bert_cover, albert_cover, and glove_cover.
Audio: Two complementary representations are used. (1) Low‑Level Descriptors (LLD) from the Geneva Minimalistic Acoustic Parameter Set (GeMAPS) provide traditional prosodic and spectral cues. (2) wav2vec, a self‑supervised speech model, is pre‑trained on the large Librispeech corpus and then applied to extract high‑level acoustic embeddings.
Visual: Facial expression features are derived from two deep convolutional models. DenseFace, fine‑tuned on the FER+ dataset, supplies emotion‑focused embeddings from its final mean‑pooling layer. VGGFace, pre‑trained on the VGGFace dataset, provides a complementary facial representation. Both are referred to as denseface and vggface respectively.

Model formulation
For each time step j, the K modality vectors (x_{j}^{i}) (i = 1…K) are concatenated to form (z_{j}). An FC‑ReLU transformation maps (z_{j}) to an embedding (\hat{z}_{j}). The sequence (

MUSE2020 challenge report

💡 Research Summary

Comments & Academic Discussion

Leave a Comment