Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset
We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to 65.56% (+3.66pp), while frequency-domain features for EEG achieved 67.62% (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached 75.30%, exceeding the paper’s ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved 72.68% (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.
💡 Research Summary
This paper conducts a systematic investigation of multimodal emotion recognition on the small‑scale EAV dataset, which contains synchronized EEG, audio, and video recordings from 42 participants. Each subject provides roughly 280 training samples, a size that severely limits the capacity of deep models. The authors evaluate three families of models.
M1 comprises baseline transformer architectures that directly reuse pretrained models: a custom EEG transformer built on 1‑D convolutions followed by six self‑attention layers, the Audio Spectrogram Transformer (AST) pretrained on AudioSet, and a Vision Transformer (ViT) pretrained on facial emotion datasets. These serve as reference points.
M2 introduces factorized attention mechanisms tailored to each modality’s intrinsic structure. For EEG, a “tri‑stream” transformer processes spatial, temporal, and hemispheric‑asymmetry dimensions in parallel. For audio, a temporal‑frequency dual‑attention module reshapes AST’s patch tokens into a time‑by‑frequency grid and applies separate attention across time and frequency. For video, a ViViT‑style factorized space‑time attention first attends within each frame and then across frames. All M2 variants retain skip connections to preserve pretrained features but add new attention layers.
M3 focuses on minimal, domain‑aware improvements to the original CNN baselines. In audio, delta MFCCs (first‑order temporal derivatives) are appended, raising the feature dimension from 180 to 220. In EEG, raw time‑domain data are replaced by frequency‑domain band‑power, differential entropy, and alpha‑asymmetry features (total 306 dimensions) fed into a shallow MLP. In video, the squeeze‑excitation block’s reduction ratio is corrected from 1 to 16, and frame‑to‑frame delta features are added.
Experimental results show a clear split. All M2 models underperform their respective baselines by 5–13 percentage points: EEG tri‑stream reaches only 48 % (vs. 60 % baseline), audio dual‑attention 49.46 % (many subjects near chance), and factorized video 69.54 % (vs. 74.5 % ViViT). The authors attribute this to (1) degradation of strong pretrained representations by the added attention layers, (2) severe data scarcity causing over‑parameterization, and (3) instability during training (e.g., near‑perfect training accuracy but poor test performance).
Conversely, the M3 modifications yield consistent gains. Adding delta MFCCs lifts audio accuracy from 61.9 % to 65.56 % (+3.66 pp). EEG band‑power features boost performance to 67.62 % (+7.62 pp over the paper’s baseline). Video delta features raise accuracy to 72.68 % (+1.28 pp). Moreover, the M1 vision transformer, when pretrained on facial emotion data, achieves 75.30 %—the highest result across all modalities and surpassing the original ViViT result (74.5 %). This demonstrates that domain‑specific pretraining can outweigh architectural sophistication.
The paper also uncovers practical bugs: an unnecessary Softmax before CrossEntropy in the original EEGNet, and an SE block with reduction ratio 1 that prevented effective channel re‑weighting. Fixing these issues alone improves baseline performance, but the biggest jumps come from the domain‑aware feature engineering.
In summary, the study provides three key take‑aways for emotion recognition on limited data: (1) Simple, signal‑processing‑driven features (delta MFCCs, band power, asymmetry) provide stronger inductive bias than adding complex attention modules; (2) Factorized attention, while computationally efficient on large datasets, tends to overfit and corrupt pretrained embeddings when training data are scarce; (3) Careful implementation, appropriate pretraining, and domain knowledge are far more decisive for performance than raw model depth or novelty. These insights are valuable not only for affective computing but for any multimodal biomedical signal analysis where annotated data are limited.
Comments & Academic Discussion
Loading comments...
Leave a Comment