EEG Emotion Classification Using an Enhanced Transformer-CNN-BiLSTM Architecture with Dual Attention Mechanisms
Electroencephalography (EEG)-based emotion recognition plays a critical role in affective computing and emerging decision-support systems, yet remains challenging due to high-dimensional, noisy, and subject-dependent signals. This study investigates whether hybrid deep learning architectures that integrate convolutional, recurrent, and attention-based components can improve emotion classification performance and robustness in EEG data. We propose an enhanced hybrid model that combines convolutional feature extraction, bidirectional temporal modeling, and self-attention mechanisms with regularization strategies to mitigate overfitting. Experiments conducted on a publicly available EEG dataset spanning three emotional states (neutral, positive, and negative) demonstrate that the proposed approach achieves state-of-the-art classification performance, significantly outperforming classical machine learning and neural baselines. Statistical tests confirm the robustness of these performance gains under cross-validation. Feature-level analyses further reveal that covariance-based EEG features contribute most strongly to emotion discrimination, highlighting the importance of inter-channel relationships in affective modeling. These findings suggest that carefully designed hybrid architectures can effectively balance predictive accuracy, robustness, and interpretability in EEG-based emotion recognition, with implications for applied affective computing and human-centered intelligent systems.
💡 Research Summary
The paper addresses the challenging problem of EEG‑based emotion recognition, where high dimensionality, noise, and strong subject variability often limit the performance of conventional pipelines. To answer two research questions—whether a hybrid deep learning architecture that combines convolutional, recurrent, and attention components can improve classification accuracy, robustness, and interpretability, and which categories of EEG features are most discriminative—the authors propose an “Enhanced Transformer‑CNN‑BiLSTM” model with dual multi‑head self‑attention mechanisms and extensive regularization.
Dataset and preprocessing
The authors use a publicly available EEG emotion dataset containing 2,529 samples, each represented by a 988‑dimensional feature vector. The features span four groups: statistical descriptors, frequency‑domain measures, covariance‑based connectivity, and eigenvalue‑based metrics. After z‑score normalization, the data are stratified into an 80 % training set and a 20 % test set, preserving class balance across the three emotion labels (neutral, positive, negative).
Model architecture
- Residual 1‑D CNN blocks – Four convolutional stages (kernel sizes 7, 5, 3, 6) extract localized spatial patterns from the feature dimension. Each block includes batch normalization, ReLU, max‑pooling, and a residual skip connection to aid gradient flow.
- Bidirectional LSTM layers – Two stacked BiLSTM layers (hidden size 256) process the sequence of CNN‑derived embeddings, capturing forward and backward temporal dependencies.
- Dual multi‑head self‑attention – The first attention block uses 16 heads, the second 8 heads. Layer‑norm and dropout (0.3) follow each block, allowing the network to attend to complementary aspects of the temporal representation while keeping computational cost manageable.
- Dual pooling – Global average pooling and global max pooling are concatenated, preserving both holistic trends and salient activations.
- Fully‑connected classifier – A deep dense head (256→512→256→128→3) with ReLU and dropout regularizes the final decision making.
- Regularization and loss – AdamW (β₁=0.9, β₂=0.999, weight decay = 1e‑4) optimizes a cross‑entropy loss with label smoothing (ε = 0.1). The learning rate follows a cosine‑annealing schedule with warm‑up.
Training and evaluation
The model is trained under a unified protocol across five‑fold cross‑validation. It achieves an average validation accuracy of 99.19 % with a train‑validation gap of only 0.56 %, indicating strong generalization. Baselines include Random Forest (100 trees), SVM with RBF kernel (C = 1.0), a two‑layer MLP (256/128 neurons), a standard CNN‑BiLSTM, and a baseline Transformer‑CNN‑BiLSTM. All baselines fall short, ranging from 73 % (SVM) to 92 % (basic CNN‑BiLSTM).
Statistical significance
A Friedman test followed by post‑hoc Wilcoxon signed‑rank comparisons confirms that the proposed architecture outperforms every baseline at p < 0.05.
Feature importance analysis
Multiple importance‑ranking methods (Random Forest, Extra Trees, Mutual Information, ANOVA, Pearson correlation, SHAP) consistently highlight covariance‑based connectivity features as the most informative for discriminating emotions. Ablation experiments further show that removing these features degrades accuracy by more than 5 %, whereas eliminating purely statistical descriptors has a smaller effect.
Interpretability
Attention weight visualizations reveal that certain channel pairs receive consistently higher attention scores, aligning with neurophysiological findings that inter‑regional synchrony correlates with affective states. This provides a degree of model transparency useful for clinical or real‑time applications.
Conclusions and future work
The study demonstrates that a carefully engineered hybrid network—integrating residual CNNs, stacked BiLSTMs, and dual multi‑head self‑attention—can achieve near‑perfect emotion classification on high‑dimensional EEG data while maintaining low over‑fitting risk. The dominance of connectivity features underscores the importance of modeling inter‑channel relationships. The authors suggest extending the approach to multimodal settings (e.g., EEG + facial video), exploring cross‑subject domain adaptation, and developing lightweight variants for deployment on edge devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment