Cross-subject generalization in EEG-based brain-computer interfaces (BCIs) remains challenging due to individual variability in neural signals. We investigate whether spectral representations offer more stable features for cross-subject transfer than temporal waveforms. Through correlation analyses across three EEG paradigms (SSVEP, P300, and Motor Imagery), we find that spectral features exhibit consistently higher cross-subject similarity than temporal signals. Motivated by this observation, we introduce ASPEN, a hybrid architecture that combines spectral and temporal feature streams via multiplicative fusion, requiring cross-modal agreement for features to propagate. Experiments across six benchmark datasets reveal that ASPEN is able to dynamically achieve the optimal spectral-temporal balance depending on the paradigm. ASPEN achieves the best unseen-subject accuracy on three of six datasets and competitive performance on others, demonstrating that multiplicative multimodal fusion enables effective cross-subject generalization.
Cross-subject generalization remains a fundamental bottleneck in EEG-based brain-computer interfaces (BCIs). Models trained on multi-subject data often degrade substantially when deployed to new users, requiring lengthy subject-specific calibration that undermines the goal of plug-and-play systems (Wan et al., 2021;Liang et al., 2024b). This is due to inherent differences between individuals such as skull thickness, cortical folding, and electrode placement that can produce substantial variation in signal amplitude, timing, and spatial distribution (Lu et al., 2024;Roy et al., 2019).
A growing body of work has addressed this limitation through increasingly expressive temporal modeling, progressing from compact CNN-based decoders (Lawhern et al., 2018) to Transformer architectures that capture global dependencies (Song et al., 2022). However, temporal waveforms are highly sensitive to phase shifts, latency jitter, and amplitude scaling across subjects. The hypothesis we investigate here is that spectral representations provide a more stable basis for cross-subject transfer. Frequency-domain features abstract away precise timing information while preserving the oscillatory signatures, such as µ (8-12 Hz) and β (13-30 Hz) rhythms, that serve as primary biomarkers for BCI paradigms (Ang et al., 2008;Mane et al., 2020).
To test this hypothesis, we first conduct a systematic correlation analyses comparing temporal and spectral representations across SSVEP, P300, and Motor Imagery paradigms. Our analysis reveals that spectral features exhibit substantially higher cross-subject similarity than temporal signals, suggesting that frequency-domain representations offer a more robust foundation for generalization. Motivated by this finding, we introduce ASPEN (Adaptive Spectral Encoder Network, Figure 1), a hybrid framework that processes EEG signals through parallel temporal and spectral streams and combines them via multiplicative fusion. Unlike prior approaches that concatenate or average multimodal features (Li et al., 2021;2025), multiplicative fusion computes element-wise products of projected stream representations, requiring both streams to agree for a feature to propagate. This cross-modal gating naturally suppresses artifacts and noise that appear prominently in only one view, while amplifying genuine neural patterns that manifest consistently across both temporal and spectral domains.
We evaluate ASPEN across six benchmark datasets spanning three paradigms. Our experiments reveal that the optimal spectral-temporal balance varies by task: P300 decoding benefits strongly from spectral emphasis, while Motor Imagery requires greater temporal contribution. ASPEN achieves the best unseen-subject accuracy on three datasets (Lee2019 SSVEP, BNCI2014 P300, and Lee2019 MI), outperforming both specialized temporal models and recent multimodal transformers. These results demonstrate that our model is able to perform cross-subject generalization across different BCI tasks while maintaining robustness across diverse neural signatures.
Temporal modeling: Deep learning for EEG signals has evolved from high-capacity architectures like DeepConvNet (Schirrmeister et al., 2017) toward compact, neurophysiologicallyinformed models. EEGNet (Lawhern et al., 2018) introduced depthwise and separable convolutions that mirror traditional spatial filtering, achieving strong performance with minimal parameters. Transformer-based models such as EEG Conformer (Song et al., 2022) and hybrid CNN-Transformer architectures like CTNet (Zhao et al., 2024) capture long-range temporal dependencies. Temporal convolutional networks (TCNs) offer improved sequential modeling with training stability advantages over recurrent approaches (Ingolfsson et al., 2020;Musallam et al., 2021).
Spectral and filter-bank approaches: Filter-bank methods decompose EEG into frequency subbands before learning spatial filters. The foundational FBCSP algorithm (Ang et al., 2008) demonstrated that isolating discriminative frequency bands improves motor imagery classification. Deep learning extensions apply this principle with learnable filters (Mane et al., 2020;Liu et al., 2022), while IFNet (Wang et al., 2023) models cross-frequency interactions. Time-frequency representations via wavelets have also shown promise for capturing non-stationary dynamics (Morales & Bowers, 2022).
Recent work has begun combining temporal and spectral features. Li et al. (Li et al., 2021) proposed a temporal-spectral squeeze-and-excitation network for motor imagery. TSformer-SA (Li et al., 2025) integrates temporal signals with wavelet spectrograms through crossview attention for RSVP decoding. Dual-branch architectures have also been explored for emotion recognition (Luo et al., 2023). However, these approaches typically employ additive fusion strategies like concatenation, averaging, or learned weighted sums that allow each stream to contribute independently. Our multiplicative approach
This content is AI-processed based on open access ArXiv data.