SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

Sp ecMoE: Sp ectral Mixture-of-Exp erts F oundation Mo del for Cross-Sp ecies EEG Deco ding Da vy Darank oum 1 , 2 , Chlo é Habermacher 2 , Julien V olle 2 * (  ), and Sergei Grudinin 1 * (  ) 1 Univ. Grenoble Alp es, CNRS, Grenoble INP , LJK 38000 Grenoble, F rance 2 SynapCell SAS, ZA C ISIP AR C, 38330 Sain t-Ismier, F rance * These authors jointly sup ervised the work jvolle@synapcell.fr sergei.grudinin@univ-grenoble-alpes.fr Abstract. Deco ding the orchestration of neural activit y in electro encephalography (EEG) signals is a cen tral challenge in bridging neuroscience with artiﬁcial in telligence. F ounda- tion models ha v e made strides in generalized EEG deco ding, y et many existing framew orks primarily relying on separate temporal and spectral masking of ra w signals during self- sup ervised pretraining. Suc h strategies often tend to bias learning to w ard high-frequency oscillations, as lo w-frequency rhythmic patterns can b e easily inferred from the unmask ed signal. W e in troduce a foundation model that utilizes a no v el Gaussian-smo othed masking sc heme applied to short-time F ourier transform (STFT) maps. By jointly applying time, frequency , and time-frequency Gaussian masks, w e mak e the reconstruction task muc h more c hal- lenging, forcing the model to learn in tricate neural patterns across b oth high- and low- frequency domains. T o eﬀectively recov er signals under this aggressive masking strategy , w e design Sp ecHi-Net, a U-shaped hierarc hical architecture with m ultiple encoding and deco d- ing stages. T o accelerate large-scale pretraining, we partition the data in to three subsets, eac h used to train an indep enden t exp ert mo del. W e then combine these mo dels through Sp ecMoE, a mixture of experts framew ork guided by a learned sp ectral gating mec hanism. Sp ecMoE achiev es state-of-the-art p erformance across a div erse set of EEG deco ding tasks, including sleep staging, emotion recognition, motor imagery classiﬁcation, abnormal signal detection, and drug eﬀect prediction. Importantly , the model demonstrates strong cross- sp ecies and cross-sub ject generalization, main taining high accuracy on b oth h uman and m urine EEG datasets. These ﬁndings suggest that sp ectral-a w are Gaussian-smo othed mask- ing combined with hierarchical feature in tegration provides a p o werful inductiv e bias for next-generation EEG foundation mo dels and brain–computer in terface systems. Keyw ords: EEG F oundation Mo dels · Mixture of Experts · Self-Sup ervised Learning · Time-F requency Masking · Cross-Species Neural Deco ding. 1 In tro duction Electro encephalograph y (EEG) remains the primary non-in v asive gatewa y to understanding human cortical dynamics, oﬀering sup erior temp oral resolution for clinical diagnostics and Brain-Computer In terface (BCI) developmen t. Ho wev er, the inherent challenges of EEG: high non-stationarity , low signal-to-noise ratio, and signiﬁcant in ter-sub ject v ariability , hav e historically conﬁned decoding mo dels to narro w, task-sp eciﬁc applications. The emergence of EEG foundation models has recen tly shifted the decoding paradigm to- w ard large-scale self-supervised pretraining [ 1 , 2 , 3 ]. By leveraging masked signal reconstruction, these mo dels learn generalized representations across massiv e unlab eled datasets. How ever, cur- ren t framew orks predominantly apply rectangular masks to ra w temp oral or spectral data inde- p enden tly . This approach suﬀers from tw o critical ﬂaws. First, sharp masking b oundaries introduce high-frequency edge artifacts that bias the mo del to w ard "discontin uity reco v ery" in addition to endogenous neural features. In the context of EEG, this is particularly problematic, as neural oscillations are fundamen tally smo oth and rhythmic. F orcing the mo del to reconstruct artiﬁcial transien ts at every mask boundary acts as a sp ectral in terference, diverting a portion of the opti- mization pro cess aw ay from the underlying neural manifold. Second, standard masking strategies are susceptible to information leak age in the low-frequency domain. Because these slow oscilla- tions span across long temporal windo ws, the mo del can straightforw ardly infer the low-frequency 2 D. Darank oum et al. 0 5 10 15 20 10 0 0 5 10 15 20 0.25 0 0.25 0 5 10 15 20 10 0 0 5 10 15 20 0.2 0 0.2 0 5 10 15 20 1 0 1 0 5 10 15 20 0 0.05 Our masking strategy Raw EEG signals Previous masking strategies Fig. 1. Left: previous masking strategies, where rectangular masks remo v e some time frames. Middle: original EEG signals. Right: prop osed Gaussian masks remov e some frequency oscillations (mostly low frequencies) in addition to certain time frames. The horizon tal axis represen ts time in seconds, and the v ertical axis is v oltage in µ V. patterns from the unmasked EEG segmen ts. Consequently , the pretraining task b ecomes trivial in the lo w er bands, prev en ting foundation mo dels from grasping the critical long-range rhythmic structures that c haracterize man y cognitiv e and pathological states. T o address these limitations, w e introduce Sp ecMoE, a sp ectral-anc hored foundation mo del. Unlik e previous metho ds, Sp ecMoE operates in the time-frequency domain using a nov el Gaussian- smo othed masking strategy on short-time F ourier transform (STFT) sp ectrograms (see Fig. 1 for a comparison with other masking strategies). By replacing sharp b oundaries with smo oth tran- sitions, we eliminate the bias tow ard non-physiological transien ts, ensuring that the optimization pro cess remains fo cused on physiological neural rhythms. F urthermore, our join t time and fre- quency masking geometry , where sp eciﬁc frequency bands are mask ed ov er the en tire duration of the EEG segment, preven ts an y lo w-frequency leak age betw een the masked and remaining signals. Suc h masking geometry forces the model to learn the complex long-range dep endencies of the neural manifold rather than relying on simple interpolation. T o eﬀectively recov er signals under this more challenging masking regime, we design Sp ecHi-Net, a U-shap ed hierarchical arc hitecture. By utilizing multiple enco ding and deco ding stages, Sp ecHi-Net captures multi-scale temp oral and sp ectral features, pro viding the structural depth necessary for high-ﬁdelity reconstruction where shallo w er, non-hierarchical transformers typically fail. Finally , to improv e domain-sp eciﬁc adap- tation on do wnstream applications, we prop ose a Sp ectral-guided Mixture of Exp erts (Sp ecMoE) framew ork. Gov erned by a learned spectral gating mec hanism that routes information based on the signal’s Po wer Sp ectral Densit y (PSD), our mo del dynamically weigh ts exp ert con tributions to matc h the rhythmic con tent of the task at hand. This approac h enables state-of-the-art perfor- mance and robust cross-sp ecies generalization across nine heterogeneous b enc hmarks. In summary , our w ork comprises the follo wing main contr ibutions: – A no v el Gaussian-smo othed masking strategy for self-supervised pretraining – An arc hitecture with emphasis on m ulti-lev el enco der-decoder (Sp ecHi-Net) – A no v el sp ectral gating mec hanism for a Mixture of Exp erts framew ork – A no v el cross-sp ecies v alidation approach for EEG foundation models The co de and pretrained models are av ailable at https://github.com/TeraXj78/SpecMoE . 2 Related W ork Recen t Ev aluations EEG-FM-Benc h is a recent comprehensive resource that compares founda- tion EEG mo dels on 14 op en-source datasets across 10 common EEG paradigms [ 4 ]. A ccording to the b enchmark, EEGPT [ 5 ] and CBraMo d [ 6 ] are the top-performing models, being both compact and eﬀective. The authors also emphasize the importance of ﬁne-tuning with all parameters from the pretrained w eights, and critically ev aluate single-task v ersus m ulti-task ﬁne-tuning ob jectives. Liu et al. [ 3 ] recently provided another ev aluation of 12 op en-source foundation mo dels and com- p etitiv e task-sp eciﬁc baselines across 13 EEG datasets spanning nine BCI paradigms. They also concluded that full-parameter ﬁne-tuning is beneﬁcial and noted that task-sp eciﬁc mo dels can still b e comp etitiv e. Overall, CBraMo d [ 6 ] w as again the top-p erforming foundation mo del in this study , while EEGNet [ 7 ] demonstrated the best p erformance among task-speciﬁc mo dels. Brain4FMs [ 2 ] and Kuruppu et al. [ 1 ] are yet another recent attempts to b enchmark foundation mo dels, but they EEG F oundation Mo del - SpecMoE 3 did not include some state-of-the-art approaches, e.g., CSBrain. Interestingly , the ev ent-related p oten tial (ERP) b enchmark rev ealed that the EEG foundation mo dels do not outp erform the su- p ervised task-sp eciﬁc baselines in ERP tasks [ 8 ], where EEGConformer [ 9 ] achiev ed the highest accuracy for ERP classiﬁcation. T ask-Speciﬁc and F oundation Mo dels Despite the emergence of foundation EEG mo dels, smaller, task-sp eciﬁc architectures contin ue to serve as strong baselines. F or instance, EEGNet [ 7 ] is a ligh t w eigh t CNN-based mo del that utilizes depth wise and separable conv olutions to extract robust features. FFCL [ 10 ] combines parallel CNN and LSTM branches, allowing CNNs to extract spatial features while LSTMs capture temp oral dynamics. The features from these multiple branc hes are then fused using a fully connected la yer. Another notable task-sp eciﬁc baseline is EEGConformer [ 9 ], which employs self-atten tion mec hanisms to capture long-range temporal dependencies and global con textual information. LaBraM [ 11 ] is one of the pioneering EEG foundation models, pre-trained on 2,500 hours of data using v ector quantized v ariational auto enco ders for dual frequency-phase mask learning. CBraMo d [ 6 ] introduces criss-cross attention to separately capture spatial and temporal features within the same transformer lay er, enhancing its represen tational capabilities. CSBrain [ 12 ] builds up on C BraMod and incorporates cross-scale spatiotemp oral tok enization along with structured sparse attention to create robust EEG represen tations. A no vel deco der-centric paradigm, inspired b y dev elopments in large language models (LLMs), was introduced in ECHO [ 13 ]. LEAD [ 14 ] uses separate temp oral and spatial attentions follow ed by a learnable gating fusion. During pretraining, it employs one of ﬁve augmen tation strategies, including individual temp oral, frequency , or channel masking. REVE [ 15 ] in tro duced 4D spatial-temp oral p ositional enco ding with massive-scale pre- training. Similarly , Deep erBrain [ 16 ] leverages the spatial and temp oral organization of the data b y mo deling 3D electro de geometry and a learnable spatial decay k ernel. It also introduces neu- ro dynamics statistics learning during pretraining. UNI-NTFM [ 17 ] is currently the largest EEG foundation model, featuring up to 1.9 billion parameters. It encodes time, frequency , and ra w sig- nal representations separately , follow ed by cross-atten tion. Additionally , it in tro duces spatial priors in to input features and optimizes ﬁne-tuning using a mixture-of-exp erts neural transformer. The most p opular pretraining ob jectiv e comprises randomly masked time-domain signal re- construction [ 3 ]. Y et, classical BrainBER T [ 18 ] and Brain W av e [ 19 ] approac hes are pretrained on mask ed STFT sp ectrograms, while BioCo dec [ 20 ] employs an STFT-based spectral loss computed at multiple scales, EEGF ormer [ 21 ] reconstructs sp ectral amplitudes, and ALFEE [ 22 ] reconstructs the PSD. Notably , TFM-T okenizer [ 23 ] in troduces an explicit dual-path time-frequency rectangu- lar masking ob jective to disen tangle temporal and sp ectral motifs, encouraging the mo del to learn frequency-sp eciﬁc patterns across time. The Mixture of Experts (MoE) framework is gaining traction in EEG foundation mo dels. The MGEC mo del [ 24 ] combines shared and routed exp erts, and incorp orates a MoE architecture for the latter. UNI-NTFM [ 17 ] also applies the MoE concept during ﬁne-tuning optimization. A dditionally , BrainMoE [ 25 ] seeks to dev elop a c hannel-wise MoE structure. 3 Sp ecMoE F ramework Pretraining Design Sp ecMoE pretraining (Fig. 2 ) follo ws a self-sup ervised generative paradigm designed to reconstruct signals from sp ectral-temporal corruption. The pretraining pip eline is struc- tured in to three phases: EEG signal corruption, laten t space learning, and multi-scale signal re- construction. Phase I: EEG Signal Corruption. The input x ∈ R C × L , where C denotes the num b er of recording c hannels and L denotes the sequence length (total num b er of time steps), is ﬁrst pro jected into the time-frequency domain via an STFT, yielding the complex-v alued represen tation z ∈ C C × F × T . Here, F and T represen t the frequency bins and temp oral frames, resp ectiv ely . W e then apply a prop osed Gaussian-smoothed masking strategy to z , designed to prev en t sp ectral leak age and preserv e low-frequency physiological rh ythms. The corrupted temporal signal ˆ x ∈ R C × L is subse- quen tly recov ered through an inv erse STFT (iSTFT), serving as the input for the neural backbone. 4 D. Darank oum et al. Skip Skip Skip Concat on Sequence dim Split on Sequence dim Concat on Hidden dim Expert 1 Expert 2 Expert 3 PSD O 1 O 2 O 3 PSD Gating MLP Prediction Down 2 Small k Down 2 Large k Global T ransformer Down 3 Small k Down 3 Large k Down 1 Small k Down 1 Large k B C Aux Loss 1 Aux Loss 2 Up 2 Small k Up 2 Large k Global T ransformer Up 1 Small k Up 1 Large k Up 3 Small k Up 3 Large k … Reconstructed EEG Reconstructed EEG Main Loss 1 Main Loss 2 … D E … Raw EEG signals STFT Gaussian masking Corrupted Raw EEG … A Global T ransformer Bottom 1 Bottom 2 Fig. 2. Sp ecMoE ov erview. A) Gaussian-based masking pip eline. B) Hierarchical enco der, with ’k’ standing for ’ k ernel’. C) Hierarchical deco der. D) Reconstruction ob jective. E) Fine-tuning pip eline. Phase II: Hier ar chic al L atent Sp ac e Enc o ding. The corrupted signal ˆ x is processed b y Sp ecHi-Net, a Spectral-Hierarchical Netw ork that adopts a U-Net-like hierarchical structure to enable m ulti- scale feature extraction and high-ﬁdelity signal recov ery . The enco der of Sp ecHi-Net consists of three successiv e do wnsampling stages ( Dow n 1 , D own 2 , and Dow n 3 ), eac h emplo ying a dual-path con v olutional enco der to capture simultaneously transient micro-states and long-range oscillatory patterns. T o capture global dep endencies across channels, these con volutional features are inter- lea v ed with Global T ransformer la yers utilizing Rotary Positional Enco dings (RoPE) [ 26 ]. W e sp eciﬁcally do not enco de channel absolute positions, processing them equally by the arc hitecture, whic h ensures its c hannel-size in v ariance. Phase III: Multi-Sc ale EEG signal R e c onstruction. The symmetric deco der of Sp ecHi-Net in te- grates the latent representations learned b y the enco der, through a series of upsampling stages ( U p 1 , U p 2 , and U p 3 ) and skip-connections. T o enforce structural and physiological consistency , the deco der generates reconstructions at m ultiple levels of gran ularity: one in termediate signal reconstructed after U p 1 la y er ( in term-x 1 ), a second intermediate signal reconstructed after U p 2 la y er ( in term-x 2 ) and ﬁnally tw o reconstructions from U p 3 outputs ( ˜ x 1 , ˜ x 2 ). The entire system is optimized via a multi-ob jectiv e loss function that minimizes error in b oth the temp oral and sp ectral domains. Gaussian-Smo othed Masking Strategy T o compel SpecHi-Net to learn the intrinsic sp ectral- temp oral dynamics of neural signals, w e propose a nov el Gaussian-smo othed masking strategy applied in the time-frequency domain. Unlike traditional temp oral masking, which creates sharp discon tin uities and high-frequency edge artifacts, our approac h utilizes STFT co eﬃcien ts to apply "soft" masks that resp ect the biological rh ythmic structures of the EEG (see Fig. 1 ). Giv en an input EEG signal x ∈ R C × L , we ﬁrst compute its STFT representation z ∈ C C × F × T , where F and T denote the frequency and time-frame dimensions, resp ectiv ely . W e deﬁne a 2D mask map M ( f , t ) ∈ [0 , 1] F × T , initialized as a matrix of ones. W e iteratively subtract Gaussian kernels from x un til a target mask ratio ρ = 0 . 5 is reached. A single Gaussian kernel G ( f , t ) cen tered at co ordinates ( f 0 , t 0 ) is deﬁned as: G ( f , t ) = exp − ( f − f 0 ) 2 2 σ 2 f ! exp  − ( t − t 0 ) 2 2 σ 2 t  , (1) EEG F oundation Mo del - SpecMoE 5 where we set parameters σ f and σ t to 5% of the total range of the frequency domain F and time domain T , resp ectiv ely . The mask map is up dated via: M ( f , t ) ← M ( f , t ) · (1 − G ( f , t )) . (2) The same mask M is applied to all the c hannels C of the STFT representation z . Finally , the mask ed signal ˆ x is generated via the in v erse STFT: ˆ x = iSTFT ( z ⊙ M ) . (3) Multi-T yp e Masking Strategy W e simultaneously emplo y three distinct masking geometries go v erned by a probability distribution P = [0 . 6 , 0 . 3 , 0 . 1] for frequency , time, and joint time- frequency masking, resp ectiv ely: F r e quency Masking ( σ t → ∞ ): This mo de obscures sp eciﬁc sp ectral bands across the entire tem- p oral window, comp elling Sp ecHi-Net to reconstruct frequency-domain information from broader cross-temp oral correlations. By extending the mask across the full duration of the segment, we eﬀectiv ely eliminate sp e ctr al information le akage ; since a mask ed frequency bin is unav ailable at an y time p oin t, the netw ork is forced to learn the underlying physiological dep endencies b et ween diﬀeren t neural oscillations. Time Masking ( σ f → ∞ ): This mo de remo v es contiguous temp oral segments across the en tire sp ectrum. By obscuring all frequency comp onents for a giv en time frame, the strategy forces the mo del to capture the long-range temporal dynamics of the signal. SpecHi-Net must the refore learn to interpolate transien t neural even ts b y lev eraging the temp oral contin uity and phase consistency of the preceding and succeeding sp ectral con texts. Joint Time-F r e quency Masking: This mo de utilizes lo calized 2D Gaussian "blobs" to obscure spe- ciﬁc regions in the time-frequency plane. Unlike the previous modes, this geometry targets discrete neuro-sp ectral ev ents. By masking lo calized clusters of energy , it c hallenges the mo del to p erform join t spectral-temp oral inference, reconstructing the missing information by simultaneously utiliz- ing neigh b oring frequency bands and adjacen t time steps. This sim ulates the recov ery of obscured biological biomark ers while promoting a robust, m ulti-dimensional latent representation. Sp ectral-Band Bias A critical innov ation of Sp ecMoE is the band-informed bias. During the selection of frequency centers f 0 , we sp eciﬁcally constrain 50% of the masks to corresp ond to the primary EEG physiological bands: δ (1–4 Hz), θ (4–8 Hz), α (8–12 Hz), and β (12–30 Hz). By prioritizing the masking of these lo w-frequency rh ythmic structures, which are often not targeted in previous masking strategies, w e ensure that our model dev elops a high-ﬁdelity understanding of a broad range of clinically relev ant biomark ers. Hierarc hical Dual-Path Con v olutional Enco der T o address the multi-scale nature of EEG signals, Sp ecHi-Net utilizes a hierarchical enco der comp osed of three primary levels ( Dow n 1 , D own 2 , and D own 3 ). F ollo wing previous w orks [ 27 , 28 ], eac h stage features a dual-p ath arc hi- tecture: one path employs small-kernel conv olutions for transient detection, while the parallel path utilizes large-kernel, dilated con v olutions to capture long-range rhythmic structures. This design ensures that b oth lo calized micro-states and global oscillatory patterns are enco ded into the laten t space. In terlea ved Global T ransformer W e inserted three transformer lay ers interlea ved at hierar- c hical transitions and within the arc hitectural bottleneck ( B ottom 1 and B ottom 2 ) utilizing Multi- Head A tten tion (MHA). T o ensure the learned representations remain robust to the v arying se- quence lengths encoun tered across div erse do wnstream tasks, w e adopt Rotary P ositional Encodings (RoPE) [ 26 ]. W e strategically selected RoPE ov er more traditional absolute p ositional embeddings. Indeed, it allo ws the mo del to maintain temp oral consistency across the 1s to 60s windows found 6 D. Darank oum et al. in our do wnstream datasets and ensures resolution in v ariance as the signal is compressed through the hierarc h y of Sp ecHi-Net. A distinguishing feature of our transformer implementation is the pro cessing of the dual-path enco der outputs. Before en tering the attention blocks, the feature maps from the short-range and long-range conv olutional paths are ﬂattened and concatenated along the temp oral dimension. This op eration enables the transformer to: (1) Learn cross-feature relationships: By observing features from b oth small ( k = 4 ) and wide ( k = 65 ) kernels simultaneously , the mo del can learn how lo calized transien ts (e.g., spikes) correlate with broader rhythmic oscillations (e.g., slow w a v es). (2) Capture global dependencies: Op erating on the ﬂattened channel-time represen tation allows the mo del to in tegrate context across the en tire electro de manifold. Since the Gaussian mask is applied uniformly across all c hannels, this architecture comp els the model to reconstruct obscured sp ectral-temporal regions b y mo deling the complex oscillations inheren t in large-scale brain net w orks rather than relying on lo cal temporal redundancies. Multi-Ob jective Loss Design T o ensure that Sp ecHi-Net captures b oth the high-ﬁdelit y temp o- ral morphology and the essen tial sp ectral characteristics of EEG signals, we optimize the framew ork using a composite loss function L total . This function in tegrates multi-scale temporal reconstruc- tion errors with a channel-wise spectral p enalt y . F or any predicted signal p and its corresp onding ground truth y , w e deﬁne a base composite loss L ( p , y ) that balances time-domain mean squared error (MSE) with frequency-domain consistency: L ( p , y ) = MSE ( p , y ) + w spec L spec ( p , y ) , (4) where w spec = 0 . 02 to reac h the same order of magnitude as the temp oral loss. W e designed the sp ectral loss comp onen t L spec with a sp eciﬁc goal to b e sensitiv e to the reconstructed signal amplitudes | STFT ( p c ) | but not the phases, L spec ( p , y ) = 1 C C X c =1 ( | STFT ( p c ) | − | STFT ( y c ) | ) 2 . (5) This spectral regularizer ensures that the mo del preserv es the biological pow er distribution across frequency bands, ev en when temporal alignments are subtly shifted. T o facilitate stable training and enforce structural consistency across the hierarc hical depth of the deco der, we utilize m ulti- lev el sup ervision. The total loss L total com bines the primary dual-path reconstruction losses with in termediate auxiliary losses: L total = X i ∈ 1 , 2 L ( ˜ x i , x ) + X j ∈ 1 , 2 α j L ( in term-x j , x ↓ ) , (6) where ˜ x 1 and ˜ x 2 are the ﬁnal outputs from U p 3 , in term-x 1 is the intermediate output from U p 1 and interm-x 2 is the intermediate output from U p 2 . F or auxiliary sup ervision, the ground truth x is dynamically downsampled via a linear interpolation ( x ↓ ) to match the speciﬁc temporal resolution of the intermediate deco der stages. The auxiliary weigh ts are set to α 1 = 0 . 2 and α 2 = 0 . 3 to prioritize the ﬁnal high-resolution reconstruction while pro viding suﬃcient gradient guidance to the lo w er hierarc hical la y ers. Fine-T uning Design: The Sp ectral Mixture of Exp erts Once the Sp ecHi-Net backbones are pretrained, they are transitioned in to a Sp ectral Mixture of Exp erts (Sp ecMoE) framework for do wnstream ﬁne-tuning. This architecture enables the mo del to dynamically weigh t representations from m ultiple foundation mo dels based on the spectral characteristics of the input signal. Fine-T uning Pipeline The ﬁne-tuning pipeline transforms a ra w EEG segment x ∈ R C × L in to a task-speciﬁc prediction through four primary stages: (1) P arallel Exp ert Enco ding: the input x is processed b y three pretrained SpecHi-Net encoders (referred to as Exp erts E 1 , E 2 , and E 3 ) in parallel. Eac h exp ert produces a high-lev el embedding O i ∈ R C × D , where C is the c hannel dimension and D the embedding dimension. (2) Sp ectral Gating: simultaneously , the P o w er EEG F oundation Mo del - SpecMoE 7 Sp ectral Density (PSD) of the raw input x is computed to generate a gating signal. This signal determines the relev ance of each exp ert’s features based on the input’s rhythmic conten t. (3) Gated F usion: the expert em b eddings are mo dulated b y their resp ectiv e sp ectral gates and concatenated to form a uniﬁed representation out ∈ R C × 3 D . (4) Hierarc hical Pooling and Prediction: the fused features undergo a po oling pro cess on the spatial dimension before being passed to an MLP-based predictor for the ﬁnal classiﬁcation or regression task. Pretrained Exp erts W e employ three distinct Sp ecHi-Net instances as our exp erts. Each exp ert dev elops complemen tary feature represen tations due to training on diﬀerent data partitions with indep enden t random masking seeds. During the initial phase of ﬁne-tuning, these SpecHi-Net enco ders can be either "frozen" to preserve the learned foundation features or "unfrozen" for task- sp eciﬁc adaptation. Eac h exp ert sp ecializes in diﬀeren t asp ects of the signal due to the sto c hastic nature of the pretraining masking pro cess, providing the ensem ble with a div erse set of p erspectives on the neural data. Sp ectral Gating The core innov ation of the ﬁne-tuning framew ork is the sp ectral gating mec hanism. Unlik e traditional MoE mo dels that use a learned router on latent features, SpecMoE utilizes the frequency information of the signal. W e compute the Po wer Sp ectral Density PSD x ∈ R C × K (where K is the frequency dimension) using a diﬀeren tiable W elch’s metho d. This provides a clear ﬁnger- prin t of the signal’s frequency distribution. F or each expert E i , a gating tensor G i is generated through a linear transformation of PSD x follo w ed b y a sigmoid activ ation: G i = σ ( W i · PSD x + b i ) . (7) These gates act as sp eciﬁc masks ov er the exp ert embeddings, amplifying or suppressing features based on whether the input signal’s sp ectral proﬁle matc hes the expert’s learned "sp ecialization". Predictor and Loss F unctions F ollowing gated fusion, the representation is regularized through a spatial p ooling lay er across the channel dimension C . This yields a compact v ector h ∈ R 3 D represen ting the global state of the EEG segment. The ﬁnal Predictor is a Multi-La yer P erceptron (MLP) with lay er normalization and GELU activ ations. The framework is optimized based on the task: (1) Classiﬁcation, optimized through a weigh ted cross-en trop y loss to account for class im balances when necessary . (2) Regression, optimized through Ro ot Mean Squared Error (RMSE). 4 Exp erimen tal design Pretraining Dataset W e pretrained Sp ecMoE on three subsets of the T emple Univ ersity Hospi- tal EEG dataset (TUEG) [ 29 ], comprising approximately 27,000 hours of clinical recordings from 14,987 sub jects. T o ensure data qualit y and consistency , we follow established prepro cessing pro- to cols for the EEG foundation models [ 11 , 6 , 12 ], resulting in a curated set of 1,109,545 30-second samples ( ≈ 9,000 hours). W e partitioned this corpus into three sequen tial subsets based on sub ject iden tiﬁers to preven t data leak age and facilitate parallel exp ert training. Detailed prepro cessing steps, including channel selection, ﬁltering criteria, and bad-sample remo v al thresholds, are pro- vided in the Supplemen tary Information (SI). Pretraining Implementation W e trained each Sp ecHi-Net exp ert indep enden tly on an NVIDIA T esla V100 4-GPU cluster. W e utilized the A dam W optimizer with a cosine annealing learning rate scheduler to handle the large-scale self-supervised ob jective. Hyp erparameter conﬁgurations, including learning rates, batc h sizes, and hardwar e sp eciﬁcations, are detailed in the SI. Fine-T uning Setup T o comprehensiv e ly ev aluate the generalizability of Sp ecMoE, we selected nine heterogeneous datasets representing distinct BCI paradigms and clinical diagnostics. These b enc hmarks cov er a broad sp ectrum of neural activity , ranging from cognitive states and motor in ten t to drug-induced eﬀects detection. A summary of the downstream tasks and their corresp ond- ing dataset characteristics is provided in T able 1 . T o maintain consistency with the pretraining 8 D. Darank oum et al. T able 1. Summary of downstream EEG datasets and task speciﬁcations. Sp ec. are the sp ecies (H: Human, M: Murine), S.F req is the sampling frequency , #Ch. is the num b er of channels, Len is the sample length, #Sub j. is the n umber of sub jects. T ask Dataset Sp ec. S.F req #Ch. Len #Samples #Sub j. Lab el Motor Imagery Ph ysioNet-MI H 160 Hz 64 4s 9,837 109 4-class Emotion Recognition SEED-V H 1000 Hz 62 1s 117,744 16 5-class Sleep Staging HMC H 256 Hz 4 30s 137,243 151 4-class Therap eutic Area MACO M 1024 Hz 2 60s 61,900 336 5-class Drug Eﬀect DA-Pharmaco M 1000 Hz 5 60s 2,800 10 5-class Imagined Sp eech BCIC2020-3 H 256 Hz 64 3s 6,000 15 5-class Abnormal Detection TUAB H 256 Hz 16 10s 409,455 2,383 2-class Seizure Detection Siena H 512 Hz 29 10s 51,307 14 2-class Vigilance Estimation SEED-VIG H 200 Hz 17 8s 20,355 21 contin uous phase, w e resampled all EEG signals to 200 Hz. While w e main tain a standardized backbone, spe- ciﬁc artifact remov al and segmenting pro cedures were tailored to the requirements of eac h task; detailed descriptions of these dataset-sp eciﬁc preprocessing pip elines are pro vided in the SI. W e b enc hmark Sp ecMoE against tw o categories of mo dels: task-sp eciﬁc architectures and state- of-the-art foundation mo dels. Among task-speciﬁc architectures, w e selected three widely recog- nized baselines: EEGNet [ 7 ], EEG-Conformer [ 9 ], and FF CL [ 10 ]. Regarding EEG foundation arc hitectures, we selected three leading mo dels: LaBraM [ 11 ], CBraMod [ 6 ], and CSBrain [ 12 ]. Detailed arc hitectural diﬀerences are further discussed in the SI. W e quan tiﬁed the p erformance on classiﬁcation tasks using balanced accuracy , weigh ted F1- score, w eigh ted A UR OC, and w eighted A UPRC. F or the vigilance estimation regression task, we rep ort the co eﬃcien t of determination ( R 2 ), root mean square error (RMSE), and the Pearson correlation co eﬃcien t ( r ). 5 Results T able 2 summarizes the p erformance of SpecMoE compared to the baselines. F or the sake of read- abilit y , w e only rep ort the primary metric for each task here. Detailed results can be found in the SI. Sp ecMoE ac hiev es the b est p erformance in 7 out of 9 downstream tasks, demonstrating robustness across div erse recording conditions, including transfer to murine EEG data not seen during pretraining. Our mo del exhibits its most signiﬁcant gains in tasks characterized by complex sp ectral signatures and non-stationary dynamics. Notably , on the MACO and SIENA datasets, our mo del main tains a substantial p erformance lead ov er the second-b est foundation mo dels (CBraMo d and CSBrain) b y 7.2% and 9.9%, resp ectively . In the MACO dataset, the challenge lies in distin- guishing drug-induced mo diﬁcations in murine brain activity across ﬁve classes (comp ounds from four therap eutic areas and solv ents). In con trast, the SIENA dataset requires the identiﬁcation of ictal (seizure) even ts against a background of in terictal human EEG. The consistent eﬃcacy of Sp ecMoE in these tw o distinct benchmarks—one pharmacological and one pathological—indicates that our arc hitecture framew ork is uniquely capable of prioritizing the sp eciﬁc exp ert features rele- v an t to either the sp ectral signatures of pharmaceutical in terv en tions or the high-energy discharges c haracteristic of seizures. This competitive adv antage is further highlighted in the D A-Pharmaco task, where Sp ecMoE ac hiev es a balanced accuracy of 0.6229, surpassing the second-b est mo del, EEGConformer, b y 6%. This task requires the classiﬁcation of ﬁve distinct dopaminergic comp ounds from m ulti-site m urine EEG recordings. The success of Sp ecMoE here indicates that our architecture eﬀectively deco des the complex neuro-sp ectral signatures and cross-regional dependencies induced by diﬀeren t dopaminergic modulators. F urthermore, in the vigilance estimation regression task (SEED-VIG), Sp ecMoE nearly halv es the error rate of comp eting mo dels, ac hieving an RMSE of 0.1522. This impro v emen t underscores the utility of the Sp ecMoE architecture in trac king subtle oscillatory shifts that c haracterize transitions b et ween alertness and dro wsiness. A critical adv antage of Sp ecMoE o ver recent transformer-based foundation mo dels lies in its arc hitectural ﬂexibility . Many state-of-the-art models, suc h as CBraMo d and CSBrain, rely on EEG F oundation Mo del - SpecMoE 9 T able 2. Performance comparison on diverse EEG datasets. Balanced accuracy ( ↑ ) is reported for classi- ﬁcation tasks, and RMSE ( ↓ ) is reported for SEED-VIG. Each v alue reported corresp onds to the a verage across 5 random seeds, along with their standard deviations. Bold indicates the b est performance; underline indicates the second best. T ask-Speciﬁc Mo dels F oundation Mo dels Dataset EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE Ph ysioNet- MI 0.5814 0.6049 0.5726 0.6173 0.6174 0.6304 0.6444 ± 0.0125 ± 0.0104 ± 0.0092 ± 0.0122 ± 0.0036 ± 0.0090 ± 0.0109 SEED-V 0.2961 0.3537 0.3641 0.3976 0.4091 0.4197 0.4033 ± 0.0102 ± 0.0112 ± 0.0092 ± 0.0138 ± 0.0097 ± 0.0033 ± 0.0084 HMC 0.6534 0.7149 0.4427 0.7277 0.7269 0.7345 0.7479 ± 0.0122 ± 0.0086 ± 0.0702 ± 0.0101 ± 0.0041 ± 0.0047 ± 0.0086 MA CO 0.6659 0.6119 0.4325 0.625 0.7645 0.6926 0.8527 ± 0.0166 ± 0.0217 ± 0.0105 ± 0.0272 ± 0.0118 ± 0.0117 ± 0.0033 D A- Pharmaco 0.4816 0.5597 0.3074 0.4601 0.5344 0.4526 0.6229 ± 0.0159 ± 0.0345 ± 0.0336 ± 0.0487 ± 0.021 ± 0.0142 ± 0.0231 BCIC2020-3 0.4413 0.4506 0.4678 0.5060 0.5373 0.6004 0.6262 ± 0.0096 ± 0.0133 ± 0.0197 ± 0.0155 ± 0.0108 ± 0.0187 ± 0.0166 TUAB 0.7642 0.7758 0.7848 0.8140 0.7891 0.8172 0.7742 ± 0.0036 ± 0.0049 ± 0.0038 ± 0.0019 ± 0.0030 ± 0.0043 ± 0.0069 SIENA 0.7487 0.7556 0.6616 0.7082 0.7317 0.7662 0.8655 ± 0.0521 ± 0.0210 ± 0.0391 ± 0.0329 ± 0.0647 ± 0.0471 ± 0.0038 SEED-VIG ( ↓ ) 0.2847 0.2829 0.2885 0.2871 0.3057 0.2774 0.1522 ± 0.0076 ± 0.0041 ± 0.0093 ± 0.0166 ± 0.0027 ± 0.0094 ± 0.0029 p osition-speciﬁc or c hannel-dep enden t embeddings that scale linearly with input dimensions. F or datasets like MACO and D A-Pharmaco, which feature long sequence lengths (60 secs) or non- standard channel conﬁgurations, the parameter coun ts of these mo dels escalate signiﬁcantly—often exceeding 20 million parameters. More details on the parameters coun t can b e found in the SI. In contrast, Sp ecMoE utilizes RoPE and a hierarchical dual-path structure that are inherently in v arian t to sequence length and channel coun t. This allo ws Sp ecMoE to main tain a compact, standardized parameter fo otprin t while deliv ering b etter accuracy , making it a more practical solution for real-w orld deploymen t across heterogeneous clinical en vironments and v arying sensor mon tages. Ablation Studies T o inv estigate the con tribution of each comp onen t within the Sp ecMoE frame- w ork, w e conducted a series of 6 ablation experiments across three representativ e datasets: BCIC2020- 3 (imagined sp eech), D A-Pharmaco (pharmacology), and PhysioNet-MI (motor imagery). All ab- lation studies w ere p erformed using 3 random seeds. Figure 3 illustrates the results. A fundamen tal premise of our pretraining is that neural oscillations are best captured through a multi-dimensional corruption process. Our ablation results conﬁrm this assumption. Indeed, all single-t yp e mas king strategies consisten tly underperform compared to the mixed-geometry ap- proac h. T raining foundation models exclusively with time-only , frequency-only , or time-frequency- only masks results in a noticeable performance degradation across all b enc hmarks (see SI for the sp eciﬁc conﬁguration of these single-t yp e ablation exp erimen ts). F or example, in the pharmacolog- ical task (Fig. 3 middle), the balanced accuracy drops from 0.6229 to 0.5794 ( Time-only ), 0.6137 ( F req-only ), and 0.5777 ( TF-only ), indicating that a join t sp ectral-temporal ob jectiv e is required to in ternalize the full complexity of EEG dynamics. F urthermore, replacing our Gaussian-smo othed masks with traditional rectangular masks ( Non-Gaussian exp erimen t) leads to a signiﬁcant drop 10 D. Darank oum et al. S p e c M o E T i m e - o n l y F r e q - o n l y T F - o n l y N o n - G a u s s i a n F u l l m a s k + C B r a m o d G a t i n g w i t h o u t P S D S p e c M o E T i m e - o n l y F r e q - o n l y T F - o n l y N o n - G a u s s i a n F u l l m a s k + C B r a m o d G a t i n g w i t h o u t P S D S p e c M o E T i m e - o n l y F r e q - o n l y T F - o n l y N o n - G a u s s i a n F u l l m a s k + C B r a m o d G a t i n g w i t h o u t P S D 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 B a l a n c e d A c c u r a c y 6 2 . 6 2 % - 6 . 0 2 % - 0 . 6 8 % - 1 . 3 4 % - 2 1 . 9 9 % - 3 8 . 1 4 % - 9 . 1 7 % B C I C 2 0 2 0 - 3 6 2 . 2 9 % - 4 . 3 5 % - 0 . 9 2 % - 4 . 5 2 % - 6 . 4 3 % - 9 . 7 3 % - 2 . 5 5 % D A - P h a r m a c o 6 4 . 4 4 % - 2 . 0 3 % - 1 . 5 6 % - 2 . 5 1 % - 4 . 6 0 % - 2 3 . 3 3 % - 2 . 0 4 % P h y s i o N e t - M I Fig. 3. SpecMoE ablation results. W e sho w the absolute v alue of the balanced accuracy for the SpecMoE mo del and relativ e diﬀerences for six ablation exp eriments. TF stands for time-frequency . See SI for other metrics. in accuracy . F or example, the balanced accuracy drops by more than 20% in the imagined sp eec h task (Fig. 3 left). A critical ﬁnding of our study is the p erformance gap observ ed when replacing our Sp ecHi-Net bac kb one with the CBraMod architecture within our MoE framew ork ( F ull mask+CBraMo d ). W e sp eciﬁcally selected CBraMo d as a baseline b ecause its backbone represents the current stan- dard for high-p erformance, non-hierarc hical EEG T ransformers—an arc hitecture it shares with CSBrain. By selecting CBraMo d, we aimed to ev aluate whether a competitive, ﬂat transformer arc hitecture could ac hiev e p erformance parity with our mo del without the added complexity of the region-speciﬁc channel enco ding lay ers found in CSBrain. Our goal was to isolate the impact of the hierarchical feature extraction process itself. On the BCIC2020-3 dataset, this sw ap results in a catastrophic p erformance collapse from 0.6262 to 0.2448 (Fig. 3 left). Finally , we assess the b eneﬁt of using p o wer sp ectral density (PSD) as a gating signal compared to a standard learned input enco ding netw ork ( Gating without PSD ). Detailed architecture and training parameters for this alternativ e learned gating netw ork are pro vided in the SI. Our results sho w that utilizing a raw-data-driv en encoding net w ork consistently underperforms compared to our sp ectral gating strategy . The diﬀerence is particularly striking on the imagined sp eec h task (Fig. 3 left). 6 Conclusion In this work, we introduced SpecMoE , a sp ectral-anc hored foundation mo del that utilizes a nov el Gaussian-smo othed masking strategy to mitigate the "artiﬁcial transient" bias common in rectan- gular masking. By shifting the pretraining ob jective tow ard the reconstruction of smooth, endoge- nous neural rh ythms, our framew ork develops a more physiologically grounded representation of EEG dynamics. A cross nine heterogeneous datasets, Sp ecMoE ac hiev es comp etitiv e p erformance, establishing a new baseline for several b enc hmarks. Sp ecMoE’s success on b oth human and murine recordings highligh ts that the represen tations learned from h uman EEG transfer eﬀectively to m urine data after ﬁne-tuning, suggesting shared sp ectral-temporal structure across sp ecies. Abla- tion studies indicate that the hierarc hical Sp ecHi-Net architecture is instrumen tal for high-ﬁdelity signal recov ery in complex tasks, while the spectral-guided mixture-of-exp erts framework enables structured feature integration based on input rhythmic con tent. T ogether, these comp onents pro- vide a robust inductiv e bias for the next generation of universal EEG foundation models. Limitations Despite state-of-the-art results, our work has several limitations that provide av- en ues for future research. First, the requiremen t for full-parameter ﬁne-tuning to achiev e optimal p erformance on downstream tasks indicates that, while Sp ecMoE learns robust represen tations, a "zero-shot" universal EEG foundation model remains an op en c hallenge. Second, while our mixed Gaussian masking strategy eﬀectively mitigates b oundary artifacts, we hav e not y et ex- plored the impact of v arying masking ratios; the current 50% ratio selected following previous EEG F oundation Mo del - SpecMoE 11 studies in EEG foundation mo dels may b e sub optimal. Finally , the specialization of the exp erts within the MoE framew ork results from a sto c hastic data partitioning. F uture w ork could in vesti- gate whether pretraining diﬀeren t exp erts using speciﬁc datasets from ph ysiological domains such as sleep, pathology , or cognitiv e motor tasks w ould further enhance the eﬃciency of the sp ec- tral gating mec hanism and therefore lead to b etter task predictions. The curren t choice of three exp erts reﬂects a practical trade-oﬀ: the TUEG corpus naturally partitions into three subsets of suﬃcien t size ( ∼ 300k–400k samples each) to train robust individual mo dels, while k eeping the total computational cost tractable. Ethical Considerations This work utilizes de-identiﬁed human and m urine EEG data; all exp eri- men tal proto cols were appro ved by their resp ectiv e Institutional Review Boards (IRB) or equiv alent ethical committees. While Sp ecMoE i s designed to adv ance clinical diagnostics and drug discov ery , w e ac knowledge the inheren t dual-use risks associated with high-capacit y neural deco ders, suc h as unauthorized cognitive surveillance. W e strongly opp ose any application of this technology that infringes up on cognitiv e lib ert y or is conducted without explicit informed consent. F urthermore, we emphasize that all automated assessmen ts must undergo professional v alidation to mitigate risks from algorithmic bias and ensure patien t safet y across diverse p opulations. A c knowledgmen t This work was funded by SynapCell SAS through the Cortex pro ject, aw arded at the 9th edition of the i-Nov competition organized for F rench companies. This w ork was also gran ted access to the HPC resources of IDRIS under the allo cation 2025-AD011016062R1 made b y GENCI and the HPC ressources from GRICAD infrastructure whic h is supp orted b y Grenoble researc h comm unities. References 1. Ga yal Kuruppu, Neera j W agh, and Y ogatheesan V arathara jah. EEG foundation mo dels: A critical review of curren t progress and future directions. In NeurIPS 2025 W orkshop on F oundation Mo dels for the Brain and Bo dy. 2. F anqi Shen, Enhong Y ang, Jiahe Li, Junru Hong, Xiaoran Pan, Zhizhang Y uan, Meng Li, and Y ang Y ang. Brain4FMs: A benchmark of foundation models for electrical brain signal. arXiv preprint arXiv:2602.11558, 2026. 3. Dingkun Liu, Y uheng Chen, Zhu Chen, Zheny ao Cui, Y aozhi W en, Jia yu An, Jingw ei Luo, and Don- grui W u. EEG foundation models: Progresses, b enc hmarking, and open problems. arXiv preprint arXiv:2601.17883, 2026. 4. W ei Xiong, Jiangtong Li, Jie Li, Kun Zhu, and Chang jun Jiang. EEG-FM-Bench: A comprehensiv e b enc hmark for the systematic ev aluation of EEG foundation mo dels. arXiv preprin t 2025. 5. Guangyu W ang, W enchao Liu, Y uhong He, Cong Xu, Lin Ma, and Haifeng Li. EEGPT: Pretrained transformer for univ ersal and reliable representation of EEG signals. A dv ances in Neural Information Pro cessing Systems, 37:39249–39280, 2024. 6. Jiquan W ang, Sha Zhao, Zhiling Luo, Y angxuan Zhou, Haiteng Jiang, Shijian Li, T ao Li, and Gang P an. CBraMo d: A criss-cross brain foundation model for EEG decoding. In The Thirteenth In ternational Conference on Learning Representations, 2025. 7. V ernon J Lawhern, Amelia J Solon, Nicholas R W ayto wich, Stephen M Gordon, Chou P Hung, and Bren t J Lance. EEGNet: a compact con volutional neural net work for EEG-based brain–computer in terfaces. J. Neural Eng., 15:056013, 2018. 8. Yihe W ang, Zhiqiao Kang, Bohan Chen, Y u Zhang, and Xiang Zhang. Benchmarking ERP analysis: Man ual features, deep learning, and foundation mo dels. arXiv preprin t arXiv:2601.00573, 2026. 9. N Kasthuri, R Ramy ea, VS Arunprasshath, S Abhineeth, and S Bharathra j. EEG conformer mo del based epileptic seizure prediction using deep learning. In 2024 15th In ternational Conference on Computing Communication and Net working T echnologies (ICCCNT), pages 1–7. IEEE, 2024. 10. Hongli Li, Man Ding, Ronghua Zhang, and Chun b o Xiu. Motor imagery EEG classiﬁcation algorithm based on CNN-LSTM feature fusion netw ork. Biomedical signal processing and control, 72:103342, 2022. 11. W eibang Jiang, Liming Zhao, and Bao-liang Lu. Large brain mo del for learning generic represen- tations with tremendous EEG data in BCI. In The T w elfth International Conference on Learning Represen tations, 2024. 12 D. Darank oum et al. 12. Y uchen Zhou, Jiamin W u, Zichen Ren, Zhouheng Y ao, W eiheng Lu, Kunyu Peng, Qihao Zheng, Chun- feng Song, W anli Ouy ang, and Chao Gou. CSBrain: A cross-scale spatiotemporal brain foundation mo del for EEG deco ding. The Thirt y-ninth Ann ual Conference on Neural Information Pro cessing Systems - NeurIPS, 202 5. 13. Chen yu Liu, Y uqiu Deng, Tian yu Liu, Jinan Zhou, Xinliang Zhou, Ziyu Jia, and Yi Ding. ECHO: T ow ard con textual seq2seq paradigms in large EEG mo dels. arXiv preprin t arXiv:2509.22556, 2025. 14. Yihe W ang, Nan Huang, Nadia Mammone, Marco Cecc hi, and Xiang Zhang. LEAD: An EEG foun- dation model for Alzheimer’s disease detection. arXiv preprin t arXiv:2502.01678, 2025. 15. Y assine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas F arrugia, Bastien Pasdeloup, Vincen t Grip on, Karim Jerbi, and Giulia Lioi. REVE: A foundation mo del for EEG-adapting to an y setup with large-scale pretraining on 25,000 sub jects. In The Thirty-nin th Ann ual Conference on Neural Information Pro cessing Systems. 16. Jiquan W ang, Sha Zhao, Y angxuan Zhou, Yiming Kang, Shijian Li, and Gang P an. DeeperBrain: A neuro-grounded EEG foundation mo del tow ards uni v ersal BCI. arXiv preprint 2026. 17. Zhisheng Chen, Yingwei Zhang, Qizhen Lan, Tianyu Liu, Huacan W ang, Yi Ding, Ziyu Jia, Rong- hao Chen, Kun W ang, and Xinliang Zhou. Uni-NTFM: A uniﬁed foundation mo del for EEG signal represen tation learning. arXiv preprin t arXiv:2509.24222, 2025. 18. Christopher W ang, Vighnesh Subramaniam, Adam Uri Y aari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu. BrainBER T: Self-sup ervised representation learning for intracranial record- ings. In The Eleven th In ternational Conference on Learning Representations, 2023. 19. Zhizhang Y uan, F anqi Shen, Meng Li, Y uguo Y u, Chenhao T an, and Y ang Y ang. Brain W av e: A brain signal fo undation model for clinical applications. arXiv preprin t arXiv:2402.10251, 2024. 20. Klean this A vramidis, Tiantian F eng, W oo jae Jeong, Jih wan Lee, W enhui Cui, Richard M Leah y , and Shrik anth Naray anan. Neural codecs as biosignal tok enizers. arXiv preprin t arXiv:2510.09095, 2025. 21. Zhijiang W an, Man yu Li, Shichang Liu, Jia jin Huang, Hai T an, and W enfeng Duan. EEGformer: A transformer–based brain activit y classiﬁcation metho d using EEG signal. F rontiers in neuroscience, 17:1148855, 2023. 22. W ei Xiong, Junming Lin, Jiangtong Li, Jie Li, and Chang jun Jiang. ALFEE: A daptive large foundation mo del for EEG representation. arXiv preprin t arXiv:2505.06291, 2025. 23. Jath urshan Pradeepkumar, Xihao Piao, Zheng Chen, and Jimeng Sun. T okenizing single-channel EEG with time-fre quency motif learning. arXiv preprin t arXiv:2502.16060, 2025. 24. Zhi Zhang, Y an Liu, Zhejing Hu, Gong Chen, Jiannong Cao, Shenghua Zhong, Sean F ontaine, Changhong Jing, and Sh uqiang W ang. Mutual-guided expert collaboration for cross-sub ject EEG classiﬁcation. arXiv preprin t arXiv:2602.01728, 2026. 25. F an Ma, Mingyang Jiang, Lingfei Qian, Zhiling Gu, and Hua Xu. BrainMoE: T ow ards univ ersal EEG foundation models with channel-wise mixture-of-exp erts. 26. Jianlin Su, Murtadha Ahmed, Y u Lu, Shengfeng Pan, W en Bo, and Y unfeng Liu. Roformer: Enhanced transformer with rotary p osition em b edding. Neuro computing, 568:127063, 2024. 27. Da vy Darankoum, Chloé Hab ermac her, Julien V olle, and Sergei Grudinin. CoSupF ormer: A contrastiv e sup ervised learning approac h for EEG signal classiﬁcation. arXiv preprin t arXiv:2509.20489, 2025. 28. Emadeldeen Eldele, Zhenghua Chen, Chengyu Liu, Min W u, Chee-Keong Kw oh, Xiaoli Li, and Cuntai Guan. An atten tion-based deep learning approach for sleep stage classiﬁcation with single-channel EEG. IEEE T rans. Neural Syst. Rehabil. Eng., 29:809–818, 2021. 29. Iy ad Ob eid and Joseph Picone. The temple univ ersity hospital EEG data corpus. F rontiers in neuroscience, 1 0:196, 2016. 30. Gerwin Sc halk, Dennis J McF arland, Thilo Hinterberger, Niels Birbaumer, and Jonathan R W olpa w. Bci2000: a general-purp ose brain-computer in terface (b ci) system. IEEE T ransactions on biomedical engineering, 51 (6):1034–1043, 2004. 31. W ei Liu, Jie-Lin Qiu, W ei-Long Zheng, and Bao-Liang Lu. Comparing recognition performance and robustness of multimodal deep learning mo dels for multimodal emotion recognition. IEEE T ransactions on Cognitive and Dev elopmental Systems, 2021. 32. Diego Alv arez-Estevez and Roselyne Rijsman. Haaglanden Medisch Centrum sleep staging database. Ph ysioNet, Marc h 2022. V ersion 1.1. 33. Sampath KT Kapanaiah, Holger Rosenbrock, Bastian Hengerer, and Dennis Kätzel. Neural eﬀects of dopaminergic compounds revealed by multi-site electrophysiology and interpretable machine-learning. F rontiers in pharmacology, 15:1412725, 2024. 34. Ji-Ho on Jeong, Jeong-Hyun Cho, Y oung-Eun Lee, Seo-Hyun Lee, Gi-Hw an Shin, Y oung-Seok K weon, José del R Millán, Klaus-Rob ert Müller, and Seong-Whan Lee. 2020 in ternational brain–computer in terface competition: A review. F rontiers in h uman neuroscience, 16:898300, 2022. 35. P aolo Detti. Siena Scalp EEG Database. Ph ysioNet, August 2020. V ersion 1.0.0. 36. W ei-Long Zheng and Bao-Liang Lu. A multimodal approach to estimating vigilance using EEG and forehead EOG. Journal of Neural Engineering, 14(2):026017, 2017. EEG F oundation Mo del - SpecMoE 13 Supplemen tary Information A Pretraining Setup A.1 Pretraining Dataset and Preprocessing Dataset Description W e pretrained SpecMoE on 3 subsets of the large T emple Univ ersity Hos- pital EEG dataset [ 29 ] (TUEG). The TUEG dataset comprises a div erse arc hive of 69,652 clinical EEG recordings from 14,987 sub jects across 26,846 sessions, totaling 27,062 hours in duration. The archiv e features o v er 40 diﬀerent c hannel conﬁgurations and recordings of v arying durations. Most of the recordings are sampled at 256 Hz. Unfortunately , the TUEG dataset suﬀers from signiﬁcan t data contamination, including a substantial amoun t of unmarked noise, artifacts, and fault y c hannels. T o homogenize the pretraining pro cess and reduce the dataset inconsistencies, w e prepro cessed it following previous studies on EEG foundation mo dels. These include LaBraM [ 11 ], CBraMo d [ 6 ], and CSBrain [ 12 ]. Prepro cessing T o begin with, we remo ve recordings that hav e a total duration of less than 5 min utes. Next, we discard the ﬁrst and last minute of eac h recording to eliminate as muc h low- qualit y data as p ossible. W e then select 19 common EEG channels (F p1, F p2, F7, F3, F z, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, O2) that comply with a subset of the 10-20 international electro de placemen t system standards. This ensures that we obtain clean and uniformly formatted pre-training data. Afterw ard, we apply a band-pass ﬁlter (0.3 Hz–75 Hz) to eliminate lo w-frequency and high- frequency noise. Additionally , a notch ﬁlter (60 Hz) is utilized to remov e pow er line noise. All EEG signals are resampled to 200 Hz and segmen ted into non-ov erlapping 30-second EEG samples. Ho w ev er, the prepro cessing steps men tioned ab ov e ma y not completely resolv e the quality issues presen t in the EEG data. T o further enhance the quality , we implement an automated sc heme for the remo v al of bad EEG samples. Speciﬁcally , we identify samples as bad if an y data p oint exceeds an absolute amplitude of 100 µ V and remov e these from the dataset. W e also normalize the EEG signals b y scaling the units to 100 µ V, ensuring that the v alues predominantly range b et ween -1 and 1, in line with previous w ork on EEG foundation mo dels (LaBraM [ 11 ], CBraMod [ 6 ], CSBrain[ 12 ]). After removing bad samples, we are left with 1,109,545 remaining 30-second segmen ts, totaling 9,000 hours of data. Finally , w e divide the prepro cessed samples in to three subsets. Subset-1 contains samples 1 to 400,000, subset-2 consists of samples 400,001 to 800,000, and subset-3 includes the remaining samples from 800,001 to 1,109,545. This sequen tial splitting strategy is based on the hierarc hical storage structure of the corpus, where patien ts are organized into folders according to randomized database iden tiﬁers. This ensures that eac h subset represen ts a diverse and unbiased cross-section of the 14,987 unique sub jects. F urthermore, considering the large num b er of samples p er subset and the relatively lo w av erage num b er of sessions p er patient (1.79), this approach eﬀectiv ely minimizes the risk of data leak age b et ween the three training partitions. Computational Details and Hyp erparameters SpecMoE w as implemented using the PyT orch framew ork (v2.6.0) and trained in a high-p erformance computing environmen t. T o develop the three sp ecialized exp erts within our foundation ensem ble, eac h model w as trained independently on one of the three TUEG subsets described in Section 4.1.2. The hardware conﬁguration consisted of 4 NVIDIA T esla V100-SXM2-32GB GPUs op erating in parallel. Eac h pretraining session spanned appro ximately 200 hours to complete 50 ep ochs. W e utilized the A dam W optimizer with an initial learning rate of 1 × 10 − 3 and a weigh t deca y co eﬃcient of 5 × 10 − 2 . T o ensure stable conv ergence, we emplo y ed a CosineAnnealingLR sc heduler. The eﬀective batch size was set to 256 samples, achiev ed through a dis tributed strategy of 64 samples across the 4 GPUs with 4 gradient accumulation steps. Signal processing and data management were supp orted by NumPy (v2.3.3) and SciPy (v1.16.2) within a Python 3.12.8 environmen t. All computations were executed using CUDA 12.4 to leverage hardw are acceleration. T able S1 lists the h yp erparameters used for the pretraining phase. 14 D. Darank oum et al. T able S1. Hyperparameters for Sp ecMoE pre-training. Category Setting Input EEG Data Channels ( C ) 19 Sampling F requency 200 Hz Sample Duration ( L ) 30s (600 0 points) Amplitude Norma lization Unit 1 00 µ V Signal T ransformation (STFT) Windo w T yp e hann_p eriodic Windo w Length ( n _ f f t ) 400 Hop L ength ( n _ ov erl ap ) 200 F requency Bins ( F ) 201 Time F rames ( T ) 31 Gaussian Masking T otal Mask Ratio ( ρ ) 0.5 Geometries Probability ( P ) [F req: 0.6, Time: 0.3, TF: 0.1] Kernel param eters ( σ f , σ t ) 5% of respective range Sp ectral-Band Bias 50% on δ, θ , α, β bands Sp ecHi-Net Backbone Hierarc hy Stages 3 Do wnsampling / 3 Upsampling Dual-P ath Kernels ( k ) Small: 4, Large: 65 T ransformer La yers 1 A ttention Heads 8 Hidden Dimension ( D ) 128 P osition Encoding Rotary P ositional Em b edding (RoPE) Optimization Loss F unction Multi-Ob jective (MSE + Sp ectral Loss) Sp ectral Loss W eigh t ( w spec ) 0.02 Auxiliary Loss W eights ( α 1 , α 2 ) 0.2, 0.3 Optimizer A dam W Learning Ra te 1 × 10 − 3 W eight Decay 5 × 10 − 2 Sc heduler CosineAnnealingLR Batc h Size 256 (64 p er GPU × 4 steps) Ep ochs 50 Gradien t Clipping 5.0 B Details on Do wnstream T asks B.1 Finetuning Settings Details Baselines – EEGNet: A compact, specialized CNN for EEG that utilizes depthwise and separable con vo- lutions to extract robust features with a minimal parameter fo otprin t [ 7 ]. Co de av ailable at: https://github.com/vlawhern/arl- eegmodels . – EEGConformer: A hybrid architecture that lev erages CNNs for lo cal feature extraction and T ransformer blo c ks to capture long-range temp oral dependencies [ 9 ]. Co de av ailable at: https: //github.com/eeyhsong/EEG- Conformer . – FF CL: A framework integrating parallel CNN and LSTM branches to fuse spatial and temp oral dynamics through a shared fully connected lay er [ 10 ]. W e implemented the co de ourselves, follo wing the arc hitecture design and the detailed hyperparameter description in the pap er. – LaBraM: A uniﬁed foundation mo del that enables cross-dataset learning through neural chan- nel patching and a vector-quan tized neural sp ectrum tok enizer. It utilizes a T ransformer back- b one and d ual-domain (frequency/phase) mask learning for large-scale pretraining [ 11 ]. Co de a v ailable at: https://github.com/935963004/LaBraM/tree/main . EEG F oundation Mo del - SpecMoE 15 – CBraMo d: A foundation mo del featuring criss-cross attention to capture spatial and tempo- ral features separately , ac hieving eﬃcient yet p o werful represen tations [ 6 ]. Co de av ailable at: https://github.com/wjq- learning/CBraMod/tree/main . – CSBrain: A mo del emplo ying cross-scale spatiotemp oral tok enization and structured sparse atten tion to optimize robustness in EEG signal deco ding [ 12 ]. Code av ailable at: https:// github.com/yuchen2199/CSBrain/tree/main . Metrics Classiﬁc ation Metrics. F or a giv en class c ∈ { 1 , . . . , C } , let T P c , T N c , F P c , and F N c represen t the T rue P ositiv es, T rue Negativ es, F alse Positiv es, and F alse Negatives, resp ectiv ely . W e deﬁne the fundamen tal comp onen ts for each class as: Precision c ( P c ) = T P c T P c + F P c , Recall c ( R c ) = T P c T P c + F N c . (8) A dditionally , for threshold-based curves, we deﬁne the T rue Positiv e Rate (TPR = Recall) and the F alse P ositiv e Rate (FPR = F P F P + T N ). Based on these, the follo wing metrics are employ ed: – Balanced A ccuracy: The arithmetic mean of class-sp eciﬁc Recall scores, whic h preven ts inﬂated p erformance estimates on im balanced datasets: Balanced A cc. = 1 C C X c =1 R c . (9) – W eigh ted F1-Score: The harmonic mean of Precision and Recall, w eigh ted b y the n umber of samples ( N c ) in eac h class relativ e to the total samples ( N total ): W eigh ted F1 = C X c =1 N c N total  2 · P c · R c P c + R c  . (10) – A UR OC: The area under the curv e obtained by plotting TPR against FPR at v arious decision thresholds: A UR OC = Z 1 0 TPR ( f ) d f , f = FPR . (11) – A UPR C: The area under the curv e obtained by plotting Precision against Recall, pro viding a stringen t assessmen t for minorit y class detection: A UPR C = Z 1 0 P ( r ) dr, r = Recall . (12) R e gr ession Metrics (Vigilanc e Estimation). F or the SEED-VIG dataset, let y i b e the ground truth, ˆ y i the predicted v alue, and ¯ y the mean of the ground truth for n samples. – P earson Correlation Co eﬃcien t ( r ): Measures the linear relationship b et ween predicted and true lab els: r = P n i =1 ( y i − ¯ y )( ˆ y i − ¯ ˆ y ) q P i = 1 n ( y i − ¯ y ) 2 P n i =1 ( ˆ y i − ¯ ˆ y ) 2 . (13) – Co eﬃcien t of Determination ( R 2 ): Represen ts the prop ortion of v ariance explained by the mo del: R 2 = 1 − P n i =1 ( y i − ˆ y i ) 2 P n i =1 ( y i − ¯ y ) 2 . (14) – Ro ot Mean Square Error (RMSE): Quantiﬁes the standard deviation of the prediction residuals: RMSE = v u u t 1 n n X i =1 ( y i − ˆ y i ) 2 . (15) 16 D. Darank oum et al. T able S2. Hyperparameters for Sp ecMoE downstream ﬁnetuning. Category Setting Arc hitecture Setup Num b er of Experts 3 (Pretrained Sp ecHi-Net enco ders) Exp ert Backbone State Unfrozen (F ull-parameter ﬁnetuning) Em b edding Dimension ( D ) 128 (per expert) F used Represen tation Dim 384 ( 3 × D ) Sp ectral Gating (PSD) PSD Method Diﬀerentiable W elch’s metho d Gating Inp ut Dim 201 (F requency Bins F ) Gating Activ ation Sigmoid Do wnstream Predictor Spatial P o oling A v erage po oling across c hannels MLP La yers 2 (Linear → GELU → Linear) Drop out Rate 0.2 Optimization Loss F unction (Classiﬁcation) W eighted Cross-Entrop y Loss F unction (Regression) Ro ot Mean Square Error (RMSE) Optimizer A dam Learning Ra te 1 × 10 − 3 W eight Decay 0 Sc heduler ReduceLR OnPlateau Batc h Size 64 Max Ep o c hs 100 Hyp erparameters In T able S2 , we specify the default h yp erparameter setup used for the ﬁne- tuning tasks. Details on each dataset can be found in the co de: https://github.com/TeraXj78/ SpecMoE . B.2 Motor Imagery T ask The Ph ysioNet-MI dataset [ 30 ] serv es as a large-scale b enc hmark for deco ding motor in ten t, fea- turing 109 sub jects performing four men tal imagery tasks: left/righ t ﬁst, both ﬁsts, and both feet. Data from 64 channels w ere resampled to 200 Hz and windo wed into 4-second trials, totaling 9,837 samples. W e ev aluate the mo del using a strict sub ject-indep endent proto col, allo cating sub jects 1–70 for training, 71–89 for v alidation, and 90–109 for testing. As detailed in T able S3 , foundation mo dels demonstrate a clear adv antage ov er task-speciﬁc baselines in generalizing across a diverse p opulation. SpecMoE ac hiev es the highest performance with a balanced accuracy of 0.6444 and a w eighted F1-score of 0.6476 . Notably , Sp ecMoE sur- passes the strongest foundation mo del baseline, CSBrain, b y 1.40% in balanced accuracy and 1.68% in weigh ted F1. Compared to the b est-p erforming task-sp eciﬁc mo del (EEGConformer), our framework yields a substantial impro v emen t of 3.95% in accuracy . The results suggest that our sp ectral mixture-of-exp erts is eﬀective at capturing the rhythmic signatures necessary for high- ﬁdelit y motor imagery classiﬁcation across unseen sub jects. B.3 Emotion Recognition T ask SEED-V [ 31 ] is a benchmark dataset for EEG-based emotion recognition, featuring ﬁv e emotional categories: happy , sad, neutral, disgust, and fear. EEG signals were collected from 16 sub jects across three sessions using 62 channels at a sampling rate of 1000 Hz. F ollo wing standard preprocessing, the data w as resampled to 200 Hz and segmen ted into 1-second windows, resulting in 117,744 samples. W e employ a within-session split proto col, dividing the 15 trials of eac h session in to three equal parts (5:5:5) for training, v alidation, and testing. As shown in T able S4 , deco ding high-granularit y emotional states remains a signiﬁcant c hal- lenge for all arc hitectures, with p erformance across the b oard generally lo w er than on simpler motor EEG F oundation Mo del - SpecMoE 17 T able S3. Detailed p erformance comparison on the Ph ysioNet-MI motor imagery task. V alues repre- sen t the a verage and standard deviation across ﬁv e random seeds. Bold indicates the best performance; underline indicates the second b est. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 34.2 M 38.2 M 4.3 M Bal. Acc Mean 0.5814 0.6049 0.5726 0.6173 0.6174 0.6304 0.6444 Std Dev ± 0.0125 ± 0.0104 ± 0.0092 ± 0.0122 ± 0.0036 ± 0.0090 ± 0.0109 W eigh ted F1 Mean 0.5796 0.6062 0.5701 0.6177 0.6179 0.6308 0.6476 Std Dev ± 0.0115 ± 0.0095 ± 0.0079 ± 0.0141 ± 0.0035 ± 0.0095 ± 0.0082 or clinical tasks. While Sp ecMoE do es not achiev e the top rank in this speciﬁc paradigm, it re- mains highly competitive with a balanced accuracy of 0.4033 and a weigh ted F1-score of 0.4142 . Sp eciﬁcally , Sp ecMoE performs within a 1.64% margin of the state-of-the-art CSBrain and con- tin ues to outp erform earlier foundation mo dels like LaBraM, as well as all tas k-speciﬁc baselines. These results suggest that while the sp ectral gating mec hanism is eﬀective, the extremely short 1-second temp oral con text of SEED-V may limit the full p oten tial of hierarchical sp ectral mo deling compared to datasets with longer rh ythmic signatures. T able S4. Detailed p erformance comparison on the SEED-V emotion recognition task. V alues repre- sen t the a verage and standard deviation across ﬁv e random seeds. Bold indicates the best performance; underline indicates the second b est. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 14.3 M 18.3 M 4.3 M Bal. Acc Mean 0.2961 0.3537 0.3641 0.3976 0.4091 0.4197 0.4033 Std Dev ± 0.0102 ± 0.0112 ± 0.0092 ± 0.0138 ± 0.0097 ± 0.0033 ± 0.0084 W eigh ted F1 Mean 0.2749 0.3487 0.3645 0.3974 0.4101 0.4280 0.4142 Std Dev ± 0.0098 ± 0.0136 ± 0.0132 ± 0.0111 ± 0.0108 ± 0.0023 ± 0.0091 B.4 Sleep Stages Classiﬁcation T ask The HMC (Haaglanden Medisch Centrum) dataset [ 32 ] is a standard b enc hmark for sleep stage scoring. It con tains p olysomnograph y (PSG) recordings from 151 sub jects, from whic h we extract the EEG signals from four c hannels (F4-M1, C4-M1, O2-M1, and C3-M2). The task in v olv es classifying 30-second epo chs in to ﬁv e stages deﬁned by the American Academ y of Sleep Medicine man ual: W ak e (W), Non-REM 1 (N1), Non-REM 2 (N2), Non-REM 3 (N3), and REM (R). The signals were resampled to 200 Hz, consisten t with our pretraining conﬁguration. F or this task, we follo w ed the sub ject-indep enden t split used in recen t literature [ 12 ]. As sho wn in T able S5 , SpecMoE ac hieves the highest balanced accuracy of 0.7479 , surpassing the b est foundation baseline, CSBrain ( 0.7345 ), b y 1.34% . In terms of the weigh ted F1-score, Sp ecMoE ( 0.7503 ) remains highly comp etitiv e, p erforming on par with CSBrain ( 0.7506 ) within a negligible margin. Notably , all foundation mo dels demonstrate a massive performance leap o ver the FFCL baseline, which struggled with the temp oral complexity of sleep stages. The success of Sp ecMoE here is particularly meaningful; sleep stages are characterized by speciﬁc sp ectral land- marks lik e slow-w av e delta activity , which our sp ectrally-anc hored mixture-of-experts is designed to isolate and prioritize during the learning pro cess. 18 D. Darank oum et al. T able S5. Detailed p erformance comparison on the HMC sleep stages classiﬁcation task. V alues repre- sen t the a verage and standard deviation across ﬁv e random seeds. Bold indicates the best performance; underline indicates the second b est. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 20.3 M 24.3 M 4.3 M Bal. Acc Mean 0.6534 0.7149 0.4427 0.7277 0.7269 0.7345 0.7479 Std Dev ± 0.0122 ± 0.0086 ± 0.0702 ± 0.0101 ± 0.0041 ± 0.0047 ± 0.0086 W eigh ted F1 Mean 0.6536 0.7080 0.2902 0.7454 0.7395 0.7506 0.7503 Std Dev ± 0.0168 ± 0.0039 ± 0.0485 ± 0.0027 ± 0.0089 ± 0.0042 ± 0.0024 B.5 Drug Therap eutic Area Classiﬁcation T ask The MA CO dataset is a large-scale, priv ate rep ository designed to inv estigate the therapeutic p o- ten tial of v arious pharmacological comp ounds through EEG signatures. It comprises ∼ 1,032 hours of recordings from 336 mice, co vering four ma jor therap eutic classes: an tidepre ssan ts, an tipsy- c hotics, an tiepileptics, and anxiolytics, alongside a ﬁfth control group receiving solven t adminis- tration. The EEG w as recorded from tw o brain regions—the prefron tal and parietal cortices, using 2-c hannel setups at a sampling rate of 1024 Hz. F ollo wing our standard pipeline, signals w ere re- sampled to 200 Hz and segmen ted into 60-second windows, totaling 61,900 samples. W e emplo yed a strict sub ject-independent split, allo cating 70% of the mice for training, 10% for v alidation, and 20% for testing. As sho wn in T able S6 , Sp ecMoE ac hiev es its strongest p erformance on this benchmark, reaching a balanced accuracy of 0.8527 and a weigh ted F1-score of 0.8499 . This represents an absolute impro v emen t of 8.82% in accuracy o v er the second-b est mo del, CBraMo d ( 0.7645 ), and a 16.01% lead ov er CSBrain. Notably , these results were achiev ed with a signiﬁcantly more eﬃci en t parameter fo otprin t; while mo dels lik e CBraMo d and CSBrain utilize more than 20 million parameters , Sp ecMoE main tains high-ﬁdelity decoding with approximately 4.3 million parameters . The success on the MACO dataset is highly signiﬁcant for tw o reasons. First, it demonstrates that our sp ectral-anchored pretraining, whic h was primarily conducted on h uman data, gener- alizes eﬀectiv ely to murine EEG, v alidating the cross-sp ecies utilit y of the foundation mo del. Second, the substan tial performance gap suggests that therapeutic drug eﬀects are deeply embed- ded in certain frequency-domain bands that our sp ectral mixture-of-exp erts mo del w as sp eciﬁcally designed to capture. These results indicate that SpecMoE is a p o werful tool for pharmacological researc h and drug discov ery , outp erforming m uch larger mo dels in identifying the neural signatures of therap eutic compounds. T able S6. Detailed p erformance comparison on the MACO drug therapeutic area classiﬁcation task. V alues represent the a v erage and standard deviation across ﬁve random seeds. Bold indicates the b est p erformance; underline indicates the second b est. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 20.3 M 24.3 M 4.3 M Bal. Acc Mean 0.6659 0.6119 0.4325 0.6250 0.7645 0.6926 0 .8527 Std Dev ± 0.0166 ± 0.0218 ± 0.0105 ± 0.0272 ± 0.0119 ± 0.0117 ± 0.0033 W eigh ted F1 Mean 0.6828 0.6585 0.4190 0.6736 0.8013 0.7292 0 .8499 Std Dev ± 0.0182 ± 0.0192 ± 0.0935 ± 0.0388 ± 0.0044 ± 0.0189 ± 0.0083 EEG F oundation Mo del - SpecMoE 19 B.6 Drug Eﬀects Classiﬁcation T ask The DA-Pharmaco dataset [ 33 ] is a sp ecialized pharmacological benchmark consisting of Lo cal Field Poten tial (LFP) recordings from depth electrodes in ten mice. Electro des were implanted across ﬁve key regions: the prelimbic cortex (PrL), medio dorsal thalamus (MD), dorsal hipp o campal ﬁssure (dCA1), dorsal hipp o campal CA3 subﬁeld (dCA3), a nd ven tral hipp o campal ﬁssure (vHC). The original study follow ed a within-sub ject, randomized Latin-squares design testing sev en condi- tions: Saline, TWEEN80/saline, tw o doses of Clozapine (1 and 3 mg/kg), Raclopride, SCH23390, and Amphetamine. T o ev aluate the mo del’s capacity to deco de fundamen tal pharmacological ef- fects, w e designed a speciﬁc 5-class grouping strategy: (1) V ehicles (Saline and TWEEN80), (2) Amphetamine, (3) Clozapine (grouping both 1 and 3 mg/kg doses), (4) SCH23390, and (5) Raclo- pride. W e extracted a 40-min ute windo w starting 5 minutes p ost-injection for each session. Signals w ere resampled to 200 Hz and segmented into 60-second samples (2,800 in total). W e utilized a sub ject-indep enden t split, assigning six sub jects for training, tw o for v alidation, and tw o for testing. As detailed in T able S7 , Sp ecMoE demonstrates a signiﬁcant p erformance adv antage, achieving a balanced accuracy of 0.6230 and a w eigh ted F1-score of 0.6329 . This represents an impro v emen t of 6.33% ov er the b est task-sp eciﬁc mo del (EEGConformer) and a 8.86% lead ov er CBraMod. Notably , while larger foundation mo dels lik e CSBrain (0.4526) and LaBraM (0.4601) struggled to distinguish these self-designed pharmacological classes, SpecMoE’s sp ectral-anc hored gating eﬀectiv ely isolated the regional oscillatory signatures associated with each drug category . These results suggest that Sp ecMoE is sensitive to the ﬁne-grained spectral shifts that deﬁne speciﬁc neurotransmitter mo dulations, ev en in complex depth-electro de recordings. T able S7. Detailed p erformance comparison on the DA-Pharmaco drug eﬀects classiﬁcation task. V alues represen t the av erage and standard deviation across ﬁve random seeds. Bold indicates the b est perfor- mance; u nderline indicates th e second best. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 38.7 M 42.7 M 4.3 M Bal. Acc Mean 0.4816 0.5597 0.3074 0.4601 0.5344 0.4526 0.6230 Std Dev ± 0.0159 ± 0.0345 ± 0.0336 ± 0.0487 ± 0.0210 ± 0.0142 ± 0.0231 W eigh ted F1 Mean 0.4658 0.5309 0.3080 0.4583 0.5389 0.4575 0 .6329 Std Dev ± 0.0127 ± 0.0520 ± 0.0403 ± 0.0710 ± 0.0288 ± 0.0155 ± 0.0206 B.7 Imagined Sp eec h Recognition T ask Imagined sp eec h classiﬁcation aims to deco de phonological representations embedded in neural ac- tivit y without any physical sp eec h. This task is critical for developing augmentativ e communication tec hnologies for individuals with severe sp eec h impairments resulting from stroke or am yotrophic lateral sclerosis (ALS). W e ev aluate our mo del on the BCIC2020-3 dataset [ 34 ], released for the 2020 In ternational BCI Competition. In this exp erimen t, 15 sub jects imagined ﬁve speech-related categories ("hello", "help me", "stop", "thank you", and "y es"). EEG signals were recorded from 64 channels at 256 Hz and resampled to 200 Hz for our exp erimen ts. F ollowing the comp etition’s rigorous split, each sub ject provided 60 trials p er class for training, 10 for v alidation, and 10 for testing, with eac h sample consisting of a 3-second recording. As shown in T able S8 , imagined speech recognition represents a high-complexit y deco ding task where task-sp eciﬁc mo dels often struggle to exceed 50% accuracy . SpecMoE achiev es the highest p erformance across all mo dels, with a balanced accuracy of 0.6262 and a weigh ted F1-score of 0.6264 . This represents a substantial improv ement of 2.58% ov er the previous b est-performing foundation mo del, CSBrain ( 0.6004 ). F urthermore, SpecMoE surpasses the strongest task-sp eciﬁc baseline (FF CL) b y 15.84% , demonstrating that the sp ectral mixture-of-experts architecture is capable of isolating the subtle, high-frequency oscillatory patterns asso ciated with in ternal phono- logical pro cessing. 20 D. Darank oum et al. T able S8. Detailed p erformance comparison on the BCIC2020-3 imagined sp eec h recognition task. V alues represent the a v erage and standard deviation across ﬁve random seeds. Bold indicates the b est p erformance; underline indicates the second b est. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Params — — — — 5.8 M 27.7 M 31.7 M 4.3 M Bal. Acc Mean 0.4413 0.4506 0.4678 0.5060 0.5373 0.6004 0.6262 Std Dev ± 0.0096 ± 0.0133 ± 0.0197 ± 0.0155 ± 0.0108 ± 0.0187 ± 0.0166 W eigh ted F1 Mean 0.4413 0.4488 0.4689 0.5054 0.5383 0.6003 0.6264 Std Dev ± 0.0102 ± 0.0154 ± 0.0205 ± 0.0205 ± 0.0096 ± 0.0192 ± 0.0158 B.8 Abnormal Signal Detection T ask Abnormal detection facilitates the identiﬁcation of pathological neuronal activit y , p oten tially re- ducing the clinical w orkload b y pro viding automated alerts during con tinuous monitoring. W e ev aluate our mo del on the TUAB dataset [ 29 ], a standard benchmark for pathological signal de- tection. The dataset con tains EEG recordings from 23 c hannels at 256 Hz, annotated as normal or abnormal. F ollowing existing literature, w e utilize 16 bip olar montage channels based on the inter- national 10–20 system. Signals w ere resampled to 200 Hz and segmented in to 10-second windows, totaling 409,455 samples. Using the dataset’s predeﬁned splits, we further partitioned the training sub jects into training and v alidation sets at an 8:2 ratio. T o accelerate the ﬁnetuning process and assess mo del eﬃciency , we randomly selected a subset of 50,000 samples for training and 50,000 samples for v alidation. As sho wn in T able S9 , clinical abnormality detection remains a competitive domain for founda- tion models. Sp ecMoE maintains a robust stance, reac hing a balanced accuracy of 0.7742 and an A UR OC of 0.8554 . While certain mo dels like CSBrain ( 0.8172 ) curren tly lead the b enc hmark, it is imp ortan t to note that our results were achiev ed using a signiﬁcan tly reduced subset of the a v ailable training data. This preliminary result suggests that the spectral-anchored gating mecha- nism may oﬀer high data eﬃciency , as the mo del captured essential pathological signatures despite using only a fraction of the av ailable training data. Extensiv e tests utilizing the full dataset are planned for future work to further c haracterize the scaling b eha vior of Sp ecMoE in this sp eciﬁc task. T able S9. Detailed p erformance comparison on the TUAB abnormal signal detection task. V alues rep- resen t the av erage and standard deviation across ﬁve random seeds. Bold indicates the b est p erformance; underline indicates the second b est. T ask-Spe ciﬁc Mo dels F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramod CSBrain Sp ecMoE # Params — — — — 5.8 M 24.4 M 28.4 M 4.3 M Bal. Acc Mean 0.7642 0.7758 0.7848 0.8140 0.7891 0.8172 0.7742 Std Dev ± 0.0036 ± 0.0049 ± 0.0038 ± 0.0019 ± 0.0030 ± 0.0043 ± 0.0070 A UPR C Mean 0.8299 0.8427 0.8448 0.8965 0.8636 0.9005 0.8548 Std Dev ± 0.0043 ± 0.0054 ± 0.0065 ± 0.0016 ± 0.0063 ± 0.0066 ± 0.0092 A UR OC Mean 0.8412 0.8445 0.8569 0.9022 0.8606 0.8957 0.8554 Std Dev ± 0.0031 ± 0.0038 ± 0.0051 ± 0.0009 ± 0.0057 ± 0.0046 ± 0.0082 B.9 Seizure Detection T ask The Siena dataset [ 35 ] is a clinical database comprising video-EEG monitoring from 14 adult pa- tien ts. EEG signals were recorded at 512 Hz using the in ternational 10–20 system, with seizure EEG F oundation Mo del - SpecMoE 21 ev en ts rigorously annotated by clinical exp erts following the International League Against Epilepsy (ILAE) criteria. W e utilized the 29 EEG channels consistently av ailable across the cohort and resampled the signals to 200 Hz. The data was segmented in to 10-second windows, totaling 51,307 samples. T o ensure a robust ev aluation of clinical generalization, w e emplo yed a sub ject- indep enden t split: data from sub jects PN16 and PN17 were held out for testing, while the remaining 12 sub jects were used for training and v alidation (8:2 ratio). As demonstrated in T able S10 , SpecMoE achiev es remark able p erformance on the Siena b enc h- mark, signiﬁcantly outp erforming all task-sp eciﬁc and foundation mo dels. Notably , Sp ecMoE reaches a balanced accuracy of 0.8655 and an AUPR C of 0.9906 . This represents a substan tial lead of 9.93% in balanced accuracy and a 50.35% impro vemen t in A UPR C ov er the previous state- of-the-art, CSBrain. While such a high AUPR C suggests nearly p erfect identiﬁcation of seizure ev en ts within this sp eciﬁc sub ject-indep enden t split, these results underscore the p o w er of sp ectral- anc hored gating in isolating the distinct rh ythmic d isc harges that c haracterize ictal activity . The hierarc hical U-shap ed architecture of SpecHi-Net appears particularly w ell-suited for capturing these multi-scale temporal dep endencies, pro viding a clear adv antage in clinical seizure monitor- ing. T able S10. Detailed p erformance comparison on the SIENA seizure detection task. V alues represen t the a verage and standard deviation across ﬁv e random seeds. Bold indicates the best performance; underline indicates the second best. T ask-Spe ciﬁc Mo dels F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramod CSBrain Sp ecMoE # Params — — — — 5.8 M 37.7 M 41.7 M 4.3 M Bal. Acc Mean 0.7487 0.7556 0.6616 0.7082 0.7317 0.7662 0.8655 Std Dev ± 0.0521 ± 0.0210 ± 0.0391 ± 0.0329 ± 0.0647 ± 0.0471 ± 0.0038 A UPR C Mean 0.3753 0.2091 0.3938 0.3122 0.4107 0.4871 0.9906 Std Dev ± 0.0867 ± 0.0786 ± 0.0903 ± 0.0976 ± 0.0720 ± 0.0343 ± 0.0008 A UR OC Mean 0.8687 0.8159 0.8154 0.8814 0.9038 0.9076 0.9148 Std Dev ± 0.0527 ± 0.0261 ± 0.1155 ± 0.0328 ± 0.0218 ± 0.0119 ± 0.0098 B.10 Vigilance Estimation T ask SEED-VIG [ 36 ] is a sp ecialized dataset designed for the contin uous estimation of driv er vigilance. The data w as collected using a virtual driving sim ulator where 21 sub jects p erformed a driving task while their vigilance levels were contin uously monitored. Ground-truth vigilance lab els were deriv ed from eye-trac king data using the PERCLOS (Percen tage of Closure) indicator. EEG signals w ere recorded from 17 channels at 200 Hz and segmented into 20,355 8-second windo ws. W e follo w a sub ject-independent split protocol, utilizing sub jects 1–13 for training, 14–17 for v alidation, and 18–21 for testing. As sho wn in T able S11 , vigilance estimation serv es as a rigorous test for regression-based neu- ral deco ding. Sp ecMoE achiev es the highest ov erall p erformance in terms of error minimization and v ariance explanation, reac hing a state-of-the-art RMSE of 0.1522 and an R 2 score of 0.2454 . This represen ts a substantial 0.1252 reduction in RMSE compared to the previous b est foundation mo del, CSBrain ( 0.2774 ). While LaBraM and CSBrain maintain a narrow lead in Pearson’s correlation co eﬃcient, the signiﬁcantly lo w er RMSE of Sp ecMoE indicates that our sp ectral-anc hored gating mechanism is more eﬀectiv e at minimizing large predictive deviations. C A dditional results C.1 Detailed Ablation Studies Figures S1 , S2 , and S3 illustrate the relative impact of our core design c hoices across the BCIC2020- 3, DA-Pharmaco, and PhysioNet-MI datasets. By systematically removing or replacing k ey comp o- 22 D. Darank oum et al. T able S11. Detailed p erformance comparison on the SEED-VIG vigilance estimation (regression) task. V alues represent the a v erage and standard deviation across ﬁve random seeds. Bold indicates the b est p erformance; underline indicates the second b est. F or RMSE, low er v alues are b etter. T ask-Sp eciﬁc Models F oundation Mo dels Sub-metric EEGNet EEGConf FF CL LaBraM CBramo d CSBrain Sp ecMoE # Para ms — — — — 5.8 M 21.9 M 25.9 M 4.3 M P earson’s corr Mean 0.5127 0.5800 0.4923 0.6347 0.5502 0.6314 0.5168 Std Dev ± 0.0357 ± 0.0174 ± 0.0313 ± 0.0135 ± 0.0115 ± 0.0356 ± 0.0218 R 2 Score Mean 0.1960 0.2065 0.1740 0.1808 0.0737 0.2363 0.2454 Std Dev ± 0.0427 ± 0.0230 ± 0.0530 ± 0.0958 ± 0.0167 ± 0.0519 ± 0.0289 RMSE ( ↓ ) Mean 0.2847 0.2829 0.2885 0.2871 0.3057 0.2774 0.1522 Std Dev ± 0.0076 ± 0.0041 ± 0.0093 ± 0.0166 ± 0.0027 ± 0.0094 ± 0.0029 nen ts, w e demonstrate that the high p erformance of Sp ecMoE is derived from the synergy b etw een our no v el masking strategy and the sp ectral-guided mixture-of-experts. Imp act of Gaussian-Smo othe d Masking. The most critical comp onen t of our framew ork is the Gaussian-smo othed masking scheme. When replaced with standard Non-Gaussian (rectan- gular) masks , we observ e a substan tial drop in p erformance across all tasks. F or instance, on BCIC2020-3, the balanced accuracy plummets from 0.6262 to 0.4063 (a relative decrease of ap- pro ximately 22%). This experiment conﬁrms our hypothesis that "soft" Gaussian b oundaries force the mo del to learn the complex neural oscillations from a large spectral context. Eﬃc acy of Joint Time-F r e quency (TF) Masking. W e ev aluated the necessit y of join t masking b y comparing it against Time-only and F requency-only v ariants. On the D A-Pharmaco dataset, utilizing only time masks resulted in an F1-score drop from 0.6329 to 0.5861. The consisten t sup eriorit y of the full TF-masking approach across all metrics suggests that simultaneous o cclusion in both domains is essential for capturing the non-stationary rhythmic signatures of EEG signals, particularly in pharmacological and clinical b enc hmarks. A r chite ctur e and Gating Str ate gy. The ablation labeled "F ull mask + CBramod " replaces our hierarc hical SpecHi-Net and MoE structure with the CBramod backbone while k eeping the Gaus- sian masking. The results show a massive decline (e.g., a drop to 0.2448 balanced accuracy on BCIC2020-3), indicating that non-hierarchical architectures lac k the capacity to eﬀectively solve the c hallenging reconstruction tasks p osed b y our aggressive masking strategy . F urthermore, the "Gating without PSD" experiment remov es the Po wer Sp ectral Densit y prior from the routing netw ork. This leads to a consisten t p erformance degradation—most notably on BCIC2020-3, where the AUPR C drops by ov er 12% (from 0.6874 to 0.5650). This v alidates that the PSD serves as a vital physical anc hor, enabling the gating net w ork to route signals to specialized exp erts based on their underlying rh ythmic conten t rather than mere temp oral patterns. C.2 Masking Eﬀects Visualization on STFT Sp ectrograms Figures S4 , S5 , S6 , S7 and S8 illustrate how the raw EEG signals and their corresp onding STFT sp ectrograms are mo diﬁed under diﬀerent masking conﬁgurations. These visualizations contrast our prop osed Gaussian-smo othed masks (Figures S4 – S7 ) against standard rectangular masks (Figure S8 ), highlighting the diﬀerent levels of information o cclusion across the time and frequency domains. C.3 Pretraining Imp ortance Analysis Figure S9 illustrates the critical role of self-sup ervised pretraining by comparing the p erformance of Sp ecMoE initialized with pretrained weigh ts against a version trained from scratch (random ini- tialization). A cross all three representativ e datasets—BCIC2020-3, DA-Pharmaco, and PhysioNet- MI—the pretrained mo del consisten tly achiev es sup erior results in b oth balanced accuracy and A UPR C. EEG F oundation Mo del - SpecMoE 23 SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Balanced accuracy 62.64% -6.06% -0.75% -1.33% -22.02% -38.79% -9.17% BCIC2020-3 SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 63.29% -4.68% -0.61% -4.75% -5.25% -9.31% -2.91% DA-Pharmaco SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 64.75% -1.65% -2.84% -4.79% -2.12% -23.77% -2.07% PhysioNet-MI Fig. S1. SpecMoE ablation results - balanced accuracy . W e show the absolute v alue of the balanced accuracy for the Sp ecMoE mo del and relative diﬀerences for six ablation exp erimen ts. TF stands for time- frequency . SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AUPR C 68.74% -9.97% -2.97% -1.93% -27.04% -45.58% -12.24% BCIC2020-3 SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 66.61% -1.39% +0.85% -3.22% -0.94% -8.73% +0.18% DA-Pharmaco SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 73.60% -2.57% -4.74% -8.23% -2.66% -30.18% -2.44% PhysioNet-MI Fig. S2. SpecMoE ablation results - A UPR C. W e sho w the absolute v alue of A UPR C for the Sp ecMoE mo del and relativ e diﬀerences for six ablation experiments. TF stands for time-frequency . 24 D. Darank oum et al. SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F1-scor e 62.64% -6.06% -0.75% -1.33% -22.02% -38.79% -9.17% BCIC2020-3 SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 63.29% -4.68% -0.61% -4.75% -5.25% -9.31% -2.91% DA-Pharmaco SpecMoE T ime- only F r eq- only TF- only Non- Gaussian F ull mask + CBramod Gating without PSD 64.75% -1.65% -2.84% -4.79% -2.12% -23.77% -2.07% PhysioNet-MI Fig. S3. Sp ecMoE ablation results - F1-score. W e show the absolute v alue of F1-score for the Sp ecMoE mo del and relativ e diﬀerences for six ablation experiments. TF stands for time-frequency . The most substantial impact is observ ed in the BCIC2020-3 imagined sp eech task, where balanced accuracy falls from 0.6262 to 0.4433 without pretraining—an absolute decrease of 18.29 p oin ts. Similarly , the AUPR C for this task exp erienced a sharp decline of 0.2457. This indicates that deco ding high-complexity tasks lik e imagined sp eech relies heavily on the generalized neural represen tations learned during large-scale pretraining. In the DA-Pharmaco and PhysioNet-MI datasets, w e observe absolute accuracy gains of 5.56 and 1.42 p oin ts, resp ectiv ely . The impro v emen t in DA-Pharmaco is particularly noteworth y as it v alidates the cross-sp ecies utility of our found ation mo del; despite being pretrained on h uman EEG, the model learns univ ersal sp ectral features that signiﬁcan tly accelerate and impro ve the deco ding of ro den t LFP signals. Overall, these results conﬁrm that our Gaussian-smo othed masking task successfully forces the mo del to learn a robust, generalized represen tation of brain activit y that serv es as a high-qualit y initialization for diverse downstream applications. C.4 Exp erts Contribution Analysis Figure S10 visualizes the relative con tribution of each expert across three datasets, rev ealing dis- tinct routing patterns tailored to eac h task’s complexity . On the BCIC2020-3 dataset, w e observe that the gating netw ork primarily solicits Exp erts 1 and 2. This shared con tribution suggests that the deco ding of imagined sp eec h, a high-complexity task, requires a collab orativ e representation where diﬀerent exp erts lik ely sp ecialize in complemen tary sp ectral features of in ternal phonological pro cessing. In another task, the D A-Pharmaco dataset engages all three experts across the diﬀeren t classes and brain regions. This broad utilization reﬂects the high v ariance of the pharmacological signatures in m urine LFP data, where diﬀerent exp erts are required to isolate the unique oscillatory shifts induced b y div erse comp ounds. Finally , for Ph ysioNet-MI, the routing logic is highly sparse, with Exp ert 2 being almost exclu- siv ely solicited. This indicates that for standard motor imagery tasks, the gating netw ork identiﬁes a consisten t, dominant sp ectral proﬁle across the motor cortex, which can b e eﬀectively mo deled b y a single sp ecialized expert without the need for additional parameters from the others. C.5 Embeddings Visualization W e qualitativ ely assess the mo del’s represen tation space using t-SNE pro jections on the DA- Pharmaco and MA CO test sets. Figure S11 displays the pro jections obtained using random weigh ts from an untrained Sp ecMoE mo del. More speciﬁcally , we initialized the mo del with random weigh ts and pro jected the output EEG F oundation Mo del - SpecMoE 25 0 5 10 15 20 0.2 0.0 0.2 R aw EEG signal 0 5 10 15 20 T ime (s) 0.02 0.00 0.02 0.04 Mask ed EEG signal 0 5 10 15 20 0 50 100 150 200 Original STF T 0 5 10 15 20 0 50 100 150 200 Mask ed STF T 12 10 8 6 4 12 10 8 6 4 Fig. S4. Overview of the prop osed joint Gaussian masking strategy . The panels illustrate the simultaneous application of temp oral, sp ectral, and joint time-frequency o cclusions on raw EEG signals (top panel: raw, middle panel: mask ed) and their corresponding STFT sp ectrograms (b ottom panel). 26 D. Darank oum et al. 0 5 10 15 20 0 50 100 150 200 Original STF T 0 5 10 15 20 0 50 100 150 200 Mask ed STF T 12 10 8 6 4 12 10 8 6 4 0 5 10 15 20 0.2 0.0 0.2 R aw EEG signal 0 5 10 15 20 T ime (s) 0.2 0.0 0.2 Mask ed EEG signal Fig. S5. Visualization of temp oral domain masking. The panels illustrate the application of time-based o cclusions on ra w EEG signals (top panel: raw, middle panel: masked) and their corresp onding STFT sp ectrograms (b ottom panel). EEG F oundation Mo del - SpecMoE 27 0 5 10 15 20 0.2 0.0 0.2 R aw EEG signal 0 5 10 15 20 T ime (s) 0.02 0.00 0.02 Mask ed EEG signal 0 5 10 15 20 0 50 100 150 200 Original STF T 0 5 10 15 20 0 50 100 150 200 Mask ed STF T 12 10 8 6 4 12 10 8 6 4 Fig. S6. Visualization of frequency domain masking. The panels illustrate the application of spectral- based o cclusions on ra w EEG signals (top panel: ra w, middle panel: masked) and their corresponding STFT spectrograms (bottom panel). 28 D. Darank oum et al. 0 5 10 15 20 0 50 100 150 200 Original STF T 0 5 10 15 20 0 50 100 150 200 Mask ed STF T 12 10 8 6 4 12 10 8 6 4 0 5 10 15 20 0.2 0.0 0.2 R aw EEG signal 0 5 10 15 20 T ime (s) 0.05 0.00 0.05 0.10 0.15 Mask ed EEG signal Fig. S7. Visualization of time-frequency domain masking. The panels illustrate the application of temp oral- sp ectral blobs o cclusions on raw EEG signals (top panel: raw, middle panel: mask ed) and their correspond- ing STFT sp ectrograms (b ottom panel). EEG F oundation Mo del - SpecMoE 29 0 5 10 15 20 0.2 0.0 0.2 R aw EEG signal 0 5 10 15 20 T ime (s) 0.1 0.0 0.1 Mask ed EEG signal 0 5 10 15 20 0 50 100 150 200 Original STF T 0 5 10 15 20 0 50 100 150 200 Mask ed STF T 12 10 8 6 4 12 10 8 6 4 Fig. S8. Time, frequency and time-frequency masking with rectangular masks. The panels illustrate the sim ultaneous application of temporal, spectral, and joint time-frequency rectangular o cclusions on raw EEG signals (top panel: raw, middle panel: masked) and their corresponding STFT sp ectrograms (b ottom panel). 30 D. Darank oum et al. SpecMoE No P r etrained weights 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Balanced accuracy 62.62% -18.29% BCIC2020-3 SpecMoE No P r etrained weights 62.29% -5.56% DA-Pharmaco SpecMoE No P r etrained weights 64.44% -1.42% PhysioNet-MI SpecMoE No P r etrained weights 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AUPR C 68.74% -24.57% BCIC2020-3 SpecMoE No P r etrained weights 66.61% -2.78% DA-Pharmaco SpecMoE No P r etrained weights 73.60% -2.99% PhysioNet-MI Fig. S9. SpecMoE pretraining ablations on six datasets. EEG F oundation Mo del - SpecMoE 31 0 50 100 150 200 250 300 350 Expert 1 (Embedding dim : 0 to 127) | Expert 2 (Embedding dim : 128 to 255) | Expert 3 (Embedding dim : 256 to 383) Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 1 Class 1 Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Samples (Gr ouped by Class) PhysioNet-MI e xamples (Scaled 0 to 1 | L ocal R ange: -0.00 to 0.09) 0.0 0.2 0.4 0.6 0.8 1.0 R elative A ctivation Intensity 0 50 100 150 200 250 300 350 Expert 1 (Embedding dim : 0 to 127) | Expert 2 (Embedding dim : 128 to 255) | Expert 3 (Embedding dim : 256 to 383) Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 0 Class 1 Class 1 Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 3 Class 3 Class 4 Class 4 Class 4 Class 4 Class 4 Samples (Gr ouped by Class) D A -phar maco e xamples (Scaled 0 to 1 | L ocal R ange: -0.07 to 0.49) 0.0 0.2 0.4 0.6 0.8 1.0 R elative A ctivation Intensity 0 50 100 150 200 250 300 350 Expert 1 (Embedding dim : 0 to 127) | Expert 2 (Embedding dim : 128 to 255) | Expert 3 (Embedding dim : 256 to 383) Class 0 Class 0 Class 0 Class 0 Class 0 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 Class 2 Class 2 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 3 Class 4 Class 4 Class 4 Class 4 Class 4 Samples (Gr ouped by Class) B CIC2020-3 e xamples (Scaled 0 to 1 | L ocal R ange: 0.07 to 3.41) 0.0 0.2 0.4 0.6 0.8 1.0 R elative A ctivation Intensity Fig. S10. Visualization of exp ert con tribution across diﬀerent EEG paradigms: BCIC2020-3 (top), D A- Pharmaco (middle), and PhysioNet-MI (b ottom). Each panel displa ys the routing activ ations for Exp erts 1, 2, and 3 (from left to right), with samples grouped by class lab els. The color intensit y represents the gating net w ork’s routing w eigh ts, ranging from lo w (darker tones) to high (brigh ter tones), reﬂecting the relativ e importance of each exp ert for a given sample. 32 D. Darank oum et al. results without any training. F or b oth datasets, the classes are entirely o v erlapped, conﬁrming that ra w signals lac k in trinsic linear separability and highlighting the necessit y for training. Figures S12 and S13 compare Sp ecMoE against baseline foundation mo dels. On DA-Pharmaco, Sp ecMoE pro duces more compact clusters, particularly for the Saline and Amphetamine groups. On MACO, the distinction is even more pronounced: Sp ecMoE is the only mo del capable of clearly isolating the Antiepileptic class. These results visually conﬁrm that Sp ecMoE eﬀectively structures the laten t space, mapping div erse pharmacological eﬀects into distinct neural signatures. 1 0 0 5 0 0 5 0 1 0 0 1 0 0 5 0 0 5 0 1 0 0 t - S N E D i m e n s i o n 2 M A C O C l a s s e s S o l v e n t A n t i d e p r e s s a n t A n t i p s y c h o t i c A n t i e p i l e p t i c A n x i o l y t i c C l a s s e s S a l & S a l T A m p h C L Z 1 & C L Z 3 S C H R a c l o 2 0 0 2 0 4 0 6 0 t - S N E D i m e n s i o n 1 t - S N E D i m e n s i o n 1 4 0 3 0 2 0 1 0 0 1 0 2 0 3 0 4 0 t - S N E D i m e n s i o n 2 D A - P h a r m a c o Fig. S11. T-SNE DA-Pharmaco (left) and MA CO (right) pro jections obtained b y an un trained SpecMoE mo del containing random w eights. EEG F oundation Mo del - SpecMoE 33 2 0 1 0 0 1 0 2 0 3 0 t - S N E D i m e n s i o n 1 2 0 1 0 0 1 0 2 0 t - S N E D i m e n s i o n 2 C B r a M o d C l a s s e s S a l & S a l T A m p h C L Z 1 & C L Z 3 S C H R a c l o 3 0 2 0 1 0 0 1 0 2 0 3 0 t - S N E D i m e n s i o n 1 2 0 1 0 0 1 0 2 0 3 0 t - S N E D i m e n s i o n 2 C S B r a i n C l a s s e s S a l & S a l T A m p h C L Z 1 & C L Z 3 S C H R a c l o 3 0 2 0 1 0 0 1 0 2 0 3 0 t - S N E D i m e n s i o n 1 1 5 1 0 5 0 5 1 0 1 5 t - S N E D i m e n s i o n 2 L a B r a M C l a s s e s S a l & S a l T A m p h C L Z 1 & C L Z 3 S C H R a c l o 4 0 3 0 2 0 1 0 0 1 0 2 0 3 0 4 0 t - S N E D i m e n s i o n 1 2 0 1 0 0 1 0 2 0 3 0 t - S N E D i m e n s i o n 2 S p e c M o E C l a s s e s S a l & S a l T A m p h C L Z 1 & C L Z 3 S C H R a c l o Fig. S12. T-SNE DA-Pharmaco test data pro jection using four foundation mo dels. T op left: CBraMo d, top righ t: CSBrain, b ottom left: LaBraM, b ottom right: Sp ecMoE. 34 D. Darank oum et al. 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 C l a s s e s S o l v e n t A n t i d e p r e s s a n t A n t i p s y c h o t i c A n t i e p i l e p t i c A n x i o l y t i c 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 C l a s s e s S o l v e n t A n t i d e p r e s s a n t A n t i p s y c h o t i c A n t i e p i l e p t i c A n x i o l y t i c 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 C l a s s e s S o l v e n t A n t i d e p r e s s a n t A n t i p s y c h o t i c A n t i e p i l e p t i c A n x i o l y t i c C B r a M o d C S B r a i n L a B r a M S p e c M o E t - S N E D i m e n s i o n 1 t - S N E D i m e n s i o n 2 t - S N E D i m e n s i o n 1 t - S N E D i m e n s i o n 1 t - S N E D i m e n s i o n 2 t - S N E D i m e n s i o n 1 t - S N E D i m e n s i o n 2 t - S N E D i m e n s i o n 2 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 1 0 0 7 5 5 0 2 5 0 2 5 5 0 7 5 1 0 0 1 0 0 5 0 0 5 0 1 0 0 C l a s s e s S o l v e n t A n t i d e p r e s s a n t A n t i p s y c h o t i c A n t i e p i l e p t i c A n x i o l y t i c Fig. S13. T-SNE MACO test data pro jection using four foundation mo dels. T op left: CBraMo d, top right: CSBrain, bottom left: LaBraM, bottom righ t: SpecMoE.

SpecMoE: Spectral Mixture-of-Experts Foundation Model for Cross-Species EEG Decoding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment