Developing foundation models for electroencephalography (EEG) remains challenging due to the signal's low signal-to-noise ratio and complex spectro-temporal non-stationarity. Existing approaches often overlook the hierarchical latent structure inherent in neural dynamics, leading to suboptimal reconstruction of fine-grained information. In this work, we propose BrainRVQ, a general-purpose EEG foundation model pre-trained on a large-scale corpus of clinical EEG data. Unlike standard masked modeling, BrainRVQ features a Dual-Domain Residual Vector Quantization (DD-RVQ) tokenizer that disentangles temporal waveforms and spectral patterns into hierarchical discrete codes. We further introduce a hierarchical autoregressive pre-training objective that learns to reconstruct these codes in a coarse-to-fine manner, utilizing an importance-guided curriculum masking strategy to prioritize information-rich neural events over background noise. Extensive experiments across 8 diverse downstream datasets demonstrate that BrainRVQ consistently outperforms state-of-the-art baselines, validating its effectiveness in learning robust and generalizable neural representations. Our code and model weights are available:https://github.com/keqicmz/BrainRVQ
Electroencephalography (EEG) provides a non-invasive interface for monitoring millisecond-level neural dynamics (Niedermeyer & da Silva, 2005). This high temporal resolution has fueled advancements across diverse domains, ranging from seizure detection (Shoeb, 2009), sleep staging (Aboalayon et al., 2016), emotion recognition (Zheng & Lu, 2015) and motor imagery classification (Pfurtscheller & Neuper, 2001). However, decoding EEG signals remains a formidable challenge for machine learning due to their inherently low signal-to-noise ratios (SNR), complex nonstationarity, and substantial variability across subjects (Lotte et al., 2007). These characteristics, combined with the scarcity of large-scale labeled data, have historically constrained the development of generalizable EEG decoding models in neuroscience.
To address the labeled data bottleneck, self-supervised learning (SSL) has emerged as a promising paradigm. Early discriminative approaches, such as BENDR (Kostas et al., 2021), employed contrastive learning to extract transferable features from unlabeled recordings. More recently, generative approaches inspired by Masked Image Modeling (MIM) and Large Language Models (LLMs) have gained prominence. Pioneering works like LaBraM (Jiang et al., 2024) introduced neural tokenizers to convert continuous EEG into discrete tokens, enabling BERT-style pre-training. Brain-BERT (Wang et al., 2023) and Brant (Zhang et al., 2023) further explored distinct masking strategies and architectural designs to capture temporal dependencies. Subsequent efforts have focused on scaling and architectural innovations: REVE (Ouahidi et al., 2025) scaled pre-training to over 25,000 subjects, CBraMod (Wang et al., 2024b) proposed criss-cross attention to separately model spatial and temporal dependencies. Despite these advances, existing EEG foundation models face critical limitations in representation fidelity. First, most current approaches rely on single-domain tokenization, processing signals either strictly in the time domain or the frequency domain. While time-domain tokenizers excel at capturing transient events like epileptic spikes, they often struggle to represent global spectral patterns. Conversely, frequency-domain methods capture oscillatory rhythms but sacrifice temporal precision. Consequently, this singleperspective quantization fails to capture the intricate spectrotemporal coupling of neural dynamics, leading to significant information loss and suboptimal reconstruction of complex brain signals.
Second, the discretization capacity of existing models remains limited. Standard neural tokenizers typically employ single-layer vector quantization to project continuous signals into discrete latent codes. This flat quantization lacks the capacity to encode the high-dimensional variability of EEG signals. Drawing inspiration from high-fidelity audio generation, Residual Vector Quantization (RVQ) offers a robust solution by employing a cascade of codebooks to approximate signals with increasing precision. However, directly applying RVQ to EEG pre-training is non-trivial. EEG signals exhibit an inherent hierarchy characterized by dominant rhythms and subtle details, yet are significantly contaminated by background noise. Standard masked modeling objectives treat all residual codes independently and equally, which fails to capture the coarse-to-fine semantic dependency and often results in the model wasting capacity on reconstructing background noise rather than meaningful neural events.
In this work, we propose BrainRVQ, a high-fidelity EEG foundation model designed to resolve these challenges through Dual-Domain Residual Quantization and Hierarchical Autoregression. To overcome the information loss caused by single-domain processing, we introduce a Dual-Domain Residual Vector Quantization (DD-RVQ) tokenizer. This module disentangles EEG signals into hierarchical discrete codes across both time and frequency domains simultaneously, ensuring that both fine-grained temporal transients and global spectral oscillations are preserved. Furthermore, to address the limitations of flat quantization and independent prediction, we propose a Hierarchical Autoregressive Pre-training objective. Instead of predicting tokens independently, our model learns to reconstruct RVQ codes in a coarse-to-fine manner using teacher forcing, explicitly modeling the dependency between layers. This is coupled with an importance-guided curriculum masking strategy, which dynamically prioritizes information-rich regions, enabling the model to learn robust representations from a cleaner neural manifold.
The main contributions of this paper are summarized as follows:
• Dual-Domain Residual Vector Quantization (DD-RVQ): We propose a novel tokenizer that performs hierarchical quantization in both time and frequency domains. This dual-perspective design mitigates the information loss inherent in single-domain approaches, enabling high-fidelity encoding
This content is AI-processed based on open access ArXiv data.