Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.
F OUNDATION models are trained on large-scale data to learn representations that can be efficiently adapted to a wide range of downstream tasks. Their success in natural language processing and computer vision has motivated analogous efforts in neuroimaging, aiming to develop Large Neuroimaging Models (LNMs) that leverage abundant unlabeled neural recordings while minimizing reliance on scarce taskspecific or clinical annotations.
Learning transferable representations of brain activity using LNMs has the potential to support various neuroscientific and clinical applications, including neural decoding [1], [2], biomarker identification [3], [4], and brain-computer interfaces [5]- [7]. This paradigm is particularly well suited to electrophysiological modalities such as electroencephalography (EEG) and magnetoencephalography (MEG), which provide high temporal resolution and yield large-scale multivariate time-series data [8], [9].
In recent years, several foundation models for EEG and MEG have been proposed, predominantly based on the transformer architecture. These models largely fall into three classes: encoder-only transformers trained via masked token prediction (e.g., LaBraM [10], CBraMod [11], Brain-Omni [12]); encoder-decoder masked autoencoders trained via masked token reconstruction (e.g., REVE [13]); and decoderonly autoregressive models trained via next-token prediction (e.g., Neuro-GPT [14], MEG-GPT [15]).
Despite this progress, a central but underexplored design choice made in transformer-based neural foundation models is the tokenization: the process of converting continuous time series data to discrete ’tokens’ [16]. This choice determines the representational granularity of the data and can introduce an inductive bias. Inappropriate tokenization may obscure biologically meaningful structure or impose assumptions misaligned with the statistical properties of neural data, ultimately limiting representational fidelity and downstream performance. In this sense, the tokenization is not simply a preprocessing step but a defining component that can determine the success of a neural foundation model. An effective tokenizer must therefore encode neural dynamics in a form that preserves the temporal and spectral structure while remaining computationally tractable.
Existing tokenization strategies for M/EEG LNMs have largely been drawn from general-purpose time-series modeling and can be broadly categorized based on the temporal resolution of the resulting tokens. Sample-level tokenizers map each time point to a token, preserving native temporal resolution and spectral content, whereas non-sample-level tokenizers aggregate information across time, compressing temporal and spectral structure into higher-level tokens (see Section II for a review of prior work). While both approaches have been adopted in recent studies (cf. [17], [18]), most tokenization strategies were originally developed for non-biological time series such as retail, finance, or epidemiology, whose statistical properties differ substantially from those of neural signals. This lack of modality-specific design principles raises the question of whether these tokenization strategies faithfully capture the structure of M/EEG data, which exhibit oscillatory dynamics, structured spectral organization, and approximately Gaussian amplitude distributions [19]. Furthermore, current practice lacks consensus among the field, with tokenization choices often inherited from prior work or driven by architectural convenience rather than systematic evaluation.
To date, no study has systematically examined how tokenization strategies affect representational fidelity, generative behavior, and downstream task performance of the subsequent foundation model for neural time series. In this work, we address this gap by evaluating tokenization methods along two complementary axes. First, we assess their ability to represent continuous neural signals in a low-dimensional discrete space without information loss, quantified via reconstruction accuracy. Second, we pretrain a generative pretrained transformer (GPT)-style foundation model [20], [21] and examine how tokenization influences the model behavior by evaluating the (i) token prediction accuracy, (ii) biological plausibility of the generated neural data, (iii) capacity to capture subject-specific signatures and inter-subject variability, and (iv) performance on downstream decoding tasks under zero-shot and fine-tuning settings.
We focus exclusively on sample-level tokenization and defer the analysis of non-sample-level approaches to future work. Although non-sample-level tokenizers are widely employed in M/EEG foundation modeling [10], [13], sample-level tokenization offers several conceptual and practical advantages. First, by avoiding temporal compression, it preserves the temporal and spectral resolution of the signal. When applied independently to each sensor or channel, it also retains spatial resolut
This content is AI-processed based on open access ArXiv data.