One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models

One Brain, Omni Modalities: T owards Uniﬁed Non-In vasi ve Brain Decoding with Large Language Models Changli T ang 3 Shurui Li 1 2 Junliang W ang 1 Qinfan Xiao 1 3 Zhonghao Zhai 3 Lei Bai 1 Y u Qiao 1 Bowen Zhou 1 3 W en W u 1 Y uanning Li 1 2 Chao Zhang 1 3 Abstract Deciphering brain function through non-in v asi ve recordings requires synthesizing complementary high-frequency electromagnetic (EEG/MEG) and low-frequenc y metabolic (fMRI) signals. How- ev er , despite their shared neural origins, extreme discrepancies hav e traditionally conﬁned these modalities to isolated analysis pipelines, hinder- ing a holistic interpretation of brain activity . T o bridge this fragmentation, we introduce NOBEL , a n euro- o mni-modal b rain- e ncoding l arge lan- guage model (LLM) that uniﬁes these heteroge- neous signals within the LLM’ s semantic embed- ding space. Our architecture integrates a uniﬁed encoder for EEG and MEG with a nov el dual-path strategy for fMRI, aligning non-in vasi ve brain sig- nals and external sensory stimuli into a shared token space, then leverages an LLM as a uni- versal backbone. Extensive e v aluations demon- strate that NOBEL serves as a robust generalist across standard single-modal tasks. W e also show that the synergistic fusion of electromagnetic and metabolic signals yields higher decoding accu- racy than unimodal baselines, v alidating the com- plementary nature of multiple neural modalities. Furthermore, NOBEL exhibits strong capabili- ties in stimulus-aware decoding, effecti v ely in- terpreting visual semantics from multi-subject fMRI data on the NSD and HAD datasets while uniquely lev eraging direct stimulus inputs to ver - ify causal links between sensory signals and neu- ral responses. NOBEL thus takes a step to wards unifying non-in vasi ve brain decoding, demonstrat- ing the promising potential of omni-modal brain understanding. 1 Shanghai AI Laboratory 2 ShanghaiT ech Univ ersity 3 Tsinghua Univ ersity . Correspondence to: Chao Zhang < cz277@tsinghua.edu.cn > . Pr eprint. 1. Introduction The human brain operates as an intricate biological pro- cessor , orchestrating comple x neural acti vities that encode our perceptions, thoughts, and physiological states. T o non- in v asi vely decipher these mechanisms, researchers primarily rely on two distinct categories of signals: high-frequency electromagnetic ﬂuctuations arising from neuronal ﬁring, captured via electroencephalography (EEG) and magne- toencephalography (MEG), and lo w-frequency metabolic changes reﬂecting hemodynamic responses, captured via functional magnetic resonance imaging (fMRI). Dri ven by the success of representation learning, recent years hav e witnessed a surge in foundation models tailored for these indivi dual modalities. Signiﬁcant progress has been made in building specialized models for EEG ( Xiao et al. , 2025 ; Peh et al. , 2022 ; Song et al. , 2021 ; Jing et al. , 2023 ; Y ang et al. , 2023 ; Jiang et al. , 2024 ; W ang et al. , 2025b ; Ouahidi et al. , 2025 ; Y i et al. , 2023 ), MEG ( Xiao et al. , 2025 ; Hirano et al. , 2024 ), and fMRI ( W ang et al. , 2025a ; Scotti et al. , 2024 ; Kawahara et al. , 2017 ; Li et al. , 2021 ; Kan et al. , 2022b ; a ; W ei et al. , 2025 ; Dong et al. , 2025 ; Kim et al. , 2023 ; Ozcelik & V anRullen , 2023 ; Scotti et al. , 2023 ; Chen et al. , 2025 ; T ang et al. , 2023 ; Horika wa , 2025 ; Fu et al. , 2025 ; T akagi & Nishimoto , 2023 ), each demonstrating strong capabilities in decoding speciﬁc aspects of brain function. Despite these advances, the ﬁeld remains fragmented. While electromagnetic and metabolic signals are merely dif ferent manifestations of the same underlying neural processes, they are traditionally treated as isolated domains due to their ex- treme heterogeneity . EEG and MEG offer millisecond-le vel temporal resolution but poor spatial localization, whereas fMRI provides rich spatial details but suf fers from signiﬁ- cant temporal lag. Theoretically , a uniﬁed analysis of these complementary modalities could unlock a holistic vie w of brain dynamics, yet current framew orks lack the capability to bridge this “frequency gap” ef fectively . Furthermore, neu- ral activity does not occur in a v acuum; it is deeply grounded in the external sensory en vironment. Howe v er , most of re- cent brain foundation models aim to learn task-agnostic neural representations and therefore do not explicitly incor - porate task-speciﬁc or stimulus-lev el contextual information. 1 One Brain, Omni Modalities: T o wards Uniﬁed Non-In vasi ve Brain Decoding with Large Language Models As a result, the rich structure of the ev oking stimuli is often underutilized during representation learning. Integrating these external stimuli as contextual input is crucial, as it provides the ground truth necessary to align complex brain signals with high-le vel semantic concepts, thereby raising the ceiling of decoding performance. T o address these challenges, we introduce NOBEL, a neuro- omni-modal brain-encoding LLM designed to bridge the chasm between high-frequency electromagnetic acti vities, low-fre quency metabolic states, and external sensory stimuli. W e conceptualize brain signal decoding as a multi-modal alignment problem, le veraging the strong LLMs to inte grate heterogeneous data sources. Speciﬁcally , NOBEL employs a modular encoding strate gy to project di verse signals into a shared semantic space compatible with the LLM. For high- frequency data, BrainOmni ( Xiao et al. , 2025 ) is used to extract temporal features from EEG and MEG. For fMRI signals, we propose a novel dual-path encoding mechanism, employing NeuroST ORM ( W ang et al. , 2025a ) to capture global, static subject-dependent meta-information and the MindEye2 structure ( Scotti et al. , 2024 ) to e xtract dynamic, stimulus-ev ok ed perceptual features. By simultaneously ingesting these brain signals alongside the corresponding visual or auditory stimuli, NOBEL translates multi-modal in- puts into uniﬁed tokens, allo wing users to control decoding tasks and generate insights via natural language prompts. Our contributions are summarized as follo ws: • First Omni-Modal Brain Foundation Model: W e present NOBEL, a uniﬁed frame work that supports EEG, MEG, and fMRI processing. By orchestrating these heterogeneous signals within a single LLM, NOBEL achiev es competitive performance across a wide range of benchmarks compared to modality-speciﬁc specialists. • Novel Dual-P ath fMRI Encoding: W e propose a hybrid encoding architecture speciﬁcally optimized for metabolic signals to disentangle the e xtraction of static physiological meta-information from dynamic, stimuli-responsi ve fea- tures, which effecti v ely capturing the rich spatiotemporal information embedded in low-frequenc y fMRI data. • Stimuli-A war e T ask Generalization: W e demonstrate that NOBEL not only effecti vely performs stimuli decod- ing tasks but is also able to leverage e xternal audio-visual stimuli as contextual anchors to tackle tasks be yond the reach of previous models, such as verifying whether a subject is perceiving a speciﬁc tar get image or video. The rest of the paper is structured in the follo wing manner . Section 2 revie ws existing research and foundation mod- els for EEG/MEG and fMRI signals. Section 3 details the methodology of NOBEL, describing the uniﬁed model ar - chitecture and the multi-stage training pipeline. Section 4 provides the experimental setup, including data speciﬁca- tions and implementation details. Section 5 presents com- prehensiv e e xperimental results. W e conclude in Section 6 . 2. Related W ork 2.1. Foundation Models f or EEG and MEG Signals The analysis of non-in v asi ve high-frequency electromag- netic brain signals, speciﬁcally EEG and MEG, has histori- cally been challenged by low signal-to-noise ratios (SNRs) and the complex spatial-temporal dynamics of neuronal ﬁring. Early deep learning approaches predominantly fo- cused on designing specialized architectures to tackle spe- ciﬁc clinical tasks. For instance, CNN-Transformer ( Peh et al. , 2022 ) integrated CNNs for local feature extraction with T ransformers for global context to automate artifact detection, while ST -T ransformer ( Song et al. , 2021 ) uti- lized a similar hybrid structure to decode spatial-temporal features from EEG. In clinical settings, SPaRCNet ( Jing et al. , 2023 ) demonstrated that dense con v olutional net- works could achiev e e xpert-le vel performance in classifying seizures and rhythmic periodic patterns. Similarly , in the MEG domain, F AMED ( Hirano et al. , 2024 ) employed deep learning for the automatic detection and dipole estimation of epileptic discharges, pro ving the ef ﬁcacy of deep netw orks in magnetic signal analysis. Despite the success of these supervised methods, their re- liance on extensi v e annotated data limited their scalability . This led to a shift towards self-supervised learning (SSL), where models learn rob ust representations from unlabeled data. MMM ( Y i et al. , 2023 ) maps any EEG channel distri- bution into a uniﬁed topology and conducts multi-stage pre- training. ContraWR ( Y ang et al. , 2023 ) applies contrasti v e learning to EEG, enabling automated sleep staging with reduced reliance on labeled e xperts. LaBraM ( Jiang et al. , 2024 ) further scaled pre-training through v ector -quantized neural spectrum prediction. More recently , CBraMod ( W ang et al. , 2025b ) introduced a criss-cross attention mechanism to better capture the orthogonal relationships between spa- tial and temporal dimensions in EEG data. REVE ( Ouahidi et al. , 2025 ) utilized a 4D positional encoding scheme to process signals of any length and electrode arrangement. Notably , neuronal activity underpins human brain function by generating electrical currents in the cortex, which in turn produce secondary electrical and magnetic ﬁelds measurable by EEG and MEG. Although these signals stem from the same underlying neural sources, pre vious foundation mod- els hav e lar gely ignored this biophysical unity and treated electric and magnetic modalities as separate domains. The recently proposed BrainOmni ( Xiao et al. , 2025 ) bridges this gap by mapping div erse sensor conﬁgurations into a shared latent space and establishes a uniﬁed EMEG encoder . 2 One Brain, Omni Modalities: T o wards Uniﬁed Non-In vasi ve Brain Decoding with Large Language Models 2.2. Foundation Models f or fMRI Signals While EEG and MEG capture rapid neural dynamics, fMRI provides spatially resolved metabolic insights. The de vel- opment of deep learning models for fMRI has largely been dictated by data representation strategies. T raditionally , to mitigate the curse of dimensionality , fMRI data is reduced to region-of-interest (R OI)-level time-series e xtracted from pre-deﬁned brain atlases. From these time-series, v arious low-dimensional matrix representations lik e functional con- nectivity or functional gradients can be deri ved, which sum- marize inter-regional relationships. Early supervised models treated these matrix representations as grid-lik e structures or graphs; for instance, BrainNetCNN ( Kawahara et al. , 2017 ) pioneered the use of specialized con volutional ﬁl- ters to capture topological features, while subsequent works like BrainGNN ( Li et al. , 2021 ) and FBNETGEN ( Kan et al. , 2022a ) adopted graph neural networks to improv e interpretability and task-aware network generation. W ith the advent of T ransformers, architectures such as the Brain Network T ransformer ( Kan et al. , 2022b ) were proposed to learn pair-wise connection strengths among R OI by utilizing connection proﬁles as node features and incorporating an or- thonormal clustering readout. Recent ef forts ha v e attempted to scale this paradigm to foundation model settings, with Brain Harmony ( Dong et al. , 2025 ) aligning multi-site data via uniﬁed R OI embeddings and fMRI-LM ( W ei et al. , 2025 ) exploring the intersection of R OI sequences and language models. Howe ver , these R OI-based methods suf fer from a fundamental bottleneck: the reliance on ﬁxed atlases intro- duces rigid inductiv e biases and inconsistent preprocessing pipelines, limiting their ability to generalize across datasets processed with different parcellation schemes. In contrast, directly modeling raw 4D voxel data offers a promising alternative that avoids manual parcellation and preserves ﬁne-grained neural information. This volume- based paradigm treats fMRI as dense spatiotemporal videos rather than simpliﬁed graphs. SWiFT ( Kim et al. , 2023 ) successfully applied Swin T ransformers to 4D fMRI clips, showing the feasibility of learning hierarchical spatiotem- poral representations without manual feature engineering. Pushing this direction further , NeuroST ORM ( W ang et al. , 2025a ) extracts rob ust, ﬁne-grained vox el features that gen- eralize well to downstream tasks such as phenotype pre- diction and disease diagnosis, by pre-training on massiv e multi-subject datasets using masked autoencoding. Distinct from methods that primarily focus on analyzing physiological states, a parallel stream of research aims to ex- plicitly decode perceptual content, such as decoding visual or auditory stimuli from brain activity . For decoding con- tinuous stimuli ( e.g. long natural speech or videos), models ( T ang et al. , 2023 ; Fu et al. , 2025 ) typically operate directly on the fMRI time series to translate fMRI signals into per- ceiv ed continuous stimuli. In contrast, for ev ent-related stimuli, e vent-related neural representations are commonly used, such as general linear model (GLM)-deriv ed beta weights ( T akagi & Nishimoto , 2023 ) or trial-a veraged ac- tiv ations ( Chen et al. , 2025 ; Horikawa , 2025 ), focusing on the semantic alignment between brain acti vity and external stimuli. Generative approaches like BrainDif fuser ( Ozcelik & V anRullen , 2023 ) and MindEye ( Scotti et al. , 2023 ) are also based on beta weights and have achie ved superior re- sults in reconstructing percei ved images by aligning fMRI embeddings with the latent spaces of dif fusion models such as CLIP ( Radford et al. , 2021 ). Similarly , NeuroBind ( Y ang et al. , 2024 ) leverages CLIP embeddings as a univ ersal anchor to unify div erse neural modalities into a shared se- mantic space. The recently proposed MindEye2 ( Scotti et al. , 2024 ) further enhances the neural-visual alignment, enabling high-ﬁdelity reconstruction with less training data. 3. Methodology 3.1. Model Architectur e NOBEL is a uniﬁed foundation model designed to decode omni-modal non-in v asi ve brain signals, ranging from lo w- frequency fMRI acti vities to high-frequenc y EEG and MEG signals, within a shared semantic space. As illustrated in Fig. 1 , the frame work operates as a general-purpose inter- preter , which accepts a ﬂexible combination of fMRI, EEG, MEG, and external stimuli, controlled via user-pro vided textual prompts to generate natural language responses. T o maintain signal granularity and task ﬂexibility , fMRI signals are processed as raw 4D v oxels and GLM beta weights. T o bridge the substantial domain gap between biological signals and the LLM’ s textual embedding space, we design specialized encoding streams that con ver ge via a uniﬁed alignment mechanism. For metabolic signals, we implement a dual-path strategy to e xplicitly disentangle physiological state information from task-related semantic information. The ﬁrst path is task-agnostic , designed to capture univ ersal physiological features embedded in raw vox els. W e employ a universal encoder E vox based on NeuroSTORM ( W ang et al. , 2025a ) to process the ra w 4D input X vox . The second path is task-speciﬁc , aiming to e xtract high-le vel semantic content from GLM beta weights. Recognizing that semantic representations v ary signiﬁcantly across different modalities ( e.g . , viewing an image and watching a video), we instantiate a set of task-speciﬁc encoders {E ( k ) beta } K k =1 , where each en- coder corresponds to a speciﬁc stimulus domain k . T o map these heterogeneous features into the LLM’ s input space, each encoding branch is followed by a modality aligner Aligner ( · ) , implemented as a 3-layer multilayer perceptron (MLP) with GELU activ ation ( Hendrycks & Gimpel , 2016 ). Consequently , the uniﬁed metabolic tokens H vox and the 3 One Brain, Omni Modalities: T o wards Uniﬁed Non-In vasi ve Brain Decoding with Large Language Models Ta s k - agnostic Vox e l Encode r Unifie d EMEG Encoder Large Language Model LoRA Stimuli - specific Encoders Ta s k - specific Beta Encoders r aw 4D voxels be ta w eig hts fMRI EEG M EG Stimuli (Image, Video, A udio) 3- La yer MLP s 3- La yer MLP 3- La yer MLP Te x t P r o m p t … Te x t R e s p o n s e fMRI task - specific tok - ens for b et a we ig hts fMRI task - agnostic tokens for raw vo xels EEG/ME G tok ens stim uli tokens of image, video or audio F igur e 1. Overview of the NOBEL architecture. The model integrates heterogeneous brain signals and external stimuli into an LLM, features dual-path fMRI encoders for disentangling metabolic features (raw vox els vs. beta weights) and a uniﬁed EEG/MEG encoder for electromagnetic dynamics. All brain modalities are projected into the LLM’ s input space via modality-speciﬁc 3-Layer MLP aligners. The system supports ﬂexible input combinations and is optimized via LoRA for instruction-follo wing tasks. task-speciﬁc semantic tokens H ( k ) beta are formally deriv ed as: H vox = Aligner vox  E vox ( X vox )  , (1) H ( k ) beta = Aligner ( k ) beta  E ( k ) beta ( X ( k ) beta )  . (2) where X ( k ) beta denotes the beta weights associated with the k -th task. This design allows NOBEL to simultaneously lev erage global physiological conte xt and precise, stimulus- aligned semantic details. Complementing the metabolic perspecti v e, we model high-frequency electromagnetic signals using a uniﬁed EEG/MEG encoder . Although EEG and MEG measure different physical quantities (electrical potentials and mag- netic ﬂux, respecti vely), both arise from the same underlying neuronal current sources. Exploiting this shared biophysical origin, we adopt the BrainOmni backbone E em to extract uniﬁed temporal–spectral representations from the input se- quence X em . The resulting electromagnetic tokens H em are then mapped into the semantic space of the LLM through a modality-speciﬁc three-layer MLP aligner as: H em = Aligner em  E em ( X em )  . (3) Furthermore, to ground these abstract neural representations in their perceptual context, the model incorporates exter- nal stimuli via a simuli-aware encoding stream. Similar to the task-speciﬁc fMRI path, distinct encoders E ( m ) stim are employed for dif ferent stimulus modalities m ( e.g . , visual or audio) to process the raw media inputs X ( m ) stim . The modality- speciﬁc stimulus tokens H ( m ) stim are obtained as: H ( m ) stim = E ( m ) stim ( X ( m ) stim ) . (4) The integration framew ork is inherently ﬂe xible, allo wing NOBEL to process arbitrary combinations of a v ailable in- puts. W e dynamically construct the LLM-input sequence S by concatenating the user prompt T prompt with the aligned tokens of all present modalities. Let M and K denote the sets of av ailable stimulus modalities and task-speciﬁc beta inputs, and let I ( · ) ∈ { 0 , 1 } denote the indicator function for the presence of the corresponding modality . The ﬁnal input token sequence to LLM is formulated as: S = T prompt  ⊕ m ∈M H ( m ) stim  ⊕ ( I em · H em )  ⊕ k ∈K H ( k ) beta  ⊕ ( I vox · H vox ) . (5) where ⊕ denotes the sequence concatenation operator , and terms with I = 0 or empty sets are omitted. T o ef ﬁciently adapt the pre-trained multimodal LLM to the neuroscientiﬁc domain, lo w-rank adaptation (LoRA) ( Hu et al. , 2022 ) is used, where we freeze the LLM’ s original parameters and inject trainable lo w-rank matrices into the attention layers. 3.2. T raining Pipeline Building a uniﬁed model that comprehends heterogeneous brain signals and stimuli requires large-scale multimodal training. NOBEL leverages Qwen2.5-Omni ( Xu et al. , 2025 ) as the backbone, whose visual and auditory encoders are di- rectly inherited and thus inherently aligned with the LLM’ s semantic space. While this ensures robust handling of ex- ternal stimuli, the task-speciﬁc fMRI branch faces a unique 4 One Brain, Omni Modalities: T o wards Uniﬁed Non-In vasi ve Brain Decoding with Large Language Models challenge: decoding subtle, stimulus-ev oked perceptual information via beta weights is signiﬁcantly more difﬁ- cult than extracting global physiological meta-information due to the lo w SNR. Directly mapping these noisy , high- dimensional features to the LLM space often leads to op- timization instability and poor semantic grounding. T o ad- dress this, we implement a tar geted pre-alignment strategy for the fMRI beta branch prior to the joint multimodal train- ing. Drawing inspiration from MindEye2 ( Scotti et al. , 2024 ), we prioritize aligning the noisy neural representa- tions with the robust sensory features already encoded by Qwen2.5-Omni. The output features of the beta encoders and the corresponding frozen stimuli encoders are projected into a shared low-dimensional latent space. A contrastive objectiv e is applied to maximize the similarity between the brain signal embeddings and their corresponding ground- truth stimulus embeddings. T o mitigate the limited sample size in fMRI datasets and improv e generalization, we addi- tionally adopt MixCo-style neural feature mixing ( Li et al. , 2025 ) during pre-alignment, which interpolates neural rep- resentations and their corresponding stimulus embeddings as a form of cross-modal regularization. This process effec- tiv ely utilizes the high-quality visual/audio representations as “semantic anchors” to stabilize the fMRI features. Only after this cross-modal correspondence is established are the pre-aligned features projected into the LLM input space via the MLP aligners. W ith the stimulus-related branches pre-aligne d, the frame- work proceeds to uniﬁed omni-modal instruction tuning. In this stage, data from all modalities, including EEG, MEG, raw fMRI v oxels, fMRI beta weights, and external stimuli, are inte grated into the dynamic sequence generation process. All decoding tasks are formulated as brain-to-te xt generation problems, allowing the model to learn syner gistic represen- tations across modalities. The entire system is optimized using the standard autoregressi ve next token prediction ob- jecti ve. Giv en the target te xt response Y = { y 1 , y 2 , ..., y L } , the loss function is formulated as: L = − L X t =1 log P ( y t | y

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment