NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Electroencephalography (EEG) provides a non-invasive window into neural dynamics at high temporal resolution and plays a pivotal role in clinical neuroscience research. Despite this potential, prevailing computational approaches to EEG analysis remai…
Authors: Guoan Wang, Shihao Yang, Jun-en Ding
NeuroNarrator: A Generalist EEG-to-T ext F oundation Mo del for Clinical In terpretation via Sp ectro-Spatial Grounding and T emp oral State-Space Reasoning Guoan W ang 1 † , Shihao Y ang 1 † , Jun-en Ding 1 , Hao Zh u 1 , F eng Liu 1* 1* Departmen t of Systems Engineering, Stevens Institute of T echnology , 1 Castle Poin t T errace, Hoboken, 07030, New Jersey , USA. *Corresp onding author(s). E-mail(s): fliu22@stevens.edu ; Con tributing authors: gw ang31@stevens.edu ; sy ang57@stevens.edu ; jding17@stev ens.edu ; hzhu38@stev ens.edu ; † These authors contributed equally to this work. Abstract Electro encephalograph y (EEG) pro vides a non-inv asive windo w into neural dynamics at high temp oral resolution and plays a piv otal role in clinical neu- roscience research. Despite this p oten tial, prev ailing computational approaches to EEG analysis remain largely confined to task-sp ecific classification ob jec- tiv es or coarse-grained pattern recognition, offering limited support for clinically meaningful interpretation. T o address these limitations, we introduce NeuroNar- rator, the first generalist EEG-to-text foundation mo del designed to translate electroph ysiological segments in to precise clinical narratives. A cornerstone of this framework is the curation of NeuroCorpus-160K, the first harmonized large- scale resource pairing ov er 160,000 EEG segments with structured, clinically grounded natural-language descriptions. Our architecture first aligns temp oral EEG w av eforms with spatial top ographic maps via a rigorous contrastiv e ob jec- tiv e, establishing sp ectro–spatially grounded representations. Building on this grounding, w e condition a Large Language Mo del through a state-space–inspired form ulation that in tegrates historical temp oral and sp ectral context to supp ort coheren t clinical narrativ e generation. This approach establishes a principled bridge b et ween contin uous signal dynamics and discrete clinical language, enabling interpretable narrative generation that facilitates exp ert interpretation 1 and supp orts clinical rep orting workflo ws. Extensive ev aluations across diverse b enc hmarks and zero-shot transfer tasks highlight NeuroNarrator’s capacity to in tegrate temp oral, sp ectral, and spatial dynamics, p ositioning it as a founda- tional framework for time–frequency–aw are, op en-ended clinical interpretation of electrophysiological data. Keyw ords: Multimo dal F oundation Mo del, EEG-to-T ext Generation, Sp ectro-Spatial Grounding, State-Space Reasoning 1 In tro duction Electro encephalograph y (EEG) provides a non-inv asive windo w in to human brain activit y , enabling high–temp oral-resolution measurement of neural dynamics at the millisecond scale[ 1 , 2 ]. Compared to other neuroimaging modalities such as func- tional magnetic resonance imaging (fMRI)[ 3 ], magneto encephalography (MEG)[ 4 ], or in tracranial EEG (iEEG)[ 5 ], EEG offers a fav orable trade-off among accessibility , temp oral resolution, and cost, making it widely adopted across clinical diagnos- tics, cognitive neuroscience, and brain–computer interface (BCI) applications[ 6 ]. As a result, EEG has b een extensively inv estigated in a wide range of do wnstream tasks, including seizure epilepsy classification[ 7 ], motor imagery recognition[ 8 ], sleep stage classification[ 9 ], and cognitive workload assessmen t[ 10 ]. With the rapid adv ancemen t of deep learning, neural netw ork–based approaches ha ve b een widely adopted for EEG analysis, yielding substantial p erformance gains across a v ariety of downstream tasks. Despite these adv ances, most EEG deep learning mo dels are still optimized for narrowly defined downstream ob jectiv es[ 11 , 12 ]. Although recent foundation mo del efforts hav e b egun to leverage large-scale pretraining[ 13 ], EEG-to-text interpretation remains underexplored b eyond rigid for- m ulations constrained by lab els or templates, particularly in settings that require op en v o cabulary and free-form clinical descriptions[ 14 ]. In parallel, recen t studies ha v e in vesitgated EEG-to-text interpretation in deco ding-orien ted settings, where textual targets corresp ond to external stimuli, exp erimen tal instructions, or sub jects’ linguis- tic outputs[ 15 ]. How ever, clinical EEG in terpretation do es not aim to deco de laten t seman tic con tent from neural activity , but to describ e and contextualize electrophys- iological patterns as interpreted by clinical exp erts. Moreov er, existing studies that link EEG recordings to clinical narrativ es typically op erate at the recording-level, o verlooking the fine-grained temp oral evolution that underlies clinically meaningful electrographic dynamics and thereby limiting their ability to capture and in terpret relev ant biomarkers[ 16 , 17 ]. Motiv ated by the requiremen ts of clinical EEG rep orting, where diagnostically rel- ev ant EEG features are often transien t, sparse, and should be describ ed with explicit time anc hors[ 18 , 19 ], w e pose EEG-to-text interpretation at the lev el of short segmen ts rather than en tire recording. T emp oral lo calization and time-resolved interpretation are essen tial, b ecause recording-lev el generation can blur transient electrographic ev ents and disrupt the temp oral evidence needed for clinician v erification[ 20 ]. Beyond 2 Fig. 1 Illustrative example of segment-lev el, clinically grounded EEG interpretation. In contrast to coarse recording-lev el interpretaion, this sample demonstrates the generation of a fine- grained clinical narrative for a 10-second segment. The generated text systematically syn thesizes four dimensions of electrophysiological analysis: (i) Ev ent & Clinical Labels, iden tifying sp ecific morpholog- ical abnormalities (e.g., spike-and-slo w-wa ve complexes); (ii) Spatial Energy Distribution, capturing prominent signal and energy v ariations in sp ecific anatomical regions (e.g., right fronto-temporal); (iii) F requency-Domain F eatures, identifying the dominant sp ectral bands; and (iv) T emp oral Con- text, characterizing the non-stationary evolution of brain states relative to preceding segmen ts. ev ent detection, we formulate EEG-to-text interpretation as the generation of op en- v o cabulary , segmen t-level descriptions that explicitly preserve wa veform morphology , sp ectral structure, and spatial topography—core elemen ts of exp ert clinical read- ing—and thereb y support comp ositional generalization across div erse electroph ysiolog- ical patterns. Suc h time-resolv ed, signal-grounded narratives are intended to supp ort w orkflow by directing attention to suspicious ep o c hs for rapid review and by facili- tating standardized do cumen tation, thereby reducing rep orting burden while lea ving final adjudication to exp erts. In this work, we introduce NeuroNarrator , the first generalist EEG-to-text foundation mo del designed to bridge electroph ysiological recordings and clinically meaningful natural-language in terpretation, as exemplified in Fig 1 . By moving b ey ond coarse recording-level descriptions, our framework explicitly preserves the fine-grained sp ectro-spatial structure of transient EEG patterns and mo dels their ev o- lution o ver time, enabling linguistically grounded narrativ es that reflect b oth localized electroph ysiological features and temp orally ev olving brain states. 3 T o achiev e this, w e implement a unified framew ork that synergizes large-scale op en-v o cabulary representation learning with rigorous sp ectro-spatial grounding. By in tegrating these representations within a state-space formulation, w e bridge the seman tic gap b et ween electrophysiology and clinical narrativ e through the follo wing core contributions: 1. Construction of NeuroCorpus-160K. W e aggregate and standardize 16 hetero- geneous datasets into the first large-scale, open-vocabulary EEG-clinical narrative corpus. T o ensure robust generalization, w e establish a rigorous sub ject-disjoin t training and ev aluation split, providing a standardized b enc hmark for op en- v o cabulary interpretation b ey ond closed-set classification. 2. Con trastive Sp ectro-Spatial Alignmen t. W e introduce a m ultimo dal align- men t mec hanism that pro jects temp oral EEG wa v eforms and spatial top ographic maps into a shared seman tic manifold. This enforces strict corresp ondence b et ween sp ectral dynamics and spatial energy distributions, resolving the grounding ambi- guit y inherent in single-mo dalit y enco ding. 3. Unified State-Space Generalist F ramework. W e prop ose NeuroNarrator, a m ultimo dal large language mo del (MLLM) arc hitecture that in tegrates sp ectro- spatial features with state-space-inspired temp oral-spectral priors. By conditioning generation on historical tra jectory em b eddings, the framew ork captures evolving brain dynamics (e.g., seizure ev olution) within an end-to-end deco ding pro cess. 2 Related W ork 2.1 Generalist EEG Representation Learning The landscape of automated EEG analysis has undergone a fundamental transi- tion, shifting from task-sp ecific sup ervised architectures trained on limited datasets to generalist foundation mo dels pre-trained on massive, heterogeneous corp ora via self-sup ervision. Pioneering efforts, such as BENDR[ 21 ], adapted contrastiv e learn- ing ob jectives from sp eech processing (e.g., wa v2vec 2.0) to extract transferable represen tations directly from EEG sequences. Building on this, recent frameworks ha ve increasingly leveraged Mask ed Signal Mo deling (MSM) to scale pre-training efficacy . Notably , LaBraM[ 22 ] in tro duces vector-quan tized neural spectrum predic- tion, enabling the mo del to learn generic spectral features by reconstructing masked frequency patc hes. Concurrently , BIOT[ 13 ] addresses the challenge of cross-dataset heterogeneit y b y tokenizing c hannels indep enden tly , thereby allowing the mo del to generalize across v arying electro de montages and sampling rates. T o further resolv e the complex spatiotemp oral dynamics of brain activit y , sp e- cialized attention mechanisms hav e b een prop osed. EEGPT[ 23 ] employs large-scale transformers to extract hierarchical temporal features from extensive scalp EEG data, while CBraMo d[ 24 ] utilizes a “criss-cross” attention mechanism to explicitly decouple spatial and temporal dependencies, enhancing structural enco ding stability . How- ev er, a critical limitation p ersists: these frameworks enco de EEG signals in to abstract laten t represen tations that lack explicit semantic grounding. By treating brain activ- it y strictly as n umerical time-series rather than clinically meaningful physiological 4 ev ents, they remain disconnected from the op en-ended linguistic reasoning required for exp ert-lev el interpretation. 2.2 EEG-to-T ext Generation and Multimo dal Alignment While vision-language mo dels hav e established a robust paradigm for grounding visual semantics in natural language, bridging the representational gap b et ween elec- troph ysiology and clinical linguistics remains a nascent frontier. Existing efforts in this domain hav e predominantly div erged into t wo streams. The first stream fo cuses on “brain-to-text” deco ding, whic h aims to reconstruct external stimuli, exp erimen- tal prompts, or the sub ject’s in ternal linguistic con tent from neural activity . F or instance, DeW a ve[ 15 ]in tro duces discrete co dex enco ding to translate neural dynamics in to op en-v o cabulary text, while strictly discriminativ e approac hes hav e explored sim- ilar alignmen t strategies for sentimen t classification[ 25 ] or cross-lingual deco ding[ 26 ]. Ho wev er, clinical interpretation en tails a fundamentally different ob jective: it seeks not to deco de latent cognitive conten t, but to articulate precise descriptions of the electroph ysiological phenomena themselv es. The second stream, which attempts to link EEG recordings to clinical narra- tiv es, currently faces significant limitations in granularit y and generative flexibility . Alignmen t-centric framew orks, such as EEG-CLIP[ 16 ] and recen t EEG-language pretraining frameworks[ 27 ], primarily leverage cross-mo dal con trastive learning for retriev al tasks or coarse clinical phenotyping, lacking the capacit y for fine-grained descriptiv e generation. Con versely , generative models lik e NeuroLex[ 17 ] t ypically oper- ate at the macroscopic recording lev el. This coarse resolution inheren tly obscures the fine-grained temp oral structure of EEG, failing to ground transient pathological morphologies in precise linguistic descriptions. Similarly , generalist architectures like NeuroLM[ 14 ] rely heavily on rigid instruction templates to map signals to task out- puts. While effective for closed-set execution, this paradigm restricts the comp ositional generalization requisite for op en-ended interpretation, where a mo del must synthesize ev olving s pectro-spatial dynamics into a coherent, non-templated narrative. 2.3 Sp ectro-Spatial and T emp oral Mo deling in Neuroscience Exp ert clinical EEG in terpretation relies on the in tegration of sp ectral and topographic evidence, rather than the analysis of isolated time-series. The semantic v alidit y of an oscillatory feature is determined by its spatial lo calization; for instance, alpha- band activit y (8–12 Hz) constitutes a normative resting rh ythm when posteriorly dominan t, yet indicates pathology (e.g., “alpha coma”) when generalized or anteriorly distributed[ 28 , 29 ]. This interdependence aligns with EEG microstate theory , which p osits that global brain states manifest as quasi-stable top ographic p oten tial maps distinct from purely temp oral frequency features[ 30 ]. Consequen tly , accurate machine in terpretation requires a rigorous coupling of spectral con tent with spatial top ology . F urthermore, neural activit y is gov erned by non-stationary dynamical systems, where instantaneous observ ations are conditioned on the tra jectory of evolving latent states[ 31 ]. Pathological phenomena, suc h as seizure propagation or shifts in conscious- ness, are not discrete static even ts but contin uous transitions through a latent state 5 space[ 32 ]. How ever, prev ailing deep learning architectures often neglect these neuro- dynamical principles b y flattening spatial top ology in to generic channel vectors and pro cessing segmen ts as indep enden t snapshots. By failing to mo del the conditional dep endence of the current segmen t on its historical tra jectory , these “blac k-b o x” approac hes lac k the state-space reasoning required to generate narrativ es that reflect the temp oral con tinuit y characteristic of expert clinical rep orts. 3 Metho ds 3.1 Ov erview of the NeuroNarrator F ramew ork NeuroNarrator is a unified MLLM for segment-lev el EEG-to-text interpretation, designed to bridge electrophysiological recordings and clinically meaningful natural- language descriptions. The framework in tegrates complementary temporal and spatial represen tations of EEG signals and incorp orates short-term temp oral context to sup- p ort coherent clinical narration. An ov erview of the mo del architecture and data flow is provided in Fig. 2 , with detailed components and learning ob jectives describ ed in the following sections. 3.2 Construction of a Clinically Grounded EEG–T ext Corpus 3.2.1 Multi-Source EEG Data Harmonization T o systematically address the electroph ysiological heterogeneity encountered in clinical en vironments, we constructed NeuroCorpus-160K by harmonizing 16 publicly a v ailable datasets across diverse acquisition hardware and proto cols (T able 1 ). As visualized in Fig. 3 a, this collection co vers the primary clinical application domains of neurological disorder diagnosis, seizure and pathological-even t detection, artifact and abnormal- EEG identification, cognitiv e and affective state assessment, motor-in tent deco ding, and sleep staging. Such phenotypic and instrumental div ersity is critical for mitigat- ing single-source biases and ensuring the mo del captures a broad sp ectrum of signal morphologies. The consolidated NeuroCorpus-160K comprises a total of 483.23 hours of record- ings from sub jects ranging from early c hildho o d to adv anced age. T o facilitate rigor ous b enc hmarking, we partitioned the entire corpus at the sub ject level into a training set (160,249 segments) and a strictly separated held-out ev aluation subset (13,714 segmen ts). Detailed split criteria are provided in Section 4.3 . 3.2.2 Unified Signal Prepro cessing T o mitigate the distributional shifts in tro duced b y v arying acquisition hardware and clinical proto cols, we implement a rigorous signal standardization pip eline across all constituen t datasets of NeuroCorpus-160K, as illustrated in Fig. 3 b. W e first apply a zero-phase finite impulse resp onse (FIR) band-pass filter (0 . 1–75 Hz) to all recordings to isolate clinically relev ant sp ectral conten t while atten uating low-frequency drift and high-frequency noise. This is follow ed b y a region-sp ecific notc h filter (50 Hz or 60 Hz) to suppress p o w er-line interference. T o ensure temp oral uniformity across the 6 Fig. 2 NeuroNarrator architecture for spectro–spatially grounded and temp orally coheren t EEG-to-text generation. (a) Dual-stream spectro–spatial grounding encodes each EEG segment using a pretrained EEG enco der operating on multic hannel wa veforms and a frozen vision encoder pro cessing the corresp onding scalp top ographic map. Mo dalit y-sp ecific features are pro jected into a shared latent space and aligned via a contrastiv e ob jective, enforcing corresp ondence b etw een spectral dynamics and spatial energy distributions. (b) State-space–inspired generative mo deling con- ditions text generation on b oth the aligned sp ectro–spatial embedding of the curren t segmen t and a short tra jectory of preceding EEG segments, serving as a proxy for latent brain-state evolution. These contin uous embeddings are injected as soft prompt tokens, replacing designated placeholder positions in the language-model prompt alongside task instructions, enabling the synthesis of clini- cally grounded narratives that preserve wa veform morphology , dominant frequency structure, spatial localization, and temp oral dynamics. heterogeneous corpus, all time-series are strictly resampled to a fixed sampling rate of 200 Hz. Electro de montages were strictly harmonized to the international 10–20 or 10–10 systems to ensure spatial consistency . 3.2.3 Structured F eature Extraction and LLM-Driven Description Refinemen t T o systematically bridge the representational gap b et w een contin uous electrophysio- logical dynamics and discrete linguistic descriptions, w e implemen ted a unified pip eline 7 Fig. 3 Overview of NeuroCorpus-160K construction. (a) Distribution of the aggregated datasets across ma jor clinical domains. (b) The unified data processing workflo w, which transforms raw recordings into clinically grounded narratives via three stages: signal prepro cessing, structured feature extraction, and LLM-driven description refinement. that transforms prepro cessed EEG signals in to high-fidelity clinical narratives(Fig. 3 b). F ollo wing signal prepro cessing, the contin uous recordings were partitioned into non-o verlapping 10-second segmen ts ( x ∈ R C × T ). F or eac h segment, w e constructed a structured quantitativ e feature template to serve as a factual scaffold for the subsequen t generation process. This template explicitly encodes three complemen- tary dimensions: (i) Even t & Clinical Lab els, capturing sub ject-lev el clinical con text (e.g., demographics, diagnosis, task/condition, and dataset-provided even t annota- tions when av ailable); (ii) F requency-Domain F eatures, deriv ed from p o wer sp ectral densit y (PSD) estimates summarized ov er canonical frequency bands; and (iii) Spa- tial Energy Distribution, derived b y computing channel-wise p o wer and partitioning electro des into three distinct intensit y tiers via K-means clustering ( K = 3). These structured yet rigid feature represen tations w ere subsequently refined in to fluen t, clinically coheren t narratives using GPT-4.1[ 48 ]. By employing a domain- sp ecific prompting strategy , the mo del w as directed to synthesize the quantitativ e template in to a natural language description, p erforming complex inference tasks such as identifyi ng dominan t anatomical regions (e.g., right fronto-temporal) and character- izing the primary oscillatory mo des. Additionally , we incorp orated implicit temp oral con text from preceding segments to allow the mo del to infer and describ e evolving dynamic trends, such as the gradual buildup of theta rh ythm or the attenuation of alpha activity , reflecting the non-stationary nature of brain states. T able 2 exemplifies the transformation from structured feature templates to final refined descriptions. 8 T able 1 Constituent datasets and split statistics for NeuroCorpus-160K. Summary of the 16 public EEG datasets integrated into NeuroCorpus-160K, including the asso ciated task and the sizes of the training and held-out test splits, rep orted as the num b er of non-overlapping 10-s segments (Samples) and total recording duration (Hours). Dataset T ask T rain T est Samples Hours Samples Hours AD-65[ 33 ] Alzheimer’s Disease Detection 5,460 15.17 1,382 3.84 ADHD-Child[ 34 ] ADHD Detection 1,147 3.19 330 0.92 Grasp-and-Lift[ 35 ] Hand Manipulation Even t Detection 2,531 7.03 894 2.48 HMC[ 36 ] Sleep Stage Classification 25,000 69.44 1,000 2.78 MDD-64[ 37 ] Ma jor Depressive Disorder Detection 5,738 15.94 1,466 4.07 Motor Imagery[ 38 ] Motor Imagery Recognition 11,698 32.49 1,191 3.31 PD31[ 39 ] Parkinson’s Disease Analysis 601 1.67 217 0.60 Schizophrenia-28[ 40 ] Sc hizophrenia Detection 2,157 5.99 680 1.89 SEED[ 41 ] Emotion Recognition 8,293 23.04 1,050 2.92 TUAB[ 42 ] Abnormal EEG Detection 20,000 55.56 1,000 2.78 TUAR[ 43 ] Artifact Classification 19,999 55.55 865 2.40 TUEP[ 44 ] Epilepsy Detection 20,390 56.64 1,000 2.78 TUEV[ 42 ] Event-T yp e Classification 10,540 29.28 1,000 2.78 TUSL[ 45 ] Slowing-Even t Classification 6,063 16.84 549 1.52 TUSZ[ 46 ] Seizure Even t Detection 20,112 55.87 1,000 2.78 W orkload[ 47 ] Mental W orkload Assessment 520 1.44 90 0.25 T otal 16 T asks 160,249 445.14 13,714 38.09 3.2.4 State-Space-Inspired T emp oral Context Mo deling Clinically significan t electrophysiological phenomena, including the gradual evolution of a seizure, the onset of drowsiness, or shifts in cognitive workload, are inherently non-stationary . These dynamics are b est characterized not as a series of indep enden t snapshots, but as contin uous tra jectories through a latent manifold of brain states. T o capture this temp oral contin uity within NeuroCorpus-160K, we formulate the clinically grounded corpus generation pro cess using a state-space p erspective. F ormally , let the sequence of EEG segmen ts b e denoted b y { x t } , where eac h obser- v ation x t deriv es from a latent neural state z t . In standard interpretation paradigms, the clinical description y t is typically generated via a direct mapping x t → y t , effec- tiv ely discarding historical context. How ever, the underlying physiological state evolv es according to transition dynamics z t ≈ F ( z t − 1 , x t − 1 ), implying that the accurate in ter- pretation of the current segmen t x t is conditionally dep enden t on its tra jectory . T o op erationalize this, w e appro ximate the laten t tra jectory b y explicitly conditioning the target description for x t on a lo cal temp oral neighborho o d. Sp ecifically , w e construct the data samples to include the preceding N segments { x t − k } N k =1 as a proxy for the system’s history . 9 T able 2 Example of the transformation from structured quantitativ e scaffolds to clinically grounded narratives. This comparison demonstrates the synthesis of rigid feature templates (left) into a fluent, coheren t description (right). Note that while the F requency-Domain refinement is conditioned on the spectral history of preceding segments ( { x t − k } N k =1 ) alongside the current segment ( x t ) to infer temp oral trends, only the current segment’s features are listed here for conciseness. Comp onen t Structured T emplate LLM Refined Descriptions Ev ent & Clinical Lab els This EEG segmen t from a 19-y ear-old female sub ject is recorded during a men tal arith- metic task with cognitive load. In this EEG segment from a 19- y ear-old female sub ject recorded during a mental arithmetic task with cognitiv e load, the observed patterns reflect task-related dynamics across frequency bands and scalp regions. F requency-Domain F eatures(Conditioned on { x t − k } N k =1 , x t ) F requency band p ow ers: delta (0–4 Hz): 39.8676 µ V 2 ; theta (4–8 Hz): 11.5915 µ V 2 ; alpha (8–12 Hz): 5.7254 µ V 2 ; b eta (12– 30 Hz): 5.2747 µ V 2 ; gam ma (30– 50 Hz): 1.1490 µ V 2 ; high-gamma (50–75 Hz): 0.0810 µ V 2 . The delta frequency band (0– 4 Hz) exhibits the highest pow er and shows a mo derate decrease. Theta activity (4–8 Hz) shows a noticeable increase, while the alpha band (8–12 Hz) presents a signifi- can t decrease. Stable readings are observ ed in b oth the b eta (12– 30 Hz) and gamma (30–50 Hz) bands, while the high-gamma band (50–75 Hz) remains consisten tly lo w. Spatial Energy Distri- bution Highest-energy channel is C4 (103.2591 µ V 2 ). High-energy c hannels: FZ, C4, PZ. Medium- energy channels: F3, F4, F8, C3, CZ, P3, P4, T4, T6. Low-energy c hannels: F7, O1, O2, T3, T5. The EEG p o wer is predominantly concen trated in the fron to-central and parietal regions, with the C4 c hannel maintaining the highest energy levels. 3.3 Sp ectro-Spatial Represen tation Learning 3.3.1 Dual-Stream Sp ectro-Spatial Enco ding T o capture the distinct yet complemen tary information inheren t in electrophysiolog- ical data, we employ a dual-stream enco ding architecture that pro cesses temp oral dynamics and spatial top ology in parallel b efore pro jecting them into a shared latent space, as illustrated in Fig. 2 a. T emp oral EEG Representation. Let x t ∈ R C × T denote the t -th prepro cessed EEG segmen t, where C represents the channel dimension and T corresp onds to the temp oral window (e.g., 2000 time p oin ts at 200 Hz). T o extract ric h temp oral dep en- dencies, w e utilize a domain-sp ecific EEG enco der, f eeg ( θ eeg ), instan tiated as the LaBraM-Base[ 22 ] architecture. W e initialize f eeg with weigh ts pre-trained on large- scale EEG corp ora via self-sup ervised learning, ensuring the mo del p ossesses a robust 10 prior for canonical oscillatory patterns and artifact rejection. The enco der pro cesses x t and aggregates the temp oral tok en sequence via mean p o oling to yield a compact laten t represen tation h eeg t : h eeg t = f eeg ( x t | θ eeg ) ∈ R d 1 , (1) where d 1 denotes the hidden dimensionality of the enco der. Spatial T op ographic Representation. Complementing the temp oral stream, w e explicitly mo del the spatial distribution of scalp p oten tials. Drawing inspiration from Con vDip[ 49 ], which v alidated the efficacy of learning spatially organized patterns from top ographic pro jections, we ge nerate a corresp onding EEG top ographic map i t ∈ R H × W × 3 for eac h segmen t x t . W e emplo y the CLIP ViT-Large[ 50 ] vision enco der, f vis ( θ vis ), to process this map. Crucially , we k eep f vis frozen to preserv e its pre-trained vision-language alignment, thereby anc horing spatial features in a semantically ric h manifold and comp elling the EEG encoder to align with this stable, language-grounded target. W e obtain a global spatial representation h vis t b y av eraging the output patch em b eddings: h vis t = f vis ( i t | θ vis ) ∈ R d 2 , (2) where d 2 is the output dimensionality of the vision enco der. Pro jection to Shared Manifold. T o align these heterogeneous representations, w e employ light weigh t pro jection heads g EEG ( θ eeg pro j ) and g vis ( θ vis pro j ), implemented as t wo-la yer MLPs with GELU activ ation. These pro jectors map the mo dalit y-sp ecific features into a common d -dimensional embedding space: z eeg t = g eeg ( h eeg t | θ eeg pro j ) , z vis t = g vis ( h vis t | θ vis pro j ) , (3) where z eeg t , z vis t ∈ R d . 3.3.2 Con trastiv e Sp ectro-Spatial Alignment T o enforce a rigorous seman tic corresp ondence b et ween the temp oral evolution of the EEG segment, x t , and its instantaneous spatial organization, we employ a contrastiv e learning ob jective. Sp ecifically , inspired by the optimization landscap e of SigLIP[ 51 ], w e leverage a sigmoid-based strategy to decouple the normalization of p ositiv e and negativ e pairs. Let ˆ z eeg i and ˆ z vis j denote the L 2 -normalized em b eddings of the i -th EEG segmen t and the j -th EEG top ographic map within a mini-batc h of size B . W e define the pairwise similarity logit s ij b et w een the temp oral and spatial represen tations as: s ij = τ · ( ˆ z eeg i ) ⊤ ˆ z vis j + b, (4) where τ is a learnable temp erature parameter and b is a learnable bias. Let y ij ∈ {− 1 , 1 } represent the lab el for the pair ( i, j ), where y ij = 1 if i = j (a matched sp ectro-spatial pair) and − 1 otherwise. The alignment loss is computed as the sum of 11 indep enden t binary cross-entrop y losses across all pairs: L align = − 1 B 2 B X i =1 B X j =1 log σ ( y ij · s ij ) , (5) where σ ( · ) is the sigmoid function. This establishes a grounded sp ectro-spatial basis for the subsequent generative mo deling. 3.4 State–Space–Conditioned EEG-to-T ext Generation 3.4.1 Multimo dal Large Language Mo del Architecture T o bridge the discrete sym b olic space of natural language with the contin uous manifold of electrophysiology , NeuroNarrator is architected as a unified MLLM, as illustrated in Fig. 2 b. Unlik e standard language mo dels that op erate exclusively on discrete text tok ens, our framework requires a mechanism to ingest dense ph ysiological data directly in to the mo del’s input stream. Inspired b y the con tinuous feature injection paradigm recen tly explored in cardiac modeling[ 52 ], w e map con tinuous signals into the MLLM’s high-dimensional em b edding space, effectiv ely treating them as “soft prompts” that substitute sp ecific placeholders in the instruction sequence. Our architectural design constructs a comp osite input sequence that explicitly grounds the generation in b oth historical temp oral dynamics and instantaneous sp ectro-spatial evidence. Rather than pro cessing segments in isolation, we inject the em b eddings of preceding EEG segments to represent the latent tra jectory of brain dynamics, follow ed b y the aligned representations of the curren t segment. F ormally , w e utilize the dual-stream encoders and pro jection heads (defined in Sec. 3.3 ) to extract aligned embeddings z eeg and z vis for the EEG segmen ts and top ographic maps. Let M ( ·| θ pro j ) denote the unified embedding function that pro jects physiological signals in to the MLLM’s input space. W e construct the multimodal input sequence, E in , b y fusing the historical state-space tra jectory with the current sp ectro-spatial observ ations: E in = M ( x t − N | θ eeg pro j ) , . . . , M ( x t − 1 | θ eeg pro j ) | {z } Historical Context , M ( x t | θ eeg pro j ) , M ( i t | θ vis pro j ) | {z } Current Sp ectro-Spatial Evidence , E instr (6) where the sequence { x t − k } N k =1 represen ts the temp oral context enco ded solely via the EEG branc h to efficiently mo del state ev olution, while the curren t target is represen ted b y the concatenation of b oth its temporal wa veform embedding, M ( x t ), and its spa- tial top ographic embedding, M ( i t ). E instr denotes the sequence of token embeddings corresp onding to the textual task instruction. This fully constructed sequence E in is subsequen tly fed in to the Large Language Model (LLM) bac kb one, enabling the model to jointly process the interlea v ed physiological em b eddings and textual instructions to autoregressiv ely syn thesize the final clinical description. 12 3.4.2 Instruction-Guided End-to-End Optimization The generation pro cess is explicitly conditioned on tw o complementary instruction comp onen ts app ended to the unified sp ectro-spatial sequence E in : a state-space prior instruction that contextualizes the current segmen t with resp ect to its preceding EEG tra jectory , and a task instruction that sp ecifies the requiremen t for clinically grounded, structured narrative synthesis. This instruction prompts the MLLM to synthesize the laten t physiological tra jectory into a coheren t clinical narrative. A selective optimiza- tion strategy is emplo yed during training: the CLIP vision enco der remains frozen to main tain its pretrained semantic alignmen t, while the EEG enco der, pro jection la yers, and the LLM backbone are jointly up dated. The en tire framework is optimized end- to-end by minimizing the negative log-lik eliho o d of the target description sequence y = ( y 1 , . . . , y L ) of length L : L (Θ) = − L X l =1 log p Θ ( y l | y
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment