NeuroLex: A Lightweight Domain Language Model for EEG Report Understanding and Generation
📝 Abstract
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
💡 Analysis
Clinical electroencephalogram (EEG) reports encode domain-specific linguistic conventions that general-purpose language models (LMs) fail to capture. We introduce NeuroLex, a lightweight domain-adaptive language model trained purely on EEG report text from the Harvard Electroencephalography Database. Unlike existing biomedical LMs, NeuroLex is tailored to the linguistic and diagnostic characteristics of EEG reporting, enabling it to serve as both an independent textual model and a decoder backbone for multimodal EEG-language systems. Using span-corruption pretraining and instruction-style fine-tuning on report polishing, paragraph summarization, and terminology question answering, NeuroLex learns the syntax and reasoning patterns characteristic of EEG interpretation. Comprehensive evaluations show that it achieves lower perplexity, higher extraction and summarization accuracy, better label efficiency, and improved robustness to negation and factual hallucination compared with general models of the same scale. With an EEG-aware linguistic backbone, NeuroLex bridges biomedical text modeling and brain-computer interface applications, offering a foundation for interpretable and language-driven neural decoding.
📄 Content
Electroencephalogram (EEG) has long served as a noninvasive window into brain activity, powering a wide spectrum of brain-computer interface (BCI) applications-from disease detection [1], [2] and cognitive workload [3] monitoring to emotion recognition [4]- [6] and intention decoding [7], [8]. Yet despite significant advances in EEG decoding algorithms, a persistent gap remains in how such neural data are described, summarized, and communicated in natural language. EEG reports, routinely written by neurologists, encapsulate domain knowledge that bridges low-level signal phenomena (e.g., rhythmic slowing, spikes, asymmetries) with high-level clinical interpretations. They represent not merely annotations but structured linguistic codifications of human neurophysiological reasoning.
However, most current BCI research largely overlooks this linguistic layer. While deep neural networks [9]- [11] have achieved notable performance in EEG classification and crossmodal alignment [12]- [14], these systems are typically paired with generic large language models (LLMs) such as GPT or T5 that lack grounding in EEG-specific semantics [15], [16]. Consequently, the generated textual outputs-whether diagnostic summaries, data explanations, or multimodal captions-often sound fluent but lack clinical precision. For example, expressions like “attenuation of alpha rhythm” follows domain conventions that general models rarely internalize. This mismatch limits interpretability, interoperability, and the practical deployment of BCI systems in clinical or research settings.
In recent years, there has been a growing demand for domain-specialized yet lightweight language models that can operate efficiently in laboratory pipelines [17]- [19] and embedded neurotechnology environments [20]. Unlike massive general LLMs, compact EEG-domain models could be directly integrated into closed-loop systems, on-device analysis, or multimodal training frameworks without prohibitive computational cost [21], [22]. From a scientific standpoint, such models are also crucial for EEG-language alignment: enabling textual supervision for EEG representation learning, automatic report generation for neurodiagnostic datasets, and precise language grounding for multimodal BCI systems.
To address these gaps, we propose NeuroLex, a lightweight domain-adaptive language model pretrained purely on EEGrelated texts. NeuroLex is designed as both a standalone linguistic backbone and a decoder foundation for multimodal EEG-text integration. Built upon the encoder-decoder transformer framework of T5 [23], it undergoes (1) domainadaptive pretraining (DAPT) using span corruption on largescale de-identified EEG report corpora, and (2) supervised fine-tuning (SFT) on task-specific objectives such as summarization, polishing, and terminology question answering. Through these two complementary stages, NeuroLex learns not only the vocabulary and syntax of EEG reporting but also the structured reasoning embedded in real clinical narratives.
Besides, this study empirically tests four hypotheses on EEG-domain language modeling for BCI research:
• H1 (Domain Adaptation): DAPT yields lower perplexity and higher terminology coverage than general T5. • H3 (Data Efficiency): domain adaptation improves learning stability and efficiency under limited labeled data. • H4 (Robustness): DAPT models generalize better, reducing terminology hallucinations and negation errors. By providing an EEG-aware linguistic backbone, NeuroLex bridges the gap between biomedical text modeling and practical BCI needs, forming a foundation for interpretable and linguistically grounded brain-language systems.
NeuroLex follows the encoder-decoder transformer architecture of T5-Base and is trained exclusively on textual data. As shown in Fig. 1, the training process comprises two stages: DAPT to learn EEG-specific linguistic structure, and SFT to adapt the model toward practical text understanding and generation tasks, as suggested by S. Gururangan et al. [24].
Objective: we employ the span-corruption objective, but differently we mask only EEG-specific terminology, where ∼15 % of tokens in each input sequence are replaced with sentinel tokens (e.g., <extra id 0>) and the model learns to reconstruct the missing spans. This approach encourages the model to capture long-range context, structured co-occurrence, and reporting conventions typical of clinical EEG language.
Corpus: all pretraining data are derived exclusively from the Harvard Electroencephalography Database (HEEDB) [25], a large-scale open-access repository of clinical EEG recordings accompanied by textual reports. We extract the full set of EEG reports and perform extensive cleaning and normalization, including: (a) removal of protected health information and non-EEG sections, and (b) standardization of spacing, punctuation, and segmentation.
The resulting corpus contains approximately 70K EEG reports and around 800K paragraphs, cove
This content is AI-processed based on ArXiv data.