EEG-Based Brain-LLM Interface for Human Preference Aligned Generation
Large language models (LLMs) are becoming an increasingly important component of human--computer interaction, enabling users to coordinate a wide range of intelligent agents through natural language. While language-based interfaces are powerful and f…
Authors: Junzi Zhang, Jianing Shen, Weijie Tu
EEG-Based Brain-LLM Interface f or Human Pr eference Aligned Generation Junzi Zhang 1 Jianing Shen 1 W eijie T u 2 Y i Zhang 1 Hailin Zhang 1 T om Gedeon 3 Bin Jiang 1 Y ue Y ao † 1 Abstract Large language models (LLMs) are becoming an increasingly important component of hu- man–computer interaction, enabling users to coor - dinate a wide range of intelligent agents through natural language. While language-based inter- faces are powerful and flexible, they implicitly assume that users can reliably produce explicit linguistic input, an assumption that may not hold for users with speech or motor impairments, e.g ., Amyotrophic Lateral Sclerosis (ALS). In this work, we in vestigate whether neural signals can be used as an alternativ e input to LLMs, partic- ularly to support those socially marginalized or underserved users. W e build a simple brain-LLM interface, which uses EEG signals to guide image generation models at test time. Specifically , we first train a classifier to estimate user satisfaction from EEG signals. Its predictions are then incor- porated into a test-time scaling (TTS) framew ork that dynamically adapts model inference using neural feedback collected during user e valuation. The experiments sho w that EEG can predict user satisfaction, suggesting that neural acti vity car - ries information on real-time preference inference. These findings pro vide a first step to ward inte grat- ing neural feedback into adapti ve language-model inference, and hopefully open up ne w possibilities for future research on adaptiv e LLM interaction. 1. Introduction Large language models are becoming an increasingly im- portant component of modern human–computer interac- tion (HCI) ( Radford & W u , 2019 ; Brown et al. , 2020 ; OpenAI , 2023 ), supporting applications in content genera- tion ( Ouyang et al. , 2022 ; Raf fel et al. , 2020 ), information retriev al ( Le wis et al. , 2020 ; Guu et al. , 2020 ), and user 1 Shandong Uni versity , China 2 Australian National Uni versity , Australia 3 Curtin Univ ersity , Australia. Correspondence to: Y ue Y ao . Pr eprint. Mar ch 19, 2026. How can we interact with LLM w hen language is not available? Interact with brain activity directly Trad itional interaction with language EEG Interaction Instructions and Questions User prompt Answer F igur e 1. Motivation for EEG-based interface for LLM. Ex- isting large language models rely mainly on explicit linguistic feedback such as typed instructions or corrections. Ho wev er , there may exist a group of users with speech or motor impairments ( e.g ., ALS) thus language may not be av ailable. For those users, we in vestigate whether neural signals can be used as an alternativ e input to LLMs. engagement ( Thoppilan et al. , 2022 ; Zhang et al. , 2020b ). By producing coherent and context-aware te xt, LLMs assist with tasks such as writing support ( Lee et al. , 2022 ; Ouyang et al. , 2022 ), question answering ( Lewis et al. , 2020 ; W ei et al. , 2022 ), and data synthesis ( Zhang et al. , 2020a ; W ei et al. , 2021 ). Their use has also expanded to multimodal systems, where textual prompts guide models to generate visual ( Ramesh et al. , 2022 ; Saharia et al. , 2022 ) or audi- tory ( Borsos et al. , 2023 ; W ang et al. , 2023 ) content, linking linguistic input with other representational modalities. Despite these advances, LLM-based interaction may not equally serve all users, as it fundamentally relies on the av ailability of explicit linguistic input. This assumption can break down in scenarios where users ha ve limited ability to produce speech or text, or when preferences are dif ficult to articulate through language alone. As illustrated in Figure 1 , when a user is influenced by ALS, such limitations reflect constraints on the interaction channel rather than on users’ underlying cognitiv e intent: in many cases, individuals may retain intact cognitiv e function and clear preferences, while their ability to express them through language is restricted. This mismatch motiv ates the exploration of alternativ e, non- linguistic feedback signals that can complement language- based interfaces. Building on these insights, we explore the use of brain signals ( i.e ., EEG) as an alternati ve real-time interface for LLMs. By decoding neural responses as users view LLM outputs, EEG signals can provide a indication of satisf action 1 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation or dissatisfaction ( Y e et al. , 2022 ; Kim et al. , 2022 ). W e integrate these signals within the test-time scaling frame- work ( Snell ), which allows a model to allocate additional inference-time computation when feedback suggests dissat- isfaction. When EEG patterns indicate that a user may be unsatisfied, the model performs an extra refinement step; otherwise, it returns its response directly to maintain effi- ciency . This EEG-guided adaptation does not modify model parameters but adjusts inference during test time based on physiological feedback, offering a small step toward ex- ploring ho w neural information might inform real-time hu- man–AI interaction. T o ev aluate the effecti veness of the proposed EEG-based in- terface pipeline, we conduct a extensi ve human study , each person completing 130 image-generation trials while their neural activity is recorded using a 64 -channel EEG system. The experiments show that EEG signals collected during the feedback stage is predictiv e of user satisfaction ( Y e et al. , 2022 ; Kim et al. , 2022 ), suggesting that neural ac- tivity carries information relev ant to real-time preference inference ( Holro yd & Coles , 2002 ). In addition, the pipeline yields improv ements in the alignment between model out- puts and user preferences, as well as in the perceiv ed quality of generated content. On av erage, reported user satisfaction increased compared with baseline settings. W e summarize our main contributions as follo ws: • T arget people with speech or motor impairments, we are among the first to build Brain–LLM interface as well as a ne w Dataset, BLID , which is an EEG-based human preference dataset specifically designed for aligning with LLMs. In this dataset, users’ neural responses are recorded while the y e valuate model- generated outputs, providing a foundation for studying implicit feedback signals (Section 3 ); • W e propose an EEG-guided TTS framework, which lev erages EEG signals to adapt LLM outputs in real time, ensuring alignment with user preferences without the need for additional fine-tuning (Section 4.2 ). 2. Related W ork EEG-based brain–computer interfaces (BCIs) aim to decode neural acti vity for applications such as motor im- agery classification ( lot , 2018 ), emotion recognition ( Zheng et al. , 2018 ), and cognitiv e state monitoring ( Roy et al. , 2019 ). Recent advances ha ve e xpanded EEG decoding be- yond traditional single-modality tasks ( W ang et al. , 2022a ) tow ard more ambitious goals, including cross-modal gener- ation ( Liu et al. , 2024b ) and open-vocab ulary understand- ing ( W ang et al. , 2022b ). In visual perception decoding, early work primarily focused on classifying ( Spampinato et al. , 2017 ) or reconstructing ( Miyawaki et al. , 2008 ) static images and simple visual paradigms. More recent studies hav e begun addressing richer and more dynamic forms of visual e xperience. EEG2V ideo ( Liu et al. , 2024b ), for exam- ple, in vestigates reconstructing dynamic visual content from EEG signals, while Neuro3D ( Guo et al. , 2025 ) e xtends this direction toward decoding three-dimensional visual infor - mation. At the same time, decoding tasks ha ve broadened from closed visual categories to open semantic systems. EEG2T ext ( Liu et al. , 2024a ) explores open-vocab ulary EEG-to-text generation, pro viding a more flexible interface for interpreting user intent. Collectiv ely , these developments mov e BCIs beyond simple pattern recognition and to ward more general content generation and naturalistic forms of human–AI interaction. EEG-based Foundation models. Recent work in EEG decoding has increasingly focused on narrowing the neu- ral–semantic gap to support more natural forms of hu- man–computer interaction. EEGNet ( Lawhern et al. , 2018 ) introduced a compact conv olutional architecture that cap- tures spatio-temporal structure in EEG signals, while EEG- T ransformer ( Lee & Lee , 2022 ) extends this direction by using self-attention to model long-range dependencies. In parallel, self-supervised and large-scale pretraining have emerged as promising strate gies for building more general- izable EEG representations. LaBraM ( Jiang et al. , 2024b ) provides one of the first lar ge-scale EEG foundation mod- els, learning transferable features across tasks, and Con- traWR ( Y ang et al. , 2021 ) explores contrasti ve learning for end-to-end feature e xtraction from raw signals. Of particular relev ance to our work are models targeting brain-to-content translation. NeuroLM ( Jiang et al. , 2024a ) functions as a specialized language model that decodes semantic intent from EEG, and CBraMod ( W ang et al. , 2024a ) aligns such neural representations with textual and generative modal- ities. Our framew ork is inspired by these advances and complements them by incorporating EEG feedback into real-time model refinement rather than direct content gener- ation. T est-Time scaling (TTS) enhances model performance by allocating additional computation during inference rather than relying on a single forward pass ( Devlin et al. , 2019 ; Houlsby et al. , 2019 ; Sun et al. , 2020 ; Shanmugam et al. , 2020 ). Traditional approaches typically depend on of fline training ( Devlin et al. , 2019 ) or static fine-tuning ( Houlsby et al. , 2019 ), which limits adaptability to new inputs or user preferences. More recent test-time adaptation methods, such as test-time training ( Sun et al. , 2020 ), test-time augmenta- tion ( Shanmugam et al. , 2020 ), and online update strategies like TENT ( W ang et al. , 2021 ), demonstrate that adjusting model behavior at inference can improve robustness and alignment without modifying core model weights. In this work, we do not propose a new TTS algorithm ( Snell ). Instead, we adopt the TTS perspecti ve to b uild a feedback- 2 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation L a n d s c a p e O b j e c t s ( A ) V i s u a l i z a t i o n o f s e m a n t i c t h e m e s a n d c h a l l e n g e a s p e c t s ··· ≈ 1 2 m i n ≈ 1 2 m i n 2 m i n R e s t p h a s e E v a l u a t i o n b l o c k 2 m i n ≈ 1 2 m i n 5 e v a l u a t i o n b l o c k s i n t o t a l ≈ 1 2 m i n f o r 2 6 t r i a l s W o r d B l a n k P i c t u r e & w o r d F i r s t e v a l u a t e S e c o n d p i c t u r e & w o r d S e c o n d e v a l u a t e I n t e r v a l ··· 3s 5 s 2 s 3s 2 s ≈ 3 s 5 s 2 s ≈ 3 s 1s F i x a t i o n ( C ) I l l u s t r a t i o n o f t h e E E G c o l l e c t i o n p r o c e d u r e 4 8 12 16 20 0 H U O L S A N u B T I B I S M F R N e G T S N L L P 4 8 12 16 20 0 ( B ) P r o p o r t i o n o f t h e m e s a n d c h a l l e n g e a s p e c t s P r o p o r t i o n ( % ) S e m a n t i c t h e m e s C h a l l e n g e a s p e c t s F igur e 2. Meta-information on the semantic themes, challenge categories, and EEG data collection procedure of BLID. (A) V isualization of the six semantic themes and fourteen challenging aspects of text-to-image generation used to construct the stimulus set. (B) Balanced distribution of prompt–image pairs across six semantic themes and fourteen challenge categories, ensuring div erse coverage of both objectiv e constraints and subjectiv e nuances. (C) Illustration of the EEG collection procedure. The experiment consisted of fiv e ev aluation blocks ( 12 minutes each). Each block contained 26 trials, and each trial lasted approximately 28 seconds, including stimulus presentation, ev aluation periods, and inter-trial intervals. driv en inference loop: LLM outputs are refined iterativ ely based on EEG feedback collected after the user views each result. This allows the model to decide, at inference time, whether additional refinement is needed to better align with the user’ s preferences. 3. Brain-LLM Interface Dataset W e collect a new dataset, the Brain–LLM Interface Dataset ( BLID ), to in vestigate whether (i) EEG signals recorded while users view generated outputs reflect their satisfaction with the generations and (ii) such signals can help improv e LLM alignment. 3.1. Participants T en healthy participants (mean age: 21.5 years; 4 females) with normal or corrected-to-normal vision were recruited. Each person completing 130 image-generation trials while their neural acti vity is recorded. Thus our dataset contains 1,300 trials. All participants provided written informed consent and paid with salary above minimum le vel, and the study was approv ed by the institutional ethics committee. 3.2. Stimuli Generation W e prompted GPT -4o to generate 450 textual prompts cov- ering six semantic themes and 14 challenge aspects of text- to-image generation (Figure 2 A). T o ensure representative- ness and mitigate bias, we employed a stratified sampling strategy to select 130 prompts. This selection ensured a uniform distribution across the 14 challenge categories (see Figure 2 B) while balancing objective constraints ( e.g ., nu- meracy) with subjective nuances ( e.g ., aesthetics). This design captures both gross errors and subtle misalignments typical of real-world interaction. W e controlled for prompt complexity to ensure neural responses were dri ven by satis- faction rather than dif ficulty . 3.3. EEG Data Collection Data were recorded in a controlled laboratory en vironment using a 64-channel Neuracle system (acti ve AgCl electrodes, 10–20 system) at a sampling rate of 1,000 Hz. The session lasted approximately 1.5 hours. Procedur e. As illustrated in Figure 2 C, participants ev alu- ated unique prompt–image pairs. Each trial consisted of an initial e valuation and a consistency check. The trial began with a fixation cross (3 s) and the text prompt (5s), follo wed by the prompt–image pair (3s). Participants then rated the image as “Satisfied” or “Unsatisfied”. A re-e valuation phase followed immediately; trials with inconsistent ratings be- tween the two phases were e xcluded to ensure label stability . This exclusion is critical because inconsistent responses im- ply cognitive ambiguity , which fails to elicit the distinct 3 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation E E G r e c o r d i n gs EEG satisfaction decoding Please generate: seven floating ice cubes and three coins T est time scaling Initial prompt EEG collection environment Generated image W ait , please carefully re-read the user ’ s prompt. T reat the previous responses as negative examples . Generate a new answer. Not satisfied Satisfied Final generation EEG foundation models Generation models Loop until satisfied or max iteration Image Stimuli EEG Cap EEG Signals EEG Amplifier F igure 3. Overview of the Brain-LLM interface. Left: Participants view image stimuli while wearing a 64-channel EEG cap; Right: the signals are recorded and amplified to train an EEG foundation model that predicts user satisfaction. During inference, a generation model produces an initial answer , and the EEG model monitors the user’ s neural response, classifying each trial as “Satisfied” or “Not satisfied”. If not satisfied, the system triggers test time scaling step and loops until the user is satisfied or a maximum number of iterations is reached. Algorithm 1 Brain-LLM Interface for T est-time Scaling 1: Input: Initial prompt p 0 , max iterations T max 2: Initialize: Image generator G , LMM operator R , EEG decoder D ϕ 3: Initialize: Iteration t ← 0 , current prompt p t ← p 0 4: y t ← G ( p t ) ▷ Generate initial image 5: while t < T max do 6: Acquire EEG signal e t while user views y t 7: p score t ← D ϕ ( e t ) ▷ Decode satisfaction probability 8: if p score t ≥ τ accept then 9: retur n y t ▷ Accept the image and terminate 10: else 11: p t +1 ← R ( p 0 , y t , p t ) ▷ Refine prompt 12: y t +1 ← G ( p t +1 ) 13: y t ← y t +1 , p t ← p t +1 14: end if 15: t ← t + 1 16: end while 17: return y t ▷ T erminate after reaching max iterations or satisfaction error-processing signals ( e.g ., ERN) necessary for reliable decoding ( Chandrasekar et al. , 2024 ). A 1s interval sepa- rated consecuti ve trials. W e additionally ensured coverage of both common generation successes and diverse failure patterns to av oid over-representing an y single error type. The experiment comprised fi ve blocks of 26 trials, with the trial order randomized. Short rest periods of approximately 3 minutes were provided between blocks to reduce fatigue and maintain attention. After each rest period, participants reported their self-percei ved attention le vel on a fi ve-point scale (1 indicates very sleepy , 5 suggests highly focused). The results sho wed an av erage self-reported attention lev el of 4.07 across all participants and blocks, indicating that participants were generally able to maintain suf ficient con- centration for high-quality EEG recording. Data Extraction. EEG data were extracted from a reaction- time–aligned windo w spanning [ − 2 . 3 , − 0 . 3] seconds rela- tiv e to the button press. This interval isolates preference- related neural activity by excluding motor ex ecution arti- facts ( W ard & Frackowiak , 2003 ). In total, 100 unique prompt–image pairs were retained, balanced between the “Satisfied” and “Unsatisfied” conditions. 4. Method Our goal is to incorporate a user’ s internal preference signal into the generativ e loop. Direct EEG-to-text decoding, ho w- ev er , presents clear challenges: decoding arbitrary , open- ended thoughts is not yet reliable for practical use. T o mitigate this issue, we adopt a more coarse-grained and robust strategy . Instead of attempting to decode what should be changed in the output, we focus on a binary signal indicat- ing whether the user is satisfied or unsatisfied. This simpler signal offers a more reliable basis for guiding the model on whether to accept the current output or perform an additional refinement step. The resulting “generate–check–refine” loop aligns naturally with the test-time scaling framework ( Muen- nighoff et al. , 2025 ; Snell ). 4.1. Recap of T est-Time Scaling T est-time scaling (TTS) ( Snell ) improves a fr ozen model by allocating extra computation at inference rather than relying on a single forward pass. Common strategies include gener - ating multiple candidates ( Madaan et al. , 2023 ), iterati vely refining outputs, and using verifiers or reward models to guide selection ( Locey & Lennon , 2016 ). W e adopt an iterative r evision vie w ( Li et al. , 2025 ). Let the initial prompt be p 0 and the draft sequence be { y t } T t =0 . The model M θ first produces a draft: y 0 = M θ ( p 0 ) , 4 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation and then updates it within the same inference session through a refinement operator R : y t +1 = R ( y t , p 0 ) , t = 0 , 1 , . . . The process terminates when a stopping condition is reached, such as exceeding a v erification score threshold τ or hitting a predefined step limit T max : stop if g ( y t ) ≥ τ or t ≥ T max , where g ( y t ) is a verification function that outputs a scalar score representing the quality of the draft y t .This “draft–check–re vise” loop can be described as a self-re vision method that gradually improves output quality across multi- ple inference steps. In many TTS systems, the decision to continue revising depends on model-internal signals, such as verifier scores or heuristic confidence estimates ( Li et al. , 2025 ). In the Section 4.2 , we retain this iterati ve-re vision framew ork but replace the decision signal with a preference score decoded fr om EEG , allowing extra computation only when neural evidence indicates user dissatisf action. 4.2. The Brain-LLM Interface for TTS Shown in Figure 3 , we introduce an EEG-gated interface that operationalizes the iterative refinement paradigm of T est- T ime Scaling (TTS). It substitutes model-internal heuristics with direct neural feedback, creating a brain-in-the-loop system that dynamically allocates computational resources. The interface triggers a refinement step only upon detection of user dissatisfaction inferred from their EEG signals. The architecture consists of three core modules: a neural preference decoder , an EEG-gated policy , and a large mul- timodal model (LMM) operator . The entire closed-loop process is formalized in Algorithm 1 . Neural prefer ence decoder D ϕ maps EEG activity to a user’ s preference state. It takes a segment of preprocessed EEG data e t ∈ R C × T as input and outputs a scalar satisfac- tion probability p score t ∈ [0 , 1] . W e formulate this as a binary classification problem and train D ϕ on the BLID dataset to distinguish neural patterns corresponding to Satisfied versus unsatisfied ev aluations. During inference, we con vert p score t into a binary decision using a fixed acceptance threshold τ accept = 0 . 5 : responses with p score t ≥ 0 . 5 are treated as satisfied and accepted, while those with p score t < 0 . 5 are treated as unsatisfied and trigger a refinement step. No subject-specific threshold tuning is performed. LLM TTS f or generation refinement via critique and rewrite. When refinement is triggered (Line 11 in Algo- rithm 1 ), an operator R produces an improved prompt. This operator is implemented using a frozen, pretrained LMM without any additional fine-tuning. The LMM is prompted to perform a critique-and-r ewrite procedure: giv en the origi- nal prompt p 0 , the generated output y t , and the intermediate prompt p t , it generates a revised prompt p t +1 intended to elicit a better response. This prompt-engineering strategy keeps the LMM fix ed, ensuring that performance improv e- ments arise solely from the EEG-gated refinement process. 5. Experiment W e design three experiments to ev aluate the feasibility of using EEG as feedback for LLMs: first, we verify the decod- ability of user satisfaction; second, we confirm via spatial distribution analysis that the signals originate from cognitiv e ev aluation rather than motor artifacts; and finally , we ev alu- ate the effecti veness of EEG-gated refinement in improving preference alignment within a TTS framew ork. 5.1. EEG-based Satisfaction Classification T o assess whether user preference (Satisfied vs. Unsatisfied) can be reliably decoded from EEG, we conducted a bench- mark experiment on the BLID dataset. W e ev aluate two categories of models: (1) handcrafted feature-based classi- fiers, including Linear Discriminant Analysis (LDA) ( Subasi & Gursoy , 2010 ), Logistic Regression (LR) ( T omioka et al. , 2006 ), and Support V ector Machines (SVM) ( Sha’Abani et al. , 2020 ); and (2) end-to-end neural models fine-tuned on raw EEG, including EEGNet ( Lawhern et al. , 2018 ), EEG-Conformer ( Song et al. , 2022 ), CBraMod ( W ang et al. , 2024a ), and LaBraM ( Jiang et al. , 2024b ). Satisfaction can be r eliably decoded. T able 1 confirms the feasibility of EEG-based preference decoding. Among all models, the large-scale pretrained LaBraM achie ves the best performance (80.68% accuracy , 87.91% A UR OC). Notably , the SVM classifier (77.85%) outperforms non-pretrained deep architectures like EEGNet (70.20%), suggesting that traditional deep models struggle to learn robust representa- tions with limited data. In contrast, LaBraM achieves the highest performance by transferring representations learned from large-scale EEG corpora, effecti vely overcoming chal- lenges related to data scarcity , low signal-to-noise ratios, and inter-subject v ariability . T emporal dynamics confirm cognitive origin. W e fur- ther analyzed the time course of decoding accuracy to dis- tinguish preference ev aluation from motor ex ecution. As shown in Figure 4 , performance peaks in the [ − 2 . 0 , − 1 . 0] s interval (LaBraM: 80.40%), significantly preceding the physical response. This suggests that the internal ev aluation is finalized well before motor execution. Con versely , as the window approaches the button press ( [ − 1 . 0 , 0 . 0] s), accu- racy declines and v ariance increases, indicating interference 5 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation T able 1. P erformance of EEG satisfaction classification. W e compare two categories of models: (1) handcrafted fea- ture–based classifiers (LD A, Logistic Regression, SVM) and (2) end-to-end neural models fine-tuned on raw EEG (EEGNet, EEG- Conformer ,CBraMod,LaBraM). Metrics include Accuracy , F1, A UR OC, Recall, G-Mean, and MCC. Results show that EEG signals carry information relev ant to distinguishing “ Satisfied ” vs. “ Unsatisfied ” states, with LaBraM achieving the highest o verall performance. Category Methods Accuracy F1-Score A UROC Recall G-mean MCC Hand-crafted LD A 70 . 18 ± 1 . 65 70 . 72 ± 2 . 53 74 . 26 ± 2 . 81 71 . 45 ± 5 . 40 70 . 01 ± 1 . 51 40 . 50 ± 3 . 37 LR 72 . 32 ± 2 . 86 72 . 06 ± 3 . 68 79 . 32 ± 3 . 09 72 . 05 ± 6 . 62 72 . 16 ± 2 . 93 44 . 85 ± 5 . 69 SVM 77 . 85 ± 2 . 06 77 . 41 ± 2 . 46 86 . 58 ± 3 . 22 76 . 67 ± 4 . 59 77 . 75 ± 2 . 11 55 . 83 ± 4 . 06 End-to-end EEG-Conformer 68 . 00 ± 1 . 77 68 . 18 ± 1 . 67 72 . 85 ± 3 . 27 68 . 60 ± 2 . 88 67 . 94 ± 1 . 75 36 . 06 ± 3 . 56 EEGNet 70 . 20 ± 1 . 75 69 . 81 ± 2 . 10 76 . 20 ± 1 . 67 69 . 00 ± 3 . 39 70 . 15 ± 1 . 77 40 . 45 ± 3 . 49 CBraMod 77 . 10 ± 2 . 01 75 . 88 ± 2 . 38 84 . 78 ± 1 . 79 72 . 77 ± 5 . 43 76 . 80 ± 2 . 00 54 . 59 ± 4 . 11 LaBraM 80 . 68 ± 2 . 75 80 . 74 ± 2 . 51 87 . 91 ± 2 . 83 80 . 97 ± 3 . 11 80 . 64 ± 2 . 76 61 . 46 ± 5 . 50 LaBraM SVM 84 82 80 78 76 74 72 Decoding Accuracy(%) [-3.0,-2.0]s [-2.5,-1.5]s [-2.0,-1.0]s [-1.5,-0.5]s [-1.0, 0.0]s Time W indow Relative to Response Peak=80.40% F igure 4. Time-resolv ed decoding analysis of user satisfaction. W e evaluate the decoding accurac y of representativ e hand-crafted (SVM) and end-to-end (LaBraM) models using a sliding window approach relative to the button press (0.0s). The shaded regions represent the standard deviation across participants. Subject 1 Subject 2 Subject 3 Subject 4 E E G C h a n n e l A n a l y s i s F igur e 5. T opographical maps of single-channel classification accuracy for f our participants. W arm colors indicate electrodes with higher decoding accuracy , and cool colors indicate lower accuracy . While individual variability is present, se veral regions show consistently higher performance across participants. from motor preparation noise. This rise in variance near the response point reflects the transition from purely ev aluati ve neural states to complex, participant-specific motor planning rhythms. This trajectory confirms that the decoded signals reflect internal cogniti ve e v aluation rather than motor arti- facts, aligning with established neuroscientific e vidence that outcome e v aluation is a rapid, pre-reflexi ve process initiated in the medial frontal cortex ( Gehring & W illoughby , 2002 ; Chandrasekar et al. , 2024 ). Crucially , this temporal decou- pling ensures that the preference feedback integrated into our test-time scaling frame work is deri ved from high-le vel cognitiv e appraisal rather than the lo w-level physiological Accuracy(%) Brain Region(ablated) Frontal Central Parietal Temporal Occipital All F igur e 6. Classification ac- curacy ablation. The right- most bar (”All”) shows the base- line performance using all elec- trodes. The frontal region con- tributes most significantly to de- coding performance. noise of physical interaction. 5.2. Analysis of Brain Regions T o identify cortical areas critical for subjecti ve e v aluation, we conducted single-channel classification and systematic brain region ablation. Spatial distribution of evaluati ve signals. Channel-wise analysis (Figure 5 ) shows high decoding accuracies con- centrated in frontal r egions and temporo-occipital ar eas , aligning with prefrontal value-based decision-making ( Rush- worth et al. , 2012 ), visual categorization ( Grill-Spector & W einer , 2014 ), and aesthetic appraisal ( Chatterjee & V ar - tanian , 2014 ). The frontal in volv ement likely reflects the cognitiv e arbitration between model output and user expec- tations, while posterior acti vity highlights visual-semantic matching. Critically , the central sensorimotor strip —the hub for motor execution ( Pfurtscheller & Lopes da Silv a , 1999 ; W olpaw et al. , 2002 )—exhibits low accuracy . This "si- lence" v alidates our pre-response analysis windo w ( − 2 . 3 s to − 0 . 3 s) and confirms that the framework successfully iso- lates purely e valuati ve neural correlates from motor e xecu- tion noise and preparation artifacts, ensuring that e xtracted features are driv en by internal appraisal rather than anticipa- tory motor activity . Functional necessity via ablation. As shown in Figure 6 , the full-channel model achie ves 80 . 68% accuracy . Remov- ing frontal regions causes the largest drop (to 75 . 88% ), reinforcing the prefrontal cortex’ s dominant role in judg- ment ( Miller & Cohen , 2001 ; Rushworth et al. , 2012 ). Con- versely , removing the central region yields the smallest 6 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation T able 2. Effectiveness of the Brain-LLM interface. W e apply the framework to a range of state-of-the-art text-to-image models, including both open-source and proprietary systems. Performance is ev aluated using 6 metrics: CLIP Score, BLIP Score, Aesthetic Score, VQA Score, the average human rating (Human, 1–5), and the final success rate,which represents the percentage of images the user finds satisfactory after e valuation through The Brain-LLM Interface for TTS. Across all models, the Brain-LLM frame work consistently improv es generation quality and preference alignment relativ e to the baseline. Category Model Condition CLIP Score BLIP Scor e Aesthetic VQA Score Human Success Rate Open Source Stable Diffusion XL First Generation 27.23 88.90 6.11 37.72 2.56 3.33 Brain-LLM 28.71 97.41 6.51 52.21 3.08 27.62 W anx-v1 First Generation 25.35 84.81 6.31 41.62 2.62 6.67 Brain-LLM 27.70 88.04 6.65 51.06 2.91 17.62 W an2.2-t2i-flash First Generation 26.34 96.57 6.21 35.48 3.34 33.33 Brain-LLM 27.56 97.56 6.43 45.33 3.81 65.24 Qwen-image-plus First Generation 26.14 92.25 6.02 45.27 3.31 39.05 Brain-LLM 28.05 95.59 6.57 50.98 4.01 80.48 Closed Source GPT -4o Image First Generation 28.96 97.37 6.22 55.14 3.67 64.76 Brain-LLM 30.27 98.44 6.37 62.92 4.16 92.38 Doubao-seedream-3-0-t2i First Generation 28.95 99.79 6.07 65.05 3.64 55.24 Brain-LLM 30.26 99.91 6.28 77.37 4.06 92.38 Doubao-seedream-4-0 First Generation 28.97 99.18 6.11 59.74 4.13 58.10 Brain-LLM 29.96 99.81 6.29 72.48 4.23 95.24 Brain-LLM feedback iteration Brain-LLM fee dback iteration Human Score VQA Score(%) Human Score VQA Score F igure 7. The effect of TTS on feedback iterations. This figure shows the change in GPT -4o performance across Human Score and VQA Score as the number of Brain-LLM feedback iterations increases. Both metrics exhibit a steady upw ard trend, suggesting that iterativ e EEG-guided refinement can provide incremental im- prov ements to the generated outputs. decrease, proving that motor acti vity is not a driv er for our decoder . Modest reductions in other areas suggest a dis- tributed network integrating attention and visual-semantic processing ( Corbetta & Shulman , 2002 ; Grill-Spector & W einer , 2014 ). Collectiv ely , this spatial and functional evi- dence demonstrates that our decoder leverages higher -order signals related to evaluati ve judgment and semantic con- sistency rather than low-le vel motor responses ( Miller & Cohen , 2001 ; Rushworth et al. , 2012 ), supporting EEG as a rob ust feedback source for real-time model refinement. 5.3. Model Alignment with Human Prefer ence In this section, we ev aluate the effecti veness of our Brain- LLM interface in improving the alignment of text-to-image models with human preferences. W e apply the framework to a range of state-of-the-art models, including open-source systems such as Stable Diffusion XL ( Podell et al. , 2023 ), W anx-v1 ( Ma et al. , 2024 ), W an2.2-t2i-flash ( W an et al. , 2025 ), and Qwen-image-plus ( W ang et al. , 2024b ), as well as proprietary models such as GPT -4o Image ( Hurst et al. , 2024 ), Doubao-seedream-3-0-t2i ( Gao et al. , 2025 ), and Doubao-seedream-4-0 ( Seedream et al. , 2025 ). W e ev aluate performance using six metrics: CLIP Score ( Radford et al. , 2021 ), BLIP Score ( Li et al. , 2022 ), Aesthetic Score ( Mur- ray et al. , 2012 ), VQA Score ( Lin et al. , 2024 ), the average human rating (Human) from 1 to 5, and the Success Rate, which represents the percentage of images the user finds sat- isfactory after ev aluation through The Brain-LLM Interface for TTS. Brain-LLM consistently improv es alignment and gener - ation quality over baseline. As shown in T able 2 , across all seven generati ve models and six ev aluation metrics, the Brain-LLM framew ork yields consistent improvements. The most notable gains appear in the tw o metrics most sensi- tiv e to human judgment and constraint satisfaction: Human Score and VQA Score. This suggests that EEG-gated refine- ment may help the model better capture user preferences and adhere more closely to prompt specifications. T o better understand how the framework achiev es these gains, we examined ho w performance metrics ev olve across Brain-LLM refinement iterations. As sho wn in Figure 7 , both Human Score and VQA Score increase steadily with each feedback loop. This pattern suggests that EEG-guided refinement provides incremental impro vements rather than random variation, gradually correcting earlier errors and moving the output to ward the user’ s intended specification. Crucially , the steady upward trend across fi ve iterations ob- 7 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation Text Prompt Brain-LLM Inter face Computer keyboa rd. Every single key labeled wit h the letter 'A' . VQAScore 71.75 60.18 60.19 79.80 Human 2.5 3.0 3.2 3.5 Text Prompt Brain-LLM Inter face Sky scene with 2 s uns and 3 moons visible a t once. VQAScore 69.54 78.87 74.90 83.96 Human 2.5 2.6 3.9 3.9 Text Prompt Brain-LLM Inter face Cross-section d iagram of Eart h. Clearly labele d in white text : crust , mantle , outer core , inner core . VQAScore 80.67 89.86 86.88 94.82 Human 1.0 2.4 2.5 4.0 Text Prompt Brain-LLM Inter face There are exac tly seven float ing ice cubes and th ree coins sunk t o the bottom insi de the transpar ent glass. VQAScore 76.97 82.42 81.45 83.96 Human 2.5 2.8 3.5 4.2 F igur e 8. Examples of improvement through Brain-LLM feedback in image generation. This figure illustrates the gradual improv ements achiev ed through Brain-LLM feedback across sev eral examples. For each prompt, we compare VQA Score and Human Score before and after iterati ve feedback. Images on the left show the initial generations, while those on the right sho w the outputs after Brain-LLM refinement. As feedback iterations increase, the generated images sho w clearer details and better adherence to the prompt, suggesting that EEG-guided refinement can contribute to incremental impro vements in model performance. served in Figure 7 demonstrates that EEG feedback provides a consistent optimization gradient rather than a simple ran- dom “restart”. This sustained improv ement suggests that the decoded preference signal remains highly informative e ven as output quality increases, ef fectiv ely guiding the model to address subtle misalignments that are typically missed in unguided or zero-shot generation. Additionally , it is also important to verify that these improvements manifest in challenging cases. Figure 8 presents representati ve exam- ples from our stress-test prompts. For example, with the “ seven ice cubes and thr ee coins ” prompt, the initial output fails to meet the counting and spatial requirements; after EEG-guided refinements, the final image satisfies both con- straints. For “ cr oss-section of Earth ”, the clarity of textual labels improv es progressi vely across iterations, yielding a more interpretable and accurate depiction. 6. Discussion and Future W ork Looking ahead, our study represents a foundational step tow ard neural-feedback guided language model adaptation. While the current EEG feedback is simplified to binary sig- nals, this robust “binary gate” provides a high-reliability confirmation channel essential for users with se vere motor impairments, for whom complex language communication is often challenging. Future research could le verage multi- dimensional neural markers—such as cogniti ve load, con- fidence, or affecti ve valence—to provide a more nuanced “voice” for those who cannot pro vide explicit linguistic cor- rections, thereby refining model behavior without the need for motor-based input. Beyond discrete satisf action signals, our ultimate vision is to establish a unified EEG-text-LLM interaction paradigm that restores interacti ve agenc y to indi viduals with profound physical constraints. W e aim to move tow ard direct semantic synchronization, where EEG acts not merely as a validation tool, but as a continuous stream of cogniti ve context. This would allo w the LLM to percei ve and proacti vely adapt to the user’ s internal state as intent emerges. Such an advance- ment would e volv e the interface from a corrective loop into a seamless, brain-dri ven collaborator , effecti vely bypass- ing physical limitations to enable autonomous human-AI synergy . 7. Conclusion This work introduces an EEG-guided test-time scaling framew ork that adapts LLM outputs using real-time neural feedback. By decoding brain activity , our interface provides an alternativ e interaction channel that connects user intent di- rectly to model generation, reducing the reliance on explicit motor or linguistic input. Experiments confirm that EEG signals carry decodable satisfaction information, which sig- nificantly improv es preference alignment when integrated into iterative refinement. These findings demonstrate that neural feedback can inform adapti ve inference, pa ving the way for more inclusive and accessible AI systems. While currently focused on binary signals, future e xtensions will explore richer neural states to enable more flexible guidance, ev entually bridging the gap between human neural activity and generativ e intelligence, and restoring interactiv e agency to those previously limited by ph ysical constraints. 8 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation Impact Statement This research introduces a Brain-LLM interface that le ver- ages non-in vasi ve EEG signals to align large language model outputs with implicit human preferences. The poten- tial societal impacts of this w ork are significant and multi- faceted: • accessibility and Inclusion: Our framew ork provides a promising interaction channel for individuals with se vere speech or motor impairments ( e.g ., ALS or motor neuron diseases). By utilizing neural signals as an implicit feed- back source to guide real-time model refinement, we take a step tow ard making generativ e AI more accessible to populations who may struggle with traditional linguistic or motor-based interf aces. • privacy and Neur o-ethics: As brain-computer interfaces (BCIs) become more integrated with LLMs, the protec- tion of "mental priv acy" is of paramount importance. Our study focuses on task-specific, coarse-grained satisfaction signals. Howe ver , we emphasize that future de velopments must prioritize rigorous data encryption, informed con- sent, and safeguards against the unauthorized decoding of priv ate cognitive or af fective states. • human-AI Collaboration: This work adv ances human- centered machine learning by pro viding a direct link be- tween internal cogniti ve states and AI generation. While this enhances personalization, we encourage the commu- nity to consider the implications of automated preference alignment on human agency and to ensure that such sys- tems are used to augment, rather than replace, deliberate human decision-making. Overall, this paper presents w ork whose goal is to advance the fields of Machine Learning and BCI. W e believ e the societal consequences are largely positiv e for human-AI synergy , provided that neuro-ethical standards are strictly maintained as the technology ev olves. Acknowledgments This work was supported by the Ke y Research and Dev elopment Program of Shandong Province, China (2025CXGC010901), and the Shandong Province Overseas Y oung T alents Program. References A revie w of classification algorithms for eeg-based brain– computer interfaces: a 10 year update. Journal of neural engineering , 15(3):031005, 2018. Borsos, Z., Marinier , R., V incent, D., Kharitonov , E., Pietquin, O., Sharifi, M., Roblek, D., T eboul, O., Grang- ier , D., T agliasacchi, M., et al. Audiolm: a language modeling approach to audio generation. IEEE/ACM trans- actions on audio, speech, and language pr ocessing , 31: 2523–2533, 2023. Brown, T ., Mann, B., Ryder , N., Subbiah, M., Kaplan, J. D., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry , G., Askell, A., et al. Language models are few-shot learners. In NeurIPS , pp. 1877–1901, 2020. Chandrasekar , R., Hasan, M. R., Ghosh, S., Gedeon, T ., and Hossain, M. Z. Machine learning to detect anxiety disorders from error-related neg ativity and eeg signals. arXiv pr eprint arXiv:2410.00028 , 2024. Chatterjee, A. and V artanian, O. Neuroaesthetics. T rends in cognitive sciences , 18(7):370–375, 2014. Corbetta, M. and Shulman, G. L. Control of goal-directed and stimulus-dri ven attention in the brain. Nature r eviews neur oscience , 3(3):201–215, 2002. Devlin, J., Chang, M.-W ., Lee, K., and T outano va, K. Bert: Pre-training of deep bidirectional transformers for lan- guage understanding. NAA CL-HLT , 2019. Gao, Y ., Gong, L., Guo, Q., Hou, X., Lai, Z., Li, F ., Li, L., Lian, X., Liao, C., Liu, L., et al. Seedream 3.0 technical report. arXiv e-prints , pp. arXiv–2504, 2025. Gehring, W . J. and Willoughby , A. R. The medial frontal cortex and the rapid processing of monetary gains and losses. Science , 295(5563):2279–2282, 2002. Grill-Spector , K. and W einer , K. S. The functional architec- ture of the ventral temporal corte x and its role in catego- rization. NA TURE REVIEWS| NEUR OSCIENCE , 2014. Guo, S., Zhang, J., Liu, Y ., and Gao, M. Neuro3d: T o- wards 3d visual decoding from eeg signals. arXiv pr eprint arXiv:2501.03765 , 2025. Guu, K., Lee, K., Tung, Z., Pasupat, P ., and Chang, M. Retriev al augmented language model pre-training. In ICML , pp. 3929–3938, 2020. Holroyd, C. B. and Coles, M. G. The neural basis of human error processing: reinforcement learning, dopamine, and the error -related negati vity . Psychological r eview , 109(4): 679, 2002. Houlsby , N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attal, B., and Gelly , S. Parameter -efficient transfer learning for nlp. In Inter- national Conference on Machine Learning (ICML) , pp. 2790–2799, 2019. Hurst, A., Lerer , A., Goucher , A. P ., Perelman, A., Ramesh, A., Clark, A., Ostrow , A., W elihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 , 2024. 9 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation Jiang, W .-B., W ang, Y ., Lu, B.-L., and Li, D. Neurolm: A univ ersal multi-task foundation model for bridging the gap between language and ee g signals. arXiv e-prints , pp. arXiv–2409, 2024a. Jiang, W .-B., Zhao, L.-M., and Lu, B.-L. Large brain model for learning generic representations with tremendous eeg data in bci. arXiv preprint , 2024b. Kim, H., Miyakoshi, M., Kim, Y ., Stapornchaisit, S., Y oshimura, N., and K oike, Y . Electroencephalography reflects user satisfaction in controlling robot hand through electromyographic signals. Sensors , 23(1):277, 2022. Lawhern, V . J., Solon, A. J., W aytowich, N. R., Gordon, S. M., Hung, C. P ., and Lance, B. J. Eegnet: a com- pact conv olutional neural network for eeg-based brain– computer interfaces. Journal of neur al engineering , 15 (5):056013, 2018. Lee, M., Liang, P ., and Y ang, Q. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI confer ence on human factors in computing systems , pp. 1–19, 2022. Lee, Y .-E. and Lee, S.-H. Eeg-transformer: Self-attention from transformer architecture for decoding ee g of imag- ined speech. In 2022 10th International winter confer ence on brain-computer interface (BCI) , pp. 1–4. IEEE, 2022. Lewis, P ., Perez, E., Piktus, A., Petroni, F ., Karpukhin, V ., Goyal, N., Küttler , H., Lewis, M., Y ih, W .-t., Rocktäschel, T ., et al. Retriev al-augmented generation for knowledge- intensiv e nlp tasks. In NeurIPS , pp. 9459–9474, 2020. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International confer- ence on machine learning , pp. 12888–12900. PMLR, 2022. Li, Y ., Hu, X., Qu, X., Li, L., and Cheng, Y . T est-time preference optimization: On-the-fly alignment via iter- ativ e textual feedback. In ICML , 2025. URL https: //openreview.net/forum?id=ArifAHrEVD . Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P ., and Ramanan, D. Evaluating text-to-visual generation with image-to-text generation. In Eur opean Confer ence on Computer V ision , pp. 366–384. Springer, 2024. Liu, Y ., Chen, L., W ang, T ., and Xu, H. Eeg2te xt: Decoding open-vocab ulary language semantics from eeg signals. arXiv pr eprint arXiv:2406.08113 , 2024a. Liu, Y ., Chen, X., Zhang, Y ., and Xu, H. Eeg2video: T o- wards decoding dynamic visual perception from eeg sig- nals. arXiv preprint , 2024b. Locey , K. J. and Lennon, J. T . Scaling laws predict global microbial di versity . Pr oceedings of the National Academy of Sciences , 113(21):5970–5975, 2016. Ma, S., Luo, W ., W ang, Y ., Liu, X., Chen, M., Li, B., and Xiao, C. V isual-roleplay: Univ ersal jailbreak attack on multimodal large language models via role-playing image characte. arXiv e-prints , pp. arXiv–2405, 2024. Madaan, A., T andon, N., Gupta, P ., Hallinan, S., Gao, L., W iegref fe, S., Alon, U., Dziri, N., Prabhumoye, S., Y ang, Y ., et al. Self-refine: Iterativ e refinement with self- feedback. Advances in Neural Information Pr ocessing Systems , 36:46534–46594, 2023. Miller , E. K. and Cohen, J. D. An integrativ e theory of prefrontal cortex function. Annual r eview of neur oscience , 24(1):167–202, 2001. Miyawaki, Y ., Uchida, H., Y amashita, O., Sato, M., Morito, Y ., T anabe, H., Sadato, N., and Kamitani, Y . V isual image reconstruction from human brain acti vity using a combination of multiscale local image decoders. Neur on , 60(5):915–929, 2008. doi: 10.1016/j.neuron.2008.11. 004. Muennighoff, N., Y ang, Z., Shi, W ., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer , L., Liang, P ., Candès, E., and Hashimoto, T . B. s1: Simple test-time scaling. pp. 20286–20332, 2025. Murray , N., Marchesotti, L., and Perronnin, F . A va: A large- scale database for aesthetic visual analysis. In 2012 IEEE confer ence on computer vision and pattern r ecognition , pp. 2408–2415. IEEE, 2012. OpenAI. Gpt-4 technical report. , 2023. Ouyang, L., W u, J., Jiang, X., Almeida, D., W ainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray , A., et al. Training language models to follow instructions with human feedback. Advances in neural information pr ocessing systems , 35:27730–27744, 2022. Pfurtscheller , G. and Lopes da Silva, F . H. Event-related eeg/me g synchronization and desynchronization: basic principles. Clinical Neur ophysiology , 110(11):1842– 1857, 1999. Podell, D., English, Z., Lacey , K., Blattmann, A., Dockhorn, T ., Müller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent dif fusion models for high-resolution image synthesis. arXiv e-prints , pp. arXiv–2307, 2023. Radford, A. and W u, J. Re won child, david luan, dario amodei, and ilya sutske ver . 2019. Language models ar e unsupervised multitask learners. OpenAI blog , 1(8):9, 2019. 10 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation Radford, A., Kim, J. W ., Hallacy , C., Ramesh, A., Goh, G., Agarwal, S., Sastry , G., Askell, A., Mishkin, P ., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pp. 8748–8763. PmLR, 2021. Raffel, C., Shazeer , N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W ., and Liu, P . J. Exploring the limits of transfer learning with a unified text-to-te xt transformer . J ournal of machine learning resear ch , 21 (140):1–67, 2020. Ramesh, A., Dhariwal, P ., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint , 1(2):3, 2022. Roy , Y ., Banville, H., Albuquerque, I., Gramfort, A., Falk, T . H., and Faubert, J. Eeg-based mental fatigue and cogniti ve state monitoring: A revie w . F r ontiers in Human Neur oscience , 13:134, 2019. doi: 10.3389/fnhum.2019. 00134. Rushworth, M. F ., K olling, N., me Sallet, J., and Mars, R. B. V aluation and decision-making in frontal cortex: one or many serial or parallel systems? Current Opinion in Neur obiology , 22:1–10, 2012. Saharia, C., Chan, W ., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour , K., Gontijo Lopes, R., Karagol A yan, B., Salimans, T ., et al. Photorealistic text-to-image dif- fusion models with deep language understanding. In NeurIPS , pp. 36479–36494, 2022. Seedream, T ., Chen, Y ., Gao, Y ., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W ., Huang, Y ., et al. See- dream 4.0: T ow ard next-generation multimodal image generation. arXiv e-prints , pp. arXiv–2509, 2025. Shanmugam, K., Ghosh, S., and Chaudhari, P . Better gener- alization with test-time augmentation for deep learning models. arXiv preprint , 2020. Sha’Abani, M., Fuad, N., Jamal, N., and Ismail, M. knn and svm classification for eeg: a revie w . In InECCE2019: pr oceedings of the 5th international confer ence on elec- trical, contr ol & computer engineering, kuantan, pahang, Malaysia, 29th J uly 2019 , pp. 555–565. Springer , 2020. Snell, C. Scaling llm test-time compute optimally can be more effecti ve than scaling model parameters. Song, Y ., Zheng, Q., Liu, B., and Gao, X. Eeg conformer: Con volutional transformer for eeg decoding and visu- alization. IEEE T ransactions on Neural Systems and Rehabilitation Engineering , 31:710–719, 2022. Spampinato, C., Palazzo, S., Kavasidis, I., Giordano, D., Souly , N., and Shah, M. Decoding brain representations by multimodal learning of neural activity and visual fea- tures. In Pr oceedings of the IEEE International Confer- ence on Computer V ision (ICCV) , pp. 5184–5192, 2017. Subasi, A. and Gursoy , M. I. Eeg signal classification using pca, ica, lda and support v ector machines. Expert systems with applications , 37(12):8659–8666, 2010. Sun, Y ., W ang, Y ., Liu, Z., Miller , J., Schiele, B., T enen- baum, J. B., T orralba, A., and Zhao, H. T est-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning (ICML) , pp. 9229–9248, 2020. Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kul- shreshtha, A., Cheng, H.-T ., Jin, A., Bos, T ., Baker , L., Du, Y ., et al. Lamda: Language models for dialog appli- cations. arXiv preprint , 2022. T omioka, R., Aihara, K., and Müller, K.-R. Logistic regres- sion for single trial ee g classification. Advances in neural information pr ocessing systems , 19, 2006. W an, T ., W ang, A., Ai, B., W en, B., Mao, C., Xie, C.-W ., Chen, D., Y u, F ., Zhao, H., Y ang, J., et al. W an: Open and advanced lar ge-scale video generativ e models. arXiv e-prints , pp. arXiv–2503, 2025. W ang, C., Chen, S., W u, Y ., Zhang, Z., Zhou, L., Liu, S., Chen, Z., Liu, Y ., W ang, H., Li, J., et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv pr eprint arXiv:2301.02111 , 2023. W ang, D., Shelhamer , E., Liu, S., Olshausen, B., and Darrell, T . T ent: Fully test-time adaptation by entropy minimiza- tion. International Confer ence on Learning Repr esenta- tions (ICLR) , 2021. W ang, J., Zhao, S., Luo, Z., Zhou, Y ., Jiang, H., Li, S., Li, T ., and Pan, G. Cbramod: A criss-cross brain foundation model for eeg decoding. arXiv e-prints , pp. arXiv–2412, 2024a. W ang, P ., Bai, S., T an, S., W ang, S., Fan, Z., Bai, J., Chen, K., Liu, X., W ang, J., Ge, W ., et al. Qwen2-vl: Enhancing vision-language model’ s perception of the world at any resolution. arXiv e-prints , pp. arXiv–2409, 2024b. W ang, Y ., Zhang, Y ., and Li, P . Multimodal deep learning for eeg-based emotion recognition: A comprehensive surv ey . IEEE T ransactions on Af fective Computing , 2022a. doi: 10.1109/T AFFC.2022.3167324. W ang, Z., Zhang, X., W u, J., and Li, B. Open-vocab ulary decoding of brain acti vity via contrastiv e language-image pre-training. Neur oImage , 263:119617, 2022b. doi: 10. 1016/j.neuroimage.2022.119617. 11 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation W ard, N. and Frackowiak, R. Age-related changes in the neural correlates of motor performance. Brain , 126(4): 873–888, 2003. W ei, J., Bosma, M., Zhao, V . Y ., Guu, K., Y u, A. W ., Lester , B., Du, N., Dai, A. M., and Le, Q. V . Finetuned language models are zero-shot learners. In ICLR , 2021. W ei, J., W ang, X., Schuurmans, D., Bosma, M., Xia, F ., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS , pp. 24824–24837, 2022. W olpaw , J. R., Birbaumer , N., McFarland, D. J., Pfurtscheller , G., and V aughan, T . M. Brain–computer interfaces for communication and control. Clinical Neu- r ophysiology , 113(6):767–791, 2002. Y ang, C., Xiao, D., W estover , M. B., and Sun, J. Self- supervised eeg representation learning for automatic sleep staging. arXiv e-prints , pp. arXiv–2110, 2021. Y e, Z., Xie, X., Liu, Y ., W ang, Z., Chen, X., Zhang, M., and Ma, S. Brain topography adaptive network for satisfaction modeling in interactive information access system. In Pr oceedings of the 30th ACM International Confer ence on Multimedia , pp. 90–100, 2022. Zhang, J., Zhao, Y ., Saleh, M., and Liu, P . Peg asus: Pre- training with extracted gap-sentences for abstracti ve sum- marization. In ICML , pp. 11328–11339, 2020a. Zhang, Y ., Sun, S., Galley , M., Chen, Y .-C., Brockett, C., Gao, X., Gao, J., Liu, J., and Dolan, W . B. Dialogpt: Large-scale generati ve pre-training for con versational re- sponse generation. In ACL , pp. 270–278, 2020b. Zheng, W .-L., Liu, W ., Lu, Y ., Lu, B.-L., and Cichocki, A. Emotionmeter: A multimodal framework for recognizing human emotions. IEEE transactions on cybernetics , 49 (3):1110–1122, 2018. 12 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation EEG-Based Brain-LLM Interface f or Human Pr eference Aligned Generation Supplementary Material A. EEG Data Prepr ocessing In this section, we detail the EEG preprocessing pipelines. T o ensure a fair comparison while still allo wing each backbone to operate under its most suitable conditions, we followed preprocessing strategies aligned with their respecti ve official implementations or commonly adopted benchmark configurations. General procedures. EEG was recorded using a 64-channel Neuracle NeuSen W system. W e discarded fiv e non-EEG channels ( VEOL , VEOU , HEOL , HEOR , ECG ), resulting in 59 channels used in all analyses. A 50 Hz notch filter was univ ersally applied to mitigate power -line interference. Data were segmented into epochs ranging from − 2 . 3 s to − 0 . 3 s relativ e to the response onset markers, i.e ., EEG recorded 0 . 3 s before the user pressed the button. T o ensure a balanced ev aluation across all models, we selected at most 50 samples per class for each subject using random sampling . End-to-end models. For LaBraM ( Jiang et al. , 2024b ) and EEGNet ( Lawhern et al. , 2018 ), we adopted a continuous-stage strategy with a 0 . 1 – 75 . 0 Hz bandpass filter to capture broad spectral information. LaBraM was subsequently do wnsampled to 200 Hz with independent channel-wise z -score normalization, follo wing its published configuration. In contrast, EEGNet operated on the nati ve 1000 Hz sampling rate of our acquisition system without additional do wnsampling. This choice is consistent with the temporal-con volutional design of EEGNet, which is typically configured relativ e to the sampling rate, and a voids under -tuning its temporal resolution. Class imbalance during training was handled by a W eightedRandomSampler . For EEG-Conformer ( Song et al. , 2022 ), we followed a segment-first pipeline with a 1 . 0 – 40 . 0 Hz bandpass filter and downsampling to 256 Hz, and applied fold-wise fitting of the scaler to strictly prev ent data leakage. Hand-crafted featur e baselines. For SVM ( Sha’Abani et al. , 2020 ), LR ( T omioka et al. , 2006 ), and LD A ( Subasi & Gurso y , 2010 ), preprocessing focused on the 8 . 0 – 45 . 0 Hz frequency band to specifically extract oscillatory features (alpha, beta, and lo w gamma). The signals were downsampled to 200 Hz and baseline-corrected (using a 0 . 3 s pre-response interv al). Cov ariance matrices were computed and projected via tangent space mapping to align non-Euclidean geometric features before classification. B. Experimental Setup and Evaluation Metrics In this section, we provide detailed configurations for model implementation and define the quantitative metrics used to assess alignment and generation quality . B.1. Implementation Details EEG Decoding Models. All EEG classifiers were implemented in PyT orch using the AdamW optimizer . For LaBraM , we fine-tuned the classification head of the pretrained checkpoint (TUH EEG Corpus) with a learning rate of 1 e - 4 , a batch size of 16 , and trained for 30 epochs. For EEGNet and EEG-Conformer , we trained the models from scratch with a higher learning rate of 1 e - 3 and a batch size of 16 , running for 30 epochs to ensure con vergence. LMM Refinement Operator . For the closed-loop generation, we utilized state-of-the-art vision-language models ( e.g ., GPT -4o, Doubao-V ision) as the operator R . T o balance creativity and instruction follo wing, the temperature was set to 0 . 7 for the critique-and-refine step, and 0 . 0 for the ev aluation step to ensure deterministic feedback. The maximum number of refinement iterations T max was fix ed at 5 . The system prompt instructed the model to “ fix defects inconsistent with the tar get ” and “ r einfor ce critical constraints ”. B.2. Evaluation Metrics Details W e employed a comprehensiv e suite of automated and human metrics to e valuate the frame work. A utomated Metrics. • CLIP Score ( Radford et al. , 2021 ): W e follo w the standard CLIP-Score implementation and compute this metric using 13 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation Subject 1 Subject 6 Subject 2 Subject 3 Subject 4 Subject 5 Subject 7 Subject 8 Subject 9 Subject 10 F igur e 9. Radar plot visualizing the decoding perf ormance of ten participants on the binary preference classification task. The green solid line represents individual subject accuracies, while the red and blue dashed lines indicate the group average (80.4%) and the chance lev el (50%), respectiv ely . the openai/clip-vit-large-patch14 backbone, which embeds both the generated images and the text prompts into a shared representation space and reports their cosine similarity . • Aesthetic Score ( Murray et al. , 2012 ): Evaluates visual quality using a linear probe trained on the LAION-Aesthetics dataset, built upon the CLIP V iT -L/14 features. • VQA Score ( Lin et al. , 2024 ): T o assess fine-grained constraint satisf action, we adopt the VQA Score metric implemented with the CLIP-FlanT5-XXL model. For each prompt–image pair , we follo w the official yes/no question template that asks whether the image matches the full textual description and we use the model’ s probability of an affirmati ve answer as the VQA Score. • BLIP Score ( Li et al. , 2022 ): Measures the semantic consistency between the generated image and the original prompt using the Image-T ext Contrastiv e (ITC) score from the BLIP model. Human Evaluation. W e conducted a user study with 5 participants. For each image, participants pro vided: • Human Rating (1–5): A Likert scale measuring ov erall alignment and quality ( 1 = Poor , 5 = Perfect). • Success Rate (%): The percentage of trials where the participant explicitly ev aluated the final generated image as Satisfied . T o ensure the accuracy of this metric, we rely solely on the participant’ s manual ground-truth feedback , independent of the EEG decoder’ s real-time predictions. C. Analysis of Individual Differ ences As shown in Fig. 9 , we visualize the decoding performance of all ten participants on the binary preference classification task using a radar plot. All subjects achiev e accuracies clearly abov e the random baseline, indicating that preference-related information is reliably encoded in their EEG signals and can be ef fectiv ely extracted by our classifier . At the same time, there is noticeable between-subject variability: some participants perform slightly above the group a verage, whereas others are slightly below , reflecting the widely reported individual dif ferences in EEG recordings. Despite this variability , all subjects remain within a practically acceptable performance range, suggesting that the proposed decoder is applicable to heterogeneous user populations. Howe ver , the pronounced individual dif ferences also highlight a key challenge for cross-subject EEG-based preference decoding: performance without subject-specific calibration is still limited. Mitigating this cross-subject variability and improving the generalization of the model across users will be an important direction for future work. 14 EEG-Based Brain-LLM Interface for Human Pr eference Aligned Generation D. Limitations of Base Image Generators under Str ess-T est Prompts T o better understand whether the observed failure cases are specific to a particular image generator , we additionally ev aluated sev eral different backbones on our high-dif ficulty prompt set (see T able 2 ), including both closed-source models and an open-source diffusion model such as Stable Dif fusion XL (SDXL). W e observe that stronger backbones generally yield higher automatic scores ( e.g ., CLIP , BLIP , aesthetic, and VQA scores) and noticeably improve the human-rated success rate. Howe ver , e ven the best-performing backbones still fail on a non-negligible portion of the stress-test prompts: for example, the strongest closed-source models reach success rates of only around 70 – 75% , with many f ailures concentrated on fine-grained counting, compositional spatial relations, and precise text rendering. These results suggest that the bottleneck is not solely in the proposed Brain-LLM interface or the EEG decoder, b ut also in the current generation capabilities of base T2I models when faced with highly structured, constraint-hea vy prompts. In other words, simply swapping in a stronger backbone ( e .g ., GPT -4o Image or Doubao-seedream-4-0) improv es the scores but does not fundamentally resolve the hardest cases in our benchmark. W e therefore vie w our frame work as complementary to future advances in image generators: better backbones are expected to raise the o verall performance ceiling, while EEG-guided test-time scaling can help more reliably steer the generativ e process toward user -preferred solutions within that ceiling. 15
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment