To be practical for real-life applications, models for brain-computer interfaces must be easily and quickly deployable on new subjects, effective on affordable scanning hardware, and small enough to run locally on accessible computing resources. To directly address these current limitations, we introduce ENIGMA, a multi-subject electroencephalography (EEG)-to-Image decoding model that reconstructs seen images from EEG recordings and achieves state-of-the-art (SOTA) performance on the research-grade THINGS-EEG2 and consumer-grade AllJoined-1.6M benchmarks, while fine-tuning effectively on new subjects with as little as 15 minutes of data. ENIGMA boasts a simpler architecture and requires less than 1% of the trainable parameters necessary for previous approaches. Our approach integrates a subject-unified spatio-temporal backbone along with a set of multi-subject latent alignment layers and an MLP projector to map raw EEG signals to a rich visual latent space. We evaluate our approach using a broad suite of image reconstruction metrics that have been standardized in the adjacent field of fMRI-to-Image research, and we describe the first EEG-to-Image study to conduct extensive behavioral evaluations of our reconstructions using human raters. Our simple and robust architecture provides a significant performance boost across both research-grade and consumer-grade EEG hardware, and a substantial improvement in fine-tuning efficiency and inference cost. Finally, we provide extensive ablations to determine the architectural choices most responsible for our performance gains in both single and multi-subject cases across multiple benchmark datasets. Collectively, our work provides a substantial step towards the development of practical brain-computer interface applications.
Reconstructing visual experiences from brain activity has long been a goal of both neuroscience and machine learning, and a foundational step for building decoding algorithms for practical brain-computer interface (BCI) applications targeting states like mental images [17]. Decoding visual information encoded in human brain activity could help researchers better understand cognitive processes, and could also be useful in a clinical setting [22], where millions of patients are left unable to communicate through conventional means as a result of traumatic brain injuries, and many common afflictions manifest as a profound dysregulation of unwanted or confusing visual experiences [12]. While the Natural Scenes Dataset (NSD) [1,17] has yielded striking functional magnetic resonance imaging (fMRI)-based reconstructions of seen images using latent diffusion models [26], electroencephalography (EEG)-based reconstruction remains challenging due to EEG's low signal-to-noise ratio and spatial resolution. Despite these limitations, EEG remains appealing for real-time BCI applications because of its temporal precision and inexpensive, portable form factor.
Existing EEG-to-Image decoding research spans a wide range of architectural approaches: Fei et al. [6] demonstrates a simple linear mapping from EEG to an expressive CLIP (Contrastive Language-Image Pretraining [23]) image embedding space combined with a pre-trained diffusion model, while Li et al. [19] proposes a more complex architecture (ATM-S) utilizing a transformer-based brain encoder and a two-stage generation process utilizing the diffusion prior introduced in Scotti et al. [25]. These approaches have laid important groundwork for this field of research, however, current EEG-to-Image models face three critical barriers that prevent their deployment in practical BCI applications.
Rapid adaptation to new subjects remains an unsolved challenge. All existing methods for EEG-to-Image decoding [6,19,28] require training specialized models from scratch for each subject using hours of training data, which is impractical for real-world BCI applications that demand rapid functionality on new subjects. While recent work in fMRI-to-Image has made progress on this front-Scotti et al. [26] reduced calibration requirements from 40 hours to 1 hour through efficient fine-tuning-even this breakthrough remains insufficient for practical deployment, as it still requires both an expensive 7 tesla fMRI scanner and a full hour of subject-specific data collection. Critically, no EEGto-Image model has attempted to enable rapid fine-tuning on new subjects, despite EEG’s advantages in accessibility and portability. We hypothesize that this gap has persisted because of EEG’s noisy spatio-temporal signal characteristics, making it difficult to leverage generalizable knowledge learned from other subjects while quickly adapting to the idiosyncratic neural patterns of a new individual.
Current approaches lack robustness when applied to affordable hardware. The recent release of the Alljoined-1.6M dataset-designed for evaluating EEG-to-Image models on consumer-grade hardware-revealed a troubling pattern: many recent models, especially complex architectures such as ATM-S, are not robust to drops in hardware quality and fail at a much higher rate when deployed in noisier recording environments [38]. This brittleness represents a fundamental obstacle to democratizing BCI technology, as the vast majority of potential users and downstream applications cannot access research-grade EEG equipment. The challenge lies in learning representations that capture the underlying neural encoding of visual information in a way that generalizes across varying levels of signal quality, electrode density, and noise characteristics. Without addressing this weakness, EEG-to-Image models will remain confined to laboratory settings, unable to translate to the accessible, consumer-grade hardware that many practical BCI applications require.
Existing architectures are too computationally prohibitive for large-scale deployment. Most EEG-to-Image models are quite large relative to the amount of data they are trained on, and require separate models to be trained for each subject, resulting in a linear increase in what is already a very large functional model size as models are deployed across multiple users [6,28]. While Li et al. [19] demonstrated that a unified model can be trained across multiple subjects, doing so without any mechanism for handling subject-specific differences led to a substantial performance drop; attaining reasonable performance still required training separate models for each subject. These architectural and size limitations constitute a stubborn barrier for widespread deployment: running these large models on edge devices or in settings without access to server-grade GPUs is almost impossible, and supporting inference on multiple subjects requires scaling computational resources proportionally
This content is AI-processed based on open access ArXiv data.