BrainSymphony: A parameter-efficient multimodal foundation model for brain dynamics with limited data
Foundation models are transforming neuroscience but are often prohibitively large, data-hungry, and difficult to deploy. Here, we introduce BrainSymphony, a lightweight and parameter-efficient foundation model with plug-and-play integration of fMRI time series and diffusion-derived structural connectivity, allowing unimodal or multimodal training and deployment without architectural changes while requiring substantially less data compared to the state-of-the-art. The model processes fMRI time series through parallel spatial and temporal transformer streams, distilled into compact embeddings by a Perceiver module, while a novel signed graph transformer encodes anatomical connectivity from diffusion MRI. These complementary representations are then combined through an adaptive fusion mechanism. Despite its compact design, BrainSymphony consistently outperforms larger models on benchmarks spanning prediction, classification, and unsupervised network discovery. Highlighting the model’s generalizability and interpretability, attention maps reveal drug-induced context-dependent reorganization of cortical hierarchies in an independent psilocybin neuroimaging dataset. BrainSymphony delivers accessible, interpretable, and clinically meaningful results and demonstrates that architecturally informed, multimodal models can surpass much larger counterparts and advance applications of AI in neuroscience.
💡 Research Summary
BrainSymphony is a lightweight, parameter‑efficient multimodal foundation model that jointly processes functional MRI (fMRI) time series and diffusion‑derived structural connectivity (SC). The fMRI stream consists of three parallel encoders: a spatial transformer that captures inter‑regional dependencies, a temporal transformer that models dynamics across time, and a 1‑D CNN that extracts local temporal patterns. These three streams generate high‑dimensional token sequences that are fed into a Perceiver module. The Perceiver uses cross‑attention against a set of learned latent vectors to distill the rich spatiotemporal information into a compact, fixed‑size embedding. In parallel, a signed graph transformer encodes the weighted SC matrix, leveraging edge‑aware attention that can represent both excitatory (positive) and inhibitory (negative) connections. The functional and structural embeddings are then combined through an adaptive gating mechanism that learns task‑specific weighting of each modality.
The authors evaluated BrainSymphony on the HCP‑Aging cohort for two downstream tasks: gender classification (discrete) and age prediction (continuous). Across both tasks, the multimodal fusion variant consistently outperformed unimodal baselines (fMRI‑only or SC‑only) and large‑scale foundation models such as BrainLM (111 M parameters) and Brain‑JEP‑A (85 M parameters). For gender classification, the fine‑tuned fusion model achieved 94.04 % accuracy and an F1‑score of 0.933, whereas the best large model reached only ~67 % accuracy. For age prediction, the fusion model obtained the lowest mean‑squared error (0.363) and the highest Pearson correlation (ρ = 0.841) with chronological age, again surpassing the best baseline by more than 35 % reduction in error. Notably, BrainSymphony uses only 5.6 M trainable parameters—over an order of magnitude fewer than competing models—demonstrating that domain‑aware architectural priors can replace brute‑force scaling.
Interpretability was showcased by applying the model to an independent psilocybin dataset. Attention maps from the spatial transformer revealed context‑dependent reorganization of cortical hierarchies, highlighting which ROI‑to‑ROI interactions were amplified or suppressed under the drug condition. Structural embeddings reconstructed the original SC matrix with high fidelity, confirming that the model preserves anatomical information.
Overall, BrainSymphony advances the field in three key ways: (1) it proves that a carefully designed, multimodal architecture can achieve state‑of‑the‑art performance with dramatically fewer parameters; (2) it demonstrates that integrating functional dynamics with structural scaffolding yields consistent gains across both classification and regression neuroimaging tasks; and (3) it provides biologically meaningful attention visualizations that link model decisions to known neurophysiological phenomena. Because of its modest computational footprint, BrainSymphony can be trained and fine‑tuned on standard GPU hardware, making advanced foundation‑model capabilities accessible to a broader research community and enabling deployment in clinical or mobile settings where resources are limited. Future work may extend the framework to additional modalities (e.g., PET, EEG) and explore self‑supervised pretraining to further enhance representation quality.
Comments & Academic Discussion
Loading comments...
Leave a Comment