Learning Brain Representation with Hierarchical Visual Embeddings

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.

💡 Research Summary

The paper tackles the long‑standing problem of decoding visual information from brain recordings (EEG, MEG, fMRI) by explicitly modeling the hierarchical nature of visual processing in the human cortex. The authors propose a two‑stage framework that first builds a multi‑scale visual embedding from several pretrained image encoders and then aligns brain signals to this embedding using contrastive learning. In the visual branch, three encoders are employed: two CLIP‑based models (ViT‑CLIP and ResNet‑CLIP) that provide high‑level semantic tokens, and a Variational Auto‑Encoder (VAE) that yields a dense, low‑resolution latent map preserving pixel‑level color, texture, and layout information. Each encoder’s output is linearly projected to a common 1024‑dimensional space, summed, and passed through a residual two‑layer MLP with GELU activation and LayerNorm, producing the fused visual representation z_f.

On the brain side, the raw time‑series are flattened, linearly projected to the same dimensionality, and processed by an identical MLP‑LayerNorm block to obtain the brain embedding z_b. A symmetric InfoNCE loss with a learnable temperature (initialized at 0.07) maximizes cosine similarity for matched brain‑image pairs while minimizing it for mismatched pairs, effectively pulling the two modalities into a shared embedding space.

Directly feeding z_f into a diffusion model for image synthesis proved unstable because the diffusion prior expects a specific distribution of conditioning vectors. To bridge this gap, the authors pre‑train a “Fusion Prior” on large‑scale image data. The fused visual vector is further transformed by a high‑capacity MLP (hidden size 4096) into z_c, which is injected into a frozen SDXL UNet via an IP‑Adapter that uses cross‑attention. The UNet’s weights remain frozen; only the visual fuser, the high‑capacity projector, and the IP‑Adapter are trained to predict diffusion noise, yielding a stable mapping from z_f to diffusion conditions.

After the Fusion Prior is fixed, only the brain encoder is fine‑tuned with the same contrastive objective, ensuring that brain‑derived embeddings land precisely in the pretrained visual space. At inference time, a brain signal is transformed to z_b, aligned to z_f, converted to z_c, and finally fed to the frozen diffusion model, producing a high‑quality reconstruction without any textual prompt.

Extensive experiments on public EEG, MEG, and fMRI datasets evaluate two tasks: zero‑shot image retrieval and image reconstruction. For retrieval, the proposed method outperforms CLIP‑based baselines by more than 12 percentage points in top‑1 accuracy, demonstrating that the hierarchical embedding captures both semantic and fine‑grained visual cues. For reconstruction, quantitative metrics (FID, SSIM, LPIPS) and human perceptual ratings show substantial gains over prior brain‑to‑image pipelines, especially when the VAE latent is included, confirming that low‑level details are faithfully recovered. Ablation studies reveal that removing any of the three encoders degrades performance, and that omitting the Fusion Prior leads to mode collapse and noisy generations.

The authors acknowledge limitations: the approach relies on relatively low‑resolution, noisy brain recordings, and the Fusion Prior is currently trained only on static images, limiting direct extension to video or other modalities. Future work is suggested on multimodal alignment (text, audio, video), real‑time brain‑computer interface applications, and scaling the Fusion Prior to broader visual domains.

In summary, the paper introduces a novel brain‑vision interface that (1) fuses high‑level semantics with low‑level pixel information via multiple pretrained encoders, (2) aligns brain signals to this hierarchical space using contrastive learning, and (3) stabilizes downstream diffusion‑based image synthesis with a pretrained Fusion Prior. The resulting system achieves state‑of‑the‑art retrieval accuracy while delivering reconstructions with unprecedented fidelity, offering a compelling step toward deeper understanding of how visual information is represented in the human brain.

Learning Brain Representation with Hierarchical Visual Embeddings

💡 Research Summary

Comments & Academic Discussion

Leave a Comment