Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding
Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.
💡 Research Summary
This paper tackles a fundamental mismatch between human visual processing and modern deep vision models in the context of neural visual decoding. While state‑of‑the‑art decoding pipelines align non‑invasive neural recordings (EEG/MEG) with the final‑layer embeddings of large pre‑trained vision backbones, those embeddings are deliberately stripped of low‑level texture and color information to maximize semantic invariance. Consequently, the rich, multi‑scale information present in neural signals—ranging from early visual features (edges, spatial frequency) to high‑level object semantics—cannot be faithfully mapped, leading to ambiguous contrastive supervision and sub‑optimal decoding performance.
The authors propose “Shallow Alignment,” a simple yet powerful contrastive learning strategy that shifts the alignment target from the final output to an intermediate layer of the vision encoder. By selecting a depth (l^*) (typically before the network’s representation collapses into a low‑dimensional semantic space), they extract a pooled feature map that retains both structural detail and discriminative semantics. Neural signals are encoded by a lightweight EEG/MEG encoder, and both neural and visual features are projected into a shared latent space using linear mappings, deliberately limiting model capacity so that performance gains stem from the quality of the intermediate visual representation rather than from over‑parameterized decoders.
Experiments on the large‑scale THINGS‑EEG and THINGS‑MEG datasets evaluate a wide spectrum of backbones—from ResNet‑50/101 to Vision Transformers of varying sizes (ViT‑B/16, ViT‑H/14, ViT‑bigG/14) and recent foundation models (DINOv2, EVA‑02, InternViT). Across all settings, Shallow Alignment yields 22 %–58 % absolute improvements in Top‑1/Top‑5 zero‑shot retrieval accuracy compared with conventional final‑layer alignment. Notably, performance scales predictably with model capacity (correlation r = 0.90, p < 0.001), establishing a scaling law for neural decoding that was previously hidden by the granularity mismatch.
The paper also contrasts its approach with prior methods that implicitly reduce visual granularity through image blurring or augmentation (e.g., UBP, NeuroBridge). While those techniques improve robustness, they sacrifice high‑fidelity texture needed for faithful reconstruction. Shallow Alignment, by explicitly targeting intermediate representations, preserves fine‑grained details while still providing a semantically rich target, thereby achieving superior decoding without compromising image quality.
Limitations include the need to manually select the optimal intermediate layer for each backbone and dataset, and the reliance on linear projections which may leave performance on the table for more expressive mappings. Future work is suggested in multi‑scale contrastive objectives, extension to higher‑resolution neural modalities such as fMRI, and real‑time BCI deployment with lightweight encoders.
Overall, the study provides a clear diagnostic of the granularity mismatch problem, introduces an elegant solution, and demonstrates that aligning neural signals with appropriately “shallow” visual features unlocks both higher decoding accuracy and a principled scaling relationship between vision model size and brain‑decoding performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment