From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reconstructing observed images from fMRI brain recordings is challenging. Unfortunately, acquiring sufficient “labeled” pairs of {Image, fMRI} (i.e., images with their corresponding fMRI responses) to span the huge space of natural images is prohibitive for many reasons. We present a novel approach which, in addition to the scarce labeled data (training pairs), allows to train fMRI-to-image reconstruction networks also on “unlabeled” data (i.e., images without fMRI recording, and fMRI recording without images). The proposed model utilizes both an Encoder network (image-to-fMRI) and a Decoder network (fMRI-to-image). Concatenating these two networks back-to-back (Encoder-Decoder & Decoder-Encoder) allows augmenting the training with both types of unlabeled data. Importantly, it allows training on the unlabeled test-fMRI data. This self-supervision adapts the reconstruction network to the new input test-data, despite its deviation from the statistics of the scarce training data.

💡 Research Summary

The paper tackles the challenging problem of reconstructing natural images from functional magnetic resonance imaging (fMRI) recordings of the visual cortex. Existing approaches—linear regression on handcrafted or deep features, and end‑to‑end deep decoders—are all limited by the scarcity of paired {image, fMRI} data, which typically amounts to only a few thousand samples due to the time‑consuming nature of MRI scanning. This scarcity hampers the ability of models to capture the vast variability of natural images and to generalize to new test‑time fMRI data, whose signal‑to‑noise ratio (SNR) and statistical properties often differ from the training set.

To overcome this limitation, the authors propose a novel self‑supervised framework that jointly trains an Encoder (E) mapping images to predicted fMRI responses and a Decoder (D) mapping fMRI signals back to images. The two networks are concatenated in both directions: E‑D (image → fMRI → image) and D‑E (fMRI → image → fMRI). This design enables the use of three data sources during training: (i) the scarce labeled image‑fMRI pairs, (ii) unlabeled natural images (no fMRI), and (iii) unlabeled test‑time fMRI recordings (no images). By imposing reconstruction losses on the cyclic paths, the model learns from data that would otherwise be discarded.

Training proceeds in two stages. In the first stage, only the Encoder is trained in a supervised manner on the labeled pairs. The loss combines mean‑squared error (MSE) and cosine similarity between predicted and true voxel vectors (α = 0.9). Random image shifts are applied to mitigate unknown eye‑fixation variability. In the second stage, the Encoder’s weights are frozen and the Decoder is trained with a composite loss:

L_D: supervised image reconstruction loss on labeled pairs, consisting of pixel‑wise L1, VGG‑based perceptual loss (using the first two layers of VGG‑19), and total‑variation regularization.
L_ED: unsupervised loss on unlabeled images, enforcing that D(E(s)) reconstructs the original image s.
L_DE: unsupervised loss on unlabeled test fMRI, enforcing that E(D(r)) reconstructs the original fMRI vector r.

Each training batch contains 60 % labeled pairs, 20 % unlabeled images, and 20 % unlabeled test fMRI. The Decoder is optimized with Adam (initial LR = 1e‑3, 150 epochs, learning‑rate drops every 30 epochs). The Encoder uses SGD (LR = 0.1, 80 epochs). The entire two‑stage training finishes in roughly 15 minutes on a single Tesla V100 GPU; inference is near‑instantaneous.

Architecturally, the Encoder starts from pretrained AlexNet conv1 filters, followed by three convolution‑batch‑norm‑ReLU blocks (stride 2) and a final fully‑connected layer to produce voxel‑space predictions. The Decoder reshapes the fMRI vector into a 14 × 14 × 64 feature map, then applies three up‑sampling blocks (conv‑ReLU‑batch‑norm) to reach 112 × 112 resolution, ending with a 3‑channel sigmoid output. Glorot‑normal initialization is used throughout.

The method is evaluated on two publicly available fMRI datasets: “fMRI‑on‑ImageNet” (≈1,200 paired samples) and the classic “vim‑1” dataset. In both cases, the authors augment training with 50 k external ImageNet images (unlabeled) and the unlabeled test‑time fMRI recordings. Quantitative metrics (SSIM, PSNR, pixel‑wise MSE) and qualitative visual inspection show that the self‑supervised model outperforms state‑of‑the‑art approaches, including GAN‑based decoders, especially in preserving fine details of the original stimulus. Ablation studies reveal that the D‑E unsupervised loss on test fMRI contributes the largest performance boost, while the E‑D loss on natural images provides a complementary improvement.

The paper’s contributions are threefold: (1) introducing the first self‑supervised training scheme that exploits unlabeled fMRI data, (2) leveraging a bidirectional encoder‑decoder architecture to impose cyclic reconstruction constraints on both image and fMRI domains, and (3) demonstrating that this approach yields competitive reconstruction quality across heterogeneous datasets despite extremely limited labeled data. Limitations include dependence on the quality of the fMRI signal (low spatial/temporal resolution) and the necessity of having a sufficient amount of test‑time fMRI for the unsupervised adaptation to be effective. Future directions suggested include extending the framework to multimodal neuroimaging (MEG/EEG), incorporating meta‑learning for rapid domain adaptation, and exploring higher‑resolution voxel models to further close the gap between decoded and ground‑truth visual experiences.

From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

💡 Research Summary

Comments & Academic Discussion

Leave a Comment