Reconstructing Natural Scenes from fMRI Patterns using BigBiGAN

Reconstructing Natural Scenes from fMRI P atterns using BigBiGAN Milad Mozafari CerCo, CNRS T oulouse, France milad.mozafari@cnrs.fr Leila Reddy ∗ CerCo, CNRS and ANITI, Universit ´ e de T oulouse T oulouse, France leila.reddy@cnrs.fr Ruﬁn V anRullen ∗ CerCo, CNRS and ANITI, Universit ´ e de T oulouse T oulouse, France ruﬁn.vanrullen@cnrs.fr Abstract —Decoding and reconstructing images fr om brain imaging data is a resear ch area of high inter est. Recent progress in deep generative neural networks has introduced new opportu- nities to tackle this pr oblem. Here, we employ a recently pr oposed large-scale bi-directional generative adversarial network, called BigBiGAN, to decode and reconstruct natural scenes from fMRI patterns. BigBiGAN con verts images into a 120-dimensional la- tent space which encodes cl ass and attribute information together , and can also reconstruct images based on their latent vectors. W e computed a linear mapping between fMRI data, acquired over images from 150 different categories of ImageNet, and their corr esponding BigBiGAN latent v ectors. Then, we applied this mapping to the fMRI activity patterns obtained from 50 new test images from 50 unseen categories in order to retrie ve their latent vectors, and reconstruct the corresponding images. Pairwise image decoding from the predicted latent vectors was highly accurate ( 84% ). Moreov er , qualitative and quantitative assessments re vealed that the resulting image reconstructions were visually plausible, successfully captured many attributes of the original images, and had high perceptual similarity with the original content. This method establishes a new state-of-the- art for fMRI-based natural image reconstruction, and can be ﬂexibly updated to take into account any future improv ements in generative models of natural scene images. Index T erms —fMRI Decoding, V isual Reconstruction, Natural Scenes, BigBiGAN I . I N T RO D U C T I O N For many years, scientists hav e used machine learning (ML) to decode and understand human brain activity in response to visual stimuli. The great progress of deep neural networks (DNNs) in the last decade has provided researchers with powerful tools and a large number of unexplored opportunities to achiev e better brain decoding and visual reconstructions from functional magnetic resonance imaging (fMRI) data. A variety of approaches hav e been taken to address image reconstruction from brain data. Before the deep learning era, researchers achieved reconstructions of simple binary stimuli directly from fMRI data [1]. Even though the reconstruction of complex natural images was hardly possible in those days, there were attempts to identify the image within a dataset, instead of reconstructing it: for example, quantitativ e receptive ﬁeld models were used to identify the presented image [2]; in Funded by AI-REPS ANR-18-CE37-0007-01, ANITI ANR-19-PI3A-0004 and a Nvidia GPU grant. ∗ These authors contributed equally to this work. another work [3], the authors made use of Bayesian methods to ﬁnd the image with the highest likelihood. In recent years, deep networks hav e brought signiﬁcant im- prov ements in this ﬁeld, with the reconstruction of handwritten digits using deep belief networks [4], of face stimuli with variational auto-encoders (V AEs) [5], and of natural scenes with feed-forward networks [6], [7], generativ e adv ersarial networks (GANs) [8], [9], and dual-V AE/GAN [10]. Most reconstruction methods for natural images, ho we ver , tend to emphasize pixel-le vel similarity with the original images, and rarely produce recognizable objects, or visually plausible or semantically meaningful scenes. Inspired by [5], we propose a method to reconstruct natural scenes from fMRI data using a recently proposed large-scale bi-directional generative adversarial network, called BigBi- GAN [11]. This network is the current state-of-the-art for unconditional image generation on ImageNet in terms of image quality and visual plausibility . In our proposed method, the brain data is mapped to the latent space of the BigBiGAN (pre-trained on ImageNet), whose generator is then used to reconstruct the image. Fig. 1 demonstrates an overvie w of the proposed method. Speciﬁcally , a training set of natural images that is shown to the human subjects is also fed into BigBiGAN’ s encoder to get “original” latent vectors. Then, a linear mapping is computed between brain responses to the training images and their corresponding original latent vectors. Applying this mapping to the brain data for nov el test images, a set of “predicted” latent vectors is then generated. Finally , these predicted latent vectors are passed on to the BigBiGAN’ s generator for image reconstruction. W e demonstrate that the proposed method is able to outper- form others by generating high-resolution naturalistic recon- structions thanks to the BigBiGAN generator . W e justify our claims by quantitativ e comparisons of reconstructions to the original images in the high-le vel representational space of a state-of-the-art deep neural network. I I . P R E V I O U S W O R K S W e begin this section by describing our earlier work from which the method was adapted. In [5], we took advantage of the latent space of a V AE trained with a GAN procedure on a large set of faces. By learning a linear mapping between fMRI This manuscript is accepted to IEEE International Joint Conference on Neural Networks (IJCNN). Please cite it as: M. Mozafari, L. Reddy and R. V anRullen, “Reconstructing Natural Scenes from fMRI Patterns using BigBiGAN, ” 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow , United Kingdom, 2020, pp. 1-8, doi: 10.1109/IJCNN48605.2020.9206960 (https://ieeexplore.ieee.org/document/9206960). 978-1-7281-6926-2/20/$31.00 ©2020 IEEE 120 -D Latent V ector s -D fMRI P atter ns Linear R egressio n W BigBiGAN (PCA) Encoder T r aini ng Images (a) Latent V ector fMRI P atter n W T BigBiGAN's Genera tor (PCA ) -1 T est Image (b) Fig. 1: The proposed method. (a) T raining phase. W e compute a linear mapping (computing the linear transform matrix W ) from 120 -D latent vectors (derived from the BigBiGAN encoder or from PCA decomposition) to nv -D fMRI patterns. nv is the number of voxels inside the brain region of interest. (b) T est phase. The obtained mapping is in versely used to transform fMRI patterns of test images into the latent vectors. The image is then reconstructed using BigBiGAN’ s generator (or a PCA in verse transform). patterns and 1024 -dimensional V AE latent vectors, and using the GAN generator to reconstruct input images, we established a new state-of-the-art for fMRI-based face reconstruction. Moreov er , the method ev en allowed for decoding face gender or face mental imagery . Despite these promising results on faces, dealing with natural images remains a hard challenge. In another study [12], authors used a V AE for reconstructing naturalistic movie stimuli. They ﬁrst trained a V AE, with ﬁ ve layers for encoding and ﬁve layers for decoding, on the ImageNet dataset. Then, similar to [5], they conv erted the fMRI patterns to the V AE’ s latent space through linear mapping. Although they reported an appreciable level of success, the reconstructions were still blurry and difﬁcult to recognize. Studies in this ﬁeld are not limited to the latent space of V AEs. In [6], the feature space of deep con volutional networks (DCNs) was used for fMRI decoding and image re- construction. T o do so, a decoder was ﬁrst trained to transform fMRI patterns into the DCN’ s image representations. Then, for each fMRI pattern, an initial image was proposed and passed through iterative optimization steps. In each iteration, the image was giv en to the DCN and the dif ference between its feature representation and the one from the actual image was computed as a loss value. Finally , pix el values were optimized to decrease this loss. The authors also examined optimization in the space of deep generativ e networks instead of in pixel space. According to the obtained reconstructions, their method was able to capture input attributes such as object color , position, and a coarse estimation of the shape. Howe v er , images remained blurry and the objects difﬁcultly recognizable. Other studies have proposed original network architectures instead of using pre-existing ones. In [7] an encoder/decoder structure was proposed, in which the encoder maps images to fMRI data, while the decoder does the rev erse. In the ﬁrst step, the encoder and decoder were separately trained on (image, fMRI) data pairs. Since the number of data pairs was insufﬁcient for proper generalization, the authors applied a second round of training in an unsupervised fashion. In yet another study [10], the authors proposed a dual- V AE, trained with a GAN procedure. This method in v olved three stages of training. In stage 1 the encoder , generator, and discriminator were trained on original images vs. generated ones. In stage 2, the generator was ﬁxed, the encoder was trained on fMRI data, and the discriminator was trained with reconstructed images from the fMRI data, and reconstructed images from Stage 1. Finally , in Stage 3, the encoder was ﬁxed, and the generator and discriminator networks were ﬁne- tuned using the original images and reconstructed images from the fMRI data. This three-step method not only outperformed previous studies in image decoding, b ut also generated more crisp and visually plausible reconstructions. Howe ver , object identity was not always e vident in the reconstructed images. In this paper , we reconstruct images from human brain activity patterns using the state-of-the-art in natural image generation, a large-scale bi-directional GAN coined “BigBi- GAN” [11]. Notably , the high-level image attributes captured in the latent space of the BigBiGAN allow us to go beyond pixel-wise similarity between the original and reconstructed images, and to reconstruct realistic and visually plausible scenes that express high-lev el semantic and category-le vel information from brain activity patterns. I I I . M A T E R I A L S A N D M E T H O D S A. fMRI Data In this paper , we used open-source fMRI data provided by [13]. Images in the stimulus set were selected from Im- 2 ageNet, and included 1200 training samples (1 presentation each) from 150 categories ( 150 × 8 ), and 50 test samples (35 presentations each) from 50 categories. Training and test categories were independent of each other . Five healthy subjects viewed these training and test images in a fMRI scanner in separate sessions. Each fMRI run consisted of a ﬁxation point ( 33 s), 50 image presentations ( 9 s per image, ﬂashing at 2 Hz), and a ﬁnal ﬁxation point ( 6 s). Moreover , 5 images were randomly repeated during a run and subjects performed a one-back task on these images (i.e., the y pressed a button when the same image was presented on two consecuti ve trials). W e downloaded the raw data 1 and applied a standard preprocessing pipeline: slice-time correction, realignment, and coregistation to the T 1 w anatomical image using SPM12 software 2 . Details of the parameters used for preprocessing can be found in [5]. The downloaded fMRI dataset also provided pre-deﬁned re gions of interest (R OIs) that co vered visual cortex. The onset and duration of each image were entered into a general linear model (GLM) as regressors (a separate GLM was used for the training and test sessions). B. BigBiGAN BigBiGAN is a state-of-the-art lar ge-scale bi-directional generativ e network for natural images [11]. It is a successor of the BiGAN bi-directional GAN [14], but adopting the generator and discriminator architectures from the more recent BigGAN [15]. Similar to BiGAN, the encoder and generator are trained indirectly via a joint discriminator that has to discriminate real from fak e [latent vector , data] pairs. The encoder maps data into the latent v ectors (real pairs), while the generator reconstructs data from latent vectors (fake pairs). Unlike BigGAN, a conditional GAN which requires a sepa- rate “conditioning” vector for object category , BigBiGAN’ s generator has a uniﬁed 120-dimensional latent space which captures all properties of objects, including cate gory and pose. In other words, each image can be expressed as a 120 - dimensional vector in the network’ s latent space, and any latent vector can be mapped back to the corresponding image. The low-dimensionality of the BigBiGAN model makes it particularly appealing for fMRI-based decoding, giv en the relativ ely small amount of brain data available for training our system (see III-D). In this study , we used the largest pre-trained BigBiGAN model, revnet50x4, with 256 × 256 image resolution. The model is publicly av ailable on T ensorFlo w Hub 3 . C. PCA Model As a baseline image decomposition and reconstruction model for our comparisons, we applied principal component analysis (PCA) on a set of 15000 images that were randomly selected from the 150 training categories ( 100 each). W e made sure that the 1200 training images were included. Using the 1 https://openneuro.org/datasets/ds001246/ 2 https://www .ﬁl.ion.ucl.ac.uk/spm/software/spm12/ 3 https://tfhub .dev/deepmind/bigbigan- re vnet50x4/1 ﬁrst 120 principal components (PCs), all of the image stimuli were transformed into a set of 120-D vectors. These vectors were then treated similarly to BigBiGAN’ s latent vectors for brain decoding and reconstructions. This method (known as “eigen-face” or “eigen-image”) has previously been applied to fMRI-based face reconstruction [5], [16] and natural image reconstruction [12]. D. Decoding and Reconstruction Using linear regression, we computed a linear encoder that maps the 120 -dimensional BigBiGAN latent representations (or the 120 -dimensional PCA projections) associated with the training images onto the corresponding brain representations, recorded when the human subjects viewed the same images in the scanner (see Fig. 1a). For each subject, this mapping is computed by a general linear regression model (GLM) where the design matrix included the follo wing regressors of interest: ﬁxation (during the ﬁxation point), stimulus (whenever an image was presented), and one-back (when the image was a target for the one-back task). In order to obtain mapping parameters, the 120 -dimensional latent vectors (or PCs) for the training images were added as parametric modulators for the “stimulus” regressor . This step takes into account the cov ariance matrix of the latent dimensions (across images), and produces a linear transform matrix ( W ) which will be used for the in verse transformation in the test phase. In other words, for the training set of 1200 images, if there are nv voxels in the desired R OI, the GLM ﬁnds an optimal transformation matrix W between their 121 -dimensional latent vectors (including an additional constant bias term) and the corresponding nv -dimensional brain activ ation vectors: Y 1200 × nv = X 1200 × 121 · W 121 × nv , (1) where X and Y denote the latent and brain activ ation vectors, respectiv ely . Please note that all of the GLMs were solved by SPM12 over the entire visual cortex (union of all pre-deﬁned functional R OIs). For the test images, brain representations were deriv ed from another GLM in which (in addition to “ﬁxation” and “one- back” regressors, as previously) the presentation of each test image was considered as a separate regressor . The pre viously- computed mapping ( W ) was then inv erted (again, taking into account the cov ariance matrix of the latent dimensions, this time across brain vox els), and used to predict the latent vectors (or PCA projections) from the brain representations (see Fig. 1b). This corresponds to the “brain decoding” step. Precisely , we retriev ed the latent vectors X 50 × 121 from the brain activ ation vectors of the 50 test images Y 50 × nv using the previously-computed W and its (pseudo-)in verse cov ariance matrix ( W W T ) − 1 : Y = X · W Y W T = X · W W T X = Y W T · ( W W T ) − 1 . (2) 3 Before solving equation 2, the brain acti vation vectors were zero-meaned by subtracting from each the average activ ation vector across all test images. Finally , we discarded the bias term from the predicted latent vectors (PCA projections), and fed them into BigBiGAN’ s generator (PCA ’ s in verse transform) to generate image recon- structions. Since BigBiGAN’ s generator is sensitive to the distribution of latent variables, we re-scaled predicted latent variables using the mean and standard deviation of latent variables from the training set, before feeding them to the generator . E. Computational Efﬁciency The whole computation pipeline from raw fMRI data to image reconstructions consists of the following steps: 1) fMRI pre-processing 2) Extracting brain representations for test images (GLM) 3) Extracting latent representation for training images (us- ing BiBiGAN’ s encoder) 4) Computing the linear mapping (GLM) 5) Predicting latent vectors for test images (using the in verse mapping) 6) Reconstructing images (using BiBiGAN’ s generator) Apart from the ﬁrst two steps that are common between almost all fMRI image reconstruction methods, the major computational cost of the proposed method is computing the linear mapping. This is not only considerably less expensi ve than training large complex encoder/decoder networks (we use pre-trained networks instead), but also easily adaptable to the latent space of any other pre-trained networks. In other words, as soon as a better natural scene generator emerges, we can substitute the new network with the old one and run the pipeline again (from step 2). For our experiments, we ran this pipeline on a machine running Ubuntu 18.04 with 128 GB of memory , 40 CPU cores (2.20GHz), and NVIDIA TIT AN V as the GPU. Nipype python package was also used to parallelize the pre-processing and GLM steps ov er the ﬁv e subjects. It took around 16 hours to compute the linear mapping (GLM on the training data) for all subjects, while the encoding and image reconstructions with BigBiGAN took only a few seconds. F . Decoding Accuracy W e used a pairwise strategy to e valuate the accuracy of our brain decoder . Assume that there are a set of n (original) vec- tors v 1 , v 2 , ..., v n and their respectiv e predictions p 1 , p 2 , ..., p n . Then the pairwise decoding accuracy is computed as: P n − 1 i =1 P n j = i +1 K ( c ( v i , p i ) + c ( v j , p j ) , c ( v i , p j ) + c ( v j , p i )) n , (3) where c ( ., . ) is the Pearson correlation and K ( a, b ) = ( 1 a > b 0 otherwise . (4) G. High-Level Similarity Measur e Unlike human judgement, classic similarity metrics such as mean squared error (MSE), pix-comp [10], or structural similarity index (SSIM) are computed in pixel space and cannot capture high-lev el perceptual similarities, e.g. in terms of object attributes and identity , or semantic category . One good solution for this problem is to make use of DCN representational spaces, as there are several pieces of e vidence supporting their correlation to the human brain [13], [17], [18]. In this paper, Inception-V3 [19] was the DCN of our choice, with the outputs of its last inception block (after concatena- tion of its branches) deﬁning our high-le vel representational space. In this space, as a measure of high-lev el perceptual representations, we computed the average Pearson correlation distance between representations of the original images and their associated fMRI reconstructions. In addition to this high- lev el measure, we also report pix-comp v alues [10] as a measure of low-le v el similarity . I V . R E S U L T S A. Image Reconstructions Using BigBiGAN’ s generator (or the PCA inv erse trans- form), we could reconstruct an estimate of the test images from the latent vectors obtained by the brain decoder ( W T ). Since BigBiGAN’ s generator is not perfect (see ﬁrst and second columns in Fig. 2 and Fig. 3), we cannot expect the fMRI reconstructions to be identical to the input images (ev en if our decoding procedure was 100% accurate). Howe v er , we found that the brain decoder not only captured se veral high- lev el attrib utes of the images, but that there were robust consistencies in image reconstruction across subjects. Fig. 2 shows a series of reconstructions across all of the ﬁ ve subjects. For example, when the input image contained an animal (rows 1, 2, 5, 7, 10) or a human (row 9), it w as preserved in the reconstructions with comparable location, body shape and pose across subjects. It is worth mentioning that objects or attributes that occur with a higher frequenc y in the Ima- geNet dataset are more likely to be preserved in the original BigBiGAN and fMRI reconstructions. For instance, images in the third and eighth ro ws are not common in Imagenet, yet their roundness attrib ute is more frequently observed. Thus, all the reconstructions agreed with a round object, even though they could not exactly reconstruct what the object was. Other examples are the images of the tower (fourth ro w) for the narrowness and tallness attributes, or the insect (seventh row) whose reconstructions mostly captured the long rope-like object behind it and rendered it with insect-related attributes. fMRI-based natural image reconstruction has been ad- dressed by a v ariety of methods recently , howe v er only a few of them hav e been ev aluated on the dataset we used. Here, we compare our reconstructions to three recent works by Shen et al. [6], Beliy et al. [7], and Ren et al. [10]. Fig. 3 sho ws reconstructions of seven images obtained by each method. Note that we could not compare other images since their reconstructions were not available for all methods. 4 Input Image BigBiGAN R econ. fMRI R econstr uction sub-01 sub-02 sub-03 sub-0 4 sub-05 # 1 2 3 4 5 6 7 8 9 10 Fig. 2: fMRI reconstructions by the proposed method across all subjects. The ﬁrst and second columns show the input image and BigBiGAN’ s original reconstruction (reconstruction from the original latent vector), respectively . The next ﬁve columns illustrate BigBiGAN’ s fMRI reconstructions (reconstruction from predicted latent vectors) for each of the ﬁv e subjects. Although fMRI reconstructions are not a perfect match to the input images, there are many attributes that are consistently captured by all subjects. These attributes can be semantic, such as being an animal or the body pose, and/or visually dri ven such as roundness or tallness, to mention a few . 5 Input Image BigBiGAN R econ. fMRI R econstr uction BigBiGAN (ours) R en et al. Beliy et al. Shen et al. Eigen- Image 1 2 3 4 5 6 7 # Fig. 3: Comparison of fMRI reconstructions by different methods. The ﬁrst and second columns show the input image and BigBiGAN’ s original reconstruction (reconstruction from the original latent vector), respectiv ely . Columns three to se ven illustrate fMRI reconstructions for BigBiGAN (our method, reconstruction from the predicted latent vector), Eigen-Image (PCA, baseline model), Ren et al. [10], Beliy et al. [7], and Shen et al. [6], respectively . Clearly , reconstructions by the proposed method are the most naturalistic, with the highest resolution, in contrast to the more blurry or semantically ambiguous results of the other methods. T ABLE I: Quantitativ e comparison of image reconstructions. For each measure, the best value is highlighted in bold. (For Pix-Comp, higher is better; for Inception-V3, lower is better) Method Similarity Measure Low-Le vel High-Lev el (Pix-Comp) ↑ (Inception-V3) ↓ Shen et al. [6] 79 . 7% 0 . 829 Beliy et al. [7] 85 . 3% 0 . 865 Ren et al. [10] 87 . 8% 0 . 847 Eigen-Image (PCA) 73 . 4% 0 . 884 BigBiGAN (ours) 54 . 3% 0 . 818 Although our reconstructions are not a perfect match to the input image, they show the clearest resolution, details, and naturalness, and display high-lev el similarity to the input image. Clearly , PCA (eigen-image) reconstructions rank worst in clarity . The other three methods suffered to varying degrees from ambiguous reconstructions (notably , without any clearly discernible object), although they did much better in estimating low-le v el attributes of the images, with the best performance obtained by Ren et al. Moreov er , unlik e the other methods, no image “halo” is present in our reconstructions. These halos can result from various factors such as the learning capacity of the encoder/decoder networks, the training approach, and most importantly , pixel-le vel or lo w-lev el similarity optimization, to mention a few . For a quantitative comparison, we quantiﬁed lo w- and high-lev el similarities between reconstructions and original images. The former was computed as the pairwise decoding performance in pixel space (pix-comp) for all of the test images, while the latter was the correlation distance between representations of the last inception block in Inception-V3 (see subsection III-G) ov er the common set of seven reconstruc- tions showed in Fig. 3. These results (see table I) justify our 6 Fig. 4: Pairwise decoding accuracy across dif ferent brain regions of interest (R OIs). While v oxels in high-le vel areas of the visual cortex are best decoded using BigBiGAN (our method), PCA performs better in low-le v el regions (V1, V2). Although the best performance is achie ved when all the vox els (the whole visual cortex) are included, PCA could only do marginally better than when only V1d voxels were used. claim that high-level aspects of the input images were better preserved by our method, while the other methods had an advantage for low-le v el aspects. B. Decoding Accuracy Acr oss Brain Re gions As mentioned above, the fMRI dataset includes sev eral pre-deﬁned brain regions of interest (R OIs) in visual cortex, including V1 to V4, LOC, FF A, PP A, and HVC as the union of the last three. W e also deﬁned the whole visual cortex (VC) as the union of all these ROIs. By limiting vox els to those that were inside each R OI, we ev aluated the pairwise decoding accuracy across different regions in visual cortex. Fig. 4 illustrates the av erage decoding accuracy ov er all sub- jects in each brain region. PCA outperformed BigBiGAN in the two earliest visual areas (V1 and V2). Howe ver , in higher areas, BigBiGAN gradually impro ved while PCA worsened. Peak performance for our method was reached in V3, V4, and HVC, where PCA performed poorly . W e hypothesize that the superiority of PCA in lower areas is due to the fact that the PCs were computed in pixel space, and thus correspond mostly to low-le vel features. On the other hand, BigBiGAN’ s latent vectors can better represent high-level features, since they are obtained via a lar ge hierarchy of processing layers. For both BigBiGAN and PCA, the best accuracy was achieved when we used brain responses from the whole VC. Peak accuracy was 84 . 1% and 78 . 1% for BigBiGAN and PCA, respectiv ely . It is worth mentioning that, while the whole VC im- prov ed BigBiGAN’ s performance signiﬁcantly compared to each indi vidual region, PCA could only do marginally better than when using voxels in V1d alone (its best single-re gion performance). This again suggests that PCA mostly depends on low-le vel features, whereas the BigBiGAN brain decoder can beneﬁt from low-le vel information as well as high-level image attributes. V . D I S C U S S I O N In this paper , we hav e proposed a new method for realistic reconstruction of natural scenes from fMRI patterns. Thanks to the high-lev el, low-dimensional latent space of BigBiGAN, we could establish a linear mapping that associates image latent vectors to their corresponding fMRI patterns. This linear mapping was then inv erted to transform novel fMRI patterns into BigBiGAN latent vectors. Finally , by feeding the obtained latent vectors into the BigBiGAN generator , the associated images were reconstructed. Many recent approaches hav e taken advantage of deep gen- erativ e neural networks to reconstruct natural scenes [6], [7], [10]. Ho wev er , due to the comple xity of natural images, a huge amount of computational resources and capacity is required to achiev e high-resolution realistic image generation [15]. Here, we used the pre-trained BigBiGAN as a state-of-the-art large- scale bi-directional GAN for natural images. W e showed that the proposed method is able to generate the most realistic reconstructions in the highest resolution ( 256 × 256 ) compared to other methods. Moreover , comparing results across subjects rev ealed a robust consistency in capturing high-lev el attrib utes of different objects through the reconstructions. W e acknowledge that our reconstructions are still far from perfect and can often lag behind the others in terms of low- lev el similarity measures. In contrast, the superiority of the proposed method is with respect to high-lev el ev aluations of perceptual similarity . While we can surpass other methods in this area, we believ e that there is still room for method- ological improv ements. In particular , failures to retrieve the proper semantic category or visual attribute can of course be caused by imperfect brain-decoding of the latent vectors, but also sometimes by inadequate image generation from the BigBiGAN generator (e.g., compare the ﬁrst 2 columns in Fig. 2). W e believ e that one promising area of improv ement for our work is through the ability of the image generation model. In this regard, whenev er new bidirectional GANs (or other bidirectional architectures) improv e on the current state- of-the-art, our method can easily be adapted to deploy them and take adv antage of their image generation prowess for more accurate brain-based reconstructions. Another current limitation of the proposed method is our use of pre-deﬁned brain regions of interest (or potentially , of the entire visual cortex). It is likely that not all vox els are informa- tiv e or relev ant to the target task; including uninformativ e or irrelev ant vox els can only degrade the outcome. Additionally , there might well be informati ve v oxels in other brain areas such as pre-frontal cortex, signaling high-lev el perceptual or semantic aspects of the visual stimulus, that we are currently not considering. For these reasons, extending the analysis to the entire brain, while using a proper vox el selection stage to discard irrelev ant voxels, is bound to further improve the results. R E F E R E N C E S [1] Y . Miyawaki, H. Uchida, O. Y amashita, M.-a. Sato, Y . Morito, H. C. T anabe, N. Sadato, and Y . Kamitani, “V isual image reconstruction from 7 human brain activity using a combination of multiscale local image decoders, ” Neuron , vol. 60, no. 5, pp. 915–929, 2008. [2] K. N. Kay , T . Naselaris, R. J. Prenger , and J. L. Gallant, “Identifying natural images from human brain activity , ” Natur e , vol. 452, no. 7185, pp. 352–355, 2008. [3] T . Naselaris, R. J. Prenger, K. N. Kay , M. Oliv er , and J. L. Gallant, “Bayesian reconstruction of natural images from human brain acti vity , ” Neur on , vol. 63, no. 6, pp. 902–915, 2009. [4] M. A. van Gerven, F . P . de Lange, and T . Heskes, “Neural decoding with hierarchical generative models, ” Neural computation , vol. 22, no. 12, pp. 3127–3142, 2010. [5] R. V anRullen and L. Reddy , “Reconstructing faces from fmri patterns using deep generativ e neural networks, ” Communications biology , vol. 2, no. 1, pp. 1–10, 2019. [6] G. Shen, T . Horika wa, K. Majima, and Y . Kamitani, “Deep image reconstruction from human brain activity , ” PLoS computational biology , vol. 15, no. 1, p. e1006633, 2019. [7] R. Beliy , G. Gaziv , A. Hoogi, F . Strappini, T . Golan, and M. Irani, “From v oxels to pixels and back: Self-supervision in natural-image reconstruction from fmri, ” in Advances in Neural Information Processing Systems , 2019, pp. 6514–6524. [8] G. St-Yves and T . Naselaris, “Generati ve adversarial networks con- ditioned on brain activity reconstruct seen images, ” in 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) . IEEE, 2018, pp. 1054–1061. [9] K. Seeliger , U. G ¨ uc ¸ l ¨ u, L. Ambrogioni, Y . G ¨ uc ¸ l ¨ ut ¨ urk, and M. A. van Gerven, “Generative adversarial networks for reconstructing natural images from brain activity , ” NeuroIma ge , vol. 181, pp. 775–785, 2018. [10] Z. Ren, J. Li, X. Xue, X. Li, F . Y ang, Z. Jiao, and X. Gao, “Reconstructing perceiv ed images from brain activity by visually- guided cognitive representation and adversarial learning, ” arXiv preprint arXiv:1906.12181 , 2019. [11] J. Donahue and K. Simonyan, “Large scale adversarial representation learning, ” in Advances in Neural Information Processing Systems , 2019, pp. 10 541–10 551. [12] K. Han, H. W en, J. Shi, K.-H. Lu, Y . Zhang, and Z. Liu, “V ariational autoencoder: An unsupervised model for modeling and decoding fmri activity in visual cortex, ” bioRxiv , p. 214247, 2018. [13] T . Horikawa and Y . Kamitani, “Generic decoding of seen and imagined objects using hierarchical visual features, ” Nature communications , vol. 8, no. 1, pp. 1–15, 2017. [14] J. Donahue, P . Kr ¨ ahenb ¨ uhl, and T . Darrell, “ Adversarial feature learn- ing, ” arXiv pr eprint arXiv:1605.09782 , 2016. [15] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high ﬁdelity natural image synthesis, ” arXiv preprint , 2018. [16] A. S. Cowen, M. M. Chun, and B. A. Kuhl, “Neural portraits of perception: reconstructing face images from ev oked brain activity , ” Neur oimage , vol. 94, pp. 12–22, 2014. [17] S.-M. Khaligh-Razavi and N. Kriegeskorte, “Deep supervised, but not unsupervised, models may explain it cortical representation, ” PLoS computational biology , vol. 10, no. 11, 2014. [18] R. M. Cichy , A. Khosla, D. Pantazis, A. T orralba, and A. Oliva, “Com- parison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition rev eals hierarchical correspondence, ” Scientiﬁc reports , vol. 6, p. 27755, 2016. [19] C. Szegedy , V . V anhoucke, S. Iof fe, J. Shlens, and Z. W ojna, “Rethinking the inception architecture for computer vision, ” in Pr oceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 2818–2826. 8

Reconstructing Natural Scenes from fMRI Patterns using BigBiGAN

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment