Locality and low-dimensions in the prediction of natural experience from fMRI

Locality and low-dimensions in the pr ediction of natural experience fr om fMRI ∗ Franc ¸ ois G. Meyer Center for the Study of Brain, Mind and Behavior , Program in Applied and Computational Mathematics Princeton Univ ersity fmeyer@colorado.edu Greg J . Stephens Center for the Study of Brain, Mind and Behavior , Department of Physics Princeton Univ ersity gstephen@princeton.edu Both authors contributed equally to this w ork. Abstract Functional Magnetic Resonance Imaging (fMRI) provides dynamical access into the complex functioning of the human brain, detailing the hemodynamic activ- ity of thousands of voxels during hundreds of sequential time points. One ap- proach to wards illuminating the connection between fMRI and cogniti ve function is through decoding; how do the time series of v oxel acti vities combine to provide information about internal and external experience? Here we seek models of fMRI decoding which are balanced between the simplicity of their interpretation and the effecti veness of their prediction. W e use signals from a subject immersed in vir- tual reality to compare global and local methods of prediction applying both linear and nonlinear techniques of dimensionality reduction. W e ﬁnd that the prediction of complex stimuli is remarkably lo w-dimensional, saturating with less than 100 features. In particular, we b uild effecti ve models based on the decorrelated com- ponents of cogniti ve acti vity in the classically-deﬁned Brodmann areas. For some of the stimuli, the top predictiv e areas were surprisingly transparent, including W ernicke’ s area for verbal instructions, visual cortex for f acial and body features, and visual-temporal regions for velocity . Direct sensory experience resulted in the most robust predictions, with the highest correlation ( c ∼ 0 . 8 ) between the predicted and e xperienced time series of verbal instructions. T echniques based on non-linear dimensionality reduction (Laplacian eigenmaps) performed similarly . The interpretability and relativ e simplicity of our approach provides a conceptual basis upon which to b uild more sophisticated techniques for fMRI decoding and offers a windo w into cogniti ve function during dynamic, natural experience. ∗ Advances in Neural Information Processing Systems 20, Sch ¨ olkopf B., Platt J. and Hofmann T . (Editors), c  MIT Press, 2008 1 1 Introduction Functional Magnetic Resonance Imaging (fMRI) is a non-in vasi ve imaging technique that can quan- tify changes in cerebral v enous oxygen concentration. Changes in the fMRI signal that occur during brain activ ation are very small (1-5%) and are often contaminated by noise (created by the imaging system hardw are or physiological processes). Statistical techniques that handle the stochastic nature of the data are commonly used for the detection of activ ated voxels. T raditional methods of analy- sis – which are designed to test the hypothesis that a simple cognitiv e or sensory stimulus creates changes in a speciﬁc brain area – are unable to analyze fMRI datasets collected in “natural stimuli” where the subjects are bombarded with a multitude of uncontrolled stimuli that cannot always be quantiﬁed [1, 2]. The Experience Based Cognition competition (EBC) [3] offers an opportunity to study complex re- sponses to natural environments, and to test new ideas and new methods for the analysis of fMRI collected in natural en vironments. The EBC competition provides fMRI data of three human sub- jects in three 20-minute segments (704 scanned samples in each se gment) in an urban virtual reality en vironment along with quantitati ve time series of natural stimuli or features (25 in total) ranging from objectiv e features such as the presence of faces to self-reported, subjective cognitive states such as the experience of fear . During each session, subjects were audibly instructed to complete three search tasks in the environment: looking for weapons (but not tools) taking pictures of people with piercings (b ut not others), or picking up fruits (b ut not ve getables). The data was collected with a 3T EPI scanner and typically consists of the activity of 35000 volume elements (voxels) within the head. The feature time series was provided for only the ﬁrst two segments (1408 time samples) and competitive entries are judged on their ability to predict the feature on the third segment (704 time samples, see Fig. 1). At a microscopic le vel, a large number of internal v ariables associated ? t j T t t j t i 0 t T l k f(t) t i t 0 Figure 1: W e study the v ariation of the set of features f k ( t ) , k = 1 , · · · , K as a function of the dynamical changes in the fMRI signal X ( t ) = [ x 1 ( t ) , · · · , x N ( t )] during natural experience. The features represent both external stimuli such as the presence of faces and internal emotional states encountered during the exploration of a virtual urban en vironment (left and right images). W e predict the feature functions f k for t = T l +1 , · · · T , from the knowledge of the entire fMRI dataset X , and the partial knowledge of f k ( t ) for t = 1 , · · · , T l . The “toy” acti vation patterns (middle diagram) illustrate the changes in “brain states” occurring as a function of time. with various physical and physiological phenomena contribute to the dynamic changes in the fMRI signal. Because the fMRI signal is a large scale (as compared to the scale of neurons) measurement of neuronal acti vity , we expect that many of these variables will be coupled resulting in a low di- mensional set for all possible conﬁgurations of the activ ated fMRI signal. In this work we seek a low dimensional representation of the entire fMRI dataset that provides a new set of ‘voxel-free” coordinates to study cognitiv e and sensory features. W e denote a three-dimensional v olumes of fMRI composed of a total of N voxels by X ( t ) = [ x 1 ( t ) , · · · , x N ( t )] . W e have access to T such volumes. W e can stack the spatio-temporal fMRI 2 dataset into a N × T matrix, X =    x 1 (1) · · · x 1 ( T ) . . . . . . . . . x N (1) · · · x N ( T )    , (1) where each ro w n represents a time series x n generated from vox el n and each column j represents a scan acquired at time t j . W e call the set of features to be predicted f k , k = 1 , , · · · , K . W e are interested in studying the v ariation of the set of features f k ( t ) , k = 1 , · · · , K describing the subject experience as a function of the dynamical changes of the brain, as measured by X ( t ) . Formally , we need to build predictions of f k ( t ) for t = T l +1 , · · · T , from the knowledge of the entire fMRI dataset X , and the partial kno wledge of f k ( t ) for the training time samples t = 1 , · · · , T l (see Fig. 1). D t t t φ j 0 t φ i Figure 2: Low-dimensional parametrization of the set of “brain states”. The parametrization is constructed from the samples provided by the fMRI data at dif ferent times, and in different states. 2 A voxel-free parametrization of brain states W e use here the global information provided by the dynamical ev olution of X ( t ) ov er time, both during the training times and the test times. W e would like to ef fectively replace each fMRI dataset X ( t ) by a small set of features that facilitates the identiﬁcation of the brain states, and make the prediction of the features easier . Formally , our goal is to construct a map φ from the voxel space to low dimensional space. φ : R N 7→ D ⊂ R L (2) X ( t ) = [ x 1 ( t ) , · · · , x N ( t )] T 7→ ( y 1 ( t ) , · · · , y L ( t )) , (3) where L  N . As t varies over the training and the test sets, we hope that we explore most of the possible brain conﬁgurations that are useful for predicting the features. The map φ provides a parametrization of the brain states. Figure 2 provides a pictorial rendition of the map φ . The range D , represented in Fig. 2 as a smooth surface, is the set of parameters y 1 , · · · , y L that characterize the brain dynamics. Different v alues of the parameters produce dif ferent “brain states”, associated with different patterns of activ ation. Note that time does not play any role on D , and neighboring points on D correspond to similar brain states. Equipped with this re-parametrization of the dataset X , the goal is to learn the ev olution of the feature time series as a function of the new coordinates [ y 1 ( t ) , · · · , y L ( t )] T . Each feature function is an implicit function of the brain state measured by [ y 1 ( t ) , · · · , y L ( t )] . For a giv en feature f k , the training data provide us with samples of f k at cer- tain locations in D . The map φ is build by globally computing a ne w parametrization of the set { X (1) , · · · , X ( T ) } . This parametrization is built into two stages. First, we construct a graph that is a proxy for the entire set of fMRI data { X (1) , · · · , X ( T ) } . Second, we compute some eigenfunc- tions φ k deﬁned on the graph. Each eigenfunctions provides one speciﬁc coordinate for each node of the graph. 3 2.1 The graph of brain states W e represent the fMRI dataset for the training times and test times by a graph. Each v ertex i corresponds to a time sample t i , and we compute the distance between two vertices i and j by measuring a distance between X ( t i ) and X ( t j ) . Global changes in the signal due to residual head motion, or global blood ﬂo w changes were remo ved by computing a a principal components analysis (PCA) of the dataset X and removing a small number components. W e then used the l 2 distance between the fMRI v olumes (unrolled as N × 1 vectors). This distance compares all the voxels (white and gray matter , as well as CSF) inside the brain. 2.2 Embedding of the dataset Once the network of connected brain states is created, we need a distance to distinguish between strongly connected states (the two fMRI data are in the same cogniti ve state) and weakly connected states (the fMRI data are similar , but do not correspond to the same brain states). The Euclidean distance used to construct the graph is only useful locally: we can use it to compare brain states that are very similar , b ut is unfortunately very sensitiv e to short-circuits created by the noise in the data. A standard alternative to the geodesic (shortest distance) is provided by the av erage commute time, κ ( i, j ) , that quantiﬁes the expected path length between i and j for a random walk started at i . Formally , κ ( i, j ) = H ( j, i ) + H ( i, j ) , where H ( i, j ) is the hitting time, H ( i, j ) = E i [ T j ] with T j = min { n ≥ 0; Z n = j } , for a random walk Z n on the graph with transition probability P , deﬁned by P i,j = w i,j /d i , and d i = D i,i = P j w i,j is the degree of the vertex i . The commute time can be con veniently computed from the eigenfunctions φ 1 , · · · , φ N of N = D 1 2 PD − 1 2 , with the eigenv alues − 1 ≤ λ N · · · ≤ λ 2 < λ 1 = 1 . Indeed, we ha ve κ ( i, j ) = N X k =2 1 1 − λ k φ k ( i ) √ d i − φ k ( j ) p d j ! 2 . As proposed in [4, 5, 6], we deﬁne an embedding i 7→ I k ( i ) = 1 1 − λ k φ k ( i ) √ d i , k = 2 , · · · , N (4) Because − 1 ≤ λ N · · · ≤ λ 2 < λ 1 = 1 , we have 1 √ 1 − λ 2 > 1 √ 1 − λ 3 > · · · 1 √ 1 − λ N . W e can therefore neglect φ k ( j ) √ 1 − λ k for large k , and reduce the dimensionality of the embedding by using only the ﬁrst K coordinates in (4). The spectral gap measures the difference between the ﬁrst two eigen values, λ 1 − λ 2 = 1 − λ 2 . A large spectral gap indicates that the low dimensional will provide a good approximation. The algorithm for the construction of the embedding is summarized in Fig. 3. Algorithm 1: Construction of the embedding Input : – X ( t ) , t = 1 , · · · , T , K : number of eigenfunctions. Algorithm: 1. construct the graph deﬁned by the n n nearest neighbors 2. ﬁnd the ﬁrst K eigenfunctions, φ k , of N • Output: For t i = 1 : T – ne w co-ordinates of X ( t i ) : y k ( t i ) = 1 √ π i φ k ( i ) √ 1 − λ k k = 2 , · · · , K + 1 Figure 3: Construction of the embedding 4 A parameter of the embedding (Fig. 3) is K , the number of coordinates. K can be optimized by minimizing the prediction error . W e expect that for small values of K the embedding will not describe the data with enough precision, and the prediction will be inaccurate. If K is too large, some of the new coordinates will be describing the noise in the dataset, and the algorithm will overﬁt the training data. Fig. 4-(a) illustrates the effect of K on the performance of the nonlinear dimension reduction. The quality of the prediction for the features: faces, instruction and velocity is plotted against K . Instructions elicits a strong response in the auditory cortex that can be decoded with as few as 20 coordinates. Faces requires more (about 50) dimensions to become optimal. As expected the performance ev entually drops when additional coordinates are used to describe variability that is not related to the features to be decoded. This conﬁrms our hypothesis that we can replace about 15,000 vox els with 50 appropriately chosen coordinates. 2.3 Semi-supervised learning of the features The problem of predicting a feature f k at an unknown time t u is formulated as kernel ridge regres- sion problem. The training set { f k ( t ) for t = 1 , · · · , T l } is used to estimate the optimal choice of weights in the following model, ˆ f ( t u ) = T l X t =1 ˆ α ( t ) K ( y ( t u ) , y ( t )) , where K is a kernel and t u is a time point where we need to predict. 2.4 Results W e compared the nonlinear embedding approach (referred to as global Laplacian) to dimension reduction obtained with a PCA of X . Here the principal components are principal volumes, and for each time t we can expand X ( t ) onto the principal components. The 1408 training data were di vided into two subsets of 704 time samples. W e use f k ( t ) in a subset to predict f k ( t ) in the other subset. In order to quantify the stability of the prediction we randomly selected 85 % of the training set (ﬁrst subset), and predicted 85 % of the testing set (other subset). The role, training or testing, of each subset of 704 time samples was also chosen randomly . W e generated 20 experiments for each value of K , the number of predictors. The performance was quantiﬁed with the normalized correlation between the model prediction and the real value of f k , r = h δ f est k ( t ) , δ f k ( t ) i / q h δ ( f est k ) 2 ih δ f 2 k i , (5) where δf k = f k ( t ) − h f k i . Finally , r was a veraged o ver the 20 experiments. Fig. 4-(a) and (b) sho w the performance of the nonlinear method and linear method as a function of K . The approach based on the nonlinear embedding yields very stable results, with lo w variance. For both global methods the optimal performance is reached with less than 50 coordinates. Fig. 5 shows the correlation coefﬁcients for 11 features, using K = 33 coordinates. For most features, the nonlinear embedding performed better than global PCA. 3 From global to local While models based on global features lev erage predictiv e components from across the brain, cog- nitiv e function is often localized within speciﬁc regions. Here we explore whether simple models based on classical Brodmann regions pro vide an effecti ve decoder of natural e xperience. The Brod- mann areas were deﬁned almost a century ago (see e.g [7]) and di vide the corte x into approximately 50 regions, based on the structure and arrangement of neurons within each region. While the ar- eas are characterized structurally many also have distinct functional roles and we use these roles to provide useful interpretations of our predictiv e models. Though the partitioning of cortical regions remains an open and challenging problem, the Brodmann areas represent a transparent compromise between dimensionality , function and structure. Using data supplied by the competition, we warp each brain into standard T alairach coordinates and locate the Brodmann area corresponding to each voxel. W ithin each Brodmann region, dif fering in 5 size from tens to thousands of elements, we b uild the cov ariance matrix of voxel time series using all three virtual reality episodes. W e then project the voxel time series onto the eigen vectors of the cov ariance matrix (principal components) and b uild a simple, linear stimulus decoding model using the top n modes ranked by their eigen v alues, f est k ( t ) = n X i =1 w k i m k i ( t ) . (6) where k indexes the different Brodmann areas, { w k i } are the linear weights and { m k i ( t ) } are the mode time series in each re gion. The weights are chosen to minimize the RMS error on the training set and ha ve a particularly simple form here as the modes are decorrelated, w k i = h S ( t ) m k i ( t ) i . Performance is measured as the normalized correlation r (Eq. 5) between the model prediction and 1 60 1 200 loc al eigenmodes glob al e igenmo des 〈 r 〉 0.9 0 Bes t A rea (f aces) (c) (b) 0 0.9 30 100 faces inst ruc tio ns velo cit y Bro dmann37 Bro dmann19 Bro dmann21 (d) 〈 r 〉 loc al eigenmodes 1 60 30 1 48 faces inst ruc tio ns velo cit y 1 200 100 0 〈 r 〉 0.9 (a) glob al Laplac ian faces inst ruc tio ns velo cit y Figure 4: Performance of the prediction of natural experience for three features, faces, instructions and velocity as a function of the model dimension. (a) nonlinear embedding, (b) global principal components, (c) local (Brodmann area) principal components. In all cases we ﬁnd that the predic- tion is remarkably low-dimensional, saturating with less than 100 features. (d) stability and inter- pretability of the optimal Brodmann areas used for decoding the presence of faces. All three areas are functionally associated with visual processing. Brodmann area 22 (W ernicke’ s area) is the best predictor of instructions (not shown). The connections between anatomy , function and prediction add an important measure of interpretability to our decoding models. the real stimulus av eraged o ver the two virtual reality episodes and we use the region with the lowest training error to make the prediction. In principle, we could use a large number of modes to mak e a prediction with n limited only by the number of training samples. In practice the predictiv e power of our linear model saturates for a remarkably low number of modes in each region. In Fig 4(c) we demonstrate the performance of the model on the number of local modes for three stimuli that are predicted rather well (faces, instructions and velocity). 6 For many of the well-predicted stimuli, the best Brodmann areas were also stable across subjects and episodes offering important interpretability . For example, in the prediction of instructions (which the subjects recei ved through headphones), the top region was Brodmann Area 22, W ernicke’ s area, which has long been associated with the processing of human language. For the prediction of the face stimulus the best region was usually visual cortex (Brodmann Areas 17 and 19) and for the prediction of velocity it was Brodmann Area 7, known to be important for the coordination of visual and motor acti vity . Using modes deri ved from Laplacian eigenmaps we were also able to predict an emotional state, the self-reporting of fear and anxiety . Interestingly , in this case the best predictions came from higher cognitiv e areas in frontal cortex, Brodmann Area 11. aro usa l bod y dog int erio r/ ext erior faces fea rfu l/ anxi ous fruits/ veg gie hits ins tru cti ons wea pon s/ too ls vel oci ty 0.9 0 〈 r 〉 global eigenbrain global laplacian local eigenbrain Figure 5: Performance of the prediction of natural experience for ele ven features, using three dif fer- ent methods. Local decoders do well on stimuli related to objects while nonlinear global methods better capture stimuli related to emotion. While the abov e discussion highlights the usefulness of classical anatomical location, many aspects of cogniti ve experience are not lik ely to be so simple. Gi ven the reasonable results abo ve it’ s natural to look for ways of combining the intuition derived from single classical location with more global methods that are likely to do better in prediction. As a step in this direction, we modify our model to include multiple Brodmann areas f est k ( t ) = X l ∈ A n X i =1 w l i m l i ( t ) , (7) where A represents a collection of areas. T o make a prediction using the modiﬁed model we ﬁnd the top three Brodmann areas as before (ranked by their training correlation with the stimulus) and then incorporate all of the modes in these areas (nA in total) in the linear model of Eq 7. The weights { w l i } are chosen to minimize RMS error on the training data. The combined model leverages both the interpretive power of single areas and also some of the interactions between them. The results of this combined predictor are shown in Fig. 5 (black) and are generally signiﬁcantly better than the single region predictions. For ease of comparison, we also show the best global results (both nonlinear Laplacian and global principal components). For many (but not all) of the stimuli, the local, lo w-dimensional linear model is signiﬁcantly better than both linear and nonlinear global methods. 7 4 Discussion Incorporating the knowledge of functional, cortical regions, we used fMRI to build low-dimensional models of natural experience that performed surprisingly well at predicting many of the complex stimuli in the EBC competition. In addition, the regional basis of our models allows for transparent cognitiv e interpretation, such as the emergence of W ernicke’ s area for the prediction of auditory instructions in the virtual en vironment. Other well-predicted experiences include the presence of body parts and faces, both of which were decoded by areas in visual cortex. In future work, it will be interesting to examine whether there is a well-deﬁned cognitiv e dif ference between stimuli that can be decoded with local brain function and those that appear to require more global techniques. W e also learned in this work that nonlinear methods for embedding datasets, inspired by manifold learning methods [4, 5, 6], outperform linear techniques in their ability to capture the complex dynamics of fMRI. Finally , our particular use of Brodmann areas and linear methods represent only a ﬁrst step to wards combining prior knowledge of broad regional brain function with the construction of models for the decoding of natural experience. Despite the relati ve simplicity , an entry based on this approach scored within the top 5 of the EBC2007 competition [3]. Acknowledgments GJS w as supported in part by National Institutes of Health Grant T32 MH065214 and by the Sw artz Foundation. FGM was partially supported by the Center for the Study of Brain, Mind and Behavior , Princeton Uni versity . The authors are very grateful to all the members of the center for their support and insightful discussions. References [1] Y . Golland, S. Bentin, H. Gelbard, Y . Benjamini, R. Heller, and Y . Nir et al. Extrinsic and intrinsic systems in the posterior cortex of the human brain rev ealed during natural sensory stimulation. Cer ebral Cortex , 17:766–777, 2007. [2] S. Malinen, Y . Hlushchuk, and R. Hari. T owards natural stimulation in fMRI–issues of data analysis. Neur oImage , 35:131–139, 2007. [3] http://www .ebc.pitt.edu. [4] M. Belkin and P . Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen- tation. Neural Computations , 15:1373–1396, 2003. [5] P . B ´ erard, G. Besson, and S. Gallot. Embeddings Riemannian manifolds by their heat kernel. Geometric and Functional Analysis , 4(4):373–398, 1994. [6] R.R. Coifman and S. Lafon. Dif fusion maps. Applied and Computational Harmonic Analysis , 21:5–30, 2006. [7] E.R. Kandel, J.H. Schwartz, and T .M. Jessell. Principles of Neural Science . McGraw-Hill, Ne w Y ork, 2000. 8

Locality and low-dimensions in the prediction of natural experience from fMRI

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment