Graph Neural Network for Interpreting Task-fMRI Biomarkers

Finding the biomarkers associated with ASD is helpful for understanding the underlying roots of the disorder and can lead to earlier diagnosis and more targeted treatment. A promising approach to identify biomarkers is using Graph Neural Networks (GN…

Authors: Xiaoxiao Li, Nicha C. Dvornek, Yuan Zhou

Graph Neural Network for Interpreting Task-fMRI Biomarkers
Graph Neural Net w ork for In terpreting T ask-fMRI Biomark ers Xiao xiao Li 1 , Nic ha C. Dv ornek 5 , Y uan Zhou 5 , Juntang Zhuang 1 , P amela V en tola 4 and James S. Duncan 1 , 2 , 3 , 5 1 Biomedical Engineering, Y ale Univ ersity , New Ha ven, CT, USA 2 Electrical Engineering, Y ale Universit y , New Ha ven, CT, USA 3 Statistics & Data Science, Y ale Univ ersity , New Hav en, CT, USA 4 Child Study Center, Y ale School of Medicine, New Hav en, CT, USA 5 Radiology & Biomedical Imaging, Y ale Sc ho ol of Medicine, New Ha ven, CT, USA Abstract. Finding the biomark ers asso ciated with ASD is helpful for understanding the underlying ro ots of the disorder and can lead to earlier diagnosis and more targeted treatment. A promising approach to iden- tify biomarkers is using Graph Neural Net works (GNNs), which can b e used to analyze graph structured data, i.e. brain netw orks constructed b y fMRI. One wa y to interpret imp ortan t features is through lo oking at how the classification probability c hanges if the features are o ccluded or replaced. The ma jor limitation of this approach is that replacing v al- ues ma y change the distribution of the data and lead to serious errors. Therefore, we develop a 2-stage pip eline to eliminate the need to replace features for reliable biomarker in terpretation. Specifically , w e prop ose an inductiv e GNN to embed the graphs containing different prop erties of task-fMRI for identifying ASD and then discov er the brain regions/sub- graphs used as evidence for the GNN classifier. W e first show GNN can ac hieve high accuracy in identifying ASD. Next, we calculate the fea- ture imp ortance scores using GNN and compare the interpretation abil- it y with Random F orest. Finally , we run with different atlases and pa- rameters, proving the robustness of the proposed metho d. The detected biomark ers rev eal their association with so cial b ehaviors and are consis- ten t with those rep orted in the literature. W e also show the p oten tial of disco vering new informative biomark ers. Our pipeline can b e generalized to other graph feature imp ortance interpretation problems. Keyw ords: Graph Neural Netw ork · T ask-fMRI · ASD biomarker 1 In tro duction Autism sp ectrum disorders (ASD) affect the structure and function of the brain. T o better target the underlying ro ots of ASD for diagnosis and treatmen t, efforts to iden tify reliable biomarkers are gro wing [8]. Significant progress has been made using functional magnetic resonance imaging (fMRI) to characterize the brain remo deling in ASD [9]. Recen tly , emerging research on Graph Neural Netw orks (GNNs) has combined deep learning with graph representation and applied an 2 X. Li et al. Fig. 1: Pip eline for in terpreting imp ortan t features from a GNN in tegrated approach to fMRI analysis in different neuro-disorders [11]. Most ex- isting approaches (based on Graph Conv olutional Net work (GCN) [10]) require all no des in the graph to b e present during training and thus lack natural gener- alization on unseen nodes. Also, it is necessary to in terpret the imp ortan t feature in the data used as evidence for the mo del, but currently no to ol exists that can in terpret and explain GNNs while recent CNN explanation algorithms cannot directly work on graph input. Our main contributions include the following three p oin ts: 1) W e develop a metho d to in tegrate all the av ailable connectivity , geometric, anatomic informa- tion and task-fMRI (tfMRI) related parameters in to graphs for deep learning. Our approach alleviates the problem of predetermining the b est features and measures of functional connectivit y , which is often ambiguous due to the intrin- sic complex structure of task-fMRI. 2) W e propose a generalizable GNN induc- tiv e learning mo del to more accurately classify ASD v.s. healthy con trols (HC). Differen t from the sp ectral GCN [10], our GNN classifier is based on graph iso- morphism, which can b e applied to multigraphs with different no des/edges (e.g. sub-graphs), and learn lo cal graph information without binding to the whole graph structure. 3) The GNN arc hitecture enables us to train the model on the whole graph and v alidate it on subgraphs. W e directly ev aluate the importance scores on sub-graphs and no des (i.e. regions of interest (ROIs)) by examining mo del resp onses, without resampling v alue for the o ccluded features. The 2-stage pip eline to interpret imp ortan t sub-graphs/ROIs, which are defined as biomark- ers in our setting, is shown in Fig. 1. 2 Metho dology 2.1 Graph Definition W e firstly parcellate the brain into N ROIs based on its T1 structural MRI. W e define ROIs as graph no des. W e define an undirected multigraph G = ( V , E ), where V = ( v 1 , v 2 , . . . , v N ) T ∈ R N × D and E = [ e ij ] ∈ R N × N × F , D and F are Graph Neural Netw ork for Interpreting T ask-fMRI Biomarkers 3 the attribute dimensions of nodes and edges respectively. F or node attributes, we concatenate handcrafted features: degree of connectivity , General Linear Mo del (GLM) co efficien ts, mean, standard deviation of task-fMRI, and R OI center co- ordinates. W e applied the Box-Co x transformation [13] to make each feature follo w a normal distribution (parameters are learned from the training set and applied to the training and testing sets). The edge attribute e ij of no de i and j includes the P earson correlation, partial correlation calculated using residual fMRI, and exp ( − r ij / 10) where r ij is the geometric distance b etw een the centers of the t wo ROIs. W e thresholded the edges under the 95th percentile of par- tial correlation v alues to ensure sparsity for efficien t computation and av oiding o versmoothing. 2.2 Graph Neural Netw ork (GNN) Classifier The architecture of our prop osed GNN is shown in Fig. 2 (no de, edge attribute definition, kernel sizes are denoted). The model inductiv ely learns no de rep- resen tation b y recursively aggregating and transforming feature vectors of its neigh b oring no des. Below, we define the la yers in the prop osed GNN classifier. Con volutional La yer F ollo wing Message Passing Neural Netw orks (NNcon v) [7], whic h is inv ariant to graph symmetries, we lev erage no de degree in the em b edding. The embedded representation of the l th con volutional lay er v ( l ) i ∈ R d ( l ) is v ( l ) i = 1 |N ( i ) | + 1 σ ( Θv ( l − 1) i + X j ∈N ( i ) h φ ( e ij ) v ( l − 1) j ) , (1) where σ ( · ) is a nonlinear activ ation function (we use relu here), N ( i ) is no de i ’s 1-hop neighborho o d, Θ ∈ R d ( l ) × d ( l − 1) is a learnable propagation matrix, h φ denotes a Multi-la yer Perceptron (MLP), which maps the edge attributes e ij to a d ( l ) × d ( l − 1) matrix, and w e initialize v (0) i = v i . P o oling Aggregation Lay er T o make sure that down-sampling la yers b eha ve idiomatically with respect to different graph sizes and structures, w e adopt the approac h in [2] for reducing graph no des. The c hoice of which no des to drop Fig. 2: The arc hitecture of the GNN classifier 4 X. Li et al. is done based on pro jecting the no de attributes on a learnable vector w ( l − 1) ∈ R d ( l − 1) . The no des receiving lo wer scores will exp erience less feature retention. F ully written out, the op eration of this po oling la yer (computing a po oled graph, ( V ( l ) , E ( l ) ), from an input graph, ( V ( l − 1) , E ( l − 1) )), is expressed as follo ws: y = V ( l − 1) w ( l − 1) k w ( l − 1) k i = top k ( y , k ) V ( l ) = ( V ( l − 1)  tanh( y )) i , : E ( l ) = E ( l − 1) i , i . (2) Here k · k is the L 2 norm, top k finds the indices corresp onding to the largest k elemen ts in vector y ,  is (broadcasted) element-wise multiplication, and ( · ) i , j is an indexing op eration which takes elements at row indices specified by i and column indices sp ecified by j (colon denotes all indices). The p ooling op eration trivially retains sparsity by requiring only a pro jection, a p oin t-wise multipli- cation and a slicing in to the original feature and adjacency matrix. Differen t from [2], we induce constraint k w ( l ) k 2 = 1 implemented b y adding an additional regularization loss λ P L l =1 ( k w ( l ) k 2 − 1) 2 to av oid identifiabilit y issues. Readout Lay er Lastly , we seek a flattening op eration to preserve information ab out the input graph in a fixed-size representation. Concretely , to summarise the output graph of the l th conv-po ol blo c k, ( V ( l ) , E ( l ) ), we use s ( l ) = ( 1 N ( l ) N ( l ) X i =1 v ( l ) i ) k max( n v ( l ) i : i = 1 , ..., N ( l ) o ) , (3) where N ( l ) is the num b er of graph no des, v ( l ) i is the i th no de’s feature vector, max op erates element wisely , and k denotes concatenation. The final summary v ector is obtained as the concatenation of all those summaries (i.e. s = s (1) k s (2) k · · · k s ( L ) ) and submitted to a MLP for obtaining final predictions. 2.3 Explain Input Data Sensitivity T o explain input data sensitivity , we cluster the whole brain graph into sub- graphs first. Then we in vestigate the predictiv e p o wer of eac h sub-graph, further assign imp ortance score to each ROI. Net work Comm unity Clustering F rom no w on w e add the subscript to the graph as G s = ( V s , E s ) for the s th instance, s = 1 , ..., S , where S is the num ber of graphs. Concatenating the sparsified non-negative partial correlation matrices ( E s ) : , : , 2 for all the graphs, we can create a 3rd-order tensor τ of dimension N × N × S . Non-negative P ARAF A C [3] tensor decomp osition is applied to tensor τ to discov er o verlapping functional brain net works. Given decomp osition rank R , τ ≈ P R j =1 a j ⊗ b j ⊗ c j , where loading vectors a j ∈ R N , b j ∈ R N , c j ∈ R S and ⊗ denotes the vector outer pro duct. a j = b j since the connectivity matrix is symmetric. The i th element of a j , a j i pro vides the membership of region i in the comm unity j . Here, w e consider region i b elongs to comm unity j if a j i > mean ( a j ) + std ( a j ) [12]. This giv es us a collection of communit y indices indicating region mem b ership { i j ⊂ { 1 , ..., N } : j = 1 , ..., R } . Graph Neural Netw ork for Interpreting T ask-fMRI Biomarkers 5 Graph Salience Mapping After decomposing all the brain netw orks in to comm unity sub-graphs  G sj = (( V s ) i j , : , ( E s ) i j , i j ) : s = 1 , ..., S, j = 1 , ..., R  , w e use a salience mapping metho d to assign each sub-graph an imp ortance score. In our classification setting, the probability of class c ∈ { 0 , 1 } (0: HC, 1: ASD) giv en the original net work G is estimated from the predictiv e score of the model: p ( c | G ) . T o calculate p ( c | G sj ), different from CNN or GCN, w e can directly input the sub-graph into the pre-trained classifier. W e denote c s as the class label for instance s and define Evidenc e for Corr e ct Class (ECC) for each communit y: ECC j = 1 S X s tanh(log 2 ( p ( c = c s | G sj ) / (1 − p ( c = c s | G sj )))) , (4) where laplace correction ( p ← ( pS + 1) / ( S + 2)) is used to av oid zero denomi- nators. Note that log o dds-ratio is commonly used in logistic regression to make p more separable. The nonlinear tanh function is used for b ounding ECC . ECC can b e p ositiv e or negativ e. A p ositiv e v alue pro vides evidence for the classi- fier, whereas a negativ e v alue pro vides evidence against the classifier. The final imp ortance score for no de k is calculated by P j : k ∈ i j ECC j / | i j | . The larger the score, the more p ossible the no de can b e used as a distinguishable marker. 3 Exp erimen ts and Results 3.1 Data Acquisition and Prepro cessing W e tested our metho d on a group of 75 ASD children and 43 age and IQ-matc hed health y controls collected at Y ale Child Study Cen ter. Each sub ject underw ent a task fMRI scan (BOLD, TR = 2000 ms, TE = 25 ms, flip angle = 60 ◦ , v oxel size 3 . 44 × 3 . 44 × 4 mm 3 ) acquired on a Siemens MA GNETOM T rio TIM 3T scanner. F or the fMRI scans, sub jects p erformed the ”biop oin t” task, viewing p oin t ligh t animations of coherent and scrambled biological motion in a blo c k design [9] (24 s p er blo c k). The fMRI data was prepro cessed follo wing the pip eline in [14]. The mean time series for each no de were extracted from a random 1 / 3 of v oxels in the ROI (given an atlas) of prepro cessed images by b ootstrapping. W e augmen ted the ASD data 10 times and the HC data 20 times, resulting in 750 ASD graphs and 860 HC graphs separately . W e split the data into 5 folds based on sub jects. F our folds w ere used as training data and the left out fold was used for testing. Based on the definition in Section 2.1, each no de attribute v i ∈ R 10 and each edge attribute e ij ∈ R 3 . Sp ecifically , the GLM parameters of ”biop oint task” are: β 1 : co efficien t of biological motion matrix; β 3 : co efficien t of scram ble motion matrix; β 2 and β 4 : co efficien ts of the previous t wo matrices’ deriv atives. 3.2 Step 1: T rain ASD/HC Classification Mo del Firstly , we tested classifier p erformance on the Destrieux atlas [5] (148 ROIs) using prop osed GNN. Since our pip eline integrated in terpretation and classifi- cation, we apply a random forest (RF) using 1000 trees as an additional ”re- alit y chec k”, while the other existing graph classification mo dels either cannot 6 X. Li et al. T able 1: P erformance of different mo dels (mean ± std) Model RF(V) RF(E) RF(V+E) GNN(r=0.3) GNN(r=0.5) GNN(r=0.8) Accuracy 0 . 71 ± 0 . 05 0 . 66 ± 0 . 06 0 . 68 ± 0 . 06 0 . 67 ± 0 . 14 0 . 76 ± 0 . 06 0 . 73 ± 0 . 07 F-score 0 . 69 ± 0 . 06 0 . 68 ± 0 . 06 0 . 63 ± 0 . 12 0 . 68 ± 0 . 09 0 . 79 ± 0 . 08 0 . 71 ± 0 . 10 Precision 0 . 68 ± 0 . 06 0 . 61 ± 0 . 06 0 . 69 ± 0 . 12 0 . 65 ± 0 . 19 0 . 76 ± 0 . 12 0 . 68 ± 0 . 08 Recall 0 . 73 ± 0 . 12 0 . 76 ± 0 . 10 0 . 77 ± 0 . 09 0 . 74 ± 0 . 07 0 . 82 ± 0 . 06 0 . 75 ± 0 . 08 ac hieve the p erformance as GNN [2,7] or do not hav e straightforw ard and re- liable interpretation ability [1]. W e flattened the features to V ∈ R 1480 and E ∈ R 65712 (65712 = 148 × 148 × 3) and used this input to the RF. In our GNN, d (0) = D = 10, d (1) = 16 , d (2) = 8, resulting in 2746 trainable parameters and w e tried different p o oling ratios r ( k = r × N ) in Fig. 2, which was implemented based on [6]. W e applied softmax after the netw ork output and combined cross entropy loss and regularization loss with λ = 0 . 001 as the ob jectiv e func- tion. W e used the Adam optimizer with initial learning 0.001, then decreased it b y a factor of 10 every 50 epo chs. W e trained the netw ork 300 ep o c hs for all of the splits and measured the instance classification by accuracy , F-score, preci- sion and recall (see T able 1). Our prop osed model significan tly outp erformed the alternativ e metho d, due to its ability to embed high dimensional features based on the structural relationship. W e selected the b est GNN mo del with r = 0 . 5 in the next step: interpreting biomarkers. 3.3 Step 2: Interpret and Explain Biomarkers W e put forth the h yp othesis that the more accurate the classifier, the more reliable biomarkers can b e found. W e used the b est RF mo del using V as inputs (77 . 4% accuracy on testing set) and used the RF-based feature imp ortance (mean Gini impurity decrease) as a form of standard metho d for comparison. F or GNN in terpretation, we also chose the b est mo del (83 . 6% accuracy on testing set). F urther, to b e comparable with RF, all of the interpretation exp erimen ts were p erformed on the training set only . The in terpretation results are sho wn in Fig. 3, where the top 30 imp ortan t ROIs (a veraged ov er no de features and instances) selected b y RF are shown in yello w and the top 30 imp ortan t ROIs selected b y Fig. 3: (a) T op 30 imp ortan t ROIs (colored in yello w) selected by RF; (b) T op 30 imp ortan t ROIs selected b y GNN ( R =20) (colored in red) laying o ver (a); (c) No de attributes’ relative imp ortance scores in the t wo metho ds. Graph Neural Netw ork for Interpreting T ask-fMRI Biomarkers 7 Fig. 4: (a) (c) T op scoring sub-graph and corresp onding functional decoding key- w ords and coefficients. (b) (d) The 2nd high scoring sub-graph and corresp onding functional deco ding keyw ords and co efficien ts. our prop osed GNN in red. Nine imp ortan t ROIs were selected b y b oth metho ds. In addition, for no de attribute imp ortance, we a veraged the imp ortance score o ver R OIs and instances for RF. F or GNN, we a veraged gr adient explanation ov er all the no des and instances, i.e. E ( 1 N P i | ∂ y ∂ v ij | ), where y = p ( c = 1 | G ), whic h quan tifies the sensitivit y of the j th no de attribute. F rom Fig. 3(c) we show the relativ e imp ortance to the most imp ortan t no de attribute, our prop osed metho d assigned more uniform importance to each no de attribute, among whic h the biological motion parameter β 1 w as the most imp ortan t. In addition, similar features, mean/std of task-fMRI (tf mean/tf std) and co ordinates ( x, y , z ), hav e similar scores, whic h makes more sense for human interpretation. Notice that our proposed pip eline is also able to iden tify sub-graph importance from Eq. (4), whic h is helpful for understanding the interaction b et ween different brain regions. W e selected the top 2 sub-graphs ( R =20) and used Neurosynth [15] to deco de the functional keyw ords asso ciated with the sub-graphs (sho wn in Fig. 4). These net works are b oth asso ciated with high-level social b eha viors. T o illustrate the predictiv e p o w er of the 2 sub-graphs, we retrained the net work using the graph slicing on those 19 R OIs of the 2 sub-graphs as input. Accuracy on the testing set (in the split of the b est mo del) was 78 . 9%, achieving comparable p erformance to using the whole graph. 3.4 Ev aluation: Robustness Discussion T o examine the potential influence of different graph building strategies on the reliabilit y of netw ork estimates, the functional and anatomical data w ere reg- istered and parcellated by the Destrieux atlas ( A1 ) and the Desik an-Killian y atlas ( A2 ) [4]. W e also show ed the robustness of the results with resp ect to the num ber of clusters for R = 10 , 20 , 30. The results are shown in Fig. 5. W e rank ed ECC s for each no de and indicated the top 30 ROIs in A1 and top 15 R OIs in A2 . The altas and num b er of clusters are indicated on the left of eac h sub-figure. Orbitofrontal cortex and v entromedial prefrontal cortex are selected in all the cases, which are so cial motiv ation related and ha ve previously b een sho wn to b e asso ciated with ASD [9]. W e also v alidated the results by deco ding the neurological functions of the imp ortan t ROIs ov erlapp ed with Neurosynth. 8 X. Li et al. Fig. 5: (a) The biomarkers (red) in terpreted on A1 with 20 clusters; (b)-(d) The biomark ers interpreted by different R and altas la ying ov er on (a) with different colors; (e) The correlation b et w een o verlapped R OIs and functional k eywords. 4 Conclusion and F uture W ork In this pap er, w e prop osed a framework to discov er ASD brain biomarkers from task-fMRI using GNN. It achiev ed improv ed accuracy and more interpretable features than the baseline metho d. W e also show ed our metho d p erformed ro- bustly on differen t atlases and h yp er-parameters. F uture w ork will include in ves- tigating more hyper-parameters (i.e. suitable size of sub-graphs communities), testing the results on functional atlases and differen t graph definition metho ds. The pipeline can b e generalized to other feature imp ortance analysis problems, suc h as resting-fMRI biomark er disco very and vessel cancer detection. Ac kno wledgments This work was supp orted b y NIH Grant R01 NS035193. References 1. Adeba yo, J., et al.: Sanity c hecks for saliency maps. In: Adv ances in Neural Infor- mation Pro cessing Systems. pp. 9505–9515 (2018) 2. Cangea, C., et al.: T ow ards sparse hierarchical graph classifiers. arXiv preprint arXiv:1811.01287 (2018) 3. Carroll, J.D., Chang, J.J.: Analysis of individual differences in m ultidimensional scaling via an n-w ay generalization of eck art-y oung decomp osition. Psychometrik a 35 (3), 283–319 (1970) 4. Desik an, R.S., et al.: An automated lab eling system for sub dividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuroimage 31 (3), 968–980 (2006) 5. Destrieux, C., et al.: Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature. Neuroimage 53 (1), 1–15 (2010) 6. F ey , M., Lenssen, J.E.: F ast graph representation learning with PyT orc h Geometric. CoRR abs/1903.02428 (2019) 7. Gilmer, J., et al.: Neural message passing for quantum chemistry . In: ICML 2017. pp. 1263–1272. JMLR. org (2017) 8. Goldani, A.A., et al.: Biomark ers in autism. F rontiers in psychiatry 5 (2014) Graph Neural Netw ork for Interpreting T ask-fMRI Biomarkers 9 9. Kaiser, M.D., et al.: Neural signatures of autism. PNAS (2010) 10. Kipf, T.N., W elling, M.: Semi-sup ervised classification with graph con volutional net works. arXiv preprint arXiv:1609.02907 (2016) 11. Ktena, S.I., et al.: Distance metric learning using graph conv olutional netw orks: Application to functional brain net works. In: MICCAI (2017) 12. Lo e, C.W., Jensen, H.J.: Comparison of comm unities detection algorithms for mul- tiplex. Physica A: Statistical Mechanics and its Applications 431 , 29–45 (2015) 13. Nishii, R.: Box-Co x transformation. Encyclop edia of Mathematics (2001) 14. Y ang, D., et al.: Brain resp onses to biological motion predict treatmen t outcome in young children with autism. T ranslational psyc hiatry 6 (11), e948 (2016) 15. Y arkoni, T., et al.: Large-scale automated syn thesis of h uman functional neu- roimaging data. Nature metho ds 8 (8), 665 (2011)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment