Text-mining the NeuroSynth corpus using Deep Boltzmann Machines
Large-scale automated meta-analysis of neuroimaging data has recently established itself as an important tool in advancing our understanding of human brain function. This research has been pioneered by NeuroSynth, a database collecting both brain act…
Authors: Ricardo Pio Monti, Romy Lorenz, Robert Leech
T e xt-mining the Neur oSynth corpus using Deep Boltzmann Machines Ricardo Monti ∗ , Romy Lorenz † , Robert Leech † , Christoforos Anagnostopoulos ∗ and Giov anni Montana ∗ ‡ ∗ Department of Mathematics, Imperial College London † Computational, Cognitiv e and Clinical Neuroimaging Laboratory , Imperial College London ‡ Department of Biomedical Engineering, Kings College London Abstract —Large-scale automated meta-analysis of neuroimag- ing data has recently established itself as an important tool in advancing our understanding of human brain function. This resear ch has been pioneered by NeuroSynth , a database collecting both brain activation coordinates and associated text across a large cohort of neur oimaging r esearch papers. One of the fundamental aspects of such meta-analysis is text-mining. T o date, word counts and mor e sophisticated methods such as Latent Dirichlet Allocation have been proposed. In this work we present an unsupervised study of the NeuroSynth text corpus using Deep Boltzmann Machines (DBMs). The use of DBMs yields several advantages ov er the af orementioned methods, principal among which is the fact that it yields both word and document embed- dings in a high-dimensional vector space. Such embeddings serve to facilitate the use of traditional machine lear ning techniques on the text corpus. The proposed DBM model is shown to learn embeddings with a clear semantic structur e. Keyw ords — Deep Boltzmann machines; text-mining; topic mod- els; meta-analysis; I . I N T RO D U C T I O N The study of the human brain using functional magnetic resonance imaging (fMRI) has advanced rapidly in the last decades. This has provided significant insights into the re- lationship between architecture and function of the human brain. This is reflected in the number of published studies, which has grown exponentially during this time. Consequently , a major challenge for the scientific community in v olves the efficient inte gration and analysis of kno wledge across this wide corpus of studies [1]. This challenge has inspired attempts to automatically aggregate and analyze knowledge across the field of fMRI. In particular , NeuroSynth [2] is a meta-analysis database collecting both brain acti v ation coordinates and the corresponding text across a range of over ten thousand studies. This has important applications in the analysis and interpre- tation of fMRI data such as facilitating quantitativ e rev erse inference [3]. The automated extraction of information from a collection of published neuroimaging studies is based on two funda- mental pillars; the first of which in volv es generating detailed statistical maps. In this work we focus on the second pillar; the extraction and analysis of semantic topics from text [1]. Such methods look to emplo y te xt-mining methodologies to disco ver latent topics in the brain imaging literature. Such approaches can subsequently be combined with acti v ation coordinates to examine the underlying mapping between cognitive and neural states. Recent attempts to model the semantic structure of the neuroimaging literature hav e focused on the use of Latent Dirichlet Allocation (LDA) models [1]. Such an approach is able to learn a pre-specified number of latent “topics” which generated observed text. In this work we present a related approach based on the use of Deep Boltzmann ma- chines (DBMs). The motiv ation behind the use of DBMs ov er alternati ve text-mining approaches such as LD A is two- fold. First, the use of restricted Boltzmann machines (RBMs), which are a special case of DBMs, has recently been sho wn to outperform LD A in terms of generalization performance [4]. This is hypothesized to be the result of RBMs learning useful internal representations of the text corpus [5]. The presence of additional hidden layers in DBMs would serve to further facilitate the learning of internal representations. The second advantage of using DBMs is that such models yield an embedding of words or documents in a high-dimensional vector space. Such embeddings are a crucial component of modern natural language processing systems [5] as they can be easily incorporated into traditional machine learning pipelines. Furthermore, the use of word embeddings can be employed to learn joint models across both text and the associated acti v ation coordinates which is the ultimate objective of meta-analysis studies [2]. In this work we demonstrate that DBMs can be effecti v ely employed to learn the distribution of the NeuroSynth text cor- pus. Further , the proposed model is able to learn embeddings of both individual words as well as entire documents. As motiv ation, T able I shows some of the clusters obtained when k -means clustering is applied to word embeddings obtained from the DBM model. The clusters display clear semantic context. I I . M A T E R I A L S A N D M E T H O D S A. Deep topic models In this section we outline the models employed in this work. W e be gin by introducing Restricted Boltzmann machines (RBMs), which serve as the building blocks of the deeper architectures considered in this work. Extensions of RBMs to directly model word counts are discussed before considering Deep Boltzmann machines (DBMs). 1) Restricted Boltzmann machines: RBMs are a class of undirected graphical models which specify a probability distri- bution over observed binary variables v ∈ { 0 , 1 } D and binary hidden v ariables h ∈ { 0 , 1 } F . Formally , RBMs are energy based models which hav e a bipartite graph structure across visible and hidden variables. This structure is imposed in order to facilitate the learning of the models parameters which we discuss below . The follo wing energy function is defined on any configu- ration of visible and hidden units: E ( v , h ; θ ) = − v T W h − a T v − b T h (1) T ABLE I. E X AM P L E S O F S E M A NT I C C L U S TE R S L E AR N T B Y D E E P T OP I C M O D E L Associated vocabulary memory , retriev al, encoding, hippocampus, hippocampal, episodic, items, recall, memories, recollection, item, familiarity , autobiographical language, semantic, words, speech, word, reading, verbal, phonological, lexical, linguistic, naming, fluency , verbs, english adults, age, children, years, older, young, development, adolescents, developmental, aging, sleep, adult, late, younger, blind, childhood, hearing, adolescence emotional, amygdala, social, negative, faces, face, emotion, neutral, affective, facial, anxiety , fear , expressions, regulation, emotions, ofc, valence, personality, arousal, fearful, trait, threat, sad, happy , mood, empathy , moral, person, traits, communication patients, controls, schizophrenia, disorder , deficits, disease, abnormalities, symptoms, impaired, impairment, adhd, alterations, dysfunction, mdd, abnormal, atrophy , pa- tient, ptsd, sev erity , mci, damage, bipolar, lesions, impairments, deficit, depressive, ocd, mild, syndrome, symptom, elderly , dementia, epilepsy , poor , pathophysiology where θ = { W , a, b } are the parameters of the RBM which we wish to estimate. The probability of any gi v en configuration ( v , h ) is subsequently defined as P ( v , h ; θ ) = 1 Z ( θ ) e − E ( v,h ; θ ) , where Z ( θ ) = P v ,h e − E ( v,h ; θ ) is normalizing constant. Fur - thermore, the lik elihood for any observ ation, v , can be obtained by summing ov er binary hidden units: P ( v ; θ ) = 1 Z ( θ ) X h e − E ( v,h ; θ ) . (2) Parameter learning in RBMs is typically achiev ed via per- forming gradient descent on the log-likelihood ov er observed data. From equation (2), the training data log-likelihood is composed of a positive term, φ + = log P h e − E ( v,h ; θ ) , and a ne gative term, φ − = log Z ( θ ) [6]. The deriv ate with respect to the positiv e term corresponds to an expectation ov er the data dependent distribution of hidden variables, which can be easily computed due to the bipartite structure of RBMs. Howe ver , the deriv ati ve of the negati v e term in v olves an expectation ov er the distribution of both visible and hidden units under the proposed model which is intractable. This expectation is typically approximated by looking to sample from this distribution using MCMC. Starting with visible units, Gibbs sampling is applied k times in order to obtain an unbiased sample of the gradient in a procedure known as Contrastive Div er gence [7]. Letting k → ∞ recovers maximum likelihood, howe v er in practice it has been shown empirically that setting k = 1 performs well. 2) Replicated Softmax model: The aforementioned RBM model can be employed when the objecti ve is to learn the probability over binary visible variables. In the context of modeling documents it is possible to treat the occurrence of words at specific locations in the text as binary variables. In this case the observations correspond to a binary incidence matrix V ∈ { 0 , 1 } N × D where V n,d =1 if the n th word in the document takes the d th value. While such an approach is able to model the order of words, there is an explosion in the number of parameters. The replicated softmax RBM takes a more parsimonious alternativ e, directly modeling the word counts, ˆ v d = P n V n,d [4]. In such a setting visible units ˆ v ∈ N D correspond to a vector of words counts for each document. Note that D corresponds to the size of the vocab ulary . The energy of a state ( ˆ v, h ) is defined as: E ( ˆ v, h ; θ ) = − ˆ v T W h − a T ˆ v − M · b T h, (3) where M = P d ˆ v d is the total number of words in a document. As with a standard RBM, learning proceeds via Contrastive Div er gence. Such models can be interpreted as learning a distribution over word histograms of documents. 3) Deep Boltzmann machines: DBMs are extensions of RBMs to allo w for multiple layers of hidden variables. Such models have the capability of learning internal representations of the data which are increasing complex [8]. Throughout this work we consider a two-layer DBM with multinomial visible variables and binary hidden variables. Such a model is associated with the following energy function 1 : E ( ˆ v, h 1 , h 2 ) = − ˆ v T W 1 h 1 − h 1 T W 2 h 2 (4) where we write h 1 and h 2 to denote the first and second layer of binary hidden variables respecti vely . Similarly , parameters θ = { W 1 , W 2 } represent the symmetric interaction terms be- tween visible-to-hidden and hidden-to-hidden variables. Anal- ogous to equation (2), the probability assigned to a visible vector , ˆ v is defined as: P ( ˆ v ; θ ) = 1 Z ( θ ) X h 1 ,h 2 e − E (ˆ v ,h 1 ,h 2 ; θ ) (5) Furthermore, due to the bipartite across layers the conditional distributions of each of the layers can be computed in closed form. This allows for the use of persistent Markov Chains [6] to estimate the intractable model expectations. Naiv e mean- field variational inference is then used to approximate the data- dependent expectations. For further details we refer readers to [8]. In practice, appropriate initialization of parameters is cru- cial to the success of deep models. [8] propose a greedy , layer-by-layer pretraining algorithm for DBMs. This in volv es iterativ ely stacking RBMs, with the small cav eat that bottom- up (likewise top-do wn) contributions from the bottom (top) layer should be doubled during pretraining. 4) Model selection: Selecting the number of hidden units within each layer of a DBM is a non-trivial task. The difficulty of such an approach arises from the need to estimate the (typically intractable) partition function Z ( θ ) for the entire model. As Z ( θ ) depends on both the parameters as well as number of hidden units, it must be calculated in order to perform model comparison. Importance sampling is often employed to estimate prop- erties of distributions known only up to a normalizing con- stant using samples from a kno wn distrib ution. Howe v er , for importance sampling to yield a reliable estimate the known proposal distribution must resemble the target distribution. In the context of high-dimensional RBMs finding such a proposal distribution is challenging. In order to address this challenge, [9] propose the use of annealed importance sampling (AIS). Here a sequence of auxiliary proposal distributions are defined which iterativ ely approximate the target distribution. Due to the bipartite structure of RBMs, it is easy to transition across the intermediate distributions (in practice we apply one iteration of Gibbs sampling). In this fashion it is possible to begin with a sample from a uniform RBM (with 1 we have excluded bias terms for clarity partition Z 0 = 2 F ), which we propagate through auxiliary distributions [4]. In this work a greedy , layer -by-layer approach was taken to select the model architecture. As a result, the bottom layer RBM was trained using a range of hidden units. The architecture which yielded the maximum likelihood across a held-out validation set was selected. The hidden activ ation from this RBM was subsequently provided as input for the top layer RBM and the process was repeated. B. Dataset The NeuroSynth text corpus was employed in this work. While the original corpus contains word frequencies ov er the entire text for each publication, in this work only the publication abstracts were employed. This served to reduce the range of vocab ulary employed and was motiv ated by our belief that much of the semantic structure present in a publication would also be present in the corresponding abstract. Abstracts were collected for 10574 publications using the PUBMED API resulting in a mean document length of 80 words ( ± 25 words). Standard preprocessing was applied to the text corpus. Stop words were removed, as well as words which did not occur with sufficient frequency (fe wer than 50 occurrences throughout the corpus). This resulted in a vocab ulary of approximately two thousand words, of which the 1000 words which occurred most frequently were retained (corresponding to over 80% of terms). The dataset was split into a training set consisting of 9516 documents and a test set with the remaining 1058 documents. I I I . R E S U LT S A. Model arc hitectur e and implementation details A two-layer DBM was employed consisting of a visible layer of multinomial visible units followed by two binary hidden layers with 50 units each. During pretraining and model selection RBMs where trained using CD − 1 . In addition, dropout was employed as a form of regularization with hidden units retained with probability 0 . 9 . The architecture was selected by minimizing the negativ e log-likelihood over a held out v alidation dataset in a greedy manner as described previously . Briefly , AIS was employed to estimate the partition function for each RBM. Five thousand auxiliary distributions were employed (specified by uniformly spaced in verse temperatures) and estimates were averaged ov er fiv e hundred runs. Finally , the DBM was initialized to weights learnt during pretraining and trained as described in [8]. The proposed DBM model can be used to obtain both word as well as document embeddings in a high-dimensional vector space 2 . In the remainder of this section we study both the word and document embeddings obtained from the proposed DBM model. B. W ord embeddings The proposed DBM model can be employed to obtain a high-dimensional embeddings for each word in our vocab ulary . 2 in our case the embedding will be in R 50 as the top layer has 50 hidden units T ABLE II. E X AM P L E S O F O N E - ST E P R E C O NS T RU C T IO N Input One-step reconstruction memory memory , working, recall, performance, retriev al, verbal, load, seman- tic, recognition, task emotion social, emotion, emotional, regions, ofc, brain, affecti ve, gray , traits, amygdala face social, facial, faces, face, emotional, processing, regions, functional, brain, cortex disorder patients, mdd, disorder, adhd, abnormalities, controls, brain, matter, alterations, structural mode network, default, connectivity , brain, regions, cognitive, functional, mode, activity , cortex The word clusters sho wn in T able I can then be obtained by by applying k -means clustering to the word embeddings. The number of clusters was selected based on silhouette scores. Further , Figure [1i] shows a 2D visualization of word em- beddings using t-SNE [10]. Three sections of the embedding hav e been highlighted. Regions A and C sho wcase embeddings for terms relating to emotion and memory respectively . It is important to note that the relev ant brain regions are contained in this sections (i.e., the amygdala and orbitofrontal cortex in region A and the hippocampus in region C). Meanwhile region B contains terms relating to age and dev elopment. Finally , an alternative manner of demonstrating the DBM model has obtained a good estimate of the probability distri- bution is to consider one-step reconstructions. Some examples are provided in T able II. The input words where employed to obtain a distrib ution over hidden units at the top level. This distribution w as then employed to obtain a distribution o ver words. The words with highest probability mass are shown in the right column. C. Document embeddings Document embeddings are obtained in analogous fashion by providing the entire document word vector as input to the DBM. By clustering document embeddings and leveraging the activ ation maps within the NeuroSynth database, we are able to study the activ ations associated with each cluster . Figure [1ii] shows a subset of the 2D embeddings obtained using t-SNE over all document embeddings. As before, k - means clustering was employed to cluster documents accord- ing to their associated high-dimensional representations (again silhouette scores used to select number of clusters). It is then possible to study the reported functional activ ations of all publications within a giv en cluster . Follo wing [1], this was achiev ed by con volving all activ ations within a cluster (i.e., all activ ations reported by documents in a cluster) with a 10mm Gaussian kernel. This resulted in a mean activ ation map for all documents within a given cluster . The peak activ ations, together with the most frequently occurring words are shown for six clusters in Figure [1iii] 3 . The activ ation maps highlight key functional networks and regions, for example clusters four and six identify the pain and motor regions respecti vely . The clusters also appear to identify pathologies. For example cluster one appears to be related to cogniti ve impairment and atrophy . Furthermore, it is interesting to note that spatially adjacent clusters share some similarities. For example, clusters two and five both show frontal activ ation. 3 figure produced using nilearn module [11]. Fig. 1. i) The top left panel shows the result of applying t-SNE on word embeddings obtained from the DBM model. Three regions have been highlighted are are shown in greater detail in the remaining three panels. It can be seen that regions A and C correspond to emotion and memory related terms respectiv ely while region C contains terms associated with aging and dev elopment. ii) A subset of the two dimensional embedding obtained from applying t-SNE on document embedding. iii) Activ ation maps (left column) are shown for several of the highlighted clusters shown in ii) together the with the most frequently occurring terms (right column). I V . D I S C U S S I O N A N D F U T U R E W O R K In this paper we have demonstrated the use of DBMs in modeling a te xt corpus composed of abstracts from neuroscien- tific publications. The proposed DBM model is able to yield a vector representation of both individual words as well as entire documents. Such representations are advantageous for many reasons, for example the y can be employed to cluster the words or documents. Further , by combining the abstracts with the NeuroSynth corpus, we are able to study whether the activ ation maps associated with each cluster . While only exploratory results are presented in this work, future work will look simultaneously model both text and activ ations, thereby facilitating formal inference. A further exciting application would be to lev er- age document embeddings to inform novel machine learning applications in neuroscience, such as the recently proposed Automated Neur oscientist [12]. R E F E R E N C E S [1] R. Poldrack, J. Mumford, T . Schonberg, D. Kalar, B. Barman, and T . Y arkoni. Discov ering relations between mind, brain, and mental disorders using topic mapping. PLoS Comput Biol , 8 (10):e1002707, 2012. [2] T . Y arkoni, R. Poldrack, T . Nichols, D. V an Essen, and T . W ager. Large-scale automated synthesis of human functional neuroimaging data. Nature methods , 8(8):665–670, 2011. [3] R. Poldrack. Can cognitiv e processes be inferred from neuroimaging data? T rends in cognitive sciences , 10(2):59–63, 2006. [4] G. Hinton and R. Salakhutdinov. Replicated softmax: an undirected topic model. In Advances in neural information pr ocessing systems , pages 1607–1614, 2009. [5] Y . Bengio, A. Courville, and P . V incent. Representation learning: A review and new perspectives. P attern Analysis and Machine Intelligence, IEEE T ransactions on , 35(8):1798–1828, 2013. [6] T . Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In International conference on Mac hine learning , pages 1064–1071. ACM, 2008. [7] G. Hinton. Training products of experts by minimizing contrasti ve di vergence. Neural computation , 14(8):1771–1800, 2002. [8] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In International conference on artificial intelligence and statistics , pages 448–455, 2009. [9] R. Salakhutdinov and I. Murray . On the quantitati ve analysis of deep belief networks. In Proceedings of the 25th international conference on Mac hine learning , pages 872–879. A CM, 2008. [10] L. V an der Maaten and G. Hinton. V isualizing data using t-SNE. Journal of Machine Learning Researc h , 9(2579-2605):85, 2008. [11] A. Abraham, F . Pedregosa, M. Eickenberg, P . Gervais, A. Muller, J. Kossaifi, A. Gramfort, B. Thirion, and G. V aroquaux. Machine learning for neuroimaging with scikit-learn. arXiv preprint arXiv:1412.3919 , 2014. [12] R. Lorenz, R. Monti, I. Violante, C. Anagnostopoulos, A. Faisal, G. Montana, and R. Leech. The automatic neuroscientist: A framework for optimizing experimental design with closed-loop real- time fMRI. Neur oImage , 2016.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment