Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge

Anchor ed Corr elation Explanation: T opic Modeling with Minimal Domain Knowledge Ryan J. Gallagher 1,2 , K yle Reing 1 , Da vid Kale 1 , and Greg V er Steeg 1 1 Information Sciences Institute, Uni versity of Southern California 2 V ermont Complex Systems Center , Computational Story Lab, Uni versity of V ermont ryan.gallagher@uvm.edu { reing,kale,gregv } @isi.edu Abstract While generativ e models such as Latent Dirichlet Allocation (LD A) ha ve prov en fruit- ful in topic modeling, they often require de- tailed assumptions and careful speciﬁcation of hyperparameters. Such model comple xity is- sues only compound when trying to general- ize generativ e models to incorporate human input. W e introduce Correlation Explanation (CorEx), an alternativ e approach to topic mod- eling that does not assume an underlying gen- erativ e model, and instead learns maximally informativ e topics through an information- theoretic frame work. This frame work nat- urally generalizes to hierarchical and semi- supervised extensions with no additional mod- eling assumptions. In particular , word-le vel domain knowledge can be ﬂexibly incorpo- rated within CorEx through anchor words, al- lowing topic separability and representation to be promoted with minimal human interven- tion. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi- supervised variants of LD A. 1 Introduction The majority of topic modeling approaches utilize probabilistic generativ e models, models which spec- ify mechanisms for how documents are written in order to infer latent topics. These mechanisms may be explicitly stated, as in Latent Dirichlet Alloca- tion (LD A) (Blei et al., 2003), or implicitly stated, as with matrix factorization techniques (Hofmann, 1999; Ding et al., 2008; Buntine and Jakulin, 2006). The core generativ e mechanisms of LDA, in par- ticular , hav e inspired numerous generalizations that account for additional information, such as the au- thorship (Rosen-Zvi et al., 2004), document labels (McAulif fe and Blei, 2008), or hierarchical structure (Grif ﬁths et al., 2004). Ho we ver , these generalizations come at the cost of increasingly elaborate and unwieldy generati ve assumptions. While these assumptions allow topic inference to be tractable in the face of additional metadata, they progressi vely constrain topics to a narro wer vie w of what a topic can be. Such assump- tions are undesirable in contexts where one wishes to minimize model complexity and learn topics with- out preexisting notions of how those topics origi- nated. For these reasons, we propose topic modeling by way of Correlation Explanation (CorEx), 1 an information-theoretic approach to learning latent topics ov er documents. Unlike LDA, CorEx does not assume a particular data generating model, and instead searches for topics that are “maximally in- formati ve” about a set of documents. By learning informati ve topics rather than generated topics, we av oid specifying the structure and nature of topics ahead of time. In addition, the lightweight framew ork underly- ing CorEx is versatile and naturally extends to hier- archical and semi-supervised variants with no addi- tional modeling assumptions. More speciﬁcally , we 1 Open source, documented code for the CorEx topic model av ailable at https://github.com/gregversteeg/ corex_topic . may ﬂexibly incorporate word-lev el domain knowl- edge within the CorEx topic model. T opic models are often susceptible to portraying only dominant themes of documents. Injecting a topic model, such as CorEx, with domain knowledge can help guide it to wards otherwise underrepresented topics that are of importance to the user . By incorporating rele- v ant domain words, we might encourage our topic model to recognize a rare disease that would other- wise be missed in clinical health notes, focus more attention to topics from news articles that can guide relief w orkers in distributing aid more effecti vely , or disambiguate aspects of a complex social issue. Our contributions are as follows: ﬁrst, we frame CorEx as a topic model and derive an efﬁcient alter - ation to the CorEx algorithm to exploit sparse data, such as word counts in documents, for dramatic speedups. Second, we show how domain kno wledge can be naturally integrated into CorEx through “an- chor words” and the information bottleneck. Third, we demonstrate that CorEx and anchored CorEx produce topics of comparable quality to unsuper- vised and semi-supervised variants of LD A over sev- eral datasets and metrics. Finally , we carefully detail se veral anchoring strategies that highlight the versa- tility of anchored CorEx on a v ariety of tasks. 2 Methods 2.1 CorEx: Correlation Explanation Here we re view the fundamentals of Correlation Ex- planation (CorEx), and adopt the notation used by V er Steeg and Galstyan in their original presenta- tion of the model (2014). Let X be a discrete ran- dom variable that takes on a ﬁnite number of val- ues, indicated with lowercase, x . Furthermore, if we hav e n such random v ariables, let X G denote a sub-collection of them, where G ⊆ { 1 , . . . , n } . The probability of observing X G = x G is written as p ( X G = x G ) , which is typically abbre viated to p ( x G ) . The entropy of X is written as H ( X ) and the mutual information of two random v ariables X 1 and X 2 is giv en by I ( X 1 : X 2 ) = H ( X 1 ) + H ( X 2 ) − H ( X 1 , X 2 ) . The total correlation, or multiv ariate mutual in- formation, of a group of random v ariables X G is e x- pressed as T C ( X G ) = X i ∈ G H ( X i ) − H ( X G ) (1) = D K L p ( x G ) || Y i ∈ G p ( x i ) ! . (2) W e see that Eq. 1 does not quantify “correlation” in the modern sense of the w ord, and so it can be help- ful to conceptualize total correlation as a measure of total dependence. Indeed, Eq. 2 sho ws that total cor - relation can be expressed using the Kullback-Leibler Di ver gence and, therefore, it is zero if and only if the joint distribution of X G factorizes, or , in other words, there is no dependence between the random v ariables. The total correlation can be written when condi- tioning on another random variable Y , T C ( X G | Y ) = P i ∈ G H ( X i | Y ) − H ( X G | Y ) . So, we can consider the reduction in the total correlation when conditioning on Y . T C ( X G ; Y ) = T C ( X G ) − T C ( X G | Y ) (3) = X i ∈ G I ( X i : Y ) − I ( X G : Y ) (4) The quantity expressed in Eq. 3 acts as a lower bound of T C ( X G ) (V er Steeg and Galstyan, 2015), as readily veriﬁed by noting that T C ( X G ) and T C ( X G | Y ) are al ways non-negati ve. Also note, the joint distribution of X G factorizes conditional on Y if and only if T C ( X G | Y ) = 0 . If this is the case, then T C ( X G ; Y ) is maximized, and Y explains all of the dependencies in X G . In the context of topic modeling, X G represents a group of word types and Y represents a topic to be learned. Since we are always interested in group- ing multiple sets of words into multiple topics, we will denote the binary latent topics as Y 1 , . . . Y m and their corresponding groups of word types as X G j for j = 1 , . . . , m respecti vely . The CorEx topic model seeks to maximally explain the dependencies of words in documents through latent topics by max- imizing T C ( X ; Y 1 , . . . , Y m ) . T o do this, we maxi- mize the follo wing lo wer bound on this expression: max G j ,p ( y j | x G j ) m X j =1 T C ( X G j ; Y j ) . (5) As we describe in the following section, this ob- jecti ve can be efﬁciently approximated, despite the search occurring over an exponentially large proba- bility space (V er Steeg and Galstyan, 2014). Since each topic explains a certain portion of the ov erall total correlation, we may choose the number of topics by observing diminishing returns to the ob- jecti ve. Furthermore, since the CorEx implementa- tion depends on a random initialization (as described shortly), one may restart the CorEx topic model se v- eral times and choose the one that explains the most total correlation. The latent factors, Y j , are optimized to be infor- mati ve about dependencies in the data and do not require generativ e modeling assumptions. Note that the discov ered factors, Y , can be used as inputs to construct new latent factors, Z , and so on leading to a hierarchy of topics. Although this extension is quite natural, we focus our analysis on the ﬁrst lev el of topic representations for easier interpretation and e v aluation. 2.2 CorEx Implementation W e summarize the implementation of CorEx as pre- sented by V er Steeg and Galstyan (2014) in prepa- ration for innov ations introduced in the subsequent sections. The numerical optimization for CorEx be- gins with a random initialization of parameters and then proceeds via an iterative update scheme simi- lar to EM. For computational tractability , we subject the optimization in Eq. 5 to the constraint that the groups, G j , do not ov erlap, i.e. we enforce single- membership of words within topics. The optimiza- tion entails a combinatorial search over groups, so instead we look for a form that is more amenable to smooth optimization. W e re write the objecti ve using the alternate form in Eq. 4 while introducing indica- tor v ariables α i,j which are equal to 1 if and only if word X i appears in topic Y j (i.e. i ∈ G j ). max α i,j ,p ( y j | x ) m X j =1 n X i =1 α i,j I ( X i : Y j ) − I ( X : Y j ) ! s.t. α i,j = I [ j = arg max ¯ j I ( X i : Y ¯ j )] . (6) Note that the constraint on non-ov erlapping groups no w becomes a constraint on α . T o make the opti- mization smooth we should relax the constraint so that α i,j ∈ [0 , 1] . T o do so, we replace the second line with a softmax function. The update for α at iteration t becomes, α t i,j = exp  λ t ( I ( X i : Y j ) − max ¯ j I ( X i : Y ¯ j ))  . No w α ∈ [0 , 1] and the parameter λ controls the sharpness of the softmax function. Early in the opti- mization we use a small value of λ , then increase it later in the optimization to enforce a hard constraint. The objecti ve in Eq. 6 only lower bounds total cor - relation in the hard max limit. The constraint on α forces competition among latent factors to explain certain words, while setting λ = 0 results in all fac- tors learning the same thing. Holding α ﬁxed, taking the deriv ative of the objecti ve (with respect to the v ariables p ( y j | x ) , and setting it equal to zero leads to a ﬁxed point equation. W e use this ﬁxed point to deﬁne update equations at iteration t . p t ( y j ) = X ¯ x p t ( y j | ¯ x ) p ( ¯ x ) (7) p t ( x i | y j ) = X ¯ x p t ( y j | ¯ x ) p ( ¯ x ) I [ ¯ x i = x i ] /p t ( y j ) log p t +1 ( y j | x ` ) = (8) log p t ( y j )+ n X i =1 α t i,j log p t ( x ` i | y j ) p ( x ` i ) − log Z j ( x ` ) The ﬁrst two lines just deﬁne the marginals in terms of the optimization parameter , p t ( y j | x ) . W e take p ( x ) to be the empirical distribution deﬁned by some observed samples, x ` , ` = 1 , . . . , N . The third line updates p t ( y j | x ` ) , the probabilistic labels for each latent factor , Y j , for a gi ven sample, x ` . Note that an easily calculated constant, Z j ( x ` ) , appears to ensure the normalization of p t ( y j | x ` ) for each sample. W e iterate through these updates until con ver gence. After con ver gence, we use the mutual information terms I ( X i : Y j ) to rank which words are most in- formati ve for each factor . The objectiv e is a sum of terms for each latent factor and this allows us to rank the contrib ution of each factor to ward our lo wer bound on the total correlation. The expected log of the normalization constant, often called the free en- ergy , E [log Z j ( x ) ], plays an important role since its expectation provides a free estimate of the i -th term in the objecti ve (V er Steeg and Galstyan, 2015), as can be seen by taking the expectation of Eq. 8 at con ver gence and comparing it to Eq. 6. Because our sample estimate of the objectiv e is just the mean of contributions from individual sample points, x ` , we refer to log Z j ( x ` ) as the pointwise total correlation explained by factor j for sample ` . Pointwise TC can be used to localize which samples are particu- larly informati ve about speciﬁc latent f actors. 2.3 Sparsity Optimization 2.3.1 Derivation T o alter the CorEx optimization procedure to ex- ploit sparsity in the data, we now assume that all v ariables, x i , y j , are binary and x is a binary vector where X ` i = 1 if word i occurs in document ` and X ` i = 0 otherwise. Since all variables are binary , the marginal distribution, p ( x i | y j ) , is just a two by two table of probabilities and can be estimated efﬁ- ciently . The time-consuming part of training is the subsequent update of the document labels in Eq. 8 for each document ` . The computation of the log likelihood ratio for all n words o ver all documents is not ef ﬁcient, as most words do not appear in a giv en document. W e rewrite the logarithm in the interior of the sum. log p t ( x ` i | y j ) p ( x ` i ) = log p t ( X i = 0 | y j ) p ( X i = 0) + (9) x l i log  p t ( X ` i = 1 | y j ) p ( X i = 0) p t ( X i = 0 | y j ) p ( X ` i = 1)  Note, when the word does not appear in the docu- ment, only the leading term of Eq. 9 will be nonzero. Ho we ver , when the word does appear, ev erything but log P ( X ` i = 1 | y j ) /p ( X ` i = 1) cancels out. So, we hav e taken adv antage of the fact that the CorEx topic model binarizes documents to assume by de- fault that a word does not appear in the document, and then correct the contribution to the update if the word does appear . Thus, when substituting back into Eq. 8, the sum becomes a matrix multiplication between a matrix with dimensions of the number of v ariables by the number of documents and entries x ` i that is as- sumed to be sparse and a dense matrix with di- mensions of the number of variables by the num- ber of latent factors. Giv en n variables, N sam- ples, and ρ nonzero entries in the data matrix, the 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 Disaster Relief Articles Time (Seconds) CorEx Sparse CorEx LDA 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 New York Times Time (Seconds) 1 0 3 1 0 4 1 0 5 1 0 6 Number of Docs (500 Words Fixed) 1 0 0 1 0 1 1 0 2 1 0 3 1 0 4 PubMed Time (Seconds) 1 0 3 1 0 4 1 0 5 Number of Words (5000 Documents Fixed) Figure 1: Speed comparisons to a ﬁxed number of itera- tions as the number of documents and words vary . Ne w Y ork T imes articles and PubMed abstracts were collected from the UCI Machine Learning Repository (Lichman, 2013). The disaster relief articles are described in section 4.1, and represented simply as bags of words, not phrases. asymptotic scaling for CorEx goes from O ( N n ) to O ( n ) + O ( N ) + O ( ρ ) exploiting sparsity . Latent tree modeling approaches are quadratic in n or worse, so we expect CorEx’ s computational advantage to in- crease for larger datasets. 2.3.2 Optimization Evaluation W e perform experiments comparing the running time of CorEx before and after implementing the im- prov ements which exploit sparsity . W e also compare with Scikit-Learn’ s simple batch implementation of LD A using the variational Bayes algorithm (Hoff- man et al., 2013). Experiments were performed on a four core, Intel i5 chip running at 4 GHz with 32 GB RAM. W e show run time when varying the data size in terms of the number of word types and the num- ber of documents. W e used 50 topics for all runs and set the number of iterations for each run to 10 itera- tions for LD A and 50 iterations for CorEx. Results are shown in Figure 1. W e see that CorEx exploit- ing sparsity is orders of magnitude faster than the nai ve version and is generally comparable to LD A as the number of documents scales. The slope on the log-log plot suggests a linear dependence of run- ning time on the dataset size, as expected. 2.4 Anchor W ords via the Bottleneck The information bottleneck formulates a trade-off between compressing data X into a representation Y , and preserving the information in X that is rel- e v ant to Z (typically labels in a supervised learning task) (T ishby et al., 1999; Friedman et al., 2001). More formally , the information bottleneck is ex- pressed as max p ( y | x ) β I ( Z : Y ) − I ( X : Y ) , (10) where β is a parameter controlling the trade-of f be- tween compressing X and preserving information about the rele v ance v ariable, Z . T o see the connection with CorEx, we compare the CorEx objective as written in Eq. 6 with the bottleneck in Eq. 10. W e see that we hav e exactly the same compression term for each latent factor , I ( X : Y j ) , but the relev ance variables no w corre- spond to Z ≡ X i . If we want to learn represen- tations that are more relev ant to speciﬁc keywords, we can simply anchor a word type X i to topic Y j , by constraining our optimization so that α i,j = β i,j , where β i,j ≥ 1 controls the anchor strength. Oth- erwise, the updates on α remain the same. This schema is a natural extension of the CorEx optimiza- tion and it is ﬂexible, allowing for multiple word types to be anchored to one topic, for one word type to be anchored to multiple topics, or for any com- bination of these semi-supervised anchoring strate- gies. 3 Related W ork W ith respect to integrating domain knowledge into topic models, we draw inspiration from Arora et al. (2012), who used anchor words in the con- text of non-negati ve matrix factorization. Using an assumption of separability , these anchor words act as high precision markers of particular topics and, thus, help discern the topics from one another . Al- though the original algorithm proposed by Arora et al. (2012), and subsequent improvements to their approach, ﬁnd these anchor words automatically (Arora et al., 2013; Lee and Mimno, 2014), recent adaptations allow manual insertion of anchor words and other metadata (Nguyen et al., 2014; Nguyen et al., 2015). Our work is similar to the latter , where we treat anchor words as fuzzy logic markers and em- bed them into the topic model in a semi-supervised fashion. In this sense, our work is closest to Halpern et al. (2014; 2015), who have also made use of do- main e xpertise and semi-supervised anchored words in de vising topic models. There is an adjacent line of work that has focused on incorporating word-lev el information into LD A- based models. Jagarlamudi et al. (2012) proposed SeededLD A, a model that seeds words into giv en topics and guides, but does not force, these topics to wards these integrated words. Andrzejewski and Zhu (2009) presented a model that makes use of “ z - labels, ” words that are known to pertain to speciﬁc topics and that are restricted to appearing in some subset of all the possible topics. Although the z - labels can be le veraged to place different senses of a word into different topics, it requires additional ef- fort to determine when these dif ferent senses occur . Our anchoring approach allo ws a user to more easily anchor one word to multiple topics, allo wing CorEx to naturally ﬁnd topics that rev olve around different senses of a word. Andrzeje wski et al. (2009) presented a second model which allo ws speciﬁcation of Must-Link and Cannot-Link relationships between words that help partition otherwise muddled topics. These logical constraints help enforce topic separability , though these mechanisms less directly address how to an- chor a single word or set of words to help a topic emerge. More generally , the Must/Cannot link and z -label topic models hav e been expressed in a po werful ﬁrst-order -logic frame work that allows the speciﬁcation of arbitrary domain knowledge through logical rules (Andrzeje wski et al., 2011). Others hav e built of f this ﬁrst-order-logic approach to au- tomatically learn rule weights (Mei et al., 2014) and incorporate additional latent v ariable informa- tion (Foulds et al., 2015). Mathematically , CorEx topic models most closely resemble topic models based on latent tree recon- struction (Chen et al., 2016). In Chen et al. ’ s (2016) analysis, their o wn latent tree approach and CorEx both report signiﬁcantly better perplexity than hi- erarchical topic models based on the hierarchical Dirichlet process and the Chinese restaurant process. CorEx has also been in vestigated as a way to ﬁnd “surprising” documents (Hodas et al., 2015). 4 Data and Evaluation Methods 4.1 Data W e use tw o challenging datasets with corresponding domain kno wledge lexicons to e valuate anchored CorEx. Our ﬁrst dataset consists of 504,000 human- itarian assistance and disaster relief (HA/DR) arti- cles covering 21 disaster types collected from Re- liefW eb, an HA/DR news article aggregator spon- sored by the United Nations (Littell et al., 2018). T o mitigate ov erwhelming label imbalances during anchoring, we both restrict ourselves to documents in English with one label, and randomly subsample 2,000 articles from each of the largest disaster type labels. This leav es us with a corpus of 18,943 arti- cles. 2 W e accompany these articles with an HA/DR lex- icon of approximately 34,000 w ords and phrases (Littell et al., 2018). The lexicon was curated by ﬁrst gathering 40–60 seed terms per disaster type from HA/DR domain experts and CrisisLex. This term list was then expanded by creating w ord embeddings for each disaster type, and taking terms within a speciﬁed cosine similarity of the seed words. These lists were then ﬁltered by removing names, places, non-ASCII characters, and terms with fewer than three characters. Finally , the extracted terms were audited using Cro wdFlo wer, where users rated the rele v ance of the terms on a Likert scale. Lo w rel- e v ance terms were dropped from the lexicon. Of these terms 11,891 types appear in the HA/DR ar- ticles. Our second dataset consists of 1,237 deidentiﬁed clinical discharge summaries from the Informatics for Inte grating Biology and the Bedside (i2b2) 2008 Obesity Challenge. 3 These summaries are labeled by clinical experts with 15 conditions frequently associated with obesity . For these documents, we le verage a text pipeline that extracts common med- 2 HA/DR articles and accompanying lexicon av ailable at http://dx.doi.org/10.7910/DVN/TGOPRU 3 Data av ailable upon data use agreement at https:// www.i2b2.org/NLP/Obesity/ ical terms and phrases (Dai et al., 2008; Chapman et al., 2001), which yields 3,231 such term types. For both sets of documents, we use their respecti ve lexicons to break the documents down into bags of words and phrases. W e also make use of the 20 Newsgroups dataset, as provided and preprocessed in the Scikit-Learn li- brary (Pedregosa et al., 2011). 4.2 Evaluation CorEx does not explicitly attempt to learn a genera- ti ve model and, thus, traditional measures such as perplexity are not appropriate for model compari- son against LD A. Furthermore, it is well-kno wn that perplexity and held-out log-likelihood do not neces- sarily correlate with human ev aluation of semantic topic quality (Chang et al., 2009). Therefore, we measure the semantic topic quality using Mimno et al. ’ s (2011) UMass automatic topic coherence score, which correlates with human judgments. W e also ev aluate the models in terms of multi- class logistic re gression document classiﬁcation (Pe- dregosa et al., 2011), where the feature set of each document is its topic distribution. W e perform all document classiﬁcation tasks using a 60/40 training- test split. Finally , we measure how well each topic model does at clustering documents. W e obtain a cluster- ing by assigning each document to the topic that oc- curs with the highest probability . W e then measure the quality within clusters (homogeneity) and across clusters (adjusted mutual information). The highest possible value for both measures is one. W e do not report clustering metrics on the clinical health notes because the documents are multi-label and, in that case, the metrics are not well-deﬁned. 4.3 Choosing Anchor W ords W e follow the approach used by Jagarlamudi et al. (2012) to automatically generate anchor words: for each label in a data set, we ﬁnd the words that hav e the highest mutual information with the label. For word w and label L , this is computed as I ( L : w ) = H ( L ) − H ( L | w ) , (11) where for each document of label L we consider if the word w appears or not. 0.2 0.3 0.4 0.5 M a c r o F 1 Disaster Relief Articles 20 Newsgroups Clinical Health Notes 0.2 0.3 0.4 0.5 0.6 M i c r o F 1 160 140 120 100 80 60 Coherence 20 40 60 80 100 Number of Topics 20 40 60 80 100 Number of Topics 0.1 0.2 0.3 0.4 Homogeneity 20 40 60 80 100 Number of Topics CorEx LDA Figure 2: Baseline comparison of CorEx to LD A with respect to topic coherence and document classiﬁcation and clustering on three different datasets as the number of topics vary . Points are the av erage of 30 runs of a topic model. Conﬁdence intervals are plotted but are so small that they are not distinguishable. CorEx is trained using binary data, while LDA is trained on count data. Ho- mogeneity is not well-deﬁned on the multi-label clinical health notes, so it is omitted. 5 Results 5.1 LD A Baseline Comparison W e compare CorEx to LD A in terms of topic coher- ence, document classiﬁcation, and document clus- tering across three datasets. CorEx is trained on bi- nary data, while LDA is trained on count data. While not reported here, CorEx consistently outperformed LD A trained on binary data. In doing these compar- isons, we use the Gensim implementation of LD A ( ˇ Reh ˚ u ˇ rek and Sojka, 2010). The results of compar- ing CorEx to LD A as a function of the number of topics are presented in Figure 2. Across all three datasets, we ﬁnd that the topics produced by CorEx yield document classiﬁcation re- sults that are on par with or better than those pro- duced by LD A topics. In terms of clustering, CorEx consistently produces document clusters of higher Rank Disaster Relief T opic 1 drought, farmers, harv est, crop, li vestock, planting, grain, maize, rainfall, irrigation 3 eruption, volcanic, la va, crater , eruptions, volcanos, slopes, v olcanic acti vity , ev acuated, lav a ﬂows 8 winter , snow , sno wfall, temperatures, heavy sno w , heating, freezing, warm clothing, se vere winter , av alanches 23 military , armed, civilians, soldiers, aircraft, weapons, rebel, planes, bombs, military personnel Rank 20 Newsgroups T opic 3 team, game, season, player , league, hocke y , play , teams, nhl 14 car , bike, cars, engine, miles, road, ride, riding, bikes, ground 26 nasa, launch, orbit, shuttle, mission, satellite, gov , jpl, orbital, solar 39 medical, disease, doctor , patients, treatment, medicine, health, hospital, doctors, pain Rank Clinical Health Notes T opic 12 vomiting, nausea, abdominal pain, diarrhea, fe ver , dehydration, chill, clostridium dif ﬁcile, intrav enous ﬂuid, compazine 19 anxiety state, insomnia, ati v an, neurontin, depression, lorazepam, gabapentin, trazodone, ﬂuoxetine, headache 27 pain, oxycodone, tylenol, percocet, ibuprofen, morphine, osteoarthritis, hernia, motrin, bleeding T able 1: Examples of topics learned by the CorEx topic model. W ords are ranked according to mutual informa- tion with the topic, and topics are ranked according to the amount of total correlation they explain. T opic models were run with 50 topics on the Reliefweb and 20 News- groups datasets, and 30 topics on the clinical health notes. homogeneity than LD A. On the disaster relief arti- cles, the CorEx clusters are nearly twice as homoge- neous as the LD A clusters. CorEx outperforms LDA in terms of topic coher- ence on two out of three of the datasets. While LD A 0.0 0.1 0.2 0.3 0.4 0.5 Homogeneity Unsupervised Semi-Supervised Disaster Relief Articles (21 topics) Unsupervised Semi-Supervised 20 Newsgroups (20 topics) 0.0 0.1 0.2 0.3 0.4 0.5 Adjusted Mutual Information CorEx LDA Anchored CorEx Linked LDA z-labels LDA 120 100 80 60 Coherence CorEx LDA Anchored CorEx Linked LDA z-labels LDA Figure 3: Comparison of anchored CorEx to other semi- supervised topic models in terms of document clustering and topic coherence. For each dataset, the number of top- ics is ﬁxed to the number of document labels. Each dot is the average of 30 runs. Conﬁdence intervals are plotted but are so small that the y are not distinguishable. produces more coherent topics for the clinical health notes, it is particularly striking that CorEx is able to produce high quality topics while only lev erag- ing binary count data. Examples of these topics are sho wn in T able 1. Despite the binary counts limi- tation, CorEx still ﬁnds meaningfully coherent and competiti ve structure in the data. 5.2 Anchored CorEx Analysis W e now examine the effects and beneﬁts of guiding CorEx through anchor words. In doing so, we also compare anchored CorEx to other semi-supervised topic models. 5.2.1 Anchoring for T opic Separability W e are ﬁrst interested in how anchoring can be used to encourage topic separability so that docu- ments cluster well. W e focus on the HA/DR articles and 20 ne wsgroups datasets, since traditional clus- tering metrics are not well-deﬁned on the multi-label clinical health notes. For both datasets, we ﬁx the Rank Anchored Disaster Relief T opic 1 harvest , locus , drought, f ood crisis , farmers, crops, crop, malnutrition, food aid, li vestock 4 tents , quake , international federation, red crescent, red cross, blankets, earthquake, richter scale , societies, aftershocks 12 climate , impacts , warming , climate change, irrigation, consumption, household, droughts, li velihoods, interventions 19 storms , weather , winds, coastal, tornado , meteorological, tor nadoes , strong winds, tropical, roofs Rank Anchor ed 20 Newsgroups T opic 5 gov ernment, congress , clinton , state, national, economic, general, states, united, order 6 bible , christian , god, jesus, christians, belie ve, life, f aith, world, man 15 use, used, high, circuit , po wer , work, voltage, need, lo w , end 20 baseball , pitching , bra ves , mets , hitter , pitcher , cubs, dl, sox, jays T able 2: Examples of topics learned by CorEx when simultaneously anchoring many topics with anchoring parameter β = 2 . Anchor words are shown in bold . W ords are ranked according to mutual information with the topic, and topics are ranked according to the amount of total correlation they explain. T opic models were run with 21 topics on the Reliefweb articles and 20 topics on the 20 Newsgroups dataset. number of topics to be equal to the number of doc- ument labels. It is in this context that we compare anchored CorEx to two other semi-supervised topic models: z -labels LD A and must/cannot link LD A. Using the method described in Section 4.3, we au- tomatically retriev e the top ﬁv e anchors for each dis- aster type and newsgroup. W e then ﬁlter these lists of any words that are ambiguous, i.e. words that are anchor words for more than one document label. For anchored CorEx and z -labels LDA we simulta- neously assign each set of anchor words to exactly one topic each. For must/cannot link LD A, we cre- ate must-links within the words of the same anchor group, and create cannot-links between words of dif- ferent anchor groups. Since we are simultaneously anchoring to many topics, we use a weak anchoring parameter β = 2 for anchored CorEx. Using the notation from their original papers, we use η = 1 for z -labels LD A, and η = 1000 for must/cannot link LD A. For both LD A variants, we use α = 0 . 5 , β = 0 . 1 and take 2,000 samples, and estimate the models using code implemented by the original authors. The results of this comparison are sho wn in Fig- ure 3, and examples of anchored CorEx topics are sho wn in T able 2. Across all measures CorEx and anchored CorEx outperform LD A. W e ﬁnd that an- chored CorEx always improves cluster quality ver- sus CorEx in terms of homogeneity and adjusted mutual information. Compared to CorEx, multiple simultaneous anchoring neither harms nor beneﬁts the topic coherence of anchored CorEx. T ogether these metrics suggest that anchored CorEx is ﬁnd- ing topics that are of equi valent coherence to CorEx, but more rele vant to the document labels since gains are seen in terms of document clustering. Against the other semi-supervised topic models, anchored CorEx compares f avorably . The document clustering of anchored CorEx is similar to, or bet- ter than, that of z -labels LD A and must/cannot link LD A. Across the disaster relief articles, anchored CorEx ﬁnds less coherent topics than the two LDA v ariants, while it ﬁnds similarly coherent topics as must/cannot link LD A on the 20 ne wsgroup dataset. 5.2.2 Anchoring for T opic Representation W e now turn to studying how domain knowledge can be anchored to a single topic to help an other - wise dominated topic emerge, and how the anchor- ing parameter β affects that emer gence. T o discern this effect, we focus just on anchored CorEx along with the HA/DR articles and clinical health notes, datasets for which we ha ve a domain expert le xicon. W e de vise the following e xperiment: ﬁrst, we de- termine the top ﬁv e anchor words for each docu- ment label using the methodology described in Sec- tion 4.3. Unlike in the previous section, we do not ﬁlter these lists of ambiguous anchor words. Sec- ond, for each document label, we run an anchored CorEx topic model with that label’ s anchor words anchored to exactly one topic. W e compare this an- 0.4 0.2 0.0 0.2 0.4 0.6 Topic Overlap Diff. Post-Anchoring Disaster Relief Articles 0.2 0.0 0.2 0.4 0.6 Clinical Health Notes 40 20 0 20 40 60 80 100 Percent Change in Topic Coherence 100 50 0 50 100 150 200 1 2 3 4 5 6 7 8 9 10 A n c h o r i n g P a r a m e t e r β 0.15 0.10 0.05 0.00 0.05 0.10 0.15 F 1 D i f f e r e n c e Post-Anchoring 1 2 3 4 5 6 7 8 9 10 A n c h o r i n g P a r a m e t e r β 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Figure 4: Effect of anchoring words to a single topic for one document label at a time as a function of the anchor- ing parameter β . Light gray lines indicate the trajectory of the metric for a giv en disaster or disease label. Thick red lines indicate the pointwise average across all labels for ﬁxed v alue of β . chored topic model to an unsupervised CorEx topic model using the same random seeds, thus creating a matched pair where the only dif ference is the treat- ment of anchor words. Finally , this matched pairs process is repeated 30 times, yielding a distribution for each metric ov er each label. W e use 50 topics when modeling the ReliefW eb articles and 30 topics when modeling the i2b2 clin- ical health notes. These values were chosen by ob- serving diminishing returns to the total correlation explained by additional topics. In Figure 4 we show how the results of this ex- periment vary as a function of the anchoring pa- rameter β for each disaster and disease type in the two data sets. Since there is heavy variance across document labels for each metric, we also examine a more detailed cross section of these results in Fig- ure 5, where we set β = 5 for the clinical health notes and set β = 10 for the disaster relief arti- cles. As we show momentarily , disaster and disease types that beneﬁt the most from anchoring were un- 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Tropical Cyclone Flood Epidemic Earthquake Drought Volcano Flash Flood Insect Infestation Cold Wave Technological Disaster Tsunami Land Slide Wild Fire Severe Local Storm Other Snow Avalanche Extratropical Cyclone Mud Slide Heat Wave Storm Surge Fire A n c h o r i n g P a r a m e t e r β = 1 0 50 0 50 100 A n c h o r i n g P a r a m e t e r β = 1 0 0.10 0.05 0.00 0.05 0.10 0.15 0.20 A n c h o r i n g P a r a m e t e r β = 1 0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Topic Overlap Diff. Post-Anchoring Asthma Coronary Heart Disease Congestive Heart Failure Depression Diabetes GERD Gallstones Gout Hypercholesterolemia Hypertension Hypertriglyceridemia Osteoarthritis Obstructive Sleep Apnea Obesity Peripheral Vascular Disease A n c h o r i n g P a r a m e t e r β = 5 50 0 50 100 Percent Change of Anchored Topic Coherence when Most Predictive Topic A n c h o r i n g P a r a m e t e r β = 5 0.1 0.0 0.1 0.2 0.3 0.4 F 1 D i f f e r e n c e Post-Anchoring A n c h o r i n g P a r a m e t e r β = 5 0.0 0.5 1.0 Proportion of Runs Anchored Topic is the Most Predictive 0.0 0.5 1.0 A v e r a g e F 1 S c o r e Pre-Anchoring 0.0 0.5 1.0 Average Topic Overlap Pre-Anchoring 0.0 0.5 1.0 Proportion of Runs Anchored Topic is the Most Predictive 0.0 0.5 1.0 A v e r a g e F 1 S c o r e Pre-Anchoring 0.0 0.5 1.0 Average Topic Overlap Pre-Anchoring Figure 5: Cross-section results of the anchoring metrics from ﬁxing β = 5 for the clinical health notes, and β = 10 for the disaster relief articles. Disaster and disease types are sorted by frequency , with the most frequent document labels appearing at the top. Error bars indicate 95% conﬁdence interv als. The color bars provide context for each metric: topic overlap pre-anchoring, proportion of topic model runs where the anchored topic was the most predictiv e topic, and F 1 score pre-anchoring. derrepresented pre-anchoring. Document labels that were well-represented prior to anchoring achiev e only mar ginal gain. This results in the variance seen in Figure 4. A priori we do not kno w that anchoring will cause the anchor words to appear at the top of topics. So, we ﬁrst measure how the topic ov erlap, the propor- tion of the top ten mutual information words that ap- pear within the top ten words of the topics, changes before and after anchoring. From Figure 4 (row 1) we see that as β increases, more of these rel- e v ant words consistently appear within the topics. For the disaster relief articles, many disaster types see about two more words introduced, while in the clinical health notes the overlap increases by up to four w ords. Analyzing the cross section in Fig- ure 5 (column 1), we see many of these gains come from disaster and disease types that appeared less in the topics pre-anchoring. Thus, we can sway the topic model towards less dominant themes through anchoring. Document labels that occur the most frequently are those for which the topic overlap changes the least. Next, we examine whether these anchored topics are more coherent topics. T o do so, we compare the coherence of the anchored topic with that of the most predicti ve topic pre-anchoring, i.e. the topic with the largest corresponding coefﬁcient in magnitude of the logistic regression, when the anchored topic itself is most predicti ve. From Figure 4 (ro w 2), we see these results hav e more variance, b ut largely the anchored topics are more coherent. In some cases, the coher- ence is 1.5 to 2 times that of pre-anchoring. Fur - thermore, by colors of the central panel of Figure 5, we ﬁnd that the anchored topics are, indeed, often the most predictiv e topics for each document label. Similar to topic overlap, the labels that see the least improv ement are those that appear the most and are already well-represented in the topic model. Finally , we ﬁnd that the anchored, more coherent topics can lead to modest gains in document clas- siﬁcation. For the disaster relief articles, Figure 4 (ro w 3) shows that there are mixed results in terms of F 1 score improv ement, with some disaster types performing consistently better, and others perform- ing consistently worse. The results are more consis- tent for the clinical health notes, where there is an av erage increase of about 0.1 in the F 1 score, and some disease types see an increase of up to 0.3 in F 1 . Giv en that we are only anchoring 5 w ords to the topic model, these are signiﬁcant gains in predicti ve po wer . Unlike the gains in topic ov erlap and coherence, the F 1 score increases do not simply correlate with which document labels appeared most frequently . For example, we see in Figure 5 (column 3) that T ropical Cyclone exhibits the largest increase in pre- dicti ve performance, e ven though it is also one of the most frequently appearing document labels. Simi- larly , some of the major gains in F 1 for the disease types, and major losses in F 1 for the disaster types, do not come from the most or least frequent docu- ment labels. Thus, if using anchoring single topics within CorEx for document classiﬁcation, it is im- portant to examine how the anchoring affects pre- diction for indi vidual document labels. 5.2.3 Anchoring for T opic Aspects Finding topics that rev olve around a word, such as a name or location, or a group of words can aid in understanding how a particular subject or ev ent has been framed. W e ﬁnish with a qualitative ex- periment where we disambiguate aspects of a topic by anchoring a set of words to multiple topics within the CorEx topic model. Note, must/cannot link LD A cannot be used in this manner , and z -labels LD A would require us to kno w these aspects beforehand. W e consider tweets containing #Ferguson (case- insensiti ve), which detail reactions to the shooting of Black teenager Michael Brown by White po- lice ofﬁcer Darren W ilson on August 9th, 2014 in Ferguson, Missouri. These tweets were collected from the T witter Gardenhose, a 10% random sam- ple of all tweets, o ver the period August 9th, 2014 to Nov ember 30th, 2014. Since CorEx will seek max- imally informative topics by exploiting redundan- cies, we remov e duplicates of retweets, leaving us with 869,091 tweets. W e ﬁlter these tweets of punc- tuation, stop words, hyperlinks, usernames, and the ‘R T’ retweet symbol, and use the top 20,000 word types. In the wake of both the shooting and the ev entual non-indictment of Darren W ilson, several protests occurred. Some onlookers supported and encour- aged such protests, while others characterized the protests as violent “riots. ” T o disambiguate these T opic Aspects of “protest” 1 protest , pr otests , peaceful, violent, continue, night, island, photos, staten, nights 2 protest , pr otests , #hiphopmoves, #cole, hiphop, nationwide, mov es, fo, anheuser, boeing 3 protest , pr otests , st, louis, guard, national, county , patrol, highway , city 4 protest , pr otests , paddy , covering, be verly , walmart, wagon, hills, passionately , including 5 protest , pr otests , solidarity , march, square, rally , #oakland, downto wn, nyc, #nyc T opic Aspects of “riot” 6 riot , riots , unheard, language, inciting, accidentally , jokingly , watts, wa ving, dies 7 riot , black, riots , white, #tcot, blacks, men, whites, race, #pjnet 8 riot , riots , looks, like, sounds, acting, act, animals, looked, treated 9 riot , riots , store, looting, businesses, burning, ﬁre, looted, stores, b usiness 10 gas, riot , tear , riots , gear , rubber , bullets, military , molotov , armored T able 3: T opic aspects around “protest” and “riot” from running a CorEx topic model with 55 topics and anchor- ing “protest” and “protests” together to ﬁ ve topics and “riot” and “riots” together to ﬁve topics with β = 2 . An- chor words are sho wn in bold . Note, topics are not or- dered by total correlation. dif ferent depictions, we train a CorEx topic model with 55 topics, anchoring “protest” and “protests” together to ﬁ ve topics, and “riot” and “riots” to- gether to ﬁve topics with β = 2 . These anchored topics are presented in T able 3. The anchored topics reﬂect different aspects of the framing of the “protests” and “riots, ” and are generally interpretable, despite the typical dif ﬁculty of extracting coherent topics from short documents using LD A (T ang et al., 2014). The “protest” topic aspects describe protests in St. Louis, Oakland, Bev- erly Hills, and parts of New Y ork City (topics 1, 3, 4, 5), resistance by law enforcement (topics 3 and 4), and discussion of whether the protests were peaceful (topic 1). T opic 2 rev olves around hip-hop artists who marched in solidarity with protesters. The “riot” topic aspects discuss racial dynamics of the protests (topic 7) and suggest the demonstrations are dangerous (topics 8 and 9). T opic 10 describes the “riot” gear used in the militarized response to the Ferguson protesters, and T opic 7 also hints at aspects of conservatism through the hashtags #tcot (T op Conservat i ves on T witter) and #pjnet (Patriot Journalist Network). As we see, anchored CorEx ﬁnds sev eral in- teresting, non-tri vial aspects around “protest” and “riot” that could spark additional qualitativ e in ves- tigation. Retrie ving topic aspects through anchor words in this manner allows the user to explore dif- ferent frames of complex issues, events, or discus- sions within documents. As with the other anchor- ing strategies, this has the potential to supplement qualitati ve research done by researchers within the social sciences and digital humanities. 6 Discussion W e hav e introduced an information-theoretic topic model, CorEx, that does not rely on an y of the gener- ati ve assumptions of LD A-based topic models. This topic model seeks maximally informativ e topics as encoded by their total correlation. W e also deriv ed a ﬂexible method for anchoring word-lev el domain kno wledge in the CorEx topic model through the in- formation bottleneck. Anchored CorEx guides the topic model towards themes that do not naturally emerge, and often produces more coherent and pre- dicti ve topics. Both CorEx and anchored CorEx consistently produce topics that are of comparable quality to LDA-based methods, despite only making use of binarized word counts. Anchored CorEx is more ﬂexible than previous attempts at integrating word-lev el information into topic models. T opic separability can be enforced by lightly anchoring disjoint groups of words to sepa- rate topics, topic representation can be promoted by asserti vely anchoring a group of words to a single topic, and topic aspects can be un veiled by anchor - ing a single group of words to multiple topics. The ﬂexibility of anchoring through the information bot- tleneck lends itself to many other possible creati ve anchoring strategies that could guide the topic model in different ways. Different goals may call for dif- ferent anchoring strategies, and domain experts can shape these strategies to their needs. While we hav e demonstrated se veral adv antages of the CorEx topic model to LD A, it does ha ve some technical shortcomings. Most notably , CorEx re- lies on binary count data in its sparsity optimiza- tion, rather than the standard count data that is used as input into LD A and other topic models. While we hav e demonstrated CorEx performs at the le vel of LD A despite this limitation, its effect would be more noticeable on longer documents. This can be partly overcome if one chunks such longer docu- ments into shorter subdocuments prior to running the topic model. Our implementation also requires that each w ord appears in only one topic. These lim- itations are not fundamental limitations of the the- ory , but a matter of computational ef ﬁciency . In future work, we hope to remove these restrictions while preserving the speed of the sparse CorEx topic modeling algorithm. As we hav e demonstrated, the information- theoretic approach provided via CorEx has rich po- tential for ﬁnding meaningful structure in docu- ments, particularly in a way that can help domain experts guide topic models with minimal interven- tion to capture otherwise eclipsed themes. The lightweight and versatile framework of anchored CorEx leav es open possibilities for theoretical ex- tensions and novel applications within the realm of topic modeling. Acknowledgments W e would like to thank the Machine Intelligence and Data Science (MINDS) research group at the Information Sciences Institute for their help and insight during the course of this research. W e also thank the V ermont Advanced Computing Core (V A CC) for its computational resources. W e ac- kno wledge the construction of the HA/DR corpus and lexicon by Leidos Corp. under funding from the Defense Adv anced Research Projects Agency (D ARP A) Information Innov ation Ofﬁce (I2O), pro- gram: Low Resource Languages for Emergent In- cidents (LORELEI), Contract No. HR0011-15-C- 0114. Finally , we thank the anonymous revie wers and the T A CL action editors Diane McCarthy and Kristina T outanov a for their time and effort in help- ing us improve our work. Ryan J. Gallagher was a visiting research assistant at the Information Sci- ences Institute while performing this research. Ryan J. Gallagher and Greg V er Steeg were supported by D ARP A aw ard HR0011-15-C-0115 and David Kale was supported by the Alfred E. Mann Innovation in Engineering Doctoral Fello wship. References David Andrzejewski and Xiaojin Zhu. 2009. Latent Dirichlet Allocation with topic-in-set knowledge. In Pr oceedings of the NAA CL HLT 2009 W orkshop on Semi-Supervised Learning for Natural Language Pr o- cessing , pages 43–48. Association for Computational Linguistics. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Pr oceedings of the 26th Annual International Confer ence on Machine Learning , pages 25–32. David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. 2011. A framework for incorpo- rating general domain knowledge into latent Dirichlet allocation using ﬁrst-order logic. In Pr oceedings of the International J oint Confer ence on Artiﬁcial Intelli- gence , v olume 22, page 1171. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models–going beyond SVD. In 2012 IEEE 53r d Annual Symposium on F oundations of Computer Science (FOCS) , pages 1–10. IEEE. Sanjeev Arora, Rong Ge, Y onatan Halpern, David M. Mimno, Ankur Moitra, David Sontag, Y ichen W u, and Michael Zhu. 2013. A practical algorithm for topic modeling with pro vable guarantees. In Pr oceedings of International Confer ence on Mac hine Learning , pages 280–288. David M. Blei, Andrew Y . Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Ma- chine Learning Resear ch , 3(Jan):993–1022. Wray Buntine and Aleks Jakulin. 2006. Discrete compo- nent analysis. In Subspace, Latent Structur e and F ea- tur e Selection , pages 1–33. Springer . Jonathan Chang, Sean Gerrish, Chong W ang, Jordan L. Boyd-Graber , and David M. Blei. 2009. Reading tea leav es: How humans interpret topic models. In Advances in Neural Information Processing Systems , pages 288–296. W endy W . Chapman, W ill Bride well, Paul Hanbury , Gre- gory F . Cooper, and Bruce G. Buchanan. 2001. A simple algorithm for identifying negated ﬁndings and diseases in discharge summaries. Journal of Biomedi- cal Informatics , 34(5):301–310. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon, and Zhourong Chen. 2016. Progressive EM for latent tree models and hierarchical topic detection. In Pr oceed- ings of the Thirtieth AAAI Confer ence on Artiﬁcial In- telligence , pages 1498–1504. Manhong Dai, Nigam H. Shah, W ei Xuan, Mark A. Musen, Stanley J. W atson, Brian D. Athey , Fan Meng, et al. 2008. An efﬁcient solution for mapping free text to ontology terms. AMIA Summit on T ranslational Bioinformatics , 21. Chris Ding, T ao Li, and W ei Peng. 2008. On the equiv- alence between non-negati ve matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis , 52(8):3913–3927. James Foulds, Shachi Kumar , and Lise Getoor . 2015. Latent topic networks: A versatile probabilistic pro- gramming framework for topic models. In Pro- ceedings of the International Conference on Machine Learning , pages 777–786. Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali T ishby . 2001. Multiv ariate information bottleneck. In Pr oceedings of the Seventeenth Conference on Uncer- tainty in Artiﬁcial Intelligence , pages 152–161. Thomas L. Grif ﬁths, Michael I. Jordan, Joshua B. T enen- baum, and David M. Blei. 2004. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems , pages 17–24. Y oni Halpern, Y oungduck Choi, Stev en Horng, and David Sontag. 2014. Using anchors to estimate clini- cal state without labeled data. In AMIA Annual Sympo- sium Pr oceedings . American Medical Informatics As- sociation. Y oni Halpern, Stev en Horng, and David Sontag. 2015. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299 . Nathan Hodas, Greg V er Steeg, Joshua Harrison, Satish Chikkagoudar , Eric Bell, and Courtney Corley . 2015. Disentangling the lexicons of disaster response in T witter . In The 3rd International W orkshop on Social W eb for Disaster Manag ement (SWDM’15) . Matthew D. Hoffman, David M. Blei, Chong W ang, and John Paisley . 2013. Stochastic variational inference. Journal of Machine Learning Researc h , 14(1):1303– 1347. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Pr oceedings of the F ifteenth Conference on Uncertainty in Artiﬁcial Intelligence , pages 289– 296. Jagadeesh Jagarlamudi, Hal Daum ´ e III, and Raghav en- dra Udupa. 2012. Incorporating lexical priors into topic models. In Pr oceedings of the 13th Conference of the Eur opean Chapter of the Association for Com- putational Linguistics , pages 204–213. Association for Computational Linguistics. Moontae Lee and David Mimno. 2014. Lo w- dimensional embeddings for interpretable anchor- based topic inference. In Pr oceedings of Empiri- cal Methods in Natural Language Processing , pages 1319–1328. Moshe Lichman. 2013. UC Irvine Machine Learning Repository . Patrick Littell, T ian Tian, Ruochen Xu, Zaid Sheikh, David Mortensen, Lori Levin, Francis T yers, Hiroaki Hayashi, Graham Horwood, Stev e Sloto, Emily T ag- tow , Alan Black, Y iming Y ang, T eruko Mitamura, and Eduard Hovy . 2018. The ariel-cmu situation frame detection pipeline for lorehlt16: a model translation approach. Machine T ranslation , 32(1):105–126, Jun. Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neur al Information Pr o- cessing Systems , pages 121–128. Shike Mei, Jun Zhu, and Jerry Zhu. 2014. Robust Reg- Bayes: Selectiv ely incorporating ﬁrst-order logic do- main kno wledge into Bayesian models. In Pr oceed- ings of the 31st International Confer ence on Machine Learning (ICML-14) , pages 253–261. David Mimno, Hanna M. W allach, Edmund T alley , Miriam Leenders, and Andrew McCallum. 2011. Op- timizing semantic coherence in topic models. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Pr ocessing , pages 262–272. Asso- ciation for Computational Linguistics. Thang Nguyen, Y uening Hu, and Jordan L. Boyd-Graber . 2014. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In Pr oceedings of the Association of Computational Lin- guistics , pages 359–369. Thang Nguyen, Jordan Boyd-Graber , Jeffrey Lund, Ke vin Seppi, and Eric Ringger . 2015. Is your anchor going up or down? Fast and accurate supervised topic models. In Pr oceedings of North American Chapter of the Association for Computational Linguistics . Fabian Pedregosa, Ga ¨ el V aroquaux, Alexandre Gram- fort, V incent Michel, Bertrand Thirion, Oliver Grisel, Mathieu Blondel, Peter Prettenhofer , Ron W eiss, V in- cent Dubourg, Jake V anderplas, Alexandre P assos, David Cournapeau, Matthieu Brucher , Matthieu Per- rot, and ´ Edouard Duchesnay . 2011. Scikit-learn: Ma- chine learning in Python. Journal of Machine Learn- ing Resear ch , 12:2825–2830. Radim ˇ Reh ˚ u ˇ rek and Petr Sojka. 2010. Software Frame- work for T opic Modeling with Large Corpora. In Pr o- ceedings of the LREC 2010 W orkshop on New Chal- lenges for NLP F rameworks , pages 45–50. Kyle Reing, David C. Kale, Greg V er Steeg, and Aram Galstyan. 2016. T oward interpretable topic discovery via anchored correlation explanation. ICML W orkshop on Human Interpr etability in Machine Learning . Michal Rosen-Zvi, Thomas Grifﬁths, Mark Steyvers, and Padhraic Smyth. 2004. The author -topic model for au- thors and documents. In Pr oceedings of the 20th Con- fer ence on Uncertainty in Artiﬁcial Intelligence , pages 487–494. Jian T ang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limit- ing factors of topic modeling via posterior contraction analysis. In Pr oceedings of the International Confer- ence on Machine Learning , pages 190–198. Naftali T ishby , Fernando C. Pereira, and W illiam Bialek. 1999. The information bottleneck method. In Pr o- ceedings of 37th Annual Allerton Confer ence on Com- munication, Contr ol and Computing , pages 368–377. Greg V er Steeg and Aram Galstyan. 2014. Discov ering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Pr o- cessing Systems , pages 577–585. Greg V er Steeg and Aram Galstyan. 2015. Maxi- mally informati ve hierarchical representations of high- dimensional data. In Artiﬁcial Intelligence and Statis- tics , pages 1004–1012. A Supplemental Material: Anchor W ords and T opic Examples Disease T ype Anchor W ords Asthma asthma, albuterol, wheeze, advair , ﬂuticasone Coronary Artery Disease coronary artery disease, aspirin, myocardial inarction, plavix Congesti ve Heart Failure congestiv e heart failure, lasix, diuresis, heart failure, cardiomyopathy Depression depression, prozac, celexa, seroquel, remeron Diabetes diabetes mellitus, diabetes, nph insulin, insulin, metformin Gastroesophageal Reﬂux Disease gastroesophageal refulx, no known drug aller gy , protonix, not:, reﬂux Gallstones gallstone, cholecystitis, cholelithiasis, abdominal pain,vomiting Gout gout, allopurinol, colchicine, renal insufﬁcienc y , torsemide Hypercholesterolemia hypercholesterolemia, hyperlipidemia, aspirin, lipitor , dyslipidemia Hypertension hypertension, lisinopril, aspirin, diabetes mellitus, atorvastatin Hypertriglyceridemia hypertriglyceridemia, gemﬁbrozil, citrate, orphenadrine, hydroxymethylglutaryl coa reductase inhibitors Osteoarthritis osteoarthritis, degenerati ve joint disease, arthritis, naproxen, ﬁbromyalgia Obstructi ve Sleep Apnea sleep apnea, obstructiv e sleep apnea, morbid obese, obesity , ipratropium Obesity obesity , morbid obesity , obese, sleep apnea, coronary artery disease Peripheral V ascular Disease cellulitis, erythema, ulcer , swelling, word ﬁnding dif ﬁculty T able A1: W ords that hav e the highest mutual informa- tion with each disease type. Disaster T ype Anchor W ords Cold W a ve winter , snow , cold, temperatures, heavy sno w Drought drought, taliban, wheat, refugees, se vere drought Earthquake earthquake, quake, richter scale, tents, injured Epidemic virus, ebola outbreak, transmission, ebola virus, disaster Extratropical Cyclone typhoon, storm, farmland, houses, storm coincided Fire ﬁre, hospitals, blaze, water crisis, ﬁreﬁghters Flash Flood ﬂood, ﬂoods, ﬂash ﬂoods, monitoring stations, muhuri Flood ﬂoods, ﬂood, ﬂooding, ﬂood victims, rains Heat W a ves heat, temperatures, heat wa ve, heatstroke, sunstroke Insect Infestation locust, food crisis, infestations, millet, harvest Land Slide landslides, houses, hunza ri ver , search, village Mud Slide mudslides, rains, mudslide, torrential rains, houses Other climate, ocean, drought, impacts, warming Se vere Local Storm tornado, storm, tornadoes, houses, storms Sno w A v alanche av alanches, av alanche, sno w , snowf all, an av alanche Storm Surge king tides, tropical storm, ocean, cyclone season, ﬂooded T echnological Disaster en vironmental, toxic waste, pollution, tanker , sludge T ropical Cyclone hurricane, cyclone, storm, tropical storm, national hurricane Tsunami earthquake, disaster , tsunamis, wa ve, rains V olcano eruption, lav a, volcanic, crater , eruptions W ild Fire ﬁres, ﬁre, forest ﬁres, ﬁreﬁghters, burning T able A2: W ords that hav e the highest mutual informa- tion with each disaster type. relief,relief supplies,relief efforts cold,cold weather,wave fuel,supply,diesel energy latrines,water tanks,water containers locust,attacks,fighting crops,cereal,cereals eruption,volcanic,lava training,partners,protection fires,fire,forest fires survivors,relief effort,relief workers canal,disruption,rehabilitating united nations,humanitarian affairs,agencies flour,wheat,sugar household,procurement,vulnerable groups housing,reconstruction,construction rescue,search,injured military,armed,civilians environmental,pollution,contamination houses,killed,village support,assistance,appeal cholera outbreak,cholera epidemic,poor sanitation voluntary,basic needs,rehabilitation phase blankets,tents,families public health,organization,ministry of winter,snow,snowfall criminal,detained,parliament famine,severe drought,crises earthquake,quake,richter scale malnutrition,refugees,food aid malaria,diarrhoea,diseases taliban,repatriation,elections sanitation,provision,safe drinking water government,governments,prime minister vegetation,ecological,threat floods,flooding,flood staff,supplies,personnel basin,monitoring stations,basins camps,living,army perished,water storage,caused extensive damage virus,ebola outbreak,transmission drought,farmers,harvest medical,patients,hospital ngos,donors,humanitarian international federation,red cross,red crescent disaster,disasters,disaster relief storm,winds,coast water,water supply transport,flights,trucks facilities,soap,medical supplies emergency,emergencies,ocha Figure A1: Hierarchical CorEx topic model of the disaster relief articles. Edge widths are proportional to the mutual information with the latent representation. Rank T opic 1 drought, farmers, harv est, crop, li vestock, planting, grain, maize, rainf all, irrigation 2 ﬂoods, ﬂooding, ﬂood, rains, ﬂooded, landslides, inundated, ri vers, submer ged, ﬂash ﬂoods 3 eruption, volcanic, la va, crater , eruptions, volcanos, slopes, volcanic acti vity , ev acuated, lav a ﬂows 4 storm, winds, coast, hurricane, weather , tropical storm, national hurricane, coastal, storms, meteorological 5 virus, ebola outbreak, transmission, health workers, v accination, ebola virus, suspected cases, ﬂuids, ebola virus disease, ebola patients 6 malnutrition, refugees, food aid, nutrition, feeding, refugees in, hunger , nutritional, refugee, food crisis 7 international federation, red cross, red crescent, societies, volunteers, disaster relief emergenc y , national societies, disaster preparedness, information bulletin, relief operation 8 winter , snow , snowfall, temperatures, hea vy snow , heating, freezing, warm clothing, se vere winter , av alanches 9 support, assistance, appeal, funds, assist, contributions, fund, cash, contrib ution, organizations 10 taliban, repatriation, elections, militia, con voy , ruling, talibans, islamic, con voys, v ote 11 ngos, donors, humanitarian, un agencies, mission, funding, unicef, conduct, humanitarian assistance, inter-agenc y 12 ﬁres, ﬁre, forest ﬁres, burning, ﬁreﬁghters, wildﬁres, blaze, ﬂames, ﬁre ﬁghting, forests 13 earthquake, quake, richter scale, aftershocks, earthquak es, magnitude earthquake, magnitude, de v astating earthquake, an earthquake, earthquak e struck 14 blankets, tents, families, clothing, utensils, plastic sheeting, clothes, tarpaulins, schools, shelters 15 rescue, search, injured, helicopters, death toll, rescue operations, rescue teams, police, rescuers, stranded 16 crops, cereal, cereals, millet, food shortages, sorghum, harv ests, shortage, ration, rainy 17 medical, patients, hospital, hospitals, nurses, clinics, clinic, doctor , medical team, beds 18 water , water supply , drinking water , pumps, drinking, water supplies, potable water , water distribution, installed, constructed 19 locust, attacks, ﬁghting, infestations, pesticides, opposition, attack, reform, dialogue, gov ernance 20 en vironmental, pollution, contamination, ﬁsh, impacts, water quality , polluted, pollutants, chemicals, tanker 21 malaria, diarrhoea, diseases, oral, rehydration, salts, contaminated, epidemics, borne diseases, respiratory infections, clean 22 emergenc y , emergencies, ocha, disaster response, coordinating, emer gency response, coordinated, coordinators, transportation, rapid assessment 23 military , armed, civilians, soldiers, aircraft, weapons, rebel, planes, bombs, military personnel 24 united nations, humanitarian af fairs, agencies, agenc y , gov ernmental, united nations childrens fund, relief coordinator , general assembly , international cooperation, donor community 25 transport, ﬂights, trucks, airport, transported, ﬂight, truck, airlift, cargo, route T able A3: T opics 1–25 resulting from the best of 10 CorEx topic models run on the disaster relief articles. T opics are ranked by total correlation explained. Rank T opic 26 basin, monitoring stations, basins, muhuri, ﬂood forecasting, signiﬁcant rainfall, moderate rainfall, upstream, light, sludge 27 criminal, detained, parliament, protest, crime, protests, protesters, suspects, ﬁring, incident 28 public health, org anization, ministry of, ef forts, outbreaks, building, leaders, ci vil society , minister of, facility 29 housing, reconstruction, construction, repair , rebuilding, repairs, temporary housing, corrugated, permanent housing, debris remov al 30 houses, killed, village, were killed, buildings, swept, debris, roofs, roof, collapse 31 training, partners, protection, interventions, deli very establishment, violence, beneﬁt, unfpa, pilt 32 sanitation, provision, safe, drinking w ater , latrine, hygiene education, implementing partners, diarrhoeal diseases, rehabilitated, dispaced persons, sanitation services 33 ﬂour , wheat, sugar , vegetable, beans, rations, food rations, bread, lentils, needy 34 camps, li ving, army , troops, resettlement, relocated, relocation, relocate, ﬂee, settlement 35 disaster , disasters, disaster relief, cyclone, coordinating council, cyclones, aftermath, de v astation, de vastated, natural disaster 36 relief, relief supplies, relief ef forts, relief operations, relief assistance, relief goods, relief materials, relief agencies, donate, providing relief 37 household, procurement, vulnerable groups, beneﬁciary , pipeline, rehabilitate, local ngos, iodised salt, rainfed areas, water harv esting 38 staf f, supplies, personnel, deployed, staf f members, airlifted specialists, ﬂown, logistical support, airlifting 39 facilities, soap, medical supplies, clean water , sanitation facilities, emergenc y medical, international org anization, psychosocial, tent, migration iom 40 fuel, supply , diesel energy , nitrate, diesel fuel, orphanages, grid, hydroelectric, storage, facilities 41 cold, cold weather , wav e, warm clothes, extreme temperatures, ﬁre wood, sev ere cold weather , sev ere cold wav e, average temperature 42 cholera outbreak, cholera epidemic, poor sanitation, cholera outbreaks, wash, poor hygiene, dirty water , disinfect, hygiene aw areness, good hygiene practices 43 gov ernment, governments, prime minister , administration, national disaster management, corporation, dollars, bilateral donors, disburse, telecom 44 famine, se vere drought, crises, prolonged drought, de v astating, mortality rate, degradation, catastrophic, famine relief, agricultural practices 45 ve getation, ecological, threat, mosquitoes, insect, insecticides, lakes, prolonged, habitation, adverse weather 46 latrines, water tanks, water containers, af fected communities, chlorine tablets, household kits, solid waste, reception centre, local or ganisations, piped water 47 survi vors, relief ef fort, relief workers, survi vor , clean drinking water , outlying areas, de v astating cyclone, c yclone struck, cyclone survi vors, medic 48 perished, water storage, caused e xtensi ve damage, soil erosion, total loss, se wage systems, salt water , soup, water purifying tablets, electric po wer 49 canal, disruption, rehabilitating, infrastructures, vulnerable areas, uninterrupted, po wer plants, stagnant, inaccessible areas, distress 50 voluntary , basic needs, rehabilitation phase, blankets mattresses, raised, freight, humanitarian org anizations, gov ernment agency , delta region, persons displaced T able A4: T opics 26–50 resulting from the best of 10 CorEx topic models run on the disaster relief articles. T opics are ranked by total correlation explained. not:,pulmonary edema,captopril clindamycin,imodium,pulmonary disease colace,constipation,senna antibiotic,miconazole,wound use,drug,complication respiratory failure,prednisone,imuran pain,oxycodone,tylenol digoxin,cardiomyopathy aldactone,spironolactone myocardial infarction,angina,chest pressure albuterol,wheeze,atrovent vomiting,nausea,abdominal pain nph insulin,insulin,insulin dependent diabetes mellitus anxiety state,insomnia,ativan left ventricular,hypertrophy,dyspnea no known drug allergy,axid,procardia xl aspirin,plavix,lipitor erythema,cellulitis,linezolid tricuspid valve regurgitation,mitral valve regurgitation,mitral regurgitation elixir,roxicet,schizophrenia prilosec,omeprazole,lovenox decreased breath sound,stroke,tachycardia nitroglycerin,chest pain,coronary artery disease vancomycin,communicable disease,flagyl lopressor,stenosis,hypertension diuresis,congestive heart failure,lasix coumadin,atrial fibrillation,anticoagulant hypothyroidism,synthroid,levothyroxine multivitamin,folate,magnesium leukocyte esterase,yeast,fluconazole end stage renal disease,nephrocaps,phoslo Figure A2: Hierarchical CorEx topic model of the clinical health notes. Edge widths are proportional to the mutual information with the latent representation. Rank T opic 1 use, drug, complication, allergy , sodium, infection, furosemide, docusate, shortness of breath, potassium chloride 2 v ancomycin, communicable disease, ﬂagyl, le voﬂoxacin, diabetes, renal f ailure, sepsis, ceftazidime, nutrition, gentamicin 3 aspirin, plavix, lipitor , toprol xl, lantus, hydroxymethylglutaryl coa reductase inhibitors, atorv astatin, nexium, no volog, disease 4 diuresis, congesti ve heart f ailure, lasix, edema, orthopnea, crackle, heart failure, dyspnea on ex ertion, oxygen, torsemide 5 albuterol, wheeze, atro vent, chronic obstructi ve pulmonary disease, asthma, ﬂo vent, ipratropium, ﬂuticasone, adv air , combiv ent 6 end stage renal disease, nephrocaps, phoslo, calcitriol, cellcept, kidney transplant, arteriov enous ﬁstula, acetate, cyclosporine, neoral 7 nitroglycerin, chest pain, coronary artery disease, hypokinesia, st depression, lesion, unstable angina, akinesia, st ele v ation, diaphoresis 8 respiratory failure, prednisone, imuran, immunosuppression, necrosis, c yclosporin, sick, magnesium oxide, tachypnea, arteriov enous malformation 9 elixir , roxicet, schizophrenia, risperdal, zofran, crushed, valproic acid, promethazine, phenerg an, prochlorperazine 10 leukoc yte esterase, yeast, ﬂuconazole, urosepsis, dysphagia, oxycontin, lidoderm, chemotherapy , adriamycin, medical problems 11 colace, constipation, senna, lactulose, dulcolax, milk of magnesia, sennoside, dilaudid, protonix, reglan 12 vomiting, nausea, abdominal pain, diarrhea, fe ver , dehydration, chill, clostridium difﬁcile, intrav enous ﬂuid, compazine 13 coumadin, atrial ﬁbrillation, anticoagulant, warfarin, k vitamin, amiodarone, atrial ﬂutter , ﬂutter , deep venous thrombosis, allopurinol 14 digoxin, cardiomyopathy , aldactone, spironolactone, carvedilol, dob utamine, alcohol, idiopathic cardiomyopathy , ventricular rate, addiction 15 clindamycin, imodium, pulmonary disease, erythromycin, defervesced, sweating, caraf ate, quinidine, cytome galovirus, cepacol 16 lopressor , stenosis, hypertension, heparin, hypercholesterolemia, aortic v alve insuf ﬁciency , mitral v alve insuf ﬁciency , aortic v alve stenosis, sinus rhythm, peripheral v ascular disease 17 antibiotic, miconazole, wound, nitrate, morbid obese, fentan yl, sleep apnea, obesity , abscess, ampicillin 18 erythema, cellulitis, linezolid, swelling, erythematous, osteomyelitis, ancef, keﬂe x, dicloxacillin, bacitracin 19 anxiety state, insomnia, ati v an, neurontin, depression, lorazepam, gabapentin, trazodone, ﬂuoxetine, headache 20 multi vitamin, folate, magnesium, folic acid, mvi, maalox, thiamine, vitamin c, gluconate, dyspepsia T able A5: T opics 1–20 resulting from the best of 10 CorEx topic models run on the clinical health notes. T opics are ranked by total correlation explained. Rank T opic 21 decreased breath sound, stroke, tachycardia, seizure disorder , lymphocyte, atelectasis, polymorphonuclear leukoc ytes, ecchymosis, seizure, cefotaxime 22 not: , pulmonary edema, captopril, pleural effusion, rales, beta block er, f atigue, dead, q wa ve, dysfunction 23 hypothyroidism, synthroid, le vothyroxine, le voxyl, diov an, valsartan, angioedema, bestrophinopathy , atherosclerosis, ursodiol 24 nph insulin, insulin, insulin dependent diabetes mellitus, anemia, humulin insulin, retinopathy , hyperglycemia, humulin, g astrointestinal bleeding, nephropathy 25 tricuspid v alve re gurgitation, mitral v alve re gurgitation, mitral regur gitation, left atrial enlargement, zaroxolyn, ectop y , right atrial enlargement, metolazone, deﬁcit, re gurgitant 26 prilosec, omeprazole, lov enox, pulmonary embolism, enoxaparin, xalatan, oxybutynin, helicopter pylori, ﬂonase, ramipril 27 pain, oxycodone, tylenol, percocet, ibuprofen, morphine, osteoarthritis, hernia, motrin, bleeding 28 left ventricular hypertroph y , dyspnea, living alone, smok es, syndrome, hives, palpitation, elderly , left axis deviation, usual state of health 29 myocardial infarction, angina, chest pressure, patent ductus arteriosus, atenolol, micronase, adenosine, non-insulin dependent diabetes mellitus, ecotrin, caltrate 30 no kno wn drug allergy , axid, procardia xl, vasotec, obese, me vacor , tissue plasminogen acti v ator , middle-aged, nifedipine, procardia T able A6: T opics 21–30 resulting from the best of 10 CorEx topic models run on the clinical health notes. T opics are ranked by total correlation explained.

Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment