Latent Tree Models for Hierarchical Topic Detection

Latent T ree Models for Hierarchical T opic Detection Peixian Chen a , Nevin L. Zhang a, ∗ , T engfei Liu b , Leonard K. M. Poon c , Zhourong Chen a , Farhan Kha war a a Department of Computer Science and Engineering The Hong K ong University of Science and T echnolo gy , Hong Kong b Ant F inancial Services Group, Shanghai c Department of Mathematics and Information T ec hnology The Education University of Hong K ong, Hong K ong Abstract W e present a nov el method for hierarchical topic detection where topics are obtained by clustering documents in multiple ways. Speciﬁcally , we model document collections using a class of graphical models called hierar chical latent tr ee models (HLTMs) . The variables at the bottom lev el of an HL TM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lo west latent le vel representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below . Each latent variable gi ves a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Latent variables at high lev els of the hierarchy capture long-range word co-occurrence patterns and hence give thematically more general topics, while those at lo w lev els of the hierarch y capture short-range w ord co-occurrence patterns and giv e thematically more speciﬁc topics. Unlike LD A-based topic models, HL TMs do not refer to a document generation process and use word variables instead of token v ariables. They use a tree structure to model the relationships between topics and words, which is conduciv e to the discovery of meaningful topics and topic hierarchies. ∗ Corresponding author Email addr ess: lzhang@cse.ust.hk (Ne vin L. Zhang) Pr eprint submitted to Elsevier December 22, 2016 1. Introduction The objective of hierar chical topic detection (HTD) is to, giv en a corpus of documents, obtain a tree of topics with more general topics at high lev els of the tree and more speciﬁc topics at low le vels of the tree. It has a wide range of potential applications. F or example, a topic hierarchy for posts at an online forum can provide an o verview of the v ariety of the posts and guide readers quickly to the posts of interest. A topic hierarchy for the revie ws and feedbacks on a business/product can help a company gauge customer sentiments and identify areas for improvements. A topic hierarchy for recent papers published at a conference or journal can give readers a global picture of recent trends in the ﬁeld. A topic hierarchy for all the articles retriev ed from PubMed on an area of medical research can help researchers get an ov erview of past studies in the area. In applications such as those mentioned here, the problem is not about search because the user does not kno w what to search for . Rather the problem is about summarization of thematic contents and topic-guided browsing. Sev eral HTD methods ha ve been proposed previously , including nested Chinese restaurant process (nCRP) [1, 2], Pachinko allocation model (P AM) [3, 4], and nested hierarchical Dirichlet process (nHDP) [5]. Those methods are extensions of latent Dirichlet allocation (LD A) [6]. Hence we refer to them collectiv ely as LDA-based methods . In this paper, we present a novel HTD method called hierar chical latent tree analysis (HLT A) . Like the LD A-based methods, HL T A is a probabilistic method and it in volves latent variables. Howe ver , there are fundamental differences. The ﬁrst difference lies in what is being modeled and the semantics of the latent variables. The LD A-based methods model the process by which documents are generated. The latent variables in the models are constructs in the h ypothetical generation process, including a list of topics (usually denoted as β ), a topic distribution vector for each document (usually denoted as θ d ), and a topic assignment for each token in each document (usually denoted as Z d,n ). In contrast, HL T A models a collection of documents without referring to a document generation process. The latent variables in the model are considered unobserved attributes of the documents. If we compare 2 whether words occur in particular documents to whether students do well in various subjects, then the latent variables correspond to latent traits such as analytical skill, literacy skill and general intelligence. The second difference lies in the types of observed variables used in the models. Observed v ariables in the LDA-based methods are token variables (usually denoted as W d,n ). Each token variable stands for a location in a document, and its possible values are the words in a vocabulary . Here one cannot talk about conditional independence between words because the probabilities of all words must sum to 1. In contrast, each observed v ariable in HL T A stands for a word. It is a binary variable and represents the presence/absence of the word in a document. The output of HL T A is a tree-structured graphical model, where the w ord variables are at the leav es and the latent v ariables are at the internal nodes. T wo word variables are conditionally independent gi ven any latent variable on the path between them. W ords that frequently co-occur in documents tend to be located in the same “region” of the tree. This fact is conduciv e to the discovery of meaningful topics and topic hierarchies. A drawback of using binary word v ariables is that word counts cannot be taken into consideration. The third difference lies in the deﬁnition and characterization of topics. T opics in the LD A-based methods are probabilistic distributions over a v ocabulary . When presented to users, a topic is characterized using a few words with the highest probabilities. In contrast, topics in HL T A are clusters of documents. More speciﬁcally , all latent variables in HTLA are assumed to be binary . Just as the concept “analytical skill” partitions a student population into soft two clusters, those with high analytic skill in one cluster and those with low analytic skill in another, a latent variable in HL T A partitions a document collection into tw o soft clusters of documents. The document clusters are interpreted as topics. For presentation to users, a topic is characterized using the words that not only occur with high probabilities in topic but also occur with low probabilities outside the topic. The consideration of occurrence probabilities outside the topic is important because a word that occurs with high probability in the topic might also occur with high probability outside the topic. When that happens, it is not a good choice for the characterization of the topic. There are other differences that are more technical in nature and the explanations 3 are hence postponed to Section 4. The rest of the paper is organized as follo ws. W e discuss related work in Section 2 and revie w the basics of latent tree models in Section 3. In Section 4, we introduce hierarchical latent tree models (HL TMs) and explain how they can be used for hierarchical topic detection. The HL T A algorithm for learning HL TMs is described in Sections 5 - 7. In Section 8, we present the results HTLA obtains on a real-world dataset and discuss some practical issues. In Section 9, we empirically compare HL T A with the LDA-based methods. Finally , we end the paper in Section 10 with some concluding remarks and discussions of future work. 2. Related W ork T opic detection has been one of the most acti ve research areas in Machine Learning in the past decade. The most commonly used method is latent Dirichlet allocation (LD A) [6]. LD A assumes that documents are generated as follows: First, a list { β 1 , . . . , β K } of topics is drawn from a Dirichlet distribution. Then, for each document d , a topic distribution θ d is drawn from another Dirichlet distribution. Each word W d,n in the document is generated by ﬁrst picking a topic Z d,n according to the topic distribution θ d , and then selecting a word according to the word distribution β Z d,n of the topic. Gi ven a document collection, the generation process is rev erted via statistical inference (sampling or variational inference) to determine the topics and topic compositions of the documents. LD A has been extended in various ways for additional modeling capabilities. T opic correlations are considered in [7, 3]; topic ev olution is modeled in [8, 9, 10]; topic structures are built in [11, 3, 1, 4]; side information is exploited in [12, 13]; supervised topic models are proposed in [14, 15]; and so on. In the following, we discuss in more details three of the extensions that are more closely related to this paper than others. Pachinko allocation model (P AM) [3, 4] is proposed as a method for modeling correlations among topics. It introduces multiple lev els of supertopics on top of the basic topics. Each supertopic is a distribution over the topics at the next lev el below . Hence P AM can also be viewed as an HTD method, and the hierarchical structure needs 4 to be predetermined. T o pick a topic for a token, it ﬁrst draws a top-lev el topic from a multinomial distribution (which in turn is drawn from a Dirichlet distribution), and then draws a topic for the next lev el belo w from the multinomial distribution associated with the top-lev el topic, and so on. The rest of the generation process is the same as in LD A. Nested Chinese Restaurant Process (nCRP) [2] and nested Hierarchical Dirichlet Process (nHDP) [5] are proposed as HTD methods. They assume that there is a true topic tree behind data. A prior distribution is placed ov er all possible trees using nCRP and nHDP respectively . An assumption is made as to how documents are generated from the true topic tree, which, together with data, gi ves a likelihood function over all possible trees. In nCRP , the topics in a document are assumed to be from one path do wn the tree, while in nHDP , the topics in a document can be from multiple paths, i.e., a subtree within the entire topic tree. The true topic tree is estimated by combining the prior and the likelihood in posterior inference. During inference, one in theory deals with a tree with inﬁnitely many levels and each node having inﬁnitely many children. In practice, the tree is truncated so that it has a predetermined number of le vels. In nHDP , each node also has a predetermined number of children, and nCRP uses hyperparameters to control the number . As such, the two methods in effect require the user to provide the structure of an hierarchy as input. As mentioned in the introduction, HL T A models document collections without referring to a document generation process. Instead, it uses hierarchical latent tree models (HL TMs) and the latent variables in the models are regarded as unobserv ed attributes of the documents. The concept of latent tree models was introduced in [16, 17], where they were referred to as hierarchical latent class models. The term “latent tree models” ﬁrst appeared in [18, 19]. Latent tree models generalize two classes of models from the previous literature. The ﬁrst class is latent class models [20, 21], which are used for categorical data clustering in social sciences and medicine. The second class is probabilistic phylogenetic trees [22], which are a tool for determining the ev olution history of a set of species. The reader is referred to [23] for a surv ey of research activities on latent tree 5 models. The acti vities take place in three settings. In the ﬁrst setting, data are assumed to be generated from an unknown L TM, 1 and the task is to recover the generativ e model [24]. Here one tries to discover relationships between the latent structure and observed marginals that hold in L TMs, and then use those relationships to reconstruct the true latent structure from data. And one can prov e theoretical results on consistency and sample comple xity . In the second setting, no assumption is made about how data are generated and the task is to ﬁt an L TM to data [25]. Here it does not make sense to talk about theoretical guarantees on consistency and sample complexity . Instead, algorithms are ev aluated empirically using held-out likelihood. It has been shown that, on real-world datasets, better models can be obtained using methods de veloped in this setting than using those dev eloped in the ﬁrst setting [26]. The reason is that, although the assumption of the ﬁrst setting is reasonable for data from domains such as ph ylogeny , it is not reasonable for other types of data such as text data and surv ey data. The third setting is similar to the second setting, except that model ﬁt is no longer the only concern. In addition, one needs to consider how useful the results are to users, and might want to, for example, obtain a hierarchy of latent variables. Liu et al. [27] are the ﬁrst to use latent tree models for hierarchical topic detection. They propose an algorithm, namely HL T A, for learning HL TMs from text data and give a method for extracting topic hierarchies from the models. A method for scaling up the algorithm is proposed by Chen et al. [28]. This paper is based on [27, 28] 2 . There are substantial extensions: The novelty of HL T A w .r .t the LD A-based methods is now systematically discussed; The theory and algorithm are described in more details and two practical issues are discussed; A new parameter estimation method is used for large datasets; And the empirical ev aluations are more extensi ve. Another method to learn a hierarchy of latent variables from data is proposed by 1 Here data generated from a model are vectors of v alues for observed variables, not documents. 2 NO TE TO REVIEWER (to be removed in the ﬁnal version): Those are conference papers by the authors themselves. It is stated in the AIJ review form that “a paper is nov el if the results it describes were not previously published by other authors , and were not previously published by the same authors in any archiv al journal ”. 6 Figure 1: The undirected latent tree model in (c) represents an equivalent class of directed latent tree models, which includes (a) and (b) as members. V er Steeg and Galstyan [29]. The method is named correlation explanation (CorEx) . Unlike HL T A, CorEx is proposed as a model-free method and it hence does not intend to provide a representation for the joint distrib ution of the observed variables. HL T A produces a hierarchy with word variables at the bottom and multiple le vels of latent variables at top. It is hence related to hierarchical variable clustering. One difference is that HL T A also partitions documents while variable clustering does not. There is a vast literature on document clustering [30]. In particular , co-clustering [31] can identify document clusters where each cluster is associated with a potentially different set of words. Howe ver , document clustering and topic detection are generally considered two different ﬁelds with little overlap. This paper bridges the two ﬁelds by dev eloping a full-ﬂedged HTD method that partitions documents in multiple ways. 3. Latent T ree Models A latent tree model (LTM) is a tree-structured Bayesian network [32], where the leaf nodes represent observed variables and the internal nodes represent latent variables. An example is shown in Figure 1 (a). In this paper, all v ariables are assumed to be binary . The model parameters include a marginal distribution for the root Z 1 , and a conditional distribution for each of the other nodes given its parent. The product of the distributions deﬁnes a joint distribution o ver all the variables. In general, an L TM has n observed variables X = { X 1 , . . . , X n } and m latent variables Z = { Z 1 , . . . , Z m } . Denote the parent of a variable Y as pa ( Y ) and let pa ( Y ) be a empty set when Y is the root. Then the L TM deﬁnes a joint distribution 7 ov er all observed and latent variables as follo ws: P ( X 1 , . . . , X n , Z 1 , . . . , Z m ) = Y Y ∈ X ∪ Z P ( Y | pa ( Y )) (1) By changing the root from Z 1 to Z 2 in Figure 1 (a), we get another model shown in (b). The two models are equivalent in the sense that they represent the same set of distributions o ver the observed v ariables X 1 , . . . , X 5 [17]. It is not possible to distinguish between equiv alent models based on data. This implies that the root of an L TM, and hence orientations of edges, are unidentiﬁable. It therefore makes more sense to talk about undirected L TMs, which is what we do in this paper . One example is shown in Figure 1 (c). It represents an equiv alent class of directed models. A member of the class can be obtained by picking a latent node as the root and directing the edges away from the root. For example, (a) and (b) are obtained from (c) by choosing Z 1 and Z 2 to be the root respecti vely . In implementation, an undirected model is represented using an arbitrary directed model in the equiv alence class it represents. In the literature, there are variations of L TMs where some internal nodes are observed [24] and/or the variables are continuous [33, 34, 35]. In this paper, we focus on basic L TMs as deﬁned in the previous two paragraphs. W e use | Z | to denote the number of possible states of a variable Z . An L TM is r e gular if, for any latent node Z , we hav e that | Z | ≤ Q k i =1 | Z i | max k i =1 | Z i | , (2) where Z 1 , . . . , Z k are the neighbors of Z , and that the inequality holds strictly when k = 2 . When all variables are binary , the condition reduces to that each latent node must hav e at least three neighbors. For any irregular L TM, there is a regular model that has fewer parameters and represents that same set of distributions ov er the observed variables [17]. Consequently , we focus only on regular models. 4. Hierarchical Latent T ree Models and T opic Detection W e will later present an algorithm, called HL T A, for learning from te xt data models such as the one sho wn in Figure 2. There is a layer of observ ed variables at the bottom 8 and multiple layers of latent v ariables on top. The model is hence called a hierar chical latent tree model (HL TM) . In this section, we discuss how to interpret HL TMs and ho w to extract topics and topic hierarchies from them. 4.1. HLTMs for T e xt Data W e use the to y model in Figure 2 as an running e xample. It is learned from a subset of the 20 Newsgroup data 1 . The variables at the bottom level, lev el 0, are observed binary v ariables that represent the presence/absence of words in a document. The latent variables at level 1 are introduced during data analysis to model word co-occurrence patterns. For example, Z 11 captures the probabilistic co-occurrence of the words nasa , space , shuttle and mission ; Z 12 captures the probabilistic co-occurrence of the words orbit , earth , solar and satellite ; Z 13 captures the probabilistic co-occurrence of the words lunar and moon . Latent variables at lev el 2 are introduced during data analysis to model the co-occurrence of the patterns at lev el 1. For example, Z 21 represents the probabilistic co-occurrence of the patterns Z 11 , Z 12 and Z 13 . Because the latent variables are introduced layer by layer , and each latent variable is introduced to explain the correlations among a group of variables at the le vel below , we regard, for the purpose of model interpretation, the edges between two layers as directed and they are directed downwards. (The edges between top-level latent variables are not directed.) This allows us to talk a about the subtree rooted at a latent node. For example, the subtree rooted at Z 21 consists of the observed v ariables orbit , earth , . . . , mission . 4.2. T opics fr om HLTMs There are totally 14 latent variables in the toy example. Each latent variable has two states and hence partitions the document collection into two soft clusters. T o ﬁgure out what the partition and the two clusters are about, we need to consider the relationship between the latent variable and the observed variables in its subtree. T ake Z 21 as an 1 http://qwone.com/ jason/20Ne wsgroups/ 9 Figure 2: Hierarchical latent tree model obtained from a toy text dataset. The latent variables right above the word v ariables represent word co-occurrence patterns and those at higher lev els represent co-occurrence of patterns at the lev el below . T able 1: Document partition giv en by latent variable Z 21 . Z 21 s0 (0.95) s1 (0.05) space 0.04 0.58 nasa 0.03 0.43 orbit 0.01 0.33 earth 0.01 0.33 shuttle 0.01 0.24 moon 0.02 0.26 mission 0.01 0.21 example. Denote the two document clusters it gives as Z 21 = s 0 and Z 21 = s 1 . The occurrence probabilities in the two clusters of the words in the Z 21 subtree is given in T able 1, along with the sizes of the two clusters. W e see that the cluster Z 21 =s1 consists of 5% of the documents. In this cluster , the words such as space , nasa and orbit occur with relatively high probabilities. It is clearly a meaningful and is interpreted as a topic . One might label the topic “N ASA ”. The other cluster Z 21 =s0 consists of 95% of the documents. In this cluster , the words occur with lo w probabilities. W e interpret it as a backgr ound topic . There are three subtle issues concerning T able 1. The ﬁrst issue is how the word variables are ordered. T o answer the question, we need the mutual information (MI) 10 I ( X ; Y ) [36] between the two discrete variables X and Y , which is deﬁned as follows: I ( X ; Y ) = X X,Y P ( X, Y ) log P ( X, Y ) P ( X ) P ( Y ) , (3) In T able 1, the word variables are ordered according to their mutual information with Z 21 . The words placed at the top of the table have the highest MI with Z 21 . They are the best ones to characterize the difference between the two clusters because their occurrence probabilities in the two clusters differ the most. They occur with high probabilities in the clusters Z 21 = s 1 and with low probabilities in Z 21 = s 0 . If one is to choose only the top, say 5, words to characterize the topic Z 21 =s1, then the best words to pick are space , nasa , orbit , earth and shuttle . The second issue is how the background topic is determined. The answer is that, among the two document clusters giv en by Z 21 , the one where the words occur with lower probabilities is regarded as the background topic. In general, we consider the sum of the probabilities of the top 3 words. The cluster where the sum is lower is designated to be the background topic and labeled s 0 , and the other one is considered a genuine topic and labeled s 1 . Finally , the creation of T able 1 requires the joint distribution of Z 21 with each of the words variable in its subtrees (e.g., P ( space , Z 21 ). The distributions can be computed using Belief Propagation [32]. The computation takes linear time because the model is tree-structured. 4.3. T opic Hierar chies fr om HLTMs If the background topics are ignored, each latent variable giv es us exactly one topic. As such, the model in Figure 2 gi ves us 14 topics, which are sho wn in T able 2. Latent variables at high le vels of the hierarchy capture long-range word co-occurrence patterns and hence giv e thematically more general topics, while those at low levels of the hierarchy capture short-range word co-occurrence patterns and giv e thematically more speciﬁc topics. For example, the topic giv en by Z 22 ( windows , card , graphics , video , dos ) consists of a mixture of w ords about se veral aspects of computers. W e can say that the topic is about computers. The subtopics are each 11 T able 2: T opic hierarchy gi ven by the model in Figure 2. [0.05] space nasa orbit earth shuttle [0.06] orbit earth solar satellite [0.05] space nasa shuttle mission [0.03] moon lunar [0.14] team games players season hockey [0.14] team season [0.11] players baseball league [0.09] games win won [0.08] hockey nhl [0.24] windows card graphics video dos [0.12] card video driver [0.15] windows dos [0.10] graphics display image [0.09] computer science concerned with only one aspect of computers: Z 14 ( card , video , driver ), Z 15 ( dos , windows ), and Z 16 ( graphics , display , image ). 4.4. More on Novelty In the introduction, we hav e discussed three differences between HL T A and the LD A-based methods. There are three other important differences. The fourth difference lies in the relationship between topics and documents. In the LD A-based methods, a document is a mixture of topics, and the probabilities of the topics within a document sum to 1. Because of this, the LD A models are sometimes called mixed-membership models . In HL T A, a topic is a soft cluster of documents, and a document might belong to multiple topics with probability 1. In this sense, HL TMs can be said to be multi-membership models . The ﬁfth difference between HL T A and the LDA-based methods is about the semantics of the hierarchies they produce. In the context of document analysis, a common concept of hierarchy is a rooted tree where each node represents a cluster of documents, and the cluster of documents at a node is the union of the document clusters at its children. Neither HL T A nor the LDA-based methods yield such hierarchies. nCRP and nHDP produce a tree of topics. The topics at higher le vels 12 appear more often than those at lo wer lev els, but they are not necessarily related thematically . P AM yields a collection of topics that are organized into a directed acyclic graph. The topics at the lowest le vel are distrib utions over words, and topics at higher le vels are distributions o ver topics at the lev el below and hence are called super-topics. In contrast, the output of HL T A is a tree of latent variables. Latent variables at high levels of the tree capture long-range word co-occurrence patterns and hence giv e thematically more general topics, while latent variables at lo w lev els of the tree capture short-range word co-occurrence patterns and hence giv e thematically more speciﬁc topics. Finally , LD A-based methods require the user to provide the structure of a hierarchy , including the number of latent lev els and the number of nodes at each level. The number of latent le vels is usually set at 3 out of efﬁciency considerations. The contents of the nodes (distributi ons ov er vocab ulary) are learned from data. In contrast, HL T A learns both model structures and model parameters from data. The number of latent lev els is not limited to 3. 5. Model Structure Construction W e present the HL T A algorithm in this and the next two sections. The inputs to HL T A include a collection of documents and several algorithmic parameters. The outputs include an HL TM and a topic hierarchy extracted from the HL TM. T opic hierarchy extraction has already been explained in Section 4, and we will hence focus on how to learn the HL TM. In this section we will describe the procedures for constructing the model structure. In Section 6 we will discuss parameter estimation issues, and Section 7 we discuss techniques employed to accelerate the algorithm. 5.1. T op-Level Contr ol The top-le vel control of HL T A is giv en in Algorithm 1 and the subroutines are giv en in Algorithm 2-6. In this subsection, we illustrate the top-level control using the toy dataset mentioned in Section 4, which in volves 30 w ord variables. There are 5 steps. The ﬁrst step (line 3) yields the model shown in Figure 3 (a). It is said to be a ﬂat LTM because each latent v ariable is connected to at least one observed 13 Algorithm 1 H L T A ( D , τ , µ , δ , κ ) Inputs: D — a collection of documents, τ — upper bound on the number of top-le vel topics, µ — upper bound on island size, δ — threshold used in UD-test, κ — number of EM steps on ﬁnal model. Outputs: An HL TM and a topic hierarchy . 1: D 1 ← D , m ← nul l . 2: repeat 3: m 1 ← L E A R N F L A T M O D E L ( D 1 , δ , µ ); 4: if m = nul l then 5: m ← m 1 ; 6: else 7: m ← S TAC K M O D E L S ( m 1 , m ); 8: end if 9: D 1 ← H A R D A S S I G N M E N T ( m , D ); 10: until number of top-lev el nodes in m ≤ τ . 11: Run EM on m for κ steps. 12: return m and topic hierarchy extracted from m . variable. In hierarchical models such as the one shown in Figure 2, on the other hand, only the latent variables at the lowest latent layer are connected to observed variables, and other latent variables are not. The learning of a ﬂat model is the k ey step of HL T A. W e will discuss it in details later . W e refer to the latent variables in the ﬂat model from the ﬁrst step as lev el-1 latent variables. The second step (line 9) is to turn the level-1 latent variables into observed variables through data completion. T o do so, the subroutine H A R D A S S I G N M E N T carries out inference to compute the posterior distribution of each latent variable for each document. The document is assigned to the state with the highest posterior probability , resulting in a dataset D 1 ov er the level-1 latent v ariables. In the third step, line 3 is ex ecuted again to learn a ﬂat L TM for the lev el-1 latent variables, resulting the model sho wn in Figure 3 (b). 14 (a) (b) Figure 3: Illustration of the top-lev el control of HL T A: (a) A ﬂat model over the word variables is ﬁrst learned; (b) The latent variables in (a) are con verted into observed variables through data completion and another ﬂat model is learned for them; Finally , the ﬂat model in (b) is stacked on top of the ﬂat model in (c) to obtain the hierarchical model in Figure 2. In the fourth step (line 7), the ﬂat model for the level-1 latent variables is stacked on top of the ﬂat model for the observed variables, resulting in the hierarchical model in Figure 2. While doing so, the subroutine S TAC K M O D E L S cuts off the links among the level-1 latent v ariables. The parameter values for the new model are copied from two source models. In general, the ﬁrst four steps are repeated until the number of top-lev el latent variables falls below a user-speciﬁed upper bound τ (lines 2 to 10). In our running example, we set τ = 5 . The number of nodes at the top le vel in our current model is 3, which is below the threshold τ . Hence the loop is exited. In the ﬁfth step (line 11), the EM algorithm [37] is run on the ﬁnal hierarchical model for κ steps to improve its parameters, where κ is another user speciﬁed input parameter . The ﬁ ve steps can be grouped into two phases conceptually . The model construction phase consists of the ﬁrst four steps. The objective is to build a 15 Figure 4: The subroutine B U I L D I S L A N D S partitions word variables into uni- dimensional clusters and introduce a latent variable to each cluster to form an island (an LCM). Algorithm 2 L E A R N F L A T M O D E L ( D , δ , µ ) 1: L ← B U I L D I S L A N D S ( D , δ , µ ); 2: m ← B R I D G E I S L A N D S ( L , D 1 ); 3: return m . hierarchical model structure. The parameter estimation phase consists of the ﬁfth step. The objectiv e is to optimize the parameters of the hierarchical structure from the ﬁrst phase. 5.2. Learning Flat Models The objective of the ﬂat model learning step is to ﬁnd, among all ﬂat models, the one that have the highest BIC score. The BIC score [38] of a model m giv en a dataset D is deﬁned as follows: B I C ( m | D ) = log P ( D | m, θ ∗ ) − d 2 log |D | , (4) where θ ∗ is the maximum likelihood estimate of the model parameters, d is the number of free model parameters, and |D | is the sample size. Maximizing the BIC score intuitively means to ﬁnd a model that ﬁts the data well and that is not overly complex. One way to solve the problem is through search. The state-of-the-art in this direction is an algorithm named EAST [25]. It has been shown [26] to ﬁnd better models that alternativ e algorithms such as BIN [39] and CLRG [24]. Howe ver , it does not scale up. It is capable of handling data with only dozens of observed variables and is hence not suitable for text analysis. 16 In the following, we present an algorithm that, when combined with the parameter estimation technique to be described in the next section, is ef ﬁcient enough to deal with lar ge text data. The pseudo code is given in Algorithm 2. It calls tw o subroutines. The ﬁrst subroutine is B U I L D I S L A N D S . It partitions all word variables into clusters, such that the words in each cluster tend to co-occur and the co-occurrences can be properly modeled using a single latent variable. It then introduces a latent variable for each cluster to model the co-occurrence of the words inside it. In this way for each cluster we obtain an L TM with a single latent v ariable, and is called a latent class model (LCM) . In our running example, the results are shown in Figure 4. W e metaphorically refer to the LCMs as islands . The second subroutine is B R I D G E I S L A N D S . It links up the islands by ﬁrst estimating the mutual information between ev ery pair of latent variables, and then ﬁnding the maximum spanning tree [40]. The result is the model in Figure 3 (a). W e now set out to describe the tw o subroutines in details. 5.2.1. Uni-Dimensionality T est Conceptually , a set of variables is said to be uni-dimensional if the correlations among them can be properly modeled using a single latent variable. Operationally , we rely on the uni-dimensionality test (UD-test) to determine whether a set of variables is uni-dimensional. T o perform UD-test on a set S of observed variables, we ﬁrst learn two latent tree models m 1 and m 2 for S and then compare their BIC scores. The model m 1 is the model with the highest BIC score among all L TMs with a single latent variable, and the model m 2 is the model with the highest BIC score among all L TMs with two latent variables. Figure 5 (b) shows what the two models might look like when S consists of four word variables nasa , space , shuttle and mission . W e conclude that S is uni-dimensional if the following inequality holds: B I C ( m 2 | D ) − B I C ( m 1 | D ) < δ, (5) where δ is a user-speciﬁed threshold. In other words, S is considered uni-dimensional if the best two-latent variable model is not signiﬁcantly better than the best one-latent 17 Algorithm 3 B U I L D I S L A N D S ( D , δ , µ ) 1: V ← variables in D , L ← ∅ . 2: while |V | > 0 do 3: m ← O N E I S L A N D ( D , V , δ , µ ); 4: L ← L ∪ { m } ; 5: V ← variables in D b ut not in any m ∈ L ; 6: end while 7: return L . variable model. Note that the UD-test is related to the Bayes factor for comparing the two models [41]: K = P ( D | m 2 ) P ( D | m 1 ) . (6) The strength of e vidence in fav or of m 2 depends on the v alue of K . The following guidelines are suggested in [41]: If the quantity 2 log K is from 0 to 2, the evidence is negligible; If it is between 2 and 6, there is positive e vidence in fav or of m 2 ; If it is between 6 to 10, there is strong e vidence in fa vor of m 2 ; And if it is larger than 10, then there is very strong e vidence in fav or of m 2 . Here, “ log ” stands for natural logarithm. It is well known that the BIC score B I C ( m |D ) is a large sample approximation of the marginal loglikelihood log P ( D | m ) [38]. Consequently , the dif ference B I C ( m 2 |D ) − B I C ( m 1 |D ) is a large approximation of the logarithm of the Bayes factor log K . According to the cut-off v alues for the Bayes factor , we conclude that there is positive, strong , and very strong evidence fav oring m 2 when the difference is larger than 1, 3 and 5 respecti vely . In our experiments, we alw ays set δ = 3 . 5.2.2. Building Islands The subroutine B U I L D I S L A N D S (Algorithm 3) builds islands one by one. It builds the ﬁrst island by calling another subroutine O N E I S L A N D (Algorithm 4). Then it remov es the variables in the island from the dataset, and repeats the process to build other islands. It continues until all variables are grouped into islands. The subroutine O N E I S L A N D (Algorithm 4) requires a measurement of how closely 18 Algorithm 4 O N E I S L A N D ( D , V , δ , µ ) 1: if |V | ≤ 3 , m ← L E A R N L C M ( D , V ), return m . 2: S ← two v ariables in V with highest MI; 3: X ← arg max A ∈V \S M I ( A, S ) ; 4: S ← S ∪ X ; 5: V 1 ← V \ S ; 6: D 1 ← P RO J E C T D A TA ( D , S ); 7: m ← L E A R N L C M ( D 1 , S ). 8: loop 9: X ← arg max A ∈V 1 M I ( A, S ) ; 10: W ← arg max A ∈ S M I ( A, X ) ; 11: D 1 ← P RO J E C T D A TA ( D , S ∪ { X } ) , V 1 ← V 1 \ { X } ; 12: m 1 ← P E M - L C M ( m, S , X , D 1 ) ; 13: if |V 1 | = 0 retur n m 1 . 14: m 2 ← P E M - L T M - 2 L ( m , S \ { W } , { W , X } , D 1 ); 15: if B I C ( m 2 |D 1 ) − B I C ( m 1 |D 1 ) > δ then 16: return m 2 with W , X and their parent removed. 17: end if 18: if |S | ≥ µ , return m 1 . 19: m ← m 1 , S ← S ∪ { X } . 20: end loop correlated each pair of variables are. In this paper , mutual information is used for the purpose. The mutual information I ( X ; Y ) between the two variables X and Y is gi ven by (3). W e will also need the mutual information (MI) between a variable X and a set of variables S . W e estimate it as follows: MI ( X, S ) = max A ∈S MI ( X, A ) . (7) The subroutine O N E I S L A N D maintains a working set S of observed variables. Initially , S consists of the pair of variables with the highest MI (line 2), which will be referred to as the seed variables for the island. Then the variable that has the highest MI with those two variables is added to S as the third variable (line 3 and 4). Then other variables are added to S one by one. At each step, we pick the v ariable X that 19 has the highest MI with the current set S (line 9), and perform UD-test on the set S ∪ { X } (lines 12, 14, 15). If the UD-test passes, X is added to S (line 19) and the process continues. If the UD-test fails, one island is created and the subroutine returns (line 16). The subroutine also returns when the size of the island reaches a user-speciﬁed upper -bound µ (line 18). In our experiments, we always set µ = 15 . (a) Initial island (b) UD-test passes after adding mission (c) UD-test passes after adding moon (d) UD-test fails after adding lunar (e) Final island Figure 5: Illustration of the O N E I S L A N D subroutine. The UD-test requires two models m 1 and m 2 . In principle, they should be the best models with one and two latent variables respectiv ely . For the sake of computational efﬁcienc y , we construct them heuristically in this paper . For m 1 , we choose the LCM where the latent v ariable is binary and the parameters are optimized by a fast subroutine P E M - L C M that will be described in the next section. Let W be the v ariable in S that has the highest MI with the variable X to be added to the island. For m 2 , we choose the model where one latent v ariable is connected to 20 the variables in S \ { W } and the second latent variable connected to W and X . Both latent v ariables are binary and the model parameters are optimized by a f ast subroutine P E M - L T M - 2 L that will be described in the next section. Let us illustrate the O N E I S L A N D subroutine using an example in Figure 5. The pair of variables nasa and space have the highest MI among all v ariables, and they are hence the seed variables . The v ariable shuttle has the highest MI with the pair among all other variables, and hence it is chosen as the third variable to start the island (Figure 5 (a)). Among all the other variables, mission has highest MI with the three variables in the model. T o decide whether mission should be added to the group, the two models m 1 and m 2 in Figure 5 (b) are created. In m 2 , shuttle is grouped with the new v ariable because it has the highest MI with the new v ariable among all the three variables in Figure 5 (a). It turns out that m 1 has higher BIC score than m 2 . Hence the UD-test passes and the variable mission is added to the group. The next variable to be considered for addition is moon and it is added to the group because the UD-test passes again (Figure 5 (c)). After that, the variable lunar is considered. In this case, the BIC score of m 2 is signiﬁcantly higher than that of m 1 and hence the UD-test fails (Figure 5 (d)). The subroutine O N E I S L A N D hence terminates. It returns an island, which is the part of the model m 2 that does not contain the last v ariable lunar (Figure 5 (e)). The island consists of the four words nasa , space , shuttle and mission . Intuitiv ely , they are grouped together because they tend to co-occur in the dataset. 5.2.3. Bridging Islands After the islands are created, the next step is to link them up so as to obtain a model ov er all the word v ariables. This is carried out by the B R I D G E I S L A N D S subroutine and the idea is borro wed from [42]. The subroutine ﬁrst estimates the MI between each pair of latent variables in the islands, then constructs a complete undirected graph with the MI values as edge weights, and ﬁnally ﬁnds the maximum spanning tree of the graph. The parameters of the newly added edges are estimated using a fast method that will be described at the end of Section 6.3. Let m and m 0 be two islands with latent v ariables Y and Y 0 respectiv ely . The MI 21 I ( Y ; Y 0 ) between Y and Y 0 is calculated using Equation (3) from the following joint distribution: P ( Y , Y 0 | D , m, m 0 ) = C X d ∈D P ( Y | m, d ) P ( Y 0 | m 0 , d ) (8) where P ( Y | m, d ) is the posterior distribution of Y in m given data case d , P ( Y 0 | m 0 , d ) is that of Y 0 in m 0 , and C is the normalization constant. In our running example, applying B R I D G E I S L A N D S to the islands in Figure 4 results in the ﬂat model shown in Figure 3 (a). 6. Parameter Estimation during Model Construction In the model construction phase, a lar ge number of intermediate models are generated. Whether HL T A can scale up depends on whether the parameters of those intermediate models and the ﬁnal model can be estimated efﬁciently . In this section, we present a fast method called pr ogr essive EM for estimating the parameters of the intermediate models. In the next section, we will discuss how to estimate the parameters of the ﬁnal model efﬁciently when the sample size is v ery large. 6.1. The EM Algorithm W e start by brieﬂy revie wing the EM algorithm. Let X and H be respectiv ely the sets of observed and latent variables in an L TM m , and let V = X ∪ H . Assume one latent variable is picked as the root and all edges are directed away from the root. For any V in V that is not the root, the parent pa ( V ) of V is a latent variable and can take values ‘0’ or ‘1’. For technical con venience, let pa ( V ) be a dummy variable with only one possible value when V is the root. Enumerate all the variables as V 1 , V 2 , · · · , V n . W e denote the parameters of m as θ ij k = P ( V i = k | pa ( V i ) = j ) , (9) where i ∈ { 1 , · · · , n } , k is value of V i and j is a value of pa ( V i ) . Let θ be the vector of all the parameters. 22 Giv en a dataset D , the loglikelihood function of θ is given by l ( θ | D ) = X d ∈D X H log P ( d, H | θ ) . (10) The maximum likelihood estimate (MLE) of θ is the value that maximizes the loglikelihood function. Due to the presence of latent v ariables, it is intractable to directly maximize the loglikelihood function. An iterativ e method called the Expectation-Maximization (EM) [37] algorithm is usually used in practice. EM starts with an initial guess θ (0) of the parameter values, and then produces a sequence of estimates θ (1) , θ (2) , · · · . Giv en the current estimate θ ( t ) , the next estimate θ ( t +1) is obtained through an E-step and an M-step. In the context of latent tree models, the two steps are as follo ws: • The E-step: n ( t ) ij k = X d ∈D P ( V i = k , pa ( V i ) = j | d, m, θ ( t ) ) . (11) • The M-step: θ ( t +1) ij k = n ( t ) ij k P k n ( t ) ij k . (12) Note that the E-step requires the calculation of P ( V i , pa ( V i ) | d, m, θ ( t ) ) for each data case d ∈ D and each variable V i . For a gi ven data case d , we can calculate P ( V i , pa ( V i ) | d, m, θ ( t ) ) for all varia bles V i in linear time using message propagation [43]. EM terminates when the improvements in loglikelihood l ( θ ( t +1) |D ) − l ( θ ( t ) |D ) falls below a predetermined threshold or when the number of iterations reaches a predetermined limit. T o av oid local maxima, multiple restarts are usually used. 6.2. Pro gressive EM Being an iterativ e algorithm, EM can be trapped in local maxima. It is also time- consuming and does not scale up well. Progressive EM is proposed as a f ast alternati ve to EM for the model construction phase. It estimates all the parameters in multiple steps 23 A B C D E Y Z (a) A B C D E Y Z (b) Figure 6: Progressiv e EM: EM is ﬁrst run in the submodel shaded in (a) to estimate the distributions P ( Y ) , P ( A | Y ) , P ( B | Y ) and P ( D | Y ) , and then, EM is run in the submodel shaded in (b), with P ( Y ) , P ( B | Y ) and P ( D | Y ) ﬁxed, to estimate the distributions P ( Z | Y ) , P ( C | Z ) and P ( E | Z ) . and, in each step, it considers a small part of the model and runs EM in the submodel to maximize the local likelihood function. The idea is illustrated in Figure 6. Assume Y is selected to be the root. T o estimate all the parameters of the model, we ﬁrst run EM in the part of the model shaded in Figure 6a to estimate P ( Y ) , P ( A | Y ) , P ( B | Y ) and P ( D | Y ) , and then run EM in the part of the model shaded in Figure 6b, with P ( Y ) , P ( B | Y ) and P ( D | Y ) ﬁx ed, to estimate P ( Z | Y ) , P ( C | Z ) and P ( E | Z ) . 6.3. Pro gressive EM and HL T A W e use progressive EM to estimate the parameters for the intermediate models generated by HL T A, speciﬁcally those generated by subroutine O N E I S L A N D (Algorithm 4). It is carried out by the two subroutines P E M - L C M and P E M - L T M - 2 L . At lines 1 and 7, O N E I S L A N D needs to estimate the parameters of an LCM with three observed v ariables. It is done using EM. Next, it enters a loop. At the beginning, we hav e an LCM m for a set S of variables. The parameters of the LCM hav e been estimated earlier (line 7 at beginning or line 12 of previous pass through the loop). At lines 9 and 10, O N E I S L A N D ﬁnds the v ariable X outside S that has maximum MI with S , and the variable W inside S that has maximum MI with X . At line 12, O N E I S L A N D adds X to the m to create a new LCM m 1 . The parameters of m 1 are estimated using the subroutine P E M - L C M (Algorithm 5), which is an application of progressive EM. Let us explain P E M - L C M using the intermediate models shown in Figure 5. Let m be the model shown on the left of Figure 5c and 24 Algorithm 5 P E M - L C M ( m , S , X , D ) 1: Y ← the latent variable of m ; 2: S 1 ← { X }∪ two seed v ariables in S ; 3: While keeping the other parameters ﬁxed, run EM in the part of m that in volves S 1 ∪ Y to estimate P ( X | Y ) . 4: return m Algorithm 6 P E M - L T M - 2 L ( m, S \ { W } , { W, X } , D ) 1: Y ← the latent variable of m ; 2: m 2 ← model obtained from m by adding X and a new latent variable Z , connecting Z to Y , connecting X to Z , and re-connecting W (connected to Y before) to Z ; 3: S 1 ← { W, X }∪ two seed variables 3 in S ; 4: While k eeping the other parameters ﬁxed, run EM in the part of m 2 that in volves S 1 ∪ Y ∪ Z to only estimate P ( W | Z ) , P ( X | Z ) and P ( Z | Y ) . 5: return m 2 S = { nasa , space , shuttle , mission , moon } . The variable X to be added to m is lunar , and the model m 1 after adding lunar to m is sho wn on the left of Figure 5d. The only distribution to be estimated is P ( l unar | Y ) , as other distributions hav e already been estimated. P E M - L C M estimates the distribution by running EM on a part of the model m 1 in Figure 7 (left), where the variables in volved are in rectangles. The v ariables nasa and space are included in the submodel, instead of other observed variables, because they were the seed variables picked at line 2 of Algorithm 4. At line 14, O N E I S L A N D adds X to the m to create a ne w L TM m 2 with two latent variables. The parameters of m 2 are estimated using the subroutine P E M - L T M - 2 L (Algorithm 6), which is also an application of progressive EM. In our running example, let moon be the variable W that has the highest MI with lunar among all variables in S . Then the model m 2 is as shown on the right hand side of Figure 5d. The distributions to be estimated are: P ( Z | Y ) , P ( moon | Z ) and P ( l unar | Z ) . P E M - L T M - 2 L estimates 3 When one of the seed variables is W , use the other seed v ariable and the v ariable picked at line 3 of Algorithm 4. 25 Figure 7: Parameter estimation during island b uilding: T o determine whether the variable lunar should be added to the island m 1 in Figure 5c, two models are created. W e need to estimate only P ( l unar | Y ) for the model on the left, and P ( Z | Y ) , P ( moon | Z ) and P ( l unar | Z ) for the model on the right. The estimation is done by running EM in the parts of the models where the variable names are in rectangles. the distributions by running EM on a part of the model m 2 in Figure 7 (right), where the variables in volved are in rectangles. The variables nasa and space are included in the submodel, instead of shuttle and mission , because they were the seed variables pick ed at line 2 of Algorithm 4. There is also a parameter estimation problem inside the subroutine B R I D G E D I S L A N D S . After linking up the islands, the parameters for edges between latent v ariables must be estimated. W e use progressi ve EM for this task also. Consider the model in Figure 3 (a). T o estimate P ( Z 11 | Z 12) , we form a sub-model by picking two children of Z 11 , for instance nasa and space , and two children of Z 12 , for instance orbit and earth . Then we estimate the distribution P ( Z 11 | Z 12) by running EM in the submodel with all other parameters ﬁxed. 6.4. Complexity Analysis Let n be the number of observed variables and N be the sample size. HL T A requires the computation of empirical MI between each pair of observed variables. This takes O ( n 2 N ) time. When building islands for the observed variables, HL T A generates roughly 2 n intermediate models. Progressive EM is used to estimate the parameters of the 26 intermediate models. It is run on submodels with 3 or 4 observed variables. The projection of a dataset onto 3 or 4 binary variable consists of only 8 or 16 distinct cases no matter ho w large the original sample size is. Hence progressiv e EM takes constant time, which we denote by c 1 , on each submodel. This is the ke y reason why HL T A can scale up. The data projection takes O ( N ) time for each submodel. Hence the total time for island building is O (2 n ( N + c 1 )) . T o bridge the islands, HL T A needs to estimate the MI between every pair of latent variables and runs progressiv e EM to estimate the parameters for the edges between the islands. A loose upper bound on the running time of this step is n 2 N + n ( N + c 1 ) . The total number of variables (observed and latent) in the resulting ﬂat model is upper bounded by 2 n . Inference on the model takes no more 2 n propagation steps for each data case. Let c 2 be the time for each propagation step. Then the hard assignment step takes O (4 nc 2 N ) time. So, the total time for the ﬁrst pass through the loop in HL T A is O (2 n 2 N + 3 n ( N + c 1 ) + 4 nc 2 N ) = O (2 n 2 N + 3 nc 1 + 4 nc 2 N ) , where the term 3 nN is ignored because it is dominated by the term 4 nc 2 N . As we move up one lev el, the number of “observed” variables is decreased by at least half. Hence, the total time for the model construction phase is upper bounded by O (4 n 2 N + 6 nc 1 + 8 nc 2 N ) . The total number of variables (observed and latent) in the ﬁnal model is upper bounded by 2 n . Hence, one EM iteration takes O (4 nc 2 N ) time and the ﬁnal parameter optimization steps takes O (4 nc 2 N κ ) times. The total running time of HL T A is O (4 n 2 N + 6 nc 1 + 8 nc 2 N ) + O (4 nc 2 N κ ) . The two terms are the times for model construction phase and the parameter estimation phase respectiv ely . 7. Dealing with Large Datasets W e employ two techniques to further accelerate HL T A so that it can handle large datasets with millions of documents. The ﬁrst technique is downsampling and we use it is to reduce the complexity of the model construction phase. Speciﬁcally , we use a subset of N 0 randomly sampled data cases instead of the entire dataset and thereby 27 reduce the comple xity to O (4 n 2 N 0 + 6 nc 1 + 8 nc 2 N 0 ) . When N is very lar ge, we can set N 0 to be a small fraction of N and hence achie ve substantial computational sa vings. In the meantime, we can still expect to obtain a good structure if N 0 is not too small. The reason is that model construction relies on salient regularities of data and those regularities should be preserv ed in the subset when N 0 is not too small. The second technique is stepwise EM [44, 45]. W e use it to accelerate the con vergence of the parameter estimation process in the second phase, where the task is to improve the values of the parameters θ = { θ ij k } (Equation 9) obtained in the model construction phase. While standard EM, a.k.a. batch EM , updates the parameter once in each iteration, stepwise EM updates the parameters multiple times in each iteration. Suppose the data set D is randomly divided into equal-sized minibatches D 1 , . . . , D B . Stepwise EM updates the parameters after processing each minibatch. It maintains a collection of auxiliary variables n ij k , where are initialized to 0 in our experiments. Suppose the parameters have been updated u − 1 times before and the current values are θ = { θ ij k } . Let D b be the next minibatch to process. Stepwise EM carries out the updating as follows: n 0 ij k = X d ∈D b P ( V i = k , pa ( V i ) = j | d, m, θ ) , (13) n ij k = (1 − η u ) n ij k + η u n 0 ij k , (14) θ ij k = n ij k P k n ij k . (15) Note that equation (13) is similar to (11) e xcept that the statistics are calculated on the minibatch D b rather than the entire dataset D . The parameter η u is kno wn as the stepsize and is given by η u = ( u + 2) − α and the parameter α is to be chosen the range 0 . 5 ≤ α ≤ 1 [46]. In all our experiments, we set α = 0 . 75 . Stepwise EM is similar to stochastic gradient descent [47] in that it updates the parameters after processing each minibatch. It has been shown to yield estimates of the same or even better quality as batch EM and it conv erges much faster than the latter [46]. As such, we can run it for much fe wer iterations than batch EM and thereby substantially reduce the running time. 28 8. Illustration of Results and Practical Issues HL T A is a novel method for hierarchical topic detection and, as discussed in the introduction, it is fundamentally different from the LD A-based methods. W e will empirically compare HL T A with the LD A-based methods in the next section. In this section, we present the results HL T A obtains on a real-world dataset so that the reader can gain a clear understanding of what it has to offer . W e also discuss two practical issues. 8.1. Results on the NYT Dataset HL T A is implemented in Jav a. The source code is a vailable online 4 , along with the datasets used in this paper and the full details of the results obtained on them. HL T A has been tested on sev eral datasets. One of them is the NYT dataset, which consists of 300,000 articles published on Ne w Y ork T imes between 1987 and 2007 5 . A v ocabulary of 10,000 words was selected using avera ge TF-IDF [48] for the analysis. The av erage TF-IDF of a term t in a collection of documents D is deﬁned as follows: tf − idf ( t, D ) = P d ∈D tf ( t, d ) · idf ( t, D ) |D | , (16) where | · | stands for the cardinality of a set, tf ( t, d ) is the term frequency of t in document d , and idf ( t, D ) = log ( |D| / |{ d ∈ D : t ∈ d }| ) is the in verse document frequency of t in the document collection D . The subset of 10,000 randomly sampled data cases was used in the model construction phase. Stepwise EM was used in the parameter estimation phase and the size of minibatches was set at 1,000. Other parameter settings are giv en in the next section. The analysis took around 420 minutes on a desktop machine. The result is an HL TM with 5 le vels of latent variables and 21 latent variables at the top level. Figure 8 shows a part of the model structure. Four top-lev el latent variables are included in the ﬁgure. The le vel-4 and level-2 latent variables in the subtrees rooted at the ﬁv e top-lev el latent variables are also included. 4 http://www.cse.ust.hk/ ∼ lzhang/topic/index.htm 5 http://archive.ics.uci.edu/ml/datasets/Bag+of+Words 29 Z 5 0 1 Z 5 0 3 Z 5 0 2 Z 5 0 4 j a z z h i p b l u e s m u s i c i a n g u i t a r i s t g a s d r i l l i n g e x p l o r a t i o n o i l n a t u r a l _ f e d e r a l _ r e s e r v e _ f e d _ n a s d a q c o m p o s i t e _ g e o r g e t o w n m u s i c a l r e c o r d i n g c o m p o s e r c l a s s i c a l r e c o r d e d _ e n r o n _ e n r o n _ c o r p i n t e r n a l a c c o u n t i n g c o l l a p s e d r a p c o n c e r t _ e m i n e m _ g r a m m y s y n c d r u g p h a r m a c e u t i c a l p r e s c r i p t i o n m e d i c i n e s _ m e r c k a c t o r p r o d u c e r a w a r d _ h o l l y w o o d m o v i e s _ c n n c a b l e n e t w o r k v i e w e r c h a n n e l s c e n e s s tu d i o s c r i p t c a s t p r o o f r e m e m b e r c o m e s f u n n y t a k e s c r a z y f i l m m o v i e s o n g m u s i c c h a r a c t e r p r o p o s a l p r o p o s e d h e a l t h w e l f a r e p r o p o s i n g c u r r e n c y m i n i m u m e x p a n d e x p a n s i o n e x p a n d i n g s to r e s to r e s c o n s u m e r p r o d u c t r e t a i l e r i n s u r a n c e c o v e r a g e i n s u r e r p a y p r e m i u m a n i m a t e d a n i m a t i o n c a r t o o n c o m i c w r i ti n g v e r s i o n d o c u m e n t a r y o r i g i n a l f i l m m a k e r _ p b r e c r u i te r o v a t i o n c o a c h e d r e c r u i ti n g r e c r u i t e c o n o m y e c o n o m i c e c o n o m i s t r i s i n g r e c e s s i o n _ s t a r _ w a r p h a n t o m _ h a r r y _ p o t te r t a l e e p i s o d e s m e a t f o o d e a t e a t i n g p o r k g r e e n t r e e s t r e e b r o w n f o r e s t _ s o u t h e a s t _ n o r t h e a s t t h u n d e r s to r m r o c k i e s c l o u d r e c i p e s r e c i p e f l o u r c a k e c h e e s e c o o k i n g c o o k e d s a u c e f l a v o r o v e n d e g r e e s t e m p e r a t u r e t e m p e r a t u r e s w a r m b a k i n g _ s o u t h _ t e n n e s s e e _ g e o r g i a t i t a n _ g a t a b l e s p o o n p a n t e a s p o o n s p r i n k l e e v e n l y c o m p a n i e s c o m p a n y i n d u s t r y i n c e n t i v e c h i e f f i r m c o n s u l t i n g d i s t r i b u ti o n p a r t n e r d e l i v e r p r o f i t r e v e n u e g r o w t h t r o u b l e s s h a r e p e r c e n t a v e r a g e s h o r ta g e s o a r i n g r e d u c e d o l l a r b i l l i o n l e v e l m a i n t a i n m i l l i o n s o l d b o u g h t p r o p e r ty b u y i n g b u y e r p r i c e p r i c e s m e r g e r s h a r e h o l d e r a c q u i s i t i o n s to c k m a r k e t i n v e s to r a n a l y s t i n v e s tm e n t Figure 8: Part of the hierarchical latent tree model obtained by HL T A on the NYT dataset: The nodes Z 501 to Z 504 are top-le vel latent v ariables. The subtrees rooted at dif ferent top-le vel latent variables are coded using dif ferent colors. Edge width represents mutual information. 30 T able 3: Part of topic hierarchy obtained by HL T A on the NYT dataset 1. [0.20] economy stock economic market dollar 1.1. [0.20] economy economic economist rising recession 1.2. [0.20] currency minimum expand expansion wage 1.2.1. [0.20] currency expand expansion expanding euro 1.2.2. [0.23] labor union demand industries dependent 1.2.3. [0.21] minimum wage employment retirement ---------------------- 1.3. [0.20] stock market investor analyst investment 1.4. [0.20] price prices merger shareholder acquisition 1.5. [0.20] dollar billion level maintain million lower 1.6. [0.20] profit revenue growth troubles share revenues 1.7. [0.20] percent average shortage soaring reduce 2. [0.22] companies company ﬁrm industry incentive 2.1. [0.23] firm consulting distribution partner 2.2. [0.23] companies company industry ---------------------- 2.3. [0.14] insurance coverage insurer pay premium 2.4. [0.25] store stores consumer product retailer 2.5. [0.21] proposal proposed health welfare 2.6. [0.20] drug pharmaceutical prescription 2.7. [0.07] federal reserve fed nasdaq composite 2.8. [0.09] enron internal accounting collapsed 2.9. [0.09] gas drilling exploration oil natural Each level-2 latent variable is connected to four word variables in its subtrees. Those are the word variables that ha ve the highest MI with the latent v ariable among all word v ariables in the subtree. The structure is interesting. W e see that most words in the subtree rooted at Z 501 are about economy and stock market ; most words in the subtree Z 502 are about companies and various industries ; most words in the subtree rooted at Z 503 are about movies and music ; and most words in the subtree rooted at Z 504 are about cooking . T able 3 shows a part of the topic hierarchy extracted from the part of the model shown in Figure 8. The topics and the relationships among them are meaningful. For example, the topic 1 is about economy and stock market. It splits into two groups of subtopics, one on economy and another on stock market. Each subtopic further splits into sub-subtopics. For example, the subtopic 1.2 under economy splits into 31 three subtopics: currenc y expansion, labor union and minimum wages. The topic 2 is about compan y-ﬁrm-industry . Its subtopics include sev eral types of companies such as insurance, retail stores/consumer products, natural gas/oil, drug, and so on. 8.2. T wo Practical Issues Next we discuss two practical issues. 8.2.1. Broadly vs Narr owly Deﬁned T opics In HL T A, each latent variable is introduced to model a pattern of probabilistic word co-occurrence. It also gi ves us a topic, which is a soft cluster of documents. The size of the topic is determined by considering not only the words in the pattern, but all the words in the vocab ulary . As such, it conceptually includes two types of documents: (1) documents that contain, in a probabilistic sense, the pattern, and (2) documents that do not contain the pattern but are otherwise similar to those that do. Because of the inclusion of the second type of documents, the topic is said to be br oadly deﬁned . All the topics reported abov e are broadly deﬁned. The size of a widely deﬁned topic might appear unrealistically large at the ﬁrst glance. For example, one topic detected from the NYT dataset consists of the words affair , widely , scandal , viewed , intern , monica lewinsky , and its size is 0 . 18 . Although this seems too large, it is actually reasonable. Obviously , the fraction of documents that contain the se ven words in the topic should be much smaller than 18% . Ho wever , those documents also contain many other words, such as bill and clinton , about American politics. Other documents that contain man y of those other words are also included in the topic, and hence it is not surprising for the topic to cover 18% of the corpus. As a matter of fact, there are sev eral other topics about American politics that are of similar sizes. One of them is: corruption campaign political democratic presidency . In some applications, it might be desirable to identify narr owly deﬁned topics — topics made up of only the documents containing particular patterns. Such topics can be obtained as follo ws: First, pick a list of words to characterize a topic using the method described in Section 4; then, form a latent class model using those words as observed 32 T able 4: T opics detected from AAAI/IJCAI papers (2000-15) that contain the word ”network”. [0.05] neural-network neural hidden-layer layer activation [0.08] bayesian-network probabilistic-inference variable-elimination [0.04] dynamic-bayesian-network dynamic-bayesian slice time-slice [0.03] domingos markov-logic-network richardson-domingos [0.06] dechter constraint-network freuder consistency-algorithm [0.09] social-network twitter social-media social tweet [0.03] complex-network community-structure community-detection [0.01] semantic-network conceptnet partof [0.09] wireless sensor-network remote radar radio beacon [0.08] traffic transportation road driver drive road-network variables; and ﬁnally , use the model to partition all documents into two clusters. The cluster where the words occur with relati vely higher probabilities are designated as the narrow topic. The size of a narrowly deﬁned topic is typically much smaller than that of the widely deﬁned version. For example, the sizes of the narrowly deﬁned version of the two topics from the pre vious paragraph are 0.008 and 0.169 respectively . Learning a latent class model for each latent v ariable from scratch is time-consuming. T o accelerate the process, one can calculate the parameters for the latent class model from the global HL TM model, ﬁx the conditional distributions, and update only the marginal distribution of the latent variable a number of times, say 10 times. 8.2.2. Use of N-grams as Observed V ariables In HL TMs, each observed variable is connected to only one latent variable. If individual words are used as observed variables, then each word would appear in only one branch of the resulting topic hierarchy . This is not reasonable. T ake the word “network” as an example. It can appear in different terms such as “neural network”, “Bayesian netw ork”, “constraint network”, “social network”, “sensor network”, and so on. Clearly , those terms should appear in different branches in a good topic hierarchy . A method to mitigate the problem is proposed in [49]. It ﬁrst treats indi vidual words as tokens and ﬁnds top n tokens with the highest average TF-IDF . Let P I C K - T O P - T O K E N S ( D , n ) be the subroutine which does that. The method then calculates 33 the TF-IDF v alues of all 2-grams, and and includes the top n 1-grams and 2-grams with highest TF-IDF values as tokens. After that, the selected 2-grams (e.g., “social network”) are replaced with single tokens (e.g., “social-network”) in all the documents and the subroutine P I C K - T O P - T O K E N S ( D , n ) is run again to pick a new set of n tokens. The process can be repeated if one wants to consider n-grams with n > 2 as tokens. The method has been applied in an analysis of papers published at AAAI and IJCAI between 2000 and 2015. T able 4 shows some of the topics from the resulting topic hierarchy . They all contain the word “network” and are from different branches of the hierarchy . 9. Empirical Comparisons W e now present empirical results to compare HL T A with LDA-based methods for hierarchical topic detection, including nested Chinese restaurant process (nCRP) [2], nested hierarchical Dirichlet process (nHDP) [5] and Hierarchical Pachinko allocation model(hP AM) [4]. Also included in the comparisons is CorEx [29]. CorEx produces a hierarchy of latent variables, but not a probability model over all the v ariables. For comparability , we conv ert the results into a hierarchical latent tree model. 6 9.1. Datasets Three datasets were used in our experiments. The ﬁrst one is the NYT dataset mentioned before. The second one is the 20 Newsgroup (referred to as Newsgroup) dataset 7 . It consists of 19,940 ne wsgroup posts. The third one is the NIPS dataset, which consists of 1,955 articles published at the NIPS conference between 1988 and 1999 8 . Symbols, stop words and the words barely occurring were remov ed for all the 6 Let Y be a latent v ariable and Z 1 , . . . , Z k be its children. Let Z ( d ) i be the v alue of Z i in a data case d . It is obtained via hard assignment if Z i is a latent variable. CorEx gi ves the distribution P ( Y | Z ( d ) 1 , . . . , Z ( d ) k ) for data case d . Let 1 ( d ) ( Z 1 , . . . , Z k ) be a function that takes v alue 1 when Z i = Z ( d ) i for all i and 0 otherwise. Then, the expression P d ∈D P ( Y | Z ( d ) 1 , . . . , Z ( d ) k ) 1 ( d ) ( Z 1 , . . . , Z k ) / |D| deﬁnes a joint distribution over Y and Z 1 - Z k . From the joint distribution, we obtain P ( Z i | Y ) for each i , and also P ( Y ) if Y is the root. 7 http://qwone.com/ ∼ jason/20Newsgroups/ 8 http://www.cs.nyu.edu/ ∼ roweis/data.html 34 T able 5: Information about the datasets used in empirical comparisons. NIPS-1k NIPS-5k NIPS-10k Ne ws-1k Ne ws-5k NYT V ocab ulary Size 1,000 5,000 10,000 1000 5000 10,000 Sample Size 1,955 1,955 1,955 19,940 19,940 300,000 datasets. Three versions of the NIPS dataset and two versions of the Newsgroup dataset were created by choosing vocab ularies with different sizes using average TF-IDF . For the NYT dataset, only one version was chosen. So, the experiments were performed on six datasets. Information about them is gi ven in T able 5. Each dataset has tw o v ersions: a binary version where word frequencies are discarded, and a bags-of-words version where word frequencies are kept. HL T A and CorEx were run only on the binary version because the y can only process binary data. The methods nCRP , nHDP and hP AM were run on both versions and the results are denoted as nCRP-bin, nHDP-bin, hP AM-bin, nCRP-bow , nHDP-bow and hP AM-bow respecti vely . 9.2. Settings HL T A was run in two modes. In the ﬁrst mode, denoted as HLT A-batch , the entire dataset was used in the model construction phased and batch EM was used in the parameter estimation phase. In the second mode, denoted as HLT A-step , a subset of N 0 randomly sampled data cases was used in the model construction phase and stepwise EM was used in the parameter estimation phase (see Section 7). In all experiments, N 0 was set at 10,000, the size of minibatch at 1,000, and the parameter α in stepwise EM at 0.75. HL T A-batch was run on all the datasets, while HL T A-step was run only on NYT and the Newsgroup datasets. HL T A-step was not run on the NIPS datasets because the sample size is too small. For HL T A-batch, the number κ of iterations for batch EM was set at 50. For HL T A-step, stepwise EM was terminated after 100 updates. The other parameters of HL T A (see Algorithm 1) were set as follows in both modes: 35 The threshold δ used in UD-tests was set at 3 ; the upper bound µ on island size was set at 15; and the upper bound τ on the number of top-lev el topics was set at 30 for NYT and 20 for all other datasets. When extracting topics from an HL TM (see Section 4), we ignored the le vel-1 latent variables because the topics they gi ve are too ﬁne-grained and often consist of different forms of the same w ord (e.g., “image, images”). The LD A-based methods nCRP , nHDP and hP AM do not learn model structures. A hierarchy needs to be supplied as input. In our experiment, the height of hierarchy was set at 3, as is usually done in the literature. The number of nodes at each le vel was set in such way that nCRP , nHDP and hP AM would yield roughly the same total number of topics as HL T A. CorEx is conﬁgured similarly . W e used program packages and default parameter v alues provided by the authors for these algorithms. All experiments were conducted on the same desktop computer . Each experiment was repeated 3 times so that v ariances can be estimated. 9.3. Model Quality and Running T imes For topic models, the standard way to assess model quality is to measure log-likelihood on a held-out test set [2, 5]. In our experiments, each dataset was randomly partitioned into a training set with 80% of the data, and a test set with 20% of the data. Models were learned from the training set, and per -document loglikelihood was calculated on the held-out test set. The statistics are sho wn in T able 6. For comparability , only the results on binary data are included. W e see that the held-out likelihood values for HL T A are drastically higher than those for all the alternati ve methods. This implies that the models obtained by HL T A can predict unseen data much better than the other methods. In addition, the variances are signiﬁcantly smaller for HL T A than the other methods in most cases. T able 7 sho ws the running times. W e see that HL T A-step is signiﬁcantly more efﬁcient than HL T A-batch on large datasets, while there is virtually no decrease in model quality . Note that all algorithms ha ve parameters that control computational complexity . Thus, running time comparison is only meaningful when it is done with reference to 36 T able 6: Per-document held-out loglikelihood on binary data: Best scores are marked in bold. The sign “-” indicates non-termination after 72 hours, and “NR” stands for “not run’. NIPS-1k NIPS-5k NIPS-10k Ne ws-1k News-5k NYT HL T A-batch -393 ± 0 -1,121 ± 1 -1,428 ± 1 -114 ± 0 -242 ± 0 -754 ± 1 HL T A-step NR NR NR -114 ± 0 -243 ± 0 -755 ± 1 nCRP-bin -671 ± 16 -3,034 ± 135 — — — — nHDP-bin -1,188 ± 1 -3,272 ± 3 -4,001 ± 11 -183 ± 1 -407 ± 2 -1,530 ± 5 hP AM-bin -1,183 ± 3 — — -183 ± 2 — — CorEx -445 ± 2 -1,243 ± 2 -1,610 ± 4 -149 ± 1 -323 ± 4 — T able 7: Running times: The sign “-” indicates non-termination after 72 hours, and “NR” stands for “not run”. T ime(min) NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT HL T A-batch 5.6 ± 0.1 86.1 ± 3.7 318 ± 12 66.1 ± 2.9 432 ± 39 787 ± 42 HL T A-step NR NR NR 9.6 ± 0.1 133 ± 5 421 ± 17 nCRP-bin 782 ± 39 3,608 ± 163 — — — — nHDP-bin 152 ± 1 288 ± 16 299 ± 9 162 ± 13 263 ± 9 430 ± 42 hP AM-bin 261 ± 17 — — 328 ± 9 — — nCRP-bow 853 ± 150 3,939 ± 301 — — — — nHDP-bow 379 ± 14 416 ± 49 413 ± 16 250 ± 81 332 ± 16 564 ± 59 hP AM-bo w 850 ± 27 — — 604 ± 19 — — CorEx 53 ± 0.2 371 ± 23 1,190 ± 9 779 ± 34 4,287 ± 52 — model quality . It is clear from T ables 6 and 7 that HL T A achieved much better model quality than the alternativ e algorithms using comparable or less time. 9.4. Quality of T opics It has been argued that, in general, better model ﬁt does not necessarily imply better topic quality [50]. It might therefore be more meaningful to compare alternati ve 37 methods in terms of topic quality directly . W e measure topic quality using two metrics. The ﬁrst one is the topic coher ence score proposed by [51]. Suppose a topic t is characterized using a list W ( t ) = { w ( t ) 1 , w ( t ) 2 , . . . , w ( t ) M } of M words. The coherence score of t is giv en by: Coherence ( W ( t ) ) = M X i =2 i − 1 X j =1 log D ( w ( t ) i , w ( t ) j ) + 1 D ( w ( t ) j ) , (17) where D ( w i ) is the number of documents containing word w i , and D ( w i , w j ) is the number of documents containing both w i and w j . It is clear that the score depends on the choices of M and it generally decreases with M . In our experiments, we set M = 4 because some of the topics produced by HL T A ha ve only 4 words and hence the choice of a lar ger value for M would put other methods at a disadv antage. With M ﬁxed, a higher coherence score indicates a better quality topic. The second metric we use is the topic compactness score proposed by [52]. The compactness score of a topic t is giv en by: compactness ( W ( t ) ) = 2 M ( M − 1) M X i =2 i − 1 X j =1 S ( w ( t ) i , w ( t ) j ) , (18) where S ( w i , w j ) is the similarity between the words w i and w j as determined by a wor d2vec model [53, 54]. The wor d2vec model was trained on a part of the Google News dataset 9 . It contains about 100 billion words and each word is mapped to a high dimensional vector . The similarity between two words is deﬁned as the cosine similarity of the two corresponding vectors. When calculating compactness ( W ( t ) ) , words that do not occur in the wor d2vec model were simply skipped. Note that the coherence score is calculated on the corpus being analyzed. In this sense, it is an intrinsic metric. The intuition is that words in a good topic should tend to co-occur in the documents. On the other hand, the compactness score is calculated on a general and v ery large corpus not related to the corpus being analyzed. Hence it is an extrinsic metric. The intuition here is that words in a good topic should be closely related semantically . 9 https://code.google.com/archiv e/p/word2vec/ 38 T able 8: A verage topic coherence scores. NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT HL T A-batch -5.95 ± 0.04 -7.74 ± 0.07 -8.15 ± 0.05 -12.00 ± 0.09 -12.67 ± 0.15 -12.09 ± 0.07 HL T A-step NR NR NR -11.66 ± 0.19 -12.08 ± 0.11 -11.97 ± 0.05 nCRP-bow -7.46 ± 0.31 -9.03 ± 0.16 — — — — nHDP-bow -7.66 ± 0.23 -9.70 ± 0.19 -10.89 ± 0.38 -13.51 ± 0.08 -13.93 ± 0.21 -12.90 ± 0.16 hP AM-bo w -6.86 ± 0.08 — — -11.74 ± 0.14 — — nCRP-bin -7.01 ± 0.37 -9.83 ± 0.08 — — — — nHDP-bin -8.95 ± 0.11 -11.59 ± 0.12 -12.34 ± 0.11 -13.45 ± 0.05 -14.17 ± 0.08 -14.55 ± 0.07 hP AM-bin -6.83 ± 0.11 — — -12.63 ± 0.06 — — CorEx -7.20 ± 0.23 -9.76 ± 0.48 -11.96 ± 0.52 -13.49 ± 1.48 -14.71 ± 0.45 — T able 9: A verage compactness scores. NIPS-1k NIPS-5k NIPS-10k News-1k News-5k NYT HL T A-batch 0.253 ± 0.003 0.279 ± 0.008 0.265 ± 0.001 0.239 ± 0.010 0.239 ± 0.006 0.337 ± 0.003 HL T A-step NR NR NR 0.250 ± 0.003 0.243 ± 0.002 0.338 ± 0.002 nCRP-bow 0.163 ± 0.003 0.153 ± 0.001 — — — — nHDP-bow 0.164 ± 0.005 0.147 ± 0.006 0.138 ± 0.002 0.150 ± 0.003 0.148 ± 0.004 0.250 ± 0.003 hP AM-bo w 0.215 ± 0.013 — — 0.210 ± 0.006 — — nCRP-bin 0.176 ± 0.005 0.137 ± 0.005 — — — — nHDP-bin 0.119 ± 0.005 0.107 ± 0.003 0.102 ± 0.003 0.138 ± 0.003 0.134 ± 0.003 0.166 ± 0.001 hP AM-bin 0.145 ± 0.008 — — 0.169 ± 0.010 — — CorEx 0.243 ± 0.018 0.162 ± 0.013 0.167 ± 0.003 0.185 ± 0.012 0.156 ± 0.009 — T ables 8 and 9 show the av erage topic coherence and topic compactness scores of the topics produced by various methods. For LD A-based methods, we reported their scores on both datasets of binary and bags-of-words versions. There is no distinct advantage of choosing either v ersion. W e see that the scores for the topics produced by HL T A are signiﬁcantly higher than the those obtained by other methods in all cases. 39 9.5. Quality of T opic Hierar chies T able 10: A part of topic hierarchy by nHDP . 1. company business million companies money 1.1. economy economic percent government 1.2. percent stock market analyst quarter 1.2.1. stock fund market investor investment 1.2.2. economy rate rates fed economist 1.2.3. company quarter million sales analyst 1.2.4. travel ticket airline flight traveler 1.2.5. car ford sales vehicles chrysler 1.3. computer technology system software 1.4. company deal million billion stock 1.5. worker job union employees contract 1.6. project million plan official area There is no metric for measuring the quality of topic hierarchies to the best of our knowledge, and it is difﬁcult to come up with one. Hence, we resort to manual comparisons. The entire topic hierarchies produced by HL T A and nHDP on the NIPS and NYT datasets can be found at the URL mentioned at the beginning of the pre vious section. T able 10 shows the part of the hierarchy by nHDP that corresponds to the part of the hierarchy by HL T A shown in T able 3. In the HL T A hierarchy , the topics are nicely divided into three groups, economy , stoc k market , and companies . In T able 10, there is no such clear division. The topics are all mixed up. The hierarchy does not match the semantic relationships among the topics. Overall, the topics and topic hierarchy obtained by HL T A are more meaningful than those by nHDP . 10. Conclusions and Future Directions W e propose a no vel method called HL T A for hierarchical topic detection. The idea is to model patterns of word co-occurrence and co-occurrence of those patterns using a hierarchical latent tree model. Each latent v ariable in an HL TM represents a soft partition of documents and the document clusters in the partitions are interpreted as 40 topics. Each topic is characterized using the words that occur with high probability in documents belonging to the topic and occur with low probability in documents not belonging to the topic. Progressive EM is used to accelerate parameter learning for intermediate models created during model construction, and stepwise EM is used to accelerate parameter learning for the ﬁnal model. Empirical results show that HL T A outperforms nHDP , the state-of-the-art method for hierarchical topic detection based on LDA, in terms of ov erall model ﬁt and quality of topics/topic hierarchies, while takes no more time than the latter . HL T A treats words as binary variables and each word is allowed to appear in only one branch of a hierarchy . For future work, it would be interesting to extend HL T A so that it can handle count data and that a word is allowed to appear in multiple branches of the hierarchy . Another direction is to further scale up HL T A via distributed computing and by other means. Acknowledgments W e thank John Paisley for sharing the nHDP implements with us, and we thank Jun Zhu for valuable discussions. Research on this article was supported by Hong Kong Research Grants Council under grants 16202515 and 16212516, and the Hong K ong Institute of Education under project RG90/2014-2015R. References References [1] D. M. Blei, T . Grifﬁths, M. Jordan, J. T enenbaum, Hierarchical topic models and the nested Chinese restaurant process, Adv ances in neural information processing systems 16 (2004) 106–114. [2] D. M. Blei, T . Grifﬁths, M. Jordan, The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies, Journal of the A CM 57 (2) (2010) 7:1–7:30. 41 [3] W . Li, A. McCallum, Pachink o allocation: Dag-structured mixture models of topic correlations, in: Proceedings of the 23rd international conference on Machine learning, A CM, 2006, pp. 577–584. [4] D. Mimno, W . Li, A. McCallum, Mixtures of hierarchical topics with pachinko allocation, in: Proceedings of the 24th international conference on Machine learning, A CM, 2007, pp. 633–640. [5] J. Paisley , C. W ang, D. M. Blei, M. Jordan, et al., Nested hierarchical Dirichlet processes, Pattern Analysis and Machine Intelligence, IEEE Transactions on 37 (2) (2015) 256–270. [6] D. M. Blei, A. Y . Ng, M. I. Jordan, Latent Dirichlet allocation, the Journal of machine Learning research 3 (2003) 993–1022. [7] J. Lafferty , D. M. Blei, Correlated topic models, in: Adv ances in Neural Information Processing Systems, Citeseer , 2006, pp. 147–155. [8] D. M. Blei, J. D. Lafferty , Dynamic topic models, in: Proceedings of the 23rd international conference on Machine learning, A CM, 2006, pp. 113–120. [9] X. W ang, A. McCallum, T opics ov er time: a non-Marko v continuous-time model of topical trends, in: Proceedings of the 12th A CM SIGKDD international conference on Knowledge disco very and data mining, A CM, 2006, pp. 424–433. [10] A. Ahmed, E. P . Xing, Dynamic non-parametric mixture models and the recurrent Chinese restaurant process, Carnegie Mellon University , School of Computer Science, Machine Learning Department, 2007. [11] Y . W . T eh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical Dirichlet processes, Journal of the american statistical association 101 (476). [12] D. Andrzejewski, X. Zhu, M. Cra ven, Incorporating domain knowledge into topic modeling via Dirichlet forest priors, in: Proceedings of the 26th Annual International Conference on Machine Learning, A CM, 2009, pp. 25–32. 42 [13] J. Jagarlamudi, H. Daum ´ e III, R. Udupa, Incorporating lexical priors into topic models, in: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 204–213. [14] J. D. Mcaulif fe, D. M. Blei, Supervised topic models, in: Advances in neural information processing systems, 2008, pp. 121–128. [15] A. J. Perotte, F . W ood, N. Elhadad, N. Bartlett, Hierarchically supervised latent Dirichlet allocation, in: Adv ances in Neural Information Processing Systems, 2011, pp. 2609–2617. [16] N. L. Zhang, Hierarchical latent class models for cluster analysis, in: Proceedings of the 18th National Conference on Artiﬁcial Intelligence, 2002. [17] N. L. Zhang, Hierarchical latent class models for cluster analysis, Journal of Machine Learning Research 5 (2004) 697–723. [18] N. L. Zhang, S. Y uan, T . Chen, Y . W ang, Latent tree models and diagnosis in traditional Chinese medicine, Artiﬁcial intelligence in medicine 42 (3) (2008) 229–245. [19] Y . W ang, N. L. Zhang, T . Chen, Latent tree models and approximate inference in Bayesian networks, in: AAAI, 2008. [20] P . F . Lazarsfeld, N. W . Henry , Latent Structure Analysis, Houghton Mifﬂin, Boston, 1968. [21] M. Knott, D. J. Bartholomew , Latent variable models and factor analysis, no. 7, Edward Arnold, 1999. [22] R. Durbin, S. R. Eddy , A. Krogh, G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge Univ ersity Press, 1998. [23] R. Mourad, C. Sinoquet, N. L. Zhang, T . Liu, P . Leray , et al., A surve y on latent tree models and applications., J. Artif. Intell. Res.(J AIR) 47 (2013) 157–203. 43 [24] N. J. Choi, V . Y . F . T an, A. Anandkumar , A. S. W illsky , Learning latent tree graphical models, Journal of Machine Learning Research 12 (2011) 1771–1812. [25] T . Chen, N. L. Zhang, T . Liu, K. M. Poon, Y . W ang, Model-based multidimensional clustering of categorical data, Artiﬁcial Intelligence 176 (2012) 2246–2269. [26] T . Liu, N. L. Zhang, P . Chen, A. H. Liu, K. M. Poon, Y . W ang, Greedy learning of latent tree models for multidimensional clustering, Machine Learning 98 (1-2) (2013) 301–330. [27] T . Liu, N. L. Zhang, P . Chen, Hierarchical latent tree analysis for topic detection, in: ECML/PKDD, Springer , 2014, pp. 256–272. [28] P . Chen, N. L. Zhang, L. K. Poon, Z. Chen, Progressi ve em for latent tree models and hierarchical topic detection, in: AAAI, 2016. [29] G. V er Steeg, A. Galstyan, Discov ering structure in high-dimensional data through correlation explanation, in: Adv ances in Neural Information Processing Systems 27, 2014, pp. 577–585. [30] M. Steinbach, G. Karypis, V . K umar, et al., A comparison of document clustering techniques, in: KDD workshop on text mining, V ol. 400, Boston, 2000, pp. 525– 526. [31] I. S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the sev enth A CM SIGKDD international conference on Knowledge disco very and data mining, A CM, 2001, pp. 269–274. [32] J. Pearl, Probabilistic reasoning in intelligent systems: Netw orks of plausible inference, Morgan Kaufmann, 1988. [33] L. Poon, N. L. Zhang, T . Chen, Y . W ang, V ariable selection in model-based clustering: T o do or to facilitate, in: ICML-10, 2010, pp. 887–894. [34] S. Kirshner , Latent tree copulas, in: Sixth European W orkshop on Probabilistic Graphical Models, Granada, 2012. 44 [35] L. Song, H. Liu, A. Parikh, E. Xing, Nonparametric latent tree graphical models: Inference, estimation, and structure learning, arXiv preprint arXi v:1401.3940. [36] T . M. Cov er, J. A. Thomas, Elements of Information Theory , 2nd Edition, Wile y , 2006. [37] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm., J. Royal Statistical Society , Series B 39 (1) (1977) 1–38. [38] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (1978) 461–464. [39] S. Harmeling, C. K. I. W illiams, Greedy learning of binary latent trees, IEEE T ransaction on Pattern Analysis and Machine Intelligence 33 (6) (2011) 1087– 1097. [40] C. K. Chow , C. N. Liu, Approximating discrete probability distributions with dependence trees., IEEE T ransactions on Information Theory 14 (3) (1968) 462– 467. [41] A. E. Raftery , Bayesian model selection in social research, Sociological Methodology 25 (1995) 111–163. [42] C. K. Chow , C. N. Liu, Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory 14 (3) (1968) 462– 467. [43] K. P . Murphy , Machine Learning: A Probabilistic Perspectiv e, The MIT Press, 2012. [44] M.-A. Sato, S. Ishii, On-line em algorithm for the normalized gaussian network, Neural computation 12 (2) (2000) 407–432. [45] O. Capp ´ e, E. Moulines, On-line expectation–maximization algorithm for latent data models, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (3) (2009) 593–613. 45 [46] P . Liang, D. Klein, Online em for unsupervised models, in: Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Association for Computational Linguistics, 2009, pp. 611–619. [47] O. Bousquet, L. Bottou, The tradeoffs of large scale learning, in: Advances in neural information processing systems, 2008, pp. 161–168. [48] J. D. Ullman, J. Leskovec, A. Rajaraman, Mining of massi ve datasets (2011). [49] L. K. Poon, N. L. Zhang, T opic browsing for research papers with hierarchical latent tree analysis, arXiv preprint arXi v:1609.09188. [50] J. Chang, S. Gerrish, C. W ang, J. L. Boyd-graber , D. M. Blei, Reading tea leav es: How humans interpret topic models, in: Advances in neural information processing systems, 2009, pp. 288–296. [51] D. Mimno, H. M. W allach, E. T alley , M. Leenders, A. McCallum, Optimizing semantic coherence in topic models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 262–272. [52] Z. Chen, N. L. Zhang, D.-Y . Y eung, P . Chen, Sparse boltzmann machines with structure learning as applied to text analysis, arXi v preprint [53] T . Mikolov , K. Chen, G. Corrado, J. Dean, Efﬁcient estimation of word representations in vector space, in: International Conference on Learning Representations W orkshops, 2013. [54] T . Mikolo v , I. Sutskev er, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality , in: Advances in Neural Information Processing Systems 26, 2013, pp. 3111–3119. 46

Latent Tree Models for Hierarchical Topic Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment