Generative Topic Embedding: a Continuous Representation of Documents (Extended Version with Proofs)
Word embedding maps words into a low-dimensional continuous embedding space by exploiting the local word collocation patterns in a small context window. On the other hand, topic modeling maps documents onto a low-dimensional topic space, by utilizing…
Authors: Shaohua Li, Tat-Seng Chua, Jun Zhu
Generativ e T opic Embedding: a Continuous Repr esentation of Documents (Extended V ersion with Proofs) Shaohua Li 1 , 2 T at-Seng Chua 1 Jun Zhu 3 Chunyan Miao 2 shaohua@gmail.com dcscts@nus.edu.sg dcszj@tsinghua.edu.cn ascymiao@ntu.edu.sg 1. School of Computing, National Uni versity of Sing apore 2. Joint NTU-UBC Research Centre of Excellence in Acti ve Li ving for the Elderly (LIL Y) 3. Department of Computer Science and T echnology , Tsinghua Univ ersity Abstract W ord embedding maps words into a low- dimensional continuous embedding space by exploiting the local w ord collocation patterns in a small context window . On the other hand, topic modeling maps docu- ments onto a low -dimensional topic space, by utilizing the global word collocation patterns in the same document. These two types of patterns are complementary . In this paper , we propose a generati ve topic embedding model to combine the two types of patterns. In our model, topics are represented by embedding vectors, and are shared across documents. The proba- bility of each w ord is influenced by both its local context and its topic. A v ariational inference method yields the topic embed- dings as well as the topic mixing propor- tions for each document. Jointly the y rep- resent the document in a low-dimensional continuous space. In two document clas- sification tasks, our method performs bet- ter than eight e xisting methods, with fe wer features. In addition, we illustrate with an example that our method can generate co- herent topics ev en based on only one doc- ument. 1 Introduction Representing documents as fixed-length feature vectors is important for man y document process- ing algorithms. T raditionally documents are rep- resented as a bag-of-words (BO W) vectors. How- e ver , this simple representation suf fers from being high-dimensional and highly sparse, and loses se- mantic relatedness across the vector dimensions. W ord Embedding methods have been demon- strated to be an ef fecti ve way to represent words as continuous v ectors in a lo w-dimensional em- bedding space (Bengio et al., 2003; Mikolov et al., 2013; Pennington et al., 2014; Le vy et al., 2015). The learned embedding for a word encodes its semantic/syntactic relatedness with other words, by utilizing local word collocation patterns. In each method, one core component is the embed- ding link function , which predicts a word’ s distri- bution given its context words, parameterized by their embeddings. When it comes to documents, we wish to find a method to encode their overall semantics. Gi ven the embeddings of each word in a document, we can imagine the document as a “bag-of-vectors”. Related w ords in the document point in similar di- rections, forming semantic clusters . The centroid of a semantic cluster corresponds to the most rep- resentati ve embedding of this cluster of w ords, re- ferred to as the semantic centr oids . W e could use these semantic centroids and the number of words around them to represent a document. In addition, for a set of documents in a partic- ular domain, some semantic clusters may appear in many documents. By learning collocation pat- terns across the documents, the deriv ed semantic centroids could be more topical and less noisy . T opic Models, represented by Latent Dirichlet Allocation (LDA) (Blei et al., 2003), are able to group words into topics according to their colloca- tion patterns across documents. When the corpus is large enough, such patterns reflect their seman- tic relatedness, hence topic models can discover coherent topics. The probability of a word is gov- erned by its latent topic, which is modeled as a categorical distribution in LD A. T ypically , only a small number of topics are present in each docu- ment, and only a small number of w ords hav e high probability in each topic. This intuition moti v ated Blei et al. (2003) to regularize the topic distribu- tions with Dirichlet priors. 1 Semantic centroids ha ve the same nature as top- ics in LDA, except that the former exist in the em- bedding space. This similarity drives us to seek the common semantic centroids with a model similar to LD A. W e extend a generativ e word embedding model PSDV ec (Li et al., 2015), by incorporating topics into it. The new model is named T opicV ec. In T opicV ec, an embedding link function models the word distrib ution in a topic, in place of the cat- egorical distribution in LD A. The advantage of the link function is that the semantic relatedness is al- ready encoded as the cosine distance in the em- bedding space. Similar to LD A, we regularize the topic distributions with Dirichlet priors. A varia- tional inference algorithm is derived. The learning process deriv es topic embeddings in the same em- bedding space of words. These topic embeddings aim to approximate the underlying semantic cen- troids. T o e valuate how well T opicV ec represents doc- uments, we performed two document classifica- tion tasks against eight existing topic modeling or document representation methods. T wo setups of T opicV ec outperformed all other methods on two tasks, respectiv ely , with fewer features. In addi- tion, we demonstrate that T opicV ec can deriv e co- herent topics based only on one document, which is not possible for topic models. The source code of our implementation is a v ail- able at https://github .com/askerlee/topicvec. 2 Related W ork Li et al. (2015) proposed a generativ e word em- bedding method PSDV ec, which is the precur- sor of T opicV ec. PSDV ec assumes that the con- ditional distribution of a word gi ven its context words can be factorized approximately into inde- pendent log-bilinear terms. In addition, the word embeddings and regression residuals are regular- ized by Gaussian priors, reducing their chance of ov erfitting. The model inference is approached by an ef ficient Eigendecomposition and blockwise- regression method (Li et al., 2016). T opicV ec dif- fers from PSD V ec in that in the conditional distri- bution of a word, it is not only influenced by its context words, but also by a topic, which is an em- bedding vector index ed by a latent variable drawn from a Dirichlet-Multinomial distribution. Hinton and Salakhutdinov (2009) proposed to model topics as a certain number of binary hidden v ariables, which interact with all words in the doc- ument through weighted connections. Larochelle and Lauly (2012) assigned each word a unique topic vector , which is a summarization of the con- text of the current w ord. Huang et al. (2012) proposed to incorporate global (document-lev el) semantic information to help the learning of word embeddings. The global embedding is simply a weighted average of the embeddings of words in the document. Le and Mikolov (2014) proposed Paragraph V ector . It assumes each piece of text has a la- tent paragraph vector , which influences the distri- butions of all words in this te xt, in the same way as a latent word. It can be vie wed as a special case of T opicV ec, with the topic number set to 1. T yp- ically , ho we ver , a document consists of multiple semantic centroids, and the limitation of only one topic may lead to underfitting. Nguyen et al. (2015) proposed Latent Feature T opic Modeling (LFTM), which extends LDA to incorporate word embeddings as latent features. The topic is modeled as a mixture of the con- ventional categorical distribution and an embed- ding link function. The coupling between these two components makes the inference difficult. They designed a Gibbs sampler for model infer- ence. Their implementation 1 is slow and infeasi- ble when applied to a large corpous. Liu et al. (2015) proposed T opical W ord Em- bedding (TWE), which combines word embed- ding with LDA in a simple and ef fectiv e way . They train word embeddings and a topic model separately on the same corpus, and then average the embeddings of words in the same topic to get the embedding of this topic. The topic embedding is concatenated with the word embedding to form the topical w ord embedding of a word. In the end, the topical word embeddings of all words in a doc- ument are averaged to be the embedding of the document. This method performs well on our two classification tasks. W eaknesses of TWE include: 1) the way to combine the results of word embed- ding and LD A lacks statistical foundations; 2) the LD A module requires a large corpus to deriv e se- mantically coherent topics. Das et al. (2015) proposed Gaussian LD A. It uses pre-trained word embeddings. It assumes that words in a topic are random samples from a mul- ti v ariate Gaussian distribution with the topic em- bedding as the mean. Hence the probability that a 1 https://github .com/datquocnguyen/LFTM/ 2 Name Description S V ocabulary { s 1 , · · · , s W } V Embedding matrix ( v s 1 , · · · , v s W ) D Document set { d 1 , · · · , d M } v s i Embedding of word s i a s i s j , A Bigram residuals t ik , T i T opic embeddings in doc d i r ik , r i T opic residuals in doc d i z ij T opic assignment of the j -th word j in doc d i φ i Mixing proportions of topics in doc d i T able 1: T able of notations word belongs to a topic is determined by the Eu- clidean distance between the word embedding and the topic embedding. This assumption might be improper as the Euclidean distance is not an opti- mal measure of semantic relatedness between two embeddings 2 . 3 Notations and Definitions Throughout this paper , we use uppercase bold let- ters such as S , V to denote a matrix or set, low- ercase bold letters such as v w i to denote a vector , a normal uppercase letter such as N , W to denote a scalar constant, and a normal lo wercase letter as s i , w i to denote a scalar v ariable. T able 1 lists the notations in this paper . In a document, a sequence of words is referred to as a text window , denoted by w i , · · · , w i + l , or w i : w i + l . A text window of chosen size c before a word w i defines the context of w i as w i − c , · · · , w i − 1 . Here w i is referred to as the fo- cus wor d . Each context word w i − j and the focus word w i comprise a bigram w i − j , w i . W e assume each word in a document is seman- tically similar to a topic embedding . T opic embed- dings reside in the same N -dimensional space as word embeddings. When it is clear from context, topic embeddings are often referred to as topic s. Each document has K candidate topics, arranged in the matrix form T i = ( t i 1 · · · t iK ) , referred to as the topic matrix . Specifically , we fix t i 1 = 0 , referring to it as the null topic . In a document d i , each word w ij is assigned to a topic indexed by z ij ∈ { 1 , · · · , K } . Geometri- cally this means the embedding v w ij tends to align 2 Almost all modern w ord embedding methods adopt the exponentiated cosine similarity as the link function, hence the cosine similarity may be assumed to be a better estimate of the semantic relatedness between embeddings deriv ed from these methods. with the direction of t i,z ij . Each topic t ik has a document-specific prior probability to be assigned to a word, denoted as φ ik = P ( k | d i ) . The vector φ i = ( φ i 1 , · · · , φ iK ) is referred to as the mixing pr oportions of these topics in document d i . 4 Link Function of T opic Embedding In this section, we formulate the distribution of a word giv en its context words and topic, in the form of a link function. The core of most w ord embedding methods is a link function that connects the embeddings of a fo- cus word and its context words, to define the distri- bution of the focus w ord. Li et al. (2015) proposed the follo wing link function: P ( w c | w 0 : w c − 1 ) ≈ P ( w c ) exp v > w c c − 1 X l =0 v w l + c − 1 X l =0 a w l w c . (1) Here a w l w c is referred as the bigram resid- ual, indicating the non-linear part not captured by v > w c v w l . It is essentially the logarithm of the nor- malizing constant of a softmax term. Some litera- ture, e.g. (Pennington et al., 2014), refers to such a term as a bias term. (1) is based on the assumption that the con- ditional distribution P ( w c | w 0 : w c − 1 ) can be factorized approximately into independent log- bilinear terms, each corresponding to a context word. This approximation leads to an ef ficient and ef fecti ve word embedding algorithm PSD V ec (Li et al., 2015). W e follow this assumption, and pro- pose to incorporate the topic of w c in a way like a latent word. In particular , in addition to the con- text words, the corresponding embedding t ik is in- cluded as a ne w log-bilinear term that influences the distribution of w c . Hence we obtain the fol- lo wing extended link function: P ( w c | w 0 : w c − 1 , z c , d i ) ≈ P ( w c ) · exp n v > w c c − 1 X l =0 v w l + t z c + c − 1 X l =0 a w l w c + r z c o , (2) where d i is the current document, and r z c is the logarithm of the normalizing constant, named the topic r esidual . Note that the topic embeddings t z c may be specific to d i . For simplicity of notation, we drop the document index in t z c . T o restrict the impact of topics and av oid o verfitting, we con- strain the magnitudes of all topic embeddings, so that they are always within a h yperball of radius γ . 3 w 1 · · · w 0 w c z c θ d α v s i µ i W ord Embeddings a s i s j h ij Residuals Gaussian Gaussian Mult Dir t T opic Embeddings w c ∈ d T d ∈ D Documents V A Figure 1: Graphical representation of T opicV ec. It is infeasible to compute the e xact v alue of the topic residual r k . W e approximate it by the context size c = 0 . Then (2) becomes: P ( w c | k , d i ) = P ( w c ) exp v > w c t k + r k . (3) It is required that P w c ∈ S P ( w c | k ) = 1 to make (3) a distrib ution. It follows that r k = − log X s j ∈ S P ( s j ) exp { v > s j t k } . (4) (4) can be expressed in the matrix form: r = − log( u exp { V > T } ) , (5) where u is the row vector of unigram probabilities. 5 Generative Pr ocess and Likelihood The generativ e process of words in documents can be regarded as a hybrid of LD A and PSD V ec. Analogous to PSDV ec, the word embedding v s i and residual a s i s j are drawn from respectiv e Gaus- sians. For the sake of clarity , we ignore their gen- eration steps, and focus on the topic embeddings. The remaining generati ve process is as follo ws: 1. For the k -th topic, draw a topic embedding uni- formly from a hyperball of radius γ , i.e. t k ∼ Unif ( B γ ) ; 2. For each document d i : (a) Draw the mixing proportions φ i from the Dirichlet prior Dir ( α ) ; (b) For the j -th word: i. Draw topic assignment z ij from the cate- gorical distribution Cat ( φ i ) ; ii. Draw word w ij from S according to P ( w ij | w i,j − c : w i,j − 1 , z ij , d i ) . The abov e generati ve process is presented in plate notation in Figure (1). 5.1 Likelihood Function Gi ven the embeddings V , the bigram residuals A , the topics T i and the hyperparameter α , the complete-data likelihood of a single document d i is: p ( d i , Z i , φ i | α , V , A , T i ) = p ( φ i | α ) p ( Z i | φ i ) p ( d i | V , A , T i , Z i ) = Γ( P K k =1 α k ) Q K k =1 Γ( α k ) K Y j =1 φ α j − 1 ij · L i Y j =1 φ i,z ij P ( w ij ) · exp v > w ij j − 1 X l = j − c v w il + t z ij + j − 1 X l = j − c a w il w ij + r i,z ij ! , (6) where Z i = ( z i 1 , · · · , z iL i ) , and Γ( · ) is the Gamma function. Let Z , T , φ denote the collection of all the document-specific { Z i } M i =1 , { T i } M i =1 , { φ i } M i =1 , respecti vely . Then the complete-data likelihood of the whole corpus is: p ( D , A , V , Z , T , φ | α , γ , µ ) = W Y i =1 P ( v s i ; µ i ) W,W Y i,j =1 P ( a s i s j ; f ( h ij )) K Y k Unif ( B γ ) · M Y i =1 { p ( φ i | α ) p ( Z i | φ i ) p ( d i | V , A , T i , Z i ) } = 1 Z ( H , µ ) U K γ exp {− W,W X i,j =1 f ( h i,j ) a 2 s i s j − W X i =1 µ i k v s i k 2 } · M Y i =1 Γ( P K k =1 α k ) Q K k =1 Γ( α k ) K Y j =1 φ α j − 1 ij · L i Y j =1 φ i,z ij P ( w ij ) · exp n v > w ij j − 1 X l = j − c v w il + t z ij + j − 1 X l = j − c a w il w ij + r i,z ij o , (7) where P ( v s i ; µ i ) and P ( a s i s j ; f ( h ij )) are the two Gaussian priors as defined in (Li et al., 2015). 4 Follo wing the con vention in (Li et al., 2015), h ij , H are empirical bigram probabilities, µ are the embedding magnitude penalty coefficients, and Z ( H , µ ) is the normalizing constant for word embeddings. U γ is the volume of the hyperball of radius γ . T aking the logarithm of both sides, we obtain log p ( D , A , V , Z , T , φ | α , γ , µ ) = C 0 − log Z ( H , µ ) − k A k 2 f ( H ) − W X i =1 µ i k v s i k 2 + M X i =1 K X k =1 log φ ik ( m ik + α k − 1) + L i X j =1 r i,z ij + v > w ij j − 1 X l = j − c v w il + t z ij + j − 1 X l = j − c a w il w ij , (8) where m ik = P L i j =1 δ ( z ij = k ) counts the number of words assigned with the k -th topic in d i , C 0 = M log Γ( P K k =1 α k ) Q K k =1 Γ( α k ) + P M ,L i i,j =1 log P ( w ij ) − K log U γ is constant gi ven the hyperparameters. 6 V ariational Inference Algorithm 6.1 Learning Objectiv e and Process Gi ven the hyperparameters α , γ , µ , the learning objecti ve is to find the embeddings V , the topics T , and the word-topic and document-topic distri- butions p ( Z i , φ i | d i , A , V , T ) . Here the hyperpa- rameters α , γ , µ are kept constant, and we mak e them implicit in the distribution notations. Ho we ver , the coupling between A , V and T , Z , φ makes it inefficient to optimize them si- multaneously . T o get around this dif ficulty , we learn word embeddings and topic embeddings sep- arately . Specifically , the learning process is di- vided into two stages: 1. In the first stage, considering that the topics hav e a relati vely small impact to word dis- tributions and the impact might be “av eraged out” across different documents, we simplify the model by ignoring topics temporarily . Then the model falls back to the original PSDV ec. The optimal solution V ∗ , A ∗ is obtained ac- cordingly; 2. In the second stage, we treat V ∗ , A ∗ as constant, plug it into the likelihood func- tion, and find the corresponding optimal T ∗ , p ( Z , φ | D , A ∗ , V ∗ , T ∗ ) of the full model. As in LD A, this posterior is analytically in- tractable, and we use a simpler variational dis- tribution q ( Z , φ ) to approximate it. 6.2 Mean-Field Appr oximation and V ariational GEM Algorithm In this stage, we fix V = V ∗ , A = A ∗ , and seek the optimal T ∗ , p ( Z , φ | D , A ∗ , V ∗ , T ∗ ) . As V ∗ , A ∗ are constant, we also make them implicit in the follo wing expressions. For an arbitrary v ariational distribution q ( Z , φ ) , the follo wing equalities hold E q log p ( D , Z , φ | T ) q ( Z , φ ) = E q [log p ( D , Z , φ | T )] + H ( q ) = log p ( D | T ) − KL ( q || p ) , (9) where p = p ( Z , φ | D , T ) , H ( q ) is the entropy of q . This implies KL ( q || p ) = log p ( D | T ) − E q [log p ( D , Z , φ | T )] + H ( q ) = log p ( D | T ) − L ( q, T ) . (10) In (10), E q [log p ( D , Z , φ | T )] + H ( q ) is usu- ally referred to as the variational free ener gy L ( q , T ) , which is a lo wer bound of log p ( D | T ) . Directly maximizing log p ( D | T ) w .r .t. T is in- tractable due to the hidden variables Z , φ , so we maximize its lower bound L ( q , T ) instead. W e adopt a mean-field approximation of the true pos- terior as the variational distribution, and use a v ariational algorithm to find q ∗ , T ∗ maximizing L ( q , T ) . The follo wing v ariational distribution is used: q ( Z , φ ; π , θ ) = q ( φ ; θ ) q ( Z ; π ) = M Y i =1 Dir ( φ i ; θ i ) L i Y j =1 Cat ( z ij ; π ij ) . (11) W e can obtain (Appendix A) L ( q , T ) = M X i =1 ( K X k =1 L i X j =1 π k ij + α k − 1 ψ ( θ ik ) − ψ ( θ i 0 ) + Tr ( T > i L i X j =1 v w ij π > ij ) + r > i L i X j =1 π ij ) + H ( q ) + C 1 , (12) 5 where T i is the topic matrix of the i -th docu- ment, and r i is the vector constructed by con- catenating all the topic residuals r ik . C 1 = C 0 − log Z ( H , µ ) − k A k 2 f ( H ) − P W i =1 µ i k v s i k 2 + P M ,L i i,j =1 v > w ij P j − 1 k = j − c v w ik + P j − 1 k = j − c a w ik w ij is constant. W e proceed to optimize (12) with a General- ized Expectation-Maximization (GEM) algorithm w .r .t. q and T as follo ws: 1. Initialize all the topics T i = 0 , and correspond- ingly their residuals r i = 0 ; 2. Iterate over the follo wing two steps until con- ver gence. In the l -th step: (a) Let the topics and residuals be T = T ( l − 1) , r = r ( l − 1) , find q ( l ) ( Z , φ ) that max- imizes L ( q , T ( l − 1) ) . This is the Expectation step (E-step). In this step, log p ( D | T ) is con- stant. Then the q that maximizes L ( q , T ( l ) ) will minimize KL ( q || p ) , i.e. such a q is the closest variational distribution to p measured by KL-di ver gence; (b) Given the variational distribution q ( l ) ( Z , φ ) , find T ( l ) , r ( l ) that improve L ( q ( l ) , T ) , using Gradient descent method. This is the gener- alized Maximization step (M-step). In this step, π , θ , H ( q ) are constant. 6.2.1 Update Equations of π , θ in E-Step In the E-step, T = T ( l − 1) , r = r ( l − 1) are con- stant. T aking the deriv ativ e of L ( q , T ( l − 1) ) w .r .t. π k ij and θ ik , respectively , we can obtain the opti- mal solutions (Appendix B) at: π k ij ∝ exp { ψ ( θ ik ) + v > w ij t ik + r ik } . (13) θ ik = L i X j =1 π k ij + α k . (14) 6.2.2 Update Equation of T i in M-Step In the Generalized M-step, π = π ( l ) , θ = θ ( l ) are constant. For notational simplicity , we drop their superscripts ( l ) . T o update T i , we first take the deri vati ve of (12) w .r .t. T i , and then take the Gradient Descent method. The deri v ativ e is obtained as (Appendix C): ∂ L ( q ( l ) , T ) ∂ T i = L i X j =1 v w ij π > ij + K X k =1 ¯ m ik ∂ r ik ∂ T i , (15) where ¯ m ik = P L i j =1 π k ij = E [ m ik ] , the sum of the v ariational probabilities of each word being as- signed to the k -th topic in the i -th document. ∂ r ik ∂ T i is a gradient matrix, whose j -th column is ∂ r ik ∂ t ij . Remind that r ik = − log E P ( s ) [exp { v > s t ik } ] . When j 6 = k , it is easy to verify that ∂ r ik ∂ t ij = 0 . When j = k , we ha ve ∂ r ik ∂ t ik = e − r ik · E P ( s ) [exp { v > s t ik } v s ] = e − r ik · X s ∈ W exp { v > s t ik } P ( s ) v s = e − r ik · exp { t > ik V } ( u ◦ V ) , (16) where u ◦ V is to multiply each column of V with u element-by-element. Therefore ∂ r ik ∂ T i = ( 0 , · · · ∂ r ik ∂ t ik , · · · , 0 ) . Plug- ging it into (15), we obtain ∂ L ( q ( l ) , T ) ∂ T i = L i X j =1 v w ij π > ij +( ¯ m i 1 ∂ r i 1 ∂ t i 1 , · · · , ¯ m iK ∂ r iK ∂ t iK ) . W e proceed to optimize T i with a gradient de- scent method: T ( l ) i = T ( l − 1) + λ ( l , L i ) ∂ L ( q ( l ) , T ) ∂ T i , where λ ( l , L i ) = L 0 λ 0 l · max { L i ,L 0 } is the learning rate function, L 0 is a pre-specified document length threshold, and λ 0 is the initial learning rate. As the magnitude of ∂ L ( q ( l ) , T ) ∂ T i is approximately pro- portional to the document length L i , to av oid the step size becoming too big a on a long document, if L i > L 0 , we normalize it by L i . T o satisfy the constraint that k t ( l ) ik k ≤ γ , when t ( l ) ik > γ , we normalize it by γ / k t ( l ) ik k . After we obtain the ne w T , we update r ( m ) i us- ing (5). Sometimes, especially in the initial sev eral iter- ations, due to the excessi vely big step size of the gradient descent, L ( q , T ) may decrease after the update of T . Nonetheless the general direction of L ( q , T ) is increasing. 6.3 Sharing of T opics across Documents In principle we could use one set of topics across the whole corpus, or choose different topics for dif ferent subsets of documents. One could choose a way to best utilize cross-document information. For instance, when the document category in- formation is av ailable, we could make the docu- ments in each category share their respectiv e set 6 of topics, so that M categories correspond to M sets of topics. In the learning algorithm, only the update of π k ij needs to be changed to cater for this situation: when the k -th topic is relev ant to the document i , we update π k ij using (13); otherwise π k ij = 0 . An identifiability problem may arise when we split topic embeddings according to document subsets. In different topic groups, some highly similar redundant topics may be learned. If we project documents into the topic space, portions of documents in the same topic in different docu- ments may be projected onto dif ferent dimensions of the topic space, and similar documents may e ventually be projected into very different topic proportion vectors. In this situation, directly us- ing the projected topic proportion vectors could cause problems in unsupervised tasks such as clus- tering. A simple solution to this problem would be to compute the pairwise similarities between topic embeddings, and consider these similarities when computing the similarity between two projected topic proportion vectors. T wo similar documents will then still recei ve a high similarity score. 7 Experimental Results T o in vestigate the quality of document represen- tation of our T opicV ec model, we compared its performance against eight topic modeling or doc- ument representation methods in two document classification tasks. Moreov er , to show the topic coherence of T opicV ec on a single document, we present the top words in top topics learned on a ne ws article. 7.1 Document Classification Evaluation 7.1.1 Experimental Setup Compared Methods T wo setups of T opicV ec were e v aluated: • T opicV ec : the topic proportions learned by T opicV ec; • TV+MeanWV : the topic proportions, con- catenated with the mean word embedding of the document (same as the MeanWV belo w). W e compare the performance of our methods against eight methods, including three topic mod- eling methods, three continuous document repre- sentation methods, and the con ventional bag-of- words ( BO W ) method. The count vector of BO W is unweighted. The topic modeling methods include: • LDA : the vanilla LD A (Blei et al., 2003) in the gensim library 3 ; • sLDA : Supervised T opic Model 4 (McAulif fe and Blei, 2008), which improves the predic- ti ve performance of LD A by modeling class labels; • LFTM : Latent Feature T opic Modeling 5 (Nguyen et al., 2015). The document-topic proportions of topic modeling methods were used as their document representa- tion. The document representation methods are: • Doc2V ec : Paragraph V ector (Le and Mikolo v , 2014) in the gensim library 6 . • TWE : T opical W ord Embedding 7 (Liu et al., 2015), which represents a document by concatenating a verage topic embedding and av erage word embedding, similar to our TV+MeanWV ; • GaussianLDA : Gaussian LDA 8 (Das et al., 2015), which assumes that words in a topic are random samples from a multi variate Gaussian distribution with the mean as the topic embedding. Similar to T opicV ec, we deri ved the posterior topic proportions as the features of each document; • MeanWV : The mean w ord embedding of the document. Datasets W e used two standard document clas- sification corpora: the 20 Newsgroups 9 and the ApteMod version of the Reuters-21578 corpus 10 . The two corpora are referred to as the 20News and Reuters in the follo wing. 20Ne ws contains about 20,000 ne wsgroup doc- uments ev enly partitioned into 20 different cate- gories. Reuters contains 10,788 documents, where each document is assigned to one or more cate- gories. For the e v aluation of document classifi- cation, documents appearing in two or more cate- gories were removed. The numbers of documents in the categories of Reuters are highly imbalanced, and we only selected the largest 10 categories, leaving us with 8,025 documents in total. 3 https://radimrehurek.com/gensim/models/ldamodel.html 4 http://www .cs.cmu.edu/˜chongw/slda/ 5 https://github .com/datquocnguyen/LFTM/ 6 https://radimrehurek.com/gensim/models/doc2vec.html 7 https://github .com/largelymfs/topical word embeddings/ 8 https://github .com/rajarshd/Gaussian LDA 9 http://qwone.com/˜jason/20Ne wsgroups/ 10 http://www .nltk.org/book/ch02.html 7 The same preprocessing steps were applied to all methods: words were lowercased; stop w ords and words out of the word embedding vocab ulary (which means that they are extremely rare) were remov ed. Experimental Settings T opicV ec used the word embeddings trained using PSD V ec on a March 2015 W ikipedia snapshot. It contains the most fre- quent 180,000 words. The dimensionality of word embeddings and topic embeddings was 500. The hyperparameters were α = (0 . 1 , · · · , 0 . 1) , γ = 7 . For 20news and Reuters, we specified 15 and 12 topics in each category on the training set, respec- ti vely . The first topic in each category was al- ways set to null. The learned topic embeddings were combined to form the whole topic set, where redundant null topics in different categories were remov ed, leaving us with 281 topics for 20Ne ws and 111 topics for Reuters. The initial learning rate was set to 0.1. After 100 GEM iterations on each dataset, the topic embeddings were ob- tained. Then the posterior document-topic distri- butions of the test sets were deri ved by performing one E-step given the topic embeddings trained on the training set. LFTM includes two models: LF-LDA and LF- DMM. W e chose the better performing LF-LDA to ev aluate. TWE includes three models, and we chose the best performing TWE-1 to compare. LD A, sLD A, LFTM and TWE used the spec- ified 50 topics on Reuters, as this is the optimal topic number according to (Lu et al., 2011). On the larger 20ne ws dataset, they used the specified 100 topics. Other hyperparameters of all com- pared methods were left at their default v alues. GaussianLD A was specified 100 topics on 20ne ws and 70 topics on Reuters. As each sam- pling iteration took o ver 2 hours, we only had time for 100 sampling iterations. For each method, after obtaining the document representations of the training and test sets, we trained an ` -1 regularized linear SVM one-vs-all classifier on the training set using the scikit-learn library 11 . W e then ev aluated its predictiv e perfor- mance on the test set. Evaluation metrics Considering that the largest fe w categories dominate Reuters, we adopted macro-av eraged precision, recall and F1 measures as the ev aluation metrics, to av oid the av erage re- sults being dominated by the performance of the 11 http://scikit-learn.org/stable/modules/svm.html 20News Reuters Prec Rec F1 Prec Rec F1 BO W 69.1 68.5 68.6 92.5 90.3 91.1 LD A 61.9 61.4 60.3 76.1 74.3 74.8 sLD A 61.4 60.9 60.9 88.3 83.3 85.1 LFTM 63.5 64.8 63.7 84.6 86.3 84.9 MeanWV 70.4 70.3 70.1 92.0 89.6 90.5 Doc2V ec 56.3 56.6 55.4 84.4 50.0 58.5 TWE 69.5 69.3 68.8 91.0 89.1 89.9 GaussianLD A 30.9 26.5 22.7 46.2 31.5 35.3 T opicV ec 71.3 71.3 71.2 92.5 92.1 92.2 TV+MeanWV 71.8 71.5 71.6 92.2 91.6 91.6 T able 2: Performance on multi-class te xt classifi- cation. Best score is in boldface. A vg. Features BO W MeanWV TWE T opicV ec TV+MeanWV 20News 50381 500 800 281 781 Reuters 17989 500 800 111 611 T able 3: Number of features of the fiv e best per- forming methods. top categories. Evaluation Results T able 2 presents the perfor- mance of the dif ferent methods on the two classifi- cation tasks. The highest scores were highlighted with boldf ace. It can be seen that TV+MeanWV and T opicV ec obtained the best performance on the two tasks, respectively . W ith only topic pro- portions as features, T opicV ec performed slightly better than BO W , MeanWV and TWE, and sig- nificantly outperformed four other methods. The number of features it used was much lower than BO W , MeanWV and TWE (T able 3). GaussianLD A performed considerably inferior to all other methods. After checking the generated topic embeddings manually , we found that the em- beddings for different topics are highly similar to each other . Hence the posterior topic proportions were almost uniform and non-discriminativ e. In addition, on the two datasets, ev en the fastest Alias sampling in (Das et al., 2015) took over 2 hours for one iteration and 10 days for the whole 100 itera- tions. In contrast, our method finished the 100 EM iterations in 2 hours. 7.2 Qualitative Assessment of T opics Deriv ed from a Single Document T opic models need a large set of documents to e x- tract coherent topics. Hence, methods depending 8 Figure 2: T opic Cloud of the pharmaceutical com- pany acquisition ne ws. on topic models, such as TWE, are subject to this limitation. In contrast, T opicV ec can extract co- herent topics and obtain document representations e ven when only one document is provided as in- put. T o illustrate this feature, we ran T opicV ec on a New Y ork Times ne ws article about a pharma- ceutical company acquisition 12 , and obtained 20 topics. Figure 2 presents the most relev ant words in the top-6 topics as a topic cloud . W e first calcu- lated the relev ance between a word and a topic as the frequency-weighted cosine similarity of their embeddings. Then the most rele vant words were selected to represent each topic. The sizes of the topic slices are proportional to the topic pro- portions, and the font sizes of individual words are proportional to their rele vance to the topics. Among these top-6 topics, the largest and small- est topic proportions are 26.7% and 9.9%, respec- ti vely . As shown in Figure 2, words in obtained topics were generally coherent, although the topics were only deriv ed from a single document. The reason is that T opicV ec takes advantage of the rich se- mantic information encoded in word embeddings, which were pretrained on a large corpus. 12 http://www .nytimes.com/2015/09/21/b usiness/a-huge- ov ernight-increase-in-a-drugs-price-raises-protests.html The topic coherence suggests that the deriv ed topic embeddings were approximately the seman- tic centroids of the document. This capacity may aid applications such as document retriev al, where a “compressed representation” of the query docu- ment is helpful. 8 Conclusions and Future W ork In this paper , we proposed T opicV ec, a generativ e model combining word embedding and LD A, with the aim of exploiting the w ord collocation patterns both at the le vel of the local context and the global document. Experiments sho w that T opicV ec can learn high-quality document representations, even gi ven only one document. In our classification tasks we only explored the use of topic proportions of a document as its rep- resentation. Ho wev er , jointly representing a doc- ument by topic proportions and topic embeddings would be more accurate. Ef ficient algorithms for this task ha ve been proposed (Kusner et al., 2015). Our method has potential applications in vari- ous scenarios, such as document retrie v al, classifi- cation, clustering and summarization. Acknowlegement W e thank Xiuchao Sui and Linmei Hu for their help and support. W e thank the anonymous men- tor for the careful proofreading. This research is funded by the National Research Foundation, Prime Minister’ s Office, Singapore under its IDM Futures Funding Initiati ve and IRC@SG Fund- ing Initiativ e administered by IDMPO. Part of the work was concei ved when Shaohua Li was visit- ing Tsinghua. Jun Zhu is supported by National NSF of China (No. 61322308) and the Y oungth T op-notch T alent Support Program. References Y oshua Bengio, R ´ ejean Ducharme, P ascal V incent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. J ournal of Machine Learning Re- sear ch , pages 1137–1155. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of ma- chine Learning r esear ch , 3:993–1022. Rajarshi Das, Manzil Zaheer, and Chris Dyer . 2015. Gaussian LDA for topic models with word embed- dings. In Pr oceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics 9 and the 7th International Joint Conference on Natu- ral Languag e Pr ocessing (V olume 1: Long P apers) , pages 795–804, Beijing, China, July . Association for Computational Linguistics. Geoffre y E Hinton and Ruslan R Salakhutdinov . 2009. Replicated softmax: an undirected topic model. In Advances in neural information pr ocessing systems , pages 1607–1614. Eric H Huang, Richard Socher, Christopher D Man- ning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Pr oceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics: Long P apers-V olume 1 , pages 873–882. Asso- ciation for Computational Linguistics. Matt Kusner , Y u Sun, Nicholas K olkin, and Kilian Q. W einberger . 2015. From w ord embeddings to docu- ment distances. In David Blei and Francis Bach, ed- itors, Pr oceedings of the 32nd International Confer- ence on Machine Learning (ICML-15) , pages 957– 966. JMLR W orkshop and Conference Proceedings. Hugo Larochelle and Stanislas Lauly . 2012. A neural autoregressi ve topic model. In Advances in Neural Information Processing Systems , pages 2708–2716. Quoc Le and T omas Mikolov . 2014. Distributed repre- sentations of sentences and documents. In Proceed- ings of the 31st International Conference on Ma- chine Learning (ICML-14) , pages 1188–1196. Omer Levy , Y oav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. T ransactions of the Associ- ation for Computational Linguistics , 3:211–225. Shaohua Li, Jun Zhu, and Chunyan Miao. 2015. A generativ e word embedding model and its low rank positiv e semidefinite solution. In Pr oceedings of the 2015 Confer ence on Empirical Methods in Natu- ral Language Pr ocessing , pages 1599–1609, Lisbon, Portugal, September . Association for Computational Linguistics. Shaohua Li, Jun Zhu, and Chunyan Miao. 2016. PS- D V ec: a toolbox for incremental and scalable word embedding. T o appear in Neur ocomputing . Y ang Liu, Zhiyuan Liu, T at-Seng Chua, and Maosong Sun. 2015. T opical word embeddings. In AAAI , pages 2418–2424. Y ue Lu, Qiaozhu Mei, and ChengXiang Zhai. 2011. In vestigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. In- formation Retrieval , 14(2):178–203. Jon D McAuliffe and David M Blei. 2008. Super- vised topic models. In Advances in neural informa- tion processing systems , pages 121–128. T omas Mik olov , Ilya Sutskev er , Kai Chen, Greg S Cor- rado, and Jef f Dean. 2013. Distributed representa- tions of words and phrases and their compositional- ity . In Proceedings of NIPS 2013 , pages 3111–3119. Dat Quoc Nguyen, Richard Billingsley , Lan Du, and Mark Johnson. 2015. Improving topic models with latent feature word representations. T ransactions of the Association for Computational Linguistics , 3:299–313. Jeffre y Pennington, Richard Socher , and Christopher D Manning. 2014. GloV e: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Pr ocessing (EMNLP 2014) , 12. 10 A ppendix A Deriv ation of L ( q , T ) The v ariational distribution is defined as: q ( Z , φ ; π , θ ) = q ( φ ; θ ) q ( Z ; π ) = M Y i =1 Dir ( φ i ; θ i ) L i Y j =1 Cat ( z ij ; π ij ) . (17) T aking the logarithm of both sides of (17), we obtain log q ( Z , φ ; π , θ ) = M X i =1 ( log Γ( θ i 0 ) − K X k =1 log Γ( θ ik ) + K X k =1 ( θ ik − 1) log φ ik + L i ,K X j,k =1 δ ( z ij = k ) log π k ij ) , where θ i 0 = P K k =1 θ ik , π k ij is the k -th component of π ij . Let ψ ( · ) denote the digamma function: ψ ( x ) = d dx ln Γ( x ) = Γ 0 ( x ) Γ( x ) . It follo ws that H ( q ) = − E q [log q ( Z , φ ; π , θ )] = M X i =1 ( K X k =1 log Γ( θ ik ) − log Γ( θ i 0 ) − K X k =1 ( θ ik − 1) ψ ( θ ik ) + ( θ i 0 − K ) ψ ( θ i 0 ) − L i ,K X j,k =1 π k ij log π k ij ) . (18) Plugging q into L ( q , T ) , we have L ( q , T ) = H ( q ) + E q [log p ( Z , φ | T )] = H ( q ) + C 0 − log Z ( H , µ ) − k A k 2 f ( H ) − W X i =1 µ i k v s i k 2 + M X i =1 K X k =1 E q ( Z i | π i ) [ m ik ] + α k − 1 · E q ( φ ik | θ i ) [log φ ik ] + L i X j =1 v > w ij j − 1 X k = j − c v w ik + E q ( z ij | π ij ) [ t z ij ] + j − 1 X k = j − c a w ik w ij + E q ( z ij | π ij ) [ r i,z ij ] ! = C 1 + H ( q ) + M X i =1 ( K X k =1 L i X j =1 π k ij + α k − 1 ψ ( θ ik ) − ψ ( θ i 0 ) + L i X j =1 v > w ij T i π ij + r > i π ij ) = C 1 + H ( q ) + M X i =1 ( K X k =1 L i X j =1 π k ij + α k − 1 ψ ( θ ik ) − ψ ( θ i 0 ) + Tr ( T > i L i X j =1 v w ij π > ij ) + r > i L i X j =1 π ij ) , (19) 11 A ppendix B Deriv ation of the E-Step The learning objecti ve is: L ( q , T ) = M X i =1 ( K X k =1 L i X j =1 π k ij + α k − 1 ψ ( θ ik ) − ψ ( θ i 0 ) + Tr ( T > i L i X j =1 v w ij π > ij ) + r > i L i X j =1 π ij ) + H ( q ) + C 1 , (20) (20) can be expressed as L ( q , T ( l − 1) ) = M X i =1 ( K X k =1 log Γ( θ ik ) − log Γ( θ i 0 ) − K X k =1 ( θ ik − 1) ψ ( θ ik ) + ( θ i 0 − K ) ψ ( θ i 0 ) − L i ,K X j,k =1 π k ij log π k ij + K X k =1 L i X j =1 π k ij + α k − 1 ψ ( θ ik ) − ψ ( θ i 0 ) + L i X j =1 v > w ij T i π ij + r > i π ij ) + C 1 . (21) W e first maximize (21) w .r .t. π k ij , the probability that the j -th word in the i -th document takes the k -th latent topic. Note that this optimization is subject to the normalization constraint that P K k =1 π k ij = 1 . W e isolate terms containing π ij , and form a Lagrange function by incorporating the normalization constraint: Λ( π ij ) = − K X k =1 π k ij log π k ij + K X k =1 ψ ( θ ik ) − ψ ( θ i 0 ) π k ij + v > w ij T i π ij + r > i π ij + λ ij ( K X k =1 π k ij − 1) . (22) T aking the deriv ati ve w .r .t. π k ij , we obtain ∂ Λ( π ij ) ∂ π k ij = − 1 − log π k ij + ψ ( θ ik ) − ψ ( θ i 0 ) + v > w ij t ik + r ik + λ ij . (23) Setting this deri v ativ e to 0 yields the maximizing v alue of π k ij : π k ij ∝ exp { ψ ( θ ik ) + v > w ij t ik + r ik } . (24) Next, we maximize (21) w .r .t. θ ik , the k -th component of the posterior Dirichlet parameter: ∂ L ( q , T ( l − 1) ) ∂ θ ik = ∂ ∂ θ ik log Γ( θ ik ) − log Γ( θ i 0 ) + L i X j =1 π k ij + α k − θ ik ψ ( θ ik ) − L i + X k α k − θ i 0 ψ ( θ i 0 ) = L i X j =1 π k ij + α k − θ ik ψ 0 ( θ ik ) − L i + X k α k − θ i 0 ψ 0 ( θ i 0 ) , (25) where ψ 0 ( · ) is the deri vati ve of the digamma function ψ ( · ) , commonly referred to as the trigamma func- tion . 12 Setting (25) to 0 yields a maximum at θ ik = L i X j =1 π k ij + α k . (26) Note this solution depends on the v alues of π k ij , which in turn depends on θ ik in (24). Then we have to alternate between (24) and (26) until con vergence. A ppendix C Deriv ative of L ( q ( l ) , T ) w .r .t. T i : ∂ L ( q ( l ) , T ) ∂ T i = ∂ P L i j =1 v > w ij T i π ij + π > ij r i ∂ T i = ∂ ∂ T i T r ( T i L i X j =1 π ij v > w ij ) + ( L i X j =1 π ij ) > ∂ r i ∂ T i = L i X j =1 v w ij π > ij + ( L i X j =1 π ij ) > ∂ r i ∂ T i = L i X j =1 v w ij π > ij + K X k =1 ¯ m ik ∂ r ik ∂ T i , (27) where ¯ m ik = P L i j =1 π k ij = E [ m ik ] , the sum of the variational probabilities of each word being assigned to the k -th topic in the i -th document. 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment