Probit Normal Correlated Topic Models

Probit Nor mal Correlated T opic Models Xingchen Y u Rochester Institute of T echnolog y 98 Lomb Memorial Driv e, Rochester , NY 146 23, USA xvy5021@gmail.c om Er nest Fokoué Rochester Institute of T echn ology 98 Lomb M emorial Drive, Rochester , NY 14 623, USA ernest.fokoue@rit.edu Abstract The logistic n ormal distribut i on has recently been adapted via the transformation of multivariate Gaus- sian v ariables to model the topical distribution of doc uments in the pres ence of correlations among topics. In this paper , we propose a pro bit normal alternative appro ach to modelling correlated t opical stru ctures . Our u se of the probit model in the context of topic discovery is n ovel, as many authors have so far con- centrated solely of the logistic model partly due to the formidable inefﬁciency of the mu ltinomial probit model even in the case of very small topical spaces. We herein circumvent the inefﬁciency of mul tinomial probit estimation by usi n g an adaptation of the diagonal orthant multinomial probit in the topic models context, resulting in the ability of our topic modelling sch eme to handle corpuses with a large numb er of latent topics. An additional and very important beneﬁt of our method lies in the fact that unl ike with the logistic n ormal model whose n on -conjugacy leads to the need for sophisticated sampling sch emes, our ap- proac h exploits the natural co njugacy inher ent in the auxiliary formulation of the pro bit model to achieve greater simplicity . The application of our proposed scheme to a well known Associated Press corpus not only helps discover a lar ge n umber of meaning ful topics but also reveals the ca pturing of compellingly intuitive correlations among certain top ics. Besides, ou r pr oposed appr oach lends itself t o even further scalability thanks to various existing high performance algorithms and architectures cap able of handling millions of documents. Keyw ords: B a yesian, Gibbs Sampler , Cumulative Distribution Function, Probit, Log it, Orth ant, Efﬁcient Sampling, Auxiliary V ariable , Correlation Structure, T opic, V ocabulary , Conjugate, Dirichlet, Gaussian. I. I ntr oduction The task of re co verin g the latent topics underlying a given cor pus of D documents has been in the forefront of activ e research in statistical machine learning for more than a d ecade, a nd continues to receiv e the ded ica ted contributions from many r e searchers from aroun d the w orld. Since the intro duction of Latent Dirichlet Allocation (LDA) Blei et a l. (2003 ) and then the exten- sion to correlated topic models (CTM) Blei and Lafferty (2006), a series of excellent contributions ha ve been made to this exciting ﬁeld, ranging from sli ght extensio n in the modelling structure to the d evelopm ent of scalab le topic modeling algorithms capa ble of handlin g extremely large collections of d ocuments , as w ell as selecting an optimal model among a collection of competing models or using the output of topic modelling as entry points (inputs) to other machine lear ning or data min ing ta sks such as image analysis and sentiment extraction, just to name a few . As far as correla ted topic models are concerned, virtually all the contributors to the ﬁeld hav e so far 1 concentrated solely o n the use of the logistic normal topic model. The seminal pa per on corre- lated topic modelBlei and Lafferty (2 006) adopts a variational approximation approach to mo d el ﬁtting while subsequent authors like Mimno et al. (200 8 ) propose a Gibbs sampling scheme with data augmentation of uniform random variables. More recently , Chen et al. (2013) pr e sented an exact and scala ble Gibbs sampling algorithm with Poly a -Gamma d istributed auxiliar y variables which is a recent d ev elopment of efﬁcient sampling of logistic model. Despite the inseparable re - lationship betw een logistic and probit model in statistical modelling, the pr obit model has not yet been proposed, pro bably due to its computation al inefﬁciency for multiclass classiﬁcation prob- lem and high posterior dependence be tw een auxilia r y variables and parameters. As for practical application where topic models are commonly employ ed, ha ving multiple topics is e xtremely prev alent. In some cases, more than 1000 topics will be ﬁtted to large d a tasets such as W ikipedia and Pubmed data. Therefore, using MCMC probit model in topic modeling application will be impractical and inconceiv able due to its computational inefﬁciency . Nonetheless, a r ecent wo rk on diagonal orthant probit model Johndro w et al. (2013) substantially impr ov ed the sampling efﬁ- ciency while maintaining the pred ictiv e performance, which motiv ated us to build an alternative correlated topic modeling with probit normal topic distribution. On the other hand, probit mod- els inherently capture a better dependency structure between topics and co-occurrence of w ords within a topic as it doesn’t assume the IIA (independence of irrelev ant a lter nativ es) restriction of logistic models. The rest of this pa per is organized as follo ws: in section 2, w e p r esent a c onventio nal for mulation of topic modelling along with our general notation and the correlated topic models extension. S ection 3 intr oduces our adapta tion of the diagonal orthant probit model to topic disco very in the presence correlations a mon g topics, along with the corresponding aux iliar y v ariable sam- pling scheme for updating the probit model parameters a nd the remainder of all the posterior distributions of the parameters of the model. Unlike with the logistic normal formulation where the non-conjugacy leads to the need for sophisticated sampling scheme, in this section we clearly rev eal th e simplicity of our proposed method resulting from the natural conjugacy inherent in the auxiliary for mulation of the updating of the parameters. W e a lso sho w compelling com- putational demonstrations of the efﬁciency of the diagonal orthant approach compared to the traditional multinomial probit for on both the auxiliary v aria ble sampling and the estimation of the topic distribution. S ection 4 presents the performance of our proposed approach on the Associated Press data set, featuring the intuitiv ely appealing topics d iscov ered, along with the correlation structure among topics and the loglik elihood as a function of topical space dimension. S ection 5 deals with our conclusion, discussion and e lements of our future w ork. II. G eneral aspects of t opic models In a giv en corpus, on e could imagine that each document deals wi th one or more topics. For instance, one of the col lection considered in this pape r is pro vided b y the Associated Press and co v ers topics as varied as aviation, e ducation, weather , br oadca sting, air force, navy , national security , international treaties, investing, international trade, war , courts, entertainment industry , politics , and etc. Fr om a sta tistical perspectiv e, a topic is often modeled as a probability d istribution ov er wo rd s , and as a result a given document is treate d as a m ixture of probabilistic t opics B lei et al. (20 03 ). W e consider a setting where w e hav e a total of V unique wor d s in the refer ence vocabulary and K topics underlying the D documents pro vided. Let w d n denote the n -th w ord in the d -th document, and let z d n refer to the label of the topic assigned to the n -th w ord of that d -th document. Then the probability of w d n is giv en b y 2 Pr ( w d n ) = K ∑ k = 1 Pr ( w d n | z d n = k ) Pr ( z d n = k ) , (1) where Pr ( z d n = k ) is the probability that the n th w ord in the d th document is assigned to topic k . This quantity pla ys an important rol e in the analysis of correlated topic models. In the seminal article on correlated topic mo dels Blei and Lafferty (2 006 ) , Pr ( z d n = k ) is modeled for each docu- ment d as a function of a K -dimensional v ector η d of pa rameters. Speciﬁcally , the logistic-normal deﬁnes η d = ( η 1 d , η 2 d , · · · , η K d ) where the last element η K d is typically set to zero for identiﬁability and assumes with η d ∼ MVN ( µ , Σ ) with θ k d = Pr [ z k d n = 1 | η d ] = f ( η d ) = e η k d ∑ K j = 1 e η j d , k = 1, 2, · · · , K − 1 and θ K d = 1 ∑ K j = 1 e η j d , Also, ∀ n ∈ { 1, 2, · · · , N d } and z d n ∼ Mult ( θ d ) , and w d n ∼ Mult ( β ) . W ith all these model compo- nents deﬁned, th e e stimation task in correla te d topic modelling from a Bay esian perspectiv e can be summarized in the following posterior p ( η d , Z | W , µ , Σ ) ∝ p ( W | Z ) D ∏ d = 1 ( N d ∏ n = 1 p ( z d n ) p ( η d | µ , Σ ) ) (2) = K ∏ k = 1 δ ( C k + β ) δ ( β ) D ∏ d = 1 ( N d ∏ n = 1 θ z d n d ! N ( η d | µ , Σ ) ) , where δ ( · ) is deﬁned using the Gamma function G am m a ( · ) so for a K -dimension vector u , δ ( u ) = K ∏ k = 1 Γ ( u k ) Γ K ∑ k = 1 u k ! . (3) pro vides the ingredients for estimating the parameter v ectors η d that help capture the correla- tions among topics, and the matrix Z that contains the topical assignment s. Under the logistic nor- mal model, sampling f ro m the full posterior of η d deriv e d from the joint posterior in (3) requires the use of sophisticated sampling schemes like the one used in Chen et a l. (2 0 13). Although these authors managed to a chie v e great per formances on large corpuses of documents, w e thought it useful to contribute to correlated topic modelling by wa y of the multinom ia l probit. Clea rly , as indicated ear lier , most authors concentrated on logistic-normal ev en despite non-conjugacy , and the lack of probit topic modeling can be easily attributed to the inefﬁciency of the c orresponding sampling scheme. In the most raw formulation of the multinomial probit that intends to capture the full extend of all the correlations among the topics, the topic assignment probability is deﬁned b y ( 3) . Pr ( z d n = k ) = θ k d = Z Z Z · · · Z φ K ( u ; η d , R ) d u (3) 3 The practical ev a luation of (3) invo lv es a complicated high dimensional integral which is typically computationally intractable wh en the number of categories is greater than 4. A relaxed v ersion of (3), one that still capture s more correlation tha n the logit and that is also v er y commonly used in practice, d eﬁnes θ k d as θ k d = Z + ∞ − ∞ ( K ∏ j = 1, j 6 = k Φ ( v + η k d − η j d ) ) φ ( v ) dv = E φ ( v ) ( K ∏ j = 1, j 6 = k Φ ( V + η k d − η j d ) ) , (4) where φ ( v ) = 1 √ 2 π e − 1 2 v 2 is the standard norma l density , and Φ ( v ) = R v − ∞ φ ( u ) d u is the standard normal distribution function. Despite this rela xation, the multinomial probi t in this formulation still has major dra wba cks namely: (a) Even when one is giv e n the vector η d , the calculation of θ k d remains computationally prohibitiv e ev en for moderate v alues of K . In practice, one ma y consider using a monte ca r lo approximation to that integral in (4). How ev er , such an approach in the con- text of a large corpus with many underlying latent topics renders the probit formulation almost unusable. (b) As f a r as the estimation of η d is concerned, a natural approach to sampling from the posterior of η d in this context w ould be to use the Metropolis-Hastings updating scheme, si nce the full p osterior in this case is not av ailable. Unfortunately , the Metropolis in this ca se is excruci- atingly slow with poor mixing ra te s a nd high sensitivity to the proposal distribution. It turns out that an appa rently appealing solution in this case could come from the auxilia r y variable for mula- tion a s described in A lbert a nd Chib (1993). Unfortunately , even this promising for mulation fails catastrophically for modera te values K as w e will de monstrate in the subsequent section, due to the high dependency structure betw een auxiliary v ariab le s and parameters. Essentially , the need for Metropoli s is av oided by deﬁnin g an auxiliary v ector Y d of dimensio n K . For n = 1, · · · , N d , w e consider the vector z d n containing the current topic a llocation and we repeatedly sample Y d n from a K -dimension al multivariate Gaussian unti l the component of Y d n that corresponds to the non-zero index in z d n is the largest of all the components of Y d n , ie. Y z d n d n = max k = 1, · ·· , K { Y k d n } . (5) The conditio n in (5) typically fails to be fulﬁlled even when K is modera tely large. In fact, we demonstrate la te r that in som e cases, it becomes im possible to ﬁnd a v ector Y d n satisfying that condition. Besides, the dependency of Y d n on the current v alue of η d further compli cates the sampling scheme especially in the case of large topical space . In the next section , w e remedy these inefﬁciencies b y proposing and de veloping our adaptation of the diagonal orthant multinomial probit. III. D ia gon al O rthant P r obit for C orr ela ted T opic M odels In a recent w ork, Johndro w et al. (2 013 ) dev eloped the d ia gonal orthant probit app r oach to mul- ticategorical classiﬁcation. Their approach circumven ts the bottlenecks mentioned earlier and substantially impro ves the sampling efﬁciency while maintaining the predictive perfor mance. Es- sentially , the diagonal orthant probit approach successfully makes the most of the beneﬁts of binary classiﬁcation, thereby substantially reducing the high depe ndency that made th e condi- tion (5) computationally unattainable. Indee d , with the diagonal orthant multinomial model, w e achiev ed three main b eneﬁts • A more tractable and easily computatble deﬁnition of topic distribution θ k d = Pr ( z d n = k | η d ) 4 • A clear and very straightforw a rd and ada p ta ble auxiliary variable sampling scheme • The ca pacity to handle a v er y large number of topics due to the efﬁciency a nd low depe n- dency . Under the diagonal orthant probit model, w e hav e θ k d = ( 1 − Φ ( − η k d ) ) ∏ j 6 = k Φ ( − η j d ) K ∑ ℓ = 1 ( 1 − Φ ( − η ℓ d ) ) ∏ j 6 = ℓ Φ ( − η j d ) . (6) The genera tiv e process of our pr obit normal topic models is essentially id entical to logistic topic models except that the topic distribution for eac h document now is obtained b y a probit transfor- mation of a multiv aria te Gaussian v ariable (6). As such, the genera ting process of a document of length N d is as follo ws: 1. Draw η ∼ MVN ( µ , Σ ) and transform η d into topic distribution θ d where each ele ment o f θ is computed as follo ws: θ k d = ( 1 − Φ ( − η k d ) ) ∏ j 6 = k Φ ( − η j d ) K ∑ ℓ = 1 ( 1 − Φ ( − η d ) ) ∏ j 6 = ℓ Φ ( − η j d ) . (7) 2. For each w ord position n ∈ ( 1, · · · , N d ) (a) Dra w a topic assignment Z n ∼ Mult ( θ d ) (b) Draw a w ord W n ∼ Mult ( ϕ z n ) Where Φ ( · ) rep resents the cumulativ e distribution of the sta nda rd nor mal. W e specify a Gaussian prior for η d , namely ( η d | · · · ) ∼ N K ( µ , Σ ) . Througho ut this paper , w e’ll use φ K ( · ) to denote the K -dimensional multivariate Gaussian density function, φ K ( η d ; µ , Σ ) = 1 p ( 2 π ) K | Σ | exp  − 1 2 ( η d − µ ) ⊤ Σ − 1 ( η d − µ )  . T o complete the B ay esian analysis of our probit normal topic model, w e need to sample f ro m the joint posterior p ( η d , Z d | W , µ , Σ ) ∝ p ( η d | µ , Σ ) p ( Z d | η d ) p ( W | Z d ) . (8) As noted earlier , the second beneﬁt of the diagonal orthant probit model lies in its clear , simple, straightforw ard y et pow erful auxiliary v ariable sampling scheme. W e take advantage of that diagonal orthant property when de aling with the full posterior f or η d giv en b y p ( η d | W , Z d , µ , Σ ) ∝ p ( η d | µ , Σ ) p ( Z d | η d ) . (9) While sampling directly from (9) is impractical, deﬁning a collection of auxiliary v ariables Y d allo ws a scheme that samples from the jo int posterior p ( η d , Z d , Y d | W , µ , Σ ) using the follo wing: 5 For each document d , the matrix Y d ∈ R N d × K contains all the v alues of the auxiliar y variables, Y d =         Y 1 d 1 Y 2 d 1 · · · Y k d 1 · · · Y K d 1 Y 1 d 2 Y 2 d 2 · · · Y k d 2 · · · Y K d 2 . . . . . . · · · . . . · · · . . . Y 1 d , N d − 1 Y 2 d , N d − 1 · · · Y k d , N d − 1 · · · Y K d , N d − 1 Y 1 d , N d Y 2 d , N d · · · Y k d , N d · · · Y K d , N d         Each ro w Y d n = ( Y 1 d n , · · · , Y k d n , · · · , Y K d n ) ⊤ of Y d has K components, a nd the diagonal orthant updates them rea dily using the follo wing straightforward sampling scheme: Le t k be the current topic allocation for the nth wor d. • For the c ompo nent of Y d n whose index corresponds to the labe l of current topic assignment of w ord n sample from a truncated normal distribution with v a riance 1 restricted to positiv e outcomes ( Y k d n | η k d ) ∼ N + ( η k d , 1 ) z k d n = 1 • For all components of Y d n whose indices do correspond to the lab e l of current topic assign- ment o f w ord n sample fro m a truncated normal distribution with v ariance 1 restricted to negativ e outcomes ( Y j d n | η j d ) ∼ N − ( η j d , 1 ) z j d n 6 = 1 Once the matrix Y d is obtained, the sampling scheme updates the parameter vector η d b y conv e- niently dra wing ( η d | Y d , A , µ , Σ ) ∼ M V N ( µ η d , Σ η d ) , where µ η d = Σ η d ( Σ − 1 µ + X ⊤ d A − 1 v ec ( Y d ) ) and Σ η d = ( Σ − 1 + X ⊤ d A − 1 X d ) − 1 . with X d = 1 N d ⊗ I K and vec ( Y d ) representing the row-wise v ec toriz a tion of the matr ix Y d . A dopt- ing the fully Ba yesian treatment of our probi t normal correlated topic model, w e add a n extra la yer to the hierarchy in order to capture the v ar iation in the mean v ector a nd the variance- co v ar iance ma trix of the parameter vector η d . T aking advantage of c onj uga cy , w e spec if y a normal-Inv erse-W ishart prior for ( µ , Σ ) , namely , p ( µ , Σ ) = N I W ( µ 0 , κ 0 , Ψ 0 , ν 0 ) , meaning that Σ | ν 0 , Ψ 0 ∼ I W ( Ψ 0 , ν 0 ) and ( µ | µ 0 , Σ , κ 0 ) ∼ M V N ( µ 0 , Σ / κ 0 ) . The corresponding posterior is normal-inv erse- W ishart, so that w e can write p ( µ , Σ | W , Z , η ) = N I W ( µ ′ , κ ′ , Ψ ′ , ν ′ ) , where κ ′ = κ 0 + D , ν ′ = ν 0 + D , µ ′ = D D + κ 0 ¯ η + κ 0 D + κ 0 µ 0 , and Ψ ′ = Ψ 0 + Q + κ 0 κ 0 + D ( ¯ η − µ 0 ) ( ¯ η − µ 0 ) ⊤ , where Q = D ∑ d = 1 ( η d − ¯ η )( η d − ¯ η ) ⊤ . 6 As far as sampling from the full posterior distribution of Z d n is concer ned, w e use the ex p r ession Pr [ z k d n = 1 | Z ¬ n , w d n , W ¬ dn ] ∝ p ( w d n | z k d n = 1, W ¬ dn , Z ¬ n ) θ k d ∝ C w d n k , ¬ n + β w d n ∑ V j = 1 C j k , ¬ n + ∑ V j = 1 β j θ k d . where the use of C · , ¬ n is used to indicate that the n th is not included in the topic or document under consideration. IV . C omput a t ion al resul ts on the A ssocia ted P ress d a t a In this section, w e used a famous Associated Press data set from GrÃijn and Hornik (GrÃijn a nd Hornik) in R to unco v e r the w ord topic distribution, the correlation structure betw e en v arious topics as w ell as selecting optimal m odels. The Associated Press corpus con sists of 2244 documents and 10473 wor ds. A f ter preprocessing the corpus b y picking f requent and commo n ter ms, w e reduced the size of the w ords from 10473 to 2 643 for efﬁcient sampling. In our ﬁrst exper imentation , w e built a correlated topic modellin g structure ba sed on the tradi- tional multino mial probit a nd then tested the com putational speed for key sampling tasks. The high posterior dependency structure between auxiliary variables a nd parameters make multinor- mal probit essent ially unscalable for situation s where it is impossible for the sampler to yield a random v ariate of the auxiliary v a riable corresponding the current topic allocation label that is also the max imum (5). For a random initialization of topic assignment, the sa mplin g of a uxiliary v ar iable cannot ev en complete one si ngle iteration. In the ca se of good initialization of topical prior η d which lea ds to smooth sampling of a ux iliar y v a riables, the computation al efﬁciency is still undesirable and w e obser v ed that for larger topical space such as K=40, the auxiliary vari- able stumbled aga in after some amount of iterations, indicating even good initializa tion will not ease the tro ublesome dependency relationsh ip betw een the auxiliary v aria bles and parameter s in lar ger topical space. Unlike with the traditional probit model for which the com putation o f θ k d is virtua lly impractical for la r ge K ,the diagonal orthant approach makes this computation substantially faster ev er for large K . The comparison of the computation al speed of tw o essen- tial sampling tasks be tw een the multinomial probit mo del and d igon al orthant p robit mo del are sho wn as below in ta b le 1 (1). In addition to the drastic impr ov ement of the ov erall sa mplin g efﬁciency , we noticed that the com- putational complexity for sampling the auxiliar y v ariable and topic distribution is close to O(1) and O(K) respectiv ely , suggesting that probit normal topic model now becomes an attainable and feasible tool of the traditional correlated topic model. Central to topic modelling is the need to determine for a giv e n corpus the optimal number of latent topics. As it is the case for mos t latent v ariable models, this task can be for midable at times, and there is no consensus a mong machine learning researchers a s to which of the existing methods is the best. Figure (1) sho ws the loglikeli hood as a function of the number of topics disco v ere d in the model. A part from the logl ikelihood, many oth er techniques are commonly used such as pe r plexity , harmonic mean method and so on. As w e see, the optimal number of topics in this ca se is 3 0. In table (2), w e sho w a subset of the 3 0 topics uncov ered where each topic is r e presented b y the 10 most f r equent w ords. It can be seen that our pr obit normal topic model is able to capture the co-occurrence of w ords wi thin topics 7 Sampling T ask (K=10) MNP DO Probit T opic Di stribution θ 18.3 0.06 A uxiliar y variable Y d (108 to NA) 3.09 Sampling T ask (K=20) MNP DO Probit T opic Di stribution θ 63 0.13 A uxiliar y variable Y d (334 to NA) 3.39 Sampling T ask (K=30) MNP DO Probit T opic Di stribution θ 123 0.21 A uxiliar y variable Y d (528 to NA) 3.49 Sampling T ask ( K=40) MNP DO Probit T opic Di stribution θ 211.4 9 0.33 A uxiliar y variable Y d (1785 to NA) 3.79 T able 1: All the n umbers in this table repr esent the proc essin g ti me (in seconds), and ar e computed in R on PC using a parallel algorithm acting on 4 CPU cores. NA here represents situations where it is i mpos sible for the sampler to yield a random variate of the auxiliary variable corr esponding the current topic al l ocation label that is also the maximum 10 20 30 40 −1080000 −1040000 −1000000 −960000 topics Log Likelihood 10 20 30 40 −1080000 −1040000 −1000000 −960000 topics Log Likelihood Probit Model Logistic Model Figure 1: Loglikelihood as a function of the number of topics 8 successfully . In ﬁgure 2, w e also sho w the correlation structure between v arious topics which is the essential purpose of employing the correlated topi c model. Evidently , the correlation cap - tured intuitiv ely reﬂect the natural relationship betw e en similar topics. T opic 25 T opic 18 T opic 23 T opic 11 T opic 1 T opic 24 T opic 27 W ord1 court company bush students tax ﬁre air W ord2 trial billion senate school budget w ater plane W ord3 judge inc v ote meese billi on rain ﬂight W ord4 prison corp dukakis student bill northern airlines W ord5 convicted percent percent schools percent so uthern pilots W ord6 jury stock bill teachers senate inches aircraft W ord7 drug w orkers kennedy board in c ome fair planes W ord8 guilty contract sales education legislatio n degree s air line W ord9 fbi companies bentsen teacher taxes sno w eastern W ord10 sentence offer ticket tax bush temperatures airport T opic 6 T opic 12 T opic 20 T opic 2 T opi c 2 2 T opic 1 6 T opic 15 W ord1 percent space military so viet aid police d ollar W ord2 stock shuttle china go r bachev re b els arrested y en W ord3 index s o viet chinese bush co ntras shot rates W ord4 billion nasa soldiers reagan nicaragua shooting bid W ord5 prices launch troops mosco w contra injured prices W ord6 rose mi ssion saudi summit sandinista car price W ord7 stocks earth tra de soviets militar y o fﬁcers london W ord8 av erage north rebels treaty ortega bus gold W ord9 points korean hong eur ope sandinistas killing percent W ord10 shares south army ger many rebel arrest tra d ing T opic19 T opic 14 T opic 7 T opic 4 T opic 30 T opic 8 T opic 17 W ord1 iraq trade israel na vy pe rcent south ﬁlm W ord2 kuw ait percent israe li ship oil africa movie W ord3 iraqi f armers jewish coast pr ices a f rican music W ord4 german farm palestinian island price black theater W ord5 gulf billion arab boat cents church actor W ord6 germany japa n palestinians ships gasoline pope actress W ord7 saudi agriculture army earthquake a verage mandela a war d W ord8 iran japanese occupied sea offers bla c ks band W ord9 bush tons students scale gol d apartheid book W ord10 militar y drough t ga za guard crude cathol ic ﬁlms T able 2: Representation of topics discover ed by our method 9 Figure 2: Graphical representation of the correlation among topics V . C onclusion and D iscussion In the context of topic modelling where many other researchers see m to ha v e a voided it. B y adapting the diagonal orthant probit model, w e proposed a probit alter nativ e to the logit ap- proach to the topic modeling. Compared to the multi nomial p robit model w e constructed, our topic discov ery scheme using diagonal orthant probit model enjoy ed sev eral desira ble properties; First,w e gained the efﬁciency in computing the topic distribution θ k d ; S econd, w e achiev ed a clear and very straightforw ard and a daptable auxiliary variable sampling scheme that substantially reduced the strength of the dependence structure betw een auxiliary variables and model pa ram- eters, responsible f or absorbing state in the Ma rko v chain; Thirdly , as a consequence of good mixing, our approach mad e the probit model a viable a nd competitiv e alter nativ es to its logis tic counterpart. In addition to a ll these beneﬁts, our proposed m ethod offers a stra ightfor w ard and inherent c onj uga cy , which helps a vo id those complicated sampling schemes emplo yed in the lo- gistics normal probit model. In the Associated Press example e xplored in the previous section, not only does our method produce a better likelih ood than the logistic normal to pic model wi th v ariational EM, but also disco v ers meaningful topics along with underlying correlation structure betw een topics. O v era ll, the method w e develo ped in this paper offers another fe asible alternativ es in the context of corre- lated topic model that w e hope will be further explored a nd extended by many other researchers Based on the prom ising results we hav e seen in this paper , the pr obit normal topic model opens the door for v ar ious future w orks. For instance, Salomatin et al. (2009) proposed a multi-ﬁeld correlated topic model b y relaxing the assumption of using common set of topics globally a mong all documents, which can also be applied to the probit model to e nrich the comprehensiv eness of structura l r elationsh ips between topics . Another potential direction w ould b e to enhance the scalability of the model. Curre ntly we used a simple distributed algorithm proposed by Y ao et al. 10 (2009) and Ne wman et a l. (2 009 ) for efﬁcient Gibbs sampling. The architecture for topic models presented b y Smola and Naray ana murthy (20 10 ) can be further utilized to re duce the computa- tional c omplexity substantially wh ile deliv ering compara ble performance. Further more, a no vel sampling method inv olving the Gibbs Max-Margin T opic Zhu et al. (2013) will further impro ve the computational efﬁciency . R eferences Albert, J. H. a nd S. Chib (1993). Ba yesian analysis of binary and polycho tomous response data. Journal of the American Statistical Association 8 8 (422), 66 9–679. Blei, D. M. and J. D. Lafferty (200 6). Correlated topic models. In In Proceedings of the 23 rd International Conference on Machine Learning , pp. 113– 120. MIT Press. Blei, D. M., A. Y . Ng, M . I. Jordan, and J. La fferty (2 0 03). Latent dirichlet allocation. Journal of Machine Learning Research 3 , 2 003. Chen, J., J. Zhu, Z. W a ng, X. Zheng, and B. Zhang (201 3). S calable inference for lo gistic-normal topic models. In C. Burges, L. Bottou, M. W elling, Z. Ghahramani, and K. W einberger (Eds.), Advances in Neural I nformation Processin g Systems 26 , pp. 244 5–2453. Curran Associates, Inc. GrÃijn, B. and K. Hor nik. topicmodels: An r package for ﬁtting topic models. Journal of Statistical Software 40 (1 3), 1–30 . Johndro w , J ., K. Lum, and D. B. Dunson ( 2013). Diagonal orthant multinomial probit models. In JMLR P r oc eedings , V olume 31 of AIST A TS , pp. 29 –38. JM L R. Mimno, D., H. M. W allach, and A. Mccallum (20 08). Gibbs sampling for logistic normal topic models with graph-based p r iors. Newman, D., A. A suncio n, P . Smyth, and M. W elling (200 9, December). Di stributed algorithms for topic models. J. Mach. Learn. Res. 10 , 180 1–1828. Salomatin, K., Y . Y ang, and A. Lad (2009). M ulti-ﬁeld correla ted topic modeling. In Pr oceedings of the SIAM International Conference on Dat a Mining, SDM 2009, April 3 0 - M ay 2, 2009, Sparks, Nevada, USA , pp. 62 8–637. Smola, A. and S. Nara yanamurthy (201 0 ). An architecture for parallel topic models. In VLDB . Y ao, L., D. Mimno, and A. McCallum (2009). Efﬁcient methods for topic model inference on streaming document collections. In KDD . Zhu, J., N. Chen, H. Perkin s, and B. Zhang (20 13). Gibbs max-margin topic models with data augmentation. CoRR abs/131 0.2816 . 11

Probit Normal Correlated Topic Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment