MedLDA: A General Framework of Maximum Margin Supervised Topic Models

Supervised topic models utilize document's side information for discovering predictive low dimensional representations of documents. Existing models apply the likelihood-based estimation. In this paper, we present a general framework of max-margin su…

Authors: Jun Zhu, Amr Ahmed, Eric P. Xing

Journal of Machine Learning Research 1 (2008) 1-48 Submitted 4/00; Published 10/00 MedLD A: A General F ramew ork of Maxim um Margin Sup ervised T opic Mo dels Jun Zh u junzhu@cs.cmu.edu Scho ol of Computer Scienc e Carne gie Mel lon University Amr Ahmed amahmed@cs.cmu.edu Scho ol of Computer Scienc e Carne gie Mel lon University Eric P . Xing epxing@cs.cmu.edu Scho ol of Computer Scienc e Carne gie Mel lon University Editor: ? Abstract Sup ervised topic mo dels utilize do cumen t’s side information for discov ering predictive lo w dimensional represen tations of do cuments. Existing mo dels apply the likelihoo d-based estimation. In this pap er, w e present a general framework of max-margin sup ervised topic mo dels for both contin uous and categorical response v ariables. Our approach, the maxi- m um en tropy discrimination latent Dirichlet allo cation (MedLDA), utilizes the max-margin principle to train sup ervised topic mo dels and estimate predictive topic representations that are arguably more suitable for prediction tasks. The general principle of MedLD A can b e applied to p erform joint max-margin learning and maximum lik eliho o d estimation for arbitrary topic mo dels, directed or undirected, and sup ervised or unsup ervised, when the supervised side information is a v ailable. W e develop efficient v ariational methods for p osterior inference and parameter estimation, and demonstrate qualitatively and quantita- tiv ely the adv antages of MedLDA o ver lik eliho od-based topic mo dels on mo vie review and 20 Newsgroups data sets. Keyw ords: T opic mo dels, Maxim um entrop y discrimination laten t Diric hlet allo cation, Max-margin learning. 1. In tro duction Laten t topic mo dels such as Latent Dirichlet Allo cation (LDA) (Blei et al., 2003) hav e re- cen tly gained muc h p opularit y in managing a large collection of documents by discov ering a lo w dimensional representation that captures the laten t semantic of the collection. LD A p osits that each do cumen t is an admixture of latent topics where the topics are represented as unigram distribution ov er a given vocabulary . The do cument-specific admixture prop or- tion is distributed as a laten t Diric hlet random v ariable and represents a lo w dimensional represen tation of the do cumen t. This lo w dimensional representation can b e used for tasks c  2008 Jun Zhu, Amr Ahmed, and Eric P . Xing. Zhu, Ahmed, and Xing lik e classification and clustering or merely as a to ol to structurally browse the otherwise unstructured collection. The traditional LDA (Blei et al., 2003) is an unsup ervised mo del, and th us is incapable of incorp orating the useful side information asso ciated with corp ora, which is uncommon. F or example, online users usually p ost their reviews for pro ducts or restaurants with a rating score or pros/cons rating; webpages can hav e their category lab els; and the images in the Lab elMe (Russell et al., 2008) dataset are organized in differen t categories and each image is asso ciated with a set of annotation tags. Incorporating suc h sup ervised side information ma y guide the topic mo dels tow ards discov ering secondary or non-dominant statistical pat- terns (Chechik and Tish by, 2002), whic h may b e more in teresting or relev an t to the users’ goals (e.g., predicting on unlab eled data). In con trast, the unsup ervised LD A ignores such sup ervised information and ma y yields more prominen t and p erhaps orthogonal (to the users’ goals) latent seman tic structures. This problem is serious when dealing with com- plex data, whic h usually ha ve multiple, alternative, and conflicting underlying structures. Therefore, in order to b etter extract the relev an t or interesting underlying structures of corp ora, the sup ervised side information should b e incorp orated. Recen tly , learning laten t topic mo dels with side information has gained increasing at- ten tion. Ma jor instances include the sup ervised topic models (sLDA) (Blei and McAuliffe, 2007) for regression 1 , multi-class LD A (an sLDA classification mo del) (W ang et al., 2009), and the discriminative LD A (DiscLD A) (Lacoste-Jullien et al., 2008) classification model. All these mo dels fo cus on the document-lev el supervised information, suc h as do cumen t categories or review rating scores. Other v arian ts of sup ervised topic mo dels hav e b een de- signed to deal with different application problems, suc h as the asp ect rating mo del (Titov and McDonald, 2008) and the credit attribution mo del (Ramage et al., 2009), of whic h the former predicts ratings for each asp ect and the latter asso ciate each word with a lab el. In this pap er, without loss of generality , we fo cus on incorp orating do cument-lev el sup ervision information. Our learning principle can b e generalized to arbitrary topic models. F or the do cumen t level mo dels, although sLDA and DiscLDA share the same goal (unco vering the laten t structure in a document collection while retaining predictive p o w er for sup ervised tasks), they differ in their training pro cedures. sLDA is trained by maximizing the joint lik eliho o d of data and resp onse v ariables while DiscLD A is trained to maximize the condi- tional likelihoo d of resp onse v ariables. F urthermore, to the b est of our knowledge, almost all existing supervised topic mo dels are trained by maximizing the data lik eliho o d. In this pap er, w e prop ose a general principle for learning max-mar gin discriminative sup ervised laten t topic mo dels for both regression and classification. In con trast to the tw o- stage procedure of using topic models for prediction tasks (i.e., first disco vering laten t topics and then feeding them to do wnstream prediction mo dels), the prop osed maximum entr opy discrimination latent Dirichlet al lo c ation (MedLD A) is an in tegration of max-margin predic- tion models (e.g., supp ort v ector mac hines for classification) and hierarc hical Ba y esian topic mo dels b y optimizing a single ob jectiv e function with a set of exp e cte d margin constraints. MedLD A is a sp ecial instance of P oMEN (i.e., partially observ ed maxim um en trop y discrim- ination Marko v net work) (Zhu et al., 2008b), which was prop osed to combine max-margin 1. Although integrating sLDA with a generalized linear mo del was discussed in (Blei and McAuliffe, 2007), no result was reported ab out the performance of sLDA when used for classification tasks. The classifi- cation model w as reported in a later pap er (W ang et al., 2009) 2 MedLD A: A General Framework of Maximum Margin Super vised Topic Models learning and structured hidden v ariables in undirected Marko v net w orks, for disco v ering laten t topic presentations of do cuments. In MedLDA, the parameters for the regression or classification mo del are learned in a max-margin sense; and the discov ery of latent topics is coupled with the max-margin estimation of the mo del parameters. This interpla y yields laten t topic representations that are more discriminative and more suitable for sup ervised prediction tasks. The principle of MedLDA to do join t max-margin learning and maximum likelihoo d estimation is extremely general and can b e applied to arbitrary topic mo dels, including directed topic models (e.g., LD A and sLD A) or undirected Marko v net w orks (e.g., the Harmonium (W elling et al., 2004)), unsupervised (e.g., LD A and Harmonium) or supervised (e.g., sLDA and hierarc hical Harmonium (Y ang et al., 2007)), and other v ariants of topic mo dels with different priors, suc h as correlated topic mo dels (CTMs) Blei and Laffert y (2005). In this pap er, we present several examples of applying the max-margin principle to learn MedLD A mo dels which use the unsup ervised and sup ervised LD A as the underlying topic mo dels to disco ver latent topic represen tations of documents for b oth regression and classification. W e dev elop efficien t and easy-to-implement v ariational methods for MedLDA, and in fact its running time is comparable to that of an unsupervised LDA for classification. This prop erty stems from the fact that the MedLDA classification mo del directly optimizes the margin and do es not suffer from a normalization factor which generally makes learning hard as in fully generative mo dels suc h as sLDA. The paper is structured as follo ws. Section 2 in tro duces the basic concepts of latent topic mo dels. Section 3 and Section 4 presen t the MedLD A mo dels for regression and classification resp ectiv ely , with efficient v ariational EM algorithms. Section 5 discusses the generalization of MedLD A to other laten t v ariable topic mo dels. Section 6 presents empirical comparison b et w een MedLDA and likelihoo d-based topic mo dels for b oth regression and classification. Section 7 presents some related works. Finally , Section 8 concludes this pap er with future researc h directions. 2. Unsup ervised and Sup ervised T opic Mo dels In this section, w e review the basic concepts of unsup ervised and sup ervised topic mo dels and tw o v ariational upp er b ounds which will b e used later. The unsup ervised LDA (latent Diric hlet allocation) (Blei et al., 2003) is a hierarc hical Ba yesian mo del, where topic prop ortions for a do cument are drawn from a Dirichlet dis- tribution and words in the do cumen t are rep eatedly sampled from a topic which itself is dra wn from those topic prop ortions. Supervised topic mo dels (sLDA) (Blei and McAuliffe, 2007) introduce a resp onse v ariable to LDA for each do cument, as illustrated in Figure 1. Let K be the num b er of topics and M b e the n umber of terms in a v o cabulary . β denotes a K × M matrix and each β k is a distribution ov er the M terms. F or the regression problem, where the resp onse v ariable y ∈ R , the generativ e pro cess of sLD A is as follo ws: 1. Draw topic prop ortions θ | α ∼ Dir( α ). 2. F or eac h w ord (a) Draw a topic assignment z n | θ ∼ Mult( θ ). 3 Zhu, Ahmed, and Xing Figure 1: Sup ervised topic mo del (Blei and McAuliffe, 2007). (b) Draw a w ord w n | z n , β ∼ Multi( β z n ). 3. Draw a resp onse v ariable: y | z 1: N , η , δ 2 ∼ N ( η > ¯ z , δ 2 ), where ¯ z = 1 / N P N n =1 z n is the a verage topic prop ortion of a do cumen t. The mo del defines a joint distribution: p ( θ , z , y , W | α, β , η , δ 2 ) = D Y d =1 p ( θ d | α )( N Y n =1 p ( z dn | θ d ) p ( w dn | z dn , β )) p ( y d | η > ¯ z d , δ 2 ) , where y is the v ector of response v ariables in a corpus D and W are all the w ords. The join t lik eliho o d on D is p ( y , W | α, β , η , δ 2 ). T o estimate the unkno wn parameters ( α, β , η , δ 2 ), sLD A maximizes the log-lik eliho od log p ( y , W | α, β , η , δ 2 ). Given a new document, the ex- p ected resp onse v alue is the prediction: ˆ y , E [ Y | w 1: N , α, β , η , δ 2 ] = η > E [ ¯ Z | w 1: N , α, β , δ 2 ] , (1) where E [ X ] is an expectation with resp ect to the posterior distribution of the random v ariable X . Since exact inference of the p osterior distribution of hidden v ariables and the likelihoo d is in tractable, v ariational methods (Jordan et al., 1999) are applied to get appro ximate solutions. Let q ( θ , z | γ , φ ) b e a v ariational distribution that approximates the p osterior p ( θ , z | α, β , η , δ 2 , y , W ). By using Jensen’s inequality , w e can get a v ariational upp er b ound of the negativ e log-likelihoo d: L s ( q ) = − E q [log p ( θ, z , y , W | α, β , η , δ 2 )] − H ( q ( z , θ )) ≥ − log p ( y , W | α, β , η , δ 2 ) , where H ( q ) , − E q [log q ] is the entrop y of q . By introducing some indep endence assump- tions (like mean field) ab out the q distribution, this upp er b ound can b e efficien tly opti- mized, and w e can estimate the parameters ( α, β , η , δ 2 ) and get the b est approximation q . See (Blei and McAuliffe, 2007) for more details. F or the unsup ervised LD A, the generative procedure is similar, but without the third step. The joint distribution is p ( θ , z , W | α, β ) = Q D d =1 p ( θ d | α )( Q N n =1 p ( z dn | θ d ) p ( w dn | z dn , β )) and the likelihoo d is p ( W | α, β ). Similarly , a v ariational upp er b ound can b e derived for appro ximate inference: L u ( q ) = − E q [log p ( θ, z , W | α, β )] − H ( q ( z , θ )) ≥ − log p ( W | α , β ) , 4 MedLD A: A General Framework of Maximum Margin Super vised Topic Models where q ( θ, z ) is a v ariational distribution that appro ximates the p osterior p ( θ , z | α, β , W ). Again, by making some indep endence assumptions, parameter estimation and posterior inference can b e efficien tly done by optimizing L u ( q ). See (Blei et al., 2003) for more details. In sLD A, b y c hanging the distribution mo del of generating response v ariables, other t yp es of resp onses can be modeled, suc h as the discrete classification problem (Blei and McAuliffe, 2007; W ang et al., 2009). How ev er, the p osterior inference and parameter esti- mation in sup ervised LDA classification mo del are m uch more difficult than those of the sLD A regression mo del b ecause of the normalization factor of the non-Gaussian distribu- tion mo del for response v ariables. V ariational metho ds or m ulti-delta metho ds w ere used to appro ximate the normalization factor (W ang et al., 2009; Blei and McAuliffe, 2007). Dis- cLD A (Lacoste-Jullien et al., 2008) is a discriminative v ariant of sup ervised topic mo dels for classification, where the unkno wn parameters (i.e., a linear transformation matrix) are learned by maximizing the conditional lik eliho o d of the response v ariables. Although b oth maximum likelihoo d estimation (MLE) and maximum conditional likeli- ho od estimation (MCLE) hav e sho wn great success in many cases, the max-margin learning is arguably more discriminative and closer to our final prediction task in sup ervised topic mo dels. Empirically , max-margin metho ds lik e the supp ort vector machines (SVMs) for classification ha v e demonstrated impressiv e success in a wide range of tasks, including image classification, character recognition, etc. In addition to the empirical success, max-margin metho ds enjoy strong generalization guarantees, and are able to use k ernels, allo wing the classifier to deal with a v ery high-dimensional feature space. T o in tegrate the adv antages of max-margin metho ds in to the pro cedure of discov ering laten t topics, b elo w, we presen t a max-margin v ariant of the supervised topic mo dels, whic h can disco v er predictiv e topic representations that are more suitable for supervised prediction tasks, e.g., regression and classification. 3. Maxim um Entrop y Discrimination LDA for Regression In this section, w e consider the sup ervised prediction task, where the resp onse v ariables tak e contin uous real v alues. This is known as a regression problem in machine learning. W e present tw o MedLDA regression mo dels that p erform max-margin learning for the su- p ervised LD A and unsup ervised LDA mo dels. Before diving into the full exp osition of our metho ds, we first review the basic supp ort vector regression metho d, up on which MedLDA is built. 3.1 Supp ort V ector Regression Supp ort v ector machines ha ve b een developed for b oth classification and regression. In this section, w e consider the support v ector regression (SVR), on whic h a comprehensiv e tutorial has b een published by Smola and Sch ¨ o lk opf (2003). Here, we provide a brief recap of the basic concepts. Supp ose we are giv en a training set D = { ( x 1 , y 1 ) , · · · , ( x D , y D ) } , where x ∈ X are inputs and y ∈ R are real resp onse v alues. In  -supp ort vector regression (V apnik, 1995), our goal is to find a function h ( x ) ∈ F that has at most  deviation from the true resp onse v alues y for all the training data, and at the same time as flat as possible. One common choice of the 5 Zhu, Ahmed, and Xing function family F is the linear functions, that is, h ( x ) = η > f ( x ), where f = { f 1 , · · · , f K } is a vector of feature functions. Eac h f k : X → R is a feature function. η is the corresp onding w eight vector. F ormally , the linear SVR finds an optimal linear function b y solving the follo wing constrained conv ex optimization problem P0(SVR) : min η ,ξ,ξ ? 1 2 k η k 2 2 + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :    y d − η > f ( x d ) ≤  + ξ d − y d + η > f ( x d ) ≤  + ξ ? d ξ d , ξ ? d ≥ 0 , where k η k 2 2 = η > η is the ` 2 -norm; ξ and ξ ? are slack v ariables that tolerates some errors in the training data; and  is the precision parameter. The p ositive regularization constant C determines the trade-off b et ween the flatness of h (represented by the ` 2 -norm) and the amount up to which deviations larger than  are tolerated. The problem P0 can be equiv alently form ulated as a regularized empirical loss minimization, where the loss is the so-called  -insensitive loss (Smola and Sc h ¨ o lk opf, 2003). F or the standard SVR optimization problem, P0 is a QP problem and can b e easily solv ed in the dual form ulation. In the Lagrangian metho d, samples with non-zero lagrange m ultipliers are called supp ort v ectors, the same as in SVM classification mo del. There are also some freely av ailable pack ages for solving a standard SVR problem, such as the SVM- ligh t (Joac hims, 1999). W e will use these metho ds as a sub-routine to solv e our prop osed approac h. 3.2 Learning MedLD A for Regression Instead of learning a point estimate of η as in sLD A, w e tak e a more general 2 Ba yesian-st yle (i.e., an av eraging mo del) approac h and learn a distribution 3 q ( η ) in a max-margin manner. F or prediction, we take the av erage ov er all the p ossible mo dels (represen ted b y η ): ˆ y , E [ Y | w 1: N , α, β , δ 2 ] = E [ η > ¯ Z | w 1: N , α, β , δ 2 ] . (2) No w, the question underlying the a veraging prediction rule (2) is ho w w e can devise an appropriate loss function and constrain ts to in tegrate the max-margin concepts of SVR in to laten t topic disco very . In the sequel, we present the maximum entr opy discrimination latent Dirichlet al lo c ation (MedLDA), which is an extension of the P oMEN (i.e., partially observ ed maxim um entrop y discrimination Mark ov netw orks) (Zh u et al., 2008b) framework. P oMEN is an elegant com bination of max-margin learning with structured hidden v ariables in Marko v net works. The MedLD A is an extension of P oMEN to learn directed Ba yesian net works with laten t v ariables, in particular the latent topic mo dels, which discov er latent seman tic structures of do cument collections. 2. Under the sp ecial case of linear mo dels, the p osterior mean of an av eraging mo del can b e directly solved in the same manner of p oin t estimate. 3. In principle, we can p erform Bay esian-style estimation for other parameters, like δ 2 . F or simplicity , we only consider η as a random v ariable in this pap er. 6 MedLD A: A General Framework of Maximum Margin Super vised Topic Models There are t w o principled c hoice points in MedLDA according to the prediction rule (2): (1) the distribution of model parameter η ; and (2) the distribution of latent topic assignmen t Z . Belo w, we presen t tw o MedLDA regression mo dels by using sup ervised LD A or unsup ervised LDA to disco ver the latent topic assignment Z . Accordingly , we denote these tw o mo dels as Me dLDA r f ull and Me dLD A r partial . 3.2.1 Max-Mar gin Training of sLDA F or regression, the MedLDA is defined as an integration of a Bay esian sLDA, where the parameter η is sampled from a prior p 0 ( η ), and the  -insensitive supp ort v ector regres- sion (SVR) (Smola and Sch ¨ o lk opf, 2003). Th us, MedLDA defines a join t distribution: p ( θ , z , η , y , W | α, β , δ 2 ) = p 0 ( η ) p ( θ , z , y , W | α, β , η , δ 2 ), where the second term is the same as in the sLD A. Since directly optimizing the log likelihoo d is intractable, as in sLDA, we op- timize its upp er bound. Differen t from sLDA, η is a random v ariable no w. So, w e define the v ariational distribution q ( θ , z , η | γ , φ ) to approximate the true p osterior p ( θ , z , η | α, β , δ 2 , y , W ). Then, the upper b ound of the negativ e log-lik eliho o d − log p ( y , W | α, β , δ 2 ) is L bs ( q ) , − E q [log p ( θ, z , η , y , W | α, β , δ 2 )] − H ( q ( θ, z , η )) = K L ( q ( η ) k p 0 ( η )) + E q ( η ) [ L s ] , (3) where K L ( p k q ) = E p [log( p/q )] is the Kullback-Leibler (KL) divergence. Th us, the integrated learning problem is defined as: P1(MedLD A r f ull ) : min q ,α,β,δ 2 ,ξ ,ξ ? L bs ( q ) + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :        y d − E [ η > ¯ Z d ] ≤  + ξ d , µ d − y d + E [ η > ¯ Z d ] ≤  + ξ ? d , µ ? d ξ d ≥ 0 , v d ξ ? d ≥ 0 , v ? d where µ, µ ? , v , v ? are lagrange multipliers; ξ , ξ ? are slack v ariables absorbing errors in train- ing data; and  is the precision parameter. The constrain ts in P1 are in the same form as those of P0, but in an expected version b ecause b oth the latent topic assignments Z and the mo del parameters η are random v ariables in MedLDA. Similar as in SVR, the exp ected constrain ts corresp ond to an  -insensitive loss, that is, if the current prediction ˆ y as in Eq. (2) does not deviate from the target v alue to o muc h (i.e., less than  ), there is no loss; otherwise, a linear loss will b e p enalized. The rationale underlying the MedLDA r f ull is that: let the current mo del b e p ( θ , z , η , y , W | α, β , δ 2 ), then we wan t to find a laten t topic representation and a mo del distribution (as represented b y the distribution q ) which on one hand tend to predict correctly on the data with a suffi- cien t large margin, and on the other hand tend to explain the data well (i.e., minimizing an v ariational upp er b ound of the negative log-lik eliho o d). The max-margin estimation and topic disco very procedure are coupled together via the constraints, whic h are defined on the exp ectations of m odel parameters η and the latent topic representations Z . This interpla y will yield a topic representation that is more suitable for max-margin learning, as explained b elo w. V ariational EM-Algorithm : Solving the constrained problem P1 is generally in- tractable. Th us, w e make use of mean-field v ariational metho ds (Jordan et al., 1999) to 7 Zhu, Ahmed, and Xing efficien tly obtain an appro ximate q . The basic principle of mean-field v ariational metho ds is to form a factorized distribution of the latent v ariables, parameterized by free v ariables whic h are called v ariational parameters. These parameters are fit so that the KL diver- gence b et ween the approximate q and the true p osterior is small. V ariational metho ds hav e successfully used in many topic mo dels, as we hav e presented in Section 2. As in standard topic mo dels, w e assume q ( θ , z , η | γ , φ ) = q ( η ) Q D d =1 q ( θ d | γ d ) Q N n =1 q ( z dn | φ dn ), where γ d is a K -dimensional v ector of Diric hlet parameters and each φ dn is a categorical distribution ov er K topics. Then, E [ Z dn ] = φ dn , E [ η > ¯ Z d ] = E [ η ] > (1 / N ) P N n =1 φ dn . W e can dev elop an EM algorithm, which iterativ ely solv es the following t w o steps: E-step : infer the p osterior distribution of the hidden v ariables θ , Z , and η ; and M-step : estimate the unkno wn mo del parameters α , β , and δ 2 . The essen tial difference b et ween MedLD A and sLD A lies in the E-step to infer the p osterior distribution of z and η b ecause of the margin constrain ts in P1. As w e shall see in Eq. (5), these constraints will bias the exp ected topic prop ortions tow ards the ones that are more suitable for the supervised prediction tasks. Since the constraints in P1 are not on the mo del parameters ( α , β , and δ 2 ), the M-step is similar to that of the sLD A. W e outline the algorithm in Alg. 1 and explain it in details b elo w. Sp ecifically , we formulate a Lagrangian L for P1 L = L bs ( q ) + C D X d =1 ( ξ d + ξ ? d ) − D X d =1 µ d (  + ξ d − y d + E [ η > ¯ Z d ]) − D X d =1 ( µ ? d (  + ξ ? d + y d − E [ η > ¯ Z d ]) + v d ξ d + v ? d ξ ? d ) − D X d =1 N X i =1 c di ( K X j =1 φ dij − 1) , where the last term is due to the normalization condition P K j =1 φ dij = 1 , ∀ i, d . Then, the EM pro cedure alternatively optimize the Lagrangian functional with respect to eac h argumen t. 1. E-step : w e infer the p osterior distribution of the latent v ariables θ , Z and η . F or the v ariables θ and Z , inferring the p osterior distribution is to fit the v ariational parameters γ and φ b ecause of the mean-field assumption ab out q , but for η the optimization is on q ( η ). Specifically , we hav e the following up date rules for differen t laten t v ariables. Since the constraints in P1 are not on γ , optimize L with resp ect to γ d and w e can get the same up date formula as in sLDA: γ d ← α + N X n =1 φ dn (4) Due to the fully factorized assumption of q , for each do cumen t d and each word i , by setting ∂ L/∂ φ di = 0, we hav e: φ di ∝ exp ( E [log θ | γ ] + E [log p ( w di | β )] + y d N δ 2 E [ η ] − 2 E [ η > φ d, − i η ] + E [ η ◦ η ] 2 N 2 δ 2 + E [ η ] N ( µ d − µ ? d )) , (5) 8 MedLD A: A General Framework of Maximum Margin Super vised Topic Models Algorithm 1 V ariational MedLD A r Input: corpus D = { ( y , W ) } , constan ts C and  , and topic num ber K . Output: Diric hlet parameters γ , p osterior distribution q ( η ), parameters α , β and δ 2 . rep eat /**** E-Step ****/ for d = 1 to D do Up date γ d as in Eq. (4). for i = 1 to N do Up date φ di as in Eq. (5). end for end for Solv e the dual problem D1 to get q ( η ), µ and µ ? . /**** M-Step ****/ Up date β using Eq. (7), and up date δ 2 using Eq. (8). α is fixed as 1 /K times the ones v ector. un til conv ergence where φ d, − i = P n 6 = i φ dn ; η ◦ η is the element-wise pro duct; and the result of exp onen- tiating a v ector is a vector of the exp onen tials of its corresp onding comp onen ts. The first tw o terms in the exp onen tial are the same as those in unsup ervised LDA. The essential differences of MedLDA r from the sLDA lie in the last three terms in the exp onen tial of φ di . Firstly , the third and fourth terms are similar to those of sLDA, but in an exp ected version since w e are learning the distribution q ( η ). The second- order exp ectations E [ η > φ d, − i η ] and E [ η ◦ η ] mean that the co-v ariances of η affect the distribution o ver topics. This makes our approach significantly different from a p oin t estimation metho d, like sLDA, where no expectations or co-v ariances are inv olv ed in up dating φ di . Secondly , the last term is from the max-margin regression form ulation. F or a document d , whic h lies around the decision b oundary , i.e., a supp ort vector, either µ d or µ ? d is non-zero, and the last term biases φ di to wards a distribution that fa vors a more accurate prediction on the do cumen t. Moreov er, the last term is fixed for w ords in the do cumen t and th us will directly affect the laten t represen tation of the do cumen t, i.e., γ d . Therefore, the latent represen tation by MedLD A r is more suitable for max-margin learning. Let A b e the D × K matrix whose ro ws are the vectors ¯ Z > d . Then, we ha v e the follo wing theorem. Theorem 1 F or Me dLD A, the optimum solution of q ( η ) has the form: q ( η ) = p 0 ( η ) Z exp ( η > D X d =1 ( µ d − µ ? d + y d δ 2 ) E [ ¯ Z d ] − η > E [ A > A ] 2 δ 2 η ) 9 Zhu, Ahmed, and Xing wher e E [ A > A ] = P D d =1 E [ ¯ Z d ¯ Z > d ] , and E [ ¯ Z d ¯ Z > d ] = 1 N 2 ( P N n =1 P m 6 = n φ dn φ > dm + P N n =1 diag { φ dn } ) . The lagr ange multipliers ar e the solution of the dual pr oblem of P1: D1 : max µ,µ ? − log Z −  D X d =1 ( µ d + µ ? d ) + D X d =1 y d ( µ d − µ ? d ) s . t . ∀ d : µ d , µ ? d ∈ [0 , C ] . Pro of (sk etc h) Set the partial deriv ativ e ∂ L/∂ q ( η ) equal zero, w e can get the solution of q ( η ). Plugging q ( η ) into L , w e get the dual problem. In MedLDA r , we can choose differen t priors to in tro duce some regularization effects. F or the standard normal prior: p 0 ( η ) = N (0 , I ), we ha ve the corollary: Corollary 2 Assume the prior p 0 ( η ) = N (0 , I ) , then the optimum solution of q ( η ) is q ( η ) = N ( λ, Σ) , (6) wher e λ = Σ( P D d =1 ( µ d − µ ? d + y d δ 2 ) E [ ¯ Z d ]) is the me an and Σ = ( I + 1 /δ 2 E [ A > A ]) − 1 is a K × K c o-varianc e matrix. The dual pr oblem of P1 is: max µ,µ ? − 1 2 a > Σ a −  D X d =1 ( µ d + µ ? d ) + D X d =1 y d ( µ d − µ ? d ) s . t . ∀ d : µ d , µ ? d ∈ [0 , C ] , wher e a = P D d =1 ( µ d − µ ? d + y d δ 2 ) E [ ¯ Z d ] . In the ab o ve Corollary , computation of Σ can b e achiev ed robustly through Cholesky decomp osition of δ 2 I + E [ A > A ], an O ( K 3 ) procedure. Another example is the Laplace prior, which can lead to a shrink age effect (Zhu et al., 2008a) that is useful in sparse problems. In this pap er, we focus on the normal prior and extension to the Laplace prior can be done similarly as in (Zhu et al., 2008a). F or the standard normal prior, the dual optimization problem is a QP problem and can b e solved with any standard QP solv ers, although they ma y not b e so efficien t. T o lev erage recent dev elopments in supp ort vector regression, w e first prov e the following corollary: Corollary 3 Assume the prior p 0 ( η ) = N (0 , I ) , then the me an λ of q ( η ) is the opti- mum solution of the fol lowing pr oblem: min λ,ξ ,ξ ? 1 2 λ > Σ − 1 λ − λ > ( D X d =1 y d δ 2 E [ ¯ Z d ]) + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :    y d − λ > E [ ¯ Z d ] ≤  + ξ d − y d + λ > E [ ¯ Z d ] ≤  + ξ ? d ξ d , ξ ? d ≥ 0 10 MedLD A: A General Framework of Maximum Margin Super vised Topic Models Pro of See App endix A for details. The abov e primal form can b e re-form ulated as a standard SVR problem and solv ed b y using existing algorithms like SVM-light (Joachims, 1999) to get λ and the dual param- eters µ and µ ? . Sp ecifically , w e do Cholesky decomposition Σ − 1 = U > U , where U is an upp er triangular matrix with strict p ositiv e diagonal en tries. Let ν = P D d =1 y d δ 2 E [ ¯ Z d ], and w e define λ 0 = U ( λ − Σ ν ); y 0 d = y d − ν > Σ E [ ¯ Z d ]; and x d = ( U − 1 ) > E [ ¯ Z d ]. Then, the ab o v e primal problem in Corollary 3 can b e re-form ulated as the following standard form: min λ 0 ,ξ ,ξ ? 1 2 k λ 0 k 2 2 + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :    y 0 d − ( λ 0 ) > x d ≤  + ξ d − y 0 d + ( λ 0 ) > x d ≤  + ξ ? d ξ d , ξ ? d ≥ 0 2. M-step : Now, we estimate the unknown parameters α, β , and δ 2 . Here, we assume α is fixed. F or β , the up date equations are the same as for sLDA: β k,w ∝ D X d =1 N X n =1 1( w dn = w ) φ dnk , (7) F or δ 2 , this step is similar to that of sLDA but in an expected version. The up date rule is: δ 2 ← 1 D ( y > y − 2 y > E [ A ] E [ η ] + E [ η > E [ A > A ] η ]) , (8) where E [ η > E [ A > A ] η ] = tr( E [ A > A ] E [ η η > ]). 3.2.2 Max-Mar gin Learning of LDA for Regression In the previous section, w e hav e presented the MedLD A regression mo del which uses the sup ervised sLDA to discov er the latent topic represen tations Z . The same principle can b e applied to p erform joint maxim um likelihoo d estimation and max-margin training for the unsup ervised LDA Blei et al. (2003). In this section, w e present this MedLD A mo del, whic h will b e referred to as Me dLDA r partial . A naive approach to using the unsup ervised LD A for sup ervised prediction tasks, e.g., regression, is a tw o-step pro cedure: (1) using the unsup ervised LDA to discov er the laten t topic represen tations of do cumen ts; and (2) feeding the lo w-dimensional topic represen ta- tions in to a regression model (e.g., SVR) for training and testing. This de-coupled approach is rather sub-optimal because the side information of documents (e.g., rating scores of mo vie reviews) is not used in discov ering the low-dimensional representations and thus can result in a sub-optimal representation for prediction tasks. Belo w, w e presen t the MedLDA r partial , whic h integrates an unsupervised LD A for discov ering topics with the SVR for regression. 11 Zhu, Ahmed, and Xing The inter-pla y b et w een topic disco very and sup ervised prediction will result in more dis- criminativ e laten t topic represen tations, similar as in MedLDA r f ull . When the underlying topic mo del is the unsup ervised LDA, the lik eliho o d is p ( W | α, β ) as we hav e stated. F or regression, w e apply the  -insensitiv e support v ector regression (SVR) Smola and Sch ¨ o lkopf (2003) approac h as b efore. Again, w e learn a distribution q ( η ). The prediction rule is the same as in Eq. (2). The integrated learning problem is defined as: P2(MedLD A r p artial ) : min q ,q ( η ) ,α,β ,ξ,ξ ? L u ( q ) + K L ( q ( η ) || p 0 ( η )) + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :    y d − E [ η > ¯ Z d ] ≤  + ξ d − y d + E [ η > ¯ Z d ] ≤  + ξ ? d ξ d , ξ ? d ≥ 0 , where the K L -divergence is a regularizer that bias the estimate of q ( η ) tow ards the prior. In MedLD A r ful l , this KL-regularizer is implicitly contained in the v ariational b ound L bs as sho wn in Eq. (3). V ariational EM-Algorithm : F or MedLD A r p artial , the constrained optimization prob- lem P2 can b e similarly solved with an EM procedure. Sp ecifically , w e mak e the same indep endence assumptions ab out q as in LD A (Blei et al., 2003), that is, we assume that q ( θ , z | γ , φ ) = Q D d =1 q ( θ d | γ d ) Q N n =1 q ( z dn | φ dn ), where the v ariational parameters γ and φ are the same as in MedLDA r ful l . By formulating a Lagrangian L for P2 and iteratively optimiz- ing L o ver each v ariable, we can get a v ariational EM-algorithm that is similar to that of MedLD A r f ull . 1. E-step : The up date rule for γ is the same as in MedLDA r ful l . F or φ , b y setting ∂ L/∂ φ di = 0, we hav e: φ di ∝ exp ( E [log θ | γ ] + E [log p ( w di | β )] + E [ η ] N ( µ d − µ ? d )) , (9) Compared to the Eq. (5), Eq. (9) is simpler and do es not hav e the complex third and fourth terms of Eq. (5). This simplicity suggests that the latent topic represen tation is less affected by the max-margin estimation (i.e., the prediction mo del’s parameters). Set ∂ L/∂ q ( η ) = 0, then w e get: q ( η ) = p 0 ( η ) Z exp ( η > D X d =1 ( µ d − µ ? d ) E [ ¯ Z d ]) Plugging q ( η ) into L , the dual problem D2 is the same as D1. Again, we can c ho ose differen t priors to in troduce some regularization effects. F or the standard normal prior: p 0 ( η ) = N (0 , I ), the p osterior is also a normal: q ( η ) = N ( λ, I ), where λ = P D d =1 ( µ d − µ ? d ) E [ ¯ Z d ] is the mean. This identit y co v ariance matrix is muc h simpler than the co v ariance matrix Σ as in Me dLD A r f ull , which dep ends on the latent topic represen tation Z . Since I is indep enden t of Z , the prediction mo del in Me dLDA r partial is less affected by the latent topic representations. T ogether with the simpler up date 12 MedLD A: A General Framework of Maximum Margin Super vised Topic Models rule (9), we can conclude that the coupling b etw een the max-margin estimation and the discov ery of laten t topic represen tations in Me dLD A r partial is loser than that of the Me dLDA r f ull . The loser coupling will lead to inferior empirical p erformance as we shall see. F or the standard normal prior, the dual problem D2 is a QP problem: max µ,µ ? − 1 2 k λ k 2 2 −  D X d =1 ( µ d + µ ? d ) + D X d =1 y d ( µ d − µ ? d ) s . t . ∀ d : µ d , µ ? d ∈ [0 , C ] , Similarly , we can deriv e its primal form, which can be reform ulated as a standard SVR problem: min λ,ξ ,ξ ? 1 2 k λ k 2 2 − λ > ( D X d =1 y d δ 2 E [ ¯ Z d ]) + C D X d =1 ( ξ d + ξ ? d ) s . t . ∀ d :    y d − λ > E [ ¯ Z d ] ≤  + ξ d − y d + λ > E [ ¯ Z d ] ≤  + ξ ? d ξ d , ξ ? d ≥ 0 . No w, we can leverage recen t developmen ts in supp ort vector regression to solve either the dual problem or the primal problem. 2. M-step : the same as in the MedLD A r ful l . 4. Maxim um Entrop y Discrimination LDA for Classification In this section, w e consider the discrete resp onse v ariable and presen t the MedLD A classi- fication mo del. 4.1 Learning MedLD A for Classification F or classification, the resp onse v ariables y are discrete. F or brevity , we only consider the m ulti-class classification, where y ∈ { 1 , · · · , M } . The binary case can b e easily defined based on a binary SVM and the optimization problem can b e solved similarly . F or classification, we assume the discriminant function F is linear, that is, F ( y , z 1: N , η ) = η > y ¯ z , where ¯ z = 1 / N P n z n as in the regression mo del, η y is a class-sp ecific K -dimensional parameter vector asso ciated with the class y and η is a M K -dimensional v ector by stac king the elemen ts of η y . Equiv alen tly , F can be written as F ( y , z 1: N , η ) = η > f ( y, ¯ z ), where f ( y , ¯ z ) is a feature vector whose comp onen ts from ( y − 1) K + 1 to y K are those of the vector ¯ z and all the others are 0. F rom each single F , a prediction rule can b e derived as in SVM. Here, w e consider the general case to learn a distribution of q ( η ) and for prediction, w e tak e the a verage o ver all the p ossible mo dels and the latent topics: y ? = arg max y E [ η > f ( y, ¯ Z ) | α, β ] . (10) 13 Zhu, Ahmed, and Xing No w, the problem is to learn an optimal set of parameters α, β and distribution q ( η ). Belo w, w e presen t the MedLD A classification mo del. In principle, we can dev elop tw o v ariants of MedLDA classification models, whic h use the supervised sLD A (W ang et al., 2009) and the unsupervised LD A to discov er laten t topics as in the regression case. How ever, for the case of using sup ervised sLDA for classification, it is imp ossible to deriv e a dual form ulation of its optimization problem b ecause of the normalized non-Gaussian prediction mo del (Blei and McAuliffe, 2007; W ang et al., 2009). Here, we consider the case where w e use the unsup ervised LDA as the underlying topic mo del to discov er the laten t topic represen tation Z . As we shall see, the MedLD A classification model can b e easily learned b y using existing SVM solv ers to optimize its dual optimization problem. 4.1.1 Max-Mar gin Learning of LDA for Classifica tion As w e hav e stated, the sup ervised sLDA model has a normalization factor that makes the learning generally intractable, except for some sp ecial cases lik e the normal distribution as in the regression case. In (Blei and McAuliffe, 2007; W ang et al., 2009), v ariational metho ds or high-order T a ylor expansion is applied to appro ximate the normalization factor in classification model. In our max-margin form ulation, since our target is to directly minimize a hinge loss, we do not need a normalized distribution model for the response v ariables Y . Instead, we define a p artial ly generative mo del on ( θ , z , W ) only as in the unsup ervised LDA, and for the classification (i.e., from Z to Y ), we apply the max-margin principle, whic h does not require a normalized distribution. Thus, in this case, the likelihoo d of the corpus D is p ( W | α, β ). Similar as in the MedLDA r p artial regression mo del, w e define the in tegrated laten t topic disco very and multi-class classification mo del as follows: P3(MedLD A c ) : min q ,q ( η ) ,α,β ,ξ L u ( q ) + K L ( q ( η ) || p 0 ( η )) + C D X d =1 ξ d s . t . ∀ d, y 6 = y d : E [ η > ∆ f d ( y )] ≥ 1 − ξ d ; ξ d ≥ 0 , where q ( θ , z | γ , φ ) is a v ariational distribution; L u ( q ) is a v ariational upper b ound of − log p ( W | α, β ); ∆ f d ( y ) = f ( y d , ¯ Z d ) − f ( y , ¯ Z d ), and ξ are slac k v ariables. E [ η > ∆ f d ( y )] is the “ exp e cte d mar- gin” by which the true lab el y d is fav ored ov er a prediction y . These margin constraints mak e MedLDA c fundamen tally different from the mixture of conditional max-entrop y mo d- els (Pa vlo v et al., 2003), where constrain ts are based on moment matc hing, i.e., empirical exp ectations of features are equal to their mo del exp ectations. The rationale underlying the MedLDA c is similar to that of the MedLDA r , that is, w e w ant to find a laten t topic represen tation q ( θ, z | γ , φ ) and a parameter distribution q ( η ) whic h on one hand tend to predict as accurate as p ossible on training data, while on the other hand tend to explain the data well. The KL-divergence term in P3 is a regularizer of the distribution q ( η ). 4.2 V ariational EM-Algorithm As in MedLD A r , w e can dev elop a similar v ariational EM algorithm. Sp ecifically , w e assume that q is fully factorized, as in the standard unsup ervised LD A. Then, E [ η > f ( y, ¯ Z d )] = 14 MedLD A: A General Framework of Maximum Margin Super vised Topic Models E [ η ] > f ( y, 1 / N P N n =1 φ dn ). W e formulate the Lagrangian L of P3: L = L ( q ) + K L ( q ( η ) || p 0 ( η )) + C D X d =1 ξ d − D X d =1 v d ξ d − D X d =1 X y 6 = y d µ d ( y )( E [ η > ∆ f d ( y )] + ξ d − 1) − D X d =1 N X i =1 c di ( K X j =1 φ dij − 1) , where the last term is from the normalization condition P K j =1 φ dij = 1 , ∀ i, d . The EM- algorithm iteratively optimizes L w.r.t γ , φ , q ( η ) and β . Since the constraints in P3 are not on γ or β , their up date rules are the same as in MedLDA r ful l and we omit the details here. W e explain the optimization of L o ver φ and q ( η ) and show the insigh ts of the max-margin topic mo del: 1. Optimize L ov er φ : again, since q is fully factorized, w e can perform the optimization on each do cument separately . Set ∂ L/∂ φ di = 0, then w e hav e: φ di ∝ exp ( E [log θ | γ ] + E [log p ( w di | β )] + 1 N X y 6 = y d µ d ( y ) E [ η y d − η y ]) . (11) The first tw o terms in Eq. (11) are the same as in the unsup ervised LDA and the last term is due to the max-margin form ulation of P3 and reflects our intuition that the disco vered laten t topic represen tation is influenced b y the max-margin estimation. F or those examples that are around the decision b oundary , i.e., supp ort vectors, some of the lagrange multipliers are non-zero and thus the last term acts as a regularizer that biases the mo del to wards discov ering a laten t representation that tends to mak e more accurate prediction on these difficult examples. Moreo ver, this term is fixed for w ords in the do cumen t and thus will directly affect the latent represen tation of the do cumen t (i.e., γ d ) and will yield a discriminativ e latent representation, as w e shall see in Section 6, which is more suitable for the classification task. 2. Optimize L ov er q ( η ): Similar as in the regression mo del, w e hav e the following optim um solution. Corollary 4 The optimum solution q ( η ) of Me dLDA c has the form: q ( η ) = 1 Z p 0 ( η ) exp  η > ( D X d =1 X y 6 = y d µ d ( y ) E [∆ f d ( y )])  , (12) The lagr ange multipliers µ ar e the optimum solution of the dual pr oblem: D3 : max µ − log Z + D X d =1 X y 6 = y d µ d ( y ) s . t . ∀ d : X y 6 = y d µ d ( y ) ∈ [0 , C ] , 15 Zhu, Ahmed, and Xing Again, we can choose differen t priors in MedLD A c for different regularization effects. W e consider the normal prior in this pap er. F or the standard normal prior p 0 ( η ) = N (0 , I ), w e can get: q ( η ) is a normal with a shifted mean, i.e., q ( η ) = N ( λ, I ), where λ = P D d =1 P y 6 = y d µ d ( y ) E [∆ f d ( y )], and the dual problem D3 is the same as the dual problem of a standard m ulti-class SVM that can b e solv ed using existing SVM metho ds (Crammer and Singer, 2001): max µ − 1 2 k D X d =1 X y 6 = y d µ d ( y ) E [∆ f d ( y )] k 2 2 + D X d =1 X y 6 = y d µ d ( y ) s . t . ∀ d : X y 6 = y d µ d ( y ) ∈ [0 , C ] . 5. MedTM: a general framew ork W e hav e presented MedLD A, whic h integrates the max-margin principle with an underlying LD A model, which can be supervised or unsupervised, for disco vering predictiv e laten t topic represen tations of do cumen ts. The same principle can b e applied to other generative topic mo dels, suc h as the correlated topic mo dels (CTMs) (Blei and Lafferty, 2005), as well as undirected random fields, such as the exponential family harmoniums (EFH) (W elling et al., 2004). F ormally , the max-entrop y discrimination topic mo dels (MedTM) can b e generally de- fined as: P(MedTM) : min q ( H ) ,q (Υ) , Ψ ,ξ L ( q ( H )) + K L ( q (Υ) k p 0 (Υ)) + U ( ξ ) s . t . exp e cte d margin constraints, where H are hidden v ariables (e.g., ( θ , z ) in LD A); Υ are the parameters of the mo del p ertaining to the prediction task (e.g., η in sLDA); Ψ are the parameters of the underlying topic mo del (e.g., the Dirichlet parameter α ); and L is a v ariational upp er b ound of the negativ e log likelihoo d asso ciated with the underlying topic mo del. U is a con vex function o ver slack v ariables. F or the general MedTM mo del, w e can dev elop a similar v ariational EM-algorithm as for the MedLD A. Note that Υ can b e a part of H . F or example, the underlying topic mo del of MedLDA r is a Bay esian sLDA. In this case, H = ( θ, z , η ), Υ = ∅ and the term K L ( q ( η ) k p 0 ( η )) is contained in its L . Finally , based on the recen t extension of maximum entr opy discrimination (MED) (Jaakk ola et al., 1999) to the structured prediction setting (Zh u et al., 2008b), the ba- sic principle of MedLDA can b e similarly extended to p erform structured prediction, where m ultiple resp onse v ariables are predicted simultaneously and thus their m utual dep enden- cies can b e exploited to achiev e global consistent and optimal predictions. Lik eliho od based structured prediction latent topic mo dels hav e b een developed in differen t scenarios, such as image annotation (He and Zemel, 2008) and statistical mac hine translation (Zhao and Xing, 2006). The extension of MedLD A to structured prediction setting could provide a promising alternative for such problems. 16 MedLD A: A General Framework of Maximum Margin Super vised Topic Models 6. Exp erimen ts In this section, we pro vide qualitativ e as well as quan titativ e ev aluation of MedLD A on text mo deling, classification and regression. 6.1 T ext Mo deling W e study text mo deling of the MedLD A on the 20 Newsgroups data set with a standard list of stop words 4 remo ved. The data set contains p ostings in 20 related categories. W e compare with the standard unsup ervised LDA. W e fit the dataset to a 110-topic MedLDA c mo del, whic h explores the supervised category information, and a 110-topic unsup ervised LD A. Figure 2 sho ws the 2D em b edding of the exp ected topic prop ortions of MedLDA c and LD A by using the t-SNE sto chastic neigh b orho od embedding (v an der Maaten and Hinton, 2008), where each dot represen ts a do cumen t and color-shape pairs represen t class lab els. Ob viously , the max-margin based MedLD A c pro duces a b etter grouping and separation of the do cumen ts in differen t categories. In contrast, the unsup ervised LDA do es not pro duce a well separated embedding, and do cumen ts in different categories tend to mix together. A similar embedding was presen ted in (Lacoste-Jullien et al., 2008), where the transformation matrix in their mo del is pre-designed. The results of MedLDA c in Figure 2 are automatic al ly learned. It is also interesting to examine the disco vered topics and their asso ciation with class lab els. In Figure 3 we show the top topics in four classes as discov ered by b oth MedLD A and LDA. Moreov er, we depict the p er-class distribution ov er topics for eac h mo del. This distribution is computed by av eraging the exp ected latent representation of the do cumen ts in each class. W e can see that MedLDA yields sharp er, sparser and fast deca ying p er-class distributions ov er topics whic h ha ve a b etter discrimination p o w er. This b eha vior is in fact due to the regularization effect enforced ov er φ as shown in Eq. (11). On the other hand, LD A seems to disco ver topics that m odel the fine details of do cuments with no regard to their discrimination p o wer (i.e. it disco vers differen t v ariations of the same topic which results in a flat p er-class distribution ov er topics). F or instance, in the class comp.graphics, MedLDA mainly mo dels do cumen ts in this class using t wo salien t, discriminativ e topics (T69 and T11) whereas LDA results in a m uch flatter distribution. Moreo v er, in the cases where LD A and MedLD A discov er comparably the same set of topics in a giv en class (like p olitics.mideast and misc.forsale), MedLD A results in a sharp er lo w dimensional representation. 6.2 Prediction Accuracy In this subsection, we pro vide a quantitativ e ev aluation of the MedLDA on prediction p erformance. 6.2.1 Classifica tion W e p erform binary and multi-class classification on the 20 Newsgroup data set. T o obtain a baseline, we first fit all the data to an LDA mo del, and then use the latent represen tation 4. h ttp://mallet.cs.umass.edu/ 17 Zhu, Ahmed, and Xing −80 −60 −40 −20 0 20 40 60 80 −100 −80 −60 −40 −20 0 20 40 60 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 −100 −80 −60 −40 −20 0 20 40 60 80 −80 −60 −40 −20 0 20 40 60 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 2: t-SNE 2D em b edding of the topic represen tation b y: MedLD A c (ab o v e) and the unsup ervised LD A (below). 18 MedLD A: A General Framework of Maximum Margin Super vised Topic Models Class MedLDA LDA Average θ per class comp.graphics T 69 T 11 T 80 T 59 T 104 T 31 image graphics db image ftp card jpeg image key jpeg pub monitor gif data c hip color graphics dos file ftp encryption file mail video color softw are clipper gif v ersion apple files pub system images tar windows bit mail gov ernment format file drivers images pack age keys bit information vga format fax law files send cards program images escrow displa y server graphics sci.electronics T 32 T 95 T 46 T 30 T 84 T 44 ground audio source pow er water sale wire output rs ground energy price power input time wire air offer wiring signal john circuit nuclear shipping don chip cycle supply lo op sell current high lo w voltage hot interested circuit data dixie current cold mail neutral mhz dog wiring cooling condition writes time weeks signal heat email work go od face cable temperature cd politics.mideast T 30 T 40 T 51 T 42 T 78 T 47 israel turkish israel israel jews armenian israeli armenian lebanese israeli jewish turkish jews armenians israeli p eace israel armenians arab armenia lebanon writes israeli armenia writes p eople p eople article arab turks people turks attacks arab p eople genocide article greek soldiers war arabs russian jewish turkey villages lebanese center soviet state gov ernment peace lebanon jew p eople rights soviet writes people nazi muslim misc.forsale T 109 T 110 T 84 T 44 T 94 T 49 sale drive mac sale don drive price scsi apple price mail scsi shipping mb monitor offer call disk offer drives bit shipping pack age hard mail controller mhz sell writes mb condition disk card interested send driv es interested ide video mail num b er ide sell hard speed condition ve controller email bus memory email hotel flopp y dos system system cd credit system Figure 3: T op topics under eac h class as discov ered by the MedLDA and LD A mo dels of the training 5 do cumen ts as features to build a binary/multi-class SVM classifier. W e denote this baseline b y LD A+SVM. F or a mo del M , w e ev aluate its p erformance using the relativ e impro vemen t ratio, i.e., precision ( M ) − pr ecision ( LDA + S V M ) precision ( LD A + S V M ) . Note that since DiscLD A (Lacoste-Jullien et al., 2008) is using the Gibbs sampling for inference, which is slightly differen t from the v ariational metho ds as in MedLDA and sLDA (Blei and McAuliffe, 2007; W ang et al., 2009), w e build the baseline mo del of LDA+SVM with both v ariational inference and Gibbs sampling. The relativ e improv ement ratio of each mo del is computed against the baseline with the same inference metho d. Binary Classification : As in (Lacoste-Jullien et al., 2008), the binary classification is to distinguish p ostings of the newsgroup alt.atheism and the p ostings of the group talk.r eligion.misc . W e compare MedLDA c with sLDA, DiscLDA and LDA+SVM. F or sLDA, 5. W e use the training/testing split in: h ttp://p eople.csail.mit.edu/jrennie/20Newsgroups/ 19 Zhu, Ahmed, and Xing 0 5 10 15 20 25 30 35 40 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 # Topics Relatvie Ratio MedLDA MedLDA+SVM DiscLDA sLDA LDA+SVM (baseline) (a) 20 30 40 50 60 70 80 90 100 110 −0.1 −0.05 0 0.05 0.1 0.15 0.2 # Topics Ralative Ratio MedLDA MedLDA+SVM DiscLDA multi−sLDA LDA+SVM (baseline) (b) Figure 4: Relative impro vemen t ratio against LD A+SVM for: (a) binary and (b) m ulti-class classification. the extension to p erform multi-class classification was presented b y W ang et al. (2009), w e will compare with it in the multi-class classification setting. Here, for binary case, we fit an sLD A regression mo del using the binary represen tation (0/1) of the classes, and use a threshold 0.5 to make prediction. F or MedLD A c , to see whether a second-stage max-margin classifier can impro ve the p erformance, we also build a metho d Me dLDA+SVM , similar to LD A+SVM. F or all the ab o ve metho ds that utilize the class lab el information, they are fit ONL Y on the training data. W e use the SVM-ligh t (Joachims, 1999) to build SVM classifiers and to estimate q ( η ) in MedLDA c . The parameter C is chosen via 5 fold cross-v alidation during the training from { k 2 : k = 1 , · · · , 8 } . F or each mo del, we run the exp erimen ts for 5 times and take the a verage as the final results. The relativ e impro v ement ratios of different mo dels with resp ect to topic num b ers are shown in Figure 4(a). F or the DiscLDA (Lacoste-Jullien et al., 2008), the n umber of topics is set by the equation 2 K 0 + K 1 , where K 0 is the num be r of topics p er class and K 1 is the num b er of topics shared b y all categories. As in (Lacoste-Jullien et al., 2008), K 1 = 2 K 0 . Here, w e set K 0 = 1 , · · · , 8 , 10 and align the results with those of MedLD A and sLDA that ha ve the closest topic num b ers. W e can see that the max-margin based MedLDA c w orks b etter than sLDA, DiscLD A and the tw o-step metho d of LDA+SVM. Since MedLDA c in tegrates the max-margin principle in its training, the combination of MedLDA and SVM do es not yield additional b enefits on this task. W e b eliev e that the slight differences b et w een MedLD A and MedLDA+SVM are due to tuning of the regularization parameters. F or efficiency , w e do not change the regularization constan t C during training MedLDA c . The p erformance would b e improv ed if we select a go o d C in different iterations b ecause the data represen tation is changing. Multi-class Classification : W e p erform multi-class classification on 20 Newsgroups with all the categories. W e compare MedLDA c with MedLDA+SVM, LDA+SVM, multi- 20 MedLD A: A General Framework of Maximum Margin Super vised Topic Models 5 10 15 20 25 30 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 # Topics Predictive R2 MedLDA (full) MedLDA (partial) sLDA LDA+SVR 5 10 15 20 25 30 −6.46 −6.44 −6.42 −6.4 −6.38 −6.36 −6.34 −6.32 # Topics Per−word Likelihood MedLDA (full) MedLDA (partial) sLDA LDA Figure 5: Predictive R 2 (left) and p er-w ord likelihoo d (right) of different mo dels on the movie review dataset. class sLDA (multi-sLD A) (W ang et al., 2009), and DiscLDA. W e use the SVM struct pac k age 6 with a 0/1 loss to solve the sub-step of learning q ( η ) and build the SVM classifiers for LD A+SVM and MedLD A+SVM. The results are shown in Figure 4(b). F or DiscLDA, w e use the same equation as in (Lacoste-Jullien et al., 2008) to set the num b er of topics and set K 0 = 1 , · · · , 5. Again, we need to align the results with those of MedLD A based on the closest topic n umber criterion. W e can see that all the sup ervised topic models disco ver more predictiv e topics for classification, and the max-margin based MedLD A c can achiev e significant improv ements with an appropriate num b er (e.g., ≥ 80) of topics. Again, we b eliev e that the slight difference b et w een MedLD A c and MedLDA+SVM is due to parameter tuning. 6.2.2 Regression W e ev aluate the MedLDA r mo del on the mo vie review data set. As in (Blei and McAuliffe, 2007), w e tak e logs of the resp onse v alues to make them appro ximately normal. W e compare MedLD A r with the unsup ervised LD A and sLDA. As w e hav e stated, the underlying topic mo del in MedLD A r can b e a LDA or a sLD A. W e hav e implemen ted both, as denoted by Me dLDA (p artial) and Me dLDA (ful l) , respectively . F or LD A, we use its low dimensional represen tation of documents as input features to a linear SVR and denote this method b y LDA+SVR . The ev aluation criterion is predictive R 2 (pR 2 ) as defined in (Blei and McAuliffe, 2007). Figure 5 shows the results together with the p er-word likelihoo d. W e can see that the sup ervised MedLD A and sLDA can get muc h b etter results than the unsup ervised LD A, whic h ignores sup ervised resp onses. By using max-margin learning, MedLD A (full) can get sligh tly b etter results than the likelihoo d-based sLD A, especially when the n umber of topics is small (e.g., ≤ 15). Indeed, when the num b er of topics is small, the laten t represen tation of sLD A alone do es not result in a highly separable problem, th us the integration of max- 6. h ttp://svmlight.joac hims.org/svm multiclass.h tml 21 Zhu, Ahmed, and Xing 0 10 20 30 40 10 1 10 2 10 3 10 4 10 5 10 6 # Topics CPU−Seconds MedLDA sLDA LDA+SVM Figure 6: T raining time of different mo dels with resp ect to the num b er of topics for binary classification. margin training helps in disco vering a more discriminativ e laten t representation using the same n umber of topics. In fact, the num b er of supp ort vectors (i.e., do cumen ts that hav e at least one non-zero lagrange m ultiplier) decreases dramatically at T = 15 and sta ys nearly the same for T > 15, which with reference to Eq. (5) explains wh y the relativ e impro vemen t o ver sLDA decreased as T increases. This b ehavior suggests that MedLDA can disco ver more predictive latent structures for difficult , non-separable problems. F or the tw o v ariants of MedLD A r , w e can see an obvious improv emen t of MedLDA (full). This is b ecause for MedLD A (partial), the up date rule of φ do es not hav e the third and fourth terms of Eq. (5). Those terms make the max-margin estimation and laten t topic disco very attac hed more tightly . Finally , a linear SVR on the empirical word frequency gets a pR 2 of 0.458, w orse than those of sLD A and MedLDA. 6.2.3 Time Efficiency F or binary classification, MedLD A c is m uch more efficien t than sLD A, and is comparable with the LDA+SVM, as shown in Figure ?? . The slowness of sLD A may be due to the mis- matc hing b et ween its normal assumption and the non-Gaussian binary resp onse v ariables, whic h prolongs the E-step. F or multi-class classification, the training time of MedLDA c is mainly dep enden t on solving a multi-class SVM problem, and th us is comparable to that of LDA. F or regression, the training time of MedLDA (full) is comparable to that of sLDA, while MedLDA (partial) is more efficien t. 7. Related W ork Laten t Dirichlet Allo cation (LDA) (Blei et al., 2003) is a hierarchical Ba yesian mo del for disco vering latent topics in a document collection. LDA has found wide applications in 22 MedLD A: A General Framework of Maximum Margin Super vised Topic Models information retriev al, data mining, computer vision, and etc. The LD A is an unsup ervised mo del. Sup ervised LDA (Blei and McAuliffe, 2007) w as prop osed for regression problem. Al- though the sLDA w as generalized to classification with a generalized linear mo del (GLM), no results hav e b een rep orted on the classification p erformance of sLDA. One imp ortan t issue that hinders the sLD A to b e effectiv ely applied for classification is that it has a nor- malization factor b ecause sLDA defines a fully generative mo del. The normalization factor mak es the learning very difficult, where v ariatioinal metho d or higher-order statistics must b e applied to deal with the normalizer, as shown in (Blei and McAuliffe, 2007). Instead, MedLD A applies the concept of margin and directly concentrates on maximizing the mar- gin. Thus, MedLD A do es not need to define a fully generativ e mo del, and the problem of MedLDA for classification can b e easily handled via solving a dual QP problem, in the same spirit of SVM. DiscLD A (Lacoste-Jullien et al., 2008) is another sup ervised LD A model, whic h w as sp ecifically proposed for classification problem. DiscLD A also defines a fully generativ e mo del, but instead of minimizing the evidence, it minimizes the conditional likelihoo d, in the same spirit of conditional random fields (Lafferty et al., 2001). Our MedLDA significantly differs from the DiscLDA. The implemen tation of MedLDA is extremely simple. Other v ariants of topic mo dels that leverage sup ervised information ha ve b een devel- op ed in different application scenarios, including the mo dels for online reviews (Tito v and McDonald, 2008; Brana v an et al., 2008), image annotation (He and Zemel, 2008) and the credit attribution Labeled LDA mo del (Ramage et al., 2009). Maxim um en tropy discrimination (MED) (Jaakkola et al., 1999) princip e provides an excellen t com bination of max-margin learning and Bay esian-style estimation. Recent work (Zh u et al., 2008b) extends the MED framew ork to the structured learning setting and generalize to incorp orate structured hidden v ariables in a Marko v netw ork. MedLDA is an application of the MED principle to learn a laten t Diric hlet allo cation mo del. Unlik e (W es- terdijk and Wiegerinc k, 2000), where a generative mo del is degenerated to a deterministic v ersion for classification, our mo del is generative and thus can disco ver the laten t topics o ver do cumen t collections. The basic principle of MedLD A can b e generalized to the structured prediction setting, in which multi-v ariant resp onse v ariables are predicted simultaneously and thus their mutual dep endencies can b e explored to achiev e globally consisten t and optimal predictions. A t least tw o scenarios are within our horizon that can b e directly solved via MedLDA, i.e., the image annotation (He and Zemel, 2008), where neighboring annotation tends to b e smo oth, and the statistical mac hine translation (Zhao and Xing, 2006), where tokens are naturally aligned in w ord sentences. 8. Conclusions and Discussions W e hav e presented the maximum en trop y discrimination LDA (MedLDA) that uses the max-margin principle to train supervised topic models. MedLDA integrates the max-margin principle into the latent topic discov ery pro cess via optimizing one single ob jectiv e function with a set of exp e cte d margin constraints. This integration yields a predictive topic represen- tation that is more suitable for regression or classification. W e dev elop efficient v ariational 23 Zhu, Ahmed, and Xing metho ds for MedLDA. The empirical results on movie review and 20 Newsgroups data sets sho w the promise of MedLD A on text mo deling and prediction accuracy . MedLD A represents the first step tow ards in tegrating the max-margin principle in to sup ervised topic mo dels, and under the general MedTM framew ork presen ted in Section 3, several impro vemen ts and extensions are in the horizon. Sp ecifically , due to the nature of MedTM’s join t optimization form ulation, adv ances in either max-margin training or b etter v ariational b ounds for inference can b e easily incorp orated. F or instance, the mean field v ariational upper b ound in MedLDA can b e impro ved b y using the tighter collapsed v ariational b ound (T eh et al., 2006) that ac hiev es results comparable to collapsed Gibbs sampling (Griffiths and Steyvers, 2004). Moreo v er, as the exp erimen tal results suggest, incorp oration of a more expressiv e underlying topic mo del enhances the ov erall performance. Therefore, we plan to in tegrate and utilize other underlying topic mo dels like the fully generativ e sLDA mo del in the classification case. Finally , adv anced in max-margin training w ould also results in more efficient training. Ac kno wledgements This work was done while J.Z. was visiting CMU under a supp ort from NSF DBI-0546594 and DBI-0640543 aw arded to E.X.; J.Z. is also supp orted by Chinese NSF Grant 60621062 and 60605003; National Key F oundation R&D Pro jects 2003CB317007, 2004CB318108 and 2007CB311003; and Basic Research F oundation of Tsingh ua National TNList Lab. Pro of of Corollary 3 In this section, we prov e the corollary 3. Pro of Since the v ariational parameters ( γ , φ ) are fixed when solving for q ( η ), we can ignore the terms in L bs that do not dep end on q ( η ) and get the function L bs [ q ( η )] , K L ( q ( η ) k p 0 ( η )) − X d E q [log p ( y d | ¯ Z d , η , δ 2 )] = K L ( q ( η ) k p 0 ( η )) + 1 2 δ 2  E q ( η ) [ η > E [ AA > ] η − 2 η > D X d =1 y d E [ ¯ Z d ]]  + c, where c is a constant that do es not dep end on q ( η ). Let U ( ξ , ξ ? ) = C P D d =1 ( ξ d + ξ ? d ). Supp ose ( q 0 ( η ) , ξ 0 , ξ ? 0 ) is the optimal solution of P1, then we hav e: for any feasible ( q ( η ) , ξ , ξ ? ), L bs [ q 0 ( η )] + U ( ξ 0 , ξ ? 0 ) ≤ L bs [ q ( η )] + U ( ξ , ξ ? ) . F rom Corollary 2, we conclude that the optim um predictive parameter distribution is q 0 ( η ) = N ( λ 0 , Σ), where Σ = ( I + 1 /δ 2 E [ A > A ]) − 1 do es not dep end on q ( η ). Since q 0 ( η ) is also normal, for an y distribution 7 q ( η ) = N ( λ, Σ), with several steps of algebra it is easy to 7. Although the feasible set of q ( η ) in P1 is muc h ric her than the set of normal distributions with the co v ariance matrix Σ, Corollary 2 shows that the solution is a restricted normal distribution. Thus, it suffices to consider only these normal distributions in order to learn the mean of the optimum distribution. 24 MedLD A: A General Framework of Maximum Margin Super vised Topic Models sho w that L bs [ q ( η )] = 1 2 λ > ( I + 1 δ 2 E [ A > A ]) λ − λ > ( D X d =1 y d δ 2 E [ ¯ Z d ]) + c 0 = 1 2 λ > Σ − 1 λ − λ > ( D X d =1 y d δ 2 E [ ¯ Z d ]) + c 0 , where c 0 is another constan t that do es not dep end on λ . Th us, w e can get: for any ( λ, ξ , ξ ? ), where ( λ, ξ , ξ ? ) ∈ { ( λ, ξ , ξ ? ) : y d − λ > E [ ¯ Z d ] ≤  + ξ d ; − y d + λ > E [ ¯ Z d ] ≤  + ξ ? d ; and ξ , ξ ? ≥ 0 ∀ d } , w e ha ve 1 2 λ > 0 Σ − 1 λ 0 − λ > 0 ( D X d =1 y d δ 2 E [ ¯ Z d ]) + U ( ξ 0 , ξ ? 0 ) ≤ 1 2 λ > Σ − 1 λ − λ > ( D X d =1 y d δ 2 E [ ¯ Z d ]) + U ( ξ , ξ ? ) , whic h means the mean of the optimum p osterior distribution under a Gaussian MedLD A is achiev ed by solving a primal problem as stated in the Corollary . References Da vid Blei and John Laffert y . Correlated topic mo dels. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2005. Da vid Blei and Jon D. McAuliffe. Sup ervised topic mo dels. In A dvanc es in Neur al Infor- mation Pr o c essing Systems (NIPS) , 2007. Da vid Blei, Andrew Ng, and Mic hael Jordan. Laten t Dirichlet allocation. Journal of Machine L e arning R ese ar ch , (3):993–1022, 2003. S.R.K. Brana v an, Harr Chen, Jacob Eisenstein, and Regina Barzilay . Learning do cumen t- lev el semantic prop erties from free-text annotations. In Pr o c e ddings of the Annual Me eting of the Asso ciation for Computational Linguistics (A CL) , 2008. Gal Chechik and Naftali Tishb y . Extracting relev ant structures with side information. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2002. Kob y Crammer and Y oram Singer. On the algorithmic implemen tation of multiclass k ernel- based vector machines. Journal of Machine L e arning R ese ar ch , (2):265–292, 2001. T. Griffiths and M. Steyv ers. Finding scientific topics. Pr o c e e dings of the National A c ademy of Scienc es , (101):5228–5235, 2004. Xuming He and Richard S. Zemel. Learning h ybrid mo dels for image annotation with partially lab eled data. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2008. 25 Zhu, Ahmed, and Xing T ommi Jaakk ola, Marina Meila, and T ony Jebara. Maxim um en tropy discrimination. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 1999. Thorsten Joac hims. Making large-scale SVM learning practical. A dvanc es in kernel metho ds–supp ort ve ctor le arning, MIT-Pr ess , 1999. Mic hael I. Jordan, Zoubin Ghahramani, T ommis Jaakk ola, and La wrence K. Saul. A n intr o duction to variational metho ds for gr aphic al mo dels . M. I. Jordan (Ed.), Learning in Graphical Mo dels, Cambridge: MIT Press, Cambridge, MA, 1999. Simon Lacoste-Jullien, F ei Sha, and Mic hael I. Jordan. DiscLDA: Discriminative learning for dimensionalit y reduction and classification. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2008. John Lafferty , Andrew McCallum, and F ernando Pereira. Conditional random fields: Prob- abilistic mo dels for segmen ting and lab eling sequence data. In ICML , 2001. Dmitry P avlo v, Alexandrin Popescul, David M. Pennock, and Lyle H. Ungar. Mixtures of conditional maximum entrop y mo dels. In ICML , 2003. Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. Lab eled lda: A sup ervised topic mo del for credit attribution in m ulti-lab eled corp ora. In Pr o c e ddings of the 2009 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , 2009. Bry an C. Russell, Antonio T orralba, Kevin P . Murphy , and William T. F reeman. Lab elme: a database and web-based to ol for image annotation. International Journal of Computer Vision , 77(1-3):157–173, 2008. Alex J. Smola and Bernhard Sch ¨ o lk opf. A tutorial on supp ort v ector regression. Statistics and Computing , 2003. Y ee Why e T eh, Da vid Newman, and Max W elling. A collapsed v ariational ba yesian inference algorithm for laten t dirichlet allo cation. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2006. Iv an Tito v and Ry an McDonald. A join t mo del of text and aspect ratings for sen timen t sum- marization. In Pr o c e ddings of the A nnual Me eting of the Asso ciation for Computational Linguistics (ACL) , 2008. L.J.P . v an der Maaten and G.E. Hinton. Visualizing data using t-SNE. Journal of Machine L e arning R ese ar ch , (9):2579–2605, 2008. V. V apnik. The Natur e of Statistic al L e arning The ory . Springer, New Y ork, 1995. Chong W ang, David Blei, and F ei-F ei Li. Sim ultaneous image classification and annotation. In IEEE Computer So ciety Confer enc e on Computer Vision and Pattern R e c o gnition , 2009. Max W elling, Mic hal Rosen-Zvi, and Geoffrey Hin ton. Exp onen tial family harmoniums with an application to information retriev al. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2004. 26 MedLD A: A General Framework of Maximum Margin Super vised Topic Models Mic hael W esterdijk and Wim Wiegerinc k. Classification with m ultiple latent v ariable mo dels using maximum entrop y discrimination. In ICML , 2000. Jun Y ang, Y an Liu, Eric P . Xing, and Alexander G. Hauptmann. Harmonium mo dels for seman tic video representation and classification. In SIAM Conf. on Data Mining , 2007. Bin Zhao and Eric P . Xing. Hm-bitam: Bilingual topic exploration, word alignment, and translation. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2006. Jun Zhu, Eric P . Xing, and Bo Zhang. Laplace maximum margin Mark o v net w orks. In International Confer enc e on Machine L e arning , 2008a. Jun Zhu, Eric P . Xing, and Bo Zhang. Partially observed maximum en tropy discrimination Mark ov netw orks. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2008b. 27

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment