Supervised Topic Models
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on…
Authors: David M. Blei, Jon D. McAuliffe
Submitted to the Statistical Science Sup ervised T opic Mo dels David M. Blei , Princeton Universit y Jon D. McAuliffe Universit y of Califo rnia, Berk eley A bstr act. W e in tro duce supervised laten t Diric hlet allo cation (sLD A), a statistical model of labelled documents. The mo del accommo dates a v a- riet y of response types. W e derive an appro ximate maxim um-lik eliho o d pro cedure for parameter estimation, which relies on v ariational meth- o ds to handle in tractable p osterior expectations. Prediction problems motiv ate this research: we use the fitted mo del to predict resp onse v al- ues for new do cumen ts. W e test sLD A on t wo real-w orld problems: mo vie ratings predicted from reviews, and the p olitical tone of amend- men ts in the U.S. Senate based on the amendment text. W e illustrate the b enefits of sLD A v ersus mo dern regularized regression, as w ell as v ersus an unsupervised LDA analysis follo wed b y a separate regression. 1. In tro duction There is a gro wing need to analyze collections of electronic text. W e hav e unprecedented access to large corp ora, such as gov ernmen t do cumen ts and news archiv es, but we cannot tak e adv antage of these collections without b eing able to organize, understand, and summarize their con tents. W e need new statistical models to analyze suc h data, and fast algorithms to compute with those models. The complexit y of do cumen t corp ora has led to considerable interest in applying hierarc hical sta- tistical mo dels based on what are called topics ( Blei and Lafferty , 2009 ; Blei et al. , 2003 ; Griffiths et al. , 2005 ; Eroshev a et al. , 2004 ). F ormally , a topic is a probability distribution o v er terms in a v o cabulary . Informally , a topic represen ts an underlying seman tic theme; a do cumen t consisting of a large n um b er of words might be concisely mo delled as deriving from a smaller n um b er of topics. Suc h topic mo dels pro vide useful descriptiv e statistics for a collection, which facilitates tasks like bro wsing, searc hing, and assessing document similarity . Most topic mo dels, such as laten t Dirichlet allocation (LD A) ( Blei et al. , 2003 ), are unsup ervised. Only the w ords in the do cumen ts are modelled, and the goal is to infer topics that maximize the lik eliho od (or the p osterior probability) of the collection. This is comp elling—with only the do c- umen ts as input, one can find patterns of words that reflect the broad themes that run through 1 2 BLEI AND MCAULIFFE them—and unsup ervised topic modeling has man y applications. Researchers hav e used the topic- based decomp osition of the collection to examine interdisciplinarit y ( Ramage et al. , 2009 ), organize large collections of digital b ooks ( Mimno and McCallum , 2007 ), recommend purc hased items ( Mar- lin , 2003 ), and retrieve relev ant do cumen ts to a query ( W ei and Croft , 2006 ). Researchers hav e also ev aluated the in terpretability of the topics directly , such as b y correlation to a thesaurus ( Steyvers and Griffiths , 2006 ) or b y h uman study ( Chang et al. , 2009 ). In this w ork, w e fo cus on do cumen t collections where each do cumen t is endo w ed with a resp onse v ariable, external to its w ords, that w e are in terested in predicting. There are man y examples: in a collection of movie reviews, eac h do cumen t is summarized b y a n umerical rating; in a collection of news articles, each do cumen t is assigned to a section of the newspap er; in a collection of on- line scientific articles, each do cumen t is downloaded a certain num b er of times. T o analyze such collections, we dev elop sup ervise d topic mo dels , where the goal is to infer laten t topics that are predictiv e of the resp onse. With a fitted mo del in hand, we can infer the topic structure of an unlab eled do cumen t and then form a prediction of its response. Unsup ervised LD A has previously b een used to construct features for classification. The hope was that LDA topics w ould turn out to b e useful for categorization, since they act to reduce data dimension ( Blei et al. , 2003 ; F ei-F ei and P erona , 2005 ; Quelhas et al. , 2005 ). How ever, when the goal is prediction, fitting unsupervised topics ma y not be a go od choice. Consider predicting a mo vie rating from the words in its review. Intuitiv ely , go od predictiv e topics will differen tiate w ords lik e “excellent”, “terrible”, and “a verage,” without regard to genre. But topics estimated from an unsup ervised mo del ma y corresp ond to genres, if that is the dominan t structure in the corpus. The distinction b et ween unsupervised and supervised topic mo dels is mirrored in existing dimension- reduction tec hniques. F or example, consider regression on unsup ervised principal components versus partial least squares and pro jection pursuit ( Hastie et al. , 2001 )— b oth searc h for cov ariate linear com binations most predictive of a response v ariable. Linear supervised methods ha v e nonparametric analogs, such as an approac h based on kernel ICA ( F ukumizu et al. , 2004 ). In text analysis problems, McCallum et al. ( 2006 ) dev elop ed a join t topic mo del for words and categories, and Blei and Jordan ( 2003 ) developed an LDA mo del to predict caption words from images. In chemogenomic profiling, Flahert y et al. ( 2005 ) prop osed “labelled LDA,” whic h is also a joint topic mo del, but for genes and protein function categories. These mo dels differ fundamentally from the model prop osed here. This paper is organized as follo ws. W e dev elop the sup ervised laten t Diric hlet allo cation mo del (sLD A) for do cumen t-resp onse pairs. W e derive parameter estimation and prediction algorithms for the general exp onen tial family resp onse distributions. W e sho w specific algorithms for a Gaussian resp onse and Poisson resp onse, and suggest a general approach to an y other exp onen tial family . Finally , we demonstrate sLD A on tw o real-world problems. First, w e predict movie ratings based on the text of the reviews. Second, w e predict the p olitical tone of a senate amendment, based on an ideal-p oin t analysis of the roll call data ( Clinton et al. , 2004 ). In b oth settings, we find that sLDA pro vides more predictive pow er than regression on unsup ervised LD A features. The sLDA approac h also impro ves on the lasso ( Tibshirani , 1996 ), a mo dern regularized regression tec hnique. 2. Sup ervised latent Diric hlet allo cation T opic mo dels are distributions o v er do cumen t collections where each do cumen t is represented as a collection of discrete random v ariables W 1: n , whic h are its words. In topic mo dels, we treat the SUPER VISED TOPIC MODELS 3 w ords of a do cumen t as arising from a set of latent topics, that is, a set of unknown distributions o v er the v o cabulary . Do cumen ts in a corpus share the same set of K topics, but eac h do cumen t uses a mix of topics—the topic proportions—unique to itself. T opic models are a relaxation of classical do cumen t mixture models, whic h associate eac h document with a single unkno wn topic. Thus they are mixed-mem b ership mo dels ( Eroshev a et al. , 2004 ). See Steyvers and Griffiths ( 2006 ) and Blei and Laffert y ( 2009 ) for recen t reviews. Here we build on laten t Dirichlet allo cation (LD A) ( Blei et al. , 2003 ), a topic mo del that serv es as the basis for many others. In LDA, we treat the topic prop ortions for a do cumen t as a draw from a Diric hlet distribution. W e obtain the words in the do cumen t b y rep eatedly choosing a topic assignmen t from those prop ortions, then drawing a w ord from the corresp onding topic. In sup ervise d latent Dirichlet al lo c ation (sLDA), we add to LDA a resp onse v ariable connected to each do cumen t. As men tioned, examples of this v ariable include the num b er of stars given to a movie, the num b er of times an on-line article was downloaded, or the category of a do cumen t. W e jointly mo del the do cumen ts and the resp onses, in order to find latent topics that will b est predict the resp onse v ariables for future unlab eled do cumen ts. SLDA uses the same probabilistic mac hinery as a generalized linear mo del to acc ommodate v arious t yp es of response: unconstrained real v alues, real v alues constrained to b e p ositiv e (e.g., failure times), ordered or unordered class lab els, nonnegative integers (e.g., count data), and other types. Fix for a moment the mo del parameters: K topics β 1: K (eac h β k a vector of term probabilities), a Diric hlet parameter α , and resp onse parameters η and δ . (These resp onse parameters are describ ed in detail b elo w.) Under the sLDA mo del, each do cumen t and resp onse arises from the following generativ e process: 1. Draw topic prop ortions θ | α ∼ Dir( α ) . 2. F or each word (a) Draw topic assignmen t z n | θ ∼ Mult( θ ) . (b) Draw word w n | z n , β 1: K ∼ Mult( β z n ) . 3. Draw resp onse v ariable y | z 1: N , η , δ ∼ GLM( ¯ z , η , δ ), where w e define (1) ¯ z := (1 / N ) P N n =1 z n . Figure 1 illustrates the family of probabilit y distributions corresp onding to this generativ e pro cess as a graphical mo del. The distribution of the resp onse is a generalized linear mo del ( McCullagh and Nelder , 1989 ), (2) p ( y | z 1: N , η , δ ) = h ( y, δ ) exp ( η > ¯ z ) y − A ( η > ¯ z ) δ . There are tw o main ingredients in a generalized linear mo del (GLM): the “random comp onen t” and the “systematic comp onen t.” F or the random comp onen t, we take the distribution of the response to b e an exp onential disp ersion family with natural parameter η > ¯ z and disp ersion parameter δ . F or eac h fixed δ , Equation 2 is an exp onen tial family , with base measure h ( y , δ ), sufficient statistic y , and log-normalizer A ( η > ¯ z ). The disp ersion parameter pro vides additional flexibility in mo deling the v ariance of y . Note that Equation 2 need not b e an exp onen tial family join tly in ( η > ¯ z , δ ). 4 BLEI AND MCAULIFFE In the systematic component of the GLM, w e relate the exponential-family parameter of the random comp onen t to a linear combination of cov ariates—the so-called line ar pr e dictor . F or sLDA, we hav e already introduced the linear predictor: it is η > ¯ z . The reader familiar with GLMs will recognize that our choice of systematic component means sLD A uses only canonical link functions. In future w ork, w e will relax this constraint. The GLM framework gives us the flexibility to mo del any type of resp onse v ariable whose distribu- tion can b e written in exponential disp ersion form Equation 2 . This includes man y commonly used distributions: the normal (for real resp onse); the binomial (for binary resp onse); the multinomial (for categorical resp onse); the P oisson and negative binomial (for count data); the gamma, W eibull, and in v erse G aussian (for failure time data); and others. Each of these distributions corresp onds to a particular choice of h ( y , δ ) and A ( η > ¯ z ). F or example, the normal distribution corresponds to h ( y , δ ) = (1 / √ 2 π δ ) exp {− y 2 / 2 } and A ( η > ¯ z ) = ( η > ¯ z ) 2 / 2. In this case, the usual Gaussian parame- ters, mean µ and v ariance σ 2 , are equal to η > ¯ z and δ , resp ectiv ely . What distinguishes sLD A from the usual GLM is that the cov ariates are the unobserv ed empirical frequencies of the topics in the do cumen t. In the generative process, these latent v ariables are resp onsible for pro ducing the words of the do cumen t, and thus the resp onse and the words are tied. The regression coefficients on those frequencies constitute η . Note that a GLM usually includes an in tercept term, whic h amoun ts to adding a co v ariate that alw ays equals one. Here, suc h a term is redundan t, because the comp onen ts of ¯ z alw a ys sum to one (see Equation 1 ). By regressing the resp onse on the empirical topic frequencies, we treat the resp onse as non-exchangeable with the words. The document (i.e., w ords and their topic assignmen ts) is generated first, under full w ord exchangeabilit y; then, based on the do cumen t, the resp onse v ariable is generated. In con trast, one could formulate a mo del in which y is regressed on the topic prop ortions θ . This treats the resp onse and all the w ords as join tly exchangeable. But as a practical matter, our c hosen formula- tion seems more sensible: the resp onse dep ends on the topic frequencies whic h actually occurred in the document, rather than on the mean of the distribution generating the topics. Estimating this fully exchangeable model with enough topics allows some topics to b e used entirely to explain the resp onse v ariables, and others to b e used to explain the word o ccurrences. This degrades predictiv e p erformance, as demonstrated in Blei and Jordan ( 2003 ). Put a different wa y , here the latent v ari- ables that go vern the response are the same latent v ariables that gov erned the w ords. The mo del do es not hav e the flexibility to infer topic mass that explains the resp onse, without also using it to explain some of the observ ed w ords. 3. Computation with sup ervised LD A W e need to address three computational problems to analyze data with sLDA. First is p osterior infer enc e , computing the conditional distribution of the latent v ariables at the do cumen t level giv en its words w 1: N and the corpus-wide mo del parameters. The p osterior is thus a conditional distribution of topic proportions θ and topic assignments z 1: N . This distribution is not p ossible to compute. W e appro ximate it with v ariational inference. Second is p ar ameter estimation , estimating the Dirichlet parameters α , GLM parameters η and δ , and topic m ultinomials β 1: K from a data set of observ ed do cumen t-resp onse pairs { w d, 1: N , y d } D d =1 . W e use maxim um lik eliho o d estimation based on v ariational exp ectation-maximization. SUPER VISED TOPIC MODELS 5 θ d Z d,n W d,n N D K β k α Y d η , δ Fig 1 . The gr aphic al mo del r epr esentation of sup ervise d latent Dirichlet al loc ation (sLDA). (Nodes ar e r andom vari- ables; e dges indic ate p ossible dep endenc e; a shade d no de is an observe d variable; an unshade d no de is a hidden variable.) The final problem is pr e diction , predicting a response y from a newly observed do cumen t w 1: N and fixed v alues of the model paramete rs. This amounts to approximating the p osterior exp ectation E[ y | w 1: N , α, β 1: K , η , δ ]. W e treat these problems in turn for the general GLM setting of sLDA, noting where GLM-sp ecific quan tities need to b e computed or appro ximated. W e then consider the special cases of a Gaussian resp onse and a Poisson resp onse, for whic h the algorithms ha v e an exact form. Finally , we suggest a general-purpose approximation pro cedure for other resp onse distributions. 3.1 P osterior inference Both parameter estimation and prediction hinge on p osterior inference. Giv en a do cumen t and resp onse, the p osterior distribution of the laten t v ariables is (3) p ( θ , z 1: N | w 1: N , y , α, β 1: K , η , σ 2 ) = p ( θ | α ) Q N n =1 p ( z n | θ ) p ( w n | z n , β 1: K ) p ( y | z 1: N , η , σ 2 ) R dθ p ( θ | α ) P z 1: N Q N n =1 p ( z n | θ ) p ( w n | z n , β 1: K ) p ( y | z 1: N , η , σ 2 ) . The normalizing v alue is the marginal probabilit y of the observed data, i.e., the do cumen t w 1: N and resp onse y . This normalizer is also kno wn as the likeliho o d , or the evidenc e . As with LD A, it is not efficien tly computable ( Blei et al. , 2003 ). Th us, w e app eal to v ariational metho ds to appro ximate the posterior. V ariational metho ds encompass a num b er of types of approximation for the normalizing v alue of the p osterior. F or reviews, see W ainwrigh t and Jordan ( 2008 ) and Jordan et al. ( 1999 ). W e use mean-field v ariational inference, where Jensen’s inequalit y is used to low er b ound the normalizing v alue. W e let π denote the set of mo del parameters, π = { α, β 1: K , η , δ } and q ( θ , z 1: N ) denote a 6 BLEI AND MCAULIFFE variational distribution of the latent v ariables. The lo wer bound is log p ( w 1: N | π ) = log Z θ X z 1: N p ( θ , z 1: N , w 1: N | π ) = log Z θ X z 1: N p ( θ , z 1: N , w 1: N | π ) q ( θ, z 1: N ) q ( θ, z 1: N ) ≥ E[log p ( θ , z 1: N , w 1: N | π )] − E[log q ( θ , z 1: N )] , where all exp ectations are tak en with resp ect to q ( θ , z 1: N ). This b ound is called the evidenc e lower b ound (ELBO), whic h w e denote by L ( · ). The first term is the exp ectation of the log of the joint probabilit y of hidden and observ ed v ariables; the second term is the en tropy of the v ariational distribution, H( q ) = − E[log q ( θ, z 1: N )]. In its expanded form, the sLD A ELBO is L ( w 1: N , y | π ) = E[log p ( θ | α )] + P N n =1 E[log p ( Z n | θ )] + P N n =1 E[log p ( w n | Z n , β 1: K )] + E[log p ( y | Z 1: N , η , δ )] + H( q ) . (4) This is a function of the observ ations { w 1: N , y } and the v ariational distribution. In v ariational inference w e first construct a parameterized family for the v ariational distribution, and then fit its parameters to tigh ten Equation 4 as muc h as p ossible for the given observ ations. The parameterization of the v ariational distribution gov erns the exp ense of optimization. Equation 4 is tight when q ( θ , z 1: N ) is the p osterior, but sp ecifying a family that contains the p osterior leads to an intractable optimization problem. W e thus sp ecify a simpler family . Here, w e choose the fully factorized distribution, (5) q ( θ, z 1: N | γ , φ 1: N ) = q ( θ | γ ) Q N n =1 q ( z n | φ n ) , where γ is a K-dimensional Dirichlet parameter v ector and each φ n parametrizes a categorical distribution ov er K elements. The latent topic assignment Z n is represented as a K -dimensional indicator v ector; th us E[ Z n ] = q ( z n ) = φ n . Tigh tening the b ound with respect to this family is equiv alent to finding its mem b er that is closest in KL div ergence to the p osterior ( W ainwrigh t and Jordan , 2008 ; Jordan et al. , 1999 ). Th us, given a document-response pair, we maximize Equation 4 with respect to φ 1: N and γ to obtain an estimate of the posterior. Before turning to optimization, we describ e ho w to compute the ELBO in Equation 4 . The first three terms and the entrop y of the v ariational distribution are identical to the corresponding terms in the ELBO for unsup ervised LDA ( Blei et al. , 2003 ). The first three terms are E[log p ( θ | α )] = log Γ P K i =1 α i − K X i =1 log Γ( α i ) + K X i =1 ( α i − 1)E[log θ i ] (6) E[log p ( Z n | θ )] = K X i =1 φ n,i E[log θ i ] (7) E[log p ( w n | Z n , β 1: K )] = K X i =1 φ n,i log β i,w n . (8) The en tropy of the v ariational distribution is (9) H( q ) = − N X n =1 K X i =1 φ n,i log φ n,i − log Γ P K i =1 γ i + K X i =1 log Γ( γ i ) − K X i =1 ( γ i − 1)E[log θ i ] . SUPER VISED TOPIC MODELS 7 W e note that the exp ectation of the log of the Diric hlet random v ariable is (10) E[log θ i ] = Ψ( γ i ) − Ψ P K j =1 γ j , where Ψ( x ) denotes the digamma function, i.e., the first deriv ativ e of the log of the Gamma function. The v ariational ob jective function for sLD A differs from LD A in the fourth term, the exp ected log probabilit y of the resp onse v ariable given the laten t topic assignments. This term is (11) E[log p ( y | Z 1: N , η , δ )] = log h ( y , δ ) + 1 δ h η > E ¯ Z y − E A ( η > ¯ Z ) i . W e see that computing the sLDA ELBO hinges on t wo exp ectations. The first exp ectation is (12) E ¯ Z = ¯ φ = 1 N N X n =1 φ n , whic h follo ws from Z n b eing an indicator v ector. The second exp ectation is E[ A ( η > ¯ Z )]. This is computable in some mo dels, such as when the resp onse is Gaussian or P oisson distributed, but in the general case it m ust b e appro ximated. W e will address these settings in detail in Section 3.4 . F or now, we assume that this issue is resolv able and contin ue with the pro cedure for appro ximate p osterior inference. W e use blo c k coordinate-ascent v ariational inference, maximizing Equation 4 with resp ect to each v ariational parameter vector in turn. The terms that in volv e the v ariational Diric hlet γ are iden tical to those in unsupervised LD A, i.e., they do not inv olve the resp onse v ariable y . Th us, the coordinate ascen t update is as in Blei et al. ( 2003 ), (13) γ new ← α + P N n =1 φ n . The central difference b et w een sLDA and LD A lies in the up date for the v ariational multinomial φ n . Giv en n ∈ { 1 , . . . , N } , the partial deriv ative of the ELBO with resp ect to φ n is (14) ∂ L ∂ φ n = E[log θ ] + E[log p ( w n | β 1: K )] − log φ n + 1 + y N δ η − 1 δ ∂ ∂ φ n n E A ( η > ¯ Z ) o . Optimizing with resp ect to the v ariational multinomial dep ends on the form of ∂ ∂ φ n E A ( η > ¯ Z ) . In some cases, suc h as a Gaussian or Poisson response, we obtain exact coordinate up dates. In other cases, we require gradient based optimization metho ds. Again, we p ostp one these details to Section 3.4 . V ariational inference pro ceeds by iteratively up dating the v ariational parameters { γ , φ 1: N } according to Equation 13 and the deriv ative in Equation 14 . This finds a local optim um of the ELBO in Equation 4 . The resulting v ariational distribution q is used as a proxy for the p osterior. 3.2 P arameter estimation The parameters of sLD A are the K topics β 1: K , the Diric hlet hyperparameters α , the GLM co effi- cien ts η , and the GLM dispersion parameter δ . W e fit these parameters with v ariational exp ectation 8 BLEI AND MCAULIFFE maximization (EM), an approximate form of exp ectation maximization where the exp ectation is tak en with resp ect to a v ariational distribution. As in the usual EM algorithm, maximization pro- ceeds b y maxim um likelihoo d estimation under exp ected sufficien t statistics ( Dempster et al. , 1977 ). Our data are a corpus of do cumen t-resp onse pairs D = { w d, 1: N , y d } D d =1 . V ariational EM is an op- timization of the corpus-level lo wer bound on the log likelihoo d of the data, which is the sum of p er-document ELBOs. As we are now considering a collection of do cumen t-resp onse pairs, in this section we add do cumen t indexes to the previous section’s quantities, so resp onse v ariable y b ecomes y d , empirical topic assignment frequencies ¯ Z b ecomes ¯ Z d , and so on. Notice eac h do cu- men t is endo w ed with its own v ariational distribution. Exp ectations are tak en with resp ect to that do cumen t-sp ecific v ariational distribution q d ( z 1: N , θ ), (15) L ( α, β 1: K , η , δ ; D ) = D X d =1 E d [log p ( θ d , z d, 1: N , w d, 1: N , y d )] + H( q d ) . In the expectation step (E-step), w e estimate the appro ximate p osterior distribution for each do cumen t-resp onse pair using the v ariational inference algorithm describ ed abov e. In the maxi- mization step (M-step), we maximize the corpus-level ELBO with resp ect to the mo del parameters. V ariational EM finds a local optimum of Equation 15 by iterating betw een these steps. The M-step up dates are describ ed b elo w . Estimating the topics. The M-step up dates of the topics β 1: K are the same as for unsup ervised LD A, where the probabilit y of a word under a topic is proportional to the exp ected n umber of times that it was assigned to that topic ( Blei et al. , 2003 ), (16) ˆ β new k,w ∝ D X d =1 N X n =1 1( w d,n = w ) φ k d,n . Prop ortionalit y means that each ˆ β new k is normalized to sum to one. W e note that in a fully Bay esian sLD A mo del, one can place a symmetric Dirichlet prior on the topics and use v ariational Bay es to appro ximate their p osterior. This adds no additional complexit y to the algorithm ( Bishop , 2006 ). Estimating the GLM parameters. The GLM parameters are the co efficien ts η and disp ersion parameter δ . F or the corpus-lev el ELBO, the gradien t with respect to GLM coefficients η is ∂ L ∂ η = ∂ ∂ η 1 δ D X d =1 n η > E ¯ Z d y d − E A ( η > ¯ Z d ) o (17) = 1 δ ( D X d =1 ¯ φ d y d − D X d =1 E d µ ( η > ¯ Z d ) ¯ Z d ) , (18) where µ ( · ) = E GLM [ Y | · ]. The app earance of this exp ectation follows from prop erties of the cumulan t generating function A ( η > ¯ z ) ( Bro wn , 1986 ). This GLM mean response is a known function of η > ¯ z d in all standard cases. How ever, E[ µ ( η > ¯ Z d ) ¯ Z d ] has an exact solution only in some cases, suc h as Gaussian or Poisson resp onse. In other cases, we approximate the exp ectation. (See Section 3.4 .) SUPER VISED TOPIC MODELS 9 The deriv ative with resp ect to δ , ev aluated at ˆ η new , is (19) ( D X d =1 ∂ h ( y d , δ ) /∂ δ h ( y d , δ ) ) + 1 δ ( D X d =1 h ˆ η > new E ¯ Z d y d − E A ( ˆ η > new ¯ Z d ) i ) . Equation 19 can b e computed given that the rightmost summation has b een ev aluated, exactly or appro ximately , while optimizing the co efficien ts η . Depending on h ( y , δ ) and its partial deriv ative with resp ect to δ , we obtain ˆ δ new either in closed form or via one-dimensional numerical optimization. Estimating the Dirichlet parameter. While we fix the Diric hlet parameter α to 1 /K in Section 4 , other applications of topic mo deling fit this parameter ( Blei et al. , 2003 ; W allac h et al. , 2009 ). Estimating the Dirichlet parameters follo ws the standard pro cedure for estimating a Dirichlet dis- tribution ( Ronning , 1989 ). In a fully-observed setting, the sufficient statistics of the Dirichlet are the logs of the observ ed simplex v ectors. Here, they are the expected logs of the topic prop ortions, see Equation 10 . 3.3 Prediction Our fo cus in applying sLD A is prediction. Giv en a new do cumen t w 1: N and a fitted mo del { α, β 1: K , η , δ } , w e w ant to compute the expected resp onse v alue, (20) E[ Y | w 1: N , α, β 1: K , η , σ 2 ] = E µ ( η > ¯ Z ) | w 1: N , α, β 1: K . T o perform prediction, w e approximate the p osterior mean of ¯ Z using v ariational inference. This is the same pro cedure as in Section 3.1 , but here the terms depending on the resp onse y are remo ved from the ELBO in Equation 4 . Thus, with the v ariational distribution taking the same form as Equation 5 , we implement the following co ordinate ascen t up dates for the v ariational parameters, γ new = α + P N n =1 φ n (21) φ new n ∝ exp { E q [log θ ] + log β w n } , (22) where the log of a vector is a v ector of the log of eac h of its components, β w n is the v ector of p ( w n | β k ) for each k ∈ { 1 , . . . , K } , and again prop ortionalit y means that the vector is normalized to sum to one. Note in this section we distinguish b etw een exp ectations taken with resp ect to the mo del and those taken with resp ect to the v ariational distribution. (In other sections all expectations are tak en with resp ect to the v ariational distribution.) This co ordinate ascent algorithm is identical to v ariational inference for unsup ervised LD A: since w e a veraged the response v ariable out of the right-hand side in Equation 20 , what remains is the standard unsup ervised LD A mo del for Z 1: N and θ ( Blei et al. , 2003 ). Notice this algorithm do es not depend on the particular resp onse t yp e. Th us, giv en a new do cumen t, we first compute q ( θ , z 1: N ), the v ariational p osterior distribution of the laten t v ariables θ and Z n . W e then estimate the resp onse with (23) E[ Y | w 1: N , α, β 1: K , η , σ 2 ] ≈ E q µ ( η > ¯ Z ) As with parameter estimation, this dep ends on being able to compute or appro ximate E q [ µ ( η > ¯ Z )]. 10 BLEI AND MCAULIFFE 3.4 Examples The previous three sections hav e outlined the general computational strategy for sLD A: a procedure for appro ximating the posterior distribution of the topic proportions θ and topic assignmen ts Z 1: N on a p er-document/response basis, a maxim um lik eliho o d pro cedure for fitting the topics β 1: K and GLM parameters { η , δ } using v ariational EM, and a pro cedure for predicting new resp onses from observ ed documents using an appro ximate expectation based on a v ariational p osterior. Our deriv ation has remained free of any sp ecific assumptions ab out the form of the response distri- bution. W e now fo cus on sLD A for specific resp onse distributions—the Gaussian and Poisson—and suggest a general appro ximation tec hnique for handling other responses within the GLM framew ork. The resp onse distribution blo c k ed our deriv ation at three points: the computation of E[ A ( η > ¯ Z )] in the per-do cumen t ELBO of Equation 4 , the computation of ∂ ∂ φ n E A ( η > ¯ Z ) in Equation 14 and corresp onding up date for the v ariational m ultinomial φ n , and the computation of E q [ µ ( η > ¯ Z d ) ¯ Z d ] for fitting the GLM parameters in a v ariational EM algorithm. W e note that other asp ects of working with sLD A are general to any type of exp onen tial family response. Gaussian resp onse When the resp onse is Gaussian, the disp ersed exp onen tial family form can b e seen as follo ws: p ( y | ¯ z , η , δ ) = 1 √ 2 π δ exp ( − y − η > ¯ z 2 2 δ ) . (24) = 1 √ 2 π δ exp − y 2 / 2 + y η > ¯ z − ( η > ¯ z ¯ z > η ) / 2 δ . (25) Here the natural parameter η > ¯ z and mean parameter are identical. Th us, µ ( η > ¯ z ) = η > ¯ z . The usual v ariance parameter σ 2 is equal to δ . Moreo ver, notice that h ( y , δ ) = 1 √ 2 π δ exp {− y 2 / 2 } (26) A ( η > ¯ z ) = ( η > ¯ z ¯ z > η ) / 2 . (27) W e first sp ecify the v ariational inference algorithm. This requires computing the expectation of the log normalizer in Equation 27 , whic h hinges on E ¯ Z ¯ Z > , E ¯ Z ¯ Z > = 1 N 2 P N n =1 P N m =1 E Z n Z > m (28) = 1 N 2 P N n =1 P m 6 = n φ n φ > m + P N n =1 diag { φ n } . (29) T o see Equation 29 , notice that for m 6 = n , E[ Z n Z > m ] = E[ Z n ]E[ Z m ] > = φ n φ > m b ecause the v ariational distribution is fully factorized. On the other hand, E[ Z n Z > n ] = diag(E[ Z n ]) = diag( φ n ) b ecause Z n is an indicator vector. T o tak e the deriv ative of the exp ected log normalizer, we consider it as a function of a single v ariational parameter φ j . Define φ − j := P n 6 = j φ n . Expressed as a function of φ j , E A ( η > ¯ Z ) is f ( φ j ) = 1 2 N 2 η > h φ j φ > − j + φ − j φ > j + diag { φ j } i η + const (30) = 1 2 N 2 h 2 η > φ − j η > φ j + ( η ◦ η ) > φ j i + const . (31) SUPER VISED TOPIC MODELS 11 Th us the gradient is (32) ∂ ∂ φ j E A ( η > ¯ Z ) = 1 2 N 2 h 2 η > φ − j η + ( η ◦ η ) i . Substituting this gradient in to Equation 14 yields an exact co ordinate up date for the v ariational m ultinomial φ j , (33) φ new j ∝ exp E[log θ ] + E[log p ( w j | β 1: K )] + y N δ η − 1 2 N 2 δ h 2 η > φ − j η + ( η ◦ η ) i . Exp onen tiating a vector means forming the v ector of exp onen tials. The prop ortionalit y symbol means the comp onen ts of φ new j are computed according to Equation 33 , then normalized to sum to one. Note that E[log θ i ] is given in Equation 10 . Examining this up date in the Gaussian setting unco vers further in tuitions about the difference b e- t w een sLD A and LD A. As in LDA, the j th word’s v ariational distribution ov er topics dep ends on the word’s topic probabilities under the actual mo del (determined by β 1: K ). But w j ’s v ariational distribution, and those of all other words, affect the probabilit y of the resp onse. T o see this, con- sider the exp ectation of log p ( y | ¯ z , η , δ ) of Equation 24 , which is is a term in the do cumen t-lev el ELBO of Equation 4 . Notice that v ariational distribution q ( z n ) plays a role in the exp ected residual sum of squares E[( y − η > ¯ Z ) 2 ]. The end result is that the up date Equation 33 also encourages the corresp onding v ariational parameter φ j to decrease this exp ected residual sum of squares. F urther, the up date in Equation 33 dep ends on the v ariational parameters φ − j of all other w ords. Unlik e LD A, the φ j cannot b e updated in parallel. Distinct o ccurrences of the same term must be treated separately . W e now turn to parameter estimation for the Gaussian response sLDA model, i.e., the M-step up dates for the parameters η and δ . Define y := y 1: D as the vector of resp onse v alues across do cumen ts and let X b e the D × K design matrix, whose ro ws are the v ectors ¯ Z d . Setting to zero the η gradien t of the corpus-lev el ELBO from Equation 18 , we arriv e at an exp ected-v alue version of the normal equations: (34) E X > X η = E[ X ] > y ⇒ ˆ η new ← E X > X − 1 E[ X ] > y . Here the exp ectation is o ver the matrix X , using the v ariational distribution parameters chosen in the previous E-step. Note that the d th row of E[ X ] is just E[ ¯ Z d ] from Equation 12 . Also, (35) E X > X = X d E ¯ Z d ¯ Z > d , with eac h term ha ving a fixed v alue from the previous E-step as w ell, giv en by Equation 29 . W e now apply the first-order condition for δ of Equation 19 . The partial deriv ative needed is (36) ∂ h ( y d , δ ) /∂ δ h ( y d , δ ) = − 1 2 δ , whic h can b e seen from Equation 27 . Using this deriv ativ e and the definition of ˆ η new in Equation 19 , w e obtain (37) ˆ δ new ← 1 D y > y − y > E[ X ] E X > X − 1 E[ X ] > y . 12 BLEI AND MCAULIFFE It is tempting to try a further simplification of Equation 37 : in an ordinary least squares (OLS) regression of y on the columns of E[ X ], the analog of Equation 37 would just equal 1 /D times the sum of the squared residuals ( y − E[ X ] ˆ η new ). How ever, that iden tit y do es not hold here, b ecause the in verted matrix in Equation 37 is E[ X > X ], rather than E[ X ] > E[ X ]. This illustrates that the η up date Equation 34 is not just OLS regression of y on E[ X ]. Finally , w e form predictions just as in Section 3.3 . Since the mapping from the natural parameter to the mean parameter is the iden tity function, (38) E[ Y | w 1: N , α, β 1: K , η , δ ] ≈ η > E ¯ Z . Again, the exp ectation is tak en with resp ect to v ariational inference in the unsup ervised setting (Equation 21 and Equation 22 ). P oisson resp onse An o verdispersed P oisson response pro vides a natural generalized linear mo del form ulation for count data. Given mean parameter λ , the ov erdisp ersed Poisson has density (39) p ( y | λ, δ ) = 1 y ! λ y /δ exp {− λ/δ } . This can b e put in the ov erdisp ersed exp onen tial family form (40) p ( y | λ ) = 1 y ! exp y log λ − λ δ . In the GLM, the natural parameter log λ = η > ¯ z , h ( y , δ ) = 1 /y !, and (41) A ( η > ¯ z ) = µ ( η > ¯ z ) = exp η > ¯ z . W e first compute the exp ectation of the log normalizer E h A ( η > ¯ Z ) i = E h exp (1 / N ) P N n =1 η > Z n i (42) = N Y n =1 E exp (1 / N ) η > Z n , (43) where E exp (1 / N ) η > Z n = P K i =1 φ n,i exp { η i / N } + (1 − φ n,i ) (44) = K − 1 + P K i =1 φ n,i exp { η i / N } . (45) The pro duct in Equation 43 also provides E[ µ ( η > ¯ Z )], which is needed for prediction in Equation 20 . Denote this pro duct b y C and let C − n denote the same pro duct, but ov er all indices except for the n th. The deriv ative of the expected log normalizer needed to up date φ n is (46) ∂ ∂ φ n E A ( η > ¯ Z ) = C − n exp { η / N } . As for the Gaussian resp onse, this p ermits an exact up date of the v ariational multinomial. SUPER VISED TOPIC MODELS 13 In the deriv ativ e of the corpus-level ELBO with resp ect to the co efficien ts η (Equation 18 ), we need to compute the exp ected mean parameter times ¯ Z , E[ µ ( η > ¯ Z ) ¯ Z ] = 1 N N X n =1 E[ µ ( η > ¯ Z ) Z n ] (47) = exp { η / N } N N X n =1 C − n φ n . (48) W e turn to the M-step. W e cannot find an exact M-step update for the GLM co efficien ts, but w e can compute the gradient for use in a con v ex optimization algorithm. The gradien t of the corpus-level ELBO is (49) ∂ L ∂ η = 1 δ X d E d [ ¯ Z d ] y d − X d exp { η / N d } N d X n C d, − n φ d,n ! . The deriv ative of h ( y , δ ) with resp ect to δ is zero. Th us the disp ersion parameter M-step is exact, (50) ˆ δ new → P d ˆ η > new E ¯ Z d y d P d E A ( ˆ η > new ¯ Z d ) As for the general case, Poisson sLDA prediction requires computing E µ ( η > ¯ Z ) . Since µ ( · ) and A ( · ) are identical, this is giv en in Equation 43 . Exp onen tial family resp onse via the delta metho d With a general exponential family re- sp onse one can use the multiv ariate delta metho d for moments to approximate difficult exp ecta- tions ( Bic k el and Doksum , 2007 ), a metho d which is effective in v ariational approximations ( Braun and McAuliffe , 2010 ). In other w ork, Chang and Blei ( 2010 ) use this metho d with logistic regression to adapt sLD A to net work analysis; W ang et al. ( 2009 ) use this metho d with m ultinomial regression for image classification. With the m ultiv ariate delta metho d, one can embed any generalized linear mo del into sLDA. 4. Empirical study W e studied sLD A on tw o prediction problems. First, w e consider the “sentimen t analysis” of news- pap er movie reviews. W e use the publicly a v ailable data introduced in Pang and Lee ( 2005 ), whic h con tains mo vie reviews paired with the n um b er of stars given. While Pang and Lee ( 2005 ) treat this as a classification problem, w e treat it as a regression problem. Analyzing do cumen t data requires choosing an appropriate v o cabulary on whic h to estimate the topics. In topic mo deling researc h, one typically remo ves v ery common words, whic h cannot help discriminate b et ween do cumen ts, and very rare words, whic h are unlikely to b e of predictive imp or- tance. W e select the v o cabulary by remo ving words that occur in more than 25% of the documents and words that o ccur in fewer than 5 do cumen ts. This yielded a vocabulary of 2180 words, a corpus of 5006 do cumen ts, and 908,000 observ ations. F or more on selecting a vocabulary in topic mo dels, see Blei and Lafferty ( 2009 ). 14 BLEI AND MCAULIFFE T opic and Coefficient waste . unbearab le . opinions . employers . e xpressed . totally . reflect . poor . mine didnt . cant . story . couldnt . wasnt . seeing . films . bad . dialogue movie . story . films . time . director . characters . little . picture . dont screenplay . emotional . theres . dr ama . characters . melodrama . despite . doesnt . unfortunately reply . subscribe . details . mpaa . subject . running . message . rating . minutes rated . acceptable . teenagers . language . send . letter . se xuality . movie . subscribe jeffrey . son . age . kids . disne y . liked . ages . animated . slapstick employ ers . reflect . expressed . opinions . unbearab le . mine . waste . totally . meant theres . isnt . thats . enter tainment . comic . motion . doesnt . screenpla y . plot delightful . fine . profanity . charming . romantic . woody . comedy . allen . funniest aw ard . opinions . academy . unbearab le . expressed . mine . reflect . rated . totally world . powerful . lif e . experience . tale . own . comple x . nature . human ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 Fig 2 . A 12-topic sLDA mo del fit to the movie review data. Second, w e studied the texts of amendmen ts from the 109th and 110th Senates. Here the response v ariables are the discrimination parameters from an ideal p oin t analysis of the votes ( Clin ton et al. , 2004 ). Ideal p oint analysis of voting data is used in quantitativ e p olitical science to map senators to a real-v alued p oin t on a p olitical sp ectrum. Ideal p oin t mo dels p osit that the votes of eac h senator j are summarized with a real-v alued latent v ariable x j . The senator’s vote on an issue i , which is the binary v ariable y ij , is determined by a probit mo del, y ij ∼ Probit( x j β i + α i ) . Th us, each issue is connected to t w o parameters. F or our analysis, w e are most interested in the “issue discrimination” β i . When β i and x j ha v e the same sign, the senator j is more likely to v ote in fa vor of the issue i . Just as x j can be interpreted as a senator’s point on the p olitical sp ectrum, β i can b e in terpreted as an issue’s place on that sp ectrum. (The intercept term α i is called the “difficult y” and allows for bias in the votes, regardless of the ideal points of the senators.) W e connected the results of an ideal p oin t analysis to texts ab out the votes. In particular, we study the amendments considered b y the 109th and 110th U.S. Senates. 1 As a preprocessing step, w e estimated the discrimination parameters β i for eac h amendmen t, based on the v oting record. Eac h discrimination β i is used as the res ponse v ariable for its amendment text. W e study the question: Ho w w ell w e can predict the p olitical tone of an amendmen t based only on its text? 1 W e collected our data—the amendmen t texts and the v oting records—from the open gov ernment w eb-site www.go vtrack.com. SUPER VISED TOPIC MODELS 15 T opic and Coefficient equipment . states . grant . project . grants . resources . response . local . local_gov er nment alien . employ er . employment . immigration_and . nationality . aliens . fee . insert . heading energy . transportation . administrator . technologies . project . board . technology . management . research secretary_of_defense . acquisition . air_force . armed_forces . member . procurement . def ense . public_law . army chapter . commission . violation . cour t . procedures . persons . death . e vidence . members debtor . tax . petition . debt . percent . age . iraq . rate . code ● ● ● ● ● ● −1.0 −0.5 0.0 0.5 Fig 3 . A 5-topic sLDA mo del fit to the 109th U.S. Senate data. W e pre-processed the Senate text in the same wa y as the reviews data. T o obtain the response v ariables, we used the implemen tation of ideal point analysis in Simon Jac kman’s Political Science Computational Laboratory R pac k age. F or the 109th Senate, this yielded 288 amendments, a vocab- ulary of 2084 w ords, and 62,000 observ ations. F or the 110th Senate, this yielded 213 amendmen ts, a v o cabulary of 1653 words, and 63,000 observ ations. F or b oth reviews and senate amendmen ts, w e transformed the resp onse to approximate normalit y b y taking logs. This makes the data amenable to the contin uous-resp onse mo del of Section 2 ; for these tw o problems, generalized linear mo deling turned out to b e unnecessary . W e initialized β 1: K to randomly p erturbed uniform topics, σ 2 to the sample v ariance of the resp onse, and η to a grid on [ − 1 , 1] in increments of 2 /K . W e ran EM until the relativ e c hange in the corpus-lev el lik eliho o d b ound was less than 0.01%. In the E-step, w e ran co ordinate-ascen t v ariational inference for eac h do cumen t un til the relative change in the per-do cumen t ELBO was less than 0.01%. W e can examine the patterns of words found b y mo dels fit to these data. In Figure 2 w e illustrate a 12-topic sLDA mo del fit to the movie review data. Eac h topic is plotted as a list of its most likely w ords, and each is attac hed to its estimated co efficien t in the linear mo del. W ords lik e “pow erful” and “complex” are in a topic with a high p ositive co efficien t; w ords lik e “w aste” and “unbearable” are in a topic with high negative co efficien t. In Figure 3 and Figure 4 we illustrate sLDA mo dels fit to the Senate amendment data. These mo dels are harder to in terpret than the mo dels fit to mo vie review data, as the resp onse v ariable is likely less gov erned by the text. (Muc h more go es into a Senator’s decision of how to v ote.) Nonetheless, some patterns are worth noting. The health care amendmen ts in the 110th Senate w ere a distinctly righ t wing issue; grants and immigration in the 109th Senate were a left wing issue. W e assessed the quality of the predictions using fiv e fold cross-v alidation. W e measured error in t w o wa ys. First, we measured correlation b et ween the out-of-fold predictions with the out-of-fold resp onse v ariables. Second, w e measured “predictiv e R 2 .” W e defined this quan tity as the fraction of v ariability in the out-of-fold response v alues whic h is captured b y the out-of-fold predictions: pR 2 = 1 − P ( y − ˆ y ) 2 ) P ( y − ¯ y ) 2 ) . W e compared sLD A to linear regression on the ¯ φ d from unsup ervised LD A. This is the regression equiv alent of using LD A topics as classification features ( Blei et al. , 2003 ; F ei-F ei and P erona , 2005 ). 16 BLEI AND MCAULIFFE W e studied the quality of these predictions across different n umbers of topics. Figures 5 , 6 and 7 illustrate that sLD A pro vides improv ed predictions on all data. The movie review rating is easier to predict that the U.S. Senates discrimination parameters. Ho wev er, our study sho ws that there is predictiv e pow er in the texts of the amendmen ts. Finally , w e compared sLD A to the lasso, whic h is L 1 -regularized least-squares regression. The lasso is a widely used prediction metho d for high-dimensional problems ( Tibshirani , 1996 ). W e used each do cumen t’s empirical distribution ov er w ords as its lasso cov ariates. W e report the highest pR 2 it ac hiev ed across different settings of the complexit y parameter, and compare to the highest pR 2 attained by sLDA across different num b ers of topics. F or the review er data, the b est lasso pR 2 w as 0 . 426 v ersus 0 . 432 for sLD A, a modest 2% impro v ement. On the U.S. Senate data, sLDA provided definitiv ely better predictions. F or the 109th U.S. Senate data the best lasso pR 2 w as 0 . 15 versus 0 . 27 for sLDA, an 80% impro vemen t. F or the 110th U.S. Senate data, the b est lasso pR 2 w as 0 . 16 v ersus 0 . 23 for sLD A, a 43% impro v emen t. Note moreov er that the Lasso pro vides only a prediction rule, whereas sLDA mo dels laten t structure useful for other purp oses. 5. Discussion W e ha ve developed sLD A, a statistical mo del of lab elled do cumen ts. The mo del accommo dates the differen t types of resp onse v ariable commonly encoun tered in practice. W e presen ted a v ariational pro cedure for appro ximate posterior inference, which we then incorporated in an EM algorithm for maxim um-lik eliho od parameter estimation. W e studied the mo del’s predictiv e p erformance, our main fo cus, on t wo real-w orld problems. In b oth cases, w e found that sLD A impro ved on t wo natural comp etitors: unsup ervised LD A analysis follo w ed b y linear regression, and the lasso. These results illustrate the b enefits of supervised dimension reduction when prediction is the ultimate goal. W e close with remarks on future directions. First, a “semi-sup ervised” version of sLD A—where some documents hav e a resp onse and others do not—is straightforw ard. Simply omit the last tw o terms in Equation 14 for unlab elled do cumen ts and include only lab elled do cumen ts in Equation 18 and Equation 19 . Since partially lab elled corp ora are the rule rather than the exception, this is a v aluable a v enue. (Though note, in this setting, that care must be tak en that the resp onse data exert sufficient influence on the fit.) Second, if w e observ e an additional fixed-dimensional co v ariate v ector with eac h document, w e can include it in the analysis just b y adding it to the linear predictor. This change will generally require us to add an intercept term as w ell. Third, the tec hnique w e ha v e used to incorp orate a resp onse can b e applied in existing v arian ts of LD A, suc h as author-topic models ( Rosen-Zvi et al. , 2004 ), p opulation-genetics models ( Pritchard et al. , 2000 ), and survey-data models ( Eroshev a , 2002 ). W e hav e already men tioned that sLDA has b een adapted to net work mo dels ( Chang and Blei , 2010 ) and image mo dels ( W ang et al. , 2009 ). Finally , w e are no w studying the maximization of conditional likelihoo d rather than join t likelihoo d, as w ell as Ba y esian nonparametric metho ds to explicitly incorp orate uncertaint y about the num b er of topics. References Bic kel, P . and Doksum, K. (2007). Mathematic al Statistics: Basic Ide as and Sele cte d T opics , v olume 1. P earson Pren tice Hall, Upp er Saddle River, NJ, 2nd edition. Bishop, C. (2006). Pattern Re c o gnition and Machine L e arning . Springer New Y ork. SUPER VISED TOPIC MODELS 17 Blei, D. and Jordan, M. (2003). Mo deling annotated data. In Pr o c e e dings of the 26th annual International A CM SIGIR Conferenc e on R ese ar ch and Development in Information R etrieval , pages 127–134. ACM Press. Blei, D. and Lafferty , J. (2009). T opic mo dels. In Sriv astav a, A. and Sahami, M., editors, T ext Mining: The ory and Applic ations . T aylor and F rancis. Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allo cation. Journal of Machine L e arning R ese ar ch , 3:993– 1022. Braun, M. and McAuliffe, J. (2010). V ariational inference for large-scale models of discrete c hoice. Journal of the Americ an Statistic al Asso ciation . Bro wn, L. (1986). F undamentals of Statistical Exponential F amilies . Institute of Mathematical Statistics, Hayw ard, CA. Chang, J. and Blei, D. (2010). Hierarchical relational mo dels for do cumen t netw orks. A nnals of Applie d Statistics . Chang, J., Bo yd-Grab er, J., W ang, C., Gerrish, S., and Blei, D. (2009). Reading tea leav es: How humans interpret topic mo dels. In Neur al Information Pr o c essing Systems . Clin ton, J., Jackman, S., and Riv ers, D. (2004). The statistical analysis of roll call data. Americ an Politic al Scienc e R eview , 98(2):355–370. Dempster, A., Laird, N., and Rubin, D. (1977). Maxim um likelihoo d from incomplete data via the EM algorithm. Journal of the R oyal Statistic al Society, Series B , 39:1–38. Eroshev a, E. (2002). Gr ade of memb ership and latent structur e mo dels with applic ation to disability survey data . PhD thesis, Carnegie Mellon Universit y , Department of Statistics. Eroshev a, E., Fienberg, S., and Lafferty , J. (2004). Mixed-membership mo dels of scientific publications. Pr o c e e dings of the National A c ademy of Scienc e , 97(22):11885–11892. F ei-F ei, L. and Perona, P . (2005). A Ba yesian hierarc hical mo del for learning natural scene categories. IEEE Computer Vision and Pattern R e c o gnition , pages 524–531. Flahert y , P ., Giaever, G., Kumm, J., Jordan, M., and Arkin, A. (2005). A latent v ariable model for chemogenomic profiling. Bioinformatics , 21(15):3286–3293. F ukumizu, K., Bach, F., and Jordan, M. (2004). Dimensionality reduction for sup ervised learning with repro ducing k ernel hilb ert spaces. Journal of Machine L e arning R ese ar ch , 5:73–99. Griffiths, T., Steyvers, M., Blei, D., and T enenbaum, J. (2005). Integrating topics and syntax. In Saul, L. K., W eiss, Y., and Bottou, L., editors, A dvanc es in Neur al Information Pr o c essing Systems 17 , pages 537–544, Cambridge, MA. MIT Press. Hastie, T., Tibshirani, R., and F riedman, J. (2001). The Elements of Statistic al L earning . Springer. Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). Introduction to v ariational methods for graphical mo dels. Machine L e arning , 37:183–233. Marlin, B. (2003). Mo deling user rating profiles for collab orativ e filtering. In Neur al Information Pr o c essing Systems . McCallum, A., Pal, C., Druc k, G., and W ang, X. (2006). Multi-conditional learning: Generativ e/discriminative training for clustering and classification. In AAAI . McCullagh, P . and Nelder, J. A. (1989). Gener alize d Linear Mo dels . London: Chapman and Hall. Mimno, D. and McCallum, A. (2007). Organizing the OCA: Learning faceted sub jects from a library of digital b o oks. In Joint Conferenc e on Digital Libr aries . P ang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentimen t categorization with resp ect to rating scales. In Pr o c e e dings of the Asso ciation of Computational Linguistics . Pritc hard, J., Stephens, M., and Donnelly , P . (2000). Inference of population structure using multilocus genotype data. Genetics , 155:945–959. Quelhas, P ., Monay , F., Odob ez, J., Gatica-Perez, D., T uy elaars, T., and V an Go ol, L. (2005). Mo deling scenes with lo cal descriptors and latent asp ects. In ICCV . Ramage, D., Hall, D., Nallapati, R., and Manning, C. (2009). Lab eled LDA: A sup ervised topic mo del for credit attribution in muli-labeled corp ora. In Empiric al Metho ds in Natur al Language Pr o c essing . Ronning, G. (1989). Maxim um likelihoo d estimation of Dirichlet distributions. Journal of Statistc al Computation and Simulation , 34(4):215–221. Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smith, P . (2004). The author-topic mo del for authors and do cumen ts. In Pr o c e e dings of the 20th Confer enc e on Uncertainty in Artificial Intel ligenc e , pages 487–494. AUAI Press. Steyv ers, M. and Griffiths, T. (2006). Probabilistic topic models. In Landauer, T., McNamara, D., Dennis, S., and Kin tsch, W., editors, L atent Semantic Analysis: A R o ad to Me aning . Laurence Erlbaum. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B (Methodolo gic al) , 58(1):267–288. W ainwrigh t, M. and Jordan, M. (2008). Graphical models, exponential families, and v ariational inference. F oundations and T r ends in Machine L earning , 1(1–2):1–305. W allach, H., Mimno, D., and McCallum, A. (2009). Rethinking lda: Why priors matter. In Bengio, Y., Sch uurmans, D., Lafferty , J., Williams, C. K. I., and Culotta, A., editors, Advanc es in Neural Information Pr o c essing Systems 18 BLEI AND MCAULIFFE 22 , pages 1973–1981. W ang, C., Blei, D., and Li, F. (2009). Sim ultaneous image classification and annotation. In Computer Vision and Pattern Re c o gnition . W ei, X. and Croft, B. (2006). LDA-based do cumen t mo dels for ad-ho c retriev al. In SIGIR . SUPER VISED TOPIC MODELS 19 T opic and Coefficient administrator . perf ormance . procurement . office_of_management . measures . agencies . budget . decision . contractor project . regard_to . projects . loss . financial . election . compensation . reimbursement . thereof corporation . producer . account . agriculture . revenue . rural . spouse . risk . established_under foreign_intelligence_surveillance . communication . comm unications . service_provider . electronic . acquisition . aliens . access . chairman_of_the alien . issued . crime . applicant . removal . organizations . loans . secure . cause appropriations . sec . emergency . heading . expended . fund . research . maintenance . conservation armed_forces . disability . department_of_veterans_affairs . members . assessment_of . tempor ary . ar my . board . center iraq . ethics . inv estigation . iraqi . petition . qaeda . groups . independent . training taxable_y ear . taxpay er . credit . tax . read_as . improvement . e xtension . taxpay ers . attr ib utable_to food . drug . f ood_and . bill . emplo yment . meeting . technology . av ailable_f or . office_of child . child_health_plan . employ er . date_of_enactment_of . enrollment . health_insurance . social_secur ity_act . subsidy . children interior . unit . maximum_extent_practicab le . federal_gov er nment . local_gov er nment . land . described_in_subsection_a . environmental . appointment respect_to_such . shall_apply . consistent_with . board . purchase . type . days . deemed . off er ing calendar_year . fuel . e xemption . notification . loan . agreement . subsection_e . environmental_protection . ob ligation health_care . insurance . claims . liability . health_insurance_cov erage . par ty . state_la w . loss . physical ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 −1 0 1 2 3 Fig 4 . A 15-topic sLDA mo del fit to the 110th U.S. Senate data. 20 BLEI AND MCAULIFFE Movie re views Number of topics 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 20 30 40 50 Correlation R2 Model ● LD A ● sLD A Fig 5 . Err or b etwe en out-of-fold pre dictions and observe d r esponse on the movie r eview data. SUPER VISED TOPIC MODELS 21 109th Senate Number of T opics 0.1 0.2 0.3 0.4 0.5 −0.05 0.00 0.05 0.10 0.15 0.20 0.25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 25 30 Correlation R2 type ● LD A ● sLD A Fig 6 . Err or b etwe en out-of-fold pre dictions and observe d r esponse on the 109th U.S. Senate data. 22 BLEI AND MCAULIFFE 110th Senate Number of topics 0.15 0.20 0.25 0.30 0.35 0.40 0.45 −0.05 0.00 0.05 0.10 0.15 0.20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 25 30 35 40 Correlation R2 Model ● LD A ● sLD A Fig 7 . Err or b etwe en out-of-fold pre dictions and observe d r esponse on the 110th U.S. Senate data.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment