Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Impro ved Bayesian Logistic Super vised T opic Models with Data A ugmentation Jun Zh u, Xun Zheng, Bo Zhang Department of Computer Science and T echnology TNLIST Lab and State Ke y Lab of Intelligent T echnolo gy and Systems Tsinghua Unive rsity , Beijing, China { dcszj,dcs zb } @tsingh ua.edu.cn; vforveri.z heng@gmail .com Abstract Supervis ed topic models with a logis- tic like lihood hav e two issues that poten- tially limit their practical use: 1) respon se v ariable s are usually ove r -weighted by documen t word counts ; and 2) existing v ariatio nal inference methods make stric t mean-ﬁeld assu mptions. W e a ddress these issues by : 1) intr oducin g a reg ularizat ion consta nt to b etter ba lance the two pa rts based on an optimization formulation of Bayesian infere nce; and 2) dev elopi ng a simple Gibbs sampling algorithm b y intro- ducing auxiliar y Polya-Gamma v ariabl es and collaps ing o ut Dirichlet v ariable s. Our augment -and-co llapse sampling algorith m has analytical forms of each condi tional distrib u tion without making any restrict - ing assumptio ns and can be easily paral- lelized . Empirical r esults demonstra te sig- niﬁcant imp rov ements on predictio n per- formance and time ef ﬁcienc y . 1 Intr oduction As widely adopted in supervised la- tent Dirichlet allocat ion (sLD A) models (Blei and McAulif fe, 2010; W ang et al., 2009), one way to impro ve the predic ti ve power of L D A is to deﬁne a likelih ood model for the widely a v ailabl e document-l ev el respon se var iables, in additio n to the likelihoo d model for document words. For example , the logist ic lik elihoo d model is commonly used for binary or multinomial responses. By imposing some priors , po sterior inference is done w ith the Bayes’ rule. Though powerfu l, one issue that could limit the use of existin g logis tic supervised LD A models is that the y treat the document-le v el respon se v ari able as on e additiona l word via a nor - malized likel ihood model. Although some specia l treatmen t is carried out on deﬁning the likeli hood of the single respo nse va riable, it is normally of a much smaller scale than the likelihoo d of the usu- ally tens or hundreds of words in each documen t. As noted by (Halpern et al., 2012) and observed in our exp eriments, this model imbalance could result in a weak in ﬂuence of response v ariabl es on the t opic represe ntation s and thus non -satisf actor y predic tion perfo rmance. Another dif ﬁcult y arise s when dea ling with cate gorica l response variab les is th at the commonly used no rmal prior s are no longer conju gate to the logistic likelihoo d and thus lead to hard inferenc e problems. Existi ng approa ches rely on va riation al approximat ion techni ques w hich normally make s trict mean-ﬁeld assumpti ons. T o address the abov e iss ues, we present two im- pro vemen ts. First, we present a general frame- work of Bay esian lo gistic su pervise d topic models with a regulariz ation p arameter to better balance respon se variab les an d words. T ech nically , instead of doing standard Bayesian inference via B ayes’ rule, which requ ires a normaliz ed likeli hood model, we propose to do regu larized Bayesian inferen ce (Zhu et al., 2011; Zhu et al., 2013b) via solvin g an op timizatio n problem, where the pos te- rior regular ization is deﬁned as an expectat ion of a logist ic loss, a surrogate loss of the expecte d mis- classiﬁca tion error; and a regular ization par ame- ter is introduced to balance th e surroga te class iﬁ- cation loss (i.e., the respo nse log-likeli hood) and the w ord lik eliho od. The gene ral formulatio n sub- sumes standa rd sLDA as a s pecial case. Second, to solv e the intractable posteri or infer - ence problem of the generalize d Bayesian logistic superv ised topic models, we presen t a simple Gibbs samp ling algor ithm by ex plorin g the ideas of data augment ation (T anner and W ong , 1987; v an Dyk and Meng, 2001; Holmes and Held, 200 6). More speciﬁca lly , we e xtend Polson’ s method for Bayesian logist ic reg ression (Polson et al., 2012) to the generalize d logist ic supervised topic models, which are much more challengin g due to the presen ce of non-tri vial latent va riables . T echni cally , we introd uce a set of Poly a-Gamma vari ables, one per documen t, to refo rmulate the genera lized logistic pseud o-lik elihoo d model (with the regula rization paramete r) as a scale m ixture , where the mixture compone nt is conditiona lly normal for classiﬁer paramete rs. Then, w e dev elop a simple and ef ﬁcient Gibbs sampling algorithms with analytic condit ional distrib uti ons without Metropolis- Hastings accept/r eject steps . For Bayesian LD A models, we can also ex plore the conjuga cy of the Dirichlet-Mul tinomial prio r -like lihood pairs to collaps e ou t the Dirichlet v a riables (i.e., topics and mixing proporti ons) to do collapsed Gibbs sampling, which can hav e better mixing rates (Grif ﬁths and Steyv ers, 2004). Finally , our empirica l resu lts on real data sets demonst rate signiﬁca nt impro veme nts on time efﬁcienc y . The classiﬁca tion performan ce is also signiﬁcantl y impro ved by using appropr iate re gulari zation paramete rs. W e also provid e a parallel imple- mentatio n with GraphLab (Gonzalez et al., 2012), which sho ws great promise in our prelimina ry studie s. The paper is structure d as follo ws. Sec. 2 intro- duces logis tic supervised topic models as a gene ral optimiza tion problem. Sec. 3 present s Gibbs sam- pling algorithms with data augmentation . Sec. 4 presen ts expe riments. Sec. 5 concludes. 2 Logistic Super vised T opic Models W e now presen t the general ized Bayesian logistic superv ised topic models. 2.1 The Gener alized Models W e consider b inary classiﬁcat ion with a training set D = { ( w d , y d ) } D d =1 , where the respon se vari- able Y takes values from the ou tput space Y = { 0 , 1 } . A logistic supervise d topic model consists of two parts — an LD A mod el (Blei et al., 200 3) for descri bing the words W = { w d } D d =1 , where w d = { w dn } N d n =1 denote the words within docu- ment d , and a logistic cla ssiﬁer for conside ring the superv ising signal y = { y d } D d =1 . Below , we intro- duce each of them in turn. LD A : LD A is a hierarc hical Bayesian model that posits each document as an admixture of K topics , wher e each topic Φ k is a m ultino mial dis- trib utio n o v er a V -word vo cab ulary . For documen t d , the gene rating proces s is 1. draw a topic proport ion θ d ∼ Dir( α ) 2. for each word n = 1 , 2 , . . . , N d : (a) draw a topic 1 z dn ∼ Mult( θ d ) (b) draw the word w dn ∼ Mult( Φ z dn ) where D ir( · ) is a Diric hlet distrib u tion; Mult( · ) is a mult inomial distrib ut ion; and Φ z dn denote s the topic selected by the non-zero entry of z dn . For fully-Bay esian L D A, the topics are rando m sam- ples from a Dirichl et prior , Φ k ∼ Dir ( β ) . Let z d = { z dn } N d n =1 denote th e set of to pic as- signmen ts for documen t d . Let Z = { z d } D d =1 and Θ = { θ d } D d =1 denote all the topic assig nments and mixing propo rtions fo r the entire corpus . LD A infers the posterior distrib utio n p ( Θ , Z , Φ | W ) ∝ p 0 ( Θ , Z , Φ ) p ( W | Z , Φ ) , where p 0 ( Θ , Z , Φ ) =  Q d p ( θ d | α ) Q n p ( z dn | θ d )  Q k p (Φ k | β ) is the joint distrib uti on deﬁned b y t he model. As no ticed in (Jian g et al., 2012), the posterior distrib utio n by Bayes’ rule is equi v al ent to the solution of an informat ion theore tical optimization problem min q (Θ , Z , Φ ) KL( q (Θ , Z , Φ ) k p 0 (Θ , Z , Φ )) − E q [log p ( W | Z , Φ )] s . t . : q (Θ , Z , Φ ) ∈ P , (1) where KL( q || p ) is the Kullback-Leib ler div er - gence from q to p and P is th e space of pro bability distrib u tions. Logistic classiﬁer : T o cons ider binary su per - vising information, a logistic supervised topic model (e.g., sLD A) bu ilds a logistic classiﬁer using the topic repres entatio ns as inpu t feature s p ( y = 1 | η , z ) = exp( η ⊤ ¯ z ) 1 + exp( η ⊤ ¯ z ) , (2) where ¯ z is a K -v ector with ¯ z k = 1 N P N n =1 I ( z k n = 1) , and I ( · ) is an indicator function that equals to 1 if predic ate holds otherwise 0. If the classiﬁer weights η and topic assignments z are gi v en, th e predic tion rule is ˆ y | η , z = I ( p ( y = 1 | η , z ) > 0 . 5) = I ( η ⊤ ¯ z > 0) . (3) Since both η and Z are hidde n var iables, we propo se to infer a posterio r distrib u tion q ( η , Z ) that has the minimal ex pected log-logis tic loss R ( q ( η , Z )) = − X d E q [log p ( y d | η , z d )] , (4) which is a good surrog ate loss for the expected misclass iﬁcation loss, P d E q [ I ( ˆ y | η , z d 6 = y d )] , of a Gibbs classiﬁer that randomly draws a m odel 1 A K -binary vector with only one entry equaling to 1. η fr om the posterio r distrib ution and makes pre- dictio ns (McAllester , 2003; Germain et al., 2009). In f act, this ch oice is motiv ated from the obser - v ation that log istic loss has been widely used as a con vex surro gate loss for the misclassiﬁca tion loss (Rosasco et al., 2004) in the task of fully ob- serv ed binary classi ﬁcation. Also, note that the logist ic classiﬁer and the L D A lik elih ood are cou- pled by shari ng the latent topic assign ments z . The strong coup ling makes it po ssible to learn a pos - terior distrib ution that can de scribe the observe d words well and mak e accurate predictio ns. Regularize d Bayesi an Infere nce : T o integra te the abo v e two compo nents for hybrid learn ing, a logist ic supervised topic model solv es the joint Bayesian inferen ce proble m min q ( η , Θ , Z , Φ ) L ( q ( η , Θ , Z , Φ )) + c R ( q ( η , Z )) (5 ) s.t.: q ( η , Θ , Z , Φ ) ∈ P , where L ( q ) = KL( q || p 0 ( η , Θ , Z , Φ )) − E q [log p ( W | Z , Φ )] is the objecti ve for doing standa rd Bayesian inferenc e w ith the classiﬁer weights η ; p 0 ( η , Θ , Z , Φ ) = p 0 ( η ) p 0 ( Θ , Z , Φ ) ; and c is a re gulari zation parameter balanci ng the inﬂuence from respo nse v ariabl es and words. In general, we deﬁne the pseud o-lik elihoo d for the superv ision informatio n ψ ( y d | z d , η ) = p c ( y d | η , z d ) = { exp( η ⊤ ¯ z d ) } cy d (1 + exp( η ⊤ ¯ z d )) c , (6) which is un-normal ized if c 6 = 1 . But, as w e shall see this un-n ormaliza tion does not af fect our subsequent inference. Then, the generalize d inferen ce problem (5) of logisti c supervised topic models can be written in the “standard” Bayesian inferen ce form (1) min q ( η , Θ , Z , Φ ) L ( q ( η , Θ , Z , Φ )) − E q [log ψ ( y | Z , η )] (7) s.t.: q ( η , Θ , Z , Φ ) ∈ P , where ψ ( y | Z , η ) = Q d ψ ( y d | z d , η ) . It is easy to sho w that the optimum solution of problem (5) or the equi v ale nt problem (7) is the posteri or distrib u tion with supervisi ng informati on, i.e., q ( η , Θ , Z , Φ ) = p 0 ( η , Θ , Z , Φ ) p ( W | Z , Φ ) ψ ( y | η , Z ) φ ( y , W ) . where φ ( y , W ) is the normal ization consta nt to make q a distrib ut ion. W e can see that when c = 1 , the model redu ces to the standard sLD A, which i n practic e has the imbalance issue that the respon se v ariable (can be vie wed as one additional word) is usuall y dominated by the words. This imbalance was noticed in (Halpern et al., 2012). W e will see that c can m ake a big d if ference later . Comparison with MedLDA : The above formulat ion of logistic supervi sed topic mod- els a s an i nstance of r egu larized Bayesian inferen ce provides a dire ct compa rison with the max -mar gin su pervise d topic model (MedLD A) (Jiang et al., 2012), which has the same form of the optimization proble ms. The dif ference lies in the pos terior reg ulariz a- tion, for w hich MedLD A uses a hinge loss of an expected classiﬁer while the logistic super - vised topic model uses an exp ected log-logis tic loss. Gibbs MedLD A (Zhu et al., 2013a) is anothe r max-mar gin model that adopt s the ex- pected hinge loss as posterior reg ulariza tion. As we shall see in the e xperimen ts, by using approp riate re gulariz ation constan ts, logist ic superv ised top ic mod els a chie v e comparab le perfor mance as max-margi n method s. W e note that the relation ship between a logistic loss and a hinge loss has been discu ssed ext en- si vel y in vario us settings (Rosasco et al., 2004; Globerso n et al., 2007). But the presence of latent v ariable s poses addition al chal lenges in carryin g out a formal theoretica l analysis of these sur rogate losses (Lin, 2001) in the topic model setti ng. 2.2 V ariati onal Approx imation Algorithms The commonly used normal prior for η is non- conjug ate to the logist ic lik elihoo d, which makes the posterio r infe rence hard. Moreov er , the latent v ariable s Z make the inference problem harder than that of B ayesia n logist ic reg ression mod- els (Chen et al., 1999; Meyer an d Laud, 2002; Polson et al., 2012). Previ ous algorith ms to solve proble m (5 ) rely on varia tional approx imation techni ques. It is easy to sho w that the varia tional method (W ang et al., 200 9) is a coordina te de scent algori thm to solve problem (5) with the addi - tional fully-fac torized co nstrain t q ( η , Θ , Z , Φ ) = q ( η )( Q d q ( θ d ) Q n q ( z dn )) Q k q ( Φ k ) an d a vari- ationa l approx imation to the expectat ion of the log-logisti c like lihood , which is intrac table to compute directly . Note that the non- Bayesian treatment of η as unkno wn parameters in (W ang et al., 2009) results in an EM algori thm, which still need s to make strict mean-ﬁeld as- sumption s togeth er with a va riation al bound of the exp ectatio n of the lo g-logis tic likelih ood. In this paper , we consider the full Bayesian treatment , which can principa lly cons ider pr ior distrib utions and infer the poste rior cov a riance. 3 A Gibbs Sampling Algorithm No w , w e present a simple an d e f ﬁcient Gibbs sam- pling algorithm for the generalized Bayesian lo- gistic superv ised topic models. 3.1 F ormulatio n with Data A ugmentat ion Since the logis tic pseud o-lik elihoo d ψ ( y | Z , η ) is not conjugat e with normal priors, it is not easy to deriv e the sampling algorithms directly . In- stead, we de vel op our algorithms by introducing auxili ary variab les, which lead to a scale mix- ture of Gaussian compo nents and an alytic condi- tional distrib utions for automatical Bayesian in- ferenc e without an accept/re ject ratio. Our algo- rithm represent s a ﬁ rst attempt to exten d Polson’ s approa ch (Polson et al., 2012) to deal with highly non-tr iv ial Bayesi an lat ent varia ble mod els. Let us ﬁrst introd uce the Polya-Gamma v ariable s. Deﬁnition 1 (P olson et al ., 2012) A random variab le X has a P olya-Gamma distrib ution , denote d by X ∼ P G ( a, b ) , if X = 1 2 π 2 ∞ X i =1 g k ( i − 1) 2 / 2 + b 2 / (4 π 2 ) , wher e a, b > 0 and each g i ∼ G ( a, 1) is an inde- pende nt Gamma ran dom variable . Let ω d = η ⊤ ¯ z d . Then, using the ideas of data au gmentatio n (T anner and W ong , 1987; Polson et al., 2012), we can sho w that the gener- alized pseudo -lik elihood can be expressed as ψ ( y d | z d , η ) = 1 2 c e κ d ω d Z ∞ 0 exp  − λ d ω 2 d 2  p ( λ d | c, 0) dλ d , where κ d = c ( y d − 1 / 2) and λ d is a P olya-Gamma v ariable with parameters a = c and b = 0 . This result indicates that the posterio r distrib ution of the generalize d Bayesian logisti c supervis ed topic models, i.e., q ( η , Θ , Z , Φ ) , can be expre ssed as the margina l of a higher dimensi onal distr ib ution that include s the augmented v ariabl es λ . The complete poster ior distrib u tion is q ( η , λ , Θ , Z , Φ ) = p 0 ( η , Θ , Z , Φ ) p ( W | Z , Φ ) φ ( y , λ | Z , η ) ψ ( y , W ) , where the pseudo -joint distrib u tion of y and λ is φ ( y , λ | Z , η ) = Y d exp  κ d ω d − λ d ω 2 d 2  p ( λ d | c, 0) . 3.2 Infer ence with Collapsed Gibbs Sampling Although we can do Gibbs samplin g to infer the complete posterio r distrib ution q ( η , λ , Θ , Z , Φ ) and thus q ( η , Θ , Z , Φ ) by ignorin g λ , the mixing rate would be slo w due to the larg e sample space. One way to ef fecti v ely improve mixing rates is to integrat e out the intermediate v ariabl es ( Θ , Φ ) and build a Marko v chain whose equi- librium distrib u tion is the margina l distrib u tion q ( η , λ , Z ) . W e propo se to use collapsed Gibbs sampling , which has been successfu lly used in LD A (Grifﬁths and Stey vers , 2004). For our model, the colla psed posterior distrib ution is q ( η , λ , Z ) ∝ p 0 ( η ) p ( W , Z | α , β ) φ ( y , λ | Z , η ) = p 0 ( η ) K Y k =1 δ ( C k + β ) δ ( β ) D Y d =1 h δ ( C d + α ) δ ( α ) × exp  κ d ω d − λ d ω 2 d 2  p ( λ d | c, 0) i , where δ ( x ) = Q dim( x ) i =1 Γ( x i ) Γ( P dim( x ) i =1 x i ) , C t k is the number of times the te rm t being assign ed to topic k over t he whole corpus and C k = { C t k } V t =1 ; C k d is the num- ber o f times th at terms bein g assoc iated with topic k within the d -th document and C d = { C k d } K k =1 . Then, the cond itional distrib ut ions used in col- lapsed Gibbs sampling are as follo ws. For η : for the commonly used isotropic Gaus- sian prior p 0 ( η ) = Q k N ( η k ; 0 , ν 2 ) , we ha ve q ( η | Z , λ ) ∝ p 0 ( η ) Y d exp  κ d ω d − λ d ω 2 d 2  = N ( η ; µ , Σ) , (8) where the po sterior mea n is µ = Σ ( P d κ d ¯ z d ) and the cov arian ce is Σ = ( 1 ν 2 I + P d λ d ¯ z d ¯ z ⊤ d ) − 1 . W e can easily draw a sample from a K -dimension al multi v ariat e G aussia n distrib uti on. The in v erse can be ro b ustly done usin g Choles ky decomposi- tion, an O ( K 3 ) procedu re. Since K is normally not lar ge, the in v ersi on can be done ef ﬁciently . For Z : The con dition al distrib ution of Z is q ( Z | η , λ ) ∝ K Y k =1 δ ( C k + β ) δ ( β ) D Y d =1 h δ ( C d + α ) δ ( α ) × exp  κ d ω d − λ d ω 2 d 2 i . By can celing common f actors , we can deriv e th e local cond itional of one v ariab le z dn as: q ( z k dn = 1 | Z ¬ , η , λ , w dn = t ) ∝ ( C t k, ¬ n + β t )( C k d, ¬ n + α k ) P t C t k, ¬ n + P V t =1 β t exp  γ κ d η k − λ d γ 2 η 2 k + 2 γ (1 − γ ) η k Λ k dn 2  , (9) Algorithm 1 for collaps ed Gibbs sampling 1: Initialization: set λ = 1 and randomly draw z dn from a unifo rm distrib u tion. 2: for m = 1 to M do 3: dra w a classiﬁer from the distrib ution (8 ) 4: f or d = 1 to D do 5: f or each word n in documen t d d o 6: dra w the topic using distrib utio n (9) 7: end for 8: dra w λ d from distr ib ution (10). 9: end fo r 10: end for where C · · , ¬ n indica tes th at term n is ex cluded from the cor respon ding document or topic; γ = 1 N d ; and Λ k dn = 1 N d − 1 P k ′ η k ′ C k ′ d, ¬ n is the discrimi- nant func tion va lue without wor d n . W e can see that the ﬁrst term is from the L D A model for ob - serv ed word counts and the second term is from the superv ising signal y . For λ : Finally , the conditiona l distrib u tion of the augmente d variab les λ is q ( λ d | Z , η ) ∝ ex p  − λ d ω 2 d 2  p ( λ d | c, 0) = P G  λ d ; c, ω d  , (10) which is a Polya-Gamma distrib ut ion. The equali ty has been achie ved by using the con- structi on deﬁnition of the general P G ( a, b ) class throug h an exponen tial ti lting of the P G ( a, 0) den- sity (Polson et al., 2012). T o draw samples from the Polya-Gamma dis trib utio n, w e adopt the ef- ﬁcient meth od 2 propo sed in (Polson et al., 2012), which dra ws the samples throug h drawing sa m- ples from the closely related exponent ially tilted Jacobi distrib u tion. W ith the abov e condit ional distrib ution s, we can constr uct a Marko v chain which iterati vel y draws samples of η usin g Eq. (8), Z using Eq. (9) and λ using Eq. (1 0), with an initial condition . In our exp eriments, we initially set λ = 1 and randomly dra w Z from a uniform distrib ution . In training, we run th e Mar ko v ch ain fo r M ite rations ( i.e., the b urn-in stage), as outlined in A lgorit hm 1. Then, we draw a sample ˆ η as the ﬁ nal classiﬁer to make predic tions on testing data. As we shall see, th e Marko v ch ain con ver ges to stabl e predictio n per- formance with a fe w b urn-in iterations . 2 The basic sampler was implemented in the R package BayesLogit. W e implem ented the samp ling algorithm in C++ together with our topic model sampler . 3.3 Pre diction T o apply the classiﬁer ˆ η on testing data, we need to infer their topic assignments. W e take the approa ch in (Zhu et al., 2012; Jiang et al., 2012), which uses a point estimate of topics Φ from trainin g data and mak es prediction based on them. Speciﬁcally , we use the MAP es timate ˆ Φ to re- place the probability distrib ution p ( Φ ) . For th e Gibbs sampler , an estimate of ˆ Φ using the sam- ples is ˆ φ k t ∝ C t k + β t . Then, giv en a testing doc- ument w , we infer its la tent compone nts z using ˆ Φ as p ( z n = k | z ¬ n ) ∝ ˆ φ k w n ( C k ¬ n + α k ) , where C k ¬ n is the times that the terms in this do cument w assign ed to topic k with the n -th term exc luded. 4 Experiments W e presen t empirica l results and sensi ti vity anal- ysis to demonstra te th e efﬁcie ncy and prediction perfor mance 3 of the generaliz ed logistic super - vised topic models on the 20Newsg roups (20NG) data set, w hich contain s about 20,000 postings within 20 ne w s group s. W e fol lo w the same set- ting as in (Zhu et al., 20 12 ) and r emov e a stand ard list of sto p words for both binar y and multi-class classiﬁca tion. For all the experimen ts, we use the standa rd normal prior p 0 ( η ) (i.e., ν 2 = 1 ) and the symmetric Dirich let priors α = α K 1 , β = 0 . 01 × 1 , whe re 1 is a vector w ith al l e ntries being 1 . For each setting, we report the a vera ge perfor - mance an d the stand ard de viation with ﬁv e ran- domly initial ized runs. 4.1 Binary classiﬁcat ion Follo wing the s ame setting in (Lacoste-Jul lien et al., 2009; Zhu et al., 2012), the task is to disting uish posting s of the ne ws- group alt.atheis m and those of the grou p talk.r eligion . misc . The training set contains 856 documents and the test set contains 569 documen ts. W e compare the generali zed lo- gistic supervi sed LD A using Gibbs sampling (denot ed by gSLD A) w ith v arious competi- tors, includ ing the standa rd sLD A using v ariational m ean-ﬁeld m ethod s (den oted by vSLD A) (W ang et al., 2009), the MedLD A model using var iationa l mean-ﬁeld methods (denoted by vMedLD A) (Zhu et al ., 2012), and the MedLD A model usi ng collapsed Gibbs sampling algorith ms (denot ed by gMedLD A) (Jiang et al., 2012). 3 Due to space limit, the topic visualization (simil ar to that of MedLD A) is deferred to a longer version. 5 10 15 20 25 30 0.55 0.6 0.65 0.7 0.75 0.8 0.85 # Topics Accuracy gSLDA gSLDA+ vSLDA vMedLDA gMedLDA gLDA+SVM (a) accurac y 5 10 15 20 25 30 10 −2 10 −1 10 0 10 1 10 2 10 3 # Topics Train−time (seconds) gSLDA gSLDA+ vSLDA vMedLDA gMedLDA gLDA+SVM (b) training time 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 3 3.5 4 # Topics Test−time (seconds) gSLDA gSLDA+ vSLDA gMedLDA vMedLDA gLDA+SVM (c) testing time Figure 1: Accuracy , training time (in log-scale ) and testing time on the 20NG binary data set. W e also include the unsupervis ed LD A using collap sed Gibbs sampling as a baseline, de- noted by gLD A. For gLD A, we learn a binary linear SVM on its topic repres entatio ns using SVMLight (Joachi ms, 1999). The results of DiscLD A (Lacos te-Jull ien et al., 2009) and linear SVM on raw bag-of-w ords feature s were reported in (Z hu et al., 201 2). For gSL D A, we compare two versi ons – the standar d sLD A with c = 1 and the sLD A with a well-tune d c va lue. T o distin guish, we denote the latter by gSLD A +. W e set c = 25 for gSLD A+, and set α = 1 and M = 100 for both gSLDA and gSLD A +. As we shall see, gSLD A is insensiti ve to α , c and M in a wide range . Fig. 1 sho ws the p erformance of d ifferen t meth- ods with vario us numbers of topics. For accur acy , we can dra w two con clusio ns: 1) w ithout mak- ing restric ting assumptio ns on the posteri or dis- trib utions, gSLD A ac hie ves higher accu racy than vSLD A th at us es stric t v ariational mean-ﬁeld ap- proximat ion; and 2) by using the regu larizati on consta nt c to improve the inﬂuence of supervi- sion information, gSLD A+ achiev es much bet- ter classiﬁcatio n res ults, in fact comparable with those of M edLD A models since the y ha ve the similar mechanism to improv e the inﬂuence of superv ision by tuning a reg ulariza tion constant . The fact that gLD A+SVM performs better than the standard gSLD A is due to the same reason, since the SVM part of gLD A+SV M can w ell cap- ture the su pervisi on informat ion to learn a clas - siﬁer for good prediction , while standa rd sLD A can’ t well-balance the inﬂuence of supervis ion. In cont rast, the well-bala nced gSLD A+ model succes sfully out performs the two-stage approach, gLD A+SVM, by performing topic discov ery and predic tion jointly 4 . For training time, both gSLD A and gSLD A+ are 4 The v ariational sLD A with a well-tune d c is signiﬁcantly better than the standard sLD A, but a bit inferior to gSLD A+. ver y ef ﬁ cient, e.g., about 2 order s of magnitu des fast er than vSL D A and about 1 order of magnit ude fast er than vMedLD A. For testing time, gSLD A and gSLD A+ are compa rable with gMedLD A and the unsuperv ised gLD A, bu t faster than the v aria- tional vMedLD A and vSL D A, especially w hen K is lar ge. 4.2 Multi-class classi ﬁcation W e per form multi-class classiﬁcat ion o n the 20NG data set with all the 20 categorie s. For m ulti- class classiﬁcation , one possible exten sion is to use a multinomial logistic regressio n model for cate gorica l variab les Y by using topi c repres en- tation s ¯ z as input featur es. Howe ver , it is non- tri vial to de velop a G ibbs sampling algorith m us- ing the similar data augmentat ion idea, due to the presence of laten t v ariables and the nonlin- earity of the soft-max function. In fac t, this is harder than the multinomial Bayesian log istic re- gressi on, which can b e d one v ia a coo rdinat e strat- egy (Pols on et al., 2012). H ere, we apply the bi- nary gSLDA to do the mult i-class cla ssiﬁcation , follo wing the “one- vs-all” strateg y , which has been sho wn effec ti ve (Rifkin and Klautau , 2004), to provide some prelimin ary anal ysis. Namely , we learn 20 binary gSLD A models and aggregate thei r predic tions by taking the most lik ely ones as the ﬁnal predic tions. W e again ev aluate two versio ns of gSLD A – the standar d gSLDA with c = 1 and the improv ed gSLDA+ with a well-tuned c value . Since gSLD A is also insensit iv e to α and c for the multi-cla ss task, we set α = 5 . 6 for both gSLD A and gSLD A +, and set c = 256 for gSLD A+. The number of b urn-in is set as M = 40 , which is suf- ﬁciently la r ge to get st able results , as we sh all see. Fig. 2 shows the accurac y and training ti me. W e can see that: 1) by using Gibbs samp ling with- out rest ricting assumpt ions, gSLD A perfor ms bet- ter than the variat ional vSLD A that uses strict mean-ﬁeld appro ximation; 2) due to the imbal- 20 30 40 50 60 70 80 90 100 110 0.55 0.6 0.65 0.7 0.75 0.8 # Topics Accuracy gSLDA gSLDA+ vSLDA vMedLDA gMedLDA gLDA+SVM (a) accurac y 20 30 40 50 60 70 80 90 100 110 10 −1 10 0 10 1 10 2 10 3 10 4 10 5 # Topics Train−time (seconds) gSLDA gSLDA+ vSLDA vMedLDA gMedLDA gLDA+SVM parallel−gSLDA parallel−gSLDA+ (b) training time Figure 2: Multi-class classiﬁcatio n. ance be tween the sin gle superv ision and a lar ge set of word counts, gSLDA doesn’ t outperform the decoupled approa ch, gLD A+SVM; and 3) if we increa se the value of the re gularizatio n con- stant c , supervis ion information can be better cap- tured to infer predicti ve topic representat ions, and gSLD A+ performs much better than gSLD A . In fact , gSL D A+ is e ven better than the M edLD A that uses mean-ﬁeld approx imation, while is compara- ble with the MedLD A using col lapsed Gibbs sam- pling. Finally , we should note that the improv e- ment on the accuracy might be due to the dif fer - ent strategie s on buil ding the multi-cla ss classi- ﬁers. But giv en the performance gain in the b inary task, we believ e tha t the G ibbs sampling algorit hm without facto rizatio n assumptio ns is the m ain fac- tor for the impro ved performance. For training time, gSLD A models are about 10 times faster tha n varia tional vSLD A. T able 1 sho ws in detail th e per centage s of the trainin g time (see the numbe rs in brack ets) spent at each sam- pling step for gSLDA+. W e can see that: 1) sam- pling th e glo bal v ariables η is very efﬁcien t, while sampling local var iables ( λ , Z ) are much more ex- pensi ve; and 2) sampl ing λ is relati vely stable as K increases, while sampling Z takes more time as K becomes lar ger . B ut, the good ne w s is that our Gibbs sampli ng algorithm can be easily paral- lelized to speedup the sampling of local va riables , follo wing the similar archit ectures as in LD A. A Parallel Implementation : Graph Lab is a graph- based programming frame work for parallel computin g (Gonzalez et al., 2012). It provide s a high-l e vel abstract ion of paral lel tasks by ex press- ing data depen dencies with a distrib uted graph. GraphLab implement s a GAS (gather , apply , scat- ter) model, where the data required to comput e a ver tex (edge) are gathered along its neighbo ring compone nts, and modiﬁcatio n of a vert ex (edge) will trigger its adjacent components to recompute their v alues. Since GAS has been successfull y ap- T able 1: S plit of trainin g time ov er variou s steps. S A M P L E λ S A M P L E η S A M P L E Z K = 2 0 2 8 4 1 . 6 7 ( 6 5 . 8 0 % ) 7 . 7 0 ( 0 . 1 8 % ) 1 4 5 5 . 2 5 ( 3 4 . 0 2 % ) K = 3 0 2 4 1 7 . 9 5 ( 5 6 . 1 0 % ) 1 0 . 3 4 ( 0 . 2 4 % ) 1 8 8 8 . 7 8 ( 4 3 . 6 6 % ) K = 4 0 2 3 9 3 . 7 7 ( 4 9 . 0 0 % ) 1 4 . 6 6 ( 0 . 3 0 % ) 2 4 7 6 . 8 2 ( 5 0 . 7 0 % ) K = 5 0 2 1 6 1 . 0 9 ( 4 3 . 6 7 % ) 1 6 . 3 3 ( 0 . 3 3 % ) 2 7 7 1 . 2 6 ( 5 6 . 0 0 % ) plied to sev eral machine learni ng algorit hms 5 in- cludin g Gibbs sampl ing of LD A, we choose it as a prelimin ary attempt to paralleliz e our Gibbs sam- pling algorithm. A sys tematical in vestig ation of the para llel computatio n with v arious architectu res in inter esting, but bey ond the scope of this paper . For our task, since there is no cou pling among the 20 binary gSLD A classiﬁers, we can learn them in parallel. This suggests an ef ﬁ cient hybrid multi-co re/multi-mac hine implementati on, which can a void the t ime consump tion of IPC (i.e., in ter - proces s communication) . Namel y , we run our exp eriments on a cluster w ith 20 nodes where each node is equipped with two 6-core CP Us (2.93GHz). Each no de is respons ible for learn- ing one binary gSLD A classiﬁer w ith a paral- lel implementatio n on its 12-cores. For each bi- nary gSLD A model, we const ruct a biparti te graph conne cting train documents w ith correspondi ng terms. The graph w orks as fo llo ws: 1) the edges contai n the token count s and topic assignmen ts; 2) the vertices contain indivi dual topic counts and the augmented varia bles λ ; 3) the global topic counts and η are aggreg ated from the vertic es pe- riodic ally , and the topic assignment s and λ are sampled asy nchron ously during the GAS ph ases. Once star ted, sampling and sign aling will propa- gate over the graph. One thi ng to n ote is that since we cannot di rectly measure the number of itera- tions of an asynch ronous model, here we estimate it with the total number of topic samplings , which is again aggre gated perio dically , divid ed by the number of tok ens. W e denote the parallel models by paralle l-gSLD A ( c = 1 ) and parallel-g SLD A + ( c = 256 ). From Fig. 2 (b), we can see that the par - allel gSL D A models are about 2 orders of magni- tudes f aster than the ir sequen tial counterpa rt mod- els, which is very promising. Also, the prediction perfor mance is not sacriﬁced as we shall see in Fig. 4. 4.3 Sensitiv ity analysis Burn-In : Fig. 3 sho ws the performa nce of gSLD A+ with dif ferent b urn-in steps for binary 5 http://docs.graphlab . org/toolkits.html 10 0 10 1 10 2 10 3 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 burn−in iterations Accuracy K = 5 K = 10 K=20 train accuracy test accuracy (a) accurac y 0 100 200 300 400 500 0 5 10 15 20 25 30 35 burn−in iterations Train−time (seconds) K = 5 K = 10 K=20 (b) training time Figure 3: Performanc e of gSLD A+ with diffe rent b urn-in steps for binary classiﬁcatio n. The most left points are for the settin gs with no b urn in. 10 −1 10 0 10 1 10 2 10 3 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 burn−in iterations Accuracy K = 20 K = 30 K = 40 K = 50 gSLDA+ parallel−gSLDA+ (a) accurac y 10 −1 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 10 4 10 5 burn−in iterations Train−time (sec) K = 20 K = 30 K = 40 K = 50 parallel−gSLDA+ gSLDA+ (b) training time Figure 4: Performance of gSLDA+ and paralle l- gSLD A+ with differe nt bu rn-in steps for multi- class classiﬁcatio n. The m ost left point s are for the setting s w ith no b urn in. classiﬁca tion. When M = 0 (se e the most lef t points ), the models are built on rando m topic as- signmen ts. W e can see that the classiﬁca tion per- formance incr eases fast and con ver ges t o the stabl e optimum with about 20 burn-i n steps. The trai n- ing time increas es abou t linearly in general when using more b urn-in steps. M oreo ver , the training time increase s linearly as K increases. In the pre- vious expe riments, we set M = 100 . Fig. 4 shows the performance of gSLD A+ and its paralle l implementat ion (i.e., parallel- gSLD A+) for the multi-clas s classiﬁcation with dif ferent bu rn-in steps. W e ca n see when the num- ber of b urn-in steps is lar ger than 20, the per - formance of gSLD A+ is quite stable. A gain, in the log-log scale, since the slopes of the lines in Fig. 4 (b ) are clo se to the con stant 1, the train - ing time grows about linearl y as the number of b urn-in steps increases. E ven when we use 40 or 60 bur n-in steps, the training time is still compet- iti ve, compared with the v ariation al vSLD A. For paralle l-gSLD A+ using GraphLab, the training is consis tently about 2 orders of m agnitu des faster . Meanwhile, the classiﬁcation performance is also comparab le with that of gSLD A+, when the num- ber of b urn-in steps is la rg er than 40 . In the pre- vious experiment s, we hav e set M = 40 for both gSLD A+ and parallel- gSLD A+. Regulariza tion constant c : Fig. 5 sho ws the 1 2 3 4 6 7 8 9 10 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 √ c Accuracy K = 5 K = 10 K = 20 train accuracy test accuracy (a) accurac y 1 2 3 4 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 √ c Train−time (seconds) K = 5 K = 10 K = 20 (b) training time Figure 5: Performance of gSLD A for binary clas- siﬁcation with dif ferent c values . 10 −4 10 −2 10 0 10 2 10 4 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 α Accuracy K = 5 K = 10 K = 15 K=20 (a) c = 1 10 −6 10 −4 10 −2 10 0 10 2 10 4 0.55 0.6 0.65 0.7 0.75 0.8 0.85 α Accuracy K = 5 K = 10 K = 15 K=20 (b) c = 9 Figure 6: Accurac y of gSLD A for binary classiﬁ- cation with differ ent α v alues in two settings with c = 1 and c = 9 . perfor mance o f gSLDA in the binary classiﬁcation task with diff erent c v alues. W e can see that in a wide range, e.g ., from 9 to 1 00, the performan ce is quite stable for all the three K va lues. But for the standa rd sL D A model, i.e., c = 1 , bo th the train- ing accuracy and test accurac y are low , which in- dicate s that sLD A doesn’ t ﬁt the supervisi on data well. When c becomes lar ger , the training accu- racy gets highe r , but it doesn’ t seem to over -ﬁt and the generaliza tion performanc e is stable. In the abo ve exp eriments, w e set c = 25 . For multi- class classiﬁca tion, we ha ve similar observ ations and set c = 256 in the pre vious exp eriments. Dirichlet prior α : Fig. 6 shows the perfor - mance of gSL D A on the binary task with diffe r - ent α val ues. W e report two cases w ith c = 1 and c = 9 . W e can see that the pe rformance is quite stable in a wide ran ge of α value s, e.g., from 0 . 1 to 10 . W e also no ted that the c hange of α does no t af fect the training time much. 5 Conclusions and Discussions W e presen t two improv ements to Baye sian lo gis- tic supervised topic models, namely , a general for - mulation by introducing a regula rizatio n parame- ter to av oid mo del imbalance and a hi ghly efﬁcient Gibbs sampling algorit hm w ithout restrictin g as- sumption s on the po sterior d istrib utions by expl or - ing the idea of data augmentat ion. The algorith m can also be paralleli zed. Empirical r esults for both binary and m ulti-cla ss class iﬁcation demon strate signiﬁca nt impro vemen ts o ver the e xisting lo gistic superv ised topic models. Our preliminary resu lts with GraphLab h av e shown pro mise on paralle liz- ing the Gibbs sampling algorit hm. For future work, we plan to carry out more careful in ves tigatio ns, e.g ., using v arious distrib uted archit ec- tures (Ahmed et al., 2012; Ne wman et al., 2009; Smola and Narayana murthy , 2010), to mak e the sampling algorith m highly scalabl e to deal with massi ve data corpora. Moreo ver , the data augmenta tion techn ique can be applie d to deal with other types of response variab les, such as count data with a negat i ve-b inomial like lihood (P olson et al., 201 2). Acknowledgmen ts This work is suppor ted by National Key Foun- dation R&D P rojects (No.s 2013CB329403 , 2012CB316 301), Tsinghua Initiat iv e Scientiﬁc Research Program No.201210 88071, Tsingh ua National Labor atory for Information Science an d T echnolog y , and the 221 Basic Research Plan for Y oung Facul ties at Tsinghua Univ ersity . Refer ences [Ahmed et al.201 2] A. Ah med, M. Aly , J. Gon zalez, S. Na rayanamu rthy , and A. Smola. 2 012. Scal- able infer ence in latent variable models. In Interna - tional Confer e nce on W eb Sear ch a nd Data Mining (WSDM) . [Blei and McAuliffe2010] D.M. Blei and J.D. McAuliffe. 2010. Superv ised topic models. arXiv:100 3.078 3v1 . [Blei et al.200 3] D.M. Blei, A.Y . Ng , and M.I. Jordan. 2003. Latent Dirichlet allocation. JMLR , 3:99 3– 1022. [Chen et al.199 9] M. Chen, J. Ibr ahim, and C. Y ian- noutsos. 199 9. Prior elicitation, variable selec- tion and Bayesian com putation for logistic regres- sion models. J o urnal of Ro yal Statistical So ciety , Ser . B , (61) :223–2 42. [Germain et al.200 9] P . Germain , A. L acasse, F . Lavio- lette, and M. Marchand . 200 9. P A C-Bayesian learn- ing of linear classiﬁers. I n International Confer ence on Machine Learning (ICML) , pages 353–360 . [Globerson et al.2007 ] A. Globerson , T . K o o, X. Car- reras, and M. Collins. 200 7. Expon entiated gradi- ent algo rithms for log-linear structu red prediction . In ICML , pages 305–31 2. [Gonzalez et al.201 2] J.E. Go nzalez, Y . Low , H. Gu, D. Bickson, and C. Guestrin. 2 012. Powergraph: Distributed gr aph-pa rallel compu tation o n natural graphs. In th e 10th US ENIX Symposium on O per- ating Systems Design and Implementatio n (OSDI) . [Grifﬁths and Steyvers2004] T .L. Grifﬁths and M. Steyvers. 2004. Finding scientiﬁc top ics. Pr oceedin gs o f National A cademy of Scien ce (PNAS) , pages 5228–5 235. [Halpern et al.201 2] Y . Halper n, S. Horng, L. Nathanso n, N. Shapiro, and D. So ntag. 20 12. A compariso n o f dimensionality reduction techn iques for unstructured clin ical text. In ICML 2012 W o rkshop on Clinical Data Analysis . [Holmes and Held2006 ] C. Holme s and L. Held. 2 006. Bayesian auxiliary variable models fo r b inary and multin omial regression. Bayesian Analysis , 1(1):1 45–16 8. [Jiang et al.2012] Q. Jiang, J. Zhu, M. Sun, and E .P . Xing. 2 012. Mon te Carlo method s for maximum margin superv ised topic mod els. In Advances in Neural Information Pr ocessing Systems (NIPS ) . [Joachims19 99] T . Joachims. 1999. Making lar ge- scale SVM learning practical . MIT press. [Lacoste-Jullien et al.2009] S. Lacoste-Jullien, F . Sha, and M. I. Jord an. 20 09. DiscLDA: Discriminative learning for dimensionality reduction and classiﬁ ca- tion. Advances in Neural Informatio n Pr ocessing Systems (NIPS) , pages 897–90 4. [Lin200 1] Y . Lin. 2001. A no te on margin-b ased loss function s in classiﬁcation. T echnical Report No. 1044. University of W isconsin . [McAllester200 3] D. McAllester . 2 003. P AC-Bayesian stochastic mode l selection. Machine Lea rning , 51:5–2 1. [Meyer and Laud200 2] M. Meyer and P . Laud. 2 002. Predictive v ariab le selection in g eneralized lin ear models. J ournal of American S tatistical Associa- tion , 97(459 ):859– 871. [Newman et al.200 9] D. Newman, A. A suncion, P . Smyth, and M. W elling. 2 009. Distributed algorithm s for topic models. Journal o f Ma chine Learning Resear ch (JMLR) , (10):18 01–1 828. [Polson et al.2012 ] N.G. Polson, J.G. Scott, and J. W in- dle. 2012. Bayesian in ference for logis- tic mod els using Polya-Gamma latent variables. arXiv:120 5.031 0v1 . [Rifkin and Klautau200 4] R. Rifkin and A. Klautau. 2004. In defense of one-v s-all classiﬁcation. Journal o f Machine Lea rning Researc h (JMLR) , (5):10 1–141 . [Rosasco et al.2004 ] L. R osasco, E. De V ito, A. Capon- netto, M. Piana, a nd A. V er ri. 2004. Are loss fu nc- tions all the same? Neural Computa tion , (16 ):106 3– 1076. [Smola and Narayan amurth y2010] A. Smola an d S. Nar ayanamu rthy . 2010. An architecture fo r parallel topic models. V ery Larg e Data Ba se (VLDB) , 3(1-2) :703–7 10. [T anner and W ong19 87] M.A. T anner and W .-H. W ong. 1987. The calculation of posterior distributions by data augmenta tion. Journal of the Americal Statisti- cal Association (J AS A) , 82(398) :528– 540. [van Dyk and Meng2001] D. van Dyk and X. Meng. 2001. The art of data au gmentation . Journal of Compu tational and Graphical Statistics ( JCGS) , 10(1) :1–50. [W ang et al.2009 ] C. W ang , D.M. Blei, and Li F .F . 2009. Simultan eous image classiﬁcation an d an no- tation. IE EE Confer ence on Comp uter V ision and P attern Recognition (CVPR) . [Zhu et al.2011 ] J. Zhu, N. Chen , an d E .P . Xing. 2011. Inﬁnite latent SVM for classiﬁcation an d m ulti-task learning. I n A dvances in Neural Information Pr o- cessing Systems (NIPS) , pages 1620 –162 8. [Zhu et al.2012 ] J. Zhu, A. Ahmed , and E.P . Xing . 2012. MedLD A: maximum margin supervised topic models. Journal of Machine Learnin g Re sear ch (JMLR) , (13):2 237– 2278. [Zhu et al.2013 a] J. Zhu , N. Chen , H. Perkins, and B. Zhang. 201 3a. Gibbs max-m argin topic mod - els with fast samp ling algor ithms. In In ternationa l Confer en ce on Machine Learning (ICML) . [Zhu et al.2013 b] J. Z hu, N. Chen, and E.P . Xing . 2013b. Bayesian in ference with p osterior regu- larization and applications to inﬁnite latent svms. arXiv:121 0.176 6v2 .

Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment