Discriminative Relational Topic Models

SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 1 Discr iminativ e Relati onal T opic Models Ning Chen, Jun Zhu, Member , IEEE, Fei Xia, and Bo Zhang Abstract —Many scientiﬁc and eng ineering ﬁelds involv e analyzing net work data. For do cument networks, relational topic models (RTMs ) provide a probabilistic generativ e process to describe both the link structure and document contents , and they have shown promise on predicting ne twork structures and di scov ering latent topic representations . Howev er , existing RTMs have limitations in both the restricted model expressiv eness and incapabili ty of dealing with i mbalanced network data. T o e xpand the scope and i mprov e the inference a ccuracy of RTMs, this paper presents three extensions: 1) unl ike the common link likelihood with a diagonal weight matrix that allows the-same-topic interactions only , we generalize it to use a full weight matri x that captures all pairwise topic interactions and is applicable to asymmetric networks; 2) instead of doing standard Bay e sian inference, we perf orm regular ized Bayesian inference (RegBay es) with a regular ization parameter to deal with the imbalanced link structure issue in common real networks and improv e the discri minativ e ability of lear ned l atent representations; and 3) instead of doing v ariational appro ximation with strict mean-ﬁeld assumptions, we present collapsed Gibbs sampling algorithms f or the generalized relational topic models by exploring data augmentation without making restricting assumptions. Under the generi c RegBay es frame work, we carefully invest igate two popular discr iminative loss functions, namely , the logistic log-loss and the max-margin hinge loss . Experimental results on se ver al real netw or k datasets demonst rate the signiﬁcance of these e xtensions on impro ving the prediction perf or mance, and the time efﬁciency can be dramatically improv ed wi th a simple fas t approximat ion method. Index T erms —statistical network analysis, relational topic models , data augmentation, regularized Bay esian inference ✦ 1 I N T R O D U C T I O N M A N Y scientiﬁc and engineering ﬁ elds involve analyzing large collections of data that can be well described by networks, where vertices represent entities and edges represent relations hips or interac- tions between entities; and to name a few , such data include online social networks, communi cation net- works, protein interaction networks, academic pa per citation and coauthorship networks, etc. As the avail- ability and scope of network data incr ease, statistical network analysis (SNA) has attracte d a considera ble amount of a tte ntion (see [17] for a comprehensive survey). Among the many ta sks studied in SNA, link prediction [25], [4] is a most fundamental one that attempts to estimate the link structure of networks based on partially observed links a nd/or entity at- tributes (if exist). Link prediction could pro vide useful predictive models for suggesting fr ie nds to social network users or citations to scientiﬁc articles. Many link prediction methods have been propos ed, including the ea rly work on designing good similarity measures [25] that are used to rank unobserved links and those on learning supervised classiﬁers with well- conceived features [19], [26]. Though speciﬁc domain knowledge can be used to design effective feature rep- resentations, fe ature engineering is generally a la bor- intensive process. In order to expand the scope and • N. Chen † , J. Zhu † , F . Xia ‡ and B. Zhang † are with the Department of Computer Science and T echnol ogy , National Lab of Information Science and T echnol ogy , State Key Lab of Intelligent T echnology and Systems, T singhua University , Beijing, 100084 China. E-mail: † { ningchen, dcszj, dcszb } @mail.tsinghua .edu.cn, ‡ xia.fei09@gmai l.com. ease of a pplicability of machine learning methods, fast growing interests have been spent on learning feature representations from data [6]. Along this line, recent research on link p rediction has f ocused on lea rning latent v a riable models, including both parametric [20], [21], [ 2] and nonparametric Bayesian methods [31], [41]. Though these methods could model the network structures well, little a ttention has been paid to ac- count for observed a ttr ibutes of the e ntities, such as the text contents of papers in a citation network or the contents of web pages in a hyperlinked network. One work that accounts for both text contents and network structures is the relational topic models (R TM s) [8], an extension of la te nt Dirichlet a llocation (L DA) [7] to predicting link structures among documents a s well as discovering their la tent topic structures. Though powerful, existing R TMs have som e as- sumptions that could limit their a pplicability and inference a c curacy . First, R TMs deﬁne a symmetric link likelihood model with a dia gonal weight matrix that allows the- same-topic interactions only , and the symmetric nature could also make R TM s unsuitable for asymmetric networks. Second, by performing stan- dard Bayesian inference under a generative modeling process, R TMs do not explicitly dea l with the common imbalance issue in real networks, which normally have only a few observed links while most entity pairs do not hav e links, and the learned topic represen- tations could be weak at predicting link structures. Finally , R TM s a nd other variants [27] apply varia- tional methods to estimate model parameters with mean-ﬁeld assumptions [24], which are normally too restrictive to be realistic in practice. T o address the a bove limitation s, this paper SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 2 presents discriminative relational topic models, which consist of three extensions to im proving R TMs: 1) we relax the symmetric assumption and d eﬁne the generalized relational topic models (gR TMs) with a full weight matrix tha t allows all pairwise topic interactions and is more suitable for asym- metric networks; 2) we pe r form regularized Bayesian inference (Reg- Bayes) [4 3] that introduces a regularization pa- rameter to dea l with the imbalance problem in common real networks; 3) we present a collapsed Gibbs sampling algorithm for gR TM s by exp loring the classical ideas of data augmentation [11], [40], [1 4]. Our methods are quite generic, in the sense that we can use various loss function s to learn d iscriminative latent representations. In this pa per , we particula rly focus on two types of popular loss functions, namely , logistic log-loss and max-margin hinge loss. For the max-margin loss, the resulting max-margin R TM s are themselves new contributions to the ﬁeld of statistical network analysis. For posterior inference, we present efﬁcient Markov Chain Monte Carlo (MCMC ) metho ds for both types of loss functions by introducing a uxiliary variab le s. Speciﬁcally , for the logistic log-loss, we introduce a set of Polya-Gamma random variab le s [34], one per training link, to derive an ex a ct mixture rep- resentation of the logistic link likelihood; while for the max- ma rgin hinge loss, we intr oduce a set of generalized inverse Gaussian va riables [12] to d e- rive a mixtur e representation of the c orresponding unnormalized pseudo-likelihood. T hen, we integrate out the intermediate Dirichlet variables a nd derive the local conditional distributions for collapsed Gibbs sampling analytically . These “augment-and-collapse” algorithms are simple a nd efﬁcient. More importantly , they do not make any rest ricting assumption s on the desired posterior distribution. Experimental results on several real networks demonstrate that these exten- sions are important and can signiﬁcantly improve the performance. The rest paper is structured a s follows. Section 2 summarizes the related work. Section 3 presents the generalized R T M s with both the log-loss a nd hinge loss. Section 4 presents the “augment-and- collapse” Gibbs sampling a lgorithms for both types of loss func- tions. Section 5 presents experimental results. Finally , Section 6 concludes with future dir ections discussed. 2 R E L AT E D W O R K Probabilistic latent variable models, e.g., la tent Dirich- let a llocation (LDA) [7], have been widely developed for modeling link relationships between d ocuments, as they share nice properties on d ealing with miss- ing attributes as well as discovering representative latent structures. For instance, R TMs [ 8] capture both text contents and network relations for document link prediction; T opic-Link LDA [27] performs topic modeling a nd author community discovery in one uniﬁed framework; Link-PLSA-LDA [32] combines probabilistic latent semantic analysis ( PLSA) [2 3] and LDA into a single fra mework to explicitly model the topical relationship between documents; Others include Pairwise-Link-LDA [33], Copycat a nd Citation Inﬂuence models [13], Latent T opic Hypertext Models (L THM) [ 1], Block-LDA models [5], etc. One shared goal of the a forementioned models is link p rediction. For static networks, our focus in this paper , this problem is usually formulated as inferring the missing links given the other observed ones. However , very few work explicitly imposes discriminative training, and many models suffer from the common imbala nce issue in sparse networks (e . g., the number of unob- served links is much larger than that of the observed ones). In this paper , we build our approaches by exploring the nice f ramework of regularized Bayesian inference (RegBayes) [44], under which one could easily introduce posterior regularization and do dis- criminative training in a cost sensitive manner . Another under-addressed problem in most prob- abilistic topic models for link prediction [8], [27] is the intracta b ility of p osterior inference due to the non-conjugacy be tween the prior and link likelihood (e.g., logis tic likelihood). Existing approaches using variational inference with mean ﬁeld assumption are often too restrictive in practice . Recently , [34] and [35] show that by making use of the ideas of d ata a ug- mentation, the intractable likelih ood (either a logistic likelihood or the one induced from a hinge loss) could be e xpressed a s a marginal of a higher-dimensional distribution with augmented variables that leads to a scale mixture of Gaussian components. These strate- gies hav e been successfully explored to develop efﬁ- cient Gibbs samplers for supervised topic models [45], [42]. This paper further explores data augmentation techniques to do collapsed Gibbs sampling for our discriminative relational topic models. Please note that our methods could also be a pplied to many of the aforementioned relational latent varia b le models. Finally , this pape r is a systematical gener alization of the conference paper [9]. 3 G E N E R A L I Z E D R T M S W e consider document networks with binary link structures. Let D = { ( w i , w j , y ij ) } ( i,j ) ∈I be a labeled training set, where w i = { w in } N i n =1 denote the words within document i and the response variable y ij takes values f rom the binary output space Y = { 0 , 1 } . A relational topic model ( R TM) consists of two parts — an LDA model [7] for d escribing the words W = { w i } D i =1 and a classiﬁer for considering link structur e s y = { y ij } ( i,j ) ∈I . Let K be the number of topics a nd each topic Φ k is a multinomial distribution over a V - word vocabulary . For Bayesian R T M s, the topics are SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 3 T AB LE 1 Learned diagonal weight matrix of 10-topic RTM and representativ e words corresponding with topics. 36.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 −74.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 44.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 42.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 41.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 −61.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 29.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 41.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 38.6 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 −57.1 −38.4 −19.6 −0.9 17.9 36.6 learning, bound, P AC, hypothesis, algorithm numerical, solutions, extensions, approach, r emark mixtures, experts, EM, Bayesian, pro babilistic features, selection, case-based, networks, model planning, learning, acting, reinfor cemen t, dynamic genetic, algorithm, evolving, evolutionary , learning plateau, feature, perfor mance, sparse, networks modulo, schedule, paralleli sm, control , processo r neural, cortical , networks, learning, feedforward markov , models, monte, car lo, Gibbs, sampler T AB LE 2 Learned weight matrix of 10- topic gR TM and representativ e words corresponding with topics. 30.0 −5.6 −6.9 2.9 −8.1 −8.9 −8.5 −10.5 −7.7 −11.2 −4.5 21.6 −4.8 3.0 −5.0 −1.9 −9.7 −11.7 −5.7 −7.0 −6.2 −3.4 31.0 5.3 −6.6 −8.4 −9.8 −9.4 −7.7 −8.7 3.0 1.8 5.9 16.7 2.4 7.4 14.1 5.8 −0.3 2.7 −8.1 −3.6 −7.4 3.6 23.1 −8.5 −4.4 −6.9 −6.9 −9.1 −9.9 −0.5 −8.1 6.9 −8.3 29.6 −10.5 −10.3 −5.1 −7.9 −8.6 −10.5 −9.4 13.0 −3.4 −11.3 22.3 −7.1 −11.5 −12.1 −10.8 −10.6 −9.9 5.3 −7.6 −10.0 −6.9 24.7 −7.1 −8.8 −7.4 −5.3 −8.8 1.3 −7.2 −6.4 −10.6 −8.0 28.3 −8.3 −10.6 −9.2 −7.9 3.7 −8.8 −8.2 −11.5 −7.9 −7.6 30.0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 −5.9 0.9 7.7 14.6 21.4 28.3 genetic, evolving, algorithm, coding, programming logic, grammars, FOIL, EBG, k n owledge, clauses reinfor cement , learning, planning, act, exploration mixtures, EM, Bayesian, networks, learning, genet ic images, visual, scenes, m ixtures, networks, learning decision-tr ee, rules, induction, learning, featur e s wake-sleep, learning, ne tworks, cortica l, inhibition monte, carlo, h astings, markov , chain, sampler case-based, reasoning, CBR, event-based , cases markov , learning, bayesian, net works, distributions samples drawn from a prior , e.g. , Φ k ∼ Dir ( β ) , a Dirichlet distribution. The genera ting process can be described as 1) For each document i = 1 , 2 , . . . , D : a) d raw a topi c mixing pro portion θ i ∼ Dir( α ) b) for each word n = 1 , 2 , . . . , N i : i) draw a topic assignment z in ∼ Mult( θ i ) ii) draw the observed word w in ∼ Mult( Φ z in ) 2) For each pair of documents ( i, j ) ∈ I : a) d raw a link indicator y ij ∼ p ( . | z i , z j , η ) , where z i = { z in } N i n =1 . W e have used Mult( · ) to denote a multinomial dis- tribution; and used Φ z in to denote the topic selected by the non-zero entry of z in , a K - dimensional binary vector with only one entry equaling to 1 . Previous work has deﬁned the link likelihood as p ( y ij = 1 | z i , z j , η ) = σ  η ⊤ ( ¯ z i ◦ ¯ z j )  , (1) where ¯ z i = 1 N i P N i n =1 z in is the average topic a ssign- ments of document i ; σ is the sigmoid function; and ◦ denotes elementwise product. In [8] , other choices of σ such a s the exponential f unction and the cumulative distribution function of the normal distribution were also used, a s long as it is a monotonically increasing function with respect to the weighted inner product between ¯ z i and ¯ z j . Here, we focus on the commonly used logistic likelihood model [3 1], [27], a s no one has shown consistently superior performance than others. 3.1 The Full R TM Model Since η ⊤ ( ¯ z i ◦ ¯ z j ) = ¯ z ⊤ i diag( η ) ¯ z j , the standard R TM learns a diagonal weight matrix which only captures the-same-topic interactions (i.e., there is a non-zero contribution to the link likelihood only when docu- ments i and j ha v e the same topic). One example of the ﬁtted diagonal matrix on the Cora citation network [8] is shown in T able 1, where each row corresponds to a topic a nd we show the repr esen- tative words for the topic at the right hand side . Due to the positiveness of the la tent fe a tures (i.e., ¯ z i ) and the competition between the diagonal entries, some of η k will have positive values while some are negative. The nega tive interactions may conﬂict our intuitions of understanding a citation network, where we would expect that pa p e rs with the sa me topics tend to have citation links. Furthermore, by using a diagonal weight matrix, the model is symmetric, i.e., the p robability of a link from document i to j is the same as the probability of a link from j to i . The symmetry property does not hold for many networks, e.g., citation networks . T o ma ke RTMs more expressive and applicable to asymmetric networks, the ﬁrst simple extension is to deﬁne the lin k likelihood as p ( y ij = 1 | z i , z j , U ) = σ  ¯ z ⊤ i U ¯ z j  , (2) SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 4 using a full K × K weight matrix U . Using the algorithm to be presented, a n e xample of the learned U matrix on the same Cora citation network is shown in T a ble 2. W e can se e that by allowing all pairwise topic intera ctions, a ll the diagonal entries a re positive, while most off-dia gonal entries are nega tive. This is consistent with our intuition that documents with the same topics tend to ha ve citation links, while documents with different topics are less likely to have citation links. W e also note that there are some documents with generic topics (e.g., topic 4) that have positive link intera ctions with almost all others. 3.2 Regularized Bay esian In ference Given D , we let Z = { z i } D i =1 and Θ = { θ i } D i =1 denote all the topic assignments and mixing proportions respectively . T o ﬁt RTM models, maximum likelihood estimation (MLE) has be en used with a n EM algo- rithm [8]. W e consider Bayesian inference [2 1], [ 31] to get the posterior distribution p ( Θ , Z , Φ , U |D ) ∝ p 0 ( Θ , Z , Φ , U ) p ( D| Z , Φ , U ) , where p ( D| Z , Φ , U ) = p ( W | Z , Φ ) p ( y | Z , U ) is the like- lihood of the observed data a nd p 0 ( Θ , Z , Φ , U ) = p 0 ( U )[ Q i p ( θ i | α ) Q n p ( z in | θ i )] Q k p ( Φ k | β ) is the prior distribution deﬁned by the model. One common issue with this estimation is that real networks are highly imbalanced—the number of positive links is much smaller than the number of negative links. For ex- ample, less than 0 . 1% document pairs in the Cora network have positive links. T o deal with this imbalance issue, we p ropose to do regularized Bayesian inference (RegBayes) [43] which offers an extra freedom to handle the imbalance issue in a cost-sensitive manner . Speciﬁcally , we deﬁne a Gibbs classiﬁer for binary links as follows . 1) A Late nt Predict or : If the weight matrix U and topic a ssignments Z are given, we build a classi- ﬁer using the likelihood (2) and the latent predic- tion rule is ˆ y ij | z i , z j ,U = I ( ¯ z ⊤ i U ¯ z j > 0 ) , (3) where I ( · ) is a n indicator f unction that equals to 1 if predicate holds otherwise 0. Then, the training error of this latent prediction rule is Err ( U, Z ) = X ( i,j ) ∈I I ( y ij 6 = ˆ y ij | z i , z j , U ) . Since directly optimi zing the training error is hard, a convex surrogate loss is commo nly used in ma chine learning. Here, we consider two pop- ular examples, namely , the logistic log-loss and the hinge los s R 1 ( U, Z ) = − X ( i,j ) ∈I log p ( y ij | z i , z j , U ) , R 2 ( U, Z ) = X ( i,j ) ∈I max  0 , ℓ − ˜ y ij z ⊤ i U z j  , where ℓ ( ≥ 1) is a cost pa rameter that p e nalizes a wrong prediction and ˜ y ij = 2 y ij − 1 is a transfor- mation of the 0 / 1 binar y links to be − 1 / + 1 f or notation convenience. 2) Expe cted Loss : Since both U and Z are hidde n variables, we infer a posterior distribution q ( U, Z ) that has the mi nimal expected los s R 1 ( q ( U, Z )) = E q [ R 1 ( U, Z )] (4) R 2 ( q ( U, Z )) = E q [ R 2 ( U, Z )] . (5) Remark 1: Note that both loss f unctions R 1 ( U, Z ) and R 2 ( U, Z ) are convex over the para mete r s U when the late nt topics Z are ﬁxed. T he hinge loss is a n upper bound of the training error , while the log-loss is not. Many comparisons have been done in the context of classiﬁcation [36]. Our results will provide a careful comparison of these two loss functions in the context of relational topic models. Remark 2: Both R 1 ( q ( U, Z )) and R 2 ( q ( U, Z )) are good surrogate loss for the expe c ted link prediction error Err ( q ( U, Z )) = E q [ Err ( U, Z )] , of a Gibbs classiﬁer that randomly draws a model U from the posterior distribution q and ma kes predic- tions [28][ 1 6]. The expected hinge loss R 2 ( q ( U, Z )) is also an up p e r bound of Err ( q ( U, Z )) . W ith the above Gibbs classiﬁers, we deﬁne the generalized relational topic models (gR TM ) as solving the regularized Bayesian inference problem min q ( U, Θ , Z , Φ ) ∈P L ( q ( U, Θ , Z , Φ )) + c R ( q ( U , Z )) (6) where L ( q ) = KL( q ( U, Θ , Z , Φ ) || p 0 ( U, Θ , Z , Φ )) − E q [log p ( W | Z , Φ )] is an information theoretical objec- tive; c is a positive regularization parameter contr ol- ling the inﬂuence from link structures; and P is the space of normalized distributions. In f act, minimizing the single term of L ( q ) results in the posterior distri- bution of the vanilla LDA without c onsidering link information. For the second term, we have used R to denote a generic loss function, which ca n be either the log-loss R 1 or the hinge-loss R 2 in this paper . Note that the Gibbs cla ssiﬁers and the LDA likelihood are coupled by sharing the latent topic assignments Z , and the str ong coupling makes it possible to learn a posterior distribution that can describe the observe d words well and make accurate predictions. T o better understand the a bove formulation, we de- ﬁne the un-normalized pseudo-likelihood 1 for lin ks: ψ 1 ( y ij | z i , z j , U ) = p c ( y ij | z i , z j , U ) = e cy ij ω ij (1 + e ω ij ) c , (7) ψ 2 ( y ij | z i , z j , U ) = exp ( − 2 c max(0 , 1 − y ij ω ij )) , (8) 1. Pseudo-likelihood h as b een used as an approximate maximum likelihood estimation procedure [39]. Here, we use it to denote an unnormalized likelihood of empirical data. SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 5 where ω ij = ¯ z ⊤ i U ¯ z j is the discriminant function value. The pseudo-likelihood ψ 1 is un-normalized if c 6 = 1 . Then, the inference problem (6) can be written as min q ( U, Θ , Z , Φ ) ∈P L ( q ( U, Θ , Z , Φ )) − E q [log ψ ( y | Z , U )] (9) where ψ ( y | Z , U ) = Q ( i,j ) ∈I ψ 1 ( y ij | z i , z j , U ) if using log-loss a nd ψ ( y | Z , U ) = Q ( i,j ) ∈I ψ 2 ( y ij | z i , z j , U ) if using hinge loss . W e can show that the optimum solution of problem (6) or the equivalent problem (9) is the posterior distribution with link information q ( U, Θ , Z , Φ ) = p 0 ( U, Θ , Z , Φ ) p ( W | Z , Φ ) ψ ( y | Z , U ) φ ( y , W ) . where φ ( y , W ) is the normalization constant to make q as a normalized distribution. Therefore, by solving problem (6) or (9) we are in fact doing Bayesian inference with a generalize d pseudo-likelihood, which is a power e d version of the likelihood (2) in the ca se of using the log-loss. The ﬂexibility of using regularization parameters can p lay a signiﬁcant role in dealing with imbalance d network data as we shall see in the ex periments. For exa mple, we ca n use a larger c va lue for the sparse positive links, while using a smaller c for the dense negative links. This simple strategy has been shown effec tive in learning classiﬁers [3] and link p rediction models [41] with highl y imbalanced data. Finally , f or the logistic log-loss a n ad hoc generative story can be d escribed as in RTMs, where c ca n be understood as the pseudo- count of a link. 4 A U G M E N T A N D C O L L A P S E S A M P L I N G For gR TMs with either the log-loss or the hinge loss, e xact posterior inference is intractable due to the non-conjugacy between the prior and pseudo- likelihood. Previous in ference methods for th e standard R TMs use var ia tional techniques with mean-ﬁeld assumptions. For example, a varia- tional EM algorithm was developed in [8] with the factoriza tion assumption that q ( U, Θ , Z , Φ ) = q ( U )  Q i q ( θ i ) Q n q ( z in )  Q k q ( Φ k ) which can be too restrictive to b e realistic in practice. In this section, we present simple and efﬁcient Gibbs sampling algo- rithms without any restricting a ssumptions on q . Our “augment-and-collapse” sampling algorithm relies on a data a ugmentation reformulation of the RegBa yes problem ( 9). Before a full exposition of the algorithm s, we sum- marize the high-level idea s. For the pseudo-likelihood ψ ( y | Z , U ) , it is not easy to de rive a sa mpling algo- rithm d irectly . Instead, we de velop our algorithms by introducing a uxiliary variables, which lead to a scale mixture of Gaussian compo nents a nd analytic condi- tional distributions for Bayesian inference without a n accept/reject ratio. Be low , we present the algorithms for the lo g-loss and hinge loss in turn. 4.1 Sampling Algorithm f or the Log-Loss For the case with the log-loss, our algorithm repre- sents an extension of Polson et al.’s approach [34] to deal with the highly non-trivial Bayesian late nt variable models f or relational data analysis. 4.1.1 Fo r mulation with Data A ugmentation Let us ﬁrst intro duce the Polya-Gamma variables [34]. Deﬁnition 3: A r andom v a riable X has a Polya- Gamma distribution, denoted by X ∼ P G ( a, b ) , if X = 1 2 π 2 ∞ X m =1 g m ( m − 1 / 2) 2 + b 2 / (4 π 2 ) , where ( a > 0 , b ∈ R ) are parameters and each g m ∼ G ( a, 1) is an independent Gamma ra ndom variable. Then, using the ideas of da ta augmentation [34], we have the follow ing results Lemma 4 : The pseudo-likelihood can be ex p ressed as ψ 1 ( y ij | z i , z j , U ) = 1 2 c e ( κ ij ω ij ) Z ∞ 0 e ( − λ ij ω 2 ij 2 ) p ( λ ij | c, 0) dλ ij , where κ ij = c ( y ij − 1 / 2) a nd λ ij is a Polya-Gamma variable with parameters a = c and b = 0 . Lemma 4 indicates that the posterior distribution of the genera lized B a yesian logist ic relational topic models, i.e., q ( U, Θ , Z , Φ ) , ca n be expressed as the marginal of a higher dimensional distribution that includes the augmented variables λ . The complete posterior distribution is q ( U, λ , Θ , Z , Φ ) = p 0 ( U, Θ , Z , Φ ) p ( W | Z , Φ ) ψ ( y , λ | Z , U ) φ ( y , W ) , where ψ ( y , λ | Z , U ) = Q ( i,j ) ∈I exp  κ ij ω ij − λ ij ω 2 ij 2  p ( λ ij | c, 0) is the joint pseudo-distribution 2 of y and λ . 4.1.2 Inf er ence with Collapsed Gibb s Sampling Although we ca n do Gibbs sampling to infer the com- plete posterior q ( U, λ , Θ , Z , Φ ) and thus q ( U, Θ , Z , Φ ) by ignoring λ , the mixing rate would b e slow due to the large sample space. An e ffec tive way to re- duce the sample spac e a nd improve mixing rates is to integrate out the intermediate Dirichlet v a riables ( Θ , Φ ) and build a Markov chain whose equilibrium distribution is the collapsed distribution q ( U, λ , Z ) . Such a collapsed Gibbs sampling procedure has been successfully used in LDA [1 8]. For gR TM s, the c ol- lapsed posterior distribution is q ( U, λ , Z ) ∝ p 0 ( U ) p ( W , Z | α , β ) ψ ( y , λ | Z , U ) = p 0 ( U ) K Y k =1 δ ( C k + β ) δ ( β ) D Y i =1 δ ( C i + α ) δ ( α ) × Y ( i,j ) ∈I exp  κ ij ω ij − λ ij ω 2 ij 2  p ( λ ij | c, 0) , 2. Not n ormalized appropriately . SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 6 where δ ( x ) = Q dim( x ) i =1 Γ( x i ) Γ( P dim( x ) i =1 x i ) , C t k is the number of times the term t being a ssigned to topic k over the whole corpus a nd C k = { C t k } V t =1 ; C k i is the number of times that terms are associated with topic k withi n the i -th document and C i = { C k i } K k =1 . T hen, the conditional distributions used in collapsed Gibbs sampling are as follows. For U : f or notation clarity , we deﬁne ¯ z ij = vec ( ¯ z i ¯ z ⊤ j ) and η = vec ( U ) , where vec ( A ) is a vector concate- nating the row vectors of matrix A . Then, we have the discriminant function value ω ij = η ⊤ ¯ z ij . For the commonly used isotr opic Gaussian prior p 0 ( U ) = Q kk ′ N ( U kk ′ ; 0 , ν 2 ) , i.e., p 0 ( η ) = Q K 2 m N ( η m ; 0 , ν 2 ) , we have q ( η | Z , λ ) ∝ p 0 ( η ) Y ( i,j ) ∈I exp  κ ij η ⊤ ¯ z ij − λ ij ( η ⊤ ¯ z ij ) 2 2  = N ( η ; µ , Σ ) , (10) where Σ =  1 ν 2 I + P ( i,j ) ∈I λ ij ¯ z ij ¯ z ⊤ ij  − 1 and µ = Σ  P ( i,j ) ∈I κ ij ¯ z ij  . Therefore, we c an easily dra w a sample from a K 2 -dimensional multivariate Gaussian distribution. The inverse can be robustly done using Cholesky decomposition. Since K is normally not large, the inversion is relatively efﬁcient, especia lly when the number of d ocuments is large. W e will provide empirical a nalysis in the experiment section. Note that for large K this step can be a practical limitation. But fortunately , there are good parallel algorithms for Cholesky dec omposition [ 15], which can be used for applications with large K values. For Z : the conditional distribution of Z is q ( Z | U, λ ) ∝ K Y k =1 δ ( C k + β ) δ ( β ) D Y i =1 δ ( C i + α ) δ ( α ) Y ( i,j ) ∈I ψ 1 ( y ij | λ , Z ) where ψ 1 ( y ij | λ , Z ) = exp( κ ij ω ij − λ ij ω 2 ij 2 ) . By canceling common factors, we can derive the local conditional of one variable z in given others Z ¬ as: q ( z k in = 1 | Z ¬ , U, λ , w in = t ) ∝ ( C t k, ¬ n + β t )( C k i, ¬ n + α k ) P t C t k, ¬ n + P V t =1 β t × Y j ∈N + i ψ 1 ( y ij | λ , Z ¬ , z k in = 1 ) × Y j ∈N − i ψ 1 ( y j i | λ , Z ¬ , z k in = 1) , (11) where C · · , ¬ n indicates that term n is excluded from the correspon ding document or topic; and N + i = { j : ( i, j ) ∈ I } and N − i = { j : ( j, i ) ∈ I } denote the neighbors of document i in the training network. For symmetric networks, N + i = N − i , only one pa rt is sufﬁcient. W e can see that the ﬁrst ter m is from the LDA model f or observed word counts and the second term is f rom the link structures y . Algorithm 1 Collapsed Gibbs Sampling Algorithm for Generalized R TMs with Logistic Log-loss 1: Initi aliza tion: set λ = 1 and randomly draw z dn from a uniform distribution. 2: for m = 1 to M do 3: draw the classiﬁer from the normal distribu- tion (10) 4: for i = 1 to D do 5: for each wor d n in document i do 6: draw the topic using distribution (11) 7: end for 8: end for 9: for ( i, j ) ∈ I d o 10: draw λ ij from distribution (12). 11: end for 12: end for For λ : the c onditional distribution of the augmented variables λ is a Polya-Gamma distribution q ( λ ij | Z , U ) ∝ exp − λ ij ω 2 ij 2 ! p ( λ ij | c, 0) = P G ( λ ij ; c, ω ij ) . (12) The equality is ac hieved by using the construction deﬁnition of the genera l P G ( a, b ) class thr ough a n exponential tilting of the P G ( a, 0) density [ 34]. T o draw samples from the Polya-Gamma distribution, a naive implementation using the inﬁnite sum-of- Gamma representation is not efﬁcient and it a lso involves a potentially inaccurate step of truncating the inﬁnite sum. Here we adopt the method proposed in [34], which draws the samples from the closely related exponenti ally tilted Jacobi distribution. W ith the ab ove conditional distributions, we can construct a Markov chain which iteratively draws samples of η (i.e, U ) using Eq. (10), Z using Eq. ( 11) and λ using Eq. (1 2) a s shown in Alg. 1, with an initial condition. In our exper iments, we initially set λ = 1 and randomly dra w Z f rom a uniform distribution. In training, we run the Markov chain for M iterations (i.e., the so c alled burn-in stage). T hen, we dra w a sample ˆ U as the ﬁnal classiﬁer to ma ke predictions on testing da ta. As we shall see in p r actice, the Ma rkov chain converges to stable prediction perf ormance with a few bur n- in itera tions. 4.2 Sampling Algorithm f or the Hinge Loss Now , we present a n “augment-a nd-collapse” Gibbs sampling algorithm for the gRTMs with the hinge loss. The algorithm represent s a n extension of the recent techniques [42] to relational data analysis. 4.2.1 Fo r mula with Data A ugmentation As we do not ha v e a closed-form of the e x pected margin loss, it is hard to deal with the expected hinge SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 7 loss in Eq. (5). Here, we develop a collapsed Gibbs sampling method based on a data augmentation for- mulation of the expected margin loss to infer the posterior distribution q ( U, Θ , Z , Φ ) = p 0 ( U, Θ , Z , Φ) p ( W | Z , Φ) ψ ( y | Z , U ) φ ( y , W ) , where φ ( y , W ) is the normalization c onstant and ψ ( y | Z , U ) = Q ( i,j ) ∈I ψ 2 ( y ij | z i , z j , U ) in this case. Speciﬁcally , we have the following data a ugmentation representation of the pseudo-likelih ood: ψ 2 ( y ij | z i , z j , U ) = Z ∞ 0 1 p 2 π λ ij exp n − ( λ ij + cζ ij ) 2 2 λ ij o dλ ij , ( 13) where ζ ij = ℓ − y ij ω ij . Eq. (13) can be derived follow- ing [35], and it indicates tha t the posterior distribu- tion q ( U, Θ , Z , Φ ) ca n be expressed as the marginal of a higher dimensional posterior distribution that includes the augmented varia bles λ : q ( U, λ , Θ , Z , Φ ) = p 0 ( U, Θ , Z , Φ ) p ( W | Z , Φ ) ψ ( y , λ | Z , U ) φ ( y , W ) , where the unnormalized distribution of y and λ is ψ ( y , λ | Z , U ) = Y ( i,j ) ∈I 1 p 2 π λ ij exp  − ( λ ij + cζ ij ) 2 2 λ ij  . 4.2.2 Inf er ence with Collapsed Gibb s Sampling Similar as in the log-loss case, although we ca n sa mple the complete distribution q ( U, λ , Θ , Z , Φ ) , the mixing rate would be slow due to the high dimensional sample space. Thus, we reduce the sample space and improve mixing rate by integrating out the in- termediate Dirichlet varia bles ( Θ , Φ ) and building a Markov chain whose equilibrium distribution is the resulting marginal distribution q ( U, λ , Z ) . Speciﬁcally , the collapsed posterior distribution is q ( U, λ , Z ) ∝ p 0 ( U ) p ( W , Z | α , β ) Y i,j φ ( y ij , λ ij | z i , z j , U ) = p 0 ( U ) D Y i =1 δ ( C i + α ) δ ( α ) × K Y k =1 δ ( C k + β ) δ ( β ) × Y ( i,j ) ∈I 1 p 2 π λ ij exp n − ( λ ij + cζ ij ) 2 2 λ ij o . Then we could get the c onditional distribution using the collapsed Gibbs sampling as following: For U : we use the similar notations, η = vec ( U ) and ¯ z ij = vec ( ¯ z i ¯ z ⊤ j ) . For the commonly used isotropic Gaussian prior , p 0 ( U ) = Q k,k ′ N ( U k,k ′ ; 0 , ν 2 ) , the pos- terior distribution of q ( U | Z , λ ) or q ( η | Z , λ ) is still a Gaussian distribution : q ( η | Z , λ ) ∝ p 0 ( U ) Y ( i,j ) ∈I exp  − ( λ ij + cζ ij ) 2 2 λ ij  = N ( η ; µ , Σ ) (14) where Σ =  1 σ 2 I + c 2 P i,j ¯ z ij ¯ z ⊤ ij λ ij  − 1 and µ = Σ  c P i,j y ij ( λ ij + cℓ ) λ ij ¯ z ij  . For Z : the condition al posterior distribution of Z is q ( Z | U, λ ) ∝ D Y i =1 δ ( C i + α ) δ ( α ) K Y k =1 δ ( C k + β ) δ ( β ) Y ( i,j ) ∈I ψ 2 ( y ij | λ , Z ) , where ψ 2 ( y ij | λ , Z ) = exp( − ( λ ij + cζ ij ) 2 2 λ ij ) . By canceling common factors, we can derive the local conditional of one variable z in given others Z ¬ as: q ( z k in = 1 | Z ¬ , U, λ , w in = t ) ∝ ( C t k, ¬ n + β t )( C k i, ¬ n + α k ) P t C t k, ¬ n + P V t =1 β t × Y j ∈N + i ψ 2 ( y ij | λ , Z ¬ , z k in = 1 ) × Y j ∈N − i ψ 2 ( y j i | λ , Z ¬ , z k in = 1) . (15) Again, we can see that the ﬁrst term is from the LDA model for observed word counts and the second term is from the link stru ctures y . For λ : due to the independence structure a mong the augmented variables when Z and U are given, we can derive the conditional posterior distribution of each augmented varia ble λ ij as: q ( λ ij | Z , U ) ∝ 1 p 2 π λ ij exp  − ( λ ij + cζ ij ) 2 2 λ ij  = G I G  λ ij ; 1 2 , 1 , c 2 ζ 2 ij  (16) where G I G ( x ; p, a, b ) = C ( p, a, b ) x p − 1 exp( − 1 2 ( b x + ax )) is a generalized inverse Gaussian distribution [12] and C ( p, a, b ) is a normalization constant. Therefore, we can d erive that λ − 1 ij follows an inverse Gaussian distribution p ( λ − 1 ij | Z , U ) = I G  λ − 1 ij ; 1 c | ζ ij | , 1  , where I G ( x ; a, b ) = q b 2 π x 3 exp( − b ( x − a ) 2 2 a 2 x ) for a, b > 0 . W ith the ab ove conditional distributions, we can construct a Markov chain which iteratively draws samples of the weights η (i. e ., U ) using Eq. (14), the topic assignments Z using Eq. (1 5) and the augmented variables λ using Eq. (16), with a n initial condition which is the same as in the case of the logistic log- loss. T o sample from an inverse Gaussian distribution, we apply the efﬁcient transformation method with multiple roots [30]. Remark 5: W e note that the Gibbs sampling algo- rithms for both the hinge loss and logisti c loss have a similar structure. But they have different distribu- tions for the augmented va r iables. As we shall see in experiments, drawing samples from the different distributions for λ will have different efﬁciency . SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 8 4.3 Prediction Since gR TMs a ccount f or both text contents a nd net- work structures, we ca n make predictions for each of them conditioned on the other [8]. For link prediction, given a test document w , we need to infer its topic assignments z in order to apply the classiﬁer (3). This can be done with a collapsed Gibbs sampling method, where the conditional distribution is p ( z k n = 1 | z ¬ n ) ∝ ˆ φ kw n ( C k ¬ n + α k ); C k ¬ n is the times that the terms in this d ocument w are assigned to topic k with the n -th term excluded; a nd ˆ Φ is a point estimate of the topics, with ˆ φ kt ∝ C t k + β t . T o initialize, we randomly set each word to a topic, and then run the Gibbs sampler until some stopping criterion is met, e .g., the relative change of likelihood is less than a threshold (e.g., 1e-4 in our experiments). For word prediction , we need to infer the distribu- tion p ( w n | y , D , ˆ Φ , ˆ U ) = X k ˆ φ kw n p ( z k n = 1 | y , D , ˆ U ) . This ca n be d one by drawing a few samples of z n and compute the empirica l mea n of ˆ φ kw n using the sampled z n . The number of samples is determined by running a Gibbs sa mpler until some stopping criterion is met, e . g., the relative change of likelihood is less than 1e-4 in our experiments. 5 E X P E R I M E N T S Now , we present b oth quantitative and qualitative results on several real network da tasets to demon- strate the efﬁcacy of the generalized discriminative relational topic models. W e also p resent extensive sen- sitivity analysis with respect to va rious parameters. 5.1 Data sets and Setups W e p resent exp e riments on three public datasets of document networks 3 : 1) The Cora data [2 9] consists of abstracts of 2,708 computer science research pa pers, with links be- tween documents that cite each other . In total, the Cora citation network has 5 ,429 p ositive links, and the dictionary consists of 1,433 wor d s. 2) The We bKB data [1 0] contains 87 7 webpages from the computer science d epartments of different universities, with links between webpa ges that are hyper-linked. In total, the W ebKB network has 1,608 positive links and the dictionary has 1,703 words. 3) The Citeseer da ta [37] consists of 3 ,312 scientiﬁc publications with 4,73 2 positive links, and the dictionary contains 3,703 unique words. Since many baseline methods ha v e been outper- formed by R TMs on the same datasets [8], we focus 3. http://www . cs.umd.edu/projects/linqs/pro jects/lbc/index.html on evaluating the effects of the various extensions in the discriminative gR TMs with log-loss (denoted by Gibbs-gR TM) and hinge loss (denoted by Gibbs- gMMR TM ) by comparing with various special cases: 1) V ar-R TM : the standard R TMs (i.e., c = 1 ) with a diagonal logistic likelihood and a variational EM algorithm with mean-ﬁeld assumptions [8]; 2) Gibbs- R TM : the Gibbs-R TM model with a diag- onal weight matrix and a Gibbs sa mpling a lgo- rithm for the logistic link likelihood; 3) Gibbs- gR TM : the Gibbs-gR TM model with a full weight matrix and a Gibbs sampling algorithm for the lo gistic link likelihoo d; 4) Approx- gR TM : the Gibbs-gR TM model with f ast approximation on sampling Z , by computing the link likelihoo d term in Eq. ( 10) for once and caching it for sampling all the word topics in each document; 5) Gibbs- MMR TM : the Gibbs-MMR TM model with a diagonal weight matrix and a Gibbs sa mpling algorithm for the hinge loss; 6) Gibbs- gMMR TM : the Gibbs-gMMR TM model with a full weight matrix and a Gibbs sampling algorithm for the hinge loss; 7) Approx- gMMR TM : the Gibbs-gMM R TM model with fast approximation on sa mpling Z , which is similar to Approx-gR TM. For V ar-R TM, we follow the setup [8] and use positive links only as training da ta; to d e al with the one-class problem, a regularization penalty was used, which in e ffect injects some num ber of pseudo- observations (ea ch with a ﬁxed uniform topic distribu- tion). For the other p roposed models, including Gibbs- gR TM, Gibbs-RTM, Approx-gR TM, Gibbs-gMMR T M , Gibbs-MMR TM , and Approx-gMMR TM, we instead draw some unobserved links as negative exa mples. Though subsampling normally results in imbalanced datasets, the regul arization para mete r c in our dis- criminative gR TMs can e ffective ly address it, as we shall see. Here, we ﬁx c at 1 for negative ex a mples, while we tune it for positive examples. All the training and testing time are fairly calculated on a de sktop computer with four 3 .10GHz processors a nd 4G RAM . 5.2 Quantitative Results W e ﬁrst report the overall results of link rank , word rank and AUC (a rea under the ROC curve) to measure the prediction per f ormance, following the setups in [8]. Link rank is deﬁned as the a verage rank of the ob- served links from the held-out test documents to the training documents, and wor d rank is deﬁned a s the average r a nk of the words in testing documents given their links to the training documents. Therefore, lower link rank and word rank are better , a nd higher AUC value is better . The test documents are completely new that a re not observed during training. In the training phase all the words along with their links of the te st d ocuments are removed. SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 9 5 10 15 20 25 300 400 500 600 700 800 900 1000 Link Rank Topic Number (a) link rank 5 10 15 20 25 100 150 200 250 300 350 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 AUC Value Topic Number Gibbs−gRTM Approx−gRTM Gibbs−RTM Var−RTM Var−RTM (neg 0.2%) Gibbs−RTM (neg 0.2%) (c) AUC score 5 10 15 20 25 0 500 1000 1500 2000 2500 3000 3500 Train Time (CPU seconds) Topic Number (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Test Time (CPU seconds) Topic Number (e) test time Fig. 1. Results of various model s with different n umbers of topics on the Cora citation dataset. 5 10 15 20 25 0 50 100 150 200 250 300 Link Rank Topic Number (a) link rank 5 10 15 20 25 100 150 200 250 300 350 400 450 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 AUC Value Topic Number Gibbs−gRTM Approx−gRTM Gibbs−RTM Var−RTM (c) AUC score 5 10 15 20 25 0 500 1000 1500 2000 2500 3000 Train Time (CPU seconds) Topic Number (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Test Time (CPU seconds) Topic Number (e) test time Fig. 2. Results of various model s with different n umbers of topics on the W ebKB dataset. 5 10 15 20 25 200 300 400 500 600 700 800 900 Link Rank Topic Number (a) link rank 5 10 15 20 25 200 300 400 500 600 700 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Topic Number AUC Value Giibs−gRTM Approx−gRTM Gibbs−RTM Var−RTM (c) AUC score 5 10 15 20 25 0 5000 10000 15000 Train Time (CPU seconds) Topic Number (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Test Time (CPU seconds) Topic Number (e) test time Fig. 3. Results of various model s with different n umbers of topics on the Citeseer dataset. 5.2.1 Results with the Log-loss Fig. 1, Fig. 2 and Fig. 3 show the 5-fold ave rage results and standard deviations of va rious models on all the thr ee datasets with varying numbers of topic. For the R TM models using collapsed Gibbs sampling, we r andomly draw 1 % of the unobserved links as negative training ex a mples, which lead to imbalanced training sets. W e can see that the generalized Gibbs- gR TM can effectively deal with the imbalance and achieve signiﬁcantly better results on link rank a nd AUC scores than all other competitors. For word ra nk, all the R TM models using Gibbs sa mpling per form better than the R TMs using va riational EM methods when the num ber of topics is larger than 5 . The outstanding pe rformance of Gibbs-gR TM is due to many possible factors. For ex ample, the superior performance of Gibbs-gRTM over the diagonal Gibbs- R TM demonstrates that it is important to consider all pairwise topic intera ctions to ﬁt real network data; and the superior performance of Gibbs-R TM over V a r- R TM shows the beneﬁts of using the regularization parameter c in the regularized Bayesian fra mework T ABLE 3 Split of training time on Cora dataset. Sample Z Sample λ Sample U K=10 331.2 (73.55%) 55.3 (12.29%) 67.8 (14.16%) K=15 746.8 (76.54%) 55.0 (5.64%) 173.9 (17.82%) K=20 130 0.3 (74.16%) 55.4 (3.16%) 397.7 (22.68%) and a collapsed Gibbs sampling algorithm without restricting mean-ﬁeld assumptions 4 . T o single out the inﬂuence of the proposed Gibbs sampling algorithm, we also present the results of V ar-R TM and Gibbs-R TM with c = 1 , both of which randomly sample 0 . 2% unobserved lin ks 5 as negative examples on the Cora dataset. W e can see that by using Gibbs sampling without restricting mean-ﬁeld 4. Gibbs-R TM doesn’t outpe rform V ar-R TM on Citeseer b ecause they use different strategies of d rawing negative samples. If we use the same strategy (e.g., randomly drawing 1% negative samples), Gibbs-R TM signiﬁcantly outperforms V ar-R TM. 5. V ar -RTM performs much worse if using 1% negative links, while Gibbs-RTM could obtain similar performance (see Fig. 13 ) due to its effectiveness in dealing with imbalance. SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 10 5 10 15 20 25 300 400 500 600 700 800 900 1000 Link Rank Topic Number (a) link rank 5 10 15 20 25 50 100 150 200 250 300 350 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Topic Number AUC Value Gibbs−gMMRTM Gibbs−gRTM Approx−gMMRTM Approx−gRTM Gibbs−MMRTM Gibbs−RTM (c) AUC score 5 10 15 20 25 0 500 1000 1500 2000 2500 3000 3500 Topic Number Train Time (CPU seconds) (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Topic Number Test Time (CPU seconds) (e) test time Fig. 4. Results of various model s with different n umbers of topics on the Cora dataset. 5 10 15 20 25 200 300 400 500 600 700 800 900 Link Rank Topic Number (a) link rank 5 10 15 20 25 200 300 400 500 600 700 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Topic Number AUC Value Gibbs−gMMRTM Gibbs−gRTM Approx−gMMRTM Approx−gRTM Gibbs−MMRTM Gibbs−RTM (c) AUC score 5 10 15 20 25 0 2000 4000 6000 8000 10000 12000 Topic Number Train Time (CPU seconds) (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Test Time (CPU seconds) Topic Number (e) test time Fig. 5. Results of various model s with different n umbers of topics on the Citeseer dataset. assumptions, Gibbs-R TM (neg 0 . 2% ) outperforms V a r- R TM (neg 0 . 2 % ) that makes mean-ﬁeld assumptions when the number of topics is larger tha n 10. W e defer more careful analysis of other factors in the next section, including c and the subsampling ratio. W e also note that the cost we pa y f or the outstand- ing performance of Gibbs-gR TM is on tra ining time, which is much longer than that of V a r-RTM because Gibbs-gR TM has K 2 latent fe atures in the logistic likelihood a nd more tr aining link pairs, while V ar- R TM has K latent features and only uses the spa rse positive links as training examples. Fortunately , we can apply a simple approximate method in sampling Z a s in Approx-gR TM to signiﬁcantly impro ve the training efﬁciency , while the prediction performance is not sacr iﬁced much. In fac t, A pprox-gR TM is still signiﬁcantly better than V ar-R TM in all ca ses, and it has comparable link prediction perf orma nce with Gibbs-gR TM on the W ebKB da taset, when K is large. T able 3 further shows the tra ining time spent on each sub-step of the Gibbs sa mpling a lgorithm of Gibbs- gR TM. W e can see that the step of sampling Z takes most of the time ( > 70% ); and the steps of sampling Z a nd η take more time as K increases, while the step of sampling λ takes a lmost a constant time when K changes. 5.2.2 Results with the Hinge Loss Fig. 4 and Fig. 5 show the 5- f old avera ge results with standard deviations of the discriminative R TMs with hinge loss, comparing with the R TMs with log- loss on Cora and Citeseer da tasets 6 . W e c a n see that 6. The result on W e b KB dataset is similar , but omitted for saving space. Please refer to Fig. 15 in Appendix. K= 5 K=10 K=15 K=20 K=25 0 10 20 30 40 50 60 70 80 90 100 # of latent topics Draw λ Time ( CPU seconds) Gibbs−gMMRTM Gibbs−gRTM Approx−gMMRTM Approx−gRTM Gibbs−MMRTM Gibbs−RTM (a) T ime for drawing λ K= 5 K=10 K=15 K=20 K=25 0 500 1000 1500 2000 # of latent topics Draw η Time (CPU seconds) Gibbs−gMMRTM Gibbs−gRTM Approx−gMMRTM Approx−gRTM Gibbs−MMRTM Gibbs−RTM (b) T ime for d rawing η Fig. 6. Time comple xity of dra wing λ and η on the Citeseer dataset. the discriminative R TM models with hinge loss (i.e ., Gibbs-gMMR TM and Gibbs-MMR TM) obtain c om- parable predictive results (e.g., link rank and A UC scores) with the RTMs using log-loss (i.e., Gibbs- gR TM and Gibbs-R TM). A nd owing to the use of a full weight matrix, Gibbs-gMMRTM obtains superior performance over the diagonal Gibbs-MMRTM. These results verify the fact that the max-margin R T Ms can be used a s a competing alternative approach for statistical network link prediction. For word rank, all the RTM models using Gibbs sampling per f orm similarly . As shown in Fig. 6, one superiority of the max margin Gibbs-gMMR TM is that the time cost of draw- ing λ is cheaper than that in Gibbs-gR T M with log- loss. Speciﬁcally , the time of drawing λ in Gibbs- gR TM is a bout 10 times longer than Gibbs-gMMR T M (Fig. 6(a)). This is be c ause sampling from a Polya- gamma distribution in Gibbs-gR TM nee ds a fe w steps SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 11 2 4 6 300 400 500 600 700 800 900 1000 1100 1200 √ c Link Rank K=15 K=20 K=25 (a) link rank 2 4 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 √ c AUC Value K=15 K=20 K=25 (b) AUC score 2 4 6 80 100 120 140 160 180 200 220 240 260 280 300 √ c Word Rank K=15 K=20 K=25 (c) word rank Fig. 7. P erformance of Gibbs-RTM with diff erent c val ues on the Cora dataset. 5 10 15 300 350 400 450 500 550 600 c Link Rank K=15 K=20 K=25 (a) link rank 5 10 15 0.55 0.6 0.65 0.7 0.75 0.8 0.85 c AUC Value K=15 K=20 K=25 (b) AUC score 5 10 15 80 100 120 140 160 180 200 220 240 260 280 300 c Word Rank K=15 K=20 K=25 (c) word rank Fig. 8. P erfo r mance of Gibbs-gRTM with different c val ues on the Cora dataset. of iteration for c onvergence, which takes more time than the constant time sampler of an inverse Gaussian distribution [3 0] in Gibbs-gMMRTM. W e a lso observe that the time costs for drawing η (Fig. 6(b)) in Gibbs- gR TM and Gibbs-gMMR TM are comparable 7 . As most of the time is spent on dr awing Z and η , the total training time of the R TMs with the two types of losses are simi lar (gM MR TM is slightly faster on C iteseer). Fortunately , we can a lso develop Approx-gMMR TM by using a simple approximate method in sampling Z to greatly improve the time efﬁciency (Fig. 4 and Fig. 5), and the p rediction per f ormance is still very compelling, especially on the Citeseer dataset. 5.3 Sensitivity Analysis T o provi de more insights about the behaviors of our discriminative R TMs, we present a careful ana lysis of various factors. 5.3.1 Hyper-parameter c Fig. 7 and 9 show the prediction performance of the diagonal Gibbs-R TM and Gibbs-MMRTM with different c values on both Cora a nd Citeseer datasets 8 , and Fig. 8 and 10 show the results of the generalize d Gibbs-gR TM a nd Gibbs-gMM R TM . For Gibbs-R TM and Gibbs-MM R TM, we can see that the link rank decreases and AUC scores incr ease when c becomes larger a nd the prediction per formance is stable in a 7. Sampling Z also takes comparable time. Omitted for sav ing space. 8. W e have similar observations on the W ebKB dataset, again omitted for saving space. 2 4 6 400 600 800 1000 1200 1400 1600 1800 2000 √ c Link Rank K=15 K=20 K=25 (a) link rank 2 4 6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 √ c AUC Value K=15 K=20 K=25 (b) AUC score 2 4 6 100 200 300 400 500 600 700 √ c Word Rank K=15 K=20 K=25 (c) word rank Fig. 9. P erformance of Gibbs-MMR TM with different c val ues on the Citeseer dataset. 0 5 10 15 200 300 400 500 600 700 800 900 c Link Rank K=10 K=15 K=20 (a) link rank 0 5 10 15 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 c AUC Value K=10 K=15 K=20 (b) AUC score 0 5 10 15 100 200 300 400 500 600 700 c Word Rank K=10 K=15 K=20 (c) word rank Fig. 10. P erformance of Gibbs-gMMR TM wit h diff erent c val ues on the Citeseer dataset. wide range (e.g., 2 ≤ √ c ≤ 6 ) . But the R TM model (i.e., c = 1 ) using Gibbs sampling doesn’t perform well due to its ineffectiveness in dealing with imbalanced net- work d a ta. In Fig. 8 a nd 10, we ca n observe that when 2 ≤ c ≤ 10 , the link rank and AUC scores of Gibbs- gR TM achieve the local optimum, which pe r forms much better than the performance of Gibbs-gRTM when c = 1 . In general, we can see that both Gibbs- gR TM and Gibbs-gMMR TM need a smaller c to get the best performance. This is beca use by allowing all pairwise topic interactions, Gibbs-gR TM a nd Gibbs- gMMR TM a re much more expressive than Gibbs-RTM and Gibbs-MMR TM with a dia gonal weight matrix; and thus e a sier to over -ﬁt when c gets large. For all the proposed models, the word ra nk in- creases slowly with the growth of c . This is because a larger c value ma kes the model more concentrated on ﬁtting link structures and thus the ﬁtness of observed words sacriﬁces a bit. But if we compare with the variational R TM (i.e., V ar-R TM) as shown in Fig. 1 a nd Fig. 3 , the word ranks of all the four proposed R TMs using Gibbs sampling are much lower for a ll the c values we have te sted. This suggests the adva ntages of the c ollapsed Gibbs sampling algorithms. 5.3.2 Burin-In Steps Fig. 1 1 and Fig. 12 show the sensitivity of Gibbs-gRTM and Gibbs-gMMR TM with respect to the number of burn-in iterations, respectively . W e c a n see that the link rank and AUC scores c onverge fa st to stable optimum points with about 300 iterations. The train- ing time grows a lmost linearly with respect to the SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 12 0 200 400 600 250 300 350 400 450 500 550 600 650 700 750 800 Burn In Iterations Link Rank K=10 K=15 K=20 (a) link rank 0 200 400 600 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Burn In Iterations AUC Value K=10 K=15 K=20 (b) AUC score 0 200 400 600 0 500 1000 1500 2000 2500 3000 Burn In Iterations Train Time (CPU seconds) K=10 K=15 K=20 (c) train time Fig. 11. P erfo r mance of Gibbs-gRTM with diff er ent b ur n-in steps on the Cora dataset. 0 200 400 600 150 200 250 300 350 400 450 500 550 600 650 700 Burn In Iterations Link Rank K=10 K=15 K=20 (a) link rank 0 200 400 600 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Burn In Iterations AUC Value K=10 K=15 K=20 (b) AUC 0 200 400 600 0 2000 4000 6000 8000 10000 12000 Burn In Iterations Train Time (CPU seconds) K=10 K=15 K=20 (c) train time Fig. 12. P erformance of Gibbs-gMMR TM wit h diff erent Burin-In steps on the Citeseer dataset. number of burn-in iterations. W e have simil ar obser- vations for the diagonal Gibbs-RTM, Gibbs-MMR TM and Approximate R TMs with fast ap p roximation. In the previous expe r iments, we have set the burn-in steps a t 4 00 for Cora and Citeseer , which is sufﬁciently large. 5.3.3 Subsample ratio Fig. 13 shows the inﬂuence of the subsample ratio on the pe r formance of Gibbs-gR TM on the Cora data. In total, less than 0 . 1% links a re p ositive on the Cora networks. W e ca n see that by introducing the reg- ularization para meter c , Gibbs-gR TM can effectively ﬁt various imbalanced network data and the different subsample ra tios have a wea k inﬂuence on the per- formance of Gibbs-gRTM. Since a larger subsample ratio leads to a bigger training set, the training time increases as e xpected. W e have similar observations on Gibbs-gMMR TM and other mo dels. 5.3.4 Dirichlet prior α Fig. 14 shows the sensitivity of the generaliz ed Gibbs- gR TM model and diagonal Gibbs-R TM on the Cora dataset with d ifferent α values. W e can see that the results are quite sta ble in a wide range of α ( i.e., 1 ≤ α ≤ 10 ) for three different topic numbers. W e have similar observations for Gibbs-gMMR TM. In the previous exp eriments, we set α = 5 for both Gibbs- gR TM and Gibbs-gMMR T M . 5.4 Link Suggestion As in [8], Gibbs-gR TM could perform the task of suggesting links for a new document (i.e., test data ) 5 10 15 20 25 300 350 400 450 500 550 600 Link Rank Topic Number (a) link rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Topic Number AUC Value SampleRatio=0.2% SampleRatio=0.4% SampleRatio=0.6% SampleRatio=0.8% SampleRatio=1% (b) AUC score 5 10 15 20 25 0 500 1000 1500 2000 2500 3000 Train Time (CPU seconds) Topic Number (c) train time Fig. 13. P erfo r mance of Gib bs-gRTM with different numbers o f n egativ e trai ning links on the Co ra dataset. 2 4 6 8 10 300 350 400 450 500 550 600 α Link Rank K=15 K=20 K=25 (a) link rank 2 4 6 8 10 0.55 0.6 0.65 0.7 0.75 0.8 0.85 α AUC Value K=15 K=20 K=25 (b) AUC score 2 4 6 8 10 100 150 200 250 300 α Word Rank K=15 K=20 K=25 (c) word rank Fig. 14. P erformance of Gib bs-gRTM ( c = 1 ) with diff erent α values on the Cora dataset. based on its text contents. T able 4 shows the ex- ample suggested citations for two query documents: 1) “Competitive environments evolve be tte r solutions for complex tasks” a nd 2) “Planning by Incremental Dynamic Programming” in Cora data using Gibbs- gR TM and V ar-R TM . The query documents are not observed during training, and suggestio n results are ranked by the values of link prediction likelihood between the training d ocuments and the given query . W e can see that Gibbs-gR TM outperforms V ar-R T M in terms of ide ntifying more ground-truth links. For query 1, Gibbs-gR TM ﬁnds 4 truly linked documents (5 in total) in the top-8 suggested results, while V ar- R TM ﬁnds 3. For q uery 2, Gibbs-gR TM ﬁnds 2 while V ar-R TM does not ﬁnd any . In general, Gibbs-gR TM outperforms V ar-R TM on the link suggestion task across the whole corpus. W e also observe that the suggested documents which are not truly linked to the query document are also ve r y related to it seman- tically . 6 C O N C L U S I O N S A N D D I S C U S S I O N S W e have presented discriminative r elational topic models (gR TMs and gMMR TMs) which consider a ll pairwise topic interactions and a re suitable for a sym- metric networks. W e perform regularized Bayesian in- ference that introduces a regulariza tion parameter to control the imbalance issue in common r eal networks and gives a freedom to incorporate two popular loss functions (i.e., logistic log-loss and hinge loss). W e also presented a simple “a ugment-and-collapse” sam- pling algorithm for the proposed discriminative R TMs SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 13 T ABLE 4 T op 8 link predictions made by Gibbs-gR T M and V ar-RTM on the Cora da taset. (P apers with titles in bold hav e ground-truth links with the quer y document.) Query: Competitive environments evolve better solutions for complex tasks Coevolving High Level Representations Gibbs-gRTM Strongly typed genetic programming in evolving cooperation strategies Genetic Algorithms in Search, Optimization and Machine Learnin g Improving tactical plans with genetic algorithms Some studies in machine learning using the game of Ch eckers Issues in evolutionary robotics: From Animals to Animats Strongly T yped Gen etic Programming Evaluating and improving steady st at e ev olutionary algorithms on constraint satisfaction problems Coevolving High Level Representations V ar -R TM A survey of E volutionary Strategies Genetic Algorithms in Search, Optimization and Machine Learnin g Strongly typed genetic programming in evolving cooperation strategies Solving combinatorial problems using ev olutionary algorithms A promising genetic algorithm approach to job-shop scheduling, rescheduling, and open-shop scheduling problems Evolutionary Module Acquisition An E mpirical Investigation of Multi-Parent Recombination Operators in Evolution Strategies Query: Planning by Incremental Dynamic Programming Learning to predict by the methods of temporal d ifferences Gibbs-gRTM Neuronlike adaptive elements that can solve difﬁc ult learning control problems Learning t o Act using Real- T ime Dynamic Programming A new learning algorithm for blind signal separation Planning with closed-loop macro actions Some studies in machine learning using the game of Ch eckers T ransfer of L earning by Comp osing Solutions of Element al Sequential T asks Introduction to the Theory of Neural Computation Causation, action, and counterfactuals V ar -R TM Learning Policies for Partially Observable En v ironments Asynchronous modiﬁed policy iteration with single-sided upd ates Hidden Markov models in computational biology: Applications to protein modeling Exploiting st ructure in policy construction Planning and acting in partially observable stochastic domains A qualitative Markov assumption and its implications for belief change Dynamic Programming and Markov Processes without restricting assumptions on the posterior dis- tribution. Experiments on real network da ta demon- strate signiﬁcant improvements on prediction tasks. The time efﬁciency ca n be signiﬁcantly improved with a simple a pproximation method. For future work, we are interested in making the sampling algorithm scalable to la rge networks by using distributed architectures [3 8] or doing online inference [22]. M oreover , developing non parametric R TMs to avoid model selection problems (i.e., au- tomatically reso lve the number of latent topics in R TMs) is an interesting d irection. Finally , our current focus in on static networks, and it is interesting to extend the models to dea l with dynamic networks, where incorporating time varying depende ncies is a challenging problem to address. A C K N OW L E D G M E N T S This work is supported by Na tional Key Project for Basic Research of China (Grant Nos. 201 3CB32 9 403, 2012C B 3163 0 1), and T singhua Self-innovation Project (Grant Nos: 2012108 8 071, 201 1108 1 111) , and China Postdoctoral Science Foundation Grant (Gra nt Nos: 2013T6 0117 , 201 2M52 0281 ). R E F E R E N C E S [1] M. Rosen-zvi A. G ruber and Y . W eiss. Latent T opic Mode ls for Hypertext. In Proceedings of Uncertainty in Artiﬁcial Intelligence , 2008. [2] E. Airoldi, D. M. Blei, S. E. Fienberg, and E. P . Xing. Mixed Membership S tochastic Blockmodels. In Advances in Neural Information Processing System s , 2008. [3] R. Akb ani, S. Kwek, and N. Japkowicz. Applying Support V ec- tor Machines to Imbalanced Datasets. In European Conference on Machine Learning , 2004. [4] L. Backstrom an d J. L eskovec. Supervised Random W alks: Predicting and Recommen d ing L inks in Social Networks. In International Conference on Web Search and Data Mining , 2011. [5] R. Balasubramanyan and W . Cohen. Block-LDA: Jointly Modeling Entity-Annotated T ext and E ntity-entity Links. In Proceeding of the SIAM International Confer ence on Data Mining , 2011. [6] Y . Ben gio, A . Courville, and P . V incent . Representation Learn- ing: A Review and New Perspectives. , 2012. [7] D. B lei, A. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research , (3):993–1022, 2003. [8] J. Chang and D. Blei. Relational T opic Models for Document Networks. In Inter n ational Conference on Artiﬁcia l Intelligence and Statistics , 2009. [9] N. Ch en, J. Zh u, F . Xia, and B. Zhang. Generalized Relational T opic Models with Da ta Augmentation. In International Joint Conference on Ar tiﬁcial Intellig ence , 2013. [10] M. Craven , D. Dipasquo, D. Freitag, and A. McCallum. Learn- ing to Extract Symb olic Knowledge from th e W orld W ide W eb . In AAAI Conference on Artiﬁcial Intelligence , 1998. [11] A. P . Dempster , N. M. Lair d, and D. B. Rubin. Maximum Likelihood Estimation from Incomplete Data via t he EM Al- gorithm. Journal of the Royal Statistical Society , Ser . B , (39):1–38, 1977. SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 14 [12] L. Devr oye. Non-uniform random va riate generation . Springer- V erlag, 1986. [13] L. Dietz, S. B ickel, and T . Scheffer . Unsupervised Prediction of Citation Inﬂuences. In Proceedings of the 24th Annual International Conference on Machine Learning , 2007. [14] D. V an Dyk and X. Meng. Th e Art of Data Augmentation. Journal of Computational and Graphi cal Statistics , 10(1):1–50, 2001. [15] A. George, M. Heath, and J. Liu. Parall el Cholesky Factoriza- tion on a Sh ared-memory Multiprocessor . Linear Alg ebra and Its Applica tions , 77:165–187, 1986. [16] P . Germain, A. L acasse, F . Laviolette, and M. Marchand. P AC-Bayesian Learning of Linear Classiﬁers. In International Conference on Machine Learning , pages 353–360, 2009. [17] A. Goldenberg, A . X. Zheng, S. E. Fienberg, and E . M. A iroldi. A S urv ey of St atistical Network Models. Foundations and T rends in Machine Learning , 2(2):129–23 3, 2010. [18] T . L. Grifﬁths and M. Ste yvers. Finding S cientiﬁc T opics. Proceedings of the National Academy of Sciences , 2004. [19] M. A. Hasan, V . Chaoji, S. Salem, and M. Zaki. Link prediction using supervised learning. In SIAM Workshop on Link Analysis, Counterterrorism and Security , 2006. [20] P . D. Hoff, A. E . Raftery , and M. S . Handcoc k. Lat e nt Space Approaches to Social Network Analysis. Journal of Am erican Statistical Associatio n , 97(460), 2002. [21] P .D. Hoff. Modeling Homophily and Stochastic Equivalence in Symmetric Relational Data. In Advances in Neural Information Processing Systems , 2007. [22] M. Hoff man, D. Blei, and F . Bach. Online Learning for Latent Dirichlet Allocation. In Advances in Neural Information Processing Systems , 2010. [23] T . Hofmann. Probabilistic Latent Se mantic Analysis. In Proceedings of Unce rtainty in Aritiﬁcial Intelligence , 1999. [24] M. Jordan, Z. Gh ahramani, T . Jaakkola, and L. Saul. An introduct ion to variational methods for graphical models . MIT Press, Cambridge, MA , 1999. [25] D. L iben-Nowell and J.M. Kleinberg. T he Link Prediction Problem for Social Networks. In ACM Conference of Information and Knowledge Management , 2003. [26] R. N. Lichtenwalter , J. T . Lussier , and N. V . Chawla. New Perspectives and Methods in Link Prediction. In P roceedings of the 19th ACM S IGKDD Conference on Knowledge D iscovery and Data Mining , 2010. [27] Y . Liu, A. Niculescu-Mizil, and W . Gryc. T opic -Link LDA: Joint Models of T opic and Author Community. In International Conference on Machine Learning , 2009. [28] D. McAllester . P AC-Bayesian stochastic model selection. Ma- chine Learning , 51:5–21, 2003. [29] A. McCallum, K. Nigam, J. Renn ie, and K. Seymore. Au- tomating the Construction of Internet Portals with Machine Learning. Information Retrieval , 2000. [30] J.R. Michael, W .R. S chucany , and R.W . Haas. Generating Random V ariates Using T ransformations with Multiple Roots. The American S tatistician , 30(2):88–90, 1976. [31] K. Miller , T . Grifﬁths, and M. Jordan. Nonparametric Latent Feature Models for Link Prediction. In Advances in Neural Information Processing Systems , 2009. [32] R. Nallapati and W . Cohen. Link-PLSA-LDA: A New Unsupe r- vised Model for T opic s and Inﬂuence in B logs. In Proceedings of Internationa l Confer ence on Weblogs and Social Media , 2008. [33] R. M. Nallapati, A. Ahmed, E. P . Xing, and W . Cohen. Joint Latent T opic Models for T ext and Citations. In Proceeding of the ACM SIGKDD International Conference on Knowledge D iscovery and Data Mining , 2008. [34] N. G. Polson, J. G. Scott, and J. W indle. Bayesian Inference for Logistic Mod e ls using Polya-Gamma Latent V ariables. arXiv:1205.0310v1 , 2012. [35] N. G. Polson and S. L. Scott. Data Augment ation for Support V ector Machines. B ayesian Analysis , 6(1):1–24, 2011. [36] L. Rosasco, E. De V ito, A. Caponnet to, M. Piana, and A. V erri. Are L oss Functions All t he Same? Neural Computation , (16):1063–1076 , 2004. [37] P . Sen, G. Namata, M. Bilgic, and L. Getoor . Collective classiﬁc ation in net work data. AI Magazine , 29(3):93–106, 2008. [38] A. Smola and S. Narayanamurthy . An Architecture for Parallel T opic M odels. International Conference on V ery Large D ata Bases , 2010. [39] D. Strauss and M. Ikeda. Pseudolikelihood Estimat ion for Social Networks. Journal of Am erican Statistical Association , 85(409):204–21 2, 1990. [40] M. A. T anner and W . H. W ong. Th e Calculation of Posterior Distributions by Data Augmen tation. Journal of the Americal Statistical Associatio n , 82(398):528– 540, 1987. [41] J. Zhu. Max-Margin Nonparamet ric Latent Feature Mod e ls for Link Prediction. In International Conference on Machine Learning , 2012. [42] J. Zhu, N. Chen , H. Perkins, and B. Zhang. Gibbs Max-margin T opic Mod e ls with Fast Sampling Algorithms. In Inter n ational Conference on M achine Learning , 2013. [43] J. Zhu, N. Chen , and E. P . Xing. Inﬁnite Late nt SVM for Classiﬁca tion and Multi-task Learning. In Advances in Neural Information Processing System s , 2011. [44] J. Zhu, N. Che n, and E. P . Xing. Bayesian Inference with Posterior Regularization and applications to Inﬁn ite L atent SVMs. , 2013. [45] J. Zh u, X. Zheng, and B. Zhang. Bayesian L ogistic Supervised T opic Models with Data Augmen t ation. In Proceedings of the 51st Annual Meeting of the Ass ociatio n for Computational Linguistics , 2013. Ning Chen recei v ed her BS from China Nor thwestern P ol ytechnical University , and PhD degree in the Depar tment of Com- puter Science and T echnology a t Tsinghua Univers ity , China, where she is currently a post-doc fello w . She was a visiting re- searcher in the Machine Lear nin g Depar t- ment of Car negie Mellon Univ ersity . Her re- search interests are primar ily in machine lear ning, especially probabi listic graphical models, Ba yesian Nonparametrics wi th appli- cations on data mining and computer vision. Jun Zhu receiv ed his BS, MS and PhD degrees al l from the Depar tment of Com- puter Science and T echnol ogy in Tsinghua Univers ity , China, where he is currently an associate prof essor. He was a project sci- entist and postdoctoral f ellow in the Machine Lear ning Depa r tment, Carnegi e Mellon Uni- v ersity . His research interests are primarily on dev eloping statistical machine lea rni ng methods to understand scientiﬁc and engi- neering data ari sing from various ﬁelds. He is a member of the IEEE. Fei Xia received his BS from School of Soft- ware, Tsinghua University , China. He is cur- rently working toward hi s MS degree in the Language T echnologies Institut e, School of Computer Science, Car negie Mellon Univer- sity , USA. His research interests are primari ly on machine learning especially on probabili s- tic graphical models, Ba yesian nonparamet- rics and dat a mining problems such as social networks. Bo Zhang gr aduated from the Depar tment of Automat ic Control, Tsinghua University , Beijing, China, in 195 8. Currently , he is a Prof essor in the Depar tment o f Computer Science and T echnology , Tsinghua Univer- sity and a Fellow of Chinese Academy of Sciences, Beijing, China. His main interests are ar tiﬁcial intelli gence, pattern recogni tion, neural netw or ks, and intelligent control. He has published ov er 150 papers and f our monogr aphs in these ﬁelds. SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 15 A P P E N D I X In this section, we p resent additional experimental results. A.1 Prediction P erformance on W ebKB Dataset Fig. 15 shows the 5-fold aver a ge results with stan- dard deviations of the discriminative R TMs (with both the log-loss and hinge loss) on the W ebKB dataset. W e have similar observations as shown in Section 5.2 .2 on the other two datasets. Discriminative R TMs with the hinge loss (i. e ., Gibbs-gMM R TM and Gibbs-MMR TM ) obtain comparab le predictive results with the R TMs using the log-loss ( i.e., Gibbs-gRTM and Gibbs-R TM). And generalized gR TMs achieve superior performance over the diagonal R TMs, espe- cially when the topic numbers are less than 25. W e also develop Approx-gMMR TM and Approx-gR TM by using a simple approximation method in sampling Z (see S ection 5 .1) to greatly impr ove the time ef- ﬁciency without sacriﬁcing much prediction perfor- mance. A.2 T opic Discovery T able. 5 shows 7 example topics discovered by the 10-topic Gibbs-gR TM on the Cora dataset. For each topic, we show the 6 top-ranked document titles that yield higher values of Θ . In order to qualitatively illustrate the sema ntic meaning of e a ch topic a mong the documents from 7 ca tegories 9 , in the left part of T able 5, we show the average probability of each category distributed on the particular topic. Note that category labels are not considered in all the models in this paper , we use it here just to visualize the discovered semantic meanings of the proposed Gibbs- gR TM. W e can observe that most of the discovered topics are representative for documents from one or several categories. For exa mple, topics T1 and T2 tend to represent documents a bout “Genetic Algorithms” and “Rule Learning”, respectively . Similarly , topics T3 and T6 are good at representing documents about “Reinforcement Learning” and “Theory”, respectively . 9. The sev en categories are Case Based , Genetic Algorithms , Neural Networks , Probabilistic Methods , Reinforcement Learning , Rule Learning and Theory . SUBMITTED TO IEEE TRANSACTION ON P A T TERN ANAL YSIS AND MA CHINE INTELLIGENCE, A UGUST 2013 16 5 10 15 20 25 0 50 100 150 200 250 300 Link Rank Topic Number (a) link rank 5 10 15 20 25 100 150 200 250 300 350 400 450 Word Rank Topic Number (b) word rank 5 10 15 20 25 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Topic Number AUC Value Gibbs−gMMRTM Gibbs−gRTM Approx−gMMRTM Approx−gRTM Gibbs−MMRTM Gibbs−RTM (c) AUC score 5 10 15 20 25 0 500 1000 1500 2000 2500 3000 Train Time (CPU seconds) Topic Number (d) train time 5 10 15 20 25 0 0.5 1 1.5 2 2.5 3 Test Time (CPU seconds) Topic Number (e) test time Fig. 15. Results of va rious models with different n umbers of topics on the W ebKB dataset. T ABLE 5 Example topics discov ered b y a 10-topic G ibbs-gR TM on the Cora dataset. F or each topic, w e show 6 top-rank ed documents as well as the av erage probabil ities of that topic on repres enting documents from 7 categories. T opic T op-6 Document T itles T1: Ge netic Algorithms 1. Stage scheduling: A tech. to reduce the register requirements of a mod ulo schedule. 0 0.05 0.1 0.15 0.2 0.25 0.3 Probability Case Based Genetic Algorithms Neural Networks Probabilistic Methods Reinforcement Learning Rule Learning Theory 2. Optimum modulo schedules for minimum register requir ements. 3. Duplication of coding segments in genetic progr amming. 4. Gen etic programming and redundancy . 5. A cooperative coevolutionary approach to function optimization. 6. Ev olving graphs and net works with ed g e encoding: Preliminary report. T2: Rule Learning 1. Inductive Constraint Logic. 0 0.05 0.1 0.15 0.2 0.25 Probability 2. The difﬁculties of learning logic programs with cut. 3. Learning se-mantic grammars with constructive inductive logic progra mming. 4. Learning Singly Recursive Relations from Small Datasets. 5. Least generalizations and greatest specializations of sets of clauses. 6. Learning logical de ﬁnitions from relations. T3: reinforcement learning 1. Integ. Architect. for L earning, Planning & Reacting by Approx. Dynamic Program. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Probability 2. Multiagent reinfor cement learning: Theoretical framework an d an algorithm. 3. Learning to Act using Real- T ime Dynamic Programming. 4. Learning to predict b y the met hods of temporal differences. 5. Robot shaping: Developing autonomous agent s though learning. 6. Planning and acting in partially observable stochastic domains. T6: Th eory 1. Learning with Many Irrelevant Features. 0 0.05 0.1 0.15 0.2 Probability 2. Learning decision lists using hom ogen eous rules. 3. An empirical comparison of selection measures for d ecision-tree induction. 4. Learning active classiﬁers. 5. Using Decision T rees to Improve Case-based Learning. 6. Utilizing prior concepts for learning. T7: Neural Networks 1. Learning factorial codes by predictabil ity minimization. 0 0.05 0.1 0.15 0.2 Probability 2. The wake-sleep algorithm for un supervised neural networks. 3. Learning to control fast-weight memories: A n alternativ e to recurrent nets. 4. An improvement over LBG inspired from neural networks. 5. A distributed feature map model of the lexicon. 6. Self-organizing process b ased on lateral inhibition and synaptic resour ce redistribution. T8: Probabilistic M ethods 1. Density estimation by wavelet th resholding. 0 0.05 0.1 0.15 0.2 Probability 2. On Bayesian analysis of mixtures with an unknown number of components. 3. Markov chain Mont e Carlo methods based on ”slicing” the density function. 4. Markov chain Mont e Carlo convergence diagnostics: A comparative review . 5. Bootstrap C-Interv . for Smooth S p lines & Comparison to Bayesian C-Interv . 6. Rates of convergence of the Hastings and Met ropolis algorithms. T9: Case Based 1. Case Retrieval Nets: Basic ideas and exte nsions. 0 0.05 0.1 0.15 0.2 0.25 Probability 2. Case-based reasoning: Foundat. issues, methodological v ariat., & sys. approaches. 3. Ad apt er: an integrated diagnostic system combining case-based and abduct. reasoning. 4. An event-based abductive model of update. 5. Applying Case Ret riev al Nets to diagnostic tasks in t e chnical domains. 6. Introspective Reasoning using Me t a-Explanations for Multistrategy Learning.

Discriminative Relational Topic Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment