Bibliographic Analysis with the Citation Network Topic Model
Bibliographic analysis considers author's research areas, the citation network and paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents using a n…
Authors: Kar Wai Lim, Wray Buntine
JMLR: W orkshop and Conference Proceedings 39: 142 – 158 , 2014 ACML 2014 Bibliographic Analysis with the Citation Net w ork T opic Mo del Kar W ai Lim kar w ai.lim@anu.edu.au A ustr alian National University, Canb err a, A ustr alia NICT A, Canb err a, Austr alia W ra y Buntine wra y.buntine@monash.edu Monash University, Clayton, Austr alia Editors: Dinh Ph ung and Hang Li Abstract Bibliographic analysis considers author’s research areas, the citation net work and paper con tent among other things. In this pap er, we com bine these three in a topic mo del that pro duces a bibliographic mo del of authors, topics and do cumen ts using a non-parametric extension of a combination of the Poisson mixed-topic link mo del and the author-topic mo del. W e prop ose a nov el and efficien t inference algorithm for the mo del to explore subsets of researc h publications from CiteSeer X . Our mo del demonstrates improv ed p erformance in both mo del fitting and a clustering task compared to sev eral baselines. Keyw ords: author-citation net work, topic mo del, Bay esian non-parametric 1. In tro duction Mo dels of bibliographic data need to consider man y kinds of information. Articles are usually accompanied b y metadata, for example, authors, publication data, categories and time. Cited pap ers can also b e av ailable. When authors’ topic preferences are mo delled, w e need to associate the document topic information somehow with the authors’. Join tly mo delling text data with citation netw ork information can b e c hallenging for topic mo dels, and the problem is confounded when also mo delling author-topic relationships. In this pap er, we prop ose a topic mo del to jointly mo del authors’ topic preferences, text con tent and the citation net work. The model is a non-parametric extension of previous mo dels discussed in Section 2 . W e derive a nov el algorithm that allo ws the probabilit y v ec- tors in the mo del to be in tegrated out, using simple assumptions and appro ximations, which giv e Mark ov chain Monte Carlo (MCMC) inference via discrete sampling. Section 3 , 4 and 5 detail our model and its inference algorithm. Applying our mo del on researc h publication data, w e demonstrate the mo del’s improv ed p erformance, on b oth mo del fitting and a clus- tering task, compared to baselines. W e describ e the datasets used in Section 6 and rep ort on exp erimen ts in Section 7 . Additionally , w e qualitativ ely analyse the inference results pro duced by our mo del. W e find that the topics returned hav e high comprehensibility . c 2014 K.W. Lim & W. Buntine. Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model 2. Related W ork Laten t Diric hlet Allocation (LD A) is the simplest Ba yesian topic mo del used in mo delling text, which also allows easy learning of the mo del. T eh and Jordan ( 2010 ) prop osed the Hier ar chic al Dirichlet pr o c ess (HDP) LDA, whic h utilises the Dirichlet pro cess (DP) as a non-parametric prior whic h allows a non-symmetric, arbitrary dimensional topic prior to b e used. F urthermore, one can replace the Diric hlet prior on the w ord v ectors with the Pitman-Y or Pr o c ess (PYP) ( T eh , 2006b ), which models the p o wer-la w of word frequency distributions in natural language ( Sato and Nak aga wa , 2010 ). V ariants of LD A allow incorp orating more asp ects of a particular task and here we consider authorship and citation information. The author-topic mo del (A TM) ( Rosen-Zvi et al. , 2004 ) uses the authorship information to restrict topic options based on author. Some recent w ork join tly models the do cument citation net work and text conten t. This includes the r elational topic mo del ( Chang and Blei , 2010 ), the Poisson mixe d-topic link mo del (PMTLM) ( Zh u et al. , 2013 ) and Link-PLSA-LDA ( Nallapati et al. , 2008 ). An extensiv e review of these mo dels can be found in Zhu et al. ( 2013 ). The Citation Author T opic (CA T) mo del ( T u et al. , 2010 ) mo dels the author-author netw ork on publications based on citations using an extension of the A TM. Note that our work is different to CA T in that we mo del the author-do cumen t-citation netw ork instead of author-author netw ork. The T opic-Link LD A ( Liu et al. , 2009 ) jointly mo dels author and text b y using the distance b et ween the do cumen t and author topic v ectors. Similarly the Twitter-Net work topic model ( Lim et al. , 2013 ) mo dels the author (“follo w er”) net work based on author topic v ectors, but using a Gaussian pro cess to mo del the netw ork. Note that our work considers the author-do cumen t-citation of Liu et al. ( 2009 ) using the techniques dev elop ed in Lim et al. ( 2013 ), but using the PMTLM of Zh u et al. ( 2013 ) to mo del the net work which lets one integrate PYP hierarc hies with the PMTLM using efficient MCMC sampling. There is also existing work on analysing the degree of authors’ influence. On publication data, Kataria et al. ( 2011 ) and Mimno and McCallum ( 2007 ) analyse influen tial authors with topic models. While W eng et al. ( 2010 ), T ang et al. ( 2009 ) and Liu et al. ( 2010 ) use topic mo dels to analyse users’ influence on so cial media. 3. Citation Net w ork T opic Mo del In this section, we prop ose a topic model that jointly mo del the text , authors , and the cita- tion network of researc h publications (do cumen ts). W e name the topic mo del the Citation- Net work T opic Mo del (CNTM). W e first disc uss the topic mo del part of CNTM where the citations are not considered, whic h will b e used for comparison later in Section 7 . The full graphical mo del for CNTM is displa yed in Figure 1 . T o clarify the notations used in this pap er, variables that ar e without subscript r epr esent a c ol le ction of variables of the same notation . F or example, w d w ould represent all the words in do cument d , that is, w d = { w d 1 , . . . , w dN d } where N d is the n umber of w ords in document d ; and w represen ts all words in a corpus, w = { w 1 , . . . , w D } , where D is the num b er of do cumen ts. 143 Lim Buntine A K N d x ij ν a D 2 θ ′ d D z dn ϕ k γ µ w dn θ d ϕ ′ k Figure 1: Graphical mo del for CNTM. The box on the top left with D 2 en tries is the citation net work on do cumen ts represented as a Bo olean matrix. The remainder is a non- parametric author-topic mo del where the A authors on the left hav e topic vectors that influence the D do cumen t topic vectors. The K topics, sho wn in the top righ t, hav e bursty mo delling following Bun tine and Mishra ( 2014 ). 3.1. Hierarchical Pitman-Y or T opic Mo del The CNTM uses the Griffiths-Engen-McCloskey (GEM) ( Pitman , 1996 ) distribution to generate probabilit y vectors and the Pitman-Y or pr o c ess (PYP) ( T eh , 2006b ) to generate probabilit y vectors given another probability vector (called me an or b ase distribution). Both GEM and PYP are parameterised by a discount parameter α and a concentration parameter β . PYP is additionally parameterised b y a base distribution H , which is also the mean of the PYP . Note that the GEM distribution is equiv alent to a PYP with a base distribution that generates an ordered integer lab el. In mo delling authors, CNTM mo difies the approach of the author-topic mo del ( Rosen- Zvi et al. , 2004 ), whic h assumes that the w ords in a publication are equally attributed to the differen t authors. This is not reflected in practice since publications are often written more b y the first author, excepting when the order is alphab etical. An appro ximation w e mak e in this work is that the first author is dominan t. W e could mo del the influence of eac h author on a publication, sa y , using a Dirichlet distribution, but w e found that considering only the first author giv es a simpler learning algorithm and cleaner results. IN CNTM, we first sample a ro ot topic distribution µ with a GEM distribution, to act as a base distribution for the author-topic distributions ν a for each author a : µ ∼ GEM( α µ , β µ ) , ν a | µ ∼ PYP( α ν a , β ν a , µ ) . Giv en the first author a d of eac h publication d , we sample the do cumen t-topic prior θ 0 d and the do cument-topic distribution θ d : θ 0 d | a d , ν ∼ PYP( α θ 0 d , β θ 0 d , ν a d ) , θ d | θ 0 d ∼ PYP( α θ d , β θ d , θ 0 d ) . Note that instead of mo delling a single do cumen t-topic distribution, we mo del a do cumen t- topic hierarch y with θ 0 and θ . The primed θ 0 represen ts the topics of the document in the con text of the citation netw ork. The unprimed θ represen ts the topics of the text, naturally related to θ 0 but not the same. Suc h modelling giv es citation information a higher impact to counter the relativ ely lo w amoun t of citations compared to the text. More details on the motiv ation of such mo delling is presented in the supplementary materials. 144 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model F or the v o cabulary side, w e generate a background w ord distribution γ , where H γ is a discrete uniform v ector of length |V | and V is the set of distinct word tokens observed. Then, w e sample a topic-word distribution φ k for each topic k , with γ as the base distribution: γ ∼ PYP( α γ , β γ , H γ ) , φ k | γ ∼ PYP( α φ k , β φ k , γ ) . Mo delling w ord burstiness ( Buntine and Mishra , 2014 ) is important since, as shown in Section 6 , w ords in a do cumen t are likely to rep eat in the do cument. This is addressed b y making topics bursty , so each do cumen t only fo cuses on a subset of w ords in the topic. T o generate φ 0 dk for each topic k in do cumen t d : φ 0 dk | φ k ∼ PYP( α φ 0 dk , β φ 0 dk , φ k ) . Finally , for eac h w ord w dn in document d , we sample the corresp onding topic assignment z dn from the do cumen t-topic distribution θ d , while the word w dn is sampled from the topic-word distribution φ 0 d giv en z dn . z dn | θ d ∼ Discrete( θ d ) , w dn | z dn , φ 0 d ∼ Discrete( φ 0 dz dn ) . Note that w includes words from title and abstract, but not the full article of a publication. This is b ecause title and abstract provide a go o d summary of a publication’s topics, while the full article contains to o muc h detail. 3.2. Citation Netw ork Poisson Model In CNTM, w e assume that the citations are generated based on the topics relev ant to the publications’ using the degree-corrected version of the PMTLM ( Zh u et al. , 2013 ). Denoting x ij as the n umber of times do cumen t i citing do cumen t j , we mo del x ij with a P oisson distribution with mean parameter λ ij : x ij | λ ij ∼ Poisson( λ ij ) , λ ij = λ + i λ − j P k λ T k θ 0 ik θ 0 j k . (1) Here, λ + i is the prop ensit y of do cumen t i to cite and λ − j represen ts the p opularity of cited do cumen t j and λ T k scales the k -th topic. Hence, a citation from do cument i to document j is more likely when these documents are ha ving relev an t topics. The P oisson distribution 1 is used instead of a Bernoulli b ecause it leads to dramatically reduced complexit y in analysis. 4. Mo del Representation and Posterior Before presen ting the p osterior used to develop the MCMC sampler, we briefly review handling of hierarchical PYP mo dels in Section 4.1 . W e cannot provide an adequately detailed review in this pap er, thus w e present the main ideas. 1. Note that Poisson distribution is similar to the Bernoulli distribution when the mean parameter is small. 145 Lim Buntine 4.1. Mo delling with Hierarc hical PYPs The key to efficient Gibbs sampling with PYPs is to marginalise out the probability v ectors ( e.g. topic distributions) in the mo del and record v arious asso ciated coun ts instead, thus yielding a collapsed sampler. While a common approach here is to use the hierarc hical Chinese Restaurant Pro cess (CRP) of T eh and Jordan ( 2010 ), w e use another represen tation that requires no dynamic memory and has b etter inference efficiency ( Chen et al. , 2011 ). W e denote f ( N ) as the marginalised lik eliho o d associated with the probability vector N . Since the v ector is marginalised out, the marginalised likelihoo d is in terms of — using the CRP terminology — the customer c ounts c N = ( · · · , c N k , · · · ) and the table c ounts t N = ( · · · , t N k , · · · ). The customer count c N k corresp onds to the n umber of data p oin ts ( e.g. w ords) assigned to group k ( e.g. topic) for v ariable N . Here, the table c ounts t N represen t the subset of c N that gets passed up the hierarc hy (as customers for the parent probability v ector of N ). W e also denote C N = P k c N k as the total customer coun ts for no de N , and similarly , T N = P k t N k is the total table coun ts. The marginalised likelihoo d is: f ( N ) = ( β N | α N ) T N ( β N ) C N Y k S c N k t N k ,α N , for N ∼ PYP( α N , β N , P ) . (2) S x y ,α is the generalised Stirling n umber; b oth ( x ) C and ( x | y ) C denote the P o c hhammer sym- b ol (rising factorial), see Buntine and Hutter ( 2012 ) for details. Note the GEM distribution b eha v es like a PYP in which the table count t N k is alwa ys 1 for non-zero c N k . The inno v ation of Chen et al. ( 2011 ) w as to notice that sampling with Equation 2 directly led to po or p erformance due to inadequate mixing. They in tro duce a new Bernoulli indic ator variable u N k for eac h customer who has contributed a “+1” to c N k . A v alue u N k = 1 indicates that the customer has op ened a new table, which also means the customer has also contributed a “+1” to t N k and thus has b een passed up the hierarch y to the parent v ariable P . The pro cess rep eats at the parent no de b ecause the “+1” to t N k is inherited as a “+1” to c P k , and thus we now need to consider the v alue of u P k . If u N k = 0 then a “+1” w as not inherited and a corresp onding u P k do es not exist. The use of indicator v ariables has b een empirically sho wn to lead to b etter mixing of the samplers. Note that ev en though the probabilit y v ectors are in tegrated out and not explicitly stored, they can easily be estimated from the associated coun ts. The probabilit y v ector N is estimated from the counts and paren t probability v ector P using standard CRP estimation: N = · · · , ( α N T N + β N ) P k + c N k − α N T N k β N + C N , · · · . (3) 4.2. Likelihoo d for the Hierarc hical PYP T opic Mo del W e use b old face capital letters to denote the set of all relev ant low er case v ariables, for example, Z = { z 11 , · · · , z DN D } denotes the set of all topic assignments. V ariables W , T and C are similarly defined, that is, they denote the set of all w ords, table counts and customer coun ts resp ectiv ely . Additionally , w e denote ζ as the set of all h yp erparameters (such as the α ’s). With the probabilit y vectors replaced b y the counts, the lik eliho o d of the topic 146 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model mo del can b e written — in terms of f ( · ) — as p ( Z , W , T , C | ζ ) ∝ f ( µ ) A Y a =1 f ( ν a ) ! D Y d =1 f ( θ 0 d ) f ( θ d ) K Y k =1 f ( φ 0 dk ) ! K Y k =1 f ( φ k ) ! f ( γ ) Y v 1 |V | t γ v ! . (4) Note that the last term in Equation 4 corresp onds to the parent probabilit y vector of γ (see Section 3.1 ), and v indexes the unique w ord tokens in vocabulary set V . 4.3. Likelihoo d for the Citation Net work P oisson Mo del F or the citation netw ork, the Poisson lik eliho o d for eac h x ij uses the definition of λ ij in Equation 1 . Note that the term x ij ! is dropp ed due to the limitation of the data that x ij ∈ { 0 , 1 } , thus x ij ! is ev aluated to 1. With conditional indep endence of x ij , the joint lik eliho o d for the whole citation netw ork X = { x 11 , · · · , x DD } can b e written as p ( X | λ, θ 0 ) = Y i ( λ + i ) g + i ( λ − i ) g − i ! Y ij X k λ T k θ 0 ik θ 0 j k ! x ij exp − X ij k λ + i λ − j λ T k θ 0 ik θ 0 j k ! , where g + i is the n umber of citations for publication i , g + i = P j x ij , and g − i is the n umber of times publication i b eing cited, g − i = P j x j i . W e also mak e a simplifying assumption 2 that x ii = 1 for all do cumen ts i , that is, all publications are treated as self-cited. In the next section, we demonstrate that our mo del representation gives rise to an in tuitive sampling algorithm for learning the mo del. W e also sho w how the Poisson mo del in tegrates into the topic mo delling framework. 5. Inference T ec hniques Here, we derive the Mark ov chain Monte Carlo (MCMC) algorithms for learning the Citation Net work T opic Mo del. W e first detail the Gibbs sampler for the topic mo del and then discuss the Metropolis -Hastings (MH) algorithm for the citation netw ork. The full inference pro cedure is p erformed by alternating b et ween the Gibbs sampler and the MH algorithm. 5.1. Collapsed Gibbs Sampler for the Hierarchical PYP T opic Mo del T o join tly sample the w ords’ topic and the associated coun ts in the CNTM, w e use a col- lapsed Gibbs sampler designed for the PYP ( Chen et al. , 2011 ). The concept of the sampler is analogous to LDA, which consists of decremen ting the coun ts asso ciated with a word, sampling the respective new topic assignmen t for the w ord, and incrementing the asso ciated coun ts. Our collapsed Gibbs sampler is more complicated than LDA. In particular, we ha v e to consider the indicators u N k describ ed in Section 4.1 op erating on the hierarch y of PYPs. The sampler pro ceeds by considering the laten t v ariables asso ciated with a giv en w ord w dn . First, we decrement out the effects of the latent v ariables, the topic z dn = k and the chain of indicator v ariables u θ d k , u θ 0 d k , u ν a d k , u µ k (where they exist). After decremen ting, 2. T echnically , defining x ii allo ws us to rewrite the joint likelihoo d into another form for efficient caching. 147 Lim Buntine w e join tly sample a new topic z dn and the asso ciated indicators (which contribute “+1” to coun ts) for w ord w dn from their joint conditional p osterior distribution: p ( z dn , T , C | Z − dn , W , T − dn , C − dn , ζ ) = p ( Z , W , T , C | ζ ) p ( Z − dn , W , T − dn , C − dn | ζ ) . (5) where the sup erscript − dn indicates that the topic z dn , indicators and the asso ciated counts for word w dn are not observ ed in the resp ectiv e sets, i.e. the state after decrement. The mo dularised likelihoo d of Equation 4 allo ws the conditional p osterior (Equation 5 ) to b e computed easily , since it simplifies to ratios of lik eliho o d f ( · ), which simplifies further as the counts differ b y at most 1 during sampling. F or instance, the ratio of the P o c hhammer sym b ols, ( x | y ) C +1 / ( x | y ) C , simplifies to x + C y , while the ratio of Stirling n umbers, such as S y +1 x +1 ,α /S y x,α , can b e computed quickly via cac hing ( Buntine and Hutter , 2012 ). Sampling a new z dn = k corresp onds to incrementing the counts for v ariable θ d , that is, “+1” to c θ d k and p ossibly also “+1” to t θ d k . If t θ d k is incremen ted, then c θ 0 d k will be incremented to o but t θ 0 d k ma y or ma y not b e, as dictated by the sampled indicators u k . The pro cess is rep eated until the ro ot µ , since µ is GEM distributed, incrementing t µ k is equiv alen t to sampling a new topic, i.e. the n umber of topics increase b y 1. Pro cedure on the vocabulary side ( φ etc .) is similar. 5.2. Metrop olis-Hastings Algorithm for Citation Net work W e prop ose a nov el MH algorithm that allows the probabilit y vectors to remain integrated out, th us retaining the fast discrete sampling pro cedure for the PYP and GEM hierarc h y , rather than, for instance, resorting to an exp ectation-maximisation (EM) algorithm or v ariational approach. W e introduce an auxiliary variable y ij , named citing topic , to denote the topic that prompts publication i to cite publication j . T o illustrate, for a biolo gy publication that cites a machine le arning publication for the learning tec hnique, the citing topic would b e ‘mac hine learning’ instead of ‘biology’. F rom Equation 1 , a citing topic y ij is jointly Poisson with x ij : x ij , y ij = k | λ, θ 0 ∼ Poisson λ + i λ − j λ T k θ 0 ik θ 0 j k . (6) Incorp orating Y , the set of all y ij , we rewrite the citation netw ork likelihoo d as p ( X , Y | λ, θ 0 ) ∝ Y i ( λ + i ) g + i ( λ − i ) g − i Y k λ T k 1 2 P i h ik Y ik θ 0 ik h ik exp − X ij λ + i λ − j λ T y ij θ 0 iy ij θ 0 j y ij ! where h ik = P j x ij I ( y ij = k ) + P j x j i I ( y j i = k ) is the num b er of connections publication i made due to topic k . T o in tegrate out θ 0 , w e note the term θ 0 ik h ik app ears like a m ultinomial likelihoo d, so we absorb them in to the likelihoo d for p ( Z , W , T , C | ζ ) where they corresp ond to additional coun ts for c θ 0 i , with h ik added to c θ 0 i k . T o disambiguate the source of the coun ts, we will refer these customer counts contributed by x ij as network c ounts , and denote the augmented coun ts ( C plus netw ork coun ts) as C + . F or the exp onen tial term, we use Delta metho d appro ximation, R f ( θ ) exp( − g ( θ )) d θ ≈ exp( − g ( ˆ θ )) R f ( θ ) d θ , where ˆ θ is the exp ected v alue 148 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model according to a distribution prop ortional to f ( θ ). This approximation is reasonable as long as the terms in the exp onen tial are small (see supplementary material). The appro ximate full p osterior of CNTM can then b e written as p ( Z , W , T , C , X , Y | λ, ζ ) ≈ p ( Z , W , T , C + | ζ ) Y i ( λ + i ) g + i ( λ − i ) g − i Y k λ T k 1 2 P i h ik exp − X ij λ + i λ − j λ T y ij ˆ θ 0 iy ij ˆ θ 0 j y ij ! (7) The MH algorithm can b e summarised in three steps: estimate the document topic prior θ 0 , prop ose a new citing topic y ij from Equation 6 , and accept or reject the prop osed y ij follo wing an MH sc heme with Equation 7 . W e presen t the details of the MH sampler in the supplementary material. Note that the MH algorithm is similar to the collapsed Gibbs sampler, where w e decrement the counts, sample a new state and update the counts. Since all probability vectors are represen ted as coun ts, w e do not need to deal with their vector form in the collapsed Gibbs sampler. Additionally , our MH algorithm is intuitiv e and simple to implemen t. Like the w ords in a do cumen t, eac h citation is assigned a topic, hence the w ords and citations can b e thought as voting to determine a do cumen ts’ topic. 5.3. Hyp erparameter Sampling Hyp erparameter sampling for the priors are imp ortan t ( W allach et al. , 2009 ). In our infer- ence algorithm, w e sample the concentration parameters β of all PYPs with an auxiliary v ariable sampler ( T eh , 2006a ), 3 but lea ving the discoun t parameters α fixed. W e do not sample the α due to the coupling of the parameter with the Stirling num b ers cac he. In addition to the PYP h yp erparameters, we also sample λ + , λ − and λ T with a Gibbs sampler. W e let the hyperpriors for λ + , λ − and λ T to b e Gamma distributed with shap e 0 and rate 1 . With the conjugate Gamma prior, the posteriors for λ + i , λ − i and λ T k are also Gamma distributed, so they can b e sampled directly . ( λ + i | X , λ − , λ T θ 0 ) ∼ Gamma 0 + g + i , 1 + P k λ T k θ 0 ik P j λ − j θ 0 j k , ( λ − i | X , λ + , λ T θ 0 ) ∼ Gamma 0 + g − i , 1 + P k λ T k θ 0 ik P j λ + j θ 0 j k , ( λ T k | X , Y , λ + , λ − , θ 0 ) ∼ Gamma 0 + 1 2 P i h ik , 1 + λ T k ( P j λ + j θ 0 j k )( P j λ − j θ 0 j k ) . In this pap er, we apply v ague priors to the hyperpriors by setting 0 = 1 = 1. W e summarise the full inference algorithm for the CNTM in Algorithm 1 . 6. Data W e p erform our exp erimen ts on subsets of CiteSeer X data 4 whic h consists of scientific pub- lications. Each publication from CiteSeer X is accompanied by title , abstr act , keywor ds , au- thors , citations and other metadata. W e prepare three publication datasets from CiteSeer X for ev aluations. The first dataset corresp onds to Mac hine Learning (ML) publications, 3. W e outline the hyperparameter sampling for concentration parameters in the supplementary material. 4. http://citeseerx.ist.psu.edu/ 149 Lim Buntine Algorithm 1 Inference Algorithm for the Citation Netw ork T opic Mo del 1. Initialise the mo del by assigning a random topic assignment z dn to each word w dn and constructing the relev ant customer counts c N k and table counts t N k for all v ariables N . 2. F or each word w dn in each do cumen t d : i. Decrement the counts asso ciated with z dn and w dn . ii. Blo c ked sample a new topic z dn and asso ciated T and C from Equation 5 . 3. F or each citation x ij : i. Decrement the netw ork counts asso ciated with x ij and y ij . ii. Sample a new citing topic y ij from the joint p osterior of Equation 6 . iii. Accept or reject the sampled y ij with an MH scheme using Equation 7 . 4. Up date the hyperparameters β , λ + , λ − and λ T . 5. Rep eat steps 2-4 un til the mo del conv erges or a fix n umber of iterations reac hed. whic h are queried from CiteSeer X using the keyw ords from Microsoft Academic Search. 5 The ML dataset contains 139,227 publications. Our second dataset corresp onds to publications from 10 distinct research areas: agri- cultur e , ar chae olo gy , biolo gy , c omputer scienc e , financial e c onomics , industrial engine ering , material scienc e , p etr oleum chemistry , physics and so cial scienc e . The query words for these 10 disciplines are chosen suc h that the publications form distinct clusters. W e name this dataset M10 (Multidisciplinary 10 classes), whic h is made of 10,310 publications. F or the third dataset, we query publications from both arts and science disciplines. Arts publica- tions are made of history and r eligion publications, while the science publications con tain physics , chemistry and biolo gy researches. This dataset consists of 18,720 publications and is named AvS (Arts v ersus Science) in this pap er. The k eywords used to create the datasets are obtained from Microsoft Academic Searc h, and are listed in the supplementary material. F or the clustering ev aluation in Section 7.1.2 , w e treat the query categories as the ground truth. How ev er, publications that span m ultiple disciplines can b e problematic for clustering ev aluation, hence we simply remov e the publi- cations that satisfy the queries from more than one discipline. Nonetheless, the lab els are inheren tly noisy . The metadata for the publications can also b e noisy , for instance, the au- thors field ma y sometimes displa y publication’s keyw ords instead of the authors, publication title is sometimes an URL, and table of conten ts can b e mistakenly parsed as the abstract. W e discuss our treatmen ts to these issues in Section 6.1 . W e also note that non-English publications are discarded using langid.py ( Lui and Baldwin , 2012 ). In addition to the manually queried datasets, we also mak e use of existing datasets from LINQS ( Sen et al. , 2008 ) 6 to facilitate comparison with existing work. In particular, we use their CiteSeer, Cora and PubMed datasets. Their CiteSeer data consists of Artificial In telligence publications and hence w e name the dataset AI in this pap er. Although these datasets are small, they are fully labelled and thus useful for clustering ev aluation. Ho wev er, 5. http://academic.research.microsoft.com/ 6. http://linqs.cs.umd.edu/projects/projects/lbc/ 150 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model Datasets Publications Citations Authors V o cabulary W ords/Doc %Rep eat 1. ML 139 227 1 105 462 43 643 8 322 59 . 4 23 . 3 2. M10 10 310 77 222 6 423 2 956 57 . 8 24 . 3 3. AvS 18 720 54 601 11 898 4 770 58 . 9 17 . 0 4. AI 3 312 4 608 − 3 703 31 . 8 − 5. Cora 2 708 5 429 − 1 433 18 . 2 − 6. PubMed 19 717 44 335 − 4 209 67 . 6 40 . 1 T able 1: Summary of the datasets used in the pap er, showing the n umber of publications, citations, authors, unique w ord tokens, the av erage num b er of w ords in eac h do c- umen t, and the last column is the a verage p ercen tage of unique words rep eated in a do cumen t. Note: author information is not av ailable on the last three datasets. they do not come with additional metadata such as the authors. Note that the AI and Cora datasets are presented as Bo olean matrices, i.e. the w ord coun ts information is lost and all w ords in a document are assumed to o ccur only once. Although this represen tation is less useful for topic mo delling, w e still use them for the sake of comparison. Also note that the w ord counts were conv erted to TF-IDF in the PubMed dataset, so we recov er the w ord coun ts using a reasonable assumption, see supplementary material for the reco very pro cess. In T able 1 , we presen t a summary of the datasets used in this pap er. 6.1. Data Noise Remov al Here, we briefly discuss the steps tak en in cleansing the noise from the CiteSeer X datasets (ML, M10 and AvS). Note that the keywor ds field in the publications are often empt y and are sometimes noisy , that is, they con tain irrelev an t information such as section heading and title, which mak es the keyw ords unreliable source of information as categories. Instead, w e simply treat the keyw ords as part of the abstracts. W e also remov e the URLs from the data since they do not provide any additional useful information. Moreo ver, the author information is not consisten tly presented in CiteSeer X . Some of the authors are sho wn with full name, some with first name initialised, while some others are prefixed with title (Prof, Dr. etc. ). W e th us standardise the author information b y remo ving all title from the authors, initialising all first names and discarding the middle names. Although standardisation allows us to matc h up the authors, it does not solv e the problem that differen t authors who hav e the same initial and last name are treated as a single author. F or example, both Bruce Lee and Brett Lee are standardised to B Lee. Note this corresp onds to a whole researc h problem ( Han et al. , 2004 , 2005 ) and hence not addressed in this pap er. Occasionally , institutions are mistakenly treated as authors in CiteSeer X data, example includes Americ an Mathematic al So ciety and T e chnische Universit¨ at M ¨ unchen . In this case, w e simply remo ve the incorrect authors using a list of exclusion w ords 7 for authors. 7. The list of exclusion words is presented in the supplementary material. 151 Lim Buntine 6.2. T ext Prepro cessing Here, we discuss the prepro cessing pip eline adopted for the querie d datasets (LINQS data w ere already pro cessed). First, since publication text con tains many technical terms that are made of m ultiple words, we tokenise the text using phrases (or collo cations) instead of unigr am words. Thus, phrases like de cision tr e e are treated as single token rather than t wo distinct words. The phrases are extracted from the resp ectiv e datasets using LingPipe . 8 In this pap er, we use the w ord wor ds to mean b oth unigram words and phrases. W e then change all the words to lo wer case and filter out certain words. W ords that are remo ved are stop wor ds , common words and rare words. More sp ecifically , we use the stop w ords list from MALLET , 9 w e define common words as words that app ear in more than 18% of the publications, and rare words are w ords that o ccur less than 50 times in each dataset. Note that the threshold are determined by insp ecting the words remov ed. Finally , the tok enised words are stored as arrays of integers. W e also split the datasets to 90% training set for training the topic mo dels, and 10% test set for ev aluations detailed in Section 7 . 7. Exp erimen ts In this section, w e describ e exp erimen ts that compare the CNTM against sev eral baseline topic mo dels. The baselines are HDP-LDA with burstiness ( Bun tine and Mishra , 2014 ), a non-parametric extension of the A TM, the Poisson mixed-topic link mo del (PMTLM) ( Zh u et al. , 2013 ) and the CNTM without the citation netw ork. W e ev aluate these mo dels quan titatively with go odness-of-fit and clustering measures. W e qualitatively analyse the topics pro duced and p erform topic analysis on the authors. Additionally , we exp erimen t on merging authors who ha ve low num b er of publications and grouping them based on categories. This gives us a semi-supervised topic mo delling in whic h some lab els are known for authors who do not publish muc h. Finally , we presen t a discussion on the algorithm running time and conv ergence analysis in the supplementary material. In the following exp erimen ts, w e initialise the concen tration parameters β of all PYPs to 0 . 1, noting that the hyperparameters are up dated automatically . W e set the discount parameters α to 0 . 7 for all PYPs corresp onding to the “ wor d ” side of the CNTM ( i.e. γ , φ , φ 0 ). This is to induce p o wer-la w behaviour on the word distributions. W e simply fix the α to 0 . 01 for all other PYPs. Note that the num b er of topics gro w with data in non-parametric topic mo delling. T o preven t the learned topics to b e to o fine-grained, we set a limit to the maxim um num b er of topics that can b e learned. In particular, w e set the n umber of topics cap to 20 for the ML dataset, 50 for M10 and 30 for the AvS dataset. F or all the topic mo dels, our exp eriments find that the num b er of topics alw ays conv erges to the cap. F or AI, Cora and PubMed datasets, w e fix the num b er of topics to 6, 7 and 3 resp ectively simply for comparison against PMTLM. When training the topic mo dels, w e run the inference algorithm for 2,000 iterations. F or the CNTM, the MH algorithm for the citation netw ork is p erformed after 1,000 iterations, this is so the topics can b e learned first. This gives a faster learning algorithm and also allows 8. http://alias- i.com/lingpipe/ 9. http://mallet.cs.umass.edu/ 152 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model ML M10 T rain T est T rain T est Burst y HDP-LDA 4904 . 24 ± 71 . 34 4992 . 94 ± 65 . 57 1959 . 36 ± 32 . 77 2265 . 18 ± 68 . 19 Non-parametric A TM 2238 . 19 ± 12 . 22 2460 . 28 ± 11 . 34 1562 . 85 ± 18 . 11 1814 . 03 ± 23 . 18 CNTM w/o netw ork 1918 . 21 ± 4 . 31 2057 . 61 ± 3 . 56 912 . 69 ± 10 . 94 1186 . 11 ± 8 . 32 CNTM w netw ork 1851.82 ± 8 . 50 1990.78 ± 11 . 36 824.04 ± 11 . 96 1048.33 ± 21 . 39 T able 2: Perplexit y for the train and test do cuments on ML and M10, lo wer is b etter. us to assess the “ value-adde d ” b y the citation netw ork to topic mo delling. 10 W e rep eat eac h exp erimen t five times to reduce the estimation error of the ev aluation measures. 7.1. Quantitativ e Results 7.1.1. Goodness-of-fit and Perplexity P erplexity is a p opular metric used to ev aluate the go odness-of-fit of a topic model. Per- plexit y is negatively related to the likelihoo d of the observ ed words giv en the model, and lo wer is b etter. Perplexit y , estimated using do cumen t completion, is given as: p erplexit y( W ) = exp − P D d =1 P N d n =1 log p ( w dn | θ d , φ ) P D d =1 N d ! , where p ( w dn | θ d , φ ) is obtained by summing ov er all p ossible topics: p ( w dn | θ d , φ ) = X k p ( w dn | z dn = k , φ k ) p ( z dn = k | θ d ) = X k φ kw dn θ dk . The topic distribution θ is unkno wn for the test do cumen ts. Instead of using half of the text in the test set to estimate θ , whic h is a standard practice, w e used only the words from the title to estimate θ . One of the reasons b ehind this is that although title is usually short, it is a go od indicator of topic. Moreov er, using only the title allows more w ords to b e used to calculate the p erplexit y . The technical details on estimating θ is presented in the supplemen tary material. Note that the perplexity estimate is un biased since the data used in estimating θ is not used for ev aluation. W e presen t the p erplexit y result in T able 2 , which clearly shows the significan tly 11 b etter p erformance of CNTM against the baselines. Inclusion of citation information also provides significan t improv ement for mo del fitting, as sho wn in the comparison of CNTM with and without netw ork comp onent. 7.1.2. Document Clustering Next, we ev aluate the clustering ability of the topic models. Recall that topic mo dels assign a topic to each w ord in a do cumen t, essentially performing a soft clustering in which the mem b ership is given by the do cumen t-topic distribution θ . F or the follo wing ev aluation, we 10. This is elab orated further in the supplementary material with likelihoo d comparison. 11. In this pap er, significance is quantified at 5% significance level. 153 Lim Buntine M10 AvS Purit y NMI Purit y NMI Burst y HDP-LDA 0 . 66 ± 0 . 02 0 . 67 ± 0 . 01 0.75 ± 0 . 03 0.66 ± 0 . 01 Non-parametric A TM 0 . 58 ± 0 . 01 0 . 63 ± 0 . 00 0 . 69 ± 0 . 02 0 . 64 ± 0 . 01 CNTM w/o netw ork 0 . 61 ± 0 . 04 0 . 67 ± 0 . 01 0 . 72 ± 0 . 03 0.66 ± 0 . 01 CNTM w netw ork 0.67 ± 0 . 03 0.69 ± 0 . 02 0 . 72 ± 0 . 01 0.66 ± 0 . 00 T able 3: Comparison of clustering p erformance on the M10 and AvS dataset. con vert the soft clustering to hard clustering b y c ho osing a topic that b est represen ts the do cumen ts, hereafter called the dominant topic . The dominan t topic corresp onds to the topic that has the highest probability in a topic distribution N . As mentioned in Section 6 , w e assume the ground truth classes corresp ond to the query categories used in creating the datasets. W e ev aluate the clustering performance with pu- rit y and normalised mutual information (NMI) 12 ( Manning et al. , 2008 ). Purity is a simple clustering measure which can b e in terpreted as the proportion of do cumen ts correctly clus- tered. F or ground truth classes S = { s 1 , . . . , s J } and obtained clusters R = { r 1 , . . . , r K } , the purity and NMI are computed as purit y ( S , R ) = 1 D X k max j | r k ∩ s j | , NMI( S , R ) = 2 I ( S ; R ) H ( S ) + H ( R ) , where I ( S ; R ) denotes the mutual information and H ( · ) denotes the entrop y: I ( S ; R ) = X k, j | r k ∩ s j | D log 2 D | r k ∩ s j | | r k || s j | , H ( R ) = − X k | r k | D log 2 | r k | D . The clustering results are presen ted in T able 3 and T able 4 . W e can see that the CNTM greatly outp erforms the PMTLM in NMI ev aluation. Note that for a fair comparison against PMTLM, the exp eriments on the AI, Cora and PubMed datasets are ev aluated with a 10-fold cross v alidation. Additionally , we w ould lik e to p oin t out that since no author information is provided on these 3 datasets, the CNTM b ecomes a v ariant of HDP- LD A, but with PYP instead of DP . W e find that the clustering p erformance of CNTM with or without netw ork is similar in T able 4 . This is likely b ecause the publications in eac h datasets are highly related to one another, 13 and thus the citation information is not discriminating enough for clustering. 7.2. Author-merging for Semi-sup ervised Learning Author mo delling allows topic sharing of m ultiple do cumen ts written by the same author. Ho wev er, there are many authors who hav e authored only a few publications, thus their treatmen t can b e problematic. In this section, we exp erimen t on merging these authors 12. Note that the NMI in Zhu et al. ( 2013 ) is slightly different to ours, we use the definition in Manning et al. ( 2008 ). This p enalises our NMI result when compared against the result in Zhu et al. ( 2013 ) since our normalising term will alwa ys b e equal or greater than that of Zhu et al. ( 2013 ). 13. See the list of category lab els of these datasets in supplementary material. 154 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model AI Cora PubMed Purit y NMI Purit y NMI Purit y NMI PMTLM* N/A 0 . 51 N/A 0 . 41 N/A 0 . 27 CNTM w/o netw ork 0.51 ± 0 . 07 0.67 ± 0 . 02 0 . 37 ± 0 . 03 0.63 ± 0 . 01 0.47 ± 0 . 04 0.69 ± 0 . 01 CNTM w netw ork 0.51 ± 0 . 08 0 . 66 ± 0 . 02 0.39 ± 0 . 03 0.63 ± 0 . 02 0 . 46 ± 0 . 02 0.69 ± 0 . 01 T able 4: Comparison of clustering p erformance of CNTM against PMTLM. The b est PMTML results are chosen for comparison, from T able 2 in Zhu et al. ( 2013 ). 0.70 0.72 0.74 0.76 0. 78 0.80 0.82 Purity / NMI 2 3 4 5 Pur i t y NM I Figure 2: Plot showing the purit y and NMI results for η = { 2 , 3 , 4 , 5 } on M10 dataset. The in terv al represents one standard error for estimation. in to groups to improv e do cumen t clustering. W e merge authors who hav e authored less than η publications, to clarify , η = 2 means authors who hav e only a single publication are merged, while η = 1 corresponds to no merging. Additionally , we use the category lab els for a semi-sup ervised learning. This is ac hieved b y assigning the do cumen ts to dummy authors represen ted by the category lab els, i.e. the authors are merged into groups based on the category lab els of their publications. These groups are now considered the “authors” for the do cuments. W e present the clustering results for η = { 2 , 3 , 4 , 5 } as a plot in Figure 2 (results in table format are sho wn in the supplemen tary material). W e find that increasing η generally impro ves the clustering p erformance, although the effect is not too significan t for successive η . Note that if η is set to b e to o large, most of the author information will be replaced by the category lab els, which defeats the purp ose of author mo delling. 7.3. Qualitative Analysis W e can obtain a summary of a text corpus from a trained CNTM, this is done by analysing the topic-w ord distribution φ . In T able 5 , w e displa y some ma jor topics extracted from the ML dataset (M10 and AvS in supplemen tary material). The topics are represented by the top words, which are ordered based on φ kw . The lab els of the topics are man ually assigned. 155 Lim Buntine T opic T op W ords Reinforcemen t Learning reinforcemen t, agents, con trol, state, task Ob ject Recognition face, video, ob ject, motion, trac king Data Mining mining, data mining, researc h, patterns, kno wledge SVM k ernel, supp ort vector, training, clustering, space Sp eec h Recognition recognition, sp eech, sp eec h recognition, audio, hidden mark ov T able 5: T opic Summary for ML Dataset Author T opic T op W ords D. Aerts Quantum Theory quantum, theory , quantum mechanics, classical, quantum field Y. Bengio Neural Netw ork net works, learning, recurren t neural, neural netw orks, models C. Boutilier Decision Making decision making, agents, decision, theory , agen t S. Thrun Rob ot Learning rob ot, rob ots, control, autonomous, learning M. Baker Financial Market mark et, risk, firms, returns, financial T able 6: Example of authors and their topic preference learned b y the CNTM. Additionally , we analyse the author-topic distributions ν to find out about authors’ in terests. W e fo cus on the M10 dataset since it cov ers a wider area of research topics. F or eac h author a , we determine their dominan t topic from their author-topic distribution ν a . W e displa y the interests of some authors in T able 6 . Again, the topic labels are man ually pic ked giv en the dominant topics and the corresp onding top words from the topics. F urthermore, w e can graphically visualise the author-topics netw ork extracted by CNTM with Graphviz . 14 This is detailed in the supplementary material due to space. 8. Conclusions In this pap er, w e ha ve prop osed the Citation-Netw ork T opic Mo del (CNTM) to jointly mo del research publications and their citation netw ork. CNTM p erforms text mo delling with a hierarc hical PYP topic mo del and models the citations with the Poisson distribution. W e also proposed a no vel learning algorithm for the CNTM, which exploits the conjugacy of the Dirichlet and Multinomial distribution, allo wing the sampling of the citation net works to b e of similar form of the collapsed Gibbs sampler of a topic mo del. As discussed, our learning algorithm is intuitiv e and easy to implement. The CNTM offers substantial p erformance improv ement ov er previous work ( Zhu et al. , 2013 ). On three CiteSeer X datasets and three existing datasets, we demonstrate the im- pro vemen t of joint topic and net work mo delling in terms of model fitting and clustering ev al- uation. Additionally , w e experiment on merging authors who do not ha ve many publications in to groups of similar authors based on the query categories, giving us a semi-sup ervised learning. W e find that clustering p erformance improv es with the lev el of merging. F uture w ork includes learning the influences of the co-authors, utilising them for author merging and further sp eed up non-parametric mo delling with techniques in Li et al. ( 2014 ). 14. http://www.graphviz.org/ 156 Bibliographic Anal ysis with the Cit a tion Netw ork Topic Model Ac kno wledgments NICT A is funded b y the Australian Go vernmen t through the Departmen t of Comm unica- tions and the Australian Research Council through the ICT Centre of Excellence Program. The authors wish to thank CiteSeer X for providing the data. References W. Bun tine and M. Hutter. A Ba yesian view of the Poisson-Dirichlet process. T echnical Rep ort arXiv:1007.0296v2, 2012. W. Buntine and S. Mishra. Exp erimen ts with non-parametric topic mo dels. In KDD , pages 881–890. ACM, 2014. J. Chang and D. Blei. Hierarchical relational models for do cumen t netw orks. The Annals of Applie d Statistics , 4(1):124–150, 2010. C. Chen, L. Du, and W. Buntine. Sampling table configurations for the hierarchi cal Poisson- Diric hlet Pro cess. In ECML , pages 296–311. Springer-V erlag, 2011. H. Han, C. L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two sup ervised learning approac hes for name disam biguation in author citations. In JCDL , pages 296–305. A CM, 2004. H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a K-wa y sp ectral clustering metho d. In JCDL , pages 334–343. A CM, 2005. S. Kataria, P . Mitra, C. Caragea, and C. L. Giles. Context sensitiv e topic mo dels for author influence in do cumen t netw orks. In IJCAI , pages 2274–2280. AAAI Press, 2011. A. Li, A. Ahmed, S. Ra vi, and A. Smola. Reducing the sampling complexit y of topic models. In KDD , pages 891–900. A CM, 2014. K. W. Lim, C. Chen, and W. Buntine. Twitter-netw ork topic mo del: A full Ba yesian treatmen t for so cial netw ork and text mo deling. In NIPS T opic Mo del workshop , 2013. L. Liu, J. T ang, J. Han, M. Jiang, and S. Y ang. Mining topic-lev el influence in heterogeneous net works. In CIKM , pages 199–208. A CM, 2010. Y. Liu, A. Niculescu-Mizil, and W. Gryc. T opic-link LD A: Joint models of topic and author comm unity . In ICML , pages 665–672. ACM, 2009. M. Lui and T. Baldwin. langid.py: An off-the-shelf language iden tification to ol. In A CL , pages 25–30. ACL, 2012. C. Manning, P . Raghav an, and H. Sch¨ utze. Intr o duction to Information R etrieval . Cam- bridge Universit y Press, 2008. ISBN 0521865719, 9780521865715. D. Mimno and A. McCallum. Mining a digital library for influen tial authors. In JCDL , pages 105–106. ACM, 2007. R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Join t latent topic models for text and citations. In KDD , pages 542–550. ACM, 2008. J. Pitman. Some developmen ts of the Blackw ell-Macqueen urn sch eme. L e ctur e Notes- Mono gr aph Series , pages 245–267, 1996. 157 Lim Buntine M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P . Smyth. The author-topic mo del for authors and do cuments. In UAI , pages 487–494. AUAI Press, 2004. I. Sato and H. Nak aga wa. T opic models with pow er-law using Pitman-Yor process. In KDD , pages 673–682. ACM, 2010. P . Sen, G. Namata, M. Bilgic, L. Geto or, B. Gallagher, and T. Eliassi-Rad. Collective classification in netw ork data. AI Magazine , 29(3):93–106, 2008. J. T ang, J. Sun, C. W ang, and Z. Y ang. So cial influence analysis in large-scale net works. In KDD , pages 807–816. A CM, 2009. Y. W. T eh. A Ba yesian in terpretation of interpolated Kneser-Ney . T e c hnical rep ort, School of Computing, National Universit y of Singap ore, 2006a. Y. W. T eh. A hierarc hical Bay esian language mo del based on Pitman-Yor pro cesses. In A CL , pages 985–992. A CL, 2006b. Y. W. T eh and M. Jordan. Hierarchical Ba yesian nonparametric mo dels with applications. In Bayesian Nonp ar ametrics: Principles and Pr actic e . Cambridge Universit y Press, 2010. Y. T u, N. Johri, D. Roth, and J. Ho c kenmaier. Citation author topic mo del in exp ert searc h. COLING, pages 1265–1273. ACL, 2010. H. W allach, D. Mimno, and A. McCallum. Rethinking LD A: Why priors matter. In NIPS , pages 1973–1981. 2009. J. W eng, E.-P . Lim, J. Jiang, and Q. He. TwitterRank: Finding topic-sensitive influen tial Twitterers. In WSDM , pages 261–270. ACM, 2010. Y. Zhu, X. Y an, L. Geto or, and C. Mo ore. Scalable text and link analysis with mixed-topic link mo dels. In KDD , pages 473–481. ACM, 2013. 158
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment