Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes

Preprint for Int. J. Approximate Reasoning 78 (2016) 172–191 Submitted 31 Jan 2016; Published 18 Jul 2016 Nonparametric Ba y esian T opic Mo delling with the Hierarc hical Pitman-Y or Pro cesses Kar W ai Lim kar w ai.lim@anu.edu.au The A ustr alian National University Data61/NICT A, A ustr alia W ra y Buntine wra y.buntine@monash.edu Monash University, A ustr alia Changy ou Chen cchangyou@gmail.com Duke University, Unite d States Lan Du lan.du@monash.edu Monash University, A ustr alia Editor: An tonio Lijoi, Antonietta Mira, and Alessio Bena voli Abstract The Dirichlet process and its extension, the Pitman-Y or pro cess, are sto c hastic pro cesses that tak e probability distributions as a parameter. These pro cesses can b e stack ed up to form a hierarchical nonparametric Ba yesian mo del. In this article, we presen t eﬃcient metho ds for the use of these pro cesses in this hierarchical context, and apply them to laten t v ariable mo dels for text analytics. In particular, we prop ose a general framework for designing these Bay esian mo dels, which are called topic mo dels in the computer science comm unity . W e then prop ose a sp eciﬁc nonparametric Bay esian topic mo del for mo delling text from so cial media. W e fo cus on tw eets (p osts on Twitter) in this article due to their ease of access. W e ﬁnd that our nonparametric mo del p erforms b etter than existing parametric mo dels in b oth go o dness of ﬁt and real world applications. Keyw ords: Bay esian nonparametric metho ds, Marko v c hain Monte Carlo, topic mo dels, hierarc hical Pitman-Y or pro cesses, Twitter netw ork mo delling 1. In tro duction W e liv e in the information age. With the Internet, information can b e obtained easily and almost instantly . This has changed the dynamic of information acquisition, for example, w e can now (1) attain knowledge b y visiting digital libraries, (2) b e aw are of the world b y reading news online, (3) seek opinions from so cial media, and (4) engage in p olitical debates via web forums. As technology adv ances, more information is created, to a p oint where it is infeasible for a p erson to digest al l the a v ailable con tent. T o illustrate, in the con text of a healthcare database (PubMed), the n umber of entries has seen a growth rate of appro ximately 3,000 new en tries p er da y in the ten-year p erio d from 2003 to 2013 ( Suominen et al. , 2014 ). This motiv ates the use of machines to automatically organise, ﬁlter, summarise, and analyse the a v ailable data for the users. T o this end, researc hers ha ve dev elop ed v arious metho ds, whic h can be broadly categorised into computer vision ( Lo w , 1991 ; Mai , 2010 ), sp eec h recognition ( Rabiner and Juang , 1993 ; Jelinek , 1997 ), and c  2016 Lim, Buntine, Chen and Du. http://dx.doi.org/10.1016/j.ijar.2016.07.007 Lim, Buntine, Chen and Du natur al language pr o c essing (NLP , Manning and Sch¨ utze , 1999 ; Jurafsky and Martin , 2000 ). This article fo cuses on text analysis within NLP . In text analytics, researc hers seek to accomplish v arious goals, including sentiment anal- ysis or opinion mining ( P ang and Lee , 2008 ; Liu , 2012 ), information r etrieval ( Manning et al. , 2008 ), text summarisation ( Lloret and P alomar , 2012 ), and topic mo del ling ( Blei , 2012 ). T o illustrate, sentimen t analysis can be used to extract digestible summaries or reviews on pro ducts and services, which can b e v aluable to consumers. On the other hand, topic mo dels attempt to discov er abstract topics that are present in a collection of text do cumen ts. T opic mo dels were inspired by latent semantic indexing (LSI, Landauer et al. , 2007 ) and its probabilistic v ariant, pr ob abilistic latent semantic indexing (pLSI), also kno wn as the pr ob abilistic latent semantic analysis (pLSA, Hofmann , 1999 ). Pioneered b y Blei et al. ( 2003 ), latent Dirichlet al lo c ation (LD A) is a fully Bayesian extension of pLSI, and can b e considered the simplest Bay esian topic mo del. The LDA is then extended to many diﬀeren t t yp es of topic mo dels. Some of them are designed for sp eciﬁc applications ( W ei and Croft , 2006 ; Mei et al. , 2007 ), some of them mo del the structure in the text ( Blei and Laﬀert y , 2006 ; Du , 2012 ), while some incorporate extra information in their mo delling ( Ramage et al. , 2009 ; Jin et al. , 2011 ). On the other hand, due to the well kno wn corresp ondence b etw een the Gamma-Poisson family of distributions and the Dirichlet-m ultinomial family , Gamma-Poisson factor mo d- els ( Canny , 2004 ) and their nonparametric extensions, and other P oisson-based v ariants of non-ne gative matrix factorisation (NMF) form a metho dological contin uum with topic mo dels. These NMF metho ds are often applied to text, how ever, we do not consider these metho ds here. This article will concentrate on topic mo dels that take into account additional informa- tion. This information can b e auxiliary data (or metadata) that accompany the text, such as keyw ords (or tags), dates, authors, and sources; or external resources lik e w ord lexicons. F or example, on Twitter , a p opular so cial media platform, its messages, known as twe ets , are often asso ciated with several metadata like lo cation, time published, and the user who has written the tw eet. This information is often utilised, for instance, Kinsella et al. ( 2011 ) mo del t weets with lo cation data, while W ang et al. ( 2011b ) use hash tags for sen timent classiﬁcation on tw eets. On the other hand, man y topic mo dels hav e b een designed to p erform bibliographic analysis by using auxiliary information. Most notable of these is the author-topic mo del (A TM, Rosen-Zvi et al. , 2004 ), whic h, as its name suggests, incorp orates authorship information. In addition to authorship, the Citation Author T opic mo del ( T u et al. , 2010 ) and the Author Cite T opic Mo del ( Kataria et al. , 2011 ) make use of citations to mo del research publications. There are also topic mo dels that emplo y external resources to impro ve mo delling. F or instance, He ( 2012 ) and Lim and Bun tine ( 2014 ) incorp orate a sen timent lexicon as prior information for a weakly sup ervised sen timent analysis. Indep enden t to the use of auxiliary data, recent adv ances in nonparametric Ba yesian metho ds hav e pro duced topic mo dels that utilise nonp ar ametric Bay esian priors. The sim- plest examples replace Dirichlet distributions by the Dirichlet pr o c ess (DP , F erguson , 1973 ). The simplest is hierarchical Diric hlet process LD A (HDP-LD A) prop osed b y T eh et al. ( 2006 ) that replaces just the do cument b y topic matrix in LDA. One can further extend topic mo dels by using the Pitman-Y or pr o c ess (PYP , Ishw aran and James , 2001 ) that gen- 2 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes eralises the DP , b y replacing the second Diric hlet distribution whic h generates the topic b y word matrix in LDA. This includes the w ork of Sato and Nak agaw a ( 2010 ), Du et al. ( 2012b ), Lindsey et al. ( 2012 ), among others. Like the HDP , the PYPs can b e stack ed to form hierarchical Pitman-Y or pro cesses (HPYP), which are used in more complex mo dels. Another fully nonparametric extension to topic mo delling uses the Indian buﬀet pro cess ( Arc hambeau et al. , 2015 ) to sparsify b oth the do cument b y topic matrix and the topic by w ord matrix in LDA. Adv antages of employing nonparametric Ba yesian metho ds with topic mo dels is the abilit y to estimate the topic and word priors and to infer the n umber of clusters 1 from the data. Using the PYP also allows the mo delling of the p o w er-law prop ert y exhibited by nat- ural languages ( Goldwater et al. , 2005 ). These touted adv an tages hav e b een sho wn to yield signiﬁcan t improv ements in p erformance ( Bun tine and Mishra , 2014 ). How ev er, w e note the b est known approach for learning with hierarc hical Dirichlet (or Pitman-Y or) pro cesses is to use the Chinese restauran t franchise ( T eh and Jordan , 2010 ). Because this requires dynamic memory allo cation to implement the hierarch y , there has b een extensiv e research in attempting to eﬃciently implemen t just the HDP-LDA extension to LDA mostly based around v ariational metho ds ( T eh et al. , 2008 ; W ang et al. , 2011a ; Bry ant and Sudderth , 2012 ; Sato et al. , 2012 ; Hoﬀman et al. , 2013 ). V ariational metho ds ha ve rarely b een applied to more complex topic mo dels, as we consider here, and unfortunately Bay esian nonpara- metric metho ds are gaining a reputation of being diﬃcult to use. A newer collapsed and blo c k ed Gibbs sampler ( Chen et al. , 2011 ) has b een sho wn to generally outp erform the v ariational metho ds as w ell as the original Chinese restaurant franc hise in b oth compu- tational time and space and in some standard p erformance metrics ( Buntine and Mishra , 2014 ). Moreov er, the technique do es app ear suitable for more complex topic mo dels, as w e consider here. This article, 2 extending the algorithm of Chen et al. ( 2011 ), shows how to develop fully nonparametric and relativ ely eﬃcien t Ba yesian topic mo dels that incorp orate auxiliary information, with a goal to pro duce more accurate mo dels that w ork w ell in tackling sev eral applications. As a b y-pro duct, w e wish to encourage the use of state-of-the-art Bay esian tec hniques, and also to incorp orate auxiliary information, in mo delling. The remainder of this article is as follows. W e ﬁrst pro vide a brief bac kground on the Pitman-Y or pro cess in Section 2 . Then, in Section 3 , w e detail our mo delling framework by illustrating it on a simple topic mo del. W e con tinue through to the inference pro cedure on the topic mo del in Section 4 . Finally , in Section 5 , we present an application on mo delling so cial netw ork data, utilising the prop osed framework. Section 6 concludes. 2. Bac kground on Pitman-Y or Pro cess W e pro vide a brief, informal review of the Pitman-Y or pro cess (PYP , Ishw aran and James , 2001 ) in this section. W e assume the readers are familiar with basic probability distributions (see W alck , 2007 ) and the Dirichlet pro cess (DP , F erguson , 1973 ). In addition, we refer the readers to Hjort et al. ( 2010 ) for a tutorial on Bay esian nonparametric mo delling. 1. This is kno wn as the num b er of topics in topic mo delling. 2. W e note that this article adapts and extends our previous work ( Lim et al. , 2013 ). 3 Lim, Buntine, Chen and Du 2.1 Pitman-Y or Pro cess The Pitman-Y or pr o c ess (PYP , Ish waran and Jame s , 2001 ) is also known as the t wo- parameter Poisson-Dirichlet pr o c ess . The PYP is a tw o-parameter generalisation of the DP , now with an extra parameter α named the disc ount p ar ameter in addition to the con- cen tration parameter β . Similar to DP , a sample from a PYP corresp onds to a discrete distribution (kno wn as the output distribution ) with the same support as its base distri- bution H . The underlying distribution of the PYP is the Poisson-Dirichlet distribution (PDD), which was introduced by Pitman and Y or ( 1997 ). The PDD is deﬁned by its construction pro cess. F or 0 ≤ α < 1 and β > − α , let V k b e distributed indep enden tly as follows: ( V k | α, β ) ∼ Beta(1 − α, β + k α ) , for k = 1 , 2 , 3 , . . . , (1) and deﬁne ( p 1 , p 2 , p 3 , . . . ) as p 1 = V 1 , (2) p k = V k k − 1 Y i =1 (1 − V i ) , for k ≥ 2 . (3) If we let p = ( ˜ p 1 , ˜ p 2 , ˜ p 3 , . . . ) b e a sorted version of ( p 1 , p 2 , p 3 , . . . ) in descending order, then p is Poisson-Diric hlet distributed with parameter α and β : p ∼ PDD( α, β ) . (4) Note that the unsorted v ersion ( p 1 , p 2 , p 3 , . . . ) follo ws a GEM( α , β ) distribution, which is named after Griﬃths, Engen and McCloskey ( Pitman , 2006 ). With the PDD deﬁned, we can then deﬁne the PYP formally . Let H b e a distribution o ver a measurable space ( X , B ), for 0 ≤ α < 1 and β > − α , supp ose that p = ( p 1 , p 2 , p 3 , . . . ) follo ws a PDD (or GEM) with parameters α and β , then PYP is given by the formula p ( x | α, β , H ) = ∞ X k =1 p k δ X k ( x ) , for k = 1 , 2 , 3 , . . . , (5) where X k are indep enden t samples drawn from the base measure H and δ X k ( x ) represen ts probabilit y p oint mass concentrated at X k ( i.e. , it is an indicator function that is equal to 1 when x = X k and 0 otherwise): δ x ( y ) =  1 if x = y 0 otherwise . (6) This construction, Equation ( 1 ), is named the stick-br e aking pr o c ess . The PYP can also b e constructed using an analogue to Chinese restaurant pro cess (which explicitly dra ws a sequence of samples from the base distribution). A more extensive review on the PYP is giv en by Buntine and Hutter ( 2012 ). A PYP is often more suitable than a DP in mo delling since it exhibits a p ow er-law b eha viour (when α 6 = 0), which is observed in natural languages ( Goldwater et al. , 2005 ; T eh and Jordan , 2010 ). The PYP has also b een employ ed in genomics ( F a v aro et al. , 2009 ) and economics ( Aoki , 2008 ). Note that when the discoun t parameter α is 0, the PYP simply reduces to a DP . 4 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes 2.2 Pitman-Y or Pro cess with a Mixture Base Note that the base measure H of a PYP is not necessarily restricted to a single probability distribution. H can also b e a mixture distribution such as H = ρ 1 H 1 + ρ 2 H 2 + · · · + ρ n H n , (7) where P n i =1 ρ i = 1 and { H 1 , . . . H n } is a set of distributions o ver the same measurable space ( X , B ) as H . With this speciﬁcation of H , the PYP is also named the comp ound Poisson-Diric hlet pro cess in Du ( 2012 ), or the doubly hierarc hical Pitman-Y or pro cess in W o od and T eh ( 2009 ). A sp ecial case of this is the DP equiv alent, which is also known as the DP with mixed random measures in Kim et al. ( 2012 ). Note that we hav e assumed constant v alues for the ρ i , though of course we can go fully Bay esian and assign a prior distribution for eac h of them, a natural prior w ould b e the Diric hlet distribution. 2.3 Remark on Bay esian Inference P erforming exact Ba y esian inference on nonparametric mo dels is often intractable due to the diﬃculty in deriving the closed-form p osterior distributions. This motiv ates the use of Mark ov c hain Mon te Carlo (MCMC) metho ds (see Gelman et al. , 2013 ) for approximate inference. Most notable of the MCMC metho ds are the Metrop olis-Hastings (MH) algo- rithms ( Metrop olis et al. , 1953 ; Hastings , 1970 ) and Gibbs samplers ( Geman and Geman , 1984 ). These algorithms serve as a building blo ck for more adv anced samplers, such as the MH algorithms with dela yed rejection ( Mira , 2001 ). Generalisations of the MCMC metho d include the reversible jump MCMC ( Green , 1995 ) and its delay ed rejection v ariant ( Green and Mira , 2001 ) can also b e employ ed for Bay esian inference, how ever, they are out of the scop e in this article. Instead of sampling one parameter at a time, one can dev elop an algorithm that up dates more parameters in each iteration, a so-called blo cke d Gibbs sampler ( Liu , 1994 ). Also, in practice we are usually only interested in a certain subset of the parameters; in such cases w e can sometimes derive more eﬃcient c ol lapse d Gibbs samplers ( Liu , 1994 ) by integrating out the nuisance parameters. In the remainder of this article, we will emplo y a com bination of the blo c ked and collapsed Gibbs samplers for Ba yesian inference. 3. Mo delling F ramework with Hierarc hical Pitman-Y or Pro cess In this section, we discuss the basic design of our nonparametric Ba yesian topic mo dels using thierarchical Pitman-Y or pro cesses (HPYP). In particular, we will in tro duce a simple topic mo del that will b e extended later. W e discuss the general inference algorithm for the topic mo del and hyp erp ar ameter optimisation. Dev elopment of topic models is fundamentally motiv ated by their applications. Depend- ing on the application, a speciﬁc topic mo del that is most suitable for the task should b e designed and used. Ho wev er, despite the ease of designing the mo del, the ma jorit y of time is sp ent on implementing, assessing, and redesigning it. This calls for a b etter designing cycle/routine that is more eﬃcient, that is, sp ending less time in implementation and more time in mo del design and developmen t. 5 Lim, Buntine, Chen and Du K N d ν D z dn γ µ w dn θ d ϕ k Figure 1: Graphical mo del of the HPYP topic mo del. It is an extension to LDA b y al- lo wing the probability v ectors to b e mo delled by PYPs instead of the Dirichlet distributions. The area on the left of the graphical mo del (consists of µ , ν and θ ) is usually referred as topic side, while the righ t hand side (with γ and φ ) is called the vocabulary side. The word no de denoted by w dn is observed. The notations are deﬁned in T able 1 . W e can achiev e this by a higher level implementation of the algorithms for topic mod- elling. This has b een made p ossible in other statistical domains by BUGS (Bay esian in- ference using Gibbs sampling, Lunn et al. , 2000 ) or JAGS (just another Gibbs sampler, Plummer , 2003 ), alb eit with standard probabilit y distributions. Theoretically , BUGS and JA GS will work on LD A; ho w ever, in practice, running Gibbs sampling for LDA with BUGS and JA GS is very slow. This is b ecause their Gibbs samplers are uncollapsed and not op- timised. F urthermore, they cannot b e used in a mo del with sto c hastic pro cesses, like the Gaussian pro cess (GP) and DP . Belo w, we presen t a framew ork that allows us to implemen t HPYP topic mo dels eﬃ- cien tly . This framework allows us to test v ariants of our prop osed topic mo dels without signiﬁcan t reimplementation. 3.1 Hierarchical Pitman-Y or Pro cess T opic Mo del The HPYP topic model is a simple netw ork of PYP no des since all distributions on the probabilit y vectors are mo delled b y the PYP . F or simplicity , we assume a topic mo del with three PYP lay ers, although in practice there is no limit to the num b er of PYP lay ers. W e presen t the graphical mo del of our generic topic mo del in Figure 1 . This mo del is a v ariant of those presented in Bun tine and Mishra ( 2014 ), and is presented here as a starting mo del for illustrating our metho ds and for subsequen t extensions. A t the ro ot level, we hav e µ and γ distributed as PYPs: µ ∼ PYP( α µ , β µ , H µ ) , (8) γ ∼ PYP( α γ , β γ , H γ ) . (9) The v ariable µ is the ro ot no de for the topics in a topic mo del while γ is the ro ot no de for the wor ds . T o allow arbitrary num b er of topics to b e learned, we let the base distribution for µ , H µ , to b e a contin uous distribution or a discrete distribution with inﬁnite samples. 6 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes W e usually choose a discrete uniform distribution for γ based on the w ord v o cabulary size of the text corpus. This decision is technical in nature, as we are able to assign a tiny probabilit y to words not observed in the training set, whic h eases the ev aluation pro cess. Th us H γ = {· · · , 1 |V | , · · · } where |V | is the set of all word vocabulary of the text corpus. W e now consider the topic side of the HPYP topic mo del. Here we hav e ν , which is the c hild no de of µ . It follows a PYP given ν , which acts as its base distribution: ν ∼ PYP( α ν , β ν , µ ) . (10) F or each do cument d in a text corpus of size D , we hav e a do cumen t–topic distribution θ d , whic h is a topic distribution sp eciﬁc to a do cumen t. Eac h of them tells us ab out the topic comp osition of a do cumen t. θ d ∼ PYP( α θ d , β θ d , ν ) , for d = 1 , . . . , D . (11) While for the v o cabulary side, for each topic k learned b y the mo del, we hav e a topic– w ord distribution φ k whic h tells us ab out the w ords asso ciated with each topic. The topic– w ord distribution φ k is PYP distributed given the paren t no de γ , as follows: φ k ∼ PYP( α φ k , β φ k , γ ) , for k = 1 , . . . , K . (12) Here, K is the n umber of topics in the topic mo del. F or every word w dn in a do cumen t d which is indexed b y n (from 1 to N d , the num b er of words in do cumen t d ), we ha ve a laten t topic z dn (also known as topic assignmen t) whic h indicates the topic the word represents. z dn and w dn are categorical v ariables generated from θ d and φ k resp ectiv ely: z dn | θ d ∼ Discrete( θ d ) , (13) w dn | z dn , φ ∼ Discrete( φ z d ) , for n = 1 , . . . , N d . (14) The abov e α and β are the discoun t and concen tration parameters of the PYPs (see Sec- tion 2.1 ), note that they are called the hyp erp ar ameters in the mo del. W e present a list of v ariables used in this section in T able 1 . 3.2 Mo del Represen tation and Posterior Likelihoo d In a Bay esian setting, p osterior inference requires us to analyse the p osterior distribution of the mo del v ariables given the observed data. F or instance, the joint p osterior distribution for the HPYP topic mo del is p ( µ, ν, γ , θ, φ, Z | W , Ξ ) . (15) Here, w e use b old face capital letters to represent the set of all relev ant v ariables. F or instance, W captures all words in the corpus. Additionally , we denote Ξ as the set of all h yp erparameters and constants in the mo del. Note that deriving the p osterior distribution analytically is almost imp ossible due to its complex nature. This leav es us with approximate Bay esian inference techniques as men- tioned in Section 2.3 . Ho w ever, ev en with these techniques, p erforming p osterior inference 7 Lim, Buntine, Chen and Du T able 1: List of v ariables for the HPYP topic mo del used in this section. V ariable Name Description z dn T opic T opical lab el for word w dn . w dn W ord Observ ed w ord or phrase at position n in do cumen t d . φ k T opic–word distribution Probabilit y distribution in generating w ords for topic k . θ d Do cumen t–topic distribution Probabilit y distribution in generating topics for do cumen t d . γ Global word distribution W ord prior for φ k . ν Global topic distribution T opic prior for θ d . µ Global topic distribution T opic prior for ν . α N Discoun t Discoun t parameter for PYP N . β N Concen tration Concen tration parameter for PYP N . H N Base distribution Base distribution for PYP N . c N k Customer count Num b er of customers ha ving dish k in restauran t N . t N k T able count Num b er of tables serving dish k in restau- ran t N . Z All topics Collection of all topics z dn . W All words Collection of all words w dn . Ξ All hyperparameters Collection of all hyperparameters and con- stan ts. C All customer counts Collection of all customers coun ts c N k . T All table counts Collection of all table coun ts t N k . with the p osterior distribution is diﬃcult due to the coupling of the probabilit y v ectors from the PYPs. The key to an eﬃcien t inference pro cedure with the PYPs is to marginalise out the PYPs in the mo del and record v arious asso ciated counts instead, which yields a collapsed sampler. T o ac hieve this, we adopt a Chinese Restaurant Process (CRP) metaphor ( T eh and Jordan , 2010 ; Blei et al. , 2010 ) to represen t the v ariables in the topic mo del. With this metaphor, all data in the mo del ( e.g. , topics and w ords) are the customers ; while the PYP no des are the r estaur ants the customers visit. In each restaurant, each customer is to be seated at only one table , though each table can hav e any num b er of customers. Eac h table in a restaurant serves a dish , the dish corresp onds to the categorical lab el a data p oin t ma y 8 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes R e st a ur a n t 1 R e st a ur a n t 2 Figure 2: An illustration of the Chinese restaurant pro cess representation. The customers are represented by the circles while the tables are represented by the rectangles. The dishes are the symbols in the middle of the rectangles, here they are denoted b y the sunny sym b ol and the cloudy symbol. In this illustration, we know the n umber of customers corresponds to eac h table, for example, the green table is o ccupied by three customers. Also, since Restaurant 1 is the paren t of Restauran t 2, the tables in Restauran t 2 are treated as the customers for Restaurant 1. ha ve ( e.g. , the topic label or w ord). Note that there can b e more than one table serving the same dish. In a HPYP topic mo del, the tables in a restaurant N are treated as the customers for the paren t restauran t P (in the graphical mo del, P p oin ts to N ), and they share the same dish. This means that the data is passed up recursiv ely until the ro ot no de. F or illustration, we present a simple example in Figure 2 , showing the seating arrangement of the customers from t wo restaurants. Na ¨ ıv ely recording the seating arrangement (table and dish) of each customer brings ab out computational ineﬃciency during inference. Instead, we adopt the table m ultiplicity (or table coun ts) representation of Chen et al. ( 2011 ) which requires no dynamic memory , th us consuming only a factor of memory at no loss of inference eﬃciency . Under this represen tation, w e store only the customer counts and table coun ts asso ciated with each restauran t. The customer count c N k denotes the num b er of customers who are having dish k in restauran t N . The corresp onding symbol without subscript, c N , denotes the collection of customer coun ts in restauran t N , that is, c N = ( · · · , c N k , · · · ). The total n umber of customers in a restaurant N is denoted by the capitalised sym b ol instead, C N = 9 Lim, Buntine, Chen and Du R e st a ur a n t 1 R e st a ur a n t 2 Figure 3: An illustration of the Chinese restaurant with the table coun ts representation. Here the setting is the same as Figure 2 but the seating arrangement of the customers are “forgotten” and only the table and customer counts are recorded. Th us, we only know that there are three sunny tables in Restauran t 2, with a total of nine customers. P k c N k . Similar to the customer count, the table count t N k denotes the num b er of non- empt y tables serving dish k in restaurant N . The corresp onding t N and T N are deﬁned similarly . F or instance, from the example in Figure 2 , w e ha ve c 2 sun = 9 and t 2 sun = 3, the corresp onding illustration of the table multiplicit y representation is presen ted in Figure 3 . W e refer the readers to Chen et al. ( 2011 ) for a detailed deriv ation of the p osterior lik eliho o d of a restaurant. F or the p osterior lik eliho od of the HPYP topic model, we marginalise out the probability v ector asso ciated with the PYPs and represen t them with the customer counts and table coun ts, following Chen et al. ( 2011 , Theorem 1). W e present the mo dularised version of the full p osterior of the HPYP topic mo del, whic h allo ws the p osterior to b e computed v ery quic kly . The full p osterior consists of the mo dularised lik eliho o d asso ciated with each PYP in the mo del, deﬁned as f ( N ) =  β N   α N  T N  β N  C N K Y k =1 S c N k t N k , α N  c N k t N k  − 1 , for N ∼ PYP  α N , β N , P  . (16) Here, S x y , α are generalised Stirling num b ers ( Buntine and Hutter , 2012 , Theorem 17). Both ( x ) T and ( x | y ) T denote P o c hhammer sym b ols with rising factorials ( Oldham et al. , 2009 , 10 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Section 18): ( x ) T = x · ( x + 1) · · ·  x + ( T − 1)  , (17) ( x | y ) T = x · ( x + y ) · · ·  x + ( T − 1) y  . (18) With the CRP represen tation, the full p osterior of the HPYP topic mo del can no w be written — in terms of f ( · ) given in Equation ( 16 ) — as p ( Z , T , C | W , Ξ ) ∝ p ( Z , W , T , C | Ξ ) ∝ f ( µ ) f ( ν ) D Y d =1 f ( θ d ) ! K Y k =1 f ( φ k ) ! f ( γ ) |V | Y v =1  1 |V |  t γ v ! . (19) This result is a generalisation of Chen et al. ( 2011 , Theorem 1) to account for discrete base distribution — the last term in Equation ( 19 ) corresp onds to the base distribution of γ , and v indexes each unique w ord in vocabulary set V . The b old face T and C denote the collection of all table counts and customer coun ts, resp ectiv ely . Note that the topic assignmen ts Z are implicitly captured by the customer counts: c θ d k = N d X n =1 I ( z dn = k ) , (20) where I ( · ) is the indicator function, which ev aluates to 1 when the statemen t inside the function is true, and 0 otherwise. W e would like to p oin t out that ev en though the proba- bilit y v ectors of the PYPs are integrated out and not explicitly stored, they can easily b e reconstructed. This is discussed in Section 4.4 . W e mo ve on to Ba yesian inference in the next section. 4. P osterior Inference for the HPYP T opic Mo del W e fo cus on the MCMC metho d for Ba yesian inference on the HPYP topic mo del. The MCMC method on topic mo dels follo ws these simple pro cedures — decremen ting coun ts con tributed by a w ord, sample a new topic for the w ord, and update the mo del b y accepting or rejecting the proposed sample. Here, we describ e the collapsed blo c k ed Gibbs sampler for the HPYP topic mo del. Note the PYPs are marginalised out so we only deal with the counts. 4.1 Decrementing the Counts Asso ciated with a W ord The ﬁrst step in a Gibbs sampler is to remo ve a w ord and corresponding laten t topic, then decremen t the associated customer coun ts and table coun ts. T o giv e an example from Figure 2 , if we remo ve the red customer from Restauran t 2, w e w ould decrement the customer count c 2 sun b y 1. Additionally , w e also decrement the table count t 2 sun b y 1 b ecause the red customer is the only customer on its table. This in turn decrements the customer count c 1 sun b y 1. How ev er, this requires us to k eep trac k of the customers’ seating arrangemen t whic h leads to increased memory requiremen ts and p oorer p erformance due to inadequate mixing ( Chen et al. , 2011 ). 11 Lim, Buntine, Chen and Du T o ov ercome the ab o ve issue, w e follow the concept of table indicator ( Chen et al. , 2011 ) and in tro duce a new auxiliary Bernoulli indicator v ariable u N k , whic h indicates whether remo ving the customer also remov es the table by whic h the customer is seated. Note that our Bernoulli indicator is diﬀerent to that of Chen et al. ( 2011 ) which indicates the restauran t a customer con tributes to. The Bernoulli indicator is sampled as needed in the decrementing pro cedure and it is not stored, this means that w e simply “forget” the seating arrangemen ts and re-sample them later when needed, thus we do not need to store the seating arrangement. The Bernoulli indicator of a restaurant N dep ends solely on the customer counts and the table coun ts: p  u N k  = ( t N k /c N k if u N k = 1 1 − t N k /c N k if u N k = 0 . (21) In the con text of the HPYP topic mo del describ ed in Section 3.1 , w e formally present ho w w e decrement the counts asso ciated with the word w dn and laten t topic z dn from do cumen t d and p osition n . First, on the vocabulary side (see Figure 1 ), we decrement the customer coun t c φ z dn w dn asso ciated with φ z dn b y 1. Then sample a Bernoulli indicator u φ z dn w dn according to Equation ( 21 ). If u φ z dn w dn = 1, we decrement the table count t φ z dn w dn and also the customer count c γ w dn b y one. In this case, we w ould sample a Bernoulli indicator u γ w dn for γ , and decremen t t γ w dn if u γ w dn = 1. W e do not decremen t the resp ective customer count if the Bernoulli indicator is 0. Second, we would need to decremen t the coun ts associated with the latent topic z dn . The pro cedure is similar, we decrement c θ d z dn b y 1 and sample the Bernoulli indicator u θ d z dn . Note that whenever w e decrement a customer coun t, w e sample the corresponding Bernoulli indicator. W e rep eat this procedure recursiv ely until the Bernoulli indicator is 0 or until the pro cedure hits the ro ot no de. 4.2 Sampling a New T opic for a W ord After decrementing the v ariables asso ciated with a w ord w dn , we use a blo cke d Gibbs sampler to sample a new topic z dn for the word and the corresp onding customer counts and table coun ts. The conditional p osterior used in sampling can b e computed quickly when the full p osterior is represen ted in a mo dularised form. T o illustrate, the conditional p osterior for z dn and its asso ciated customer counts and table coun ts is p ( z dn , T , C | Z − dn , W , T − dn , C − dn , Ξ ) = p ( Z , T , C | W , Ξ ) p ( Z − dn , T − dn , C − dn | W , Ξ ) , (22) whic h is further brok en down by substituting the p osterior lik eliho o d deﬁned in Equa- tion ( 19 ), giving the follo wing ratios of the mo dularised likelihoo ds: f ( µ ) f ( µ − dn ) f ( ν ) f ( ν − dn ) f ( θ d ) f ( θ − dn d ) f ( φ z dn ) f ( φ − dn z dn ) f ( γ ) f ( γ − dn )  1 |V |  t γ w dn − ( t γ w dn ) − dn . (23) The superscript  − dn indicates that the v ariables asso ciated with the w ord w dn are remo ved from the resp ective sets, that is, the customer coun ts and table coun ts are after the decre- men ting pro cedure. Since we are only sample the topic assignmen t z dn asso ciated with one 12 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes T able 2: All p ossible prop osals of the blo c k ed Gibbs sampler for the v ariables asso ciated with w dn . T o illustrate, one sample w ould b e z dn = 1, t N z dn do es not incremen t (sta ys the same), and c N z dn incremen ts by 1, for all N in { µ, ν, θ d , φ z dn , γ } . W e note that the prop osals can include states that are inv alid, but this is not an issue since those states hav e zero p osterior probabilit y and thus will not b e sampled. V ariable P ossibilities V ariable P ossibilities V ariable P ossibilities z dn { 1 , . . . , K } t N z dn { t N z dn , t N z dn + 1 } c N z dn { c N z dn , c N z dn + 1 } w ord, the customer counts and table coun ts can only incremen t by at most 1, see T able 2 for a list of all p ossible prop osals. This allows the ratios of the mo dularised likelihoo ds, which consists of ratios of Pochham- mer symbol and ratio of Stirling num b ers f ( N ) f ( N − dn ) = ( β N ) ( C N ) − dn ( β N ) C N ( β N | α N ) T N ( β N | α N ) ( T N ) − dn K Y k =1 S c N k t N k , α N S ( c N k ) − dn ( t N k ) − dn , α N , (24) to simplify further. F or instance, the ratios of P o c hhammer sym b ols can b e reduced to constan ts, as follo ws: ( x ) T +1 ( x ) T = x + T , ( x | y ) T +1 ( x | y ) T = x + y T . (25) The ratio of Stirling num b ers, suc h as S y +1 x +1 , α /S y x, α , can be computed quickly via cac hing ( Bun tine and Hutter , 2012 ). T echnical details on implementing the Stirling n umbers cache can b e found in Lim ( 2016 ). With the conditional p osterior deﬁned, we pro ceed to the sampling pro cess. Our ﬁrst step inv olves ﬁnding all possible c hanges to the topic z dn , customer coun ts, and the table coun ts (hereafter known as ‘ state ’) asso ciated with adding the remov ed w ord w dn bac k into the topic mo del. Since only one word is added in to the mo del, the customer counts and the table counts can only increase b y at most 1, constraining the p ossible states to a reasonably small num b er. F urthermore, the customer counts of a parent node will only b e incremented when the table counts of its child no de increases. Note that it is p ossible for the added customer to generate a new dish (topic) for the mo del. This requires the customer to incremen t the table count of a new dish in the ro ot no de µ by 1 (from 0). Next, we compute the conditional p osterior (Equation ( 22 )) for all p ossible states. The conditional p osterior (up to a prop ortional constant) can b e computed quickly by breaking do wn the p osterior and calculating the relev an t parts. W e then normalise them to sample one of the states to be the proposed next state. Note that the prop osed state will alwa ys b e accepted, whic h is an artifact of Gibbs sampler. Finally , given the prop osal, w e up date the HPYP mo del b y incremen ting the relev ant customer counts and table counts. 13 Lim, Buntine, Chen and Du 4.3 Optimising the Hyp erparameters Cho osing the righ t hyperparameters for the priors is imp ortan t for topic mo dels. W allac h et al. ( 2009a ) show that an optimised hyperparameter increases the robustness of the topic mo dels and improv es their mo del ﬁtting. The hyperparameters of the HPYP topic mo dels are the discoun t parameters and concentration parameters of the PYPs. Here, w e prop ose a pro cedure to optimise the concentration parameters, but leav e the discoun t parameters ﬁxed due to their coupling with the Stirling n umbers cac he. The concen tration parameters β of all the PYPs are optimised using an auxiliary v ariable sampler similar to T eh ( 2006 ). Being Ba yesian, we assume the concentration parameter β N of a PYP no de N has the following hyp erprior : β N ∼ Gamma( τ 0 , τ 1 ) , for N ∼ PYP  α N , β N , P  , (26) where τ 0 is the shap e parameter and τ 1 is the r ate parameter. The gamma prior is chosen due to its conjugacy whic h gives a gamma p osterior for β N . T o optimise β N , w e ﬁrst sample the auxiliary v ariables ω and ζ i giv en the curr ent v alue of α N and β N , as follows: ω | β N ∼ Beta  C N , β N  , (27) ζ i | α N , β N ∼ Bernoulli  β N β N + iα N  , for i = 0 , 1 , . . . , T N − 1 . (28) With these, we can then sample a new β N from its conditional p osterior β N   ω , ζ ∼ Gamma   τ 0 + T N − 1 X i =0 ζ i , τ 1 − log (1 − ω )   . (29) The collapsed Gibbs sampler is summarised by Algorithm 1 . 4.4 Estimating the Probability V ectors of the PYPs Recall that the aim of topic mo delling is to analyse the p osterior of the mo del parameters, suc h as one in Equation ( 15 ). Although we hav e marginalised out the PYPs in the ab o ve Gibbs sampler, the PYPs can be reconstructed from the asso ciated customer coun ts and table counts. Recov ering the full p osterior distribution of the PYPs is a complicated task. So, instead, we will analyse the PYPs via the exp ected v alue of their conditional marginal p osterior distribution, or simply , their p osterior me an , E [ N | Z , W , T , C , Ξ ] , for N ∈ { µ, ν, γ , θ d , φ k } . (30) The posterior mean of a PYP corresponds to the probabilit y of sampling a new customer for the PYP . T o illustrate, we consider the p osterior of the topic distribution θ d . W e let ˜ z dn to b e a unknown futur e laten t topic in addition to the kno wn Z . With this, we can write the p osterior mean of θ dk as E [ θ dk | Z , W , T , C , Ξ ] = E [ p ( ˜ z dn = k | θ d , Z , W , T , C , Ξ ) | Z , W , T , C , Ξ ] = E [ p ( ˜ z dn = k | Z , T , C ) | Z , W , T , C , Ξ ] . (31) 14 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Algorithm 1 Collapsed Gibbs Sampler for the HPYP T opic Mo del 1. Initialise the HPYP topic model by assigning random topic to the latent topic z dn asso ciated to eac h word w dn . Then up date all the relev ant customer coun ts C and table counts T by using Equation ( 20 ) and setting the table coun ts to b e ab out half of the customer counts. 2. F or eac h word w dn in each do cumen t d , do the following: (a) Decrement the counts asso ciated with w dn (see Section 4.1 ). (b) Blo c k sample a new topic for z dn and corresponding customer counts C and table coun ts T (see Section 4.2 ). (c) Up date (incremen t counts) the topic mo del based on the sample. 3. Up date the hyperparameter β N for each PYP no des N (see Section 4.3 ). 4. Rep eat Steps 2 – 3 un til the mo del conv erges or when a ﬁx num b er of iterations is reached. b y replacing θ dk with the p osterior predictiv e distribution of ˜ z dn and note that ˜ z dn can b e sampled using the CRP , as follows: p ( ˜ z dn = k | Z , T , C ) = ( α θ d T θ d + β θ d ) ν k + c θ d k − α θ d T θ d k β θ d + C θ d . (32) Th us, the p osterior mean of θ d is given as E [ θ dk | Z , W , T , C , Ξ ] = ( α θ d T θ d + β θ d ) E [ ν k | Z , W , T , C , Ξ ] + c θ d k − α θ d T θ d k β θ d + C θ d , (33) whic h is written in term of the posterior mean of its parent PYP , ν . The p osterior means of the other PYPs such as ν can b e deriv ed by taking a similar approac h. Generally , the p osterior mean corresp onds to a PYP N (with parent PYP P ) is as follows: E [ N k | Z , W , T , C , Ξ ] = ( α N T N + β N ) E [ P k | Z , W , T , C , Ξ ] + c N k − α N T N k β N + C N , (34) By applying Equation ( 34 ) recursiv ely , we obtain the posterior mean for all the PYPs in the mo del. W e note that the dimension of the topic distributions ( µ , ν , θ ) is K + 1, where K is the num b er of observed topics. This accounts for the generation of a new topic asso ciated with the new customer, though the probability of generating a new topic is usually muc h smaller. In practice, we may instead ignore the extra dimension during the ev aluation of a topic mo del since it do es not provide useful interpretation. One wa y to do this is to simply discard the extra dimension of all the probability vectors after computing the p osterior mean. Another approach w ould b e to normalise the p osterior mean of the ro ot no de µ after discarding the extra dimension, b efore computing the p osterior mean of others PYPs. Note that for a considerably large corpus, the diﬀerence in the ab o v e approaches would b e to o small to notice. 15 Lim, Buntine, Chen and Du 4.5 Ev aluations on T opic Mo dels Generally , there are t w o wa ys to ev aluate a topic model. The ﬁrst is to ev aluate the topic mo del based on the task it performs, for instance, the ability to mak e predictions. The second approach is the statistical ev aluation of the topic mo del on modelling the data, whic h is also kno wn as the goo dness-of-ﬁt test. In this section, w e will present some commonly used ev aluation metrics that are applicable to all topic mo dels, but we ﬁrst discuss the pro cedure for estimating v ariables asso ciated with the test set. 4.5.1 Predictive Inference on the Test Documents T est do cumen ts, which are used for ev aluations, are set aside from learning do cumen ts. As suc h, the do cumen t–topic distributions θ asso ciated with the test do cumen ts are unknown and hence need to be estimated. One estimate for θ is its p osterior mean giv en the v ariables learned from the Gibbs sampler: ˆ θ d = E [ θ d | Z , W , T , C , Ξ ] , (35) obtainable by applying Equation ( 34 ). Note that since the laten t topics ˜ Z corresp onding to the test set are not sampled, the customer coun ts and table counts asso ciated with θ d are 0, th us ˆ θ d is equal to ˆ ν , the p osterior mean of ν . Ho wev er, this is not a goo d estimate for the topic distribution of the test do cumen ts since they will b e identical for all the test documents. T o o vercome this issue, we will instead use some of the words in the test do cumen ts to obtain a b etter estimate for θ . This metho d is kno wn as do cument completion ( W allach et al. , 2009b ), as w e use part of the text to estimate θ , and use the rest for ev aluation. Getting a b etter estimate for θ requires us to ﬁrst sample some of the latent topics ˜ z dn in the test do cuments. The prop er wa y to do this is by running an algorithm akin to the collapsed Gibbs sampler, but this would be excruciatingly slo w due to the need to re-sample the customer counts and table counts for all the paren t PYPs. Instead, w e assume that the v ariables learned from the Gibbs sampler are ﬁxed and sample the ˜ z dn from their conditional p osterior sequentially , given the previous latent topics: p ( ˜ z dn = k | ˜ w dn , θ d , φ, ˜ z d 1 , . . . , ˜ z d,n − 1 ) ∝ θ dk φ kw dn . (36) Whenev er a latent topic ˜ z dn is sampled, w e incremen t the customer coun t c θ d ˜ z dn for the test do cumen t. F or simplicity , we set the table coun t t θ d ˜ z dn to b e half the corresp onding customer coun ts c θ d ˜ z dn , this av oids the exp ensiv e op eration of sampling the table counts. Additionally , θ d is re-estimated using Equation ( 35 ) b efore sampling the next latent topic. W e note that the estimated v ariables are unbiased. The ﬁnal θ d b ecomes an estimate for the topic distribution of the test do cumen t d . The ab o v e pro cedure is rep eated R times to give R samples of θ ( r ) d , which are used to compute the following Monte Carlo estimate of θ d : ˆ θ d = 1 R R X r =1 θ ( r ) d . (37) 16 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes This Mon te Carlo estimate can then b e used for computing the ev aluation metrics. Note that when estimating θ , we ha ve ignored the p ossibilit y of generating a new topic, that is, the laten t topics ˜ z are constrained to the existing topics, as previously discussed in Section 4.4 . 4.5.2 Goodness-of-fit Test Measures of goo dness-of-ﬁt usually in volv es computing the discrepancy of the observ ed v alues and the predicted v alues under the model. Ho wev er, the observed v ariables in a topic mo del are the words in the corpus, which are not quan tiﬁable since they are discrete lab els. Th us ev aluations on topic models are usually based on the model lik eliho o ds instead. A popular metric commonly used to ev aluate the go odness-of-ﬁt of a topic mo del is p erplexit y , which is negatively related to the likelihoo d of the observed w ords W given the mo del, this is deﬁned as p erplexit y( W | θ, φ ) = exp − P D d =1 P N d n =1 log p ( w dn | θ d , φ ) P D d =1 N d ! , (38) where p ( w dn | θ d , φ ) is the likelihoo d of sampling the w ord w dn giv en the do cumen t–topic distribution θ d and the topic–word distributions φ . Computing p ( w dn | θ d , φ ) requires us to marginalise out z dn from their joint distribution, as follo ws: p ( w dn | θ d , φ ) = X k p ( w dn , z dn = k | θ d , φ ) = X k p ( w dn | z dn = k , φ k ) p ( z dn = k | θ d ) = X k φ kw dn θ dk . (39) Although p erplexit y can be computed on the whole corpus, in practice we compute the perplexity on test do cumen ts. This is to measure if the topic model generalises w ell to unseen data. A go od topic mo del w ould b e able to predict the w ords in the test set b etter, thereby assigning a higher probability p ( w dn | θ d , φ ) in generating the w ords. Since p erplexit y is negatively related to the likelihoo d, a low er p erplexit y is b etter. 4.5.3 Document Clustering W e can also ev aluate the clustering ability of the topic mo dels. Note that topic mo dels assign a topic to each word in a do cument, essen tially p erforming a soft clustering ( Eroshev a and Fien b erg , 2005 ) for the do cumen ts in whic h the mem b ership is given b y the do cumen t–topic distribution θ . T o ev aluate the clustering of the do cumen ts, we conv ert the soft clustering to hard clustering by choosing a topic that b est represents the do cuments, hereafter called the dominant topic . The dominant topic of a do cumen t d corresp onds to the topic that has the highest prop ortion in the topic distribution, that is, Dominan t T opic( θ d ) = arg max k θ dk . (40) 17 Lim, Buntine, Chen and Du Tw o commonly used ev aluation measures for clustering are purity and normalise d mutual information (NMI, Manning et al. , 2008 ). The purity is a simple clustering measure which can b e interpreted as the prop ortion of do cumen ts correctly clustered, while NMI is an information theoretic measures used for clustering comparison. If w e denote the ground truth classes as S = { s 1 , . . . , s J } and the obtained clusters as R = { r 1 , . . . , r K } , where eac h s j and r k represen ts a collection (set) of do cumen ts, then the purity and NMI can b e computed as purit y ( S , R ) = 1 D K X k =1 max j | r k ∩ s j | , NMI( S , R ) = 2 MI( S ; R ) E ( S ) + E ( R ) , (41) where MI( S ; R ) denotes the mutual information b et w een tw o sets and E ( · ) denotes the en tropy . They are deﬁned as follo ws: MI( S ; R ) = K X k =1 J X j =1 | r k ∩ s j | D log 2 D | r k ∩ s j | | r k || s j | , E ( R ) = − K X k =1 | r k | D log 2 | r k | D . (42) Note that the higher the purity or NMI, the b etter the clustering. 5. Application: Mo delling So cial Netw ork on Twitter This section lo oks at how we can employ the framework discussed ab o ve for an application of t w eet mo delling, using auxiliary information that is av ailable on Twitter. W e propose the Twitter-Network topic mo del (TNTM) to jointly mo del the text and the so cial net work in a fully Ba yesian nonparametric w a y , in particular, b y incorporating the authors, hashtags, the “follo wer” netw ork, and the text conten t in mo delling. The TNTM emplo ys a HPYP for text mo delling and a Gaussian pro cess (GP) random function mo del for so cial netw ork mo delling. W e show that the TNTM signiﬁcantly outp erforms several existing nonparametric mo dels due to its ﬂexibility . 5.1 Motiv ation Emergence of web services such as blogs, microblogs and so cial netw orking w ebsites allows p eople to contribute information freely and publicly . This user-generated information is generally more p ersonal, informal, and often contains p ersonal opinions. In aggregate, it can b e useful for reputation analysis of entities and pro ducts ( Aula , 2010 ), natural disaster detection ( Karimi et al. , 2013 ), obtaining ﬁrst-hand news ( Broersma and Graham , 2012 ), or ev en demographic analysis ( Correa et al. , 2010 ). W e fo cus on Twitter, an accessible source of information that allo ws users to freely v oice their opinions and though ts in short text kno wn as t weets. Although LDA ( Blei et al. , 2003 ) is a p opular mo del for text mo delling, a direct appli- cation on tw eets often yields po or result as t weets are short and often noisy ( Zhao et al. , 2011 ; Baldwin et al. , 2013 ), that is, tw eets are unstructured and often contain grammatical and sp elling errors, as well as informal words such as user-deﬁned abbreviations due to the 140 characters limit. LD A fails on short t weets since it is hea vily dep enden t on word co-o ccurrence. Also notable is that the text in t weets may con tain special tok ens kno wn 18 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes as hashtags ; they are used as keyw ords and allow users to link their t w eets with other t weets tagged with the same hashtag. Nev ertheless, hashtags are informal since they ha v e no standards. Hash tags can b e used as b oth inline words or categorical lab els. When used as lab els, hash tags are often noisy , since users can create new hash tags easily and use any existing hash tags in an y w ay they like. 3 Hence instead of b eing hard lab els, hashtags are b est treated as sp ecial words which can b e the themes of the tw eets. These prop erties of t weets make them c hallenging for topic mo dels, and ad ho c alternatives are used instead. F or instance, Maynard et al. ( 2012 ) advocate the use of shallo w metho d for t weets, and Mehrotra et al. ( 2013 ) utilise a tw eet-p o oling approach to group short t weets into a larger do cumen t. In other text analysis applications, tw eets are often ‘cleansed’ b y NLP metho ds suc h as lexical normalisation ( Baldwin et al. , 2013 ). Ho w ever, the use of normalisation is also criticised ( Eisenstein , 2013 ), as normalisation can change the meaning of text. In the following, w e prop ose a nov el metho d for b etter mo delling of microblogs by lev er- aging the auxiliary information that accompanies t weets. This information, complementing w ord co-o ccurrence, also op ens the do or to more applications, suc h as user recommendation and hash tag suggestion. Our ma jor con tributions include (1) a fully Bay esian nonparamet- ric mo del named the Twitter-Net work topic mo del (TNTM) that mo dels t weets w ell, and (2) a combination of b oth the HPYP and the GP to join tly mo del text, hashtags, authors and the follo wers netw ork. Despite the seeming complexit y of the TNTM mo del, its im- plemen tation is made relativ ely straigh tforward using the ﬂexible framework developed in Section 3 . Indeed, a num b er of other v ariants were rapidly implemen ted with this framework as well. 5.2 The Twitter-Netw ork T opic Mo del The TNTM makes use of the accompan ying hashtags , authors , and fol lowers network to mo del t weets b etter. The TNTM is comp osed of tw o main comp onen ts: a HPYP topic mo del for the text and hash tags, and a GP based random function netw ork model for the follo wers netw ork. The authorship information serv es to connect the t wo together. The HPYP topic model is illustrated by region b  in Figure 4 while the net w ork model is captured by region a  . 5.2.1 HPYP Topic Model The HPYP topic mo del describ ed in Section 3 is extended as follo ws. F or the word distri- butions, we ﬁrst generate a paren t w ord distribution prior γ for all topics: γ ∼ PYP( α γ , β γ , H γ ) , (43) where H γ is a discrete uniform distribution o ver the complete word v o cabulary V . 4 Then, w e sample the hashtag distribution ψ 0 k and word distribution ψ k for each topic k , with γ as 3. F or example, hashtag hijacking , where a well deﬁned hashtag is used in an “inappropriate” wa y . The most notable example w ould b e on the hash tag #McDStories , though it was initially created to promote happ y stories on McDonald’s, the hashtag was hijac ked with negativ e stories on McDonald’s. 4. The complete w ord vocabulary contains w ords and hashtags seen in the corpus. 19 Lim, Buntine, Chen and Du ν i µ 0 µ 1 η d a d θ ′ d θ d z ′ dm z dn y dm w dn ψ k γ ψ ′ k A D K M d N d F x ij E a  b  Figure 4: Graphical mo del for the Twitter-Net w ork T opic Mo del (TNTM) comp osed of a HPYP topic mo del (region b  ) and a GP based random function netw ork mo del (region a  ). The author–topic distributions ν serve to link the tw o together. Each t weet is mo delled with a hierarch y of do cument–topic distributions denoted by η , θ 0 , and θ , where eac h is attuned to the whole tw eet, the hash tags, and the w ords, in that order. With their own topic assignments z 0 and z , the hash tags y and the words w are separately mo delled. They are generated from the topic–hashtag distributions ψ 0 and the topic–w ord distributions ψ resp ectiv ely . The v ariables µ 0 , µ 1 and γ are priors for the resp ective PYPs. The connections b etw een the authors are denoted by x , mo delled by random function F . the base distribution: ψ 0 k | γ ∼ PYP( α ψ 0 k , β ψ 0 k , γ ) , (44) ψ k | γ ∼ PYP( α ψ k , β ψ k , γ ) , for k = 1 , . . . , K . (45) Note that the tokens of the hashtags are shared with the w ords, that is, the hashtag #happy shares the same token as the word happy , and are th us treated as the same word. This treat- men t is imp ortan t since some hashtags are used as words instead of lab els. 5 Additionally , this also allo ws any w ords to b e hash tags, which will be useful for hashtag recommendation. F or the topic distributions, w e generate a global topic distribution µ 0 , which serv es as a prior, from a GEM distribution. Then generate the author–topic distribution ν i for each 5. F or instance, as illustrated by the following tw eet: i want to get into #photo gr aphy. c an some one r e commend a goo d b e ginner #c amer a ple ase? i dont know wher e to start 20 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes author i , and a miscellaneous topic distribution µ 1 to capture topics that deviate from the authors’ usual topics: µ 0 ∼ GEM( α µ 0 , β µ 0 ) , (46) µ 1 | µ 0 ∼ PYP( α µ 1 , β µ 1 , µ 0 ) , (47) ν i | µ 0 ∼ PYP( α ν i , β ν i , µ 0 ) , for i = 1 , . . . , A . (48) F or each t w eet d , given the author–topic distribution ν and the observ ed author a d , w e sample the do cumen t–topic distribution η d , as follows: η d | a d , ν ∼ PYP( α η d , β η d , ν a d ) , for d = 1 , . . . , D . (49) Next, we generate the topic distributions for the observed hashtags ( θ 0 d ) and the observ ed w ords ( θ d ), following the technique used in the adaptive topic mo del ( Du et al. , 2012a ). W e explicitly mo del the inﬂuence of hashtags to w ords, by generating the words conditioned on the hashtags. The intuition comes from hashtags b eing the themes of a tw eet, and they driv e the conten t of the tw eet. Sp eciﬁcally , we sample the mixing proportions ρ θ 0 d , which con trol the con tribution of η d and µ 1 for the base distribution of θ 0 d , and then generate θ 0 d giv en ρ θ 0 d : ρ θ 0 d ∼ Beta  λ θ 0 d 0 , λ θ 0 d 1  , (50) θ 0 d | µ 1 , η d ∼ PYP  α θ 0 d , β θ 0 d , ρ θ 0 d µ 1 + (1 − ρ θ 0 d ) η d  . (51) W e set θ 0 d and η d as the parent distributions of θ d . This ﬂexible conﬁguration allows us to in vestigate the relationship b et ween θ d , θ 0 d and η d , that is, we can e xamine if θ d is directly determined b y η d , or through the θ 0 d . The mixing prop ortions ρ θ d and the topic distribution θ d is generated similarly: ρ θ d ∼ Beta  λ θ d 0 , λ θ d 1  , (52) θ d | η d , θ 0 d ∼ PYP  α θ d , β θ d , ρ θ d η m + (1 − ρ θ d ) θ 0 d  . (53) The hash tags and w ords are then generated in a similar fashion to LDA. F or the m -th hash tag in t weet d , w e sample a topic z 0 dm and the hashtag y dm b y z 0 dm | θ 0 d ∼ Discrete  θ 0 d  , (54) y dm | z 0 dm , ψ 0 ∼ Discrete  ψ 0 z 0 dm  , for m = 1 , . . . , M d , (55) where M d is the n umber of seen hashtags in tw eet d . While for the n -th word in t weet d , w e sample a topic z dn and the word w dn b y z dn | θ d ∼ Discrete( θ d ) , (56) w dn | z dn , ψ ∼ Discrete  ψ z dn  , for n = 1 , . . . , N d , (57) where N d is the n umber of observed words in tw eet d . W e note that all ab o v e α , β and λ are the hyperparameters of the mo del. W e show the imp ortance of the ab o ve mo delling with ablation studies in Section 5.6 . Although the HPYP topic mo del may seem complex, it is a simple netw ork of PYP no des since all distributions on the probability v ectors are mo delled by the PYP . 21 Lim, Buntine, Chen and Du 5.2.2 Random Function Network Model The netw ork mo delling is connected to the HPYP topic mo del via the author–topic distri- butions ν , where w e treat ν as inputs to the GP in the net w ork mo del. The GP , represented b y F , determines the link b et ween tw o authors ( x ij ), whic h indicates the existence of the so cial links betw een author i and author j . F or eac h pair of authors, w e sample their connections with the following random function netw ork mo del: Q ij | ν ∼ F ( ν i , ν j ) , (58) x ij | Q ij ∼ Bernoulli  s ( Q ij )  , for i = 1 , . . . , A ; j = 1 , . . . , A , (59) where s ( · ) is the sigmoid function : s ( t ) = 1 1 + e − t . (60) By marginalising out F , w e can write Q ∼ GP( ς , κ ), where Q is a ve ctorise d collection of Q ij . 6 ς denotes the mean vector and κ is the cov ariance matrix of the GP: ς ij = Sim( ν i , ν j ) , (61) κ ij,i 0 j 0 = s 2 2 exp −   Sim( ν i , ν j ) − Sim( ν i 0 , ν j 0 )   2 2 l 2 ! + σ 2 I ( ij = i 0 j 0 ) , (62) where s , l and σ are the hyperparameters asso ciated to the kernel. Sim( · , · ) is a similarity function that has a range b et ween 0 and 1, here chosen to b e c osine similarity due to its ease of computation and p opularit y . 5.2.3 Rela tionships with Other Models The TNTM is related to many existing mo dels after removing certain components of the mo del. When hash tags and the netw ork comp onen ts are remov ed, the TNTM is reduced to a nonparametric v arian t of the author topic mo del (A TM). Opp ositely , if authorship information is discarded, the TNTM resembles the c orr esp ondenc e LD A ( Blei and Jordan , 2003 ), although it diﬀers in that it allo ws hashtags and w ords to be generated from a common vocabulary . In con trast to existing parametric mo dels, the netw ork mo del in the TNTM provides p os- sibly the most ﬂexible wa y of netw ork mo delling via a nonparametric Bay esian prior (GP), follo wing Llo yd et al. ( 2012 ). Diﬀeren t to Lloyd et al. ( 2012 ), w e prop ose a new kernel function that ﬁts our purp ose b etter and achiev es signiﬁcant improv emen t ov er the origi- nal kernel. 5.3 Representation and Mo del Lik eliho o d As with previous sections, we represent the TNTM using the CRP representation discussed in Section 3.2 . How ever, since the PYP v ariables in the TNTM can hav e multiple paren ts, w e extend the representation following Du et al. ( 2012a ). The distinction is that w e store 6. Q = ( Q 11 , Q 12 , . . . , Q AA ) T , note that ς and κ follow the same indexing. 22 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes m ultiple tables coun ts for each PYP , to illustrate, t N →P k represen ts the n umber of tables in PYP N serving dish k that are contributed to the customer coun ts in PYP P , c P k . Similarly , the total table coun ts that con tribute to P is denoted as T N →P = P k t N →P k . Note the n umber of tables in PYP N is t N k = P P t N →P k , while the total num b er of tables is T N = P P T N →P . W e refer the readers to Lim et al. ( 2013 , App endix B) for a detailed discussion. W e use bold face capital letters to denote the set of all relev ant low er case v ariables, for example, w e denote W ◦ = { W , Y } as the set of all words and hashtags; Z ◦ = { Z , Z 0 } as the set of all topic assignmen ts for the words and the hash tags; T as the set of all table counts and C as the set of all customer coun ts; and we in tro duce Ξ as the set of all h yp erparameters. By marginalising out the latent v ariables, we write down the mo del lik eliho o d corresp onding to the HPYP topic mo del in terms of the counts: p ( Z ◦ , T , C | W ◦ , Ξ ) ∝ p ( Z ◦ , W ◦ , T , C | Ξ ) ∝ f ( µ 0 ) f ( µ 1 ) A Y i =1 f ( ν i ) ! K Y k =1 f ( ψ 0 k ) f ( ψ k ) ! f ( γ ) × D Y d =1 f ( η d ) f ( θ 0 d ) f ( θ d ) g  ρ θ 0 d  g  ρ θ d  ! |V | Y v =1  1 |V |  t γ v , (63) where f ( N ) is the modularised lik eliho od corresp onding to no de N , as deﬁned by Equa- tion ( 16 ), and g ( ρ ) is the likelihoo d corresp onding to the probability ρ that controls which paren t no de to send a customer to, deﬁned as g ( ρ N ) = B  λ N 0 + T N →P 0 , λ N 1 + T N →P 1  , (64) for N ∼ PYP  α N , β N , ρ N P 0 + (1 − ρ N ) P 1  . Note that B ( a, b ) denotes the Beta function that normalises a Dirichlet distribution, deﬁned as follows: B ( a, b ) = Γ( a ) Γ( b ) Γ( a + b ) . (65) F or the random function netw ork mo del, the conditional p osterior can b e derived as p ( Q | X , ν, Ξ ) ∝ p ( X , Q | ν, Ξ ) ∝ A Y i =1 A Y j =1 s ( Q ij ) x ij  1 − s ( Q ij )  1 − x ij ! × | κ | − 1 2 exp  − 1 2 ( Q − ς ) T κ − 1 ( Q − ς )  . (66) The full p osterior likelihoo d is th us the pro duct of the topic model p osterior (Equation ( 63 )) and the netw ork p osterior (Equation ( 66 )): p ( Q , Z ◦ , T , C | X , W ◦ , Ξ ) = p ( Z ◦ , T , C | W ◦ , Ξ ) p ( Q | X , ν, Ξ ) . (67) 23 Lim, Buntine, Chen and Du 5.4 Performing Posterior Inference on the TNTM In the TNTM, com bining a GP with a HPYP mak es its p osterior inference non-trivial. Hence, we employ approximate inference b y alternatively p erforming MCMC sampling on the HPYP topic mo del and the netw ork mo del, conditioned on eac h other. F or the HPYP topic mo del, we employ the ﬂexible framew ork discussed in Section 3 to p erform collapsed blo c k ed Gibbs sampling. F or the netw ork mo del, we derive a Metrop olis-Hastings (MH) algorithm based on the elliptical slice sampler ( Murra y et al. , 2010 ). In addition, the author– topic distributions ν connecting the HPYP and the GP are sampled with an MH sc heme since their p osteriors do not follow a standard form. W e note that the PYPs in this section can hav e m ultiple parents, so we extend the framework in Section 3 to allow for this. The collapsed Gibbs sampling for the HPYP topic mo del in TNTM is similar to the pro cedure in Section 4 , although there are tw o main diﬀerences. The ﬁrst diﬀerence is that w e need to sample the topics for b oth words and hashtags, each with a diﬀeren t conditional p osterior compared to that of Section 4 . While the second is due to the PYPs in TNTM can ha ve m ultiple parents, th us an alternative to decremen ting the counts is required. A detailed discussion on p erforming p osterior inference and h yp erparameter sampling is presented in the app endix. 5.5 Twitter Data F or ev aluation of the TNTM, we construct a t weet corpus from the Twitter 7 dataset ( Y ang and Lesk ov ec , 2011 ), 7 This corpus is queried using the hashtags #sp ort , #music , #ﬁnanc e , #p olitics , #scienc e and #te ch , c hosen for div ersit y . W e remo ve the non-English t w eets with langid.py ( Lui and Baldwin , 2012 ). W e obtain the data on the follo wers net work from Kw ak et al. ( 2010 ). 8 Ho wev er, note that this follow ers netw ork data is not complete and do es not con tain information for all authors. Th us we ﬁlter out the authors that are not part of the follo wers net work data from the tw eet corpus. Additionally , we also remov e authors who ha ve written less than ﬁft y tw eets from the corpus. W e name this corpus T6 since it is queried with six hashtags. It is consists of 240,517 t weets with 150 authors after ﬁltering. Besides the T6 corpus, w e also use the tw eet datasets describ ed in Mehrotra et al. ( 2013 ). The datasets contains three corp ora, each of them is queried with exactly ten query terms. The ﬁrst corpus, named the Generic Dataset, are queried with generic terms. The second is named the Sp eciﬁc Dataset, whic h is comp osed of tw eets on sp eciﬁc named entities. Lastly , the Ev ents Dataset is asso ciated with certain ev ents. The datasets are mainly used for comparing the p erformance of the TNTM against the tw eet p ooling techniques in Mehrotra et al. ( 2013 ). W e presen t a summary of the t weet corp ora in T able 3 . 5.6 Exp erimen ts and Results W e consider several tasks to ev aluate the TNTM. The ﬁrst task in volv es comparing the TNTM with existing baselines on p erforming topic mo delling on t weets. W e also compare the TNTM with the random function net work mo del on mo delling the follo wers netw ork. Next, we ev aluate the TNTM with ablation studies, in which we p erform comparison with 7. http://snap.stanford.edu/data/twitter7.html 8. http://an.kaist.ac.kr/traces/WWW2010.html 24 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes T able 3: Summary of the datasets used in this section, showing the n umber of t weets ( D ), authors ( A ), unique w ord tok ens ( |V | ), and the a verage num b er of w ords and hash tags in each t weet. The T6 dataset is queried with six diﬀerent hash tags and th us has a higher n umber of hashtags p er tw eet. W e note that there is a typo on the num b er of tw eets for the Ev ents Dataset in Mehrotra et al. ( 2013 ), the correct n umber is 107,128. Dataset Tw eets Authors V o cabulary W ords/Tw eet Hash tags/Tweet T6 240 517 150 5 343 6 . 35 1 . 34 Generic 359 478 213 488 14 581 6 . 84 0 . 10 Sp eciﬁc 214 580 116 685 15 751 6 . 31 0 . 25 Ev ents 107 128 67 388 12 765 5 . 84 0 . 17 the TNTM itself but with eac h comp onen t tak en aw a y . Additionally , we ev aluate the clustering p erformance of the TNTM, we compare the TNTM against the state-of-the-art t weets-po oling LDA metho d in Mehrotra et al. ( 2013 ). 5.6.1 Experiment Settings In all the following exp erimen ts, we v ary the discoun t parameters α for the topic distribu- tions µ 0 , µ 1 , ν i , η m , θ 0 m , and θ m , we set α to 0 . 7 for the word distributions ψ , φ 0 and γ to induce p o wer-la w b eha viour ( Goldwater et al. , 2011 ). W e initialise the concen tration pa- rameters β to 0 . 5, noting that they are learned automatically during inference, we se t their h yp erprior to Gamma(0 . 1 , 0 . 1) for a v ague prior. W e ﬁx the hyperparameters λ , s , l and σ to 1, as we ﬁnd that their v alues hav e no signiﬁcant impact on the mo del p erformance. 9 In the following ev aluations, w e run the full inference algorithm for 2,000 iterations for the mo dels to conv erge. W e note that the MH algorithm only starts after 1,000 iterations. W e rep eat each exp erimen t ﬁve times to reduce the estimation error for the ev aluations. 5.6.2 Goodness-of-fit Test W e compare the TNTM with the HDP-LDA and a nonparametric author-topic model (A TM) on ﬁtting the text data (words and hash tags). Their p erformances are measured us- ing p erplexit y on the test se t (see Section 4.5.2 ). The p erplexit y for the TNTM, accounting for b oth words and hashtags, is P erplexity( W ◦ ) = exp − log p  W ◦ | ν, µ 1 , ψ , ψ 0  P D d =1 N d + M d ! , (68) where the likelihoo d p  W ◦ | ν, µ 1 , ψ , ψ 0  is broken into p  W ◦ | ν, µ 1 , ψ , ψ 0  = D Y d =1 M d Y m =1 p ( y dm | ν, µ 1 , ψ 0 ) N d Y n =1 p ( w dn | y d , ν, µ 1 , ψ ) . (69) 9. W e v ary these hyperparameters ov er the range of 0 . 01 to 10 during testing. 25 Lim, Buntine, Chen and Du T able 4: T est p erplexit y and net work log lik eliho o d comparisons b et ween the HDP-LD A, the nonparametric A TM, the random function netw ork model and the TNTM. Lo wer p erplexit y indicates b etter mo del ﬁtting. The TNTM signiﬁcantly outp er- forms the other mo dels in term of mo del ﬁtting. Mo del T est Perplexit y Netw ork Log Likelihoo d HDP-LD A 840 . 03 ± 15 . 7 N/A Nonparametric A TM 664 . 25 ± 17 . 76 N/A Random F unction N/A − 557 . 86 ± 11 . 2 TNTM 505 . 01 ± 7 . 8 − 500 . 63 ± 13 . 6 T able 5: Ablation test on the TNTM. The test p erplexit y and the net work log likelihoo d is ev aluated on the TNTM against sev eral ablated v ariants of the TNTM. The result sho ws that eac h comp onent in the TNTM is imp ortan t. TNTM Mo del T est P erplexity Net work Log Lik eliho od No author 669 . 12 ± 9 . 3 N/A No hashtag 1017 . 23 ± 27 . 5 − 522 . 83 ± 17 . 7 No µ 1 no de 607 . 70 ± 10 . 7 − 508 . 59 ± 9 . 8 No θ 0 - θ connection 551 . 78 ± 16 . 0 − 509 . 21 ± 18 . 7 No p o wer-la w 508 . 64 ± 7 . 1 − 560 . 28 ± 30 . 7 F ull mo del 505 . 01 ± 7 . 8 − 500 . 63 ± 13 . 6 W e also compare the TNTM against the original random function net work model in terms of the log likelihoo d of the netw ork data, giv en b y log p ( X | ν ). W e presen t the comparison of the p erplexit y and the netw ork log likelihoo d in T able 4 . W e note that for the net work log lik eliho o d, the less negativ e the b etter. F rom the result, we can see that the TNTM achiev es a m uch lo wer perplexity compared to the HDP-LD A and the nonparametric A TM. Also, the nonparametric A TM is signiﬁcantly b etter than the HDP-LDA. This clearly sho ws that using more auxiliary information gives a b etter mo del ﬁtting. Additionally , we can also see that join tly mo delling the text and net work data leads to a b etter modelling on the follow ers net work. 5.6.3 Abla tion Test Next, w e p erform an extensiv e ablation study with the TNTM. The comp onen ts that are tested in this study are (1) authorship, (2) hashtags, (3) PYP µ 1 , (4) connection b etw een PYP θ 0 d and θ d , and (5) p o wer-la w behaviour on the PYPs. W e compare the full TNTM against v ariations in which each comp onen t is ablated. T able 5 presents the test set p er- plexit y and the netw ork log lik eliho o d of these mo dels, it shows signiﬁcan t improv ements of the TNTM o ver the ablated models. F rom this, we see that the greatest impro vemen t 26 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes T able 6: Clustering ev aluations of the TNTM against the LDA with diﬀerent p ooling sc hemes. Note that higher purity and NMI indicate b etter p erformance. The results for the diﬀerent p o oling metho ds are obtained from T able 4 in Mehrotra et al. ( 2013 ). The TNTM achiev es b etter p erformance on the purit y and the NMI for all datasets except for the Speciﬁc dataset, where it obtains the sam e purity score as the b est p o oling metho d. Metho d/Model Purit y NMI Data Generic Sp e ciﬁc Events Generic Sp e ciﬁc Events No p ooling 0 . 49 0 . 64 0 . 69 0 . 28 0 . 22 0 . 39 Author 0 . 54 0 . 62 0 . 60 0 . 24 0 . 17 0 . 41 Hourly 0 . 45 0 . 61 0 . 61 0 . 07 0 . 09 0 . 32 Burst wise 0 . 42 0 . 60 0 . 64 0 . 18 0 . 16 0 . 33 Hash tag 0 . 54 0 . 68 0 . 71 0 . 28 0 . 23 0 . 42 TNTM 0 . 66 0 . 68 0 . 79 0 . 43 0 . 31 0 . 52 on p erplexit y is from mo delling the hash tags, which suggests that the hashtag information is the most imp ortan t for modelling t w eets. Second to the hashtags, the authorship infor- mation is very imp ortan t as w ell. Even though mo delling the p o w er-law b eha viour is not that imp ortant for perplexity , we see that the impro vemen t on the netw ork log lik eliho o d is b est achiev ed by mo delling the p o w er-law. This is b ecause the ﬂexibility enables us to learn the author–topic distributions b etter, and thus allowing the TNTM to ﬁt the netw ork data better. This also suggests that the authors in the corpus tend to fo cus on a sp eciﬁc topic rather than having a wide interest. 5.6.4 Document Clustering and Topic Coherence Mehrotra et al. ( 2013 ) shows that running LD A on p o oled t weets rather than unp o oled t weets gives signiﬁcant improv emen t on clustering. In particular, they ﬁnd that grouping t weets based on the hashtags provides most improv ement. Here, we show that instead of resorting to suc h an ad ho c method, the TNTM can ac hiev e a signiﬁcan tly b etter result on clustering. The clustering ev aluations are measured with purity and normalised mutual information (NMI, see Manning et al. , 2008 ) describ ed in 4.5.3 . Since ground truth lab els are unknown, we use the resp ectiv e query terms as the ground truth for ev aluations. Note that tw eets that satisfy multiple lab els are remo ved. Given the learned mo del, we assign a t weet to a cluster based on its dominant topic. W e p erform the ev aluations on the Generic, Sp eciﬁc and Ev en ts datasets for comparison purp ose. W e note the lac k of netw ork information in these datasets, and th us we employ only the HPYP part of the TNTM. Additionally , since the purity can trivially b e improv ed b y increasing the num b er of clusters, w e limit the maxim um n um b er of topics to tw en ty for a fair comparison. W e presen t the results in T able 6 . W e can see that the TNTM outperforms the p o oling method in all asp ects except on the Sp eciﬁc dataset, where it ac hieves the same purit y as the b est p ooling sc heme. 27 Lim, Buntine, Chen and Du T able 7: T opical analysis on the T6 dataset with the TNTM, which displays the top three hash tags and the top n words on six topics. Instead of manually assigning a topic lab el to the topics, we ﬁnd that the top hashtags can serv e as the topic lab els. T opic T op Hash tags T op W ords T opic 1 ﬁnance, money , econom y ﬁnance, money , bank, market watc h, sto c ks, china, group, shares, sales T opic 2 p olitics, iranelection, tcot p olitics, iran, iranelection, tcot, tlot, topprog, obama, musiceanewsfeed T opic 3 m usic, folk, p op m usic, folk, monster, head, p op, free, indie, album, gratuit, dernier T opic 4 sp orts, women, asheville sp orts, women, fo otball, win, game, top, world, asheville, vols, team T opic 5 tec h, news, jobs tec h, news, jquery , jobs, hiring, gizmos, go ogle, reuters T opic 6 science, news, biology science, news, source, study , scientists, cancer, researchers, brain, biology , health 5.6.5 Automa tic Topic Labelling T raditionally , researc hers assign a topic for each topic–w ord distribution manually b y in- sp ection. More recen tly , there hav e been attempts to label topics automatically in topic mo delling. F or instance, Lau et al. ( 2011 ) use Wikip edia to extract lab els for topics, and Mehdad et al. ( 2013 ) use the entailmen t relations to select relev ant phrases for topics. Here, w e sho w that we can use hashtags to obtain go od topic labels. In T able 7 , w e displa y the top w ords from the topic–w ord distribution ψ k for each topic k . Instead of manually assigning the topic lab els, we display the top three hash tags from the topic–hashtag distribution ψ 0 k . As we can see from T able 7 , the hashtags app ear suitable as topic lab els. In fact, by empir- ically ev aluating the suitability of the hashtags in representing the topics, we consistently ﬁnd that, ov er 90 % of the hash tags are go od candidates for the topic lab els. Moreov er, insp ecting the topics show that the ma jor hash tags coincide with the query terms used in constructing the T6 dataset, which is to b e exp ected. This v eriﬁes that the TNTM is w orking prop erly . 6. Conclusion In this article, w e prop osed a topic mo delling framew ork utilising PYPs, for whic h their realisation is a probabilit y distribution or another stochastic pro cess of the same type. In particular, for the purp ose of p erforming inference, w e describ ed the CRP representation for the PYPs. This allo ws us to prop ose a single framework, discussed in Section 3 , to implemen t these topic mo dels, where we mo dularise the PYPs (and other v ariables) into 28 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes blo c ks that can b e combined to form diﬀerent mo dels. Doing so enables signiﬁcant time to b e sav ed on implementation of the topic mo dels. W e presented a general HPYP topic model, that can be seen as a generalisation to the HDP-LD A ( T eh and Jordan , 2010 ). The HPYP topic mo del is represen ted using a Chinese Restaurant Pro cess (CRP) metaphor ( T eh and Jordan , 2010 ; Blei et al. , 2010 ; Chen et al. , 2011 ), and w e discussed ho w the p osterior likelihoo d of the HPYP topic mo del can be mo dularised. W e then detailed the learning algorithm for the topic mo del in the mo dularised form. W e applied our HPYP topic mo del framework on Twitter data and prop osed the Twitter- Net work T opic model (TNTM). The TNTM mo dels the authors, text, hashtags, and the authors-follo wer netw ork in an in tegrated manner. In addition to HPYP , the TNTM em- plo ys the Gaussian pro cess (GP) for the netw ork modelling. The main suggested use of the TNTM is for con tent discov ery on so cial netw orks. Through exp erimen ts, w e sho w that jointly mo delling of the text conten t and the netw ork leads to b etter mo del ﬁtting as compared to mo delling them separately . Results on the qualitative analysis sho w that the learned topics and the authors’ topics are sound. Our exp erimen ts suggest that incorp orat- ing more auxiliary information leads to b etter ﬁtting mo dels. 6.1 F uture Research F or future w ork on TNTM, it would b e interesting to apply TNTM to other types of data, suc h as blogs and news feeds. W e could also use TNTM for other applications. suc h as hash tag recommendation and con ten t suggestion for new Twitter users. Moreov er, we could extend TNTM to incorp orate more auxiliary information: for instance, we can model the lo cation of tw eets and the embedded multimedia conten ts such as URL, images and videos. Another interesting source of information w ould b e the path of retw e eted conten t. Another in teresting area of researc h is the com bination of diﬀerent kinds of topic mo dels for a b etter analysis. This allo ws us to transfer learned knowledge from one topic mo del to another. The work on com bining LDA has already b een lo ok ed at by Schnober and Gurevyc h ( 2015 ), how ever, com bining other kinds of topic mo dels, such as nonparametric ones, is unexplored. Ac kno wledgments The authors lik e to thank Shamin Kinathil, the editors, and the anonymous review ers for their v aluable feedbac k and comments. NICT A is funded b y the Australian Go vernmen t through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program. App endix A. Posterior Inference for TNTM A.1 Decrementing the Counts Asso ciated with a W ord or Hash tag When w e remov e a word or a hashtag during inference, we decrement b y one the customer coun t from the PYP asso ciated with the w ord or the hash tag, that is, c θ d k for w ord w dn 29 Lim, Buntine, Chen and Du ( z dn = k ) and c θ 0 d k for hashtag y dm ( z 0 dm = k ). Decrementing the customer coun t ma y or ma y not decrement the resp ective table count. How ev er, if the table count is decremented, then we would decrement the customer count of the parent PYP . This is relatively straight forw ard in Section 4.1 since the PYPs hav e only one parent. Here, when a PYP N has m ultiple paren ts, w e w ould sample for one of its paren t PYPs and decremen t the table coun t corresp onding to the parent PYP . Although not the same, the rationale of this pro cedure follo ws Section 4.1 . W e explain in more details b elo w. When the customer coun t c N k is decremen ted, we in tro duce an auxiliary v ariable u N k that indicates which parent of N to remov e a table from, or none at all. The sample space for u N k is the P paren t no des P 1 , . . . , P P of N , plus ∅ . When u N k is equal to P i , w e decrement the table coun t t N →P i k and subsequently decremen t the customer coun t c P i k in no de P i . If u N k equals to ∅ , we do not decremen t any table coun t. The process is repeated recursively as long as a customer coun t is decremented, that is, we stop when u N k = ∅ . The v alue of u N k is sampled as follows: p  u N k  = ( t N →P i k /c N k if u N k = P i 1 − P P i p  u N k = P i  if u N k = ∅ . (70) T o illustrate, when a w ord w dn (with topic z dn ) is remo ved, we decremen t c θ d z dn , that is, c θ d z dn b ecomes c θ d z dn − 1. W e then determine if this word contrib utes to any table in node θ d b y sampling u θ d z dn from Equation ( 70 ). If u θ d z dn = ∅ , w e do not decrement an y table count and pro ceed with the next step in Gibbs sampling; otherwise, u θ d z dn can either be θ 0 d or η d , in these cases, we would decrement t θ d → u θ d z dn z dn and c u θ d z dn z dn , and contin ue the pro cess recursiv ely . W e present the decrementing process in Algorithm 2 . T o remov e a w ord w dn during inference, we would need to decrement the counts contributed by w dn (and z dn ). F or the topic side, we decrement the counts asso ciated with no de N = θ d with group k = z dn using Algorithm 2 . While for the vocabulary side, w e decremen t the coun ts asso ciated with the no de N = ψ z dn with group k = w dn . The eﬀect of the w ord on the other PYP v ariables are implicitly considered through recursion. Note that the procedure to decrementing a hash tag y dm is similar, in this case, we decremen t the counts for N = θ 0 d with k = z 0 dm (topic side), then decremen t the counts for N = ψ 0 z 0 dm with k = y dm (v o cabulary side). A.2 Sampling a New T opic for a W ord or a Hashtag After decrementing, w e sample a new topic for the w ord or the hash tag. The sampling pro cess follows the pro cedure discussed in Section 4.2 , but with diﬀerent conditional p os- teriors (for b oth the word and the hash tag). The full conditional p osterior probability for the collapsed blo c ked Gibbs sampling can b e deriv ed easily . F or instance, the conditional p osterior for sampling the topic z dn of word w dn is p ( z dn , T , C | Z ◦ − dn , W ◦ , T − dn , C − dn , Ξ ) = p ( Z ◦ , T , C | W ◦ , Ξ ) p ( Z ◦ − dn , T − dn , C − dn | W ◦ , Ξ ) (71) 30 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Algorithm 2 Decrementing counts asso ciated with a PYP N and group k . 1. Decrement the customer count c N k b y one. 2. Sample an auxiliary v ariable u N k with Equation ( 70 ). 3. F or the sampled u N k , p erform the following: (a) If u N k = ∅ , exit the algorithm. (b) Otherwise, decrement the table count t N → u N k k b y one and rep eat Steps 2 – 4 by replacing N with u N k . whic h can then be easily decomp osed in to simpler form (see discussion in Section 4.2 ) using Equation ( 63 ). Here, the sup erscript  − dn indicates the word w dn and the topic z dn are remo v ed from the respective sets. Similarly , the conditional p osterior probability for sampling the topic z 0 dm of hashtag y dm can b e derived as p ( z 0 dm , T , C | Z ◦ − dm , W ◦ , T − dm , C − dm , Ξ ) = p ( Z ◦ , T , C | W ◦ , Ξ ) p ( Z ◦ − dm , T − dm , C − dm | W ◦ , Ξ ) (72) where the sup erscript  − dm signals the remov al of the hashtag y dm and the topic z 0 dm . As in Section 4.2 , we compute the p osterior for all p ossible changes to T and C corresp onding to the new topic (for z dn or z 0 dm ). W e then sample the next state using a Gibbs sampler. A.3 Estimating the Probability V ectors of the PYPs with Multiple Paren ts F ollowing Section 4.4 , we estimate the v arious probabilit y distributions of the PYPs b y their p osterior means. F or a PYP N with a single PYP parent P 1 , as discussed in Section 4.4 , w e can estimate its probability vector ˆ N = ( ˆ N 1 , . . . , ˆ N K ) as ˆ N k = E [ N k | Z ◦ , W ◦ , T , C , Ξ ] =  α N T N + β N  E [ P 1 k | Z ◦ , W ◦ , T , C , Ξ ] + c N k − α N T N k β N + C N , (73) whic h lets one analyse the probability v ectors in a topic mo del using recursion. Unlik e the ab o ve, the p osterior mean is slightly more complicated for a PYP N that has multiple PYP parents P 1 , . . . , P P . F ormally , w e deﬁne the PYP N as N | P 1 , . . . , P P ∼ PYP  α N , β N , ρ N 1 P 1 + · · · + ρ N P P P  , (74) where the mixing prop ortion ρ N = ( ρ N 1 , . . . , ρ N P ) follows a Diric hlet distribution with pa- rameter λ N = ( λ N 1 , . . . , λ N P ): ρ N ∼ Diric hlet  λ N  . (75) 31 Lim, Buntine, Chen and Du Before we can estimate the probability vector, we ﬁrst estimate the mixing prop ortion with its p osterior mean given the customer counts and table counts: ˆ ρ N i = E [ ρ N i | Z ◦ , W ◦ , T , C , Ξ ] = T N →P i + λ N i T N + P i λ N i . (76) Then, we can estimate the probabilit y v ector ˆ N = ( ˆ N 1 , . . . , ˆ N K ) by ˆ N k =  α N T N + β N  ˆ H N k + c N k − α N T N k β N + C N , (77) where ˆ H N = ( ˆ H N 1 , . . . , ˆ H N K ) is the exp ected base distribution: ˆ H N k = P X i =1 ˆ ρ N i E [ P ik | Z ◦ , W ◦ , T , C , Ξ ] . (78) With these form ulations, all the topic distributions and the word distributions in the TNTM can b e reconstructed from the customer coun ts and table counts. F or instance, the author–topic distribution ν i of each author i can b e determined recursively by ﬁrst estimating the topic distribution µ 0 . The word distributions for each topic are similarly estimated. A.4 MH Algorithm for the Random F unction Netw ork Mo del Here, we discuss how we learn the topic distributions µ 0 and ν from the random function net work mo del. W e conﬁgure the MH algorithm to start after running one thousand itera- tions of the collapsed blo ck ed Gibbs sampler, this is to we can quickly initialise the TNTM with the HPYP topic mo del b efore running the full algorithm. In addition, this allows us to demonstrate the impro vemen t to the TNTM due to the random function netw ork mo del. T o facilitate the MH algorithm, w e ha ve to represen t the topic distributions µ 0 and ν explicitly as probabilit y vectors, that is, w e do not store the customer counts and table coun ts for µ 0 and ν after starting the MH algorithm. In the MH algorithm, we prop ose new samples for µ 0 and ν , and then accept or reject the samples. The details for the MH algorithm is as follow. In eac h iteration of the MH algorithm, we use the Dirichlet distributions as prop osal distributions for µ 0 and ν : q ( µ new 0 | µ 0 ) = Diric hlet( β µ 0 µ 0 ) , (79) q ( ν new i | ν i ) = Diric hlet( β ν i ν i ) . (80) These prop osed µ 0 and ν are sampled given the their previous v alues, and w e note that the ﬁrst µ 0 and ν are computed using the technique discussed in A.3 . These prop osed samples are subsequen tly used to sample Q new . W e ﬁrst compute the quant ities ς new and κ new using the prop osed µ new 0 and ν new with Equation ( 61 ) and Equation ( 62 ). Then we sample Q new giv en ς new and κ new using the elliptical slice sampler (see Murray et al. , 2010 ): Q new ∼ GP( ς new , κ new ) . (81) 32 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Algorithm 3 Performing the MH algorithm for one iteration. 1. Prop ose a new µ new 0 with Equation ( 79 ). 2. F or eac h author i , prop ose a new ν new i with Equation ( 80 ). 3. Compute the mean function ς new and the cov ariance matrix κ new with Equation ( 61 ) and Equation ( 62 ). 4. Sample Q new from Equation ( 81 ) using the elliptical slice sampler from Murray et al. ( 2010 ). 5. Accept or reject the samples with acceptance probability from Equation ( 82 ). Finally , we compute the acceptance probability A 0 = min( A, 1), where A = p ( Q new | X , ν new , Ξ ) p ( Q old | X , ν old , Ξ ) f ∗ ( µ new 0 | ν new , T ) Q A i =1 f ∗ ( ν new i | T ) f ∗ ( µ old 0 | ν old , T ) Q A i =1 f ∗ ( ν old i | T ) × q ( µ old 0 | µ new 0 ) Q A i =1 q ( ν old i | ν new i ) q ( µ new 0 | µ old 0 ) Q A i =1 q ( ν new i | ν old i ) , (82) and we deﬁne f ∗ ( µ 0 | ν, T ) and f ∗ ( ν | T ) as f ∗ ( µ 0 | ν, T ) = K Y k =1 ( µ 0 k ) t µ 1 k + P A i =1 ν i , (83) f ∗ ( ν i | T ) = K Y k =1 ( ν ik ) P D d =1 t η d k I ( a d = i ) . (84) The f ∗ ( · ) corresp onds to the topic mo del p osterior of the v ariables µ 0 and ν after we rep- resen t them as probability vectors explicitly . Note that we treat the acceptance probability A as 1 when the expression in Equation ( 82 ) ev aluates to more than 1. W e then accept the prop osed samples with probability A , if the sample are not accepted, w e k eep the resp ec- tiv e old v alues. This completes one iteration of the MH scheme. W e summarise the MH algorithm in Algorithm 3 . A.5 Hyp erparameter Sampling W e sample the hyperparameters β using an auxiliary v ariable sampler while lea ving α ﬁxed. W e note that the auxiliary v ariable sampler for PYPs that hav e multiple parents are iden tical to that of PYPs with single parent, since the sampler only used the total customer coun ts C N and the total table counts T N for a PYP N . W e refer the readers to Section 4.3 for details. W e would like to p oint out that hyperparameter sampling is p erformed for all PYPs in TNTM for the ﬁrst one thousand iterations. After that, as µ 0 and ν are represen ted as probabilit y vectors explicitly , w e only sample the hyperparameters for the other PYPs 33 Lim, Buntine, Chen and Du Algorithm 4 F ull inference algorithm for the TNTM. 1. Initialise the HPYP topic model by assigning random topic to the latent topic z dn asso ciated with eac h w ord w dn , and to the latent topic z 0 dm asso ciated with eac h hash tag y dm . Then up date all the relev ant customer coun ts C and table counts T . 2. F or eac h word w dn in each do cumen t d , p erform the following: (a) Decrement the counts asso ciated with w dn (see A.1 ). (b) Blo c ked sample a new topic for z dn and corresp onding customer counts C and table counts T (with Equation ( 71 )). (c) Up date (incremen t counts) the topic mo del based on the sample. 3. F or eac h hashtag y dm in each do cumen t d , p erform the following: (a) Decrement the counts asso ciated with y dm (see A.1 ). (b) Blo c ked sample a new topic for z 0 dn and corresp onding customer counts C and table counts T (with Equation ( 72 )). (c) Up date (incremen t counts) the topic mo del based on the sample. 4. Sample the hyperparameter β N for each PYP N (see A.5 ). 5. Rep eat Steps 2 – 4 for 1,000 iterations. 6. Alternatingly p erform the MH algorithm (Algorithm 3 ) and the collapsed blo ck ed Gibbs sampler conditioned on µ 0 and ν . 7. Sample the hyperparameter β N for each PYP N except for µ 0 and ν . 8. Rep eat Steps 6 – 7 until the mo del con v erges or when a ﬁx num b er of iterations is reac hed. (except µ 0 and ν ). W e note that sampling the concentration parameters allo ws the topic distributions of each author to v ary , that is, some authors hav e few v ery sp eciﬁc topics and some other authors can hav e a wider range of topics. F or simplicity , w e ﬁx the kernel h yp erparameters s , l and σ to 1. Additionally , w e also mak e the priors for the mixing prop ortions uninformativ e b y setting the λ to 1. W e summarise the full inference algorithm for the TNTM in Algorithm 4 . References Aoki, M. (2008). Thermodynamic limits of macroeconomic or ﬁnancial models: One- and t wo-parameter Poisson-Dirichlet models. Journal of Ec onomic Dynamics and Contr ol , 32(1):66–84. 34 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Arc hambeau, C., Lakshminaray anan, B., and Bouchard, G. (2015). Latent IBP comp ound Diric hlet allo cation. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 37(2):321–333. Aula, P . (2010). So cial media, reputation risk and ambien t publicity managemen t. Str ate gy & L e adership , 38(6):43–49. Baldwin, T., Cook, P ., Lui, M., MacKinlay , A., and W ang, L. (2013). How noisy so cial media text, how diﬀrn t [sic] social media sources? In Pr o c e e dings of the Sixth International Joint Confer enc e on Natur al L anguage Pr o c essing , IJCNLP 2013, pages 356–364, Nago ya, Japan. Asian F ederation of Natural Language Pro cessing. Blei, D. M. (2012). Probabilistic topic mo dels. Communic ations of the ACM , 55(4):77–84. Blei, D. M., Griﬃths, T. L., and Jordan, M. I. (2010). The nested Chinese Restaurant Pro cess and Bay esian nonparametric inference of topic hierarchies. Journal of the ACM , 57(2):7:1–7:30. Blei, D. M. and Jordan, M. I. (2003). Mo deling annotated data. In Pr o c e e dings of the 26th Annual International A CM SIGIR Confer enc e on R ese ar ch and Development in Informaion R etrieval , SIGIR 2003, pages 127–134, New Y ork, NY, USA. ACM. Blei, D. M. and Laﬀerty , J. D. (2006). Dynamic topic mo dels. In Pr o c e e dings of the 23r d International Confer enc e on Machine L e arning , ICML 2006, pages 113–120, New Y ork, NY, USA. ACM. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allo cation. Journal of Machine L e arning R ese ar ch , 3:993–1022. Bro ersma, M. and Graham, T. (2012). So cial media as b eat. Journalism Pr actic e , 6(3):403– 419. Bry ant, M. and Sudderth, E. B. (2012). T ruly nonparametric online v ariational inference for hierarc hical Diric hlet pro cesses. In A dvanc es in Neur al Information Pr o c essing Systems 25 , pages 2699–2707. Curran Asso ciates, Rostrevor, Northern Ireland. Bun tine, W. L. and Hutter, M. (2012). A Ba yesian view of the Poisson-Diric hlet pro cess. A rXiv e-prints arXiv:1007.0296v2 . Bun tine, W. L. and Mishra, S. (2014). Exp eriments with non-parametric topic mo dels. In Pr o c e e dings of the 20th ACM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , KDD 2014, pages 881–890, New Y ork, NY, USA. ACM. Cann y , J. (2004). GaP: a factor mo del for discrete data. In Pr o c e e dings of the 27th An- nual International ACM SIGIR Confer enc e on R ese ar ch and Development in Informaion R etrieval , SIGIR 2014, pages 122–129, New Y ork, NY, USA. ACM. Chen, C., Du, L., and Buntine, W. L. (2011). Sampling table conﬁgurations for the hier- arc hical Poisson-Diric hlet process. In Pr o c e e dings of the 2011 Eur op e an Confer enc e on Machine L e arning and Know le dge Disc overy in Datab ases - V olume Part I , ECML 2011, pages 296–311, Berlin, Heidelb erg. Springer-V erlag. 35 Lim, Buntine, Chen and Du Correa, T., Hinsley , A. W., and de Z ´ u ˜ niga, H. G. (2010). Who interacts on the Web?: The in tersection of users’ personality and so cial media use. Computers in Human Behavior , 26(2):247–253. Du, L. (2012). Non-p ar ametric Bayesian metho ds for structur e d topic mo dels . PhD thesis, The Australian National Universit y , Can b erra, Australia. Du, L., Bun tine, W. L., and Jin, H. (2012a). Modelling sequential text with an adaptive topic model. In Pr o c e e dings of the 2012 Joint Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing and Computational Natur al L anguage L e arning , EMNLP-CoNLL 2012, pages 535–545, Stroudsburg, P A, USA. A CL. Du, L., Bun tine, W. L., Jin, H., and Chen, C. (2012b). Sequential latent Diric hlet allo cation. Know le dge and Information Systems , 31(3):475–503. Eisenstein, J. (2013). What to do ab out bad language on the internet. In Pr o c e e dings of the 2013 Confer enc e of the North Americ an Chapter of the A CL: Human L anguage T e chnolo gies , NAACL-HL T 2013, pages 359–369, Stroudsburg, P A, USA. A CL. Eroshev a, E. A. and Fienberg, S. E. (2005). Bayesian Mixe d Memb ership Mo dels for Soft Clustering and Classiﬁc ation , pages 11–26. Springer Berlin Heidelb erg, Berlin, Heidel- b erg. F av aro, S., Lijoi, A., Mena, R. H., and Pr ¨ unster, I. (2009). Bay esian non-parametric infer- ence for sp ecies v ariet y with a tw o-parameter Poisson-Dirichlet pro cess prior. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 71(5):993–1008. F erguson, T. S. (1973). A Bay esian analysis of some nonparametric problems. The A nnals of Statistics , 1(2):209–230. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., V ehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis, Thir d Edition . Chapman & Hall/CRC T exts in Statis- tical Science. CRC Press, Bo ca Raton, FL, USA. Geman, S. and Geman, D. (1984). Sto c hastic relaxation, Gibbs distributions, and the Ba yesian restoration of images. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , P AMI-6(6):721–741. Goldw ater, S., Griﬃths, T. L., and Johnson, M. (2005). Interpolating b et ween types and tok ens by estimating p o wer-la w generators. In A dvanc es in Neur al Information Pr o c essing Systems 18 , NIPS 2005, pages 459–466. MIT Press, Cambridge, MA, USA. Goldw ater, S., Griﬃths, T. L., and Johnson, M. (2011). Pro ducing p o wer-la w distributions and damping word frequencies with t wo-stage language mo dels. Journal of Machine L e arning R ese ar ch , 12:2335–2382. Green, P . J. (1995). Rev ersible jump Mark ov c hain Monte Carlo computation and Bay esian mo del determination. Biometrika , 82(4):711–732. Green, P . J. and Mira, A. (2001). Dela yed rejection in rev ersible jump Metrop olis-Hastings. Biometrika , 88(4):1035–1053. 36 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Hastings, W. K. (1970). Mon te Carlo sampling metho ds using Mark ov c hains and their applications. Biometrika , 57(1):97–109. He, Y. (2012). Incorporating sentimen t prior knowledge for weakly sup ervised senti- men t analysis. ACM T r ansactions on Asian L anguage Information Pr o c essing (T ALIP) , 11(2):4:1–4:19. Hjort, N. L., Holmes, C., M ¨ uller, P ., and W alker, S. G. (2010). Bayesian Nonp ar ametrics , v olume 28. Cambridge Universit y Press, Cambridge, England. Hoﬀman, M. D., Blei, D. M., W ang, C., and P aisley , J. (2013). Sto c hastic v ariational inference. Journal of Machine L e arning R ese ar ch , 14:1303–1347. Hofmann, T. (1999). Probabilistic laten t semantic indexing. In Pr o c e e dings of the 22nd A n- nual International ACM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , SIGIR 1999, pages 50–57, New Y ork, NY, USA. ACM. Ish waran, H. and James, L. F. (2001). Gibbs sampling metho ds for stic k-breaking priors. Journal of the Americ an Statistic al Asso ciation , 96(453):161–173. Jelinek, F. (1997). Statistic al Metho ds for Sp e e ch R e c o gnition . MIT Press, Cambridge, MA, USA. Jin, O., Liu, N. N., Zhao, K., Y u, Y., and Y ang, Q. (2011). T ransferring topical knowledge from auxiliary long texts for short text clustering. In Pr o c e e dings of the 20th ACM In- ternational Confer enc e on Information and Know le dge Management , CIKM 2011, pages 775–784, New Y ork, NY, USA. ACM. Jurafsky , D. and Martin, J. H. (2000). Sp e e ch & L anguage Pr o c essing . Prentice-Hall, Upp er Saddle River, NJ, USA. Karimi, S., Yin, J., and P aris, C. (2013). Classifying microblogs for disasters. In Pr o c e e dings of the 18th A ustr alasian Do cument Computing Symp osium , ADCS 2013, pages 26–33, New Y ork, NY, USA. ACM. Kataria, S., Mitra, P ., Caragea, C., and Giles, C. L. (2011). Con text sensitiv e topic mo dels for author inﬂuence in do cumen t net works. In Pr o c e e dings of the Twenty-Se c ond Inter- national Joint Confer enc e on A rtiﬁcial Intel ligenc e - V olume Thr e e , IJCAI 2011, pages 2274–2280, Palo Alto, CA, USA. AAAI Press. Kim, D., Kim, S., and Oh, A. (2012). Dirichlet pro cess with mixed random measures: A nonparametric topic mo del for lab eled data. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning , ICML 2012, pages 727–734, New Y ork, NY, USA. Omnipress. Kinsella, S., Murdo c k, V., and O’Hare, N. (2011). “I’m eating a sandwich in Glasgow”: Mo deling lo cations with tw eets. In Pr o c e e dings of the 3r d International Workshop on Se ar ch and Mining User-gener ate d Contents , SMUC 2011, pages 61–68, New Y ork, NY, USA. ACM. 37 Lim, Buntine, Chen and Du Kw ak, H., Lee, C., Park, H., and Mo on, S. (2010). What is Twitter, a so cial netw ork or a news media? In Pr o c e e dings of the 19th International Confer enc e on World Wide Web , WWW 2010, pages 591–600, New Y ork, NY, USA. ACM. Landauer, T. K., McNamara, D. S., Dennis, S., and Kintsc h, W. (2007). Handb o ok of L atent Semantic Analysis . Lawrence Erlbaum, Mahw ah, NJ, USA. Lau, J. H., Grieser, K., Newman, D., and Baldwin, T. (2011). Automatic lab elling of topic mo dels. In Pr o c e e dings of the 49th Annual Me eting of the ACL: Human L anguage T e chnolo gies - V olume 1 , A CL-HL T 2011, pages 1536–1545, Stroudsburg, P A, USA. A CL. Lim, K. W. (2016). Nonp ar ametric Bayesian T opic Mo del ling with Auxiliary Data . PhD thesis, submitted, The Australian National Universit y , Canberra, Australia. Lim, K. W. and Buntine, W. L. (2014). Twitter Opinion Topic Mo del: Extracting pro duct opinions from tw eets by lev eraging hash tags and sentimen t lexicon. In Pr o c e e dings of the 23r d A CM International Confer enc e on Confer enc e on Information and Know le dge Management , CIKM 2014, pages 1319–1328, New Y ork, NY, USA. ACM. Lim, K. W., Chen, C., and Bun tine, W. L. (2013). Twitter-Netw ork Topic Mo del: A full Bay esian treatmen t for so cial net work and text mo deling. In A dvanc es in Neur al Information Pr o c essing Systems: T opic Mo dels Workshop , NIPS W orkshop 2013, pages 1–5, Lake T aho e, Nev ada, USA. Lindsey , R. V., Headden I II, W. P ., and Stipicevic, M. J. (2012). A phrase-discov ering topic mo del using hierarchical Pitman-Yor pro cesses. In Pr o c e e dings of the 2012 Joint Confer- enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing and Computational Natur al L anguage L e arning , EMNLP-CoNLL 2012, pages 214–222, Stroudsburg, P A, USA. ACL. Liu, B. (2012). Sen timent analysis and opinion mining. Synthesis L e ctur es on Human L anguage T e chnolo gies , 5(1):1–167. Liu, J. S. (1994). The collapsed Gibbs sampler in Bay esian computations with applica- tions to a gene regulation problem. Journal of the A meric an Statistic al Asso ciation , 89(427):958–966. Lloret, E. and Palomar, M. (2012). T ext summarisation in progress: A literature review. A rtiﬁcial Intel ligenc e R eview , 37(1):1–41. Llo yd, J., Orbanz, P ., Ghahramani, Z., and Roy , D. M. (2012). Random function priors for exc hangeable arrays with applications to graphs and relational data. In A dvanc es in Neu- r al Information Pr o c essing Systems 25 , NIPS 2012, pages 998–1006. Curran Asso ciates, Rostrev or, Northern Ireland. Lo w, A. A. (1991). Intr o ductory Computer Vision and Image Pr o c essing . McGraw-Hill, New Y ork, NY, USA. Lui, M. and Baldwin, T. (2012). langid.p y: An oﬀ-the-shelf language identiﬁcation to ol. In Pr o c e e dings of the ACL 2012 System Demonstr ations , ACL 2012, pages 25–30, Strouds- burg, P A, USA. ACL. 38 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). WinBUGS - a Bay esian mo delling framew ork: Concepts, structure, and extensibilit y . Statistics and Computing , 10(4):325–337. Mai, L. C. (2010). In tro duction to image pro cessing and computer vision. T echnical report, Institute of Information T ec hnology , Hanoi, Vietnam. Manning, C. D., Raghav an, P ., and Sch¨ utze, H. (2008). Intr o duction to Information R e- trieval . Cambridge Univ ersity Press, New Y ork, NY, USA. Manning, C. D. and Sch¨ utze, H. (1999). F oundations of Statistic al Natur al L anguage Pr o- c essing . MIT Press, Cambridge, MA, USA. Ma ynard, D., Bontc hev a, K., and Rout, D. (2012). Challenges in dev eloping opinion mining to ols for so cial media. In Pr o c e e dings of @NLP c an u tag #user gener ate d c ontent , LREC W orkshop 2012, pages 15–22, Istanbul, T urk ey . Mehdad, Y., Carenini, G., Ng, R. T., and Joty , S. R. (2013). T o wards topic lab eling with phrase entailmen t and aggregation. In Pr o c e e dings of the 2013 Confer enc e of the North A meric an Chapter of the ACL: Human L anguage T e chnolo gies , NAA CL-HL T 2013, pages 179–189, Stroudsburg, P A, USA. ACL. Mehrotra, R., Sanner, S., Buntine, W. L., and Xie, L. (2013). Impro ving LD A topic mo dels for microblogs via t w eet p o oling and automatic labeling. In Pr o c e e dings of the 36th Inter- national ACM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , SIGIR 2013, pages 889–892, New Y ork, NY, USA. ACM. Mei, Q., Ling, X., W ondra, M., Su, H., and Zhai, C. (2007). T opic sentimen t mixture: Mo d- eling facets and opinions in weblogs. In Pr o c e e dings of the 16th International Confer enc e on World Wide Web , WWW 2007, pages 171–180, New Y ork, NY, USA. ACM. Metrop olis, N., Rosenbluth, A. W., Ros en bluth, M. N., T eller, A. H., and T eller, E. (1953). Equation of state calculations b y fast computing mac hines. The Journal of Chemic al Physics , 21(6):1087–1092. Mira, A. (2001). On Metropolis-Hastings algorithms with dela yed rejection. Metr on - International Journal of Statistics , LIX(3–4):231–241. Murra y , I., Adams, R. P ., and MacKa y , D. J. C. (2010). Elliptical slice sampling. In Pr o c e e d- ings of the Thirte enth International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , AIST A TS 2010, pages 541–548, Bro okline, MA, USA. Microtome Publishing. Oldham, K. B., Myland, J., and Spanier, J. (2009). An Atlas of F unctions: With Equator, the Atlas F unction Calculator . Springer Science and Business Media, New Y ork, NY, USA. P ang, B. and Lee, L. (2008). Opinion mining and sen timent analysis. F oundations and T r ends in Information R etrieval , 2(1-2):1–135. Pitman, J. (2006). Combinatorial Sto chastic Pr o c esses . Springer-V erlag, Berlin Heidelb erg. 39 Lim, Buntine, Chen and Du Pitman, J. and Y or, M. (1997). The tw o-parameter Poisson-Diric hlet distribution derived from a stable sub ordinator. The Annals of Pr ob ability , 25(2):855–900. Plummer, M. (2003). JA GS: A program for analysis of Bay esian graphical mo dels using Gibbs sampling. In Pr o c e e dings of the 3r d International Workshop on Distribute d Statis- tic al Computing , DSC 2003, Vienna, Austria. Rabiner, L. and Juang, B.-H. (1993). F undamentals of Sp e e ch R e c o gnition . Prentice-Hall, Upp er Saddle River, NJ, USA. Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Lab eled LD A: A sup ervised topic mo del for credit attribution in multi-labeled corp ora. In Pr o c e e dings of the 2009 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing: V olume 1 , EMNLP 2009, pages 248–256, Stroudsburg, P A, USA. A CL. Rosen-Zvi, M., Griﬃths, T., Steyvers, M., and Sm yth, P . (2004). The author-topic mo del for authors and do cumen ts. In Pr o c e e dings of the 20th Confer enc e on Unc ertainty in A rtiﬁcial Intel ligenc e , UAI 2004, pages 487–494, Arlington, Virginia, USA. AUAI Press. Sato, I., Kurihara, K., and Nak agaw a, H. (2012). Practical c ollapsed v ariational Bay es inference for hierarchical Diric hlet pro cess. In Pr o c e e dings of the 18th A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , KDD 2012, pages 105–113, New Y ork, NY, USA. ACM. Sato, I. and Nak agaw a, H. (2010). T opic models with p o wer-la w using Pitman-Yor pro- cess. In Pr o c e e dings of the 16th A CM SIGKDD International Confer enc e on Know le dge Disc overy and Data Mining , KDD 2010, pages 673–682, New Y ork, NY, USA. ACM. Sc hnob er, C. and Gurevych, I. (2015). Combining topic mo dels for corpus exploration: Applying LDA for complex corpus research tasks in a digital h umanities pro ject. In Pr o c e e dings of the 2015 Workshop on T opic Mo dels: Post-Pr o c essing and Applic ations , TM 2015, pages 11–20, New Y ork, NY, USA. ACM. Suominen, H., Hanlen, L., and Paris, C. (2014). Twitter for health – building a so cial media searc h engine to b etter understand and curate laypersons’ p ersonal exp eriences. In T ext Mining of Web-b ase d Me dic al Content , c hapter 6, pages 133–174. De Gruyter, Berlin, German y . T eh, Y. W. (2006). A Ba yesian in terpretation of interpolated Kneser-Ney . T ec hnical Rep ort TRA2/06, National Universit y of Singap ore. T eh, Y. W. and Jordan, M. I. (2010). Hierarchical Bay esian nonparametric models with applications. In Bayesian Nonp ar ametrics , c hapter 5. Cam bridge Univ ersity Press. T eh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarc hical Dirichlet Pro cesses. Journal of the A meric an Statistic al Asso ciation , 101(476):1566–1581. T eh, Y. W., Kurihara, K., and W elling, M. (2008). Collapsed v ariational inference for HDP. In A dvanc es in Neur al Information Pr o c essing Systems 20 , pages 1481–1488. Curran Asso ciates, Rostrevor, Northern Ireland. 40 Nonp arametric Ba yesian Topic Modelling with Hierarchical Pitman-Yor Processes T u, Y., Johri, N., Roth, D., and Ho c k enmaier, J. (2010). Citation author topic mo del in exp ert search. In Pr o c e e dings of the 23r d International Confer enc e on Computational Linguistics: Posters , COLING 2010, pages 1265–1273, Stroudsburg, P A, USA. A CL. W alck, C. (2007). Handb o ok on statistical distributions for exp erimen talists. T echnical Rep ort SUF-PFY/96-01, Universit y of Sto ckholm, Sweden. W allach, H. M., Mimno, D. M., and McCallum, A. (2009a). Rethinking LDA: Wh y priors matter. In A dvanc es in Neur al Information Pr o c essing Systems , NIPS 2009, pages 1973– 1981. Curran Asso ciates, Rostrevor, Northern Ireland. W allach, H. M., Murray , I., Salakhutdino v, R., and Mimno, D. (2009b). Ev aluation methods for topic models. In Pr o c e e dings of the 26th A nnual International Confer enc e on Machine L e arning , ICML 2009, pages 1105–1112, New Y ork, NY, USA. ACM. W ang, C., Paisley , J., and Blei, D. M. (2011a). Online v ariational inference for the hi- erarc hical Diric hlet process. In Pr o c e e dings of the F ourte enth International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , AIST A TS 2011, pages 752–760, Brookline, MA, USA. Microtome Publishing. W ang, X., W ei, F., Liu, X., Zhou, M., and Zhang, M. (2011b). T opic sentimen t analysis in Twitter: A graph-based hashtag sen timent classiﬁcation approach. In Pr o c e e dings of the 20th A CM International Confer enc e on Information and Know le dge Management , CIKM 2011, pages 1031–1040, New Y ork, NY, USA. ACM. W ei, X. and Croft, W. B. (2006). LDA-based do cumen t mo dels for ad-hoc retriev al. In Pr o c e e dings of the 29th A nnual International A CM SIGIR Confer enc e on R ese ar ch and Development in Information R etrieval , SIGIR 2006, pages 178–185, New Y ork, NY, USA. A CM. W o o d, F. and T eh, Y. W. (2009). A hierarchical nonparametric Ba y esian approac h to statistical language mo del domain adaptation. In Pr o c e e dings of the Twelfth Interna- tional Confer enc e on Artiﬁcial Intel ligenc e and Statistics , AIST A TS 2009, pages 607–614, Bro okline, MA, USA. Microtome Publishing. Y ang, J. and Lesko vec, J. (2011). P atterns of temp oral v ariation in online media. In Pr o- c e e dings of the F ourth A CM International Confer enc e on Web Se ar ch and Data Mining , WSDM 2011, pages 177–186, New Y ork, NY, USA. ACM. Zhao, W. X., Jiang, J., W eng, J., He, J., Lim, E.-P ., Y an, H., and Li, X. (2011). Comparing Twitter and traditional media using topic mo dels. In Pr o c e e dings of the 33r d Eur op e an Confer enc e on A dvanc es in Information R etrieval , ECIR 2011, pages 338–349, Berlin, Heidelb erg. Springer-V erlag. 41

Nonparametric Bayesian Topic Modelling with the Hierarchical Pitman-Yor Processes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment