Modeling community structure and topics in dynamic text networks

Mo deling comm unit y structure and topics in dynamic text net w orks T eague R. Henry University of North Car olina at Chap el Hil l Da vid Banks Duke University Derek Ow ens-Oas Duke University Christine Chai Duke University Submitted to Journal of Classiﬁc ation Abstract The last decade has seen great progress in both dynamic netw ork mo deling and topic mod- eling. This paper draws up on b oth areas to create b esp ok e Bay esian mo del applied to a dataset consisting of the top 467 US p olitical blogs in 2012, their p osts ov er the year, and their links to one another. Our mo del allows dynamic topic discov ery to inform the latent netw ork mo del and the netw ork structure to facilitate topic identiﬁcation. Our results ﬁnd complex communit y structure within this set of blogs, where communit y membership dep ends strongly up on the set of topics in which the blogger is in terested. W e examine the time v arying nature of the Sensa- tional Crime topic, as well as the netw ork prop erties of the Election News topic, as notable and easily in terpretable empirical examples. 1 1 In tro duction Dynamic text net works hav e b een widely studied in recent y ears, primarily b ecause the Inter- net stores textual data in a w ay that allows links b et ween diﬀerent do cuments. Articles on the Wikip edia (Hoﬀman et al., 2010), citation net works in journal articles (Mo ody, 2004), and linked blog posts (Latouc he et al., 2011) are examples of dynamic text net w orks, or net works of do cumen ts that are generated ov er time. But each application has idiosyncratic features, suc h as the structure of the links and the nature of the time v arying do cumen ts, so analysis typically requires b espoke mo dels that directly address those asp ects. This article studies dynamic topic structure and the netw ork prop erties of the top 467 US p olitical blogs in 2012. Some key features of this data set are (1) topics, suc h as presidential election news, that evolv e ov er time and (2) communit y structure among bloggers with similar in terests. W e develop a b esp ok e Bay esian mo del for the dynamic interaction b et w een text and net work structure, and examine the dynamics of b oth the discourse and the communit y structure among the bloggers. Our approac h com bines a topic mo del and a network mo del . A topic mo del infers the unobserv ed topic assignments of a set of do cumen ts (in this case, blog p osts) from the text. And a netw ork mo del infers comm unities among the no des (in this case, blogs that tend to link to one another). In com bination, we ﬁnd blo c ks of blogs that tend to p ost on the same topics and which link with one another. These blo c ks, whic h we call topic inter est blo cks , allo w one to examine sets of similar blogs, such as those that p ost only on the 2012 election or those that are only interested in b oth the Middle East and foreign p olicy . T opic in terest blo c ks allow text con tent to guide communit y disco very and link patterns to guide topic learning. W e b egin with a review of terminology in topic modeling. A c orpus is a collection of do cuments . A do cumen t, in our case a p ost, is a collection of tokens , whic h consist of words and n-gr ams , whic h are sets of words that commonly app ear together (“President of the United States” is a common 5-gram). In our application, a blo g produces p osts. A topic is a distribution ov er the tokens in the corpus. T ypically , a p ost concerns a single topic. One such topic might be describ ed as “the 2012 election”, but this lab eling is usually done sub jectively , after estimation of the topic distributions, based on the high-probabilit y tok ens. F or example, the 2012 election topic might put high probabilit y on “Gingrich”, “Santorum”, “Cain” and “primaries” 1 . An early and inﬂuential topic mo del is Latent Diric hlet Allo cation (LD A), prop osed in Blei et al. (2003). It is a bag-of-w ords mo del, since the order of the tokens is ignored. LDA assumes that the tokens in a document are dra wn at random from a topic. If a do cumen t is about more than one topic, then the tok ens are dra wn from m ultiple topics with topic prop ortions that must b e estimated. The LDA generative mo del can pro duce a do cumen t that is 70% ab out the 2012 election topic and 30% ab out a Supreme Court topic by rep eatedly tossing a coin with probability 0.7 of coming up heads. When it is heads, LDA draws a w ord from the 2012 election distribution; otherwise, it dra ws from the Supreme Court distribution. Marko v c hain Monte Carlo allo ws one to rev erse the generative mo del, so that giv en a corpus of do cumen ts, one can estimate the distribution corresp onding to eac h topic, and, for each do cumen t, the proportion of that document that is drawn from eac h topic. In our application, topic sp eciﬁc w ord probabilities evolv e o ver time—the tok en “Gingrich” is more probable early in 2012 than later, when he dropp ed out. Blei and Laﬀerty (2006) dev elops a metho d that allows for topic drift, so the probability of a tok en in a topic can c hange (slowly) through an auto-regressive pro cess. But blog data requires the p ossibilit y of rapid change; “Beng- 1 Gingric h, Santorum and Cain all refer to candidates in the 2012 Republican presidential primary . 1 hazi” did not o ccur in the corpus b efore September 11, but thereafter was a high-probabilit y tok en. W e dev elop a dynamic v ersion of a topic mo del describ ed in Yin and W ang (2014). The wa y we infer topics allows for b oth slow drift and the sudden app earance of new words, and even new topics, o ver the course of the y ear. There is a second source of information in the blog data that previous dynamic topic mo dels cannot utilize. It is the links b etw een blogs, which prompt a netw ork mo del. Here a blog is a no de , and a hyperlink b et ween blogs is an e dge . W e use an exp onen tial random graph mo del (Holland and Leinhardt, 1981; W asserman and Pattison, 1996) to estimate the probability of an edge through a logistic regression on predictors that include no de characteristics and other explanatory v ariables. This framew ork can b e com bined with clustering metho ds to p erform c ommunity dete ction , where a communit y is a set of nodes that are more likely to create edges among themselves than with other no des. There are a n umber of recent metho ds for communit y detection. One is a family of algorithms that use mo dularit y optimization (Newman and Girv an, 2004). But the corresp onding mo dels are not parametric and do not supp ort Bay esian inference. A p opular alternativ e is the latent space mo del of Hoﬀ et al. (2002). It estimates no del lo cations in an unobserved space whic h then deﬁnes the communit y structure; but it is too computationally demanding for the large blog p osts data set. W e prefer the sto chastic blo ck mo del of Snijders and Nowic ki (1997). Sto chastic blo c k mo dels place no des in to latent communities based on the observed pattern of links b et ween no des, whic h are mo deled using independent Bernoulli random v ariables. It has b een extended as the mixed mem b ership blo c k mo del (Airoldi et al., 2008), which allo ws no des to b e mem b ers of more than one comm unity . In that spirit, the mo del developed in this pap er k eeps the sto c hastic blo c k mo deling framew ork, but p ermits no des to hav e idiosyncrasies in their connection patterns that are not solely due to communit y membership, but also reﬂect no de cov ariates (in this application, the degree of the blogs’ in terests in sp eciﬁc topics). Shared comm unity mem b ership increases edge formation probabilit y , and no des in diﬀeren t communities that hav e shared topic interests also hav e elev ated probabilities of linking. A sto c hastic blo ck mo del can be easily expressed within an exp onen tial random graph mo deling framework Com bining topic information and link age information through the topic in terest blo c ks is our k ey metho dological con tribution in this article. A secondary con tribution is extending the topic mo del of Yin and W ang (2014) in to a dynamic topic mo del. Researchers hav e started to develop mo dels that com bine netw ork analysis and topic analysis, mostly in the context of static net works. Chang and Blei (2009) describ es a relational topic mo del in whic h the probability of links b et ween do cumen ts dep ends up on their topics and applies it to tw o datasets of abstracts and a set of w ebpages from computer science departments. Ho et al. (2012) applies such methods to linked h yp ertext and citation netw orks. W ang et al. (2011) develops a mo del for the case in which there are noisy links b etw een no des, in the sense that there are links b etw een do cumen ts whose topics are not related. Yin and W ang (2014) do es related work on clustering do cuments through use of a topic mo del. How ever none of these methods allo w for the simultaneous mo deling of dynamic topics with a communit y structure on the no des. Our mo del uses text and cov ariate information on eac h no de to group blogs in to blo c ks more lik ely to p ost on the same topics and link to one another. This approach expands up on comm unity detection, but also fundamentally alters how communities are deﬁned. W e assume that if t wo blogs are in terested in the same topics, then they are more likely to link to each other and form a communit y . Estimating the extent to which blogs p ost ab out the same topics helps explain comm unity structure, ab o ve and b eyond the communit y structure describ ed b y link age pattern. F urthermore, integrating comm unity detection in to topic mo dels allo ws the link ages to inform 2 the allo cation of topics, connecting net work structure to topic structure. So inference on topic distributions is supplemented by non-text information. This results in communities that are deﬁned b oth on the pattern of links (traditional comm unity detection), as well as textual data. One consequence of this approac h is that communities are more grounded in the substantiv e reason for an y communit y structure, shared interest in v arious topics. In particular, for the 2012 blog application, we wan ted a b esp ok e Ba yesian mo del that (1) allo ws topic distributions to change o ver time, b oth slo wly and quickly , (2) classiﬁes blogs in to blo c ks that share topic in terests and ha v e elev ated in ternal link age probabilities, and (3) theoretically enable use of cov ariate information on blogs. This includes prestige, so ciabilit y , whether recen tly linked, and more. Here we extend cov ariates from static sto c hastic blo c k mo dels, as in F aust and W asserman (1992), to dynamic netw orks. Some cov ariates are ﬁxed (e.g., topic in terests) whereas others are time-v arying (e.g., whether recently linked). Section 2 describ es our dataset and its preparation. Section 3 giv es a generative dynamic model for b oth the text and the net work. Section 4 sp eciﬁes the Bay esian prior and p osterior inference algorithm used to estimate mo del parameters. Finally , in Section 5, we present several ﬁndings from the p olitical blog data, and Section 6 ﬁnishes with a discussion of possible generalizations. 2 P olitical Blogs of 2012 Our data consists of the blog p osts from the top 467 US p olitical blogs for the year 2012, as rank ed b y T ec hnorati (2002). This dataset has a dynamic net work structure since blog p osts often link to each other, resp onding to eac h other’s conten t. Additionally , the topic structure of the blog p osts reﬂect diﬀerent interests, suc h as the presidential cam paign or sensational crime. The token usage in eac h topic changes ov er time, sometimes quite suddenly , as with the app earance of the tok ens “T rayv on” and “Zimmerman” 2 in March, 2012, and sometimes more gradually , as with the slo w fade of the tok en “Gingrich” 3 during the spring. Ov er the 366 days in 2012, a leap y ear, the p olitical blogs accumulated 109,055 p osts. 2.1 Data Preparation Our data were obtained through a collab oration with MaxPoin t In teractive, now V alassis Digital, a company headquartered in the Research T riangle that sp ecializes in computational advertising. Using the list of 467 U.S. p olitical blog sites curated by Technorati , computer sc ien tists at Max- P oint scrap ed all the text and links at those sites (after declaring robot status and follo wing all rob ot proto cols). The scrap ed text w as stemmed, using a mo diﬁed version of Snowball (McNamee and Mayﬁeld, 2003) dev elop ed in-house at MaxP oint Interactiv e. The initial application remo ved all three-letter w ords, whic h was undesirable, since such acronyms as DOT, EP A and NSA are imp ortan t. That problem w as ﬁxed and the data w ere restemmed. The second step was ﬁltering. This ﬁltering was based on the v ariance of the unw eighted term- frequency , inv erse do cumen t frequency (TF-IDF) scores (Ramos, 2003). The TF-IDF score for tok en w in blog p ost d is TF-IDF wd = f wd /n w (1) where f wd is the num b er of times that token w o ccurs in blog p ost d , and n w is the num b er of p osts in the corpus that use token w . W ords that ha v e low v ariance TF-IDF scores are such words as 2 George Zimmerman shot and killed T rayv on Martin in March of 2012. 3 Newt Gingrich gradually faded to political irrelev ance after a failed presidential primary run. 3 “therefore” and “b ecause,” whic h are common in all p osts. High-v ariance scores are informative w ords that are used often in a small num b er of p osts, but rarely in other p osts, suc h as “homosexual” or “Zimmerman”. Interestingly , “Obama” is a lo w-v ariance TF-IDF token, since it arises in nearly all p olitical blog p osts. Next, we remov ed tokens that were mentioned in less than 0.02% of the p osts. This reduced the num b er of unique tokens that app eared in the corpus, as these were unlik ely to b e helpful in determining the topic tok en distribution across all p osts. Man y of these were missp ellings; e.g., “Merkle” for “Merkel”, the Chancellor of Germany . Ov erall, these missp ellings were either rare (as in the case of “Merkle-Merkel”), or incomprehensible. After all tokens w ere ﬁltered, we computed the n -grams, starting with bigrams. A bigram is a pair of words that app ear together more often than chance, and thus corresp ond to a meaningful phrase. F or example, the words “white” and “house” appear in the blog corpus often, in many diﬀeren t contexts (e.g., race relations and the House of Represen tatives). But the phrase “White House” refers to the oﬃcial residence of the president, and app ears more often than one w ould predict under an indep endence mo del for which the exp ected num b er of phrase o ccurrences is N p white p house , where N is the total amoun t of text in the corpus and p white and p house are the prop ortions of the text that are stemmed to “white” and “house”. Bigrams were rejected if their signiﬁcance probability w as greater than 0.05. In examining the bigram set generated from this pro cedure, it app eared to b e to o lib eral; English usage includes many phrases, and ab out 70% of tested bigrams were retained. Therefore w e excluded all bigrams o ccurring less than 500 times corpus-wide. This signiﬁcan tly reduced the set of bigrams. After the bigrams were computed and the text reformatted to combine them, the bigramming pro cedure was rep eated. This pro duced a set of candidate trigrams (consisting of a previously iden tiﬁed bigram and a unigram), as well as a set of candidate quadrigrams (made up of t wo previously accepted bigrams). These candidates were retained only if they had a frequency greater than 100. This cut-oﬀ remov ed the ma jorit y of the candidate trigrams and quadrigrams. The ﬁnal v o cabulary consisted of 7987 tokens. It is p ossible to go further, ﬁnding longer n -grams, but w e did not. Ho wev er, w e iden tiﬁed and remov ed some long n -gram pathologies, suc h as the one created by a blogger who ﬁnished ev ery p ost b y quoting the Second Amendmen t. There is a large literature on v arious n -gramming strategies (Bro wn et al., 1992). Our work did not emplo y sophisticated metho ds, suc h as those that use information about parts of sp eec h. After this prepro cessing complete, we had the following kinds of information: • Stemmed, tokenized, reduced text for each p ost, the date on which the p ost was published, the blog the p ost it w as published on, and links to other blogs in the netw ork. • Blog information, including the w eb domain, an estimate of its prestige from Technorati , and sometimes information on p olitical aﬃliation. F rom this information, we wan t to estimate the following: • Time evolving distributions o ver the tok ens, where the time ev olution on a token ma y b e abrupt or gradual. • The topic of each p ost—our mo del assumes that a p ost is about a single topic, which is usually but not alwa ys the case (based up on preliminary work with a more complicated mo del). • The topic in terest blo c ks, which are sets of blogs that tend to link among themselv es and whic h tend to discuss the same topic(s). 4 • The sp eciﬁc topics of in terest to each of the topic in terest blo c ks. • The linking probabilities for eac h pair of blogs, as a function of topic in terest blo c k membership and other cov ariates. • P osting rates, as a function of blog cov ariates and external news even ts that drive discussion. W e now describ e the generative mo del that connects dynamic topic mo dels with net work mo dels in a wa y that accoun ts for the unique features of this data set. 3 Mo del The generative mo del can b e describ ed in tw o main phases: initialization of static quan tities, such as blogs’ topic interest blo c k membership, and generation of dynamic quantities, sp eciﬁcally p osts and links. First, the model creates k topic distributions that are allo wed to c hange ov er time. Next, it generates time-stamp ed news even ts for eac h topic. Each blog is randomly assigned to a topic in terest blo c k. With these elements in place, p ost and link generation pro ceeds. F or eac h blog, on each day the num b er of p osts from eac h topic is generated, in accordance to the topic interest blo c k of that blog. The con tent of the p ost is generated from the day-speciﬁc topic distribution, and links are generated so as to take account the blog’s topic interest blo c k. W e now describ e eac h step in the generative mo del in more detail. 3.1 T opic and T ok en Generation W e b egin with the topic distributions, which m ust allo w dynamic change. F or the k th topic, on a sp eciﬁed day t , we assume the token probabilities V kt are drawn from a Dirichlet distribution prior. This set of topic-sp eciﬁc tok en probabilities is the topic distribution on day t . T o encourage con tinuit y across da ys, we calculate the av erage of topic k ’s topic distribution V k ( t − 1):( t − ` ) from the previous ` da ys and use it as the concentration parameter for the Dirichlet distribution from whic h the present da y’s topic V kt is drawn. The sampling pro ceeds in sequence, ﬁrst calculating eac h topic’s concen tration parameter as in 2 and then sampling each topic as in 3 and then mo ving to the next day . This pro cedure rep eats for times t = 1 : T . The topics are then distributed: a kt = 1 ` ` X t 0 =1 V k ( t − t 0 ) , (2) V kt ∼ D ir | W | ( a kt ) . (3) 3.1.1 T opic Even t Generation T o capture the ev ent-driv en asp ect of blog p osting, we generate even ts which then b oost the p ost rate on the corresp onding topic. F or each topic k , at each time t , there is some probability η k of an even t o ccurring. One can c ho ose η k = . 01 for all k , which suggests each topic has on a verage 1 ev ent every 100 days. Alternativ ely , diﬀerent topics can b e giv en diﬀeren t daily ev ent probabilities or one can put a prior on η k . Given η k , the daily , topic-sp eciﬁc even t indicators are sampled as: E kt ∼ B er n ( η k ) . (4) 5 When an ev ent happ ens on topic k , blogs with interest in topic k hav e their p osting rates increase b y a factor determined b y ψ k . Sp eculating that some topics hav e ev ents which are muc h more inﬂuen tial than others, w e let this multiplier b e topic sp eciﬁc: ψ k ∼ Gam ( a ψ , b ψ ) . (5) 3.1.2 Blo c k and blog sp eciﬁc topic interest sp eciﬁcation With our topic distributions and topic sp eciﬁc ev ents generated, w e can now assign blogs to topic in terest blo c ks. W e b egin b y deﬁning the blo ck-speciﬁc topic-interests matrix I , where each column b indicates which of the k topics are of interest to block b . The ﬁrst  K 1  columns corresp ond to the singleton blo c ks, w hic h are interested only in topic 1, topic 2, up through topic K , resp ectiv ely . The next  K 2  columns deﬁne doublet blo c ks, which hav e interest in all of the p ossible topic pairs. The next  K 3  columns corresp ond to blo c ks whic h hav e interest in exactly 3 topics, and the ﬁnal column is for the blo c k which has interest in all K topics: I kb =                    1 0 0 . . . 0 1 1 1 . . . 0 1 1 1 . . . 0 1 0 1 0 . . . 0 1 0 0 . . . 0 1 1 1 . . . 0 1 0 0 1 . . . 0 0 1 0 . . . 0 1 0 0 . . . 0 1 0 0 0 . . . 0 0 0 1 . . . 0 0 1 0 . . . 0 1 0 0 0 . . . 0 0 0 0 . . . 0 0 0 1 . . . 0 1 0 0 0 . . . 0 0 0 0 . . . 0 0 0 0 . . . 0 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 . . . 0 0 0 0 . . . 0 0 0 0 . . . 0 1 0 0 0 . . . 0 0 0 0 . . . 0 0 0 0 . . . 1 1 0 0 0 . . . 0 0 0 0 . . . 1 0 0 0 . . . 1 1 0 0 0 . . . 1 0 0 0 . . . 1 0 0 0 . . . 1 1                    (6) T o assign blogs to blo c ks, w e sample their mem b ership with a single draw from a m ultinomial distribution. This means eac h blog is a mem b er of only a single blo c k, characterized by the topic in terests in the ab o ve matrix. Eac h blo c k assignmen t is then drawn from a m ultinomial distribution: b i ∼ M ult (1 , p B ) . (7) One can c ho ose the probabilities of b elonging to each blo c k uniformly , by setting each element of p b = 1 /B where B =  K 1  +  K 2  +  K 3  + 1 gives the total num b er of blo c ks. Another approac h is to partition the probabilities v ector into the singlets, doublets, triplets, and all-topics blo c ks, and allo cate probability uniformly to eac h of these categories, and then uniformly divide up the probabilit y among blo cks within each category: p B =     p 1 p 2 p 3 p K     , with p 1 =      p 1 , 1 p 1 , 2 . . . p 1 , ( K 1 )      , p 2 =      p 2 , 1 p 2 , 2 . . . p 2 , ( K 2 )      , p 3 =      p 3 , 1 p 3 , 2 . . . p 3 , ( K 3 )      , p K = p K, ( K K ) . (8) F or notational conv enience throughout the rest of the pap er, we deﬁne B i to b e the set of topics whic h are of interest to blog i : 6 B i = { k : I kb i = 1 } . (9) With each blog’s topic interest indicators kno wn, we can generate blog-sp eciﬁc topic-interest prop ortions. F or example, t wo blogs may b e in the blo ck with in terest in topic 1 and topic 2, but one may ha ve in terest prop ortions ( . 9 , . 1) while the other has ( . 5 , . 5). As is conv en tional in topic mo deling, topic (interest) prop ortions are drawn from a Diric hlet distribution, though we mak e the distinction that eac h blog has a sp eciﬁc set of hyperparameters α i . An individual topic interest v ector π i is then a draw from a Dirichlet distribution: π i ∼ D ir K ( α i ) . (10) The h yp erparameters are c hosen suc h that a blog with interest in topics 1 and 2 is lik ely to ha ve most of its in terest in those topics, though it allows for in terest in other topics to o ccur with small probabilities: α i =      α i 1 α i 2 . . . α iK      , with α ik = P 1( k ∈ B i ) + 1( k / ∈ B i ) . (11) 3.1.3 P ost Generation Giv en the blogs’ topic interest and block membership, along with the even t distribution, w e can no w generate the n umber of p osts a blog pro duces on a particular topic. Each blog may p ost on m ultiple topics, but eac h p ost is asso ciated with a single topic. Every blog has a baseline p osting rate which c haracterizes ho w activ e it generally is on da ys without even ts. F or blog i the baseline p ost rate ρ i is sampled from the following distribution: ρ i ∼ Gam ( a ρ , b ρ ) . (12) With the blog sp eciﬁc baseline p ost rate ρ i , the blog sp eciﬁc topic interest prop ortions π ik , the topic sp eciﬁc daily ev ent indicators E kt , and topic sp eciﬁc p ost rate multipliers ψ k accoun ted for, w e construct the exp ected p ost rate for each topic, on each blog, each day: λ tki = ρ i π ik + ρ i E tk ψ k . (13) Giv en this p ost rate, the count D tki of p osts ab out topic k , on blog i , on day t are generated: D tki ∼ P ois ( λ tki ) . (14) In the observed data, w e don’t kno w the p ost counts D tki on each topic, but instead w e kno w the marginal counts D ti . These are referenced throughout the inference pro cedure describ ed in section 4 and are calculated: D ti = K X k =1 D tki . (15) With daily topic sp eciﬁc p ost counts and token probabilities a v ailable, the p osts can b e p opu- lated with tok ens. W e ﬁrst sample a total num b er of tokens for eac h p ost. In particular, on da y t , the tok en count W tkid for p ost d ab out topic k on blog i is sampled: 7 W tkid ∼ P ois ( λ D ) . (16) Where λ D is the a verage n umber of tok ens o ver all p osts. The W tkid tok ens can then be sampled from the appropriate day and topic sp eciﬁc m ultinomial distribution with probability vector V kt . This is done for all of the p osts in the corpus like so: N w tkid ∼ M ult ( W tkid , V kt ) . (17) 3.1.4 Net w ork Generation Finally , w e generate the netw ork of links b et ween blogs. Rather than mo deling link generation at a p ost lev el, w e mo del it at a daily blog to blog lev el. Sp eciﬁcally , we mo del a directed adjacency matrix A t of links, with entry a ii 0 t indicating whether any p osts from blog i ha ve links to blog i 0 on da y t . The binary logistic regression is suitable for this scenario. W e assume the link probabilit y p ii 0 t = p ( A ii 0 t = 1) dep ends on the follo wing factors: • B ( i, i 0 ) = 1( b i = b i 0 ) + π T i π i 0 1( b i 6 = b i 0 ) is the similarit y (in topic interests) for no des i and i 0 , and is in the interv al [0,1], taking v alue 1 if and only if blogs i and i 0 are in the same blo c k. • L i 0 it = 1(( P t − 1 t 0 = t − 7 a i 0 it 0 ) > 0) indicates if blog i 0 has link ed to blog i within the last w eek (previous to the current time t). • I i 0 t = 1 t − 1 P t − 1 t 0 =1 P i a ii 0 t 0 is the av erage indegree (through time t-1) of the receiving no de i’. • O it = 1 t − 1 P t − 1 t 0 =1 P i 0 a ii 0 t 0 is the av erage outdegree (through time t-1) of the sending no de i. The ﬁrst cov ariate is sampled and constructed in equations 6-11, and the other three cov ariates are deﬁned and calculated as statistics of the past data { A t 0 } t − 1 t 0 =1 . T ogether with an intercept, they comprise the regressors in a logistic regression for links, whic h can b e written as in Equation 18, log( p ii 0 t 1 − p ii 0 t ) = θ 0 + θ 1 B ( i, i 0 ) + θ 2 L i 0 it + θ 3 I i 0 t + θ 4 O it . (18) W e sp ecify a normal prior for the intercept and the regression co eﬃcien ts: θ p ∼ N or m ( µ θ , σ 2 θ ) . (19) W e can use the logistic function to write the probability of a link as: p ii 0 t = p ( A ii 0 t = 1) = exp( θ T S ii 0 t ) 1 + exp( θ T S ii 0 t ) , (20) with co eﬃcien ts and cov ariates written as: θ = ( θ 0 , θ 1 , θ 2 , θ 3 , θ 4 ) T and S ii 0 t = (1 , B ( i, i 0 ) , L i 0 it , I i 0 t , O it ) T . (21) This form makes it more clear ho w the mo del can b e cast within the ER GM framew ork (Holland and Leinhardt, 1981). Co v ariates are time dep enden t as in TER GM literature (Krivitsky and Handco c k, 2014). An imp ortant note is that co v ariates dep end only on past linking data, whic h mak es this a predictive mo del of links. Finally , w e sample each link as a single Bernoulli trial with the appropriate probability as deﬁned in Equation 20: 8 A ii 0 t ∼ B er n ( p ii 0 t ) . (22) This generative mo del for the links can b e thought of as a v ariant of sto c hastic blo c k mo deling (Snijders and No wicki, 1997), where blo c k membership is “fuzzy”. In our mo del, while members of the same blo c k will hav e the highest probability of linking with other members of the same blo c k, individuals who share similar topic interests, but do are not in the same blo c k are more likely to link than individuals who share no topic in terests. This allows for a pattern of link ages that more accurately reﬂect the empirical phenomena of topic based blog h yp erlinks. A t the end of data generation w e ha ve { B i } I i =1 and π i giving the topic interest set and topic in terest prop ortions, resp ectiv ely , for blog i ; K × T matrix E with daily topic sp eciﬁc even t indica- tors; K × I × T array D with entry D kit giving the num b er of p osts ab out topic k on blog i at time t ; | W | × K × T array V of daily topic sp eciﬁc token probabilities; and m ultidimensional ob ject N con taining the count N w tkid for eac h token w in the d th p ost ab out topic k on blog i at time t . With a theoretically justiﬁed data generating mechanism in place, we pro ceed to Section 4 to “in vert the generative mo del” and derive p osterior inference for the parameters of interest. 4 Estimation As our dataset of blog p osts consists of p osts, time stamps, whic h blog p osted each p ost, and eac h p ost’s links to other blogs, our inferen tial model needs to estimate a num b er of quan tities. This section gives the details of ho w we estimate quan tities of interest from the data and how we sp eciﬁed our priors. The notation is dense, so the follo wing guide is helpful. • Eac h p ost’s topic assignment . W e observ e the con tent of each p ost, but do not know the topic assignment of each p ost. This m ust b e inferred. W e denote this estimate as z d for p ost d . • T opic distributions . W e do not know what the conten t of each topic is, or ho w each topic c hanges ov er time. W e use V t for the topic tok en-distributions matrix, and for sp eciﬁc topics, w e denote this as V kt . • Ev ents and P ost Rate Bo osts . Even ts are not observed and m ust b e inferred. This T × K matrix is E . The even t-caused, topic-sp eciﬁc b o osts in p ost rate ψ k are also inferred. • Blog sp eciﬁc parameters . A blog’s av erage p ost rate and topic interests must b e inferred. The blog a v erage p ost rate is denoted ρ i , and the topic in terest prop ortions is a v ector of length K , denoted π i . • Blogs’ blo ck membership . A Blog’s blo c k mem b ership is inferred using the link age pattern and topic assignmen ts of eac h of the blog’s p osts. The i th blog’s blo c k mem b ership is denoted as b i and its corresp onding topic in terests indicator vector is B i . • Net work parameters . The net work parameters go v ern the probability of link age. These are dep end upon blo ck mem b ership, lagged recipro cit y , indegree and outdegree through a logistic regression whose co eﬃcien ts m ust be estimated. These ﬁv e net work parameters (including in tercept) are denoted θ 0 , θ 1 , θ 2 , θ 3 and θ 4 , resp ectiv ely . 9 4.0.1 Hyp er Parameter Sp eciﬁcation The mo del requires that sev eral parameters b e sp eciﬁed a priori . In this subsection w e describe these h yp er parameters in general terms, while in section 5.1, w e show which sp eciﬁc v alues w e used to analyze the p olitical blog data. The ﬁrst hyper parameter is K , the total n umber of topics. In principle, one could place an informative prior on the n umber of topics and use the p osterior mean determined by the data. This is, ho wev er, computationally cum b ersome and so we make the decision to specify the num b er of topics in adv ance This approach is used in Blei and Laﬀert y (2006) and Blei et al. (2003) for the Latent Dirichlet Allo cation mo dels . One can use p enalized lik eliho o d as a selection criterion, as describ ed in Yin and W ang (2014), or an en tropy based criterion, such as the one describ ed in Arun et al. (2010). W e chose the n umber of topics b y running mo dels with diﬀeren t v alues of K and selecting the num b er of topics using the en tropy based criterion of Arun et al. (2010). The time lag ` for topic dep endency needs to b e sp eciﬁed. This time lag determines the scale of the topics, and has units in num b er of days. This determines ho w long tokens remain in a topic, and can b e conceptualized as a smo other ov er the time changing vocabulary . Smaller v alues of ` will produce more v aried topic distributions o v er time, while larger v alues will reﬂect slo wer shifts in the topic conten t. F or the no de sp eciﬁc parameters, only P , the Diric hlet concentration parameter on the topics which are of in terests to a blo c k, is needed. This parameter go verns how often blogs are allo wed to p ost outside of topics they are interested in, with lo wer v alues allowing for more out of interest p osting, and higher v alues corresp onding to restricted topic in terests. Finally , for any reasonable num b er of topics, a restriction on the blo c k structure is required to ensure computational feasibilit y . F or an unrestricted blo ck structure with K topics, the total num b er of p ossible blo c ks that must b e ev aluated is P K i =1  K i  , which is computationally in tractable for mo derate K . In this pap er, we restrict blo cks to hav e 1, 2, or 3 topic interests, and allo w one blo c k to hav e interest in all topics. Finally , we sp ecify the exp ected n umber of non-zero topic in terest blo c ks using the prior λ B . 4.1 A Simple Data Augmen tation While the generativ e mo del assumes Poisson distribution on p ost coun ts D kit , we rely on a data augmen tation for the inference pro cedure. Because counts D it of p osts on eac h blog each da y are already kno wn, we augmen t the generativ e mo del with latent v ariables { z d it } D it d it =1 whic h in- stead tell the latent topic assignmen t of p ost d it . W e can then re-write the Poisson likelihoo d Q K k =1 P ois ( D kit | λ kit ) as a m ultinomial likelihoo d Q D it d it =1 M ult ( z d it | 1 , ξ it ) with ξ kit = λ kit P K k =1 λ kit . This reform ulation enables use of the topic assignment inference algorithm from GSDMM. 4.2 Metrop olis Within Gibbs Sampling W e use a Metrop olis within Gibbs sampling algorithm Gilks et al. (1995) to obtain p osterior distri- butions for the parameters deﬁned in the generative mo del. This approach consists of four stages: 1. Eac h day t , for each blog i , sample a topic assignmen t z d it for eac h p ost d it and up date the matrix of daily topic sp eciﬁc tok en-distributions V t . 2. F or blogs, up date topic interest prop ortions ( π ik ), and base rate for p osting ( ρ i ). F or even ts, up date the even t matrix E , and activ ation lev el parameters ( ψ k ). 3. Up date the netw ork parameters, i.e., θ 0 , θ 1 , θ 2 , θ 3 and θ 4 . 4. Up date each blog’s blo ck assignment b i and corresp onding topic interest indicators B i . 10 4.3 T opic Mo deling and Post Assignment Both posts’ topic assignments as w ell as the topic distributions themselv es are unobserved and m ust b e inferred. A preferred algorithm for inferring p ost topic assignment and topic token-distributions w ould ﬁrst assign each p ost a single topic, and second hav e some ﬂexibilit y in collapsing topic distributions together. T o those ends, we adapt the Gibbs Sampler for the Diric hlet Mixture Mo del (GSDMM) of Yin and W ang (2014). As originally prop osed, the GSDMM classiﬁes a set of do cumen ts in to sp eciﬁc topics. The tokens of a p ost are assumed to b e generated from the topic sp eciﬁc m ultinomial which that p ost w as assigned, and many tok ens ma y b e instantiated in m ultiple topics (e.g., common w ords suc h as “therefore”). The assignmen t of each p ost to a single topic diﬀers from the La- ten t Dirichlet Allo cation mo del, which mo dels do cuments as mixtures across a n umber of topics. GSDMM estimates the probability that a document d is ab out topic k , given the curren t topic v o cabulary distribution, as P ( z d = k | V k , d ) = m k, − d + α | D | − 1 + K α Q w ∈ d Q N w d j =1 ( N w k, − d + β + j − 1) Q N d i =1 ( n k, − d + | W | β + i − 1) , (23) where m k, − d is the n um b er of p osts curren tly assigned to topic k (not including the topic assignmen t of p ost d ), N w d is the num b er of o ccurrences in p ost d of token w , and N w k, − d is the num b er of o ccurrences of token w in topic k (not including the conten t of p ost d ). The α con trols the prior probabilit y that a p ost is assigned to a topic; increasing α implies that all topics grow equally lik ely . The β relates to the prior probabilit y that a token will ha ve relev ance to an y sp eciﬁc topic; increasing β results in fewer topics b eing found by the sampler. Finally , | D | is the num b er of p osts in total, and | W | is the size of the v o cabulary . As originally prop osed b y Yin and W ang (2014), GSDMM is a static mo del. W e mo dify it by allo wing V to v ary o ver time. F or readability , we suppress the subscripts and denote the sp eciﬁc p ost d it b y d . W e deﬁne m ∗ k,t, − d = ( t X t 0 = t − ` D t 0 k ) − 1 with D tk = X i D tki , (24) to b e the n umber of p osts assigned to topic k in the interv al from t − ` to t , not including p ost d b y blog i at time t . Also w e let N ∗ w k,t, − d = t X t 0 = t − ` N w k,t 0 , (25) b e the num b er of times that tok en w o ccurs in topic k in the interv al from t − ` to t , not including p ost d . This deﬁnes a sliding window that allows the sampler to use information from the recent past to infer the topic to whic h a p ost b elongs, while allowing new tok ens to inﬂuence the assignmen t of the p ost at the curren t time p oint. The probabilit y of assigning p ost d to topic k is then: I P[ z d = k | V kt , d ] = m ∗ k,t, − d + α | D t − ` : t | − 1 + K α Q w ∈ d Q N w d s =1 ( N ∗ w k,t, − d + β + s − 1) Q N d i =1 ( N ∗ k,t, − d + | W | β + i − 1) , (26) where | D t − ` : t | is the num b er of p osts within the lag windo w. Note that (26) do es not use information ab out the blog that generates the p ost. So the ﬁnal step is to incorp orate the tendency of blog i to post on topic k at time t , using the P oisson rate parameter in (13). Using the normalized 11 p oin t-wise pro duct of conditional probabilities, the ﬁnal expression for the probability that p ost d (i.e., d it ) b elongs to topic k is I P[ Z d = k | V k,t , d, λ ikt ] = I P[ Z d = k | V k,t , d ]I P[ Z d = k | λ ikt ] P K q =1 I P[ Z d = q | V q ,t , d ]I P[ Z d = q | λ iq t ] . (27) T o reduce computation, we appro ximate I P[ Z d = k | λ ikt ] with λ ikt / P K j =1 λ ij t , as clariﬁed in the data augmen tation ab ov e. The topic assignment of a p ost can now b e Gibbs sampled using equation (27). The sampler assigns the ﬁrst p ost to a single topic, updates the topic-token distributions with the conten t of that p ost, then contin ues to the next p ost, and rep eats. At each time p oint, the sampler sw eeps through the set of p osts several times so that the topic assignments can stabilize. The preferred n umber of sweeps dep ends on the complexit y of the p osts on that day , but it need not b e large. After some exploration, this study used 10 sweeps at eac h time p oin t p er iteration. In summary , after topic assignment has b een completed for a giv en da y t , all p osts within that da y will hav e a single topic assignment. Moving to day t + 1, the topic assignment for all p osts that da y will utilize the information from da y t through that da y’s topic-sp eciﬁc, token-distribution estimator sp eciﬁed in Equation (25). Once the topic assignment estimator reac hes the ﬁnal day T , all p osts will hav e assigned topics, and all topics will hav e a time v arying token-distribution. The p ost sp eciﬁc topic assignmen ts are then used in the next step of estimating blogs’ topic interest v ectors. 4.4 No de Sp eciﬁc P arameters and Ev ent P arameters Once p osts are assigned to topics, the next step is to up date the no de sp eciﬁc parameters, sp eciﬁ- cally the blog topic interest vector π i and the blog p osting rate ρ i . The topic interest v ector is up dated in a Metrop olis-Hastings step. As is standard in M-H, w e sp ecify a prop osal distribution, a lik eliho o d, and a prior. The proposal π ∗ i is a draw from a Diric hlet distribution with α i = π i D i , where D i is the total num b er of p osts generated by no de i . The lik eliho o d is T Y t =1 K Y k =1 P ( D kit | λ kit ) , (28) where P ( D kit | λ kit ) is the P oisson lik eliho o d from equation (14) represen ting the da y sp eciﬁc n umber of posts on blog i assigned to eac h topic. Note the dep endence of λ kit on π i comes through equation (13). A hierarchical prior is used, which is Dirichlet( α B i ), where the parameters are deﬁned b y the curren t blo ck assignment of no de i , as in equation (11). This step requires estimates of each blog’s blo c k assignmen t, which will be describ ed later. The i th blog’s p osting rate ρ i is also up dated using a Metrop olis-Hastings step, where the prop osal distribution is a Normal truncated at 0, with mean equal to ρ i and standard deviation equal to σ 2 ρ . The likelihoo d ev aluated is the same as in equation (28). The prior is a univ ariate Normal truncated at 0 and with mean ρ and v ariance σ 2 ρ . T runcated normal distributions are used to uncouple the mean and the v ariance. Next to up date are the ev ent matrix E and activ ation b o ost parameters ψ k . The ev ent matrix is up dated with a series of Metrop olis steps for eac h time p oin t and topic. The prop osal is simply 1 if E k,t = 1 and 0 if E k,t = 0. The likelihoo d is the same as equation (28), except, for eac h topic k at each time t , the pro duct of Poisson densities is ov er the blogs i = 1 : I . The prior is simply a Bernoulli with parameter E π . 12 Eac h activ ation parameter ψ k is up dated with a Metrop olis-Hastings step, where the prop osal ψ ∗ k is a truncated normal at 0 with mean ψ k and standard deviation σ ψ . Again, the likelihoo d is similar to (28), in that it is the pro duct indexed ov er time of the Poisson densities for every blogs’ n umber of p osts in topic k . Unlike the original likelihoo d, ho wev er, the second pro duct is ov er blogs i = 1 : I . The prior distribution on ψ k is a normal truncated at 0 with mean ψ and standard deviation σ ∗ ψ . 4.5 Net w ork P arameters The netw ork parameter set consists of the vector θ = ( θ 0 , . . . , θ 4 , ) as deﬁned in equation (18). Eac h net w ork parameter can b e sampled using a Metrop olis within Gibbs step. Sp eciﬁcally , the Bernoulli lik eliho o d p ortion (equiv alently a logistic regression) can b e expressed as: Y i 6 = j Y t ∈{ 1 ,...,T } exp(( θ 0 + θ 1 B ( i, j ) + θ 2 L j t + θ 3 I j + θ 4 O i )) A ij t 1 + exp( θ 0 + θ 1 B ( i, j ) + θ 2 L j t + θ 3 I j + θ 4 O i ) . (29) T o update eac h parameter, one conditions on all other pieces of information in the model. Prop osals are normal with mean set to the curren t v alue of the parameter, and a standard deviation sp eciﬁc to the parameter, while the priors are normal with a giv en mean and standard deviation. Note here that this sampling relies on estimates of the blo c k mem b ership of each blog. 4.6 Blo c k Assignmen t Previously we describ ed the sampling routine for p ost-topic assignment, topic token-distributions, no de sp eciﬁc parameters, even t matrix and b oosts, and netw ork parameters. The blog sp eciﬁc blo c k memberships remain to b e estimated. Due to the complexity of their dep endence on man y other comp onen ts of the mo del, w e describ e blo c k estimation last. Recall that a blog’s blo c k describ es t wo things. The ﬁrst is the set of topics the blog will b e most lik ely to p ost on. The second is a blog will tend to link more to other blogs that are within the same blo c k as it is. Therefore, blo c k assignment for a given node i is informed b y sev eral pieces of data. These are its topic interests, the net work parameters, other blog’s blo c k mem b erships, and the observed links. One assumption of our mo del is that a no de’s netw ork p osition and topic in terest are conditionally indep enden t given block assignmen t, which in turn mak es the sampling of a blo c k assignmen t considerably simpler. After simpliﬁcation, a no de’s p oten tial blo c k assignment is informed by the num b er of no des already assigned to eac h blo c k. Ultimately , the probabilit y that a no de i will b e assigned to the b th blo c k is prop ortional to I P[ b i = b | A , θ , π i , B − i ] ∝ N b, − i + α B α B | B | + N − 1 P ( A | θ , B i = b ) P ( π i | B i = b ) P ( | B | | λ B ) , (30) where N b, − i is the num b er of no des assigned to blo c k b , not including no de i , α B is related to the prior probabilit y of b eing assigned to an y blo c k (analogously to α in the topic mo del), θ is the complete set of net work parameters, | B | is the num b er of blo c ks with non-zero membership while no de i is b eing considered for p oten tial assignmen t to blo c k b , λ B is the prior num b er of blo c ks exp ected to exist, and B − i is the set of blo c k assignmen ts with the i th blog’s blo c k assignment remo ved. As suc h, the ﬁrst term acts as a p enalt y term on the num b er of blogs in an y given blo c k, the second term is the Bernoulli likelihoo d (logistic regression) of observed links given other blo ck assignmen ts and net work cov ariates, and the third term is the probability of blog i ’s topic interest 13 v ector given the considered blo c k assignmen t. Finally P ( | B | | λ B ) is the Poisson probability of | B | giv en λ B and acts as a p enalt y term for the n umber of blo c ks with non-zero membership. T o elab orate brieﬂy , the ﬁrst term in (30) and the ﬁnal term in (30) together act as tunable priors on the distribution of sizes of blo cks, as w ell as the num b er of non-empty blo c ks. In the generativ e mo del, we sp ecify individual priors on the probability of membership in eac h blo c k. Ho wev er, for any reasonable K , the total num b er of blo c ks B = 1 + P 3 j =1  K j  exceeds the n umber of blogs to assign to blocks. So the blo c k assignment sampler accounts for blocks that hav e no mem b ers. This is a desirable feature since the analyst need not sp ecify exactly how many blo c ks are instantiated in the mo del. As compared to the n umber of p ossible blo c ks, the n umber of non- empt y blo c ks is a rare even t. This enables use of the P oisson distribution as a reasonable (and computationally feasible) approximation for the num b er of non-empty blo cks. Due to the computational intensit y of computing the blo c k sp eciﬁc probabilities, we restrict the n umber of topics in which a blo c k can hav e interest. In our work, blo c ks ma y b e interested in at most three topics, except for one blo c k that is interested in all topics (to accoun t for suc h blogs as The Huﬃngton Post or The New Y ork Times ’s blog). F urthermore, during sampling, we restrict the blocks considered for a giv en no de i by only considering blocks that hav e topics for whic h no de i generated at least one p ost. These restrictions c hange the normalizing constan t, though the relativ e probabilities of the blo c ks considered remains the same. 4.7 Summary of Estimation The primary goals of the estimation routine are to obtain p ost-topic assignmen ts, and then topic tok en-distributions, as w ell as blog- sp eciﬁc blo ck assignmen ts. These are the main parameters of in terest in our application, as they describ e the dynamic nature of eac h conten t topic and the range of in terests and communications each blog has resp ectiv ely . Alongside this information, w e also estimate sev eral other parameters such as those which gov ern link age formation, whic h topics are activ e when (via even ts), topic p ost-rate b o ost parameters, and blogs’ topic in terests. While eac h of these are informativ e in their own right, in our application b elo w, we chose to fo cus on the topic distributions and the blo c k assignments of each blog. 5 Results 5.1 Prior Choices Changes in the topics for this dataset are exp ected to b e slow, aside from sudden even ts that abruptly add new tok ens (e.g., “Benghazi” or “Sandy Ho ok”). Therefore, we used a lag ` parameter of 62 da ys to capture gradual drift in the topics ov er time. Sp eciﬁcally , the distribution ov er tokens for each topic was estimated based up on a sliding window for the preceding 62 days. Within that windo w, all p osts had equal weigh t. T o determine the n umber of topics, we used the criterion developed by Arun et al. (2010). The goal of this criterion is to determine the n um b er of topics that lead to the smallest v alue of the criteria. It is imp ortan t to note that this criteria is only based oﬀ of the topic distribution ov er the p osts, and do es not take in to accoun t the netw ork structure. This is, of course, a limitation of this criteria and a suggestion for further research. Figure 1 shows the criterion curve, which considers ﬁtting anywhere b et ween 1 and 30 topics. The curv e has its minimum at 22, and thus our study ﬁxes the num b er of topics to b e 22. 14 10 15 20 25 30 0.50 0.55 0.60 0.65 Maxim um Num ber of Topi cs KL-C riteri on Figure 1: The criterion curv e, as in Arun et al. (2010), for determining the num b er of topics. Once the num b er of topics is established, the restrictions on the blo c ks and the parameter P , as introduced in equation (1), can b e set. Recall that each blo c k may only b e interested in 1, 2, 3 or all topics. Finally , P , the out of blo c k interest parameter which gov erns the blogs abilit y to p ost on out of in terest topics, was set to 50, to allow some freedom for blogs to p ost on topics outside of their blo c k’s interests, but nonetheless mostly fo cus on the blo c k’s interests. The netw ork mo del sp eciﬁed an edge parameter, a mean in-degree and out-degree parameter, a 7-da y lag parameter, and a blo c k membership parameter. The edge parameter acts as the intercept for the net w ork model. Mean in-degree and out-degree are no dal co v ariates consisting of the a verage daily in-degree and out-degree for each no de. This allows mo deling of diﬀeren tially p opular blogs. Finally , to add in temp oral dep endency , the 7-day lag is an indicator function that takes the v alue 1 if and only if the pair of blogs has b een linked within the previous 7 days, and is otherwise 0 (this captures the fact that bloggers sometimes ha ve debates, whic h pro duce a series of links o ver a relativ ely short p erio d of time). V ague priors were set for each of the netw ork mo del parameters; all were normals with mean 0 and standard deviation 1000. The prop osal standard deviation was set to 1 for the edge parameter, and to 0.25 for each of the other parameters in the net work mo del. F or the topic model the α and β parameters were b oth set to 0.1. The prior for the av erage p ost rates ρ i in e quation (2) w as a truncated normal at 0, with mean 4 and standard deviation 1000. The prior for topic activ ation parameters ψ k in equation (2) was set as a truncated normal at 0 with mean 0 and standard deviation 1000, and a prop osal standard deviation of 0.5. Additionally , 25 was set as the prior mean num b er of blo c ks ( λ B ), and the prior tendency for blo c k mem b ership α B w as set to 1. The prior probabilit y of topic activ ation was set to 0.2. The sampler ran for 1000 iterations. T o ensure mixing for the netw ork parameters, at eac h iteration the netw ork parameters were up dated 10 times. During each iteration, there w ere 10 sub-iterations for the topic mo del and 10 sub-iterations for the block assignments. The ﬁrst 100 15 o verall iterations were discarded as burn-in, the remaining 900 were thinned at interv als of 10. 5.2 Findings The sampler conv erged to stationarit y quickly in every parameter. T o assess the mixing of the p ost to topic assignmen t, and of the blogger to blo c k assignment, we calculated Adjusted Rand Indices (Hub ert and Arabie, 1985; Steinley, 2004) for eac h iteration i compared to iteration i − 1. The p ost to topic assignmen t was very stable, with a mean Adjusted Rand Index of 0.806 and standard deviation 0.047. The blo c k assignment w as less stable, with a mean Adjusted Rand Index of 0.471 and standard deviation 0.031. W e b eliev e this v ariabilit y is due to the fact that many bloggers tended to p ost on whatev er news ev ent captured their attention, making it diﬃcult to assign them to a blo c k with in terest in no more than three topics. How ev er, their interests were not so wide that they were reasonably assigned to the blo ck that is interested in all topics. All domain level parameters con v erged successfully . The domain rate parameter ρ i w as esti- mated for each domain, and the p osterior means of the domain rates had a mean of 0.632 and a standard deviation of 1.67. The largest domain rate w as 22.69. The distribution of domain p ost rates w as highly sk ewed, with few blogs ha ving a very high a verage p ost rate, and most blogs ha ving a lo wer p ost rate. The topic sp eciﬁc activ ation parameters ψ k con verged successfully . Information on the p osterior means and standard deviations is in T able 3, and w ere calculated after the topics had b een deﬁned. The topics Election and Republican Primary hav e the greatest p osterior means, which suggests that these topics were more even t driv en than other topics. 5.2.1 T opic Results The topic mo del found 22 topics, each of which had distinct sub ject matter. T able 1 contains the topic titles and total num b er of p osts in each topic, as well as the three tok ens that ha ve the highest predictiv e probabilit y for that topic o v er all da ys. Predictiv e probabilit y w as calculated using Ba yes’ rule: P ( Z d = k | w ∈ d ) = P ( w ∈ d | Z d = k ) P ( Z d = k ) P ( w ∈ d ) . (31) T able 2 contains the ﬁve most frequen t tokens in each topic ov er all days. T opics were named b y the authors on the basis of the most predictive tokens as w ell as the most frequent tokens o ver all da ys. Some of these tokens ma y seem obscure, but in fact they are generally quite p ertinen t to the iden tiﬁed topics. 16 T able 1: T opic names and their most sp eciﬁc tokens. T opic Name # of p osts Highest Sp eciﬁcit y T okens 1 2 3 F eminism 3971 russel.saunder.juli circumcis femin Keystone Pip eline 4422 loan.guaran te.program pro duct.tax.credit tar.sand.pip elin Birth Con trol 2703 con tracept.cov erag birth.con trol.cov erag religi.organ Election 14713 soptic c heroke east woo d Mortgages 2130 estat probat ﬁduciari En tertainment 10555 email.read.add olivia free.v an Middle East 6068 m ursi morsi fatah LGBT Righ ts 5425 an ti.gay .righ t supp ort.equal.marriag equal.marriag Sensational Crime 6423 zimmerman lanza mass.sho ot T echnology 3230 mail.feel.free pipa ret Supreme Court 1767 commerc.claus b ork c hief.justic.rob ert Bank Regulation 5222 v olck er dimon lib or National Defense 1977 iaea iranian.n uclear.weapon warhead Republican Primary 9351 p oll.mitt.romney nation.p opular.v ote romney .lead V oting Laws 7865 ohio.secretari.state v oter.registr.form h ust P olitical Theory 1448 b ylaw rawl sweatshop Eurozone 1832 standalon troik a ecb T axation 8435 tax.p olici.cen ter health.care.sp end top.tax.rate Diet and Nutrition 3057 spielb erg harlan calori Education 2909 chicago.teac her.union c hicago.public.school c harter.school Global W arming 2205 arctic.sea.ice sea.ice sea.lev el.rise T errorism 3347 kim b erlin broadw el assang 17 T able 2: The most frequent words in each topic. T opic Names Most F requent W ords 1 2 3 4 5 F eminism w omen p eopl don t p erson life Keystone Pip eline energi oil price compani industri Birth Con trol righ t state law marriag w omen Election obama romney p eopl presid polit Mortgages case court bank judg attorney En tertainment p eopl don t go od work game Middle East israel islam american p eopl coun tri LGBT Righ ts ga y p eopl marriag homosexu supp ort Sensational Crime gun p olic rep ort p eopl zimmerman T echnology compani go ogl faceb ook appl user Supreme Court la w court state case constitut Bank Regulation bank mark et money price compani National Defense iran militari israel n uclear obama Republican Primary romney republican obama p oll vote V oting Laws state v ote elect v oter coun ti P olitical Theory libertarian p eopl right state go vern Eurozone bank debt economi rate gov ern T axation tax state go v ern cut obama Diet and Nutrition p eopl don t go vern p olit w ork Education school studen t teac her educ state Global W arming climat climat.c hang temp eratur scienc scien tist T errorism rep ort gov ern attac k inform case 18 T able 3: T opic Sp eciﬁc Activ ation Parameters ψ k T opic P osterior Mean Standard Deviation F eminism 0.0034 0.0029 Keystone Pip eline 0.0037 0.0017 Birth Con trol .00001 .00003 Election 0.3917 0.0249 Mortgages .00001 .00005 En tertainment 0.0018 0.0014 Middle East 0.0378 0.0129 LGBT Righ ts 0.0028 0.0026 Sensational Crime 0.0208 0.0085 T echnology 0.0012 0.001 Supreme Court 0.0022 0.0032 Bank Regulation 0.0021 0.0021 National Defense 0.0014 0.0013 Republican Primary 0.2267 0.0188 V oting Laws 0.0375 0.0095 P olitical Theory 0.0012 .00001 Eurozone 0.0001 .00002 T axation 0.0135 0.0084 Diet and Nutrition 0.0804 0.0139 Education 0.0011 0.0012 Global W arming 0.0019 0.0011 T errorism 0.0022 0.0019 It is b ey ond our scop e to detail the dynamics of all 22 topics. Ho wev er, a close lo ok on one topic, Sensational Crime, shows the kind of information this analysis obtains. The p osts ab out Sensational Crime largely concerned four even ts: the sho oting of T rayv on Martin in F ebruary , the Aurora movie theater sho oting in July , the Sikh T emple sho oting in August, and the Sandy Ho ok sho oting in December 4 . T o illustrate ho w the salience of a token changes o ver time, we use a weigh ted frequency pro- p ortion which is equal to: W F w ∈ k = P ( Z d t = k t | w ∈ d ) F ( w ∈ k t ) P w ∗ ∈ V P ( Z d t = k t | w ∗ ∈ d ) F ( w ∗ ∈ k t ) , (32) where F ( w ∈ k t ) is the frequency of the tok en w in the k th topic’s distribution at time t . This w eighted frequency can b e interpreted as the prop ortion of topic sp eciﬁc tokens at time t that is tak en up b y token w , and is useful in this context as man y of the tokens are shared at high frequency b et w een topics (such as ”p eople”) and are therefore uninformative. So this quantit y trac ks the topic sp eciﬁc information of a tok en ov er time. Recall that the topic-sp eciﬁc token- distributions are computed o ver the past 62 days, which accoun ts for the smo othness of the curv es. The gra y shading around eac h curve represents the 95% Bay esian credible interv al. 4 T rayv on Martin was a y oung African American man shot by George Zimmerman, in what he claimed to b e an act of self defense, while Martin w as walking in Zimmerman’s neighborho o d. The Aurora theater massacre was a mass sho oting at a movie theater in Aurora, Colorado. The Sikh T emple sho oting was a mass sho oting at a Sikh temple in Wisconsin. The Sandy Ho ok massacre w as a mass sho oting at an elemen tary school in Connecticut. 19 0.000 0.025 0.050 0.075 2/26/2012 3/23/2012 4/11/2012 6/1/2012 12/3/2012 Weighted Proportion T oken georg.zimmerman tra yvon tra yvon.mar tin zimmerman Figure 2: W eigh ted prop ortion timeline of the T ra yvon Martin sho oting and the subsequent legal case. The date 2/26/2012 is when T ra yvon Martin was shot, 3/23/2012 is when President Obama said that T rayv on could ha ve b een his son, 6/1/2012 was when Zimmerman’s b ond was rev oked, and 12/3/2012 was when photos were released sho wing Zimmerman’s injuries on the night of the sho oting. (95% credible interv als sho wn.) Figure 2 presen ts the weigh ted frequency curves for tokens sp eciﬁcally related to the sho oting of T rayv on Martin. The plot shows three in teresting features. First, the prev alence of all of the tok ens do es not spike up at the time of the sho oting (2/26/2012), but rather at the time of Obama’s press statemen t regarding the sho oting. Second, the term “zimmerman” dominates the tok ens, and in fact is the most prev alent token in the whole of the Sensational Crime topic from March 22 to July 16th. The gap in prev alence b et ween the tok ens of “zimmerman” and “trayv on” or “trayv on martin” is also in teresting, suggesting that in this case, media atten tion w as on the p erpetrator rather than the victim. 20 0.000 0.025 0.050 0.075 2/26/2012 3/23/2012 6/1/2012 7/20/2012 8/5/2012 12/14/2012 12/14/2012 Weighted Proportion T oken holm lanza sikh zimmerman Figure 3: W eigh ted prop ortion timeline of ma jor even ts in the Sensational Crime topic. The date 2/26/2012 is when T ra yvon Martin w as shot, 3/23/2012 is when Presiden t Obama says that T rayv on could hav e b een his son, 6/1/2012 is when Zimmerman’s b ond was revok ed, 7/20/2012 is the Aurora mass sho oting by James Holmes, 8/5/2012 is the Sikh T emple sho oting b y Mic hael P age, 12/3/2012 is when photos sho wing Zimmerman’s physical injuries were released, and 12/14/2012 is when the Sandy Ho ok massacre o ccurred. (95% credible interv als shown.) Figure 3 tracks the ma jor even ts in the Sensational Crime topic for the en tire year. Notably , media fo cus is never as strong on the tokens related to the ev ents as it is on “Zimmerman” sp eciﬁ- cally . Rather, the usual top terms ov er the course of the y ear are “p olice” and “gun”. Also notable is the lac k of ev ents in the later part of the y ear after the media atten tion on the Sikh T emple Sho oting receded. 5.3 The Net work Results T able 4 sho ws the posterior means of the net wor k parameters with 95% credible interv als. The edge parameter posterior mean indicates that the net w ork is rather sparse at most time p oin ts. In terestingly , the 7-da y lag parameter was negativ e, suggesting that blogs which were recently link ed w ere less lik ely to link in the near future. There are tw o plausible explanations for this ﬁnding. First, the link age dynamics may not b e driv en by recent links, but rather the links are a consequence of the ev ents taking place. An upsurge in linking when an ev ent o ccurs is follo wed b y a decrease in the n umber of links as the even t fades out of the news cycle. Second, if linking is done as part of a debate, then once a p oin t has b een made, the bloggers ma y not feel a need for bac k-and-forth argument. The blo ck parameter is strongly p ositiv e (mean = 1.058, standard deviation 0.240), suggesting that blogs which share common interests are more likely to link to each other. This is particularly imp ortan t, as the blo c k statistic was not only formed from explicit blo c k matching, but also from blogs that did not share the same interests. The blo c k statistic is prop ortional to the shared topic in terests. This result directly links the net work mo del to the topic mo del, and allo ws the analyst 21 to mak e claims ab out the blo c k structure as inferred from the topics. Finally , and predictably , b oth the in-degree and out-degree of a blog increases the probability that the blo ck will receiv e links. These parameters w ere included in the analysis to control for the inﬂuence of highly p opular blogs suc h as The Blaze and The Huﬃngton Post . T able 4: Posterior means and 95% credible interv als for netw ork parameters. P arameter P osterior Mean 95% CI Edges -8.524 [-8.539, -8.513] 7 da y lag -0.163 [-0.198, -0.131] Blo c k 1.058 [0.638, 1.485] Outdegree of Receiver 0.330 [0.329, 0.332] Indegree of Receiver 0.497 [0.496, 0.499] W e can examine the link dynamics within a topic blo c k. There were 21 blogs whose maxim um p osterior probability of blo c k assignmen t placed them in the blo c k that was only in terested in the Sensational Crime topic. Only 2 of these 21 blogs receiv ed any links ov er the course of the year, and only 1 receiv ed links within the blo c k ( le galinsurr e ction.c om ). While this runs coun ter to the idea that they form one blo c k, recall that blogs are also more lik ely to link to blogs that share some of the same topic interest. There are a total of 62 blogs to whic h members of the Sensational Crime blo c k link, and 15 of these blogs receive appro ximately 90% of the links. As suc h, the Sensational Crime topic blo ck app ears to b e a set of “commen ter” blogs that react to p osts that are p osted on larger blogs. Our mo del allows the analyst to isolate the blogs that p ost on a particular topic, to get a b etter idea of the link age dynamics around imp ortant even ts. As an example, we describ e ho w the link age pattern changes around the o ccurrence of Barac k Obama’s sp eech regarding the sho oting of T ra yvon Martin, and also follo wing the Aurora sho oting. Figures 4 and 5 show the link structure from the blogs in the Sensational Crime blo c k to other blogs. The data are aggregated ov er ﬁfteen days. Figure 4 p ertains to the da ys b efore President Obama’s press conference regarding T ra yvon Martin on 3/23/2013, and Fig. 5 pertains the da ys follo wing his remarks. Figure 6 p ertains to the p eriod immediately b efore the Aurora sho oting on 7/20/2012, and Fig. 7 p ertains to the p eriod immediately after. T o impro ve in terpretability , only a subset of blogs and links are plotted. Sp eciﬁcally , blogs that w ere assigned to the block in terested only in Sensational Crime, and who p osted during the sp eciﬁed time frame, are plotted. Additionally , blogs who are part of the 15 blog subset that received 90% of the links from the Sensational Crime blo c k, and which received links within the timeframe, are plotted. Also, links generated from blogs in the Sensational Crime blo c k to other mem b ers of the same blo ck, or to other blogs, are plotted. Links emanating from the 15-no de subset are not plotted. These plotting constrain ts help enable us to discern and interpret the comm unity structure that formed in the discussion of these even ts. 22 althouse.blogspot.com americanpo werblog.blogspot.com americanthink er.com atlasshrugs2000.typepad.com babalublog.com blogs.the.american.interest.com captainsjournal.com dailypundit.com directorblue.blogspot.com gunw atch.blogspot.com hotair.com le g alinsurrection.com marathonpundit.blogspot.com michellemalkin.com patterico.com redstate.com sandrarose.com shark − tank.net theblaze.com themoderate v oice.com Figure 4: Fifteen da y aggregate link age from 3/8/2012 to 3/22/2012, immediately b efore President Obama’s commen t. The n umber of links, represen ted by line thic kness, is ro ot transformed for clarit y . Circular no des are blogs in the Sensational Crime blo c k. Square no des are blogs to which the Sensational Crime blogs link, and these blogs are generally in multi-topic blo c ks, where one of the topics is Sensational Crime. althouse.blogspot.com americanpo werblog.blogspot.com americanthink er.com atlasshrugs2000.typepad.com babalublog.com blogs.the.american.interest.com captainsjournal.com dailypundit.com directorblue.blogspot.com gunw atch.blogspot.com hotair.com jef fwinb ush.com le g alinsurrection.com marathonpundit.blogspot.com michellemalkin.com patterico.com redstate.com shark − tank.net theblaze.com themoderate v oice.com Figure 5: This ﬁgure is constructed in the same wa y as Fig. 4, but for the time p eriod from 3/23/2012 to 4/6/2012, immediately after President Obama’s comment. 23 althouse.blogspot.com americanpo werblog.blogspot.com americanthink er.com atlasshrugs2000.typepad.com blogs.the.american.interest.com captainsjournal.com dailypundit.com directorblue.blogspot.com gunw atch.blogspot.com hotair.com le g alinsurrection.com marathonpundit.blogspot.com michellemalkin.com patterico.com redstate.com shark − tank.net theblaze.com themoderate v oice.com trailblazersblog.dallasne ws.com Figure 6: This ﬁgure is constructed in the same wa y as Fig. 4, but for the time perio d from 7/5/2012 to 7/19/2012, immediately b efore the Aurora sho oting. althouse.blogspot.com americanpo werblog.blogspot.com americanthink er.com atlasshrugs2000.typepad.com blogs.the.american.interest.com captainsjournal.com dailypundit.com directorblue.blogspot.com gunw atch.blogspot.com hotair.com le g alinsurrection.com marathonpundit.blogspot.com michellemalkin.com patterico.com redstate.com shark − tank.net theblaze.com themoderate v oice.com Figure 7: This ﬁgure is constructed in the same wa y as Fig. 4, but for the time p eriod from 7/20/2012 to 8/2/2012, immediately after the Aurora sho oting. Comparing Figs. 4 and 5 shows that the communit y structure seen in the link age pattern did not change muc h as a result of the press conference in whic h President Obama remark ed that if he had a son, he would resemble T rayv on Martin. Remark ably , there was also no net increase in 24 p osting rates. It is known that there was a ﬂurry of p osts at this time, and it turns out that uptick w as allo cated to the Election blo c k, as p eople sp eculated on ho w his remarks would aﬀect the 2012 presiden tial election. The patterns surrounding the Aurora sho oting (Figs.6 and 7) are more clear. The communit y structure in the discussion is essen tially the same, but the amoun t of traﬃc increases conspicuously . Sp eciﬁcally , the n umber of links in the 15 da ys b efore the sho oting was 197, but afterw ards it was 427. Link age rates esp ecially increase from gunwatch.blogspot.com . In general, this agrees with the conclusion that the metho dology is able to detect stable communities whose link age rates are driv en by news even ts. T o further illustrate the ﬁndings of the net work mo del for a diﬀerent blo ck, w e no w present examples from the Election blo c k. There are 52 blogs that the mo del assigned to the blo c k whose only interest w as the presidential election. Of these 52 blogs, 33 blogs linked to or receiv ed links from other blogs within this same blo ck. And of these 33 blogs, 12 w ere the recipients of all links. W e use random walk comm unity detection (P ons and Latap y, 2006) up on the Election blo c k to sho w that the mo del can extract meaningful subnetw orks for use in secondary analyses. Figure 8 shows the comm unity substructure for the Election blo c k aggregated o ver the en tire y ear. Random w alk communit y detection indicates that sev en communities optimized mo dular- it y , but tw o communities contained the ma jorit y of the blogs. As such, for in terpretabilit y , only these tw o comm unities are shown. The mo dularit y of this partition is 0.49, and a 10,000 sample p erm utation test of the communi ty lab els indicated that this v alue of mo dularit y w as in the 99th p ercen tile (the greatest mo dularit y found in the p erm utation test w as 0.317). This result indicates that the mo del found meaningful comm unity structure, rather than sampling v ariability . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Figure 8: Comm unity substructure of the Election blo c k. Circles and squares denote separate comm unities. The absence of shap e denotes membership in a small communit y . The thic kness of edges corresp ond to a log transformation of the num b er of links sent ov er the entire year. T able 5 contains the blog lab els for the Election blo c k net work. Examination of the blo c k 25 structure shows that the ma jority of the blogs partitioned in to one of tw o communities. Based on Technorati ratings, the comm unity plotted as circles in Figure 8 is p olitically conserv ativ e, while the other communit y plotted as squares is lib eral. This separation of the tw o ends of the p olitical sp ectrum has been found before in blogs (La wrence et al., 2010). There is little comm unication b et w een the t wo comm unities, but a lot of comm unication within those communities. Interestingly , b oth communities sent many links to blog 31, which was allo cated into a distinct communit y that it shared with blogs 15, 4 and 28. Blog 31 is me diaite.c om , a non-partisan general news and media blog, and the pattern of links from b oth partisan comm unities suggests that me diaite.c om acts as a common source of information. T able 5: Blog names and their communit y membership. Lab el Blog Comm unity 1 afeatheradrift.wordpress.com 1 2 atlasshrugs2000.typepad.com 3 3 bleedingheartlib ertarians.com 1 4 brainsandeggs.blogsp ot.com 2 5 citizentom.com 3 6 crethiplethi.com 3 7 cro ok edtimber.org 1 8 dav edub ya.com 4 9 dogwalkm usings.blogsp ot.com 5 10 driftglass.blogsp ot.com 1 11 greatsatansgirlfriend.blogsp ot.com 3 12 hennessysview.com 6 13 josh uapundit.blogsp ot.com 3 14 marezilla.com 3 15 mediabistro.com 2 16 mic hellesmirror.com 3 17 nomoremister.blogsp ot.com 1 18 o c hairball.blogsp ot.com 7 19 patriotb o y .blogsp ot.com 1 20 righ ttruth.typepad.com 3 21 righ twingnews.com 3 22 rogerailes.blogsp ot.com 1 23 rw cg.wordpress.com 1 24 sultanknish.blogsp ot.com 3 25 tb ogg.ﬁredoglak e.com 1 26 thecit ysquare.blogsp ot.com 3 27 therigh tplanet.com 3 28 though tsandrantings.com 2 29 v aright.com 3 30 blogs.sun times.com 1 31 mediaite.com 2 32 righ twingw atch.org 1 33 patdollard.com 3 6 Conclusion In this man uscript w e present a Ba yesian model for analyzing a large dataset of political blog p osts. This mo del links the net work dynamics to topic dynamics through a blo c k structure that informs b oth the topic assignment of a p ost and the link age pattern of the netw ork. 26 A ma jor feature of our mo del is that the blo c k structure enables interpretable asso ciations among topics. F or example, there is a t wo-topic block whose mem b ers are in terested in b oth Election topic and the Republican Primary topic, but there is no blo c k whose members are interested in just the Supreme Court and Global W arming. That pattern of shared interest conforms to what one would exp ect. Another feature of our mo del is the ﬂexibilit y of the net work mo del. This analysis uses a limited set of predictors, but the ERGM mo deling framew ork can easily incorp orate additional cov ariates (Robins e t al., 2001) and structural features. Additionally , if one uses a maxim um pseudo-lik eliho od approac h (F rank and Strauss, 1986) as a wa y of approximating the lik eliho od, then higher order subgraph terms, suc h as num b er of triangles or geometrically weigh ted edgewise shared partners (Hun ter et al., 2008) can accoun t for transitivity eﬀects. Finally , while the blo c k structure mo deled in this pap er was based up on similarity in topic interest, more nuanced mo dels are p ossible, and these could use information on, sa y , p olitical ideology , which the analysis of the Election blo c k found to b e imp ortant in predicting link age patterns. Finally , one ma jor adv antage of our approac h to modeling this data is the nonparametric nature of the topic dynamics. By av oiding an autoregressiv e sp eciﬁcation of topic dynamics, as in Blei and Laﬀert y (2006), topics are able to change more freely; in particular, it is p ossible for new tok ens with high probability to emerge ov ernight. This is ideal for the blog data, since the blogs are often resp onding to news even ts. Our analysis of the p olitical blog dataset had an interpretable topic and blo c k set, and analysis of the Sensational Crime blo c k and the Election blo ck reached reasonable conclusions. Sp eciﬁcally , the dominance of the tok en “zimmerman” across the y ear agrees with our sense of the tone and primacy of that discussion, and the spike follo wing the Aurora sho oting is commensurate with its news co verage. The Election blo c k neatly split into sub comm unities along partisan lines, whic h accords with previous research (Lawrence et al., 2010). While our fo cus was on the sp eciﬁc application of the p olitical blog data, the mo del developed here has features that can generalize to other dynamic text netw orks. suc h as the Wikip edia and scien tiﬁc citation netw orks. Sp eciﬁcally , the connection of topic and link structure through a blo c k structure allows for do cumen t con tent to inform the communit y structure of the o verall net work. Ho wev er, each application requires some hand ﬁtting that captures speciﬁc asp ects of the data. F or example, the blo ck structure migh t need to b e dynamic; this w ould make sense is scientiﬁc citation net works, since disciplines sometimes bifurcate (e.g., the computer science of 1970 has now split in to machine learning, quan tum computing, algorithms, and many other subﬁelds). Also, scientiﬁc citation is strictly directional in time—one cannot cite future articles. But the Wikipedia is not directional in time; an article p osted a year ago can send a link to one p osted yesterda y . So sp eciﬁc applications will require tinkering with the mo del describ ed here. The work presented here suggests several av enues of future researc h. On the metho dological side, the model can b e generalized and extended in several wa ys. Sp eciﬁcally , the blo ck membership could b e considered a dynamic prop erty , allowing blogs to change in terest in topics o ver time. Additionally , building a dynamic mo del for link patterns w ould allo w researc hers to examine sp eciﬁc prop erties of links o ver time, rather than assuming the same link generating distribution at all time p oin ts. Finally , this mo del can b e adapted to other dynamic text netw orks, and its p erformance should b e compared to more traditional topic analysis and communit y detection pro cedures. 27 References AIR OLDI, E. M., BLEI, D. M., FIENBER G, S. E., and XING, E. P . (2008). Mixed Membership Sto c hastic Blo c kmo dels. Journal of Machine L e arning R ese ar ch , 9(2008):1981–2014. AR UN, R., SURESH, V., MADHA V AN, C. V., and MUR THY, M. N. (2010). On ﬁnding the natural num b er of topics with latent diric hlet allo cation: Some observ ations. In A dvanc es in Know le dge Disc overy and Data Mining , pages 391–402. Springer. BLEI, D., NG, A., and AND M. JORD AN (2003). Latent Diric hlet Allo cation. Journal of Machine L e arning R ese ar ch , 3:993–1022. BLEI, D. M. and LAFFER TY, J. D. (2006). Dynamic topic models. In Pr o c e e dings of the 23r d international c onfer enc e on Machine le arning , pages 113–120. ACM. BR OWN, P . F., DESOUZA, P . V., MER CER, R. L., PIETRA, V. J. D., and LAI, J. C. (1992). Class-based n-gram mo dels of natural language. Computational linguistics , 18(4):467–479. CHANG, J. and BLEI, D. M. (2009). Relational topic mo dels for do cument net works. In Interna- tional c onfer enc e on artiﬁcial intel ligenc e and statistics , pages 81–88. F A UST, K. and W ASSERMAN, S. (1992). Blo c kmo dels: Interpretation and ev aluation. So cial networks , 14(1):5–61. FRANK, O. and STRA USS, D. (1986). Marko v graphs. Journal of the A meric an Statistic al Asso ciation , 81(395):832–842. GILKS, W. R., BEST, N., and T AN, K. (1995). Adaptive rejection metropolis sampling within gibbs sampling. Applie d Statistics , pages 455–472. HO, Q., EISENSTEIN, J., and XING, E. P . (2012). Do cumen t hierarc hies from text and links. In Pr o c e e dings of the 21st international c onfer enc e on World Wide Web , pages 739–748. A CM. HOFF, P . D., RAFTER Y, A. E., and HANDCOCK, M. S. (2002). Latent Space Approaches to So cial Netw ork Analysis. Journal of the A meric an Statistic al Asso ciation , 97(460):1090–1098. HOFFMAN, M., BACH, F. R., and BLEI, D. M. (2010). Online learning for latent dirichlet allo cation. In A dvanc es in Neur al Information Pr o c essing Systems , pages 856–864. HOLLAND, P . W. and LEINHARDT, S. (1981). An exp onen tial family of probabilit y distributions for directed graphs. Journal of the americ an Statistic al asso ciation , 76(373):33–50. HUBER T, L. and ARABIE, P . (1985). Comparing partitions. Journal of Classiﬁc ation , 2(1):193– 218. HUNTER, D. R., GOODREA U, S. M., and HANDCOCK, M. S. (2008). Go o dness of Fit of So cial Net work Mo dels. Journal of the Americ an Statistic al Asso ciation , 103(481):248–258. KRIVITSKY, P . N. and HANDCOCK, M. S. (2014). A separable mo del for dynamic net works. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 76(1):29–46. LA TOUCHE, P ., BIRMEL ´ E, E., and AMBROISE, C. (2011). Ov erlapping sto c hastic blo c k mo dels with application to the F rench p olitical blogosphere. Annals of Applie d Statistics , 5(1):309–336. 28 LA WRENCE, E., SIDES, J., and F ARRELL, H. (2010). Self-Segregation or Delib eration? Blog Readership, Participation, and Polarization in American Politics. Persp e ctives on Politics , 8(01):141. MCNAMEE, P . and MA YFIELD, J. (2003). Jhu/apl exp erimen ts in tok enization and non-w ord translation. In Comp ar ative Evaluation of Multilingual Information A c c ess Systems , pages 85–97. Springer. MOOD Y, J. (2004). The structure of a so cial science collab oration net work: Disciplinary cohesion from 1963 to 1999. Americ an So ciolo gic al R eview , 69(2):213–238. NEWMAN, M. E. J. and GIR V AN, M. (2004). Finding and ev aluating communit y structure in net works. Physic al R eview E - Statistic al, Nonline ar, and Soft Matter Physics , 69(2 2):026113. PONS, P . and LA T APY, M. (2006). Computing comm unities in large netw orks using random w alks. J. Gr aph Algorithms Appl. , 10(2):191. RAMOS, J. (2003). Using tf-idf to determine w ord relev ance in do cumen t queries. In Pr o c e e dings of the ﬁrst instructional c onfer enc e on machine le arning . R OBINS, G., ELLIOTT, P ., and P A TTISON, P . (2001). Netw ork mo dels for so cial selection pro cesses. So cial Networks , 23(1):1–30. SNIJDERS, T. A. and NO WICKI, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent blo c k structure. Journal of Classiﬁc ation , 14(1):75–100. STEINLEY, D. (2004). Prop erties of the h ub ert-arable adjusted rand index. Psycholo gic al metho ds , 9(3):386. TECHNORA TI (2002). h ttps://web.arc hiv e.org/web/20140420052710/h ttp://tec hnorati.com/. W ANG, E., SIL V A, J., WILLETT, R., and CARIN, L. (2011). Dynamic relational topic mo del for so cial net work analysis with noisy links. In Statistic al Signal Pr o c essing Workshop (SSP), 2011 IEEE , pages 497–500. IEEE. W ASSERMAN, S. and P A TTISON, P . (1996). Logit mo dels and logistic regressions for so cial net works: I. An introduction to Marko v graphs and p ∗ . Psychometrika , 61(3):401–425. YIN, J. and W ANG, J. (2014). A dirichlet multinomial mixture mo del-based approac h for short text clustering. In Pr o c e e dings of the 20th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , pages 233–242. ACM. 29

Modeling community structure and topics in dynamic text networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment