Inferring Networks of Diffusion and Influence

Information diffusion and virus propagation are fundamental processes taking place in networks. While it is often possible to directly observe when nodes become infected with a virus or adopt the information, observing individual transmissions (i.e.,…

Authors: Manuel Gomez-Rodriguez, Jure Leskovec, Andreas Krause

Inferring Networks of Diffusion and Influence
Inferring Networks of Dif fusion and Influence MANUEL GOMEZ-RODRIGUEZ Stanf ord Univ ersity and MPI for Intell igent Sys tems JURE LESK O VEC Stanf ord Univ ersity ANDREAS KRA USE Calif or nia Institute of T echnology Information diffusion an d virus propagation are fundamen tal pro cesses taking place in netw orks. While it is often possi ble to directly observe when nodes b ecome i nfected with a virus or adopt the information, observing individual transmissi ons ( i.e. , who infects whom, or who influences whom) is typically v ery difficult. F urthermore, i n man y applications, the underlying net wo rk o ver which the diffusions and propagations spread is actually unobserve d . W e tac kle these challenges by dev eloping a m ethod for tracing paths of diffusi on and influence through netw orks and inferr ing the netw orks o ver whic h contag ions propagate. Giv en the times when no des adopt pieces of information or b ecome infected, we iden tify the optimal net wo rk that b est explains the observed infection times. Since the optimization problem is NP-hard to solve exactly , we dev elop an efficien t appro ximation algorithm that scales to l arge datasets and finds pr ov ably near-optimal net wo rks. W e demonstrate the effectivene ss of our approach by tracing information di ffusion in a set of 170 mi llion blogs and news articles o ve r a one y ear p erio d to infer how inform ation flo ws through the online media space. W e find that the diffusion netw or k of news for the top 1,000 media s i tes and blogs tends to hav e a core-p eriphery structure with a small set of core media sites that diffuse information to the rest of the W eb. These sites tend to ha ve stable circles of influence with more general news media s i tes acting as connectors b etw een them. Cate gories and Subject Descriptors: H.2.8 [ Database Management ]: Database applicatio ns— Data mining General T erms: Algori thms, Exp erim en tation Additional Key W ords and Phrases: Net works of diffusion, Information cascades, Blogs, News media, Meme-tracking , Social net works 1. INTRODUCTION The dissemination of infor mation, cascading beh avior , diffusion and spr eading of id eas, innovation, infor mation, in fluence, v iruses and diseases are u biquitou s in social an d in - formation network s. Such p rocesses play a fu ndamen tal role in settings that inclu de the spread of tech nologic al innovations [Rogers 1995; Stran g and Soule 1998] , word of mouth effects in marketing [Doming os and Richardson 20 01; Kempe et al. 2003 ; Leskovec et al. 2006] , th e spr ead of news and op inions [Ad ar et al. 2 004; Gruh l et al. 20 04; Leskovec et al. 2 007; Leskovec et al. 200 9; Liben- Nowell a nd Kleinb erg 2008], collecti ve problem- solving [Kearns et al. 2006 ], th e spread of in fectious diseases [ Anderson and May 200 2; Bailey 1975 ; Hethcote 2000] and sam pling meth ods for h idden p opulation s [Go odman 1961; Heckathor n 199 7]. In order to study network diffusion th ere are two fundamental challenges one has to ad- dress. First, to be able to track cascading processes taking p lace in a network, one needs to Prelimina ry version of t his w ork a ppeared in proc eedings of the 1 6th A CM SIGKDD Intern ational Confe rence on Kno wledge D iscov ery and Data Mining (KDD ’10), 2010. Algorithm implementation and the data are av ailable at http://snap .stanford.edu/n etinf/ AC M Transactions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY , Pages 1–0 ?? . 2 · Gomez-Rodr iguez, Lesk ov ec and Krause. identify the contagio n ( i.e. , the ide a, informatio n, virus, disease) that is actually spread ing and propag ating over the edge s of th e network. Moreover , one has then to iden tify a way to successfully trace the c ontagion as it is diffusing throu gh th e network. For example, when tracing infor mation diffusion, it is a non -trivial task to auto matically and on a large scale identify the phr ases or “memes” that are spreading throu gh the W eb [Leskovec et a l. 2009] . Second, w e usually th ink of diffusion as a p rocess that takes place on a ne twork , wher e the co ntagion p ropag ates over the edges of the under lying n etwork from node to node like an ep idemic. Howe ver, the network over which prop agations take place is usually unkno wn and uno bserved . Comm only , we only obser ve the times when par ticular n odes g et “infected” but we do n o t observe who infected th em. In case of information propaga tion, a s blogger s discover new infor mation, they write about it wit hou t e xplicitly citing the source. Thus, we o nly obser ve the time wh en a blog g ets “infec ted” with inform ation but not where it got infe cted from. Similarly , in virus prop agation, we observe p eople getting sick witho ut usually kn owing who in fected them. And, in a v iral marketing setting, we o bserve people purcha sing products or adopting particular behaviors without explicitly knowing who was the influencer that caused the adoption or the purch ase. These challenges are especially pronou nced in inform ation diffusion on the W eb, where there have been r elativ ely few large scale studies of inform ation pro pagation in large net- works [Adar and Adamic 2005; Leskov ec et al. 2006 ; Lesko vec et al. 2007; Liben-Nowell and Kleinberg 2 008]. In ord er to stud y paths of d iffusion over network s, o ne essentially requires to have com plete inform ation ab out who influences w hom, as a sin gle missing link in a sequen ce of p ropagatio ns can lead to wrong inferen ces [Sadikov et al. 20 11]. Even if one collects near complete large scale diffusion data, it is a non-tr ivial task to identif y te x- tual fr agments th at pr opagate relatively intact through the W eb witho ut h uman super vision. And e ven then t he question of how infor mation diffuses throu gh the netw ork still remains. Thus, the questions are, what is the network ov er which the information propaga tes o n the W eb? What is the globa l s tructur e of such a netw ork? How do ne ws media sites and blogs interact? W hich ro les do different sites play in the diffusion proce ss and how influen tial are they? Our a pproach to inferring networks of diffusion and influence. W e address the ab ove questions by positing that there is som e underly ing u nknown network over which info r- mation, viru ses or in fluence pr opagate. W e a ssume that the underly ing network is static and does not c hange over time . W e then observe th e times wh en n odes get in fected by or decide to adopt a particular contagion ( a par ticular piece of inform ation, pro duct or a vir us) but we do no t o bserve where they go t in fected fro m. Thus, for each contagion , we on ly observe times when n odes got infected, and we are then inter ested in determining the paths the diffusion took throu gh the unobserved network. Our goal is to reconstru ct the network over which contagion s propagate. Figure 1 gives an example. Edges in such networks of influen ce and diffusion ha ve v arious interpretation s. In virus or d isease pro pagation, edges can be interpreted as who-in fects-whom. In informa tion propag ation, ed ges are who-ad opts-info rmation-from-whom or who-listens-to-who m. In a viral marketing setting, edges can be understood as who-influenc es-whom. The main premise of our w ork is that by observing many d ifferent con tagions spreading among the nodes, we can infer the edges of the underly ing propa gation network. If no de v tends to get infected soon after node u fo r m any different contag ions, then we can expect ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 3 (a) True netwo rk G ∗ (b) Inferred network ˆ G using heuristic baseline method (c) Inferred network ˆ G using N E T I N F algorit hm Fig. 1. Diffusion network infer ence pr o blem. There is an unknown network (a) over wh ich contagion s prop agate. W e are given a collection of node infe ction times and aim to recov er the network in figure (a). Using a baseline h euristic (see Section 4) we recover network (b) and using the pro posed N E T I N F alg orithm we recover network (c). Red edge s deno te mistakes. The baseline ma kes many mistakes but N E T I N F almost perfectly rec overs the network. an edg e ( u, v ) to b e pre sent in the network . By explorin g co rrelations in node infection times, we aim to recover the unob served d iffusion network. The co ncept of set o f co ntagions over a network is illustrated in Figu re 2. As a conta- ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 4 · Gomez-Rodr iguez, Lesk ov ec and Krause. Fig. 2. Th e underlying tr ue network over wh ich contagions spread is illustrated on the top. Each subsequent layer depicts a cascade created by the dif fusion of a particular contagio n. For each cascade, no des in gray are the “infec ted” nod es and the edg es de note the direction in which the contagio n propag ated. Now , given only the nod e infection times in each cascade we aim to infer the connec ti vity of the underlying network G ∗ . gion sprea ds over the under lying network it creates a trace, called cascad e . No des of the cascade ar e the n odes o f the n etwork that got inf ected by the con tagion and edg es of the cascade repr esent edges o f the n etwork over which the con tagion actu ally sp read. On the top of Figur e 2, the und erlying tru e network over which contagion s spread is illustrated. Each sub sequent laye r dep icts a cascade cr eated b y a particu lar contagio n. A prior i, w e do n ot know th e con nectivity of th e un derlying true n etwork and ou r aim is to infer this connectivity using the infection times of nodes in many cascades. W e develop N E T I N F , a scalable algo rithm for inferring networks of diffusion and influ- ence. W e first formu late a gen erative probab ilistic mo del of how , on a fixed h ypothe tical network, con tagions spread as direc ted trees ( i.e. , a n ode inf ects many othe r node s) throug h the n etwork. Since we on ly observe times wh en no des get inf ected, ther e are many po ssi- ble ways of the co ntagion could have pro pagated th roug h the n etwork that are consistent with the observed data. In order to infer th e network we ha ve to c onsider all possible ways of the con tagion spreading throu gh the n etwork. Thu s, naive compu tation of the m odel takes exponen tial time since there is a com binatoria lly lar ge n umber of pro pagation trees. W e show that, per haps surprisingly , compu tations over this sup er-exponential set of trees can be p erforme d in polyn omial ( cubic) time. Howe ver, u nder such mod el, th e network inference pro blem is still intra ctable. Thu s, we in troduce a tr actable appr oximation , and show tha t the o bjective function can b e both efficiently comp uted and efficiently optimized. By exploiting a diminishing return s property of the prob lem, we prove that N E T I N F infers near-optimal networks. W e also speed-up N E T I N F by exploiting the local stru cture of the objective f unction and by using lazy e valuations [Lesko vec et al. 200 7]. In a broad er co ntext, our work he re is related to the network structure learning of proba - ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 5 bilistic directed graphica l models [Friedman et al. 1999; Getoor et al. 2003 ] where heu ris- tic greedy hill-climbing or stochastic search that both of fer no perform ance guarante es a re usually used in practice. In con trast, o ur work here provides a n ovel f ormulatio n and a tractable po lynomial time alg orithm f or in ferring directed networks to gether with a n ap- proxim ation guarantee that ensures the inferred networks will be of near-optimal quality . Our results on synthetic datasets show that we can reliably infer an under lying prop aga- tion a nd influe nce network, regardless of th e overall network structu re. V alidation on rea l and synth etic d atasets shows that N E T I N F outperfo rms a baseline heuristic by an ord er of magnitud e and correctly discovers mo re than 9 0% of the edg es. W e ap ply our algo rithm to a real W eb info rmation propag ation da taset of 1 70 million b log and news ar ticles over a on e year p eriod. Our results show tha t online n ews propagation networks tend to have a core-pe riphery structure with a small set of core b log and news media websites that diffuse informa tion to the rest o f the W eb, n ews m edia websites tend to diffuse the news faster than blogs and blogs keep discussing about ne ws longer time than media websites. Inferrin g how in formation or vir uses propagate over networks is c rucial f or a better understan ding of diffusion in networks. By modelin g the structu re of the prop agation network, we c an gain insight into po sitions and roles various nodes play in th e diffusion process and assess the range of influence of nodes in the network. The rest of the pap er is organized as follows. Sectio n 2 is devoted to the statem ent o f the problem , the formulation of the model and the optimizatio n prob lem. In section 3, an effi- cient refor mulation of th e optimization pr oblem is prop osed and its solution is pr esented. Experime ntal evaluation using syn thetic and MemeTracker d ata are shown in section 4. W e conclud e with related w ork in section 5 and discussion of our results in section 6. 2. DIFFUSION NETWORK INFERE NCE PROBLEM W e next fo rmally describ e th e p roblem where con tagions pro pagate over an u nknown static directed network and cre ate cascades. For each cascade we observe time s when no des got infected but n ot who infec ted them . Th e g oal then is to inf er the unk nown n etwork over which co ntagion s or iginally p ropagate d. In an info rmation diffusion setting, each contagion corr esponds to a different piece o f info rmation that spread s over the network and all we o bserve are the times when particular nod es adop ted or mention ed particu lar informa tion. The task then is to infer the network wher e a d irected edge ( u, v ) car ries the semantics that nod e v tends to get influe nced by n ode u ( i.e. , men tions o r adopts the informa tion after node u does so as well). 2.1 Prob lem statement Giv en a hidd en dir ected n etwork G ∗ , we observe mu ltiple contagion s spreadin g over it. As the co ntagion c pro pagates over the network, it leaves a trace, a cascade, in the fo rm of a set of triples ( u , v , t v ) c , which m eans tha t contag ion c reach ed n ode v at time t v by spreading fr om no de u ( i.e. , by p ropag ating over t he edge ( u , v ) ). W e denote the fact that the cascade initially starts from some acti ve node v at time t v as ( ∅ , v , t v ) c . Now , we only g et to o bserve the time t v when con tagion c reach ed node v but not how it reach ed the node v , i.e. , we on ly know that v got infected by o ne of its neighbo rs in the network but do not know who v ’ s neigh bors are and who of th em inf ected v . Thu s, instead of ob serving the triples ( u, v , t v ) c that fully specif y the trace o f the con tagion c throug h the network, we on ly get to observe pairs ( v , t v ) c that descr ibe th e time t v when node v got in fected by th e contag ion c . Now , given suc h data abo ut node in fection times ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 6 · Gomez-Rodr iguez, Lesk ov ec and Krause. for many different contag ions, we aim to recover the uno bserved directed network G ∗ , i.e . , the network over wh ich the contagion s o riginally spread. W e use the ter m hit time t u to ref er to the time when a cascade cr eated b y a c ontagion hits (inf ects, causes the adop tion by) a particular n ode u . In pr actice, many contag ions do not hit all th e no des of th e network . Simply , if a contag ion hits all the no des this m eans it will infect ev ery node of the network. In real-life mo st cascades created by contagions are relativ ely small. Thus, if a n ode u is not hit by a cascade , then we set t u = ∞ . Then , the observed data ab out a cascade c is specified by th e vector t c = [ t 1 , . . . , t n ] of hit times, where n is the numb er of nod es in G ∗ , and t i is th e time when nod e i got in fected by the contagion c ( t i = ∞ if i did not get infected by c ). Our goal now is to infer the n etwork G ∗ . In order to solve this pro blem we first define the prob abilistic mo del o f how c ontagion s spread over the ed ges of the n etwork. W e first specify the con tagion tra nsmission mode l P c ( u, v ) that describes how likely is that node u spreads the co ntagion c to nod e v . Based o n the model we then de scribe the p robab ility P ( c | T ) that the contagion c pro pagated in a par ticular cascade tree pattern T = ( V T , E T ) , where tree T simply specifies which nodes infected which other nodes ( e.g. , see Figure 2). Last, we d efine P ( c | G ) , which is the probability that c ascade c occu rs in a network G . And then, under th is m odel, we show how to estimate a (near-)maximu m likelihood network ˆ G , i.e. , the network ˆ G that (approx imately) maximize s th e probability of cascades c occur ring in it. 2.2 Cascade T ransmission Model W e start b y formu lating the pro babilistic mo del of how con tagions d iffuse over the net- work. W e build on the Inde penden t C ascade Mo del [Kempe et al. 2 003] which po sits that an inf ected nod e infects each o f its n eighbo rs in the n etwork G indepen dently at r andom with some small chosen pro bability . T his model implicitly assumes that every nod e v in the cascade c is infected by at most on e nod e u . That is, it only m atters when the first neighbo r of v infects it and all infection s that co me a fterwards h av e n o im pact. Note that v can have multiple of its neigh bors infected but o nly o ne neig hbor actually activ a tes v . Thus, the structure o f a cascad e created b y the diffusion of con tagion c is fully descr ibed by a directed tree T , that is contained in the directed graph G , i.e. , since the contagio n can only spread over the edge s of G an d eac h n ode can only be infected by at most one other node, th e pattern in w hich the co ntagion propag ated is a tre e and a subg raph of G . Refer to Figu re 2 for an illustration of a network and a set of ca scades cr eated by co ntagions diffusing over it. Probability of an individual transmission. The In depend ent Contagion Mod el only im- plicitly models time th rough the epochs of the propagation . W e thus formulate a variant of the m odel that p reserves the tree structure of c ascades and also in corpo rates the notion of time. W e think of our model of h ow a contag ion transmits fro m u to v in two step s. When a new node u gets inf ected it gets a c hance to tran smit the con tagion to each of its currently uninfected neighb ors w ind ependen tly with some sma ll pro bability β . If the contagio n is transmitted we then sample the incub a tion time , i.e. , how l ong after w got infected, w will get a chance to infect its (at that tim e u ninfected ) neig hbors. Note that cascad es in this model ar e necessarily trees since no de u on ly gets to infect neighb ors w tha t have not yet been infected . ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 7 Symbol Descripti on G ( V , E ) Directe d graph with nodes V and edges E ov er which contagions spread β Probabil ity that contagion propagates over an edge of G α Incubat ion time m odel parame ter (refer to Eq. 1) E ε Set of ε -edges, E ∩ E ε = ∅ and E ∪ E ε = V × V c Contagi on that spreads over G t u Time whe n node u got hit (infected) by a particul ar cascade t c Set of node hit times in cascade c . t c [ i ] = ∞ if node i did not partici pate in c ∆ u,v Time di ffe rence between the node hit times t v − t u in a particul ar cascade C = { ( c, t c ) } Set of contagio ns c and correspondin g hit times, i.e. , the observe d data T c ( G ) Set of all possible propagat ion trees of cascade c on graph G T ( V T , E T ) Cascade propagat ion tree, T ∈ T c ( G ) V T Node set of T , V T = { i | i ∈ V and t c [ i ] < ∞} E T Edge set of T , E T ⊆ E ∪ E ε T able I. T able of symbols. First, we define the prob ability P c ( u, v ) tha t a node u sp reads the cascad e to a node v , i.e. , a node u influen ces/infects/transmits c ontagion c to a nod e v . Formally , P c ( u, v ) specifies the condition al prob ability of observing cascade c spre ading from u to v . Consider a pair of nodes u and v , conn ected by a directed edge ( u, v ) an d the corr e- sponding hit times ( u, t u ) c and ( v , t v ) c . Since the co ntagion can o nly pro pagate forward in time, if no de u got inf ected af ter node v ( t u > t v ) then P c ( u, v ) = 0 , i.e. , nodes can not influence nodes from the past. On the other hand (if t u < t v ) we make no assumption s about th e prope rties and shap e o f P c ( u, v ) . T o build som e intu ition, we can thin k th at the probab ility of pro pagation P c ( u, v ) between a pair o f nodes u an d v is decreasing in the difference of their infectio n times, i.e. , the farther apar t in time the two n odes get infected the less likely they are to infect one anoth er . Howe ver, we note that our approac h allows fo r the con tagion transmission m odel P c ( u, v ) to be a rbitrarily comp licated as it can also dep end on the prop erties o f the contag ion c as well as the pro perties o f the nodes and edg es. For example, in a disease prop agation scenario, n ode attributes could include info rmation abo ut the ind ividual’ s socio-econom ic status, co mmute patter ns, disease h istory an d so on , and the co ntagion p roperties would include the strength and the typ e of the virus. This allows f or g reat flexibility in the cas- cade transm ission models as the pr obability of infection dep ends on the param eters of the disease and proper ties of the node s. Purely fo r simplicity , in the rest of the pape r we assume th e simplest a nd most in tuitiv e model where the prob ability of transmission d epends only on th e time d ifference between the no de hit times ∆ u,v = t v − t u . W e consider two d ifferent mod els for th e incu bation time distribution ∆ u,v , an exponential and a power -law model, each with parameter α : P c ( u, v ) = P c (∆ u,v ) ∝ e − ∆ u,v α and P c ( u, v ) = P c (∆ u,v ) ∝ 1 ∆ α u,v . (1) Both the power-law and exponen tial waiting tim e m odels h av e be en argued for in the literature [ Barab ´ asi 20 05; Leskovec e t al. 2007; Malmgr en et al. 200 8]. In th e e nd, ou r algorithm does no t depend on the p articular choice of the incubation time distribution and more complicated no n-mon otonic an d multimo dal fun ctions can easily be ch osen [Crane and Sorne tte 2 008; W allinga an d T eun is 2004 ; Gomez -Rodrigu ez et al. 20 11]. Also, w e interpret ∞ + ∆ u,v = ∞ , i.e. , if t u = ∞ , then t v = ∞ with p robability 1 . Note that the ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 8 · Gomez-Rodr iguez, Lesk ov ec and Krause. parameter α can poten tially be dif ferent for each edge in the network. Considering th e above mod el in a gen erative sen se, we can think that the c ascade c reaches nod e u at time t u , a nd we n ow need to gener ate the time t v when u spreads the cascade to n ode v . As cascades gene rally d o not in fect all the n odes of the network, we need to explicitly mod el the p robab ility that the cascade stops. With probab ility (1 − β ) , the c ascade stops, and never reaches v , thu s t v = ∞ . W ith probab ility β , the cascade transmits over th e edge ( u, v ) , an d the hit time t v is set to t u + ∆ u,v , w here ∆ u,v is the incubation time that passed between the hit times t u and t v . Likelihood of a cascade spreading in a g iven tree patt ern T . Next we calcu late the likelihood P ( c | T ) that contagio n c in a grap h G prop agated in a p articular tr ee p attern T ( V T , E T ) under some assum ptions. This means we want to assess the prob ability that a cascade c with hit times t c propag ated in a particular tree pattern T . Due to our modeling assumption that cascades are trees the likelihood is simply: P ( c | T ) = Y ( u,v ) ∈ E T β P c ( u, v ) Y u ∈ V T , ( u,x ) ∈ E \ E T (1 − β ) , (2) where E T is the edge set and V T is the v ertex set of tree T . No te that V T is the set of nodes that got infected by c , i.e. , V T ⊂ V and co ntains elements i of t c where t c ( i ) < ∞ . Th e above expression has an intuitive explanation. Since the cascad e sprea d in tree pattern T , the co ntagion successfully p ropag ated along those edges. And, along th e edges wh ere the contagion did not spr ead, the cascade had to s top. Her e, we assume ind epend ence between edges to simplify the problem . Despite th is simp lification, we later show empir ically that N E T I N F works well in practice Moreover , P ( c | T ) can b e re written as: P ( c | T ) = β q (1 − β ) r Y ( u,v ) ∈ E T P c ( u, v ) , (3) where q = | E T | = | V T | − 1 is th e numb er of edges in T and counts the edg es over which the con tagion successfully propa gated. Similarly , r c ounts th e num ber of edges that did not activ ate and failed to tr ansmit the contag ion: r = P u ∈ V T d out ( u ) − q , and d out ( u ) is the out-degree of node u in grap h G . W e conclud e with an observation that will come very h andy later . Examin ing Eq. 3 we notice that the first par t of the equation befor e the prod uct sign does no t d epend on th e edge set E T but only on th e vertex set V T of the tree T . Th is means that the first part is constant for a ll trees T with the same vertex set V T but possibly d ifferent edge sets E T . For examp le, th is means that for a fixed G an d c maximizin g P ( c | T ) with re spect to T ( i.e. , finding the most probable tree), does not depend o n the second product of Eq. 2. This means that when optim izing, one o nly nee ds to focu s on the first p rodu ct where the ed ges of the tree T simply specify how the ca scade spr eads, i.e. , every node in the cascade g ets influenced by exactly one node, that is, its parent. Cascade likelihood. W e just defined the likelihood P ( c | T ) that a s ingle contagion c pro p- agates in a p articular tree patter n T . Now , our aim is to comp ute P ( c | G ) , the p robability that a cascad e c occurs in a grap h G . Note that we observe o nly the no de in fection times while the exact pr opagatio n tre e T (who-in fected-who m) is unk nown. In general, over a giv en graph G the re may be multiple dif f erent propagation trees T that are consistent with ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 9 Fig. 3. Dif ferent propagation trees T of cascade c that are all consistent with observed node hit times c = ( t a = 1 , t c = 2 , t b = 3 , t e = 4) . In each case, wider edges compose the tree, while thinner edges are the rest of the edges of the network G . the observed data. For example, Figu re 3 shows three different cascade p ropag ation paths (trees T ) tha t are all con sistent with the observed data t c = ( t a = 1 , t c = 2 , t b = 3 , t e = 4) So, we need to combine the pro babilities of individual prop agation trees into a probabil- ity of a cascade c . W e ach iev e this by considering all possible propagation trees T that are supported by network G , i.e. , all possible w ays in which cascade c could hav e spread over G : P ( c | G ) = X T ∈T c ( G ) P ( c | T ) P ( T | G ) , (4) where c is a cascade an d T c ( G ) is the set of all the directed conn ected spanning trees on a subgrap h of G indu ced by the nodes tha t got hit by cascade c . No te that even th ough the sum r anges over all po ssible spanning trees T ∈ T c ( G ) , in case T is incon sistent with the observed data, then P ( c | T ) = 0 . Assuming that all trees are a priori equally likely ( i.e. , P ( T | G ) = 1 / |T c ( G ) | ) and using the observation from Equation 3 we obtain : P ( c | G ) ∝ X T ∈T c ( G ) Y ( u,v ) ∈ E T P c ( u, v ) . (5) Basically , the g raph G defin es the skeleton over which the cascad es can pro pagate an d T d efines a p articular possible p ropaga tion tre e. Th ere may be many possible trees that explain a single cascade ( see Fig. 3), and since we do not know in which p articular tree pattern the cascad e really pro pagated, we need to consider all p ossible prop agation trees T in T c ( G ) . T hus, th e sum over T is a sum over all directed sp anning tr ees o f the grap h induced by the vertices that got hit by the cascade c . W e just com puted th e proba bility of a sin gle cascade c occ urring in G , and we now define the proba bility of a set of cascades C occurrin g in G simply as: P ( C | G ) = Y c ∈ C P ( c | G ) , (6) where we again assume conditional independ ence between cascades given the graph G . 2.3 Estimating the network that maximizes the cascade lik elihood Now that on ce we have formu lated the cascade tran smission model, we now state the diffusion network infer ence pr ob lem , where the goal is to fin d ˆ G that solves the fo llowing optimization prob lem: ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 10 · Gomez-Rodri guez, Lesk ov ec and Krause. P R O B L E M 1 . Given a set o f n ode infec tio n times t c for a set of cascades c ∈ C , a pr op aga tion pr o b ability parameter β an d an incub ation time d istrib ution P c ( u, v ) , find the network ˆ G such that: ˆ G = argmax | G |≤ k P ( C | G ) , (7) wher e the maximiza tion is over all dir ected graphs G of at most k edges, and P ( C | G ) is defined by equ ations 6, 4 and 2. W e include th e con straint on the n umber o f ed ges in ˆ G simply becau se we seek fo r a sparse solution , since real g raphs are spar se. W e discuss h ow to choose k in further sections of the paper . The above optimization prob lem seems wildly intractable. T o ev aluate Eq. (6), we need to compute E q. (4) for each cascade c , i.e . , the s um ov er all po ssible spanning tr ees T . The number of tr ees can be sup er-exponential in the size of G but perh aps surp risingly , th is super-exponential sum can b e perfor med in time polyno mial in the n umber n of n odes in the graph G , by applyin g Kirchhoff ’ s matrix tree theorem [Knuth 1968]: T H E O R E M 1 [ T U T T E 1 9 4 8 ] . If we construct a ma trix A such that a i,j = P k w k,j if i = j and a i,j = − w i,j if i 6 = j and if A x,y is the matrix c reated by r emoving any r o w x and column y fr om A , th en ( − 1) x + y det( A x,y ) = X T ∈ A Y ( i,j ) ∈ T w i,j , (8) wher e T is each directed span ning tree in A . In our case, we set w i,j to be simply P c ( i, j ) and we c ompute the produ ct of the de- terminants of | C | matrices, one f or each c ascade, wh ich is exactly Eq. 4. Note that since edges ( i, j ) wh ere t i ≥ t j have weight 0 (i.e., they are not pr esent), given a fixed cascade c , the collection of edges with positive weig ht for ms a directed acyclic gr aph ( D A G). A D AG with a time-ordered labeling of its nodes h as an upp er triangular connectivity m atrix. Thus, the matrix A x,y of Theo rem 1 is, by construction , upper triang ular . Fortun ately , the determinan t of an upper triangular matrix is s imply the product of the elements of its diag- onal. This means that instead of using super-exponential time, we are n ow able to ev alu ate Eq. 6 i n time ( | C | · | V | 2 ) ( the time required to b u ild A x,y and compute the determina nt fo r each of the | C | cascades). Howe ver, this do es not com pletely solve our pro blem for two reason s: First, while cuadratic time is a drastic improvement over a super-exponential co mputatio n, it is still too expensive fo r the large graph s that we want to consider . Second , we can use the above result only to ev alu ate the quality o f a p articular g raph G , while our goal is to find t he best graph ˆ G . T o d o th is, we would ne ed to search over all grap hs G to find the best one. Again, as there is a sup er-exponential number of graphs, this is no t practical. T o c ircumvent this one could prop ose som e ad hoc search heu ristics, like h ill-climbing. Howe ver, due to the combinato rial nature of the likelihood fu nction, such a pro cedure would likely be prone to local maxima. W e le av e th e question o f efficient maxim ization of Eq. 4 wh ere P ( c | G ) considers all possible propag ation trees as an interesting open proble m. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 11 3. AL TERNA TIVE FORMULA TION AND THE N E T I N F ALGORITHM The diffusion network inference pro blem defined in th e previous section does not seem to allow for an e fficient solution. W e now prop ose an alter native formu lation of the p roblem that is tractable both to compu te and also to optimize. 3.1 An alternative formulati on W e u se th e same tree cascade form ation m odel as in the pr evious section. Howev er , we compute an appr oximation of the likeliho od of a single c ascade by consid ering on ly the most likely tree instead of all possible propa gation trees. W e show that this app roximate likelihood is tractable to comp ute. Mor eover , we also devise an algor ithm that p rovably finds networks with near optimal appro ximate likelihood. In the rema inder of this section, we infor mally wr ite likelihoo d an d log-likelihood ev en th ough they are approx imations. Howe ver, all app roxima tions ar e clearly indicated. First we introduc e the con cept of ε -ed ges to accou nt for the fact that nod es may get infected for reasons other than the network influenc e. For example, in on line me dia, not all the in formatio n propag ates v ia the network, as some is a lso pushed onto the netw ork by the mass m edia [ Katz and Lazarsfeld 1955 ; W atts and Dod ds 20 07] an d thu s a disconn ected cascade can be created. Similarly , in viral marketing , a per son ma y pur chase a p rodu ct due to the influ ence o f peers ( i.e. , network effect) or fo r some other reason ( e.g. , seing a commercia l on TV) [Leskovec et al. 2006] . Modeling external influence via ε -edges. T o accoun t f or such ph enomen a when a cascade “jumps” acr oss the network we can thin k of cr eating an additiona l n ode x that represents an external influence and can infect a ny o ther node u with small probability . W e then connect the external influ ence node x to e very other node u with an ε -edge. And then e very n ode u can g et inf ected by the extern al source x with a very small prob ability ε . For example, in case of informa tion diffusion in the blogo sphere, suc h a n ode x could m odel the effect of blogs getting infected by the mainstream media. Howe ver, if we were to ado pt this appro ach a nd in sert an additio nal external influence node x in to our data, we would also need t o infer t he edges pointin g o ut of x , which w ould make ou r prob lem even harder . Thus, in o rder to captur e the effect of external influen ce, we introdu ce a con cept of ε -edg e. If there is not a n etwork edge between a n ode i and a node j in the n etwork, we ad d an ε -edg e and then node i can infect n ode j with a sma ll probab ility ε . Even th ough adding ε -edge s makes o ur graph G a clique ( i.e. , the unio n o f network edges and ε - edges creates a clique), the ε -edges play the role of external i nfluen ce node. Thus, we now think of graph G as a fu lly connected graph of tw o disjoin t sets of edges, the network edge set E and the ε -edge set E ε , i.e. , E ∩ E ε = ∅ and E ∪ E ε = V × V . Now , any cascade pro pagation tree T is a comb ination o f network and ε -edg es. As we model the external influence v ia th e ε -ed ges, the probab ility of a cascade c occurr ing in tree T ( i.e. , the ana log of Eq. 2) can no w be computed as: P ( c | T ) = Y u ∈ V T Y v ∈ V P ′ c ( u, v ) , (9) where we compute the transmission prob ability P ′ c ( u, v ) as follows: ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 12 · Gomez-Rodri guez, Lesk ov ec and Krause. (a) Graph G on fiv e vertic es and four ne twork edges (solid edge s). ε -edges sho wn as dashed lines. (b) Cascade propagat ion tree T = { ( a, b ) , ( b, c ) , ( b, d ) } Fig. 4. (a) Graph G : Network e dges E are shown as solid, a nd ε -edg es are sho wn as dashed lines. (b) Propagatio n tree T = { ( a, b ) , ( b, c ) , ( b, d ) } . Four types of edge s are label ed: (i) netw ork e dges that transmit ted the contagio n (solid bold), (ii) ε -edges that transmitted the contagion (dashed bold), (iii) network edges that fai led to transmit the contagi on (solid), and (iv) ε -edges that fai led to transmit the contagion (dashed). P ′ c ( u, v ) =                β P c ( t v − t u ) if t u < t v and ( u, v ) ∈ E T ∩ E ( u, v ) is network edge εP c ( t v − t u ) if t u < t v and ( u, v ) ∈ E T ∩ E ε ( u, v ) is ε -ed ge 1 − β if t v = ∞ and ( u, v ) ∈ E \ E T v is not infec te d , ne twork ed ge 1 − ε if t v = ∞ and ( u, v ) ∈ E ε \ E T v is not infec te d , ε - edge 0 else ( i.e. , t u ≥ t v ). Note th at above we distinguish four type of edg es: network and ε -e dges that pa rticipated in the diffusion of the contagio n and network and ε -edg es that d id not p articipate in the diffusion. Figure 4 fur ther illustrates this concep t. First, Fig ure 4( a) shows an example of a gr aph G on five nod es an d four network edges E ( solid lines), and any other possible edge is the ε -edg e (d ashed lin es). Then, Figure 4(b) shows an e xamp le of a pr opagatio n tr ee T = { ( a, b ) , ( b, c ) , ( b, d ) } in graph G . W e only show the ed ges tha t p lay a role in Eq . 9 and label them with four dif ferent types: (a) network edges that transmitted the contagion, (b) ε - edges that transmitted the contagion , ( c) n etwork edges that failed to transmit the contagion , and (d) ε -edg es that failed to transmit the contagio n. W e can now re write the cascade likelihood P ( c | T ) as comb ination of products of edge- types and the produ ct over the edge incub ation times: P ( c | T ) = β q ε q ′ (1 − β ) s (1 − ε ) s ′ Y ( u,v ) ∈ E T P c ( v , u ) (10) ≈ β q ε q ′ (1 − ε ) s + s ′ Y ( u,v ) ∈ E T P c ( v , u ) , (11) ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 13 where q is the number of network edges in T ( type (a) edges in Fig. 4(b)), q ′ is the numb er of ε -edges in T , s is the nu mber of network edges that did not transmit and s ′ is the number o f ε - edges that did n ot tra nsmit. Note that th e ab ove approximation is valid since real networks are sparse and cascad es a re g enerally small, a nd hence s ′ ≫ s . Thu s, ev en though β ≫ ε we expect (1 − β ) s to be of abou t same order of magn itude as (1 − ε ) s ′ . The formu lation in Equation 11 has sev eral benefits. Du e to the intro duction of ε -edges the likelihood P ( c | T ) is always positive. For example, even if we consid er g raph G with no e dges, P ( c | T ) is still well defined as we can explain the tree T via the d iffusion over the ε -ed ges. A secon d benefit that w ill become very useful later is that the likeliho od now becomes monoton ic in the network edges of G . This means that adding an edge to G ( i.e. , conv erting ε -edge into a network edge) only increases the likelihood. Considering only the mo st likely propaga tion t ree. So far we introdu ced the conce pt of ε -edg es to model the external in fluence or diffusion that is exogeno us to the n etwork, and introduce a n approximation to treat all edges that d id not participate in the diffusion a s ε -edges. Now we consider the last ap proxim ation, where instead of considering all p ossible cas- cade propa gation trees T , we only consider the most lik ely cascade prop agation t rees T : P ( C | G ) = Y c ∈ C X T ∈T c ( G ) P ( c | T ) ≈ Y c ∈ C max T ∈T c ( G ) P ( c | T ) . (12) Thus now we a im to so lve the network in ference pr oblem b y finding a grap h G that maximizes Equation 12, where P ( c | T ) is defined in Equa tion 11. This formu lation simplifies the orig inal n etwork inferen ce pr oblem by con sidering th e most likely ( b est ) propaga tion tree T per cascade c instead of con sidering all possible propag ation trees T fo r each cascade c . Althou gh in som e cases we e xpect the lik elihood o f c with respect to the tr ue tree T ′ to be mu ch higher than that with r espect to any comp eting tree T ′′ and thus the probability mass will b e concentr ated at T ′ , there mig ht b e some cases in w hich the pr obability mass d oes n ot con centrate on on e p articular T . Howe ver, we r un extensiv e experimen ts on small networks with different structures in which both the origin al network inference pro blem and the alternative fo rmulation can be solved using exhaustiv e search. Our experimental results looked rea lly similar an d th e results were indistinguishab le. Theref ore, we consider our approximation to w ork well in practice. For co n venience, we work with the log- likelihood log P ( c | T ) rather than likelihoo d P ( c | T ) . Moreover , in stead of d irectly max imizing the log -likelihood we equi valently max- imize the following objective function that defines the improvement o f log-likeliho od for cascade c occurrin g in graph G over c occu rring in an empty graph ¯ K ( i.e. , grap h with only ε -edges and no network edges): F c ( G ) = max T ∈T c ( G ) log P ( c | T ) − max T ∈T c ( ¯ K ) log P ( c | T ) . (1 3) Maximizing Equation (12) is equiv alent to m aximizing th e fo llowing log -likelihood function : F C ( G ) = X c ∈ C F c ( G ) . (14) W e now expand E q. ( 13) an d obtain an instance of a simplified diffusion n etwork in fer - ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 14 · Gomez-Rodri guez, Lesk ov ec and Krause. ence pr o blem : ˆ G = arg max G F C ( G ) = X c ∈ C max T ∈T c ( G ) X ( i,j ) ∈ E T w c ( i, j ) , (15) where w c ( i, j ) = log P ′ c ( i, j ) − log ε is a non- negativ e weight which can be interpreted as the im provement in lo g-likelihood of edge ( i, j ) und er the most likely pro pagation tree T in G . Note that by the approx imation in Eq uation 11 one can ignore the contribution of edges that did no t par ticipate in a particular cascade c . Th e contribution of these ed ges is constant, i.e. , indep endent of the p articular s hape t hat propagation tree T takes. Th is is due to the fact that e ach spann ing tree T of G with node set V T has | V T | − 1 (ne twork and ε -) edges th at participated in the cascade, and a ll r emaining e dges stopped the cascade from spreading . Th e numb er o f non- spreading edges depe nds on ly on the nod e set V T but n ot the edge set E T . Thus, the tree T that maximizes P ( c | T ) a lso maximizes P ( i,j ) ∈ E T w c ( i, j ) . Since T is a tree that m aximizes the sum o f the edge weights this means that the most likely p ropag ation tree T is simply the maximum weight directed spannin g tr ee of nod es V T , where each edge ( i , j ) has weig ht w c ( i, j ) , an d F c ( G ) is simply the sum of the weights of edges in T . W e also observe t hat since edges ( i, j ) where t i ≥ t j have weig ht 0 ( i.e. , such edges are not present) then the outg oing edges of any node u o nly point forward in time, i.e. , a node can n ot inf ect alread y infected nodes. Thu s for a fixed cascade c , the collection of edges with positiv e weight forms a directed acyclic graph (D AG). Now we use the fact th at the collectio n of edges with p ositiv e weigh ts form s a d irected acyclic graph by observing that the maximu m weig ht directed spanning tree o f a D A G can be computed efficiently: P R O P O S I T I O N 1 . In a DA G D ( V , E , w ) with verte x set V a nd nonn e gative edge weigh ts w , the maximu m weight dir ected spanning tree can be fo u nd by choo sin g, for each nod e v , an incomin g edge ( u, v ) with maximum weight w ( u, v ) . P R O O F . The score S ( T ) = X ( i,j ) ∈ T w ( i, j ) = X i ∈ V w ( P ar T ( i ) , i ) of a tree T is the sum of the incoming edge weights w ( P ar T ( i ) , i ) f or each node i , where P ar T ( i ) is the par ent of node i in T (an d the root is handled appropriately) . Now , max T S ( T ) = max T X ( i,j ) ∈ T w ( i, j ) = X i ∈ V max P ar T ( i ) w ( P ar T ( i ) , i ) . Latter eq uality fo llows fr om the fact that since G is a D A G, the max imization can be done independ ently for each nod e without creating any cycles. This pro position is a spec ial case o f the mo re gener al m aximum spann ing tr ee (MST) problem in directed g raphs [Edmo nds 1 967]. Th e imp ortant fact now is that we can find the best p ropag ation tr ee T in time O ( | V T | D in ) , i.e., linear in the numb er of edges and the maximum in-degree D in = max u ∈ V T d in ( u ) b y simply selecting an incoming edge of highest weigh t for ea ch nod e u ∈ V T . Algorithm 1 provides the pseudoco de to efficiently compute the maximum weight directed spanning tree of a D A G. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 15 Algorithm 1 Maximu m weight directed spanning tree of a D A G Require: W eighted directed acyclic graph D ( V , E , w ) T ← { } for all i ∈ V do P ar T ( i ) = arg ma x j w ( j, i ) T ← T ∪ { ( P ar T ( i ) , j ) } return T Putting it all to gether we hav e shown how to ef ficiently evaluate th e log -likelihood F C ( G ) of a g raph G . T o find th e most likely tree T for a single cascade ta kes O ( | V T | D in ) , and this h as to b e done for a to tal of | C | cascades. Interestingly , this is ind epend ent of the size of g raph G an d o nly dep ends on the amo unt of observed d ata ( i. e. , s ize and the number of cascades). 3.2 The N E T I N F algorithm for efficient maximi zation of F C ( G ) Now we aim to find g raph G that maximizes th e log-likelihood F C ( G ) . First we no- tice that by co nstruction F C ( ¯ K ) = 0 , i.e. , the empty grap h has score 0. Moreover, we observe tha t th e objective function F C is non-n egati ve and mon otonic. This means that F C ( G ) ≤ F C ( G ′ ) f or graphs G ( V , E ) and G ′ ( V , E ′ ) , where E ⊆ E ′ . Hence adding more edges to G does not decrea se the solution quality , and thus the com plete graph maximizes F C . Monoton icity can be sh own by observing tha t, as ed ges ar e added to G , ε -ed ges are conv erted to network edges, a nd therefor e the we ight of any tre e (and therefo re th e value of the m aximum sp anning tree) can only increa se. Ho wever , since real-world social and informa tion n etworks are usually sparse, we are interested in inferrin g a sp a rse graph G , that only contains some small numbe r k of edg es. Thu s we aim to solve: P R O B L E M 2 . Given th e infection times of a set of cascade s C , pr oba bility of pr op aga- tion β and the inc u bation time d istrib ution P c ( i, j ) , find ˆ G that maximizes: G ∗ = argmax | G |≤ k F C ( G ) , (16) wher e the maximization is over all g raphs G o f at most k edges, and F C ( G ) is defin e d by Eqs. 14 and 15. Naiv ely search ing over all k edge gr aphs would take time exponen tial in k , which is intractable. Moreover , finding the op timal solution to Eq. (1 6) is NP-hard , so we cann ot expect to find the optimal solution: T H E O R E M 2 . The network inference pr oblem defined by equ ation (1 6) is NP-hard. P R O O F . By reduction from th e MAX- k -CO VER p roblem [Khu ller et al. 1999] . In MAX- k -CO VER, we are gi ven a finite set W , | W | = n and a collection of subsets S 1 , . . . , S m ⊆ W . The function F M C ( A ) = | ∪ i ∈ A S i | counts the num ber of elemen ts of W covered by sets in dexed by A . Our goal is to pick a collection of k subsets A maximiz ing F M C . W e will produ ce a collection of n cascades C over a g raph G such that max | G |≤ k F C ( G ) = max | A |≤ k F M C ( A ) . Gra ph G will be defined ov er the s et of vertices V = { 1 , . . . , m } ∪ { r } , i.e. , t here is one v ertex for each set ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 16 · Gomez-Rodri guez, Lesk ov ec and Krause. S i and one extra vertex r . F or ea ch elem ent s ∈ W we d efine a cascade which has time stamp 0 associated with a ll nod es i ∈ V such that s ∈ S i , time stamp 1 for n ode r an d ∞ for all remaining nodes. Furthermo re, we ca n cho ose the tran smission model such that w c ( i, r ) = 1 wh enever s ∈ S i and w c ( i ′ , j ′ ) = 0 for all remainin g edges ( i ′ , j ′ ) , b y ch oosing the p arameters ε , α and β ap prop riately . Since a d irected span ning tr ee over a gr aph G can conta in at most one edge in coming to node r , its weight will be 1 if G contains any edge f rom a n ode i to r wh ere s ∈ S i , and 0 otherwise. Thus, a g raph G o f at m ost k ed ges co rrespon ds to a feasible s olution A G to MAX- k - CO VER where we pick sets S i whenever edge ( i, r ) ∈ G , and each solution A to MAX- k - CO VER cor respond s to a feasible solu tion G A of (1 6). Furthermo re, by co nstruction, F M C ( A G ) = F C ( G ) . Thus, if we had an ef ficient algorithm for dec iding wh ether ther e exists a grap h G , | G | ≤ k such that F C ( G ) > c , we co uld use the algorithm to decide whether th ere e xists a solution A to MAX- k -CO VER with v alu e at least c . While finding the optimal solution is hard , we now show that F C satisfies submodu larity , a natural diminishing returns p roper ty . T he submodularity proper ty allows us to efficiently find a pr ovab ly nea r -optima l solution to this otherwise NP-har d optimization problem. A set f unction F : 2 W → R that maps sub sets of a finite set W to th e real n umbers is submodu lar if for A ⊆ B ⊆ W and s ∈ W \ B , it holds that F ( A ∪ { s } ) − F ( A ) ≥ F ( B ∪ { s } ) − F ( B ) . This simply says add ing s to th e set A increases the sco re m ore than add ing s to set B ( A ⊆ B ). Now we are read y to show the following r esult that enables us to find a near o ptimal network G : T H E O R E M 3 . Let V be a set of nodes, an d C be a collection of cascades hitting the nodes V . Then F C ( G ) is a submodu lar functio n F C : 2 W → R defined over subsets W ⊆ V × V of d ir ecte d edges. P R O O F . Fix a ca scade c , g raphs G ⊆ G ′ and an edg e e = ( r, s ) no t co ntained in G ′ . W e will show that F c ( G ∪ { e } ) − F c ( G ) ≥ F c ( G ′ ∪ { e } ) − F c ( G ′ ) . Sin ce n onnega- ti ve linear com binations o f submo dular fu nctions are subm odular, the fun ction F C ( G ) = P c ∈ C F c ( G ) is sub modu lar as well. Let w i,j be the weig ht of edge ( i, j ) in G ∪ { e } , and w ′ i,j be the weight in G ′ ∪ { e } . A s argued before, the maximum weight directed spanning tree for D A Gs is o btained by assigning to each n ode th e incomin g edge with max imum weight. Let ( i, s ) be the edge incoming a t s of max imum weight in G , and ( i ′ , s ) the max- imum weight incoming edge in G ′ . Since G ⊆ G ′ it holds that w i,s ≤ w ′ i ′ ,s . Furthermo re, w r,s = w ′ r,s . Hence, F c ( G ∪ { ( r, s ) } ) − F c ( G ) = max( w i,s , w r,s ) − w i,s ≥ max( w ′ i ′ ,s , w ′ r,s ) − w ′ i ′ ,s = F c ( G ′ ∪ { ( r, s ) } ) − F c ( G ′ ) , proving submodularity of F c . Maximizing submod ular function s in gener al is NP-har d [Khu ller et al. 199 9]. A com - monly used heuristic is the greedy algo rithm , which starts with an emp ty gr aph ¯ K , and ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 17 iterativ ely , in step i , adds the edge e i which maximizes the marginal gain: e i = ar gmax e ∈ G \ G i − 1 F C ( G i − 1 ∪ { e } ) − F C ( G i − 1 ) . (17) The algorithm stops once it has selected k edge s, an d returns the solution ˆ G = { e 1 , . . . , e k } . The s topp ing c riteria, i.e . , v alue of k , can be b ased on some threshold of the marginal gain, of the number of estimated edges or another more sophisticated heuristic. In our co ntext we can think about th e greedy algor ithm as starting on an em pty grap h ¯ K with no n etwork edg es. In eac h iteration i , the algo rithm a dds to G the edg e e i that currently improves the most the value o f the log -likelihood. Anoth er way to view the greedy algorithm is that it starts on a fully conn ected gr aph ¯ K where all the ed ges ar e ε -edges. Then adding an edge to graph G correspo nds to that edge chang ing the type from ε -edge to a network ed ge. Thus ou r alg orithm iteratively swaps ε -edges to network edges until k network edges ha ve been sw apped ( i. e. , inserted into the network G ). Guarantees on the solution qua lit y . Conside ring the NP-h ardness of th e prob lem, we might expect the greedy algo rithm to perfor m arbitrar ily bad . Ho wev er, we will see th at this is not the case. A fund amental r esult o f Nemhau ser et al. [Nemhauser et al. 19 78] proves tha t f or mo noton ic subm odular func tions, the set ˆ G retu rned by the greed y algo- rithm obtains at le ast a constant fractio n of (1 − 1 / e ) ≈ 6 3% of the op timal value achie v- able using k edges. Moreover , we can acquire a tight online data-de penden t bo und on the solution quality: T H E O R E M 4 [ L E S KO V E C E T A L . 2 0 0 7 ] . F or a graph ˆ G , and each edge e / ∈ ˆ G , let δ e = F C ( ˆ G ∪ { e } ) − F C ( ˆ G ) . Let e 1 , . . . e B be the seq u ence with δ e in d ecr easing o r d er , wher e B is the total number of edges with mar ginal gain greater than 0 . Then , max | G |≤ k F c ( G ) ≤ F c ( ˆ G ) + k X i =1 δ e i . Theorem 4 compu tes how far a given ˆ G (obtained by any algorithm ) is fr om th e unkn own NP-hard to find optimum. Speeding-up the N E T I N F algorithm. T o make th e alg orithm scale to n etworks with thou- sands of no des we speed-u p the algorith m by several orde rs of magnitud e by con sidering two follo wing two improvements: Localized upda te: Let C i be the sub set of cascade s tha t go thr ough the nod e i ( i.e. , cascades in wh ich nod e i is in fected). T hen con sider that in some step n the g r eed y algorith m selects the network edge ( j, i ) with ma rginal gain δ j,i , and now we have to u pdate th e optima l tree of each cascade . W e make a simple observation th at add ing th e network ed ge ( j, i ) may only chang e the o ptimal trees of the cascades in the set C i and thus we only need to revisit (a nd poten tially upd ate) th e trees of c ascades in C i . Since cascades are local ( i. e. , each cascade hits only a relatively small subset of th e n etwork), this localized up dating proced ure s peeds up the algor ithm considerably . Lazy evaluation: It can be used to drastically redu ce the number o f ev aluations of marginal gains F C ( G ∪ { e } ) − F C ( G ) [Leskovec et al. 2007 ]. This procedure re lies on the sub mod- ularity of F C . The key idea behind lazy e valuations is the follo wing. Supp ose G 1 , . . . , G k ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 18 · Gomez-Rodri guez, Lesk ov ec and Krause. is the sequence of gr aphs produ ced du ring iterations of th e greedy algo rithm. Now let us consider the marginal gain ∆ e ( G i ) = F C ( G i ∪ { e } ) − F C ( G i ) of ad ding some ed ge e to any of these g raphs. Due to the submodu larity of F C it h olds that ∆ e ( G i ) ≥ ∆ e ( G j ) whenever i ≤ j . Th us, the marginal gains of e can only monoto n- ically de crease over the co urse o f the g reedy alg orithm. This m eans tha t elemen ts which achieve very little marginal gain at iter ation i ca nnot sudden ly p roduce lar ge marginal gain at subsequent iteratio ns. Th is insight can b e exploited by maintaining a priority queu e data structure over the e dges a nd their r espective marginal gains. At each iteratio n, the gr eedy algorithm retr iev es the highest weig ht (prio rity) edge. Since its value may have decreased from pr evious itera tions, it reco mputes its marginal benefit. If the marginal gain remains the sam e after r ecompu tation, it has to be the edge with highest margina l g ain, and the greedy algor ithm will pick it. I f it d ecreases, one rein serts th e edge with its new weigh t into the p riority queu e and continues. Formal de tails and pseud o-cod e can be f ound in [Leskovec et al. 20 07]. As we will show later , th ese two im provements decrease the run time b y several orders of m agnitude with no lo ss in th e solutio n qu ality . W e call the alg orithm tha t implem ents the g reedy algor ithm on this altern ativ e formulatio n with the above speedups the N E T I N F algorithm (Algorith m 2). I n addition, N E T I N F nicely lends itself to paralleliza tion as like- lihoods of individual cascades and likelihood imp rovements o f ind ividual n ew ed ges can simply b e c omputed independ ently . Th is allows u s to to tackle even bigger n etworks in shorter amounts of time. A space and runtime complexity an alysis of N E T I N F d epends hea vily of th e structure of the n etwork, a nd ther efore it is necessary to m ake stron g assumptio ns on the s tructur e. Du e to this, it is out o f the scop e of the p aper to includ e a formal comp lexity analysis. Instead , we include an empirical runtime analysis in the following section. 4. EXPERIMENT A L EV ALU A TION In this section we pro ceed with the experimental ev aluation of our pr oposed N E T I N F al- gorithm fo r inferring n etwork of diffusion. W e ana lyze the p erform ance of N E T I N F o n synthetic and real networks. W e show that our algorith m perfo rms surpr isingly well, o ut- perfor ms a heuristic baseline an d corre ctly discovers mo re than 90 % of th e edges of a typical diffusion network. 4.1 Experiments on synthetic data The goal of the experiments on synthetic data is to understand how the underlying network structure and the prop agation m odel (expo nential and power -law) affect the per forman ce of our algorithm . The second goa l is to ev aluate the effect o f simp lification we had to make in or der to arr iv e to an efficient n etwork inferen ce algorithm. Namely , we assume the contagion pr opagates in a tr ee pattern T ( i. e. , exactly E T edges c aused th e pro pagation ), consider only the most likely tree T (Eq. 12), and treat non -prop agating netw ork edg es as ε -edges (Eq. 11). In general, in all our experim ents we pro ceed as follows: W e are given a tr ue d iffusion network G ∗ , an d then we simulate the pro pagation of a s et o f conta gions c over the network G ∗ . Diffusion of each co ntagion creates a cascad e and for each cascade, we r ecord the nod e hit times t u . Then, given th ese nod e hit times, we aim to rec over the n etwork G ∗ using ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 19 Algorithm 2 The N E T I N F Algorithm Require: Cascades and hit times C = { ( c, t c ) } , number of edges k G ← ¯ K for all c ∈ C do T c ← dag tree ( c ) { Find most likely tree (Algorithm 1) } while | G | < k do for all ( j, i ) / ∈ G do δ j,i = 0 { Marginal improvement of addin g edge ( j, i ) to G } M j,i ← ∅ for all c : t j < t i in c do Let w c ( m, n ) be the weigh t of ( m, n ) in G ∪ { ( j, i ) } if w c ( j, i ) ≥ w c ( P ar T c ( i ) , i ) t hen δ j,i = δ j,i + w c ( j, i ) − w c ( P ar T c ( i ) , i ) M j,i ← M j,i ∪ { c } ( j ∗ , i ∗ ) ← arg max ( j,i ) ∈ C \ G δ j,i G ← G ∪ { ( j ∗ , i ∗ ) } for all c ∈ M j ∗ ,i ∗ do P ar T c ( i ∗ ) ← j ∗ return G; 0 20 40 60 80 100 120 140 160 180 0 5 10 15 20 25 30 35 Number of edges Number of cascades per edge (a) F F: Cascade s per edge 10 0 10 1 10 2 10 3 10 1 10 2 10 3 Number of cascades Number of edges per cascade MLE = x -2.564 (b) FF: Cascade size Fig. 5. Number of cascades per edge and cascade sizes for a Forest Fire netw ork ( 1 , 024 nodes, 1 , 477 edges) with forwar d b urning probabilit y 0 . 20 , ba ckward burni ng pro babilit y 0 . 17 and e xponential in cubation time mode l with parameter α = 1 and propagation probabilit y β = 0 . 5 . The cascade size distribu tion follo ws a power -law . W e found the po wer-la w coeffici ent using maximum lik elihood estimation (MLE). the N E T I N F algorithm. For examp le, Figur e 1(a) shows a graph G ∗ of 20 nodes and 23 directed edges. Using the expon ential incubation time mo del an d β = 0 . 2 we generated 24 cascades. Now giv en the nod e in fection times, we aim to r ecover G ∗ . A b aseline m ethod (b) (described below) perf ormed po orly wh ile N E T I N F (c ) rec overed G ∗ almost perfec tly by making only two errors (red edges). Experimental setup. Our experimental methodolog y is comp osed of the follo wing steps: (1) Ground truth graph G ∗ (2) Cascade gener ation: Pr obability of prop agation β , and the incubation time model with ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 20 · Gomez-Rodri guez, Lesk ov ec and Krause. parameter α . (3) Number of cascades (1) Gr o und truth graph G ∗ : W e conside r two mo dels of directed real-world networks to generate G ∗ , namely , the Forest Fire mo del [L eskovec et al. 2005] a nd th e Kronecker Graphs model [Leskovec and Faloutsos 2 007] . For Kronecker grap hs, we consider three sets of parame ters that produ ce networks with a very different global network structure: a random g raph [Erd ˝ os and R ´ enyi 196 0] (Kronec ker parameter matrix [0 . 5 , 0 . 5; 0 . 5 , 0 . 5 ] ), a core-pe riphery n etwork [Leskovec et al. 2 008] ( [0 . 9 62 , 0 . 535; 0 . 535 , 0 . 107] ) an d a netw ork with h ierarchical co mmun ity structure [Clauset et al. 20 08] ( [0 . 962 , 0 . 107; 0 . 107 , 0 . 962] ). The Forest Fire gene rates n etworks with power-la w degree distributions that fo llow the densification power law [Barab ´ asi and Albert 1999; Lesko vec et al. 200 7]. (2) Cascade pr o p agation: W e then simulate cascades on G ∗ using th e gen erative mo del defined in Section 2.1 . For the simulatio n we need to cho ose the incu bation time mode l ( i.e. , power -law or exponential and parameter α ). W e also need to fix the parameter β , that controls probability o f a ca scade prop agating over an edge. I ntuitively , α con trols how fast the cascade spre ads ( i.e. , how lon g the incubatio n tim es are), w hile β contr ols the size o f the cascades. Large β means ca scades will likely b e large, while small β makes mo st of the edges fail to transmit the contagion which results in small infections. (3) Number o f cascad es: Intuitively , the more data ou r algorithm gets the more accura tely it should infer G ∗ . T o quantify the a mount o f data (num ber of different cascades) we define E l to be th e set o f edges that participate in a t least l cascades. This mea ns E l is a set of edges that transmitted at least l contagio ns. It is i mpo rtant to note that if an edge of G ∗ did not participate in any cascade ( i.e. , it never transmitted a contag ion) then there is n o trace of it in our data an d th us we have no cha nce to infer it. In our exper iments we cho ose the minimal a mount of d ata ( i.e. , l = 1 ) so that we at least in princip le could infer th e true network G ∗ . T hus, we gener ate as many cascades as needed to have a set E l that contain s a fraction f of all the edges o f the true network G ∗ . In all our experiments we pick cascade starting nodes unifor mly at rando m and generate eno ugh cascades so that 99% of the edges in G ∗ participate in at least one cascade, i.e. , 99% of the edges are included i n E 1 . T able I I sh ows experime ntal values of number o f cascades that let E 1 cover different percentag es of the edg es. T o hav e a closer look at the cascade size distribution, fo r a Forest Fire network on 1,024 nodes and 1,477 edges, we generated 4,038 cascades. The majority of ed ges too k par t in 4 to 12 cascades and the cascade size d istribution f ollows a power law (Figur e 5(b )). The average and m edian num ber of cascade s per edge are 9 .1 an d 8, respectively (Figure 5(a)). Baseline method. T o infer a diffusion n etwork ˆ G , we consider the a simple baseline heuristic whe re we co mpute the score of each ed ge and then p ick k ed ges with highest score. More precisely , for each possible edge ( u, v ) o f G , we com pute w ( u, v ) = P c ∈ C P c ( u, v ) , i.e. , overall how likely were the cascades c ∈ C to pr opagate over the ed ge ( u, v ) . Then we simply pick the k edges ( u, v ) with the highest scor e w ( u, v ) to obtain ˆ G . For example, Figure 1(b) shows the results of the baseline method on a small graph . Solution quality . W e evaluate the perf ormance of the N E T I N F algorithm in two different ways. First, we are interested in how successful N E T I N F is at o ptimizing the objective function F C ( G ) that is NP-hard to optimize exactly . Using the online boun d in Theorem 4, ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 21 T ype of network f | C | r BEP A UC Forest Fire 0.5 388 2,898 0.393 0.29 0.9 2,017 14,027 0.75 0.67 0.95 2,717 19,418 0.82 0.74 0.99 4,038 28,663 0.92 0.86 Hierarchical Kroneck er 0.5 289 1,341 0.37 0.30 0.9 1,209 5,502 0.81 0.80 0.95 1,972 9,391 0.90 0.90 0.99 5,078 25,643 0.98 0.98 Core-periphery Kroneck er 0.5 140 1,392 0.31 0.23 0.9 884 9,498 0.84 0.80 0.95 1,506 14,125 0.93 0.91 0.99 3,110 30,453 0.98 0.96 Flat Kroneck er 0.5 200 1,324 0.34 0.26 0.9 1,303 7,707 0.84 0.83 0.95 1,704 9,749 0.89 0.88 0.99 3,652 21,153 0.97 0.97 T able II. Performance of synthetic data. Break-e ven Point (BEP) and Recei ver Operating Charact eristic (A UC) when we generated the minimum number of | C | cascades so that f -fra ction of edges participa ted in at least one cascade s | E l | ≥ f | E | . These | C | casca des generated the total of r edge transmissions, i.e . , ave rage cascade size is r / | C | . All networks have 1,024 nodes and 1,446 edges. W e use the expone ntial incubatio n time model with paramete r α = 1 , a nd in e ach case we set t he probab ility β such tha t r / | C | is neither to o small no r too large ( i.e . , β ∈ (0 . 1 , 0 . 6) ). we can assess at most h ow far from the unknown optimal the N E T I N F solution is in terms of the log -likelihood score. Secon d, we also ev aluate the N E T I N F based on accur acy , i.e. , what fraction of edges of G ∗ N E T I N F managed to infer correctly . Figure 6(a) plots the value of the lo g-likelihood improvement F C ( G ) as a function of the n umber of edg es in G . In red we plo t the value achieved by N E T I N F and in g reen the u pper b ound using Th eorem 4. The p lot shows that th e value of the unkn own optimal solution (that is NP-hard to com pute exactly) is some where between the red and the green curve. No tice that the ban d between two curves, the op timal a nd the N E T I N F cu rve, is narrow . For example, at 2,000 edg es in ˆ G , N E T I N F find s the solution that is least 97% of the optimal grap h. Moreover, als o notice a strong d iminishing r eturn effect. The value of the objective functio n flattens out after abou t 1,00 0 edges. This means that, in p ractice, very spar se solution s (almost tree-like diffusion graph s) alre ady achieve very high values of the objective f unction close to the optimal. Accuracy of N E T I N F . W e also evaluate our appr oach by stud ying how many edges in- ferred by N E T I N F are actu ally present in the true network G ∗ . W e measu re th e precision and reca ll of our method . For ev ery value of k ( 1 ≤ k ≤ n ( n − 1) ) we gen erate ˆ G k on k edges by using N E T I N F or the b aseline method . W e then co mpute p recision (which frac- tion o f edges in ˆ G k is also present G ∗ ) an d r ecall (which fraction of edges of G ∗ appears in ˆ G k ). For small k , we expect low recall an d hig h p recision as we select the few edges that we are the most confide nt in. As k increases, precision will generally start to drop but the recall will increase. Figure 7 shows the precision- recall curves of N E T I N F and the baseline method on three different Kronecker graphs (random, hierarchical commu nity s tructur e and core-p eripher y ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 22 · Gomez-Rodri guez, Lesk ov ec and Krause. 1000 10000 100000 1e+06 0 500 1000 1500 2000 2500 Value of the objective function Number of edges NetInf Upper bound (Th. 4) (a) K roneck er network 100 1000 10000 100000 1e+06 0 1000 2000 3000 4000 Value of the objective function Number of edges NetInf Upper bound (Th. 4) (b) Real MemeTrack er data Fig. 6. Score achie ved by N E T I N F in comparison with the online upper bound from Theorem 4. In practic e N E T I N F finds networks that are at 97% of NP-hard to compute optimal. structure) with 102 4 nodes and tw o incu bation time models. The cascades wer e generated with an exponential incub ation tim e mod el with α = 1 , or a power law in cubation time model with α = 2 and a value of β low eno ugh to avoid generating too large cascades (in all cases, we pick a value of β ∈ (0 . 1 , 0 . 6) ). For each n etwork we generated b etween 2,0 00 and 4,000 ca scades so th at 99% of the edges o f G ∗ participated in a t least one cascade. W e chose cascade starting points uniform ly at rand om. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 23 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (a) H ier . Kronecker (Exp) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (b) Core-Peri ph. Kroneck er (Exp) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (c) F lat Kroneck er (E xp) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (d) Hier . Kronecke r (PL) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (e) Core-Periph. Kroneck er (PL ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (f) Flat Kronecker (PL) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (g) Forest Fir e (PL, α = 1 . 1 ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Precision Recall NetInf Baseline (h) Forest Fire (PL, α = 3 ) Fig. 7. Precision and re call for three 1024 nod e Kro necker an d Forest Fire network n et- works w ith exponential (E xp) and power law (PL) incu bation time m odel. The plots are generated by sweeping over values of k , that con trols the sparsity of the solution. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 24 · Gomez-Rodri guez, Lesk ov ec and Krause. 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 Break-even point Total number of transmissions β = 0.1 β = 0.2 β = 0.5 Fig. 8. Performance of N E T I N F as a function of the amount of cascade data. The units in the x-axis are normal- ized. x = 1 means that the total number of transmission e vents used for the experiment was equal to the number of edges in G ∗ . On average N E T I N F requires about two propagation eve nts per edge of the original network in order to reliab ly recove r the true network structu re. First, we foc us on Figures 7(a) , 7 (b) and 7(c) where we u se the exponen tial incubation time model on different Kronecker g raphs. Notice that the baseline method achieves the break-even point 1 between 0 .4 and 0.5 on all thr ee ne tworks. On the other hand, N E T I N F perfor ms much better with the break -ev en point of 0.9 9 on all three datasets. W e view this as a particularly strong result as we were especially careful not to gen erate too many cascades since more cascades mean mo re evidence that makes the pr oblem easier . Thus, using a very small numb er of cascades, where ev ery edge of G ∗ participates in only a fe w cascades, we can almost perfectly r ecover the un derlying diffusion ne twork G ∗ . Second importan t p oint to n otice is that the perfo rmance of N E T I N F seems to be stron g regardless of the structure of th e n etwork G ∗ . This means that N E T I N F works r eliably regardless of the par ticular st ructu re of the network of which contag ions propagated (refer to T able II). Similarly , Figures 7(d) , 7(e) and 7(f) show the performan ce on the same th ree networks but using the power law incubation time model. The perfo rmance o f the baseline now dra- matically dro ps. This is likely due to th e fact that the variance of power -law (and heavy tailed distributions in ge neral) is much larger than th e variance of an exponential distribu- tion. Thus the diffusion network inference p roblem is mu ch h arder in this case. As the baseline pays hig h p rice due to the increase in v ariance wit h the break-even p oint dropping below 0 . 1 the performanc e o f N E T I N F rem ains stable with the br eak e ven point in the high 90s. W e also examin e the results on the Forest Fire network ( Figures 7( g) and 7( h)). Again, the pe rforma nce of the baseline is very low while N E T I N F a chieves the b reak-even point 1 The point at which reca ll is equal to precision. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 25 at aroun d 0.90 . Generally , the perfo rmance o n the Forest Fire network is a bit lower than on th e Kro- necker graphs. Howe ver , it is important to note that while these networks ha ve very differ - ent glo bal n etwork structur e (from hiera rchical, ran dom, scale f ree to co re perip hery) the perfor mance of N E T I N F is remarkab ly stable and do es n ot seem to de pend on the struc- ture of the n etwork we are try ing to infer or the particular type of cascade incubatio n time model. Finally , in all the experim ents, we obser ve a sh arp d rop in p recision for hig h values of recall (near 1 ). This happens because the greedy algorithm starts to choose ed ges with low marginal gains that may be false edges, increasing the prob ability to make mistakes. Perf o rmance vs. cascade coverage. Intuitively , the larger the numb er of cascades that spread over a particu lar ed ge th e easier it is to identify it. On one hand if the edge never transmitted the n we ca n n ot id entify it, and th e more time s it p articipated in the tr ansmis- sion of a contagio n t he easier should the edge be to identif y . In our experiments so far , we generated a relati vely small number o f cascades. Next, we examine h ow the per forman ce of N E T I N F depe nds on the amoun t o f av ailable cascad e data. This is impo rtant because in many real world situ ations the data of on ly a f ew different cascades is a vailable. Figure 8 plo ts th e break -ev en point of N E T I N F as a function of the available cascade data measured in the n umber of contagion tr ansmission events over all cascades. T he total numbe r of con tagion tran smission events is simply the sum of cascad e sizes. Thus, x = 1 means th at the total num ber o f transmission events used for the experiment was equal to the numb er of ed ges in G ∗ . Notice that as the amou nt of cascade d ata incr eases the pe rforman ce of N E T I N F also inc reases. Overall we notice th at N E T I N F req uires a total nu mber o f transmission events to be ab out 2 times the numbe r of ed ges in G ∗ to successfully recover most of the edges of G ∗ . Moreover , the plot shows th e perform ance for different values of edg e transm ission prob- ability β . As no ted befor e, big values o f β pr oduce larger ca scades. Intere stingly , when cascades are small (small β ) N E T I N F needs less da ta to inf er th e network than when cas- cades are la rger . Th is occurs becau se th e larger a cascade, the m ore difficult is to infer the paren t of each n ode, since we have more potential paren ts for each the node to ch oose from. For exam ple, wh en β = 0 . 1 N E T I N F n eeds ab out 2 | E | transmission events, wh ile when β = 0 . 5 it need s twice as much da ta (ab out 4 | E | transmission s) to obtain the break ev en point of 0 . 9 . Stopping c r iterion. I n practice on e d oes n ot k now h ow long to run th e alg orithm a nd how many e dges to insert in to the network ˆ G . Gi ven the results from Figure 6, we foun d the f ollowing heuristic to giv e g ood results. W e ru n the N E T I N F algorithm for k steps where k is cho sen such that the o bjective f unction is “clo se” to the upper bound , i.e. , F C ( ˆ G ) > x · OPT, where OPT is o btained using the onlin e b ound . In p ractice we use values of x in r ange 0 . 8 – 0 . 9 . That m eans that in each iteration k , OPT is com puted by ev alu ating the righ t hand side expre ssion of the equ ation in The orem 4, where k is simply the iteration nu mber . Therefo re, OPT is compu ted online, and thu s the stoppin g condition is also updated online. Scalability . Figure 9 shows the av erage computatio n t ime pe r edge added for the N E T I N F algorithm implemented with lazy ev aluation and localized upda te. W e use a hier archical ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 26 · Gomez-Rodri guez, Lesk ov ec and Krause. 0 5000 10000 15000 20000 500 1000 1500 2000 2500 3000 Time per edge added (ms) Problem size (number of edges) LE and LU LU No speed ups Fig. 9. A verage time per edge added by our algorithm implemented with lazy ev aluation (LE) and localize d update (LU). Kronecker n etwork and an expo nential incu bation time mo del with α = 1 and β = 0 . 5 . Localized u pdate spee ds up the algor ithm for an ord er of m agnitude (45 × ) an d lazy eval- uation further g iv es a factor of 6 improvement. Thus, overall, we achieve two or ders o f magnitud e speed up (280 × ), without any loss in solution quality . In practice the N E T I N F algorithm can easily be used to in fer networks of 1 0,000 no des in a matter of hours. Perf o rmance vs. incubation time no ise. In our exper iments so far , we have assum ed that the incubatio n time v alues between infection s are not noisy and that we have access to the tru e distribution from wh ich the incubatio n time s are drawn. Ho wever , real d ata may violate any of these tw o assumptio ns. W e study th e perfor mance of N E T I N F (bre ak-even point) as a fu nction of the no ise o f the waiting time between in fections. Th us, we add Gau ssian n oise to th e waiting times between infections in the cascade generation process. Figure 10 plots the perf ormanc e of N E T I N F (b reak-even po int) as a function of the amount of Gaussian noise added to the incubation ti mes between in fections for both an ex- ponen tial in cubation time model with α = 1 , and a po wer law incuba tion t ime model with α = 2 . The break-even p oint degrades with noise b u t on ce a high value of noise is reached, an ad ditional in crement in the amo unt of noise does not degrade fu rther the performance of N E T I N F . I nterestingly , the b reak-even point value fo r hig h values of noise is very similar to the break- ev en point achieved la ter in a real dataset (Figures 13(a) and 13(b)). Perf o rmance vs. infections by the external source. In all o ur experiments so far , we have assume d that we hav e access to complete cascade data, i.e. , w e are able to observe all the node s taking part in each cascad e. Thereby , except f or th e first node of a cascade, we do not have any “jumps” or missing nodes in the cascade as it sp reads across the n etwork. Even though techniqu es f or coping with missing data in informa tion cascades h av e recently ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 27 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 Break-Even Point Incubation Time Noise (std) Exponential Model Power-law Model Fig. 10. Break-e ven point of N E T I N F as a function of the amount of addit iv e Gaussian noise in the incubation time. been in vestigated [Sadikov et al. 201 1], we e valuate N E T I N F again st both scenarios. First, we consider the case where a rando m fra ction o f each cascade is missing. This means that we first gen erate a set of cascad es, but then only record nod e infection times of f -frac tion of nodes. W e first g enerate enou gh cascades so tha t without c ounting th e missing nod es in the cascades, we still h av e that 99 % of the edg es in G ∗ participate in a t least one cascade. Th en we randomly delete ( i.e. , set in fection times to infinity) f -fraction of nodes in each cascade. Figure 11( a) p lots the per forman ce of N E T I N F (brea k-even po int) as a fun ction of the percentag e of m issing nodes in each cascade. Natura lly , the p erform ance drops with the amount of m issing data. Howe ver, we also n ote th at the effect of m issing nodes can be mit- igated by an app ropria te choice of the parameter ε . Basically , hig her ε makes pr opagatio n via ε -e dges m ore likely and thu s by giving a cascade a grea ter cha nce to p ropaga te over the ε -edges N E T I N F can implicitly accou nt for the missing data. Second, we also consider the case where the con tagion does not spr ead th rough the network v ia diffusion but rath er d ue to the in fluence of an external so urce. Thu s, the contagion does n ot really spread over the edges of the network but rather appears almo st at random at various nodes of the network. Figure 11( b) plots the p erforma nce of N E T I N F (b reak-even point) as a f unction o f the percentag e of nodes that are infected by an external s ource for different values o f ε . In our framework, we model the influence d ue to the external sourc e with th e ε -edg es. Note tha t approp riately setting ε can app ropriately account for the exog enous infections that are not the result of the network diffusion but due to the extern al influence. Th e higher the value of ε , the stronger the influen ce of the externa l source, i.e. , we assume a greater number of missing nodes or numb er of node s that are inf ected b y an external source. Thus, the break-even is mo re robust for highe r v alues of ε . ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 28 · Gomez-Rodri guez, Lesk ov ec and Krause. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 Break-Even Point Fraction of Missing Nodes ε = 1e-32 ε = 1e-16 ε = 1e-08 ε = 1e-04 ε = 1e-03 (a) Missing node infecti on data 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 Break-Even Point Fraction of nodes infected by an External Source ε = 1e-32 ε = 1e-16 ε = 1e-08 ε = 1e-04 ε = 1e-03 (b) Node infections due to external source Fig. 11. Break-e ven point of N E T I N F as (a) function of the fraction of m issing nodes per cascade, and as (b) functio n of the fraction of nodes that are infected by an externa l source per casca de. 4.2 Experiments on real data Dataset description. W e use m ore than 17 2 million news ar ticles and blog posts fro m 1 million online sour ces over a period of on e year fr om September 1 20 08 till Augu st 31 2009 2 . Based on this raw data, we use two different metho dologies to trace information on 2 Data av ailable at h ttp://memetrac ker.org and http://snap. stanford.edu/n etinf ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 29 Fig. 12 . Hyperlink-b ased cascades versus meme-ba sed cascad es. In hype r-link cascades, if post j linked to post k , we con sider this as a contagio n transmission event with the p ost creation time as the correspon ding infection time. In Mem eT racker ca scades, we follow the spread of a short textual phrase and use post creation times as infection times. the W eb and then create two different datasets: (1) B log hyperlink cascade s d ataset: W e use h yperlink s between blog p osts to trac e the flow of infor mation [Leskovec et al. 2007 ]. When a b log pu blishes a piece of in formatio n and uses hyper-links to refe r to oth er po sts published by other b logs we consider this as ev ents of information transmission. A cascade c starts when a blog publishes a post P and the infor mation pr opagates r ecursively to o ther blog s by them linkin g to the origin al po st or one of the o ther po sts fr om which we ca n tra ce a chain o f h yperlink s all the way to the original post P . By following the chains of hyperlinks in the reverse directio n we identify hyperlin k cascades [Leskovec et al. 2 007] . A cascade is thus composed of the time-stamps of the hyperlin k/post creation times. (1) MemeT rac ker dataset: W e u se the MemeTracker [Leskovec et al. 2009 ] metho dology to e xtract more than 343 million short textual p hrases (like, “Joe, the plumb er” or “lipstick on a pig ”). Out of these, 8 million distinct phr ases app eared more than 10 times, with the cumulative numb er o f men tions of over 150 million. W e cluster th e phr ases to aggregate different textual variants of the same phrase [Leskov ec et al. 2009 ]. W e then consider each phrase cluster as a separate cascade c . Since all do cuments are time stamped, a cascade c is simp ly a set of time-stamps when blogs first men tioned phrase c . So, we observe the times when b logs mentio n particular ph rases but not where they copied or o btained the phrases f rom. W e consider the largest 5 ,000 cascad es ( phrase clusters) and f or eac h website we record the time whe n they fir st mentio n a phrase in the particular phrase cluster . Note th at cascades in g eneral do no t spr ead over all the sites, which our m ethodo logy can successfully handle . Figure 12 further illustrates the concep t of hyper-link and MemeTracker ca scades. Accuracy on real data. As there is not ground truth network f or both datasets, we use the following way to create the gr ound tru th network G ∗ . W e create a network where there is a directed edg e ( u, v ) b etween a p air of nodes u and v if a p ost on site u linked to a post on ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 30 · Gomez-Rodri guez, Lesk ov ec and Krause. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision Recall NetInf Baseline (a) Blog hyperlink cascades dataset 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision Recall NetInf Baseline (b) MemeT racke r dataset Fig. 13. Precision and recall for a 500 node hyperlink network using (a) the blog hyperli nk cascades dataset ( i.e. , hyperlinks cascades) and (b) the MemeTrack er dataset ( i.e. , MemeTrack er cascades). W e used β = 0 . 5 , ε = 10 − 9 and the expone ntial m odel with α = 1 . 0 . The time units were hours. americanpowerblog.blogspot.com d-day.blogspot.com chacha.com pheedcontent.com golpedegato.blogspot.com gizmodo.com thekevinpipe.com deadspin.com huffingtonpost.com usnews.com techchuck.com seekingalpha.com guardian.co.uk washingtonpost.com i.gizmodo.com news.cnet.com gawker.com salon.com medlogs.com thinkprogress.org democraticunderground.com washingtonmonthly.com thepoliticalcarnival.blogspot.com engadget.com apple.wowgoldir.com techdirt.com crap713three.blogspot.com rwww.techdirt.com forums.macrumors.com wikiality.com jezebel.com boxxet.com archive.salon.com prolifeblogs.com britanniaradio.blogspot.com rsmccain.blogspot.com cinie.wordpress.com nosheepleshere.blogspot.com thevelvethottub.com gle.am alternet.org forum.dvdtalk.com kotaku.com blogs.abcnews.com joystiq.com Fig. 14 . Small par t of a news media (red ) and blog ( blue) diffusion n etwork. W e u se the blog h yperlink ca scades d ataset, i.e. , h yperlink s between blog and n ews m edia posts to trace the flow of inform ation. site v . T o construct the network we take the top 500 sites in terms of number of hyperlinks they create/receive. W e represent ea ch site as a node in G ∗ and con nect a pair of nod es if a po st in first site linked to a post in the seco nd site. T his pr ocess prod uces a groun d truth network G ∗ with 500 nodes and 4,000 edges. First, we use the blog hype rlink cascades dataset to infer the network ˆ G and evaluate how many ed ges N E T I N F g ot right. Figure 1 3(a) shows the per forman ce of N E T I N F a nd the baseline. Notice that the ba seline method achieves the break-even po int of 0.3 4, while our method perform s better with a break-even point of 0.44, almost a 30% improvement. N E T I N F is b asically perf orming a link- prediction task based o nly on tem poral linkin g informa tion. The assumption in th is experimen t is th at sites p refer to create link s to sites ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 31 taipeitimes.com americanchronicle.com patdollard.com stuff.co.nz post-gazette.com freerepublic.com sfgate.com thevelvethottub.com news.bbc.co.uk mirror.co.uk telegraph.co.uk boston.com independent.ie timesdispatch.com feedproxy.google.com huffingtonpost.com blog.taragana.com stltoday.com winnipegsun.com gather.com startribune.com toledoblade.com thejakartapost.com thehawkeye.com washingtonpost.com feeds.washingtonpost.com bit.ly nashuatelegraph.com ireport.com northjersey.com chicagotribune.com guardian.co.uk salon.com thehour.com newsmax.com olympics.thestar.com seattletimes.nwsource.com time.com philly.com msnbc.msn.com uk.news.yahoo.com chron.com baltimoresun.com twocircles.net voices.washingtonpost.com kansascity.com examiner.com nydailynews.com seattlepi.nwsource.com cnsnews.com clipmarks.com investorsiraq.com denverpost.com sacbee.com gopusa.com cnn.com americanconservativedaily.com tdn.com www2.journalnow.com bulletin.aarp.org independent.co.uk news.yahoo.com rutlandherald.com theglobeandmail.com themoneytimes.com feeds.stlouisnews.net forbes.com edmontonsun.com smartgirlpolitics.ning.com digitalspy.co.uk mercurynews.com thestar.com thechronicleherald.ca news.com.au azstarnet.com sunjournal.com foxnews.com durangoherald.com icelebz.com capecodonline.com allvoices.com sports.yahoo.com whatsonwinnipeg.com dailypress.com usatoday.com theledger.com abc.net.au smh.com.au cbs5.com freep.com miamiherald.com wsvn.com kansas.com aikenstandard.com buffalonews.com bostonherald.com feeds.newyorkcitynews.net ca.eonline.com uk.eonline.com au.eonline.com newsobserver.com newsvine.com timesleader.com pantagraph.com clickorlando.com cbsnews.com feeds.cbsnews.com cbs11tv.com timesofmalta.com iol.co.za sportsillustrated.cnn.com uk.reuters.com english.aljazeera.net postchronicle.com cbs2.com heraldonline.com kentucky.com gle.am wikio.com frontpagenews.us wibw.com nevadaappeal.com tahoedailytribune.com dailymail.com kolotv.com mashget.com bismarcktribune.com thestate.com couriermail.news.com.au chronicle.augusta.com 64.90.166.18 cbs3.com cbs4.com abcnews.go.com showhype.com wyomingnews.com timesoftheinternet.com upi.com 960werc.com am1070wdia.com 600wrec.com cbs13.com myfoxchicago.com southcoasttoday.com celebitchy.com kdvr.com celebritymound.com abluteau.wordpress.com ctv.ca timesargus.com heraldnet.com clkurl.com wral.com elections.foxnews.com eurweb.com belfasttelegraph.co.uk people.com thestar.com.my edition.cnn.com truthout.org eonline.com topnews.in dailyindia.com transcripts.cnn.com topics.edition.cnn.com cgi.cnn.com us.cnn.com archives.cnn.com dynamic.cnn.com topics.cnn.com brisbanetimes.com.au ksdk.com wbbm780.com wcbstv.com kdka.com wjz.com sefermpost.com wtkr.com news.ninemsn.com.au news24.com dispatchpolitics.com celebrityblend.com news.smh.com.au news.theage.com.au news.brisbanetimes.com.au aceshowbiz.com hosted.ap.org wbt.com newsinfo.inquirer.net calgarysun.com semissourian.com cedarcreekvoice.wordpress.com gatesofvienna.blogspot.com king5.com heraldtribune.com money.cnn.com wikio.co.uk isavesmart.com birminghammail.net examiner.co.uk iccheshireonline.icnetwork.co.uk ichuddersfield.icnetwork.co.uk columbian.com chroniclelive.co.uk wsbradio.com greenvilleonline.com iafrica.com apnews.myway.com blackamericaweb.com recordpub.com wusa9.com chinapost.com.tw rss.cnn.com blog.luciolepress.com sundaysun.co.uk chesterchronicle.co.uk buckinghamshireadvertiser.co.uk khaleejtimes.com theadvertiser.com thedaily.com.au premium.cnn.com Fig. 15 . Small part of a news med ia (red) and blog (blue ) d iffusion network . W e use the MemeTracker d ataset, i.e. , textual phr ases from MemeTracker to trace the flow of informa tion. that recen tly mentioned in formatio n while completely ignor ing the au thority of the site. Giv en such a ssumption is not satis fied in real-life , we consider th e break ev en point of 0.44 a good result. Now , we consider an ev en h arder pr oblem, wher e we use the Memetracker dataset to infer G ∗ . In this experiment, we only o bserve times w hen sites m ention particula r textual phrases a nd the task is to infer the h yperlink structure o f the und erlying web gr aph. Fig- ure 13( b) shows the per forman ce o f N E T I N F and th e baseline. Th e baseline metho d ha s a bre ak-even point o f 0.17 and N E T I N F ach iev es a br eak-even point of 0.2 8, more than a 50% improvement T o have a fair compar ison with the syn thetic cases, notice that the exp onential incubation time mo del is a simplistic assum ption fo r our real d ataset, and N E T I N F c an p otentially gain additional accuracy by choosing a more realistic incubation time model. Solution quality . Similarly as with sy nthetic data, in Figu re 6(b ) we investi gate the value of the objective f unction and compare it to the online bo und. No tice that th e boun d is almost as tight as in the case of synthetic network s, finding the solution that is least 8 4% of optimal an d both curves are similar in shape to th e synth etic case value. Again, as in the synthetic case, the v alue of th e objective fu nction quickly flattens out wh ich m eans that one needs a relatively few number of edges to capture most of the inform ation flo w on the W eb . ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 32 · Gomez-Rodri guez, Lesk ov ec and Krause. In the remain der of the section, we u se the top 1,000 me dia sites and b logs with the largest number of documents. V isualiza tion o f diffusion networks. W e examine the structur e of th e inf erred diffusion networks using both da tasets: the blog hyper link cascades dataset and the Mem eT racker dataset. Figure 14 shows the largest conn ected compo nent of the diffusion n etwork after 100 edges have b een chosen using the fir st dataset, i.e. , using hy perlink s to track the flow of informa tion. The size of the no des is propor tional to the numb er of articles on the site and the width of the edge is prop ortiona l to the proba bility of in fluence, i.e. , stro nger edges have higher wid th. The strength of an edg e acro ss all cascad es is simply defined as the marginal gain given by adding the edge in the g reedy alg orithm (an d th is is propo rtional to the prob ability of influence). Since news media articles rarely use hyp erlinks to refer to one another, the n etwork is somewhat biased towards web blogs (blue node s). There are se veral interesting patterns to observe. First, notice that three main clusters emerge: on the lef t side of the network we can see blogs and news media sites related to p olitics, at the r ight top, we have blog s devoted to go ssip, celebrity n ews or entertainmen t and on the r ight b ottom, we can distinguish blogs an d n ews m edia sites that deal with tech nological news. As Huffington Post and Political Carniv al play the central role on the p olitical side of the network , mainstream media sites like W ashing ton Post, Guardian an d th e profession al blog Salon.com play the role of connecto rs between the d ifferent parts of the n etwork. The celebrity gossip part o f the network is dominate d by the blog Gawker and te chnolog y n ews gathe r aro und blogs Gizmodo and En gadget, with CNet and T ech Chuck establishing the con nection to the rest of the network. Figure 15 shows the largest conn ected compo nent of the diffusion n etwork after 300 edges have been ch osen using the second meth odolo gy , i.e. using shor t textual ph rases to track the flow of information. In this case, th e network is biased towards ne ws media sites due to its higher volume of information. Insights into the diffusion on the web. The inf erred diffusion networks also a llow for analysis of the globa l structu re o f inform ation pr opagatio n on the W eb. For this a nalysis, we use the MemeT racker dataset and analy ze the structu re of the infer red information diffusion network. First, Figure 16 (a) shows the distribution of th e influ ence ind ex. Th e influe nce index is defined as the number of reachable nodes from w by trav ersing edges of the inferred diffu- sion network ( while respecting edge direction s). Nevertheless, we are also interested in the distance from w to its reachable nod es, i.e. nodes at shorter distances are more likely to be infected by w . Th us, we slightly modify the defin ition of influence index to be P u 1 /d wu where we sum ov er all the reachable nodes from w and d wu is the distance between w and u . Notice that we have two ty pes of nod es. There is a small set o f nodes that can reach many other nodes, which means they either directly or indirectly pr opagate information to them. On the other side we h av e a large number o f sites that only get influenced b u t do not influence many other sites. T his hints at a core per iphery stru cture of the dif fu sion network with a small set o f sites dir ectly o r ind irectly spread ing the inf ormation in the rest of the network. Figure 16(b) in vestigates the number of links in the inferred network that point between different types o f sites. Here we split the sites into mainstream media and blog s. Notice ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 33 0 50 100 150 200 250 300 350 400 450 0 5 10 15 20 25 30 35 40 45 50 Number of nodes Influence index (a) Influence Index 0 20 40 60 80 100 120 140 160 0 100 200 300 400 500 Links Iteration Media -> Media Media -> Blog Blog -> Media Blog -> Blog (b) Number of edges as iterat ions proceed 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 2 4 6 8 10 12 Fraction Median time difference (days) Media -> Media Media -> Blog Blog -> Media Blog -> Blog (c) Median edge time lag Fig. 1 6. ( a) Distribution o f no de influe nce index. Mo st no des have very low influence (they act as sin ks). ( b) Numb er a nd strength of edges between different med ia ty pes. Edges of news media influen cing blogs ar e the strong est. (c ) Median time lag on edges of different type. that mo st of th e links point fro m news m edia to b logs, which says that m ost of th e time informa tion propag ates from the main stream me dia to blog s. Then no tice how at first many media- to-media links are chosen but in later iterations the increa se of these links starts to slow d own. This m eans that media- to-media link s ten d to be th e strongest and N E T I N F picks them ear ly . The opposite seem s to occur in case of b log-to-b log links where relativ ely few are chosen first but later the alg orithm picks more of them. Lastly , links capturing th e influ ence of blogs on m ainstream me dia are the rarest and weakest. This suggests that most inform ation travels from mass media to blog s. Last, Figure 1 6(c) shows the med ian time difference between mention s of different types of sites. For every edge of the inf erred diffusion network , w e co mpute the med ian time ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 34 · Gomez-Rodri guez, Lesk ov ec and Krause. needed f or the informa tion to spread fro m the source to the de stination n ode. Again, we distinguish the mainstream media sites and blog s. Notice that media sites are quick to infect one anoth er or even to get infected from blog s. Howe ver , blogs ten d to be much slower in prop agating infor mation. It tak es a relati vely long t ime for them to get “inf ected” with informatio n regardless whether the in formatio n co mes from the mainstream media or the blogosph ere. Finally , we have observed that the insights into diffusion o n th e web using the inf erred network ar e very similar to insights obtained by simply taking the hyp erlink n etwork. How- ev er, our aim he re is to show th at (i) alth ough the quan titati ve results a re modest in terms of precision and recall, the q ualitative insigh ts ma kes sen se, an d that (ii) it is surprising that using simply timestamp s of links, we are able to draw the same qualitative insights as using the hyper link n etwork 5. FURTHER RELA TED W ORK There are se veral lin es of work we build upon. Althoug h the inform ation diffusion in on - line settings has received con siderable atten tion [Gruhl et al. 2 004; Kumar et al. 2004; Adar and Adamic 200 5; Leskov ec et al. 2006; Leskovec et al. 2006 ; Leskovec et a l. 2007 ; Liben-Nowell and Kleinberg 2 008], only a fe w stud ies w ere able to study the actua l shapes of cascades [ Leskovec et al. 200 7; Liben-Nowell and Kleinberg 2008; Ghosh and Le rman 2011; Romero et al. 2011 ; V er Steeg et a l. 201 1]. The pr oblem of inferring links of d if- fusion was first studied b y Adar and Ada mic [Adar and Adamic 2 005], who f ormulated it as a superv ised classification pro blem and used Supp ort V ector Mach ines combin ed w ith rich textual features to pre dict the occurren ce of ind ividual links. Although rich textual features are used, links are predicted indep endently and thus their ap proach is similar to our ba seline me thod in the sense that it picks a threshold ( i. e. , hyp erplane in case of SVMs) and predicts individua lly the most pro bable links. The work most closely related to ou r app roach, C O N N I E [My ers and Leskovec 20 10] and N E T R A T E [Go mez-Rodr iguez et al. 2011 ], also uses a gener ati ve probab ilistic model for the prob lem o f in ferring a latent so cial network from diffusion (cascad es) data. How- ev er, C O N N I E and N E T R A T E u se conv ex p rogra mming to so lve the network inferen ce problem . C O N N I E inclu des a l 1 -like p enalty term that co ntrols sparsity while N E T R A T E provides a uniq ue sparse solutio n by allowing different transmission rates across ed ges. For each edge ( i, j ) , C O N N I E infer s a prior proba bility β i,j and N E T R AT E infer s a trans- mission rate α i,j . Both algorithm s are co mputation ally mor e expensive than N E T I N F . In our work , we assume that all edges of the n etwork have the same prior proba bility ( β ) and transm ission rate ( α ). From this po int of view , we think the comp arison be tween th e algorithm s is unfair since N E T R AT E an d C O N N I E h av e more degrees of freedom Network structur e lear ning h as bee n co nsidered fo r e stimating the d ependen cy struc- ture of p robabilistic graph ical mo dels [Friedma n an d K oller 20 03; Friedman et al. 1 999] . Howe ver, there ar e fun damental d ifferences between our approach and graphical mode ls structure learning. (a) we learning directed networks, b u t Bayes netws are D AGs (b) undi- rected graphical m odel st ructu re learning makes n o assumption about the network but the y learn undire cted and we learn directed networks First, our work makes no assump tion abo ut th e network structure (we allow cycles, r e- ciprocal edge s) and are thu s able to learn ge neral dire cted networks. In directed g raphical models, reciprocal edges and cycles a re not allowed, and the inferred network is a directed ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 35 acyclic grap h (DA G). In undir ected grap hical mode ls, there are typica lly no assumptions about the network structure, but the inferre d network is undir ected. Secon d, Bayesian net- work structure inferen ce methods are genera lly heuristic approach es without any approxi- mation guara ntees. Network structure learning has also been u sed for estimating epid emi- ological network s [W allinga an d T eunis 2004] and for estimating pro babilistic relational models [Getoor et al. 200 3]. In both cases, the pro blem is f ormulated in a probab ilistic framework. Howe ver , since the problem is intractable , heur istic gr eedy hill-climb ing o r stochastic search that o ffer no perfor mance guarantee were usually u sed in practice. I n contrast, ou r work provides a novel formula tion and a tractable solu tion together with an approx imation guara ntee. Our work relate s to static sparse graph estimation using g raphical Lasso methods [ W ain- wright et al. 2006 ; Schmidt et al. 200 7; Fried man et al. 2008; Mein shausen an d Buehlman n 2006] , un supervised stru cture network inf erence using kernel meth ods [ Lippert et al. 2009], mutual in formatio n relev an ce network inf erence [Butte and Kohane 200 0], inferen ce of influence probabilities [ Goyal et al. 2010] , an d e xtension s to time e volving graphical mod- els [Ahmed and Xing 2009 ; Ghahramani 1998; Song et al. 2009]. O ur work is also related to a link predictio n pro blem [Jansen et a l. 2003 ; T askar et al. 2 003; Lib en-Nowell and Kleinberg 2003; B ackstrom and Leskovec 201 1; V ert and Y am anishi 2005] b u t different in a sense that this line of work assumes that part of the network is already visible to us. Last, alth ough sub modula r functio n maximizatio n has been previously considered for sensor placement [Leskovec et a l. 2007] and finding influe ncers in viral m arketing [Kempe et al. 200 3], to the b est of our knowledge, the pr esent work is the first that con siders submodu lar function maximizatio n in the context o f network structure learning. 6. CONCLUSIONS W e hav e in vestigated the pr oblem of tracin g paths of diffusion a nd influen ce. W e f or- malized the problem and developed a scalable a lgorithm, N E T I N F , to in fer networks of influence and diffusion. First, we defined a gener ativ e model of cascades and showed that choosing the best set of k edges m aximizing the likelihood of the data is NP-hard . By exploiting the s ubmo dularity of our o bjective functio n, we developed N E T I N F , an efficient algorithm for inferring a near-optimal set of k directed e dges. By exploiting localize d updates and lazy ev alua tion, our algorithm is able to scale to very large real data s ets. W e ev aluated our algorithm on synth etic cascad es sampled from our gen erative model, and showed that N E T I N F is able to accurately recover the underly ing network fr om a rel- ativ ely small nu mber of samples. In our experiments, N E T I N F drastically o utperfo rmed a naive max imum weight baseline heuristic. Most imp ortantly , our algo rithm allows us to study properties of real networks. W e ev alu ated N E T I N F on a large real data set of mem es propag ating a cross news websites and blogs. W e found that the inf erred network exhibits a core-p eriphery structur e with mass media influencin g most of the blogosph ere. Clusters of sites related to similar topics emerge (politics, gossip, technolo gy , etc.), and a few sites with social capital interco nnect these clusters allowing a p otential diffusion o f information among sites in different cluster s. There ar e several interesting dire ctions for futu re work. Here we on ly u sed time differ - ence to infer ed ges and thu s it would be inter esting to utilize m ore infor mative features (e.g., textual conten t of postings etc.) to more accu rately estimate the influen ce pr obabil- ities. Moreover, our work consider s static pro pagation networks, howe ver real influence ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 36 · Gomez-Rodri guez, Lesk ov ec and Krause. networks are dynamic and thus it would be interesting to relax this assump tion. Last, there are many other domains where o ur methodology cou ld be useful: infe rring in teraction net- works in systems bio logy (protein- protein an d gen e in teraction network s), n euroscience (inferrin g physical connectio ns between neu rons) and epidemiology . W e believe that ou r r esults provide a p romising step tow ards understan ding comp lex processes on networks based on partial observations. Ackno wledgments W e thank Spin n3r fo r resource s th at facilitated the resear ch. The resear ch was suppo rted in part b y Albert Y u & Ma ry Bechmann Founda tion, IBM, Lightspeed, Micro soft, Y ahoo, grants ONR N00 014-0 9-1-1 044, NSF CNS093239 2, NSF CNS10 1092 1, NSF IIS101 6909 , NSF IIS0953413, AFRL F A865 0-10- C-7058 an d Oka wa Foundation Research Grant. Ma- nuel Gomez Rodrigu ez has been supp orted in part by a Fund acion Caja Mad rid G raduate Fellowship, a Fund acion Barrie d e la M aza Grad uate Fellowship a nd b y the M ax Planck Society . REFERENCES A D A R , E . A N D A DA M I C , L . A . 2005. Trackin g information epidemic s in blogspace. In W eb Intellige nce . 207–214. A D A R , E . , Z H A N G , L ., A D A M I C , L . A . , A N D L U KO S E , R . M . 2004. Implicit structure and the dynamics of blogspac e. In W orkshop on the W eblo gging Ecosystem . A H M E D , A . A N D X I N G , E . 2009. Reco vering time-v arying netwo rks of dependencie s in social and biologica l studies. In PNAS ’09: Proce edings of the National Academy of Sciences . V ol. 106. A N D E R S O N , R . M . A N D M A Y , R . M . 2002. Infecti ous diseases of humans: Dynamics and contr ol . Oxford Press. B A C K S T R O M , L . A N D L E S K OV E C , J . 2011. Supervised random walk s: Predicting and recommending links in social networks. In WSDM ’11: Pr oceeding s of the ACM International Conferen ce on W eb Searc h and Data Mining . B A I L E Y , N . T. J . 1975. The Mathematical Theory of Infecti ous Diseases and its Applications , 2nd ed. Hafner Press. B A R A B ´ A S I , A . - L . 2005. The origin of bursts and heavy tail s in human dynamics. Nature 435 , 207. B A R A B ´ A S I , A . - L . A N D A L B E RT , R . 1999. Emergence of scali ng in random netwo rks. Science 286 , 509–512. B U T T E , A . A N D K O H A N E , I . 2000. Mutual information rele vance networks: functional genomic cluster ing using pairwise entrop y measurements. In P ac Symp Biocomput . V ol. 5. 418–429. C L AU S E T , A . , M O O R E , C . , A N D N E W M A N , M . E . J . 20 08. Hierarchi cal structure and the predicti on of missing links in netw orks. Natur e 453, 7191, 98–101. C R A N E , R . A N D S O R N E T T E , D . 2008. Robust dynamic classes re vealed by measuri ng the response functi on of a social system. PNAS ’08: Proc eedings of the National Academy of Scien ces 105, 41 (October ), 15649–15653. D O M I N G O S , P . A N D R I C H A R D S O N , M . 2001 . Mining the network v alue of customers. In KDD ’01: Pr oceedings of the 7th ACM SIGKDD internat ional confere nce on Knowledg e disco very and data mining . E D M O N D S , J . 1967. Optimum branchings. Journa l of Resear ch of the National Bureau of Standards 71B, 233–240. E R D ˝ O S , P . A N D R ´ E N Y I , A . 1960. On the e voluti on of rando m graphs. Publication of the Mathemat ical Institut e of the Hungarian Academy of Science 5 , 17–67. F R I E D M A N , J . , H A S T I E , T. , A N D T I B S H I R A N I , R . 2008. Sparse in verse cov arianc e estimation with the graphica l lasso. Biostat 9, 3, 432–441. F R I E D M A N , N . A N D K O L L E R , D . 2003. Being Bayesian about network s tructure . A Bayesian approach to structure disco very in Bayesian networks. Machine Learning 50, 1, 95–125. F R I E D M A N , N . , N AC H M A N , I . , A N D P E ’ E R , D . 1999. Learning Bayesian netwo rk structure from m assi ve dataset s: T he “Sparse Candidate” algorithm. In UAI ’99: P r oceedi ngs of the 15th Confer ence on Uncertainty in Artificial Intellig ence . ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY . Inferring Networks of Diffusion and Influence · 37 G E T O O R , L . , F R I E D M A N , N ., K O L L E R , D ., A N D T A S K A R , B . 2003. Learning probabil istic models of link structure . The J ournal of Machin e Learning R esear ch 3 , 707. G H A H R A M A N I , Z . 1998 . L earning dy namic Bayesian ne tworks. In Adaptive Pr ocessing of Sequen ces and Data Structur es . G H O S H , R . A N D L E R M A N , K . 2011. A frame work for quan titati ve analysis of ca scades on netw orks. In WSDM ’11: Proc eedings of the fourth ACM internation al confere nce on W eb searc h and data mining . 665–674. G O M E Z - R O D R I G U E Z , M . , B A L D U Z Z I , D . , A N D S C H ¨ O L KO P F , B . 2011. Uncov ering the T emporal Dyna mics of Dif fusion Netwo rks. In ICML ’11: P r oceedi ngs of the 28th Internati onal Conferen ce on Machine Learning . 561–568. G O O D M A N , L . A . 1961. Snowbal l sampling. A nnals of Mathematic al Statistic s 32, 1, 148–170. G O YAL , A . , B O N C H I , F., A N D L A K S H M A NA N , L . 2010. Learning influe nce probabili ties in social ne tworks. In WSDM ’10: Pr oceeding s of the Third ACM Internat ional Confer ence on W eb Sear ch and Data Mining . A CM, 241–250. G R U H L , D ., G U H A , R . , L I B E N - N OW E L L , D . , A N D T O M K I N S , A . 2004. Information diffusio n through blogspac e. In WWW ’04: Proce edings of the 13th international confer ence on W orld W ide W eb . 491–501. H E C K A T H O R N , D . 1997. Responde nt-dri ven sampling: A new approac h to the study of hidden populat ions. Social Proble ms 44, 2, 174–199. H E T H C O T E , H . W. 2000 . The mathematics of infectio us diseases. SIAM R evi ew 42, 4, 599–653. J A N S E N , R . , Y U , H . , G R E E N B AU M , D . , K L U G E R , Y . , K RO G A N , N . , C H U N G , S . , E M I L I , A ., S N Y D E R , M ., G R E E B L A T T , J ., A N D G E R S T E I N , M . 2003. A bayesian networks approach for predict ing protein protein intera ctions from genomic data. Science 302, 5644, 449–453. K AT Z , E . A N D L A Z A R S F E L D , P. 1955. P ersonal influence: The part played by people in the flow of mass communica tions . Free P ress. K E A R N S , M ., S U R I , S . , A N D M O N T F O R T , N . 2006. An experimenta l study of the coloring problem on human subject netwo rks. Science 313, 5788, 824. K E M P E , D . , K L E I N B E R G , J . M . , A N D T A R D O S , E . 2003. Maximizi ng the spread of influence through a social netw ork. In K DD ’03: Pr oceedin gs of the 9th ACM SIGKDD internat ional confer ence on Knowle dge discovery and data mining . 137–146. K H U L L E R , S . , M O S S , A . , A N D N A O R , J . 1999. The budge ted maximum cove rage problem. Information Pr ocessing Letters 70, 1, 39–45. K N U T H , D . 1968. The art of computer pr ogrammi ng . Addison-W esley . K U M A R , R . , N OV A K , J ., R AG H A V A N , P., A N D T O M K I N S , A . 2004. Structure and evo lution of blogspace. CAC M 47, 12, 35–39. L E S KOV E C , J ., A DA M I C , L . A . , A N D H U B E R M A N , B . A . 2006. T he dynamics of viral marketing. In E C ’06: Pr oceedin gs of the 7th ACM confer ence on Electr onic commerc e . 228–237. L E S KOV E C , J ., B A C K S T R O M , L ., A N D K L E I N B E R G , J . 2009. Meme-tracki ng and the dynamics of the ne ws cyc le. In KDD ’09: Proc eedings of the 15th ACM SIGKDD internationa l confer ence on Knowledg e disco very and data mining . ACM, Ne w Y ork, NY , USA, 497–506. L E S KOV E C , J . A N D F A L O U T S O S , C . 2007. Scalable modeling of real gra phs using kronec ker multiplica tion. In ICML ’07: Pr oceedin gs of the 24th International Confere nce on Machine Learning . 504. L E S KOV E C , J ., K L E I N B E R G , J . , A N D F A L O U T S O S , C . 2005. Graphs over time: densification laws, shrinking diamete rs and possible expla nations. In KDD ’05: Proc eedings of the 11th ACM SIGKDD international confer ence on Knowledg e disco very in data mining . 187. L E S KOV E C , J ., K L E I N B E R G , J . M . , A N D F A L O U T S O S , C . 2007. Graph e voluti on: Densificat ion and shrinking diamete rs. ACM T ransactions on Knowledge Discov ery fr om Data (TKDD) 1, 1, 2. L E S KOV E C , J . , K R AU S E , A ., G U E S T R I N , C ., F A L O U T S O S , C . , V A N B R I E S E N , J ., A N D G L A N C E , N . 2007. Cost-ef fecti ve outbreak detection in networks. In KDD ’07: Pr oceeding s of the 13th ACM SIGKDD interna- tional confer ence on Knowledg e discovery and data mining . 420–429. L E S KOV E C , J ., L A N G , K . J ., D A S G U P TA , A . , A N D M A H O N E Y , M . W. 2008. Statistical propertie s of commu- nity structure in large social and information netw orks. In WWW ’08: Pr oceeding s of the 17th Internat ional Confer ence on W orld W ide W eb . L E S KOV E C , J . , M C G L O H O N , M . , F A L O U T S O S , C . , G L A N C E , N ., A N D H U R S T , M . 2007. Cascading behavior in lar ge blog graphs. In SDM ’07: P r oceedin gs of the SIAM Conferen ce on Data Mining . ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, Month 20YY . 38 · Gomez-Rodri guez, Lesk ov ec and Krause. L E S KOV E C , J . , S I N G H , A . , A N D K L E I N B E R G , J . M . 2006. Patterns of influe nce in a recommendati on network. In P AKDD ’06: Proce edings of the 10th P acific-Asia Confere nce on Knowledg e Discove ry and Data Mining . 380–389. L I B E N - N OW E L L , D . A N D K L E I N B E R G , J . 2003. The link predict ion problem for social netw orks. In CIKM ’03: Proc eedings of the International Confere nce on Information and Knowledge Manag ement . 556–559. L I B E N - N OW E L L , D . A N D K L E I N B E R G , J . 2008. Tracing the flow of information on a g lobal sca le using Internet chain-l etter data. PNAS ’08: Proce edings of the National Academy of Sciences 105, 12 (25 Mar .), 4633–4638. L I P P E RT , C . , S T E G L E , O . , G H A H R A M A N I , Z . , A N D B O R G WA R D T , K . 2009 . A k ernel method for unsupervise d structure d network inference . In A IST ATS ’09: Proce edings of the Ar tificia l Intell igenc e and Statisti cs . M A L M G R E N , R . D ., S T O U FF E R , D . B ., M OT T E R , A . E . , A N D A M A R A L , L . A . A . N . 2008. A poissonian expl anation for hea vy ta ils i n e-mail communicati on. Proce edings of the National Academy of Sciences 105, 47 (Nov ember), 18153–18158. M E I N S H AU S E N , N . A N D B U E H L M A N N , P . 20 06. High-dimensiona l graphs and vari able selec tion with the lasso. The Annals of Statistic s , 1436–1462. M Y E R S , S . A N D L E S K OV E C , J . 2010. On the Con ve xity of Latent Social Netwo rk Inference. In NIPS ’10: Advances in Neural Information P r ocessing Systems . N E M H AU S E R , G ., W O L S E Y , L ., A N D F I S H E R , M . 1978. An analysis of approximati ons for maximizing sub- modular set functi ons. Mathematical Pro gramming 14, 1, 265–294. R O G E R S , E . M . 1995. Diffusi on of Innovation s , Fourth e d. Free Press, New Y ork. R O M E RO , D ., M E E D E R , B . , A N D K L E I N B E R G , J . 2011. Differ ences in the mechanics of information diffusi on across topics: Idioms, political hashtags, and complex contagion on twitter . In WWW ’11: Pr oceeding s of the 20th internationa l confe renc e on W orld wide web . ACM, 6 95–704. S A D I K OV , S . , M E D I N A , M ., L E S KOV E C , J . , A N D G A R C I A - M O L I N A , H . 2011. Correcti ng for missing data in informati on cascades. In WSDM ’11: Proc eedings of the ACM International Confere nce on W eb Sear ch and Data Mining . S C H M I D T , M . , N I C U L E S C U - M I Z I L , A ., A N D M U R P H Y , K . 2007. Learning graphical model structure using l1-re gulariza tion paths. In A AAI ’07: P rocee dings of the 21th Confer ence on Ar tificia l Intellig ence . V ol. 22. S O N G , L . , K O L A R , M . , A N D X I N G , E . 2009. Time-v arying dynamic ba yesian netw orks. In NIPS ’09: A dvances in Neural Information Proce ssing Systems . S T R A N G , D . A N D S O U L E , S . A . 1998. Dif fusion in organiza tions and social movements: From hybrid corn to poison pills. Annual Review of Sociology 24 , 265–290. T AS K A R , B . , W O N G , M . F. , A B B E E L , P . , A N D K O L L E R , D . 2003. Link prediction in relation al data. In NIPS ’03: Advances in Neural Information Proce ssing Systems . T U T T E , W . 1948. The disection of equilate ral triangle s into equilate ral triang les. Pr oceedi ngs Cambridg e Philos. Soc. 44 , 463–482. V E R S T E E G , G ., G H O S H , R . , A N D L E R M A N , K . 2011. W hat stops soc ial epidemic s? In ICWSM ’11: Pr oceed- ings of the 5th Int. Conf . on W eblogs and Social Media . V E RT , J . A N D Y AM A N I S H I , Y . 2005. Supervised graph i nference. In NIPS ’05: A dvances in Neural Information Pr ocessing Systems . W A I N W R I G H T , M . J . , R A V I K U M A R , P . , A N D L A FF E RT Y , J . D . 2006. High-dimensional graphical model selec- tion using ‘1-regul arized logistic regression. In P N AS ’06: Pro ceedings of the Nationa l Academy of Sciences . W A L L I N G A , J . A N D T E U N I S , P . 2004. Differe nt epidemic curves for se vere acute respirat ory syndrome rev eal similar impacts of control measures. American J ournal of E pidemiol ogy 160, 6, 509–516. W A T T S , D . J . A N D D O D D S , P . S . 2007. Influentia ls, networks, and public opinion formation. Journal of Consumer Researc h 34, 4 (Dece mber), 441–458. ACM T ransac tions on Kno wledge Discovery from Data, V ol. V , No. N, M onth 20YY .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment