Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm
Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these…
Authors: Hadi Daneshm, Manuel Gomez-Rodriguez, Le Song
Estimating Diffusion Network Structur es: Recov ery Conditions, Sample Complexity & Soft-thr esholding Algorithm Hadi Daneshmand 1 H A D I . DA N E S H M A N D @ T U E . M P G . D E Manuel Gomez-Rodriguez 1 M A N U E L G R @ T U E . M P G . D E Le Song 2 L S O N G @ C C . G A T E C H . E D U Bernhard Sch ¨ olkopf 1 B S @ T U E . M P G . D E 1 MPI for Intelligent Systems and 2 Georgia Institute of T echnology Abstract Information spreads across social and techno- logical networks, but often the network struc- tures are hidden from us and we only observe the traces left by the diffusion processes, called cascades . Can we recov er the hidden network structures from these observed cascades? What kind of cascades and ho w many cascades do we need? Are there some network structures which are more difficult than others to reco ver? Can we design efficient inference algorithms with prov a- ble guarantees? Despite the increasing av ailability of cascade- data and methods for inferring networks from these data, a thorough theoretical understanding of the abov e questions remains largely unex- plored in the literature. In this paper , we in ves- tigate the network structure inference problem for a general family of continuous-time diffusion models using an ` 1 -regularized likelihood ma- ximization frame work. W e show that, as long as the cascade sampling process satisfies a na- tural incoherence condition, our frame work can recov er the correct network structure with high probability if we observe O ( d 3 log N ) cascades, where d is the maximum number of parents of a node and N is the total number of nodes. More- ov er , we de velop a simple and efficient soft- thresholding inference algorithm, which we use to illustrate the consequences of our theoreti- cal results, and sho w that our framework outper- forms other alternativ es in practice. Pr oceedings of the 31 st International Confer ence on Machine Learning , Beijing, China, 2014. JMLR: W&CP v olume 32. Copy- right 2014 by the author(s). 1. Introduction Diffusion of information, behaviors, diseases, or more ge- nerally , contagions can be naturally modeled as a stochas- tic process that occur over the edges of an underlying net- work ( Rogers , 1995 ). In this scenario, we often observe the temporal traces that the dif fusion generates, called cas- cades , b ut the edges of the network that gav e rise to the diffusion remain unobservable ( Adar & Adamic , 2005 ). For example, blogs or media sites often publish a new piece of information without e xplicitly citing their sources. Mar- keters may note when a social media user decides to adopt a ne w beha vior but cannot tell which neighbor in the social network influenced them to do so. Epidemiologist observe when a person gets sick but usually cannot tell who infected her . In all these cases, giv en a set of cascades and a diffu- sion model, the network inference problem consists of in- ferring the edges (and model parameters) of the unobserved underlying network ( Gomez-Rodriguez , 2013 ). The network inference problem has attracted significant attention in recent years ( Saito et al. , 2009 ; Gomez- Rodriguez et al. , 2010 ; 2011 ; Snowsill et al. , 2011 ; Du et al. , 2012a ), since it is essential to reconstruct and pre- dict the paths over which information can spread, and to maximize sales of a product or stop infections. Most pre- vious work has focused on developing network inference algorithms and e valuating their performance experimen- tally on different synthetic and real networks, and a rigo- rous theoretical analysis of the problem has been missing. Howe ver , such analysis is of outstanding interest since it would enable us to answer many fundamental open ques- tions. For example, which conditions are sufficient to gua- rantee that we can recover a network given a large number of cascades? If these conditions are satisfied, how many cascades are sufficient to infer the network with high pro- bability? Until recently , there has been a paucity of work along this direction ( Netrapalli & Sanghavi , 2012 ; Abra- hao et al. , 2013 ) which provide only partial vie ws of the problem. None of them is able to identify the recovery Estimating Diffusion Network Structur es condition relating to the interaction between the network structure and the cascade sampling process, which we will make precise in our paper . Overview of results. W e consider the network infer- ence problem under the continuous-time dif fusion model recently introduced by Gomez-Rodriguez et al. ( 2011 ). W e identify a natural incoherence condition for such a model which depends on both the network structure, the diffusion parameters and the sampling process of the cas- cades. This condition captures the intuition that we can recov er the network structure if the co-occurrence of a node and its non-parent nodes is small in the cascades. Furthermore, we sho w that, if this condition holds for the population case, we can recov er the network struc- ture using an ` 1 -regularized maximum likelihood estima- tor and O ( d 3 log N ) cascades, and the probability of suc- cess is approaching 1 in a rate exponential in the num- ber of cascades. Importantly , if this condition also holds for the finite sample case, then the guarantee can be im- prov ed to O ( d 2 log N ) cascades. Beyond theoretical re- sults, we also propose a ne w , efficient and simple proxi- mal gradient algorithm to solve the ` 1 -regularized maxi- mum likelihood estimation. The algorithm is especially well-suited for our problem since it is highly scalable and naturally finds sparse estimators, as desired, by using soft- thresholding. Using this algorithm, we perform various experiments illustrating the consequences of our theoreti- cal results and demonstrating that it typically outperforms other state-of-the-art algorithms. Related work. Netrapalli & Sanghavi ( 2012 ) pro- pose a maximum likelihood network inference method for a variation of the discrete-time independent cascade model ( Kempe et al. , 2003 ) and show that, for general net- works satisfying a corr elation decay , the estimator recov ers the network structure given O ( d 2 log N ) cascades, and the probability of success is approaching 1 in a rate exponen- tial in the number of cascades. The rate they obtained is on a par with our results. Howe ver , their discrete diffusion model is less realistic in practice, and the correlation decay condition is rather restricted: essentially , on a verage each node can only infect one single node per cascade. Instead, we use a general continuous-time diffusion model ( Gomez- Rodriguez et al. , 2011 ), which has been extensi vely vali- dated in real diffusion data and extended in various ways by dif ferent authors ( W ang et al. , 2012 ; Du et al. , 2012a ; b ). Abrahao et al. ( 2013 ) propose a simple network inference method, First-Edge, for a slightly dif ferent continuous- time independent cascade model ( Gomez-Rodriguez et al. , 2010 ), and show that, for general networks, if the cascade sources are chosen uniformly at random, the algorithm needs O ( N d log N ) cascades to recov er the netw ork struc- ture and the probability of success is approaching 1 only in Cascade 1 Cascade 2 Diffusion Network a b c d e f a b c d e f a b c d e f t i > T t i < T t i > T t i < T Figure 1. The diffusion network structure (left) is unknown and we only observe cascades, which are N -dimensional vectors recording the times when nodes get infected by contagions that spread (right). Cascade 1 is ( t a , t b , t c , ∞ , ∞ , ∞ ) , where t a < t c < t b , and cascade 2 is ( ∞ , t b , ∞ , t d , t e , t f ) , where t b < t d < t e < t f . Each cascade contains a source node (dark red), drawn from a source distrib ution P ( s ) , as well as infected (light red) and uninfected (white) nodes, and it provides information on black and dark gray edges but does not on light gray edges. a rate polynomial in the number of cascades. Additionally , they study trees and bounded-degree networks and show that, if the cascade sources are chosen uniformly at random, the error decreases polynomially as long as O (log N ) and Ω( d 9 log 2 d log N ) cascades are recorded respectiv ely . In our work, we sho w that, for general networks satisfying a natural incoherence condition, our method outperforms the First-Edge algorithm and the algorithm for bounded-degree networks in terms of rate and sample complexity . Gripon & Rabbat ( 2013 ) propose a network inference method for unordered cascades, in which nodes that are infected together in the same cascade are connected by a path containing exactly the nodes in the trace, and giv e necessary and sufficient conditions for network inference. Howe ver , they consider a restricti ve, unrealistic scenario in which cascades are all three nodes long. 2. Continuous-Time Diffusion Model In this section, we revisit the continuous-time generativ e model for cascade data introduced by Gomez-Rodriguez et al. ( 2011 ). The model associates each edge j → i with a transmission function, f ( t i | t j ; α j i ) = f ( t i − t j ; α j i ) , a density over time parameterized by α j i . This is in con- trast to previous discrete-time models which associate each edge with a fixed infection probability ( Kempe et al. , 2003 ). Moreov er , it also dif fers from discrete-time models in the sense that ev ents in a cascade are not generated iterativ ely in rounds, but event timings are sampled directly from the transmission functions in the continuous-time model. 2.1. Cascade generative pr ocess Giv en a dir ected contact network, G = ( V , E ) with N nodes, the process begins with an infected source node, s , initially adopting certain contagion at time zero, which we draw from a source distribution P ( s ) . The contagion is transmitted from the source along her out-going edges to her direct neighbors. Each transmission through an edge entails a random transmission time, τ = t j − t j , dra wn from an associated transmission function f ( τ ; α j i ) . W e assume transmission times are independent, possibly dis- Estimating Diffusion Network Structur es T able 1. Functions. Function Infected node ( t i < T ) Uninfected node ( t i > T ) g i ( t ; α ) log h ( t ; α ) + P j : t j T S ( T | t i ; α im ) × Y k : t k 0 } with cardinality d i = |N − ( i ) | and the minimum positive trans- mission rate as α ∗ min ,i = min j : α ∗ j i > 0 α ∗ j i . Let C n be a set of n cascades sampled from the model, where the source s ∈ V ∗ of each cascade is dra wn from a source distrib ution P ( s ) . Then, the network inference problem consists of fin- ding the directed edges and the associated parameters using only the temporal information from the set of cascades C n . This problem has been cast as a maximum likelihood esti- mation problem ( Gomez-Rodriguez et al. , 2011 ) minimize A − 1 n P c ∈ C n log f ( t c ; A ) subject to α j i ≥ 0 , i, j = 1 , . . . , N , i 6 = j, (2) where the inferred edges in the network correspond to those pairs of nodes with non-zero parameters, i.e. ˆ α j i > 0 . In fact, the problem in Eq. 2 decouples into a set of in- dependent smaller subproblems, one per node, where we infer the parents of each node and the parameters associa- ted with these incoming edges. W ithout loss of generality , for a particular node i , we solve the problem minimize α i ` n ( α i ) subject to α j i ≥ 0 , j = 1 , . . . , N , i 6 = j, (3) where α i := { α j i | j = 1 , . . . , N , i 6 = j } are the relev ant variables, and ` n ( α i ) = − 1 n P c ∈ C n g i ( t c ; α i ) corres- ponds to the terms in Eq. 2 in volving α i (also see T able 1 for the definition of g ( · ; α i ) ). In this subproblem, we only need to consider a super-neighborhood V i = R i ∪ U i of i , with cardinality p i = |V i | ≤ N , where R i is the set of upstream nodes from which i is reachable, U i is the set of nodes which are reachable from at least one node j ∈ R i . Here, we consider a node i to be reachable from a node j if and only if there is a directed path from j to i . W e can skip all nodes in V \V i from our analysis because they will nev er be infected in a cascade before i , and thus, the max- imum likelihood estimation of the associated transmission rates will always be zero (and correct). Below , we show that, as n → ∞ , the solution, ˆ α i , of the problem in Eq. 3 is a consistent estimator of the true param- eter α ∗ i . Howe ver , it is not clear whether it is possible to recov er the true netw ork structure with this approach gi ven a finite amount of cascades and, if so, how many cascades are needed. W e will show that by adding an ` 1 -regularizer to the objective function and solving instead the following optimization problem minimize α i ` n ( α i ) + λ n || α i || 1 subject to α j i ≥ 0 , j = 1 , . . . , N , i 6 = j, (4) we can provide finite sample guarantees for recovering the network structure (and parameters). Our analysis also shows that by selecting an appropriate v alue for the regula- rization parameter λ n , the solution of Eq. 4 successfully re- Estimating Diffusion Network Structur es cov ers the network structure with probability approaching 1 exponentially fast in n . In the remainder of the paper , we will focus on estimating the parent nodes of a particular node i . For simplicity , we will use α = α i , α j = α j i , N − = N − ( i ) , R = R i , U = U i , d = d i , p i = p and α ∗ min = α ∗ min ,i . 4. Consistency Can we r ecover the hidden network structur es fr om the ob- served cascades? The answer is yes. W e will show this by proving that the estimator provided by Eq. 3 is consistent, meaning that as the number of cascades goes to infinity , we can always reco ver the true network structure. More specifically , Gomez-Rodriguez et al. ( 2011 ) showed that the netw ork inference problem defined in Eq. 3 is con- ve x in α if the survi val functions are log-concave and the hazard functions are concav e in α . Under these conditions, the Hessian matrix, Q n = ∇ 2 ` n ( α ) , can be expressed as the sum of a nonne gativ e diagonal matrix D n and the outer product of a matrix X n ( α ) with itself, i.e . , Q n = D n ( α ) + 1 n X n ( α )[ X n ( α )] > . (5) Here the diagonal matrix D n ( α ) = 1 n P c D ( t c ; α ) is a sum o ver a set of diagonal matrices D ( t c ; α ) , one for each cascade c (see T able 1 for the definition of its entries); and X n ( α ) is the Hazard matrix X n ( α ) = X ( t 1 ; α ) | X ( t 2 ; α ) | . . . | X ( t n ; α ) , (6) with each column X ( t c ; α ) := h ( t c ; α ) − 1 ∇ α h ( t c ; α ) . Intuitiv ely , the Hessian matrix captures the co-occurrence information of nodes in cascades. Then, we can prove Theorem 1 If the sour ce probability P ( s ) is strictly posi- tive for all s ∈ R , then, the maximum likelihood estimator ˆ α given by the solution of Eq. 3 is consistent. Proof W e check the three criteria for consistency: con- tinuity , compactness and identification of the objectiv e function ( Newe y & McFadden , 1994 ). Continuity is ob- vious. For compactness, since L → −∞ for both α ij → 0 and α ij → ∞ for all i, j so we lose nothing imposing upper and lower bounds thus restricting to a compact subset. For the identification condition, α 6 = α ∗ ⇒ ` n ( α ) 6 = ` n ( α ∗ ) , we use Lemma 9 and 10 (refer to Appendices A and B ), which establish that X n ( α ) has full row rank as n → ∞ , and hence Q n is positiv e definite. 5. Recovery Conditions In this section, we will find a set of suf ficient conditions on the diffusion model and the cascade sampling process un- der which we can reco ver the netw ork structure from finite samples . These results allow us to address two questions: • Are there some network structures which ar e mor e difficult than other s to reco ver? • What kind of cascades ar e needed for the network structur e r ecovery? The answers to these questions are intertwined. The dif fi- culty of finite-sample recovery depends crucially on an in- coherence condition which is a function of both network structure, parameters of the diffusion model and the cas- cade sampling process. Intuitiv ely , the sources of the cas- cades in a diffusion network ha ve to be chosen in such a way that nodes without parent-child relation should co- occur less often compared to nodes with such relation. Many commonly used diffusion models and network struc- tures can be naturally made to satisfy this condition. More specifically , we first place two conditions on the Hessian of the population log-likelihood, E c [ ` n ( α )] = E c [log g ( t c ; α )] , where the expectation here is taken ov er the distribution P ( s ) of the source nodes, and the den- sity f ( t c | s ) of the cascades t c giv en a source node s . In this case, we will further denote the Hessian of E c [log g ( t c ; α )] ev aluated at the true model parameter α ∗ as Q ∗ . Then, we place two conditions on the Lipschitz continuity of X ( t c ; α ) , and the boundedness of X ( t c ; α ∗ ) and ∇ g ( t c ; α ∗ ) at the true model parameter α ∗ . F or simplicity , we will denote the subset of indexes associated to node i ’s true parents as S , and its complement as S c . Then, we use Q ∗ S S to denote the sub-matrix of Q ∗ index ed by S and α ∗ S the set of parameters index ed by S . Condition 1 (Dependency condition): There exists con- stants C min > 0 and C max > 0 such that Λ min ( Q ∗ S S ) ≥ C min and Λ max ( Q ∗ S S ) ≤ C max where Λ min ( · ) and Λ max ( · ) return the leading and the bottom eigen value of its argument respectiv ely . This assumption ensures that two connected nodes co-occur reasonably frequently in the cas- cades but are not deterministically related. Condition 2 (Incoherence condition): There exists ε ∈ (0 , 1] such that |||Q ∗ S c S ( Q ∗ S S ) − 1 ||| ∞ ≤ 1 − ε , where ||| A ||| ∞ = max j P k | A j k | . This assumption captures the intuition that, node i and any of its neighbors should get infected together in a cascade more often than node i and any of its non-neighbors. Condition 3 (Lipschitz Continuity): For an y feasible cas- cade t c , the Hazard vector X ( t c ; α ) is Lipschitz conti- nuous in the domain { α : α S ≥ α ∗ min / 2 } , k X ( t c ; β ) − X ( t c ; α ) k 2 ≤ k 1 k β − α k 2 , where k 1 is some positive constant. As a consequence, the spectral norm of the difference, n − 1 / 2 ( X n ( β ) − X n ( α )) , is also bounded (refer to appendix C ), i.e. , ||| n − 1 / 2 X n ( β ) − X n ( α ) ||| 2 ≤ k 1 k β − α k 2 . (7) Furthermore, for any feasible cascade t c , D ( α ) j j is Lips- chitz continuous for all j ∈ V , | D ( t c ; β ) j j − D ( t c ; α ) j j | ≤ k 2 k β − α k 2 , Estimating Diffusion Network Structur es where k 2 is some positiv e constant. Condition 4 (Boundedness): For any feasible cascade t c , the absolute v alue of each entry in the gradient of its log- likelihood and in the Hazard vector , as e valuated at the true model parameter α ∗ , is bounded, k∇ g ( t c ; α ∗ ) k ∞ ≤ k 3 , k X ( t c ; α ∗ ) k ∞ ≤ k 4 , where k 3 and k 4 are positi ve constants. Then the abso- lute value of each entry in the Hessian matrix Q ∗ , is also bounded |||Q ∗ ||| ∞ ≤ k 5 . Remarks for condition 1 As stated in Theorem 1 , as long as the source probability P ( s ) is strictly positiv e for all s ∈ R , the maximum likelihood formulation is strictly con- ve x and thus there exists C min > 0 such that Λ min ( Q ∗ ) ≥ C min . Moreo ver , condition 4 implies that there exists C max > 0 such that Λ max ( Q ∗ ) ≤ C max . Remarks for condition 2 The incoherence condition de- pends, in a non-trivial way , on the netw ork structure, dif fu- sion parameters, observation window and source node dis- tribution. Here, we giv e some intuition by studying three small canonical examples. First, consider the chain graph in Fig. 2(a) and assume that we would like to find the incoming edges to node 3 when T → ∞ . Then, it is easy to show that the incoherence condition is satisfied if ( P 0 + P 1 ) / ( P 0 + P 1 + P 2 ) < 1 − ε and P 0 / ( P 0 + P 1 + P 2 ) < 1 − ε , where P i denotes the pro- bability of a node i to be the source of a cascade. Thus, for example, if the source of each cascade is chosen uniformly at random, the inequality is satisfied. Here, the incoherence condition depends on the source node distribution. Second, consider the directed tree in Fig. 2(b) and assume that we would like to find the incoming edges to node 0 when T → ∞ . Then, it can be shown that the incoherence condition is satisfied as long as (1) P 1 > 0 , (2) ( P 2 > 0 ) or ( P 5 > 0 and P 6 > 0 ), and (3) P 3 > 0 . As in the chain, the condition depends on the source node distribution. Finally , consider the star graph in Fig. 2(c) , with exponen- tial edge transmission functions, and assume that we would like to find the incoming edges to a leav e node i when T < ∞ . Then, as long as the root node has a nonzero probability P 0 > 0 of being the source of a cascade, it can be sho wn that the incoherence condition reduces to the inequalities 1 − α 0 j α 0 i + α 0 j e − ( α 0 i + α 0 j ) T + α 0 j α 0 i + α 0 j < 1 − ε (1 + e − α 0 i T ) , j = 1 , . . . , p : j 6 = i , which always holds for some ε > 0 . If T → ∞ , then the condition holds whenev er ε < α 0 i / ( α 0 i + max j : j 6 = i α 0 j ) . Here, the larger the ratio max j : j 6 = i α 0 j /α 0 i is, the smaller the maximum value of ε for which the incoherence condition holds. T o summarize, as long as P 0 > 0 , there is always some ε > 0 for which the condition holds, and such ε value depends on the time window and the parameters α 0 j . 0 1 2 3 01 12 23 (a) Chain 2 3 1 0 30 10 20 4 5 6 7 41 52 62 73 (b) T ree 1 2 3 p 0 03 02 01 0p (c) Star Figure 2. Example networks. Remarks f or conditions 3 and 4 W ell-known pairwise transmission lik elihoods such as exponential, Rayleigh or Po wer-la w , used in previous work ( Gomez-Rodriguez et al. , 2011 ), satisfy conditions 3 and 4. 6. Sample Complexity How many cascades do we need to r ecover the network structur e? W e will answer this question by providing a sample complexity analysis of the optimization in Eq. 4 . Giv en the conditions spelled out in Section 5 , we can sho w that the number of cascades needs to gro w polynomially in the number of true parents of a node, and depends only lo- garithmically on the size of the network. This is a positive result, since the network size can be very lar ge (millions or billions), but the number of parents of a node is usually small compared the network size. More specifically , for each individual node, we ha ve the follo wing result: Theorem 2 Consider an instance of the continuous-time diffusion model with parameters α ∗ j i and associated edges E ∗ such that the model satisfies condition 1-4, and let C n be a set of n cascades drawn fr om the model. Suppose that the r e gularization parameter λ n is selected to satisfy λ n ≥ 8 k 3 2 − ε ε r log p n . (8) Then, ther e exist positive constants L and K , independent of ( n, p, d ) , such that if n > Ld 3 log p, (9) then the following pr operties hold with pr obability at least 1 − 2 exp( − K λ 2 n n ) : 1. F or each node i ∈ V , the ` 1 -r e gularized network infe- r ence problem defined in Eq. 4 has a unique solution, and so uniquely specifies a set of incoming edges of node i . 2. F or each node i ∈ V , the estimated set of incoming edges does not include any false edges and include all true edges. Furthermor e, suppose that the finite sample Hessian matrix Q n satisfies conditions 1 and 2. Then ther e exist positive constants L and K , independent of ( n, p, d ) , such that the sample complexity can be impr oved to n > Ld 2 log p with other statements r emain the same. Remarks. The above sample complexity is proved for each node separately for recovering its parents. Using a union bound, we can provide the sample complexity for recove- Estimating Diffusion Network Structur es ring the entire network structure by joining these parent- child relations together . The resulting sample complexity and the choice of regularization parameters will remain largely the same, except that the dependency on d will change from d to d max (the largest number of parents of a node), and the dependenc y on p will change from log p to 2 log N ( N the number of nodes in the network). 6.1. Outline of Analysis The proof of Theorem 2 uses a technique called primal-dual witness method, previously used in the proof of sparsis- tency of Lasso ( W ainwright , 2009 ) and high-dimensional Ising model selection ( Ravikumar et al. , 2010 ). T o the best of our kno wledge, the present work is the first that uses this technique in the context of diffusion network in- ference. First, we sho w that the optimal solutions to Eq. 4 hav e shared sparsity pattern, and under a further condition, the solution is unique (prov en in Appendix D ): Lemma 3 Suppose that ther e e xists an optimal primal- dual solution ( ˆ α , ˆ µ ) to Eq. 4 with an associated subgra- dient vector ˆ z suc h that || ˆ z S c || ∞ < 1 . Then, any optimal primal solution ˜ α must have ˜ α S c = 0 . Moreo ver , if the Hessian sub-matrix Q n S S is strictly positive definite, then ˆ α is the unique optimal solution. Next, we will construct a primal-dual vector ( ˆ α , ˆ µ ) along with an associated subgradient vector ˆ z . Furthermore, we will show that, under the assumptions on ( n, p, d ) stated in Theorem 2 , our constructed solution satisfies the KKT optimality conditions to Eq. 4 , and the primal vector has the same sparsity pattern as the true parameter α ∗ , i.e. , ˆ α j > 0 , ∀ j : α ∗ j > 0 , (10) ˆ α j = 0 , ∀ j : α ∗ j = 0 . (11) Then, based on Lemma 3 , we can deduce that the optimal solution to Eq. 4 correctly recovers the sparsisty pattern of α ∗ , and thus the incoming edges to node i . More specifically , we start by realizing that a primal-dual optimal solution ( ˜ α , ˜ µ ) to Eq. 4 must satisfy the gene- ralized Karush-Kuhn-T uck er (KKT) conditions ( Boyd & V andenber ghe , 2004 ): 0 ∈ ∇ ` n ( ˜ α ) + λ n ˜ z − ˜ µ , (12) ˜ µ j ˜ α j = 0 , (13) ˜ µ j ≥ 0 , (14) ˜ z j = 1 , ∀ ˜ α j > 0 , (15) | ˜ z j | ≤ 1 , ∀ ˜ α j = 0 , (16) where ` n ( ˜ α ) = − 1 n P c ∈ C n log g ( t c ; ˜ α ) and ˜ z denotes the subgradient of the ` 1 -norm. Suppose the true set of parent of node i is S . W e construct the primal-dual vector ( ˆ α , ˆ µ ) and the associated subgra- dient vector ˆ z in the follo wing way 1. W e set ˆ α S as the solution to the partial re gularized maximum likelihood problem ˆ α S = argmin ( α S , 0) , α S ≥ 0 { ` n ( α ) + λ n || α S || 1 } . (17) Then, we set ˆ µ S ≥ 0 as the dual solution associated to the primal solution ˆ α S . 2. W e set ˆ α S c = 0 , so that condition ( 11 ) holds, and ˆ µ S c = µ ∗ S c ≥ 0 , where µ ∗ is the optimal dual solu- tion to the following problem: minimize α E c [ ` n ( α )] subject to α j ≥ 0 , j = 1 , . . . , N , i 6 = j. (18) Thus, our construction satisfies condition ( 14 ). 3. W e obtain ˆ z S c from ( 12 ) by substituting in the con- structed ˆ α , ˆ µ and ˆ z S . Then, we only need to prove that, under the stated scalings of ( n, p, d ) , with high-probability , the remaining KKT con- ditions ( 10 ), ( 13 ), ( 15 ) and ( 16 ) hold. For simplicity of exposition, we first assume that the depen- dency and incoherence conditions hold for the finite sample Hessian matrix Q n . Later we will lift this restriction and only place these conditions on the population Hessian ma- trix Q ∗ . The following lemma show that our constructed solution satisfies condition ( 10 ): Lemma 4 Under condition 3, if the r e gularization para- meter is selected to satisfy √ dλ n ≤ C 2 min 6( k 2 + 2 k 1 √ C max ) , and k∇ s ` n ( α ∗ ) k ∞ ≤ λ n 4 , then, k ˆ α S − α ∗ S k 2 ≤ 3 √ dλ n /C min ≤ α ∗ min / 2 , as long as α ∗ min ≥ 6 √ dλ n /C min . Based on this lemma, we can then further sho w that the KKT conditions ( 13 ) and ( 15 ) also hold for the constructed solution. This can be trivially deduced from condition ( 10 ) and ( 11 ), and our construction steps (a) and (b). Note that it also implies that ˆ µ S = µ ∗ S = 0 , and hence ˆ µ = µ ∗ . Proving condition ( 16 ) is more challenging. W e first pro- vide more details on ho w to construct ˆ z S c mentioned in step (c). W e start by using a T aylor e xpansion of Eq. 12 , Q n ( ˆ α − α ∗ ) = −∇ ` n ( α ∗ ) − λ n ˆ z + ˆ µ − R n , (19) where R n is a remainder term with its j -th entry R n j = ∇ 2 ` n ( ¯ α j ) − ∇ 2 ` n ( α ∗ ) T j ( ˆ α − α ∗ ) , and ¯ α j = θ j ˆ α + (1 − θ j ) α ∗ with θ j ∈ [0 , 1] according to the mean v alue theorem. Rewriting Eq. 19 using block matrices Q n S S Q n S S c Q n S c S Q n S c S c ˆ α S − α ∗ S ˆ α S c − α ∗ S c = − ∇ S ` n ( α ∗ ) ∇ S c ` n ( α ∗ ) − λ n ˆ z S ˆ z S c + ˆ µ S ˆ µ S c − R n S R n S c Estimating Diffusion Network Structur es and, after some algebraic manipulation, we hav e λ ˆ z S c = −∇ S c ` n ( α ∗ ) + ˆ µ S c − R n S c − Q n S c S ( Q n S S ) − 1 − ∇ s ` n ( α ∗ ) − λ ˆ z S + ˆ µ S − R n S . Next, we upper bound k ˆ z S c k ∞ using the triangle inequality k ˆ z S c k ∞ ≤ λ − 1 n k µ ∗ S c − ∇ S c ` n ( α ∗ ) k ∞ + λ − 1 n k R n S c k ∞ + kQ n S c S ( Q n S S ) − 1 k ∞ × 1 + λ − 1 n k R n S k ∞ + λ − 1 n k µ ∗ S − ∇ S ` n ( α ∗ ) k ∞ , and we want to prove that this upper bound is smaller than 1 . This can be done with the help of the following two lemmas (prov en in Appendices F and G ): Lemma 5 Given ε ∈ (0 , 1] fr om the incoher ence condi- tion, we have, P 2 − ε λ n k∇ ` n ( α ∗ ) − µ ∗ k ∞ ≥ 4 − 1 ε ≤ 2 p exp( − nλ 2 n ε 2 32 k 2 3 (2 − ε ) 2 ) , which con ver ges to zer o at rate exp( − cλ 2 n n ) as long as λ n ≥ 8 k 3 2 − ε ε q log p n . Lemma 6 Given ε ∈ (0 , 1] fr om the incoher ence condi- tion, if conditions 3 and 4 holds, λ n is selected to satisfy λ n d ≤ C 2 min ε 36 K (2 − ε ) , wher e K = k 1 + k 4 k 1 + k 2 1 + k 1 √ C max , and k∇ s ` n ( α ∗ ) k ∞ ≤ λ n 4 , then, k R n k ∞ λ n ≤ ε 4(2 − ε ) , as long as α ∗ min ≥ 6 √ dλ n /C min . Now , applying both lemmas and the incoherence condition on the finite sample Hessian matrix Q n , we hav e k ˆ z S c k ∞ ≤ (1 − ε ) + λ − 1 n (2 − ε ) k R n k ∞ + λ − 1 n (2 − ε ) k µ ∗ − ∇ ` n ( α ∗ ) k ∞ ≤ (1 − ε ) + 0 . 25 ε + 0 . 25 ε = 1 − 0 . 5 ε, and thus condition ( 16 ) holds. A possible choice of the regularization parameter λ n and cascade set size n such that the conditions of the Lem- mas 4 - 6 are satisfied is λ n = 8 k 3 (2 − ε ) ε − 1 p n − 1 log p and n > 288 2 k 2 3 (2 − ε ) 4 C − 4 min ε − 4 d 2 log p + 48 k 3 (2 − ε ) C − 1 min ( α ∗ min ) − 1 ε − 1 2 d log p . Last, we lift the dependency and incoherence conditions imposed on the finite sample Hessian matrix Q n . W e show that if we only impose these conditions in the correspon- ding population matrix Q ∗ , then the y will also hold for Q n with high probability (prov en in Appendices H and I ). Lemma 7 If condition 1 holds for Q ∗ , then, for any δ > 0 , P (Λ min ( Q n S S ) ≤ C min − δ ) ≤ 2 d B 1 exp( − A 1 δ 2 n d 2 ) , Algorithm 1 ` 1 -regularized network inference Require: C n , λ n , K , L for all i ∈ V do k = 0 while k < K do α k +1 i = α k i − L ∇ α i ` n ( α k i ) − λ n L + k = k + 1 end while ˆ α i = α K − 1 i end for retur n { ˆ α i } i ∈V P (Λ max ( Q n S S ) ≥ C max + δ ) ≤ 2 d B 2 exp( − A 2 δ 2 n d 2 ) , wher e A 1 , A 2 , B 1 and B 2 ar e constants independent of ( n, p, d ) . Lemma 8 If |||Q ∗ S c S ( Q ∗ S S ) − 1 ||| ∞ ≤ 1 − ε , then, P kQ n S c S ( Q n S S ) − 1 k ∞ ≥ 1 − ε/ 2 ≤ p exp( − K n d 3 ) , wher e K is a constant independent of ( n, p, d ) . Note in this case the cascade set size need to increase to n > Ld 3 log p , where L is a suf ficiently large positiv e con- stant independent of ( n, p, d ) , for the error probabilities on these last two lemmas to con ver ge to zero. 7. Efficient soft-thresholding algorithm Can we design efficient algorithms to solve Eq. ( 4 ) for net- work r ecovery? Here, we will design a proximal gradi- ent algorithm which is well suited for solving non-smooth, constrained, large-scale or high-dimensional conv ex opti- mization problems ( P arikh & Bo yd , 2013 ). Moreover , they are easy to understand, deriv e, and implement. W e first rewrite Eq. 4 as an unconstrained optimization problem: minimize α ` n ( α ) + g ( α ) , where the non-smooth con vex function g ( α ) = λ n || α || 1 if α ≥ 0 and + ∞ otherwise. Here, the general recipe from Parikh & Boyd ( 2013 ) for designing proximal gra- dient algorithm can be applied directly . Algorithm 1 summarizes the resulting algorithm. In each iteration of the algorithm, we need to compute ∇ ` n (T a- ble 1 ) and the proximal operator prox L k g ( v ) , where L k is a step size that we can set to a constant value L or find us- ing a simple line search ( Beck & T eboulle , 2009 ). Using Moreau’s decomposition and the conjugate function g ∗ , it is easy to show that the proximal operator for our particular function g ( · ) is a soft-thresholding operator , ( v − λ n L k ) + , which leads to a sparse optimal solution ˆ α , as desired. 8. Experiments In this section, we first illustrate some consequences of Th. 2 by applying our algorithm to se veral types of net- works, parameters ( n, p, d ) , and regularization parameter Estimating Diffusion Network Structur es 0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1 β Success Probability p=6 p=26 p=40 (a) Super-neighborhood p i 0 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 K Success Probability Forest−Fire (Exp) Forest−Fire (Pow) Hierarchical (Exp) Hierarchical (pow) (b) λ n Figure 3. Success probability vs. # of cascades. λ n . Then, we compare our algorithm to tw o dif ferent state- of-the-art algorithms: N E T R A T E ( Gomez-Rodriguez et al. , 2011 ) and First-Edge ( Abrahao et al. , 2013 ). Experimental Setup W e focus on synthetic networks that mimic the structure of real-world diffusion networks – in particular , social networks. W e consider two models of directed real-world social networks: the Forest Fire model ( Barab ´ asi & Albert , 1999 ) and the Kronecker Graph model ( Leskovec et al. , 2010 ), and use simple pairwise transmission models such as exponential, power -law or Rayleigh. W e use networks with 128 nodes and, for each edge, we draw its associated transmission rate from a uni- form distribution U (0 . 5 , 1 . 5) . W e proceed as follows: we generate a network G ∗ and transmission rates A ∗ , simu- late a set of cascades and, for each cascade, record the node infection times. Then, given the infection times, we infer a network ˆ G . Finally , when we illustrate the conse- quences of Th. 2 , we ev aluate the accuracy of the inferred neighborhood of a node ˆ N − ( i ) using probability of suc- cess P ( ˆ E = E ∗ ) , estimated by running our method of 100 independent cascade sets. When we compare our algorithm to N E T R A T E and First-Edge, we use the F 1 score, which is defined as 2 P R/ ( P + R ) , where precision (P) is the frac- tion of edges in the inferred network ˆ G present in the true network G ∗ , and recall (R) is the fraction of edges of the true network G ∗ present in the inferred network ˆ G . Parameters ( n, p, d ) According to Th. 2 , the number of cascades that are necessary to successfully infer the in- coming edges of a node will increase polynomially to the node’s neighborhood size d i and logarithmically to the super-neighborhood size p i . Here, we infer the incoming links of nodes of a hierarchical Kroneck er network with the same in-degree ( d i = 3 ) but different super-neighboorhod set sizes p i under dif ferent scalings β of the number of cas- cades n = 10 β d log p and choose the regularization para- meter λ n as a constant factor of p log( p ) /n as suggested by Th. 2 . W e used an exponential transmission model and T = 5 . Fig. 3(a) summarizes the results, where, for each node, we used cascades which contained at least one node in the super-neighborhood of the node under study . As pre- dicted by Th. 2 , very different p values lead to curves that line up with each other quite well. Regularization parameter λ n Our main result indicates that the regularization parameter λ n should be a constant 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 1 # cascades F1 NetRate Our method First Edge (a) Kronecker hierarchical, P O W 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 1 # cascades F1 NetRate Our method First Edge (b) Forest Fire, E X P Figure 4. F 1 -score vs. # of cascades. factor of p log( p ) /n . Fig. 3(b) sho ws the success proba- bility of our algorithm against different scalings K of the regularization parameter λ n = K p log( p ) /n for different types of networks using 150 cascades and T = 5 . W e find that for suf ficiently large λ n , the success probability flat- tens, as expected from Th. 2 . It flattens at values smaller than one because we used a fixed number of cascades n , which may not satisfy the conditions of Th. 2 . Comparison with N E T R A T E and First-Edge Fig. 4 com- pares the accuracy of our algorithm, N E T R A T E and First- Edge against number of cascades for a hierarchical Kro- necker network with power -law transmission model and a Forest Fire network with exponential transmission model, with an observ ation window T = 10 . Our method outper- forms both competitive methods, finding especially striking the competitiv e advantage with respect to First-Edge. 9. Conclusions Our work contributes to wards establishing a theoretical foundation of the network inference problem. Specifically , we proposed a ` 1 -regularized maximum likelihood infe- rence method for a well-known continuous-time diffusion model and an efficient proximal gradient implementation, and then show that, for general networks satisfying a natu- ral incoherence condition, our method achiev es an expo- nentially decreasing error with respect to the number of cascades as long as O ( d 3 log N ) cascades are recorded. Our work also opens man y interesting venues for future work. For example, giv en a fixed number of cascades, it would be useful to provide confidence intervals on the in- ferred edges. Further , gi ven a network with arbitrary pair- wise likelihoods, it is an open question whether there al- ways exists at least one source distribution and time win- dow value such that the incoherence condition is satisfied, and, and if so, whether there is an efficient way of finding this distribution. Finally , our work assumes all activ ations occur due to network diffusion and are recorded. It would be interesting to allow for missing observations, as well as activ ations due to exogenous f actors. Acknowledgement This research was supported in part by NSF/NIH BIG- D A T A 1R01GM108341-01, NSF IIS1116886, and a Raytheon faculty fello wship to L. Song. Estimating Diffusion Network Structur es References Abrahao, B., Chierichetti, F ., Kleinberg, R., and P anconesi, A. Trace complexity of network inference. In KDD , 2013. Adar , E. and Adamic, L. A. T racking Information Epi- demics in Blogspace. In W eb Intelligence , pp. 207–214, 2005. Barab ´ asi, A.-L. and Albert, R. Emer gence of Scaling in Random Networks. Science , 286:509–512, 1999. Beck, A. and T eboulle, M. Gradient-based algorithms with applications to signal recov ery . Con vex Optimization in Signal Pr ocessing and Communications , 2009. Boyd, S. P . and V andenberghe, L. Con vex optimization . Cambridge Univ ersity Press, 2004. Du, N., Song, L., Smola, A., and Y uan, M. Learning Net- works of Heterogeneous Influence. In NIPS , 2012a. Du, N., Song, L., W oo, H., and Zha, H. Unco ver T opic- Sensitiv e Information Diffusion Networks. In AIST ATS , 2012b. Gomez-Rodriguez, M., Leskov ec, J., and Krause, A. In- ferring Networks of Diffusion and Influence. In KDD , 2010. Gomez-Rodriguez, M., Balduzzi, D., and Sch ¨ olkopf, B. Uncov ering the T emporal Dynamics of Diffusion Net- works. In ICML , 2011. Gomez-Rodriguez, Manuel. Ph.D. Thesis . Stanford Uni- versity & MPI for Intelligent Systems, 2013. Gripon, V . and Rabbat, M. Reconstructing a graph from path traces. , 2013. Kempe, D., Kleinberg, J. M., and T ardos, ´ E. Maximizing the Spread of Influence Through a Social Network. In KDD , 2003. Leskov ec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker Graphs: An Approach to Modeling Networks. JMLR , 2010. Mangasarian, O. L. A simple characterization of solution sets of con ve x programs. Operations Resear ch Letters , 7(1):21–26, 1988. Netrapalli, P . and Sanghavi, S. Finding the Graph of Epi- demic Cascades. In A CM SIGMETRICS , 2012. Newe y , W . K. and McFadden, D. L. Large Sample Estima- tion and Hypothesis T esting. In Handbook of Econome- trics , volume 4, pp. 2111–2245. 1994. Parikh, Neal and Boyd, Stephen. Proximal algorithms. F oundations and T r ends in Optimization , 2013. Ravikumar , P ., W ainwright, M. J., and Laf ferty , J. D. High- dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics , 38(3):1287– 1319, 2010. Rogers, E. M. Dif fusion of Innovations . Free Press, New Y ork, fourth edition, 1995. Saito, K., Kimura, M., Ohara, K., and Motoda, H. Learning continuous-time information diffusion model for social behavioral data analysis. Advances in Machine Lear- ning , pp. 322–337, 2009. Snowsill, T ., Fyson, N., Bie, T . De, and Cristianini, N. Re- fining Causality: Who Copied From Whom? In KDD , 2011. W ainwright, M. J. Sharp thresholds for high-dimensional and noisy sparsity recov ery using l1-constrained quadratic programming (lasso). IEEE T ransactions on Information Theory , 55(5):2183–2202, 2009. W ang, L., Ermon, S., and Hopcroft, J. Feature-enhanced probabilistic models for diffusion network inference. In ECML PKDD , 2012. Estimating Diffusion Network Structur es A. Proof of Lemma 9 Lemma 9 Given log-concave survival functions and con- cave hazar d functions in the parameter(s) of the pairwise transmission likelihoods, then, a suf ficient condition for the Hessian matrix Q n to be positive definite is that the hazar d matrix X n ( α ) is non-singular . Proof Using Eq. 5 , the Hessian matrix can be expressed as a sum of two matrices, D n ( α ) and X n ( α ) X n ( α ) > . The matrix D n ( α ) is trivially positi ve semidefinite by log-concavity of the surviv al functions and concavity of the hazard functions. The matrix X n ( α ) X n ( α ) > is positive definite matrix since X n ( α ) is full rank by assumption. Then, the Hessian matrix is positiv e definite since it is a sum a positi ve semidefinite matrix and a positiv e definite matrix. B. Proof of Lemma 10 Lemma 10 If the sour ce pr obability P ( s ) is strictly posi- tive for all s ∈ R , then, for an arbitrarily lar ge number of cascades n → ∞ , there exists an or dering of the nodes and cascades within the cascade set such that the hazar d matrix X n ( α ) is non-singular . Proof In this proof, we find a labeling of the nodes (row indices in X n ( α ) ) and ordering of the cascades (column indices in X n ( α ) ), such that, for an arbitrary lar ge number of cascades, we can express the matrix X n ( α ) as [ T B ] , where T ∈ R p × p is an upper triangular with nonzero diag- onal elements and B ∈ R p × n − p . And, therefore, X n ( α ) has full rank (rank p ). W e proceed first by sorting nodes in R and then continue by sorting nodes in U : • Nodes in R : For each node u ∈ R , consider the set of cascades C u in which u was a source and i got in- fected. Then, rank each node u according to the ear- liest position in which node i got infected across all cascades in C u in decreasing order , breaking ties at random. For example, if a node u was, at least once, the source of a cascade in which node i got infected just after the source, b ut in contrast, node v was ne ver the source of a cascade in which node i got infected the second, then node u will hav e a lower index than node v . Then, assign row k in the matrix X n ( α ) to node in position k and assign the first d columns to the corresponding cascades in which node i got in- fected earlier . In such ordering, X n ( α ) mk = 0 for all m < k and X n ( α ) kk 6 = 0 . • Nodes in U : Similarly as in the first step, and assign them the rows d + 1 to p . Moreov er , we assign the columns d + 1 to p to the corresponding cascades in which node i got infected earlier . Again, this order- ing satisfies that X n ( α ) mk = 0 for all m < k and X n ( α ) kk 6 = 0 . Finally , the remaining columns n − p can be assigned to the remaining cascades at random. This ordering leads to the desired structure [ T B ] , and thus it is non-singular . C. Proof of Eq 7 . If the Hazard vector X ( t c ; α ) is Lipschitz continuous in the domain { α : α S ≥ α ∗ min 2 } , k X ( t c ; β ) − X ( t c ; α ) k 2 ≤ k 1 k β − α k 2 , where k 1 is some positiv e constant. Then, we can bound the spectral norm of the difference, 1 √ n ( X n ( β ) − X n ( α )) , in the domain { α : α S ≥ α ∗ min 2 } as follo ws: |k 1 √ n X n ( β ) − X n ( α ) k| 2 = max k u k 2 =1 1 √ n k u X n ( β ) − X n ( α ) k 2 = max k u k 2 =1 1 √ n v u u t n X c =1 h u , X ( t c ; β ) − X ( t c ; α ) i 2 ≤ 1 √ n q k 2 1 n k u k 2 2 k β − α k 2 2 ≤ k 1 k β − α k 2 . D. Pr oof of Lemma 3 By Lagrangian duality , the regularized network inference problem defined in Eq. 4 is equi valent to the following con- strained optimization problem: minimize α i ` n ( α i ) subject to α j i ≥ 0 , j = 1 , . . . , N , i 6 = j, || α i || 1 ≤ C ( λ n ) (20) where C ( λ n ) < ∞ is a positive constant. In this alternative formulation, λ n is the Lagrange multiplier for the second constraint. Since λ n is strictly positiv e, the constraint is activ e at any optimal solution, and thus || α i || 1 is constant across all optimal solutions. Using that ` n ( α i ) is a dif ferentiable con v ex function by assumption and { α : α j i ≥ 0 , || α i || 1 ≤ C ( λ n ) } is a con ve x set, we have that ∇ ` n ( α i ) is constant across opti- mal primal solutions ( Mangasarian , 1988 ). Moreover , any optimal primal-dual solution in the o riginal problem must satisfy the KKT conditions in the alternativ e formulation defined by Eq. 20 , in particular , ∇ ` n ( α i ) = − λ n z + µ , where µ ≥ 0 are the Lagrange multipliers associated to the Estimating Diffusion Network Structur es non negati vity constraints and z denotes the subgradient of the ` 1-norm. Consider the solution ˆ α such that || ˆ z S c || ∞ < 1 and thus ∇ α S c ` n ( ˆ α i ) = − λ n ˆ z S c + ˆ µ S c . Now , assume there is an optimal primal solution ˜ α such that ˜ α j i > 0 for some j ∈ S c , then, using that the gradient must be constant across optimal solutions, it should hold that − λ n ˆ z j + ˆ µ j = − λ n , where ˜ µ j i = 0 by complementary slackness, which implies ˆ µ j = − λ n (1 − ˆ z j ) < 0 . Since ˆ µ j ≥ 0 by assumption, this leads to a contradiction. Then, any primal solution ˜ α must satisfy ˜ α S c = 0 for the gradient to be constant across optimal solutions. Finally , since α S c = 0 for all optimal solutions, we can consider the restricted optimization problem defined in Eq. 17 . If the Hessian sub-matrix [ ∇ 2 L ( ˆ α )] S S is strictly positiv e definite, then this restricted optimization problem is strictly con ve x and the optimal solution must be unique. E. Proof of Lemma 4 T o prove this lemma, we will first construct a function G ( u S ) := ` n ( α ∗ S + u S ) − ` n ( α ∗ S ) + λ n ( k α ∗ S + u S k 1 − k α ∗ S k 1 ) . whose domain is restricted to the conv ex set U = { u S : α ∗ S + u S ≥ 0 } . By construction, G ( u S ) has the following properties 1. It is con vex with respect to u S . 2. Its minimum is obtained at ˆ u S := ˆ α S − α ∗ S . That is G ( ˆ u S ) ≤ G ( u S ) , ∀ u S 6 = ˆ u S . 3. G ( ˆ u S ) ≤ G ( 0 ) = 0 . Based on property 1 and 3, we deduce that an y point in the segment, L := { ˜ u S : ˜ u S = t ˆ u S + (1 − t ) 0 , t ∈ [0 , 1] } , connecting ˆ u S and 0 has G ( ˜ u S ) ≤ 0 . That is G ( ˜ u S ) = G ( t ˆ u S + (1 − t ) 0 ) ≤ tG ( ˆ u S ) + (1 − t ) G ( 0 ) ≤ 0 . Next, we will find a sphere centered at 0 with strictly po- sitiv e radius B , S ( B ) := { u S : k u S k 2 = B } , such that function G ( u S ) > 0 (strictly positiv e) on S ( B ) . W e note that this sphere S ( B ) can not intersect with the se gment L since the two sets have strictly different function values. Furthermore, the only possible configuration is that the seg- ment is contained inside the sphere entirely , leading us to conclude that the end point ˆ u S := ˆ α S − α ∗ S is also within the sphere. That is k ˆ α S − α ∗ S k 2 ≤ B . In the follo wing, we will provide details on finding such a suitable B which will be a function of the regularization parameter λ n and the neighborhood size d . More specifica- lly , we will start by applying a T aylor series expansion and the mean value theorem, G ( u S ) = ∇ S ` n ( α ∗ S ) > u S + u > S ∇ 2 S S ` n ( α ∗ S + b u S ) u S + λ n ( k α ∗ S + u S k 1 − k α ∗ S k 1 ) , (21) where b ∈ [0 , 1] . W e will show that G ( u S ) > 0 by boun- ding below each term of abo ve equation separately . W e bound the absolute value of the first term using the assumption on the gradient, ∇ S ` ( · ) , |∇ S ` n ( α ∗ S ) > u S | ≤ k∇ S ` k ∞ k u S k 1 ≤ k∇ S ` k ∞ √ d k u S k 2 ≤ 4 − 1 λ n B √ d. (22) W e bound the absolute value of the last term using the re- verse triangle inequality . λ n |k α ∗ S + u S k 1 − k α ∗ S k 1 | ≤ λ n k u S k 1 ≤ λ n √ d k u S k 2 . (23) Bounding the remaining middle term is more challenging. W e start by re writing the Hessian as a sum of two matrices, using Eq. 5 , q = min u S u > S D n S S ( α ∗ S + b u S ) u S + n − 1 u > S X n S ( α ∗ S + b u S ) X n S ( α ∗ S + b u S ) > u S = min u S u > S D n S S ( α ∗ S + b u S ) u S + k u > S X n S ( α ∗ S + b u S ) k 2 2 . Now , we introduce two additional quantities, ∆ D n S S = D n S S ( α ∗ S + b u S ) − D n S S ( α ∗ S ) ∆ X n S = X n S ( α ∗ S + b u S ) − X n S ( α ∗ S ) , and rewrite q as q = min u S u > S D n S S ( α ∗ S ) u S + n − 1 k u > S X n S ( α ∗ S ) k 2 2 + n − 1 k u > S ∆ X n S k 2 2 + u > S ∆ D n S S u S +2 n − 1 h u > S X n S ( α ∗ S ) , u > S ∆ X n S i . Next, we use dependenc y condition, q ≥ C min B 2 − max u S | u > S ∆ D n S S u S | {z } T 1 | − max u S 2 | n − 1 h u > S X n S ( α ∗ S ) , u > S ∆ X n S i | {z } T 2 | , and proceed to bound T 1 and T 2 separately . First, we bound T 1 using the Lipschitz condition, | T 1 | = | X k ∈ S u 2 k [ D n k ( α ∗ S + b u S ) − D n k ( α ∗ S )] | ≤ X k ∈ S u 2 k k 2 k b u S k 2 ≤ k 2 B 3 . Estimating Diffusion Network Structur es Then, we use the dependenc y condition, the Lipschitz con- dition and the Cauchy-Schwartz inequality to bound T 2 , T 2 ≤ 1 √ n k u > S X n S ( α ∗ S ) k 2 1 √ n k u > S ∆ X n S k 2 ≤ p C max B 1 √ n k u > S ∆ X n S k 2 ≤ p C max B k u S k 2 1 √ n |k ∆ X n S k| 2 ≤ p C max B 2 k 1 k b u S k 2 ≤ k 1 p C max B 3 , where we note that applying the Lipschitz condition im- plies assuming B < α min 2 . Next, we incorporate the bounds of T 1 and T 2 to lower bound q , q ≥ C min B 2 − ( k 2 + 2 k 1 p C max ) B 3 . (24) Now , we set B = K λ n √ d , where K is a constant that we will set later in the proof, and select the regulariza- tion parameter λ n to satisfy λ n √ d ≤ 0 . 5 C min /K ( k 2 + 2 k 1 p C max ) . Then, G ( u S ) ≥ − 4 − 1 λ n √ dB + 0 . 5 C min B 2 − λ n √ dB ≥ B (0 . 5 C min B − 1 . 25 λ n √ d ) ≥ B (0 . 5 C min K λ n √ d − 1 . 25 λ n √ d ) . In the last step, we set the constant K = 3 C − 1 min , and we hav e G ( u S ) ≥ 0 . 25 λ n √ d > 0 , as long as √ dλ n ≤ C 2 min 6( k 2 + 2 k 1 √ C max ) α ∗ min ≥ 6 λ n √ d C min . Finally , conv exity of G ( u S ) yields k ˆ α S − α ∗ S k 2 ≤ 3 λ n √ d/C min ≤ α ∗ min 2 . F . Pr oof of Lemma 5 Define z c j = [ ∇ g ( t c ; α ∗ )] j and z j = 1 n P c z c j . Now , using the KKT conditions and condition 4 (Boundedness), we hav e that µ ∗ j = E c { z c j } and | z c j | ≤ k 3 , respecti vely . Thus, Hoeffding’s inequality yields P ( | z j − µ ∗ j | > λ n ε 4(2 − ε ) ) ≤ 2 exp − nλ 2 n ε 2 32 k 2 3 (2 − ε ) 2 ! , and then, P ( k z − µ ∗ k ∞ > λ n ε 4(2 − ε ) ) ≤ 2 exp − nλ 2 n ε 2 32 k 2 3 (2 − ε )) 2 + log p ! . G. Proof of Lemma 6 W e start by factorizing the Hessian matrix, using Eq. 5 , R n j = ∇ 2 ` n ( ¯ α j ) − ∇ 2 ` n ( α ∗ ) > j ( ˆ α − α ∗ ) = ω n j + δ n j , where, ω n j = D n ( ¯ α j ) − D n ( α ∗ ) > j ( ˆ α − α ∗ ) δ n j = 1 n V n j ( ˆ α − α ∗ ) V n j = [ X n ( ¯ α j )] j X n ( ¯ α j ) > − [ X n ( α ∗ )] j X n ( α ∗ ) > . Next, we proceed to bound each term separately . Since [ ¯ α j ] S = θ j ˆ α S + (1 − θ j ) α ∗ S where θ j ∈ [0 , 1] , and k ˆ α S − α ∗ S k ∞ ≤ α ∗ min 2 (Lemma 4 ), it holds that [ ¯ α j ] S ≥ α ∗ min 2 . Then, we can use condition 3 (Lipschitz Continuity) to bound ω n j . | ω n j | ≤ k 1 k ¯ α j − α ∗ k 2 k ˆ α − α ∗ k 2 ≤ k 1 θ j k ˆ α − α ∗ k 2 2 ≤ k 1 k ˆ α − α ∗ k 2 2 . (25) Howe ver , bounding term δ n j is more dif ficult. Let us start by rewriting δ n j as follows. δ n j = ( Λ 1 + Λ 2 + Λ 3 ) ( ˆ α − α ∗ ) , where, Λ 1 = [ X n ( α ∗ )] j ( X n ( ¯ α j ) > − X n ( α ∗ ) > ) Λ 2 = { [ X n ( ¯ α j )] j − [ X n ( α ∗ )] j } ( X n ( ¯ α j ) > − X n ( α ∗ ) > ) Λ 3 = [ X n ( ¯ α j )] j − [ X n ( α ∗ )] j X n ( α ∗ ) > . Next, we bound each term separately . For the first term, we first apply Cauchy inequality , | Λ 1 ( ˆ α − α ∗ ) | ≤ k [ X n ( α ∗ )] j k 2 × |k X n ( ¯ α j ) > − X n ( α ∗ ) > k| 2 k ˆ α − α ∗ k 2 , and then use condition 3 (Lipschtiz Continuity) and 4 (Boundedness), | Λ 1 ( ˆ α − α ∗ ) | ≤ nk 4 k 1 k ¯ α j − α ∗ k 2 k ˆ α − α ∗ k 2 ≤ nk 4 k 1 k ˆ α − α ∗ k 2 2 . For the second term, we also start by applying Cauchy in- Estimating Diffusion Network Structur es 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1 β Success Probability P=16 P=32 P=64 P=128 P=256 (a) Chain ( d i = 1 ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 β Success Probability p=31 p=63 p=127 (b) Stars with different # of lea ves ( d i = 1 ) 0 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 β Sucess Probability p=39 p=120 p=363 (c) T ree ( d i = 3 ) Figure 5. Success probability vs. # of cascades. Different super-neighborhood sizes p i . equality , | Λ 2 ( ˆ α − α ∗ ) | ≤ k [ X n ( ¯ α j )] j − [ X n ( α ∗ )] j k 2 × |k X n ( ¯ α j ) > − X n ( α ∗ ) > k| 2 k ˆ α − α ∗ k 2 , and then use condition 3 (Lipschtiz Continuity), | Λ 2 ( ˆ α − α ∗ ) | ≤ nk 2 1 k ˆ α − α ∗ k 2 2 . Last, for third term, once more we start by applying Cauchy inequality , | Λ 3 ( ˆ α − α ∗ ) | ≤ k [ X n ( ¯ α j )] j − [ X n ( α ∗ )] j k 2 × |k X n ( α ∗ ) > k| 2 k ˆ α − α ∗ k 2 , and then apply condition 1 (Dependenc y Condition) and condition 3 (Lipschitz Continuity), | Λ 3 ( ˆ α − α ∗ ) | ≤ nk 1 p C max k ˆ α − α ∗ k 2 2 Now , we combine the bounds, k R n k ∞ ≤ K k ˆ α − α ∗ k 2 2 , where K = k 1 + k 4 k 1 + k 2 1 + k 1 p C max . Finally , using Lemma 4 and selecting the regularization pa- rameter λ n to satisfy λ n d ≤ C 2 min ε 36 K (2 − ε ) yields: k R n k ∞ /λ n ≤ 3 K λ n d/C 2 min ≤ ε 4(2 − ε ) H. Proof of Lemma 7 W e will first bound the difference in terms of nuclear norm between the population Fisher information matrix Q S S and the sample mean cascade log-likelihood Q n S S . Define z c j k = [ ∇ 2 g ( t c ; α ∗ ) − ∇ 2 ` n ( α ∗ )] j k and z j k = 1 n P n c =1 z c j k . Then, we can express the difference between the population Fisher information matrix Q S S and the sam- ple mean cascade log-likelihood Q n S S as: |kQ n S S ( α ∗ ) − Q ∗ S S ( α ∗ ) k| 2 ≤ |kQ n S S ( α ∗ ) − Q ∗ S S ( α ∗ ) k| F = v u u t d X j =1 d X k =1 ( z ik ) 2 . Since | z ( c ) j k | ≤ 2 k 5 by condition 4, we can apply Hoef f- ding’s inequality to each z j k , P ( | z j k | ≥ β ) ≤ 2 exp − β 2 n 8 k 2 5 , (26) and further , P ( |kQ n S S ( α ∗ ) − Q ∗ S S ( α ∗ ) k| 2 ≥ δ ) ≤ 2 exp − K δ 2 n d 2 + 2 log d (27) where β 2 = δ 2 /d 2 . Now , we bound the maximum eigen- value of Q n S S as follows: Λ max ( Q n S S ) = max k x k 2 =1 x > Q n S S x = max k x k 2 =1 { x > Q ∗ S S x + x > ( Q n S S − Q ∗ S S ) x } ≤ y > Q ∗ S S y + y > ( Q n S S − Q ∗ S S ) y , where y is unit-norm maximal eigenv ector of Q ∗ S S . There- fore, Λ max ( Q n S S ) ≤ Λ max ( Q ∗ S S ) + |kQ n S S − Q ∗ S S k| 2 , and thus, P Λ max ( Q n S S ) ≥ C max + δ ≤ exp − K δ 2 n d 2 + 2 log d . Reasoning in a similar way , we bound the minimum eigen- Estimating Diffusion Network Structur es 0.4 0.8 1.2 1.6 2 0 0.2 0.4 0.6 0.8 1 β Source Probability d=1 d=2 d=3 Figure 6. Success probability vs. # of cascades. Different in- degrees d i . value of Q n S S : P Λ min ( Q n S S ) ≤ C min − δ ≤ exp − K δ 2 n d 2 + 2 log d I. Proof of Lemma 8 W e start by decomposing Q n S c S ( α ∗ )( Q n S c S ( α ∗ )) − 1 as fo- llows: Q n S c S ( α ∗ )( Q n S c S ( α ∗ )) − 1 = A 1 + A 2 + A 3 + A 4 , where, A 1 = Q ∗ S c S [( Q n S c S ) − 1 − ( Q ∗ S c S ) − 1 ] , A 2 = [ Q n S c S − Q ∗ S c S ][( Q n S c S ) − 1 − ( Q ∗ S c S ) − 1 ] A 3 = [ Q n S c S − Q ∗ S c S ]( Q ∗ S S ) − 1 , A 4 = Q ∗ S c S ( Q ∗ S S ) − 1 , Q ∗ = Q ∗ ( α ∗ ) and Q n = Q n ( α ∗ ) . Now , we bound each term separately . The fourth term, A 4 , is the easiest to bound, using simply the incoherence condition: |k A 4 k| ∞ ≤ 1 − ε. T o bound the other terms, we need the following lemma: Lemma 11 F or any δ ≥ 0 and constants K and K 0 , the following bounds hold: P [ |kQ n S c S − Q ∗ S c S k| ∞ ≥ δ ] ≤ 2 exp − K nδ 2 d 2 + log d + l og ( p − d ) (28) P [ |kQ n S S − Q ∗ S S k| ∞ ≥ δ ] ≤ 2 exp − K nδ 2 d 2 + 2 log d (29) P [ |k ( Q n S S ) − 1 − ( Q ∗ S S ) − 1 k| ∞ ≥ δ ] ≤ 4 exp − K nδ d 3 − K 0 log d (30) Proof W e start by pro ving the first confidence interv al. By definition of infinity norm of a matrix, we hav e: P [ |kQ n S c S − Q ∗ S c S k| ∞ ≥ δ ] = P max j ∈ S c X k ∈ S | z j k | ≥ δ ≤ ( p − d ) P X k ∈ S | z j k | ≥ δ , where z j k = [ Q n − Q ∗ ] j k and, for the last inequality , we used the union bound and the fact that | S c | ≤ p − d . Fur- thermore, P P k ∈ S | z j k | ≥ δ ≤ P [ ∃ k ∈ S || z j k | ≥ δ /d ] ≤ dP [ | z j k | ≥ δ /d ] . Thus, P [ |kQ n S c S − Q ∗ S c S k| ∞ ≥ δ ] ≤ ( p − d ) dP [ | z j k | ≥ δ /d ] . At this point, we can obtain the first confidence bound by using Eq. 26 with β = δ /d in the above equation. The proof of the second confidence bou nd is very similar and we omit it for brevity . T o prov e the last confidence bound, we proceed as follows: |k ( Q n S S ) − 1 − ( Q ∗ S S ) − 1 k| ∞ = |k ( Q n S S ) − 1 [ Q n S S − Q ∗ S S ]( Q ∗ S S ) − 1 k| ∞ ≤ √ d |k ( Q n S S ) − 1 [ Q n S S − Q ∗ S S ]( Q ∗ S S ) − 1 k| 2 ≤ √ d |k ( Q n S S ) − 1 k| 2 |kQ n S S − Q ∗ S S k| 2 |k ( Q ∗ S S ) − 1 k| 2 ≤ √ d C min |kQ n S S − Q ∗ S S k| 2 |k ( Q n S S ) − 1 k| 2 . Next, we bound each term of the final e xpression in the abov e equation separately . The first term can be bounded using Eq. 27 : P |kQ n S S − Q ∗ S S k| 2 ≥ C 2 min δ / 2 √ d ≤ 2 exp − K nδ 2 d 3 + 2 log d , The second term can be bounded using Lemma 6 : P |k ( Q n S S ) − 1 k| 2 ≥ 2 C min = P Λ min ( Q n S S ) ≤ C min 2 ≤ exp − K n d 2 + B log d . Then, the third confidence bound follows. Contr ol of A 1 . W e start by rewriting the term A 1 as A 1 = Q ∗ S c S ( Q ∗ S S ) − 1 [( Q ∗ S S ) − ( Q n S S )]( Q n S S ) − 1 , Estimating Diffusion Network Structur es 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 1 # cascades F1 NetRate Our method First Edge (a) Kronecker hierarchical, E X P 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 1 # cascades F1 NetRate Our method First Edge (b) Kronecker hierarchical, R AY 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 1 # cascades F1 NetRate Our method First Edge (c) Forest Fire, P O W 500 1000 1500 2000 2500 0.5 0.6 0.7 0.8 0.9 # cascades F1 NetRate Our method First Edge (d) Forest Fire, R A Y Figure 7. F 1 -score vs. # of cascades. and further , |k A 1 k| ∞ ≤ |kQ ∗ S c S ( Q ∗ S S ) − 1 k| ∞ × |k ( Q ∗ S S ) − ( Q n S S ) k| ∞ |k ( Q n S S ) − 1 k| ∞ . Next, using the incoherence condition easily yields: |k A 1 k| ∞ ≤ (1 − ε ) |k ( Q ∗ S S ) − ( Q n S S ) k| ∞ × √ d |k ( Q n S S ) − 1 k| 2 Now , we apply Lemma 6 with δ = C min / 2 to have that |k ( Q n S S ) − 1 k| 2 ≤ 2 C min with probability greater than 1 − exp( − K n/d 2 + K 0 log d ) , and then use Eq. 30 with δ = εC min 12 √ d to conclude that P |k A 1 k| ∞ ≥ ε 6 ≤ 2 exp − K n d 3 + K 0 log d . Contr ol of A 2 . W e rewrite the term A 2 as |k A 2 k| ∞ ≤ |kQ n S c S −Q ∗ S c S k| ∞ |k ( Q n S S ) − 1 − ( Q ∗ S S ) − 1 k| ∞ , and then use Eqs. 28 and 29 with δ = p ε/ 6 to conclude that P |k A 2 k| ∞ ≥ ε 6 ≤ 4 exp − K n d 3 + log( p − d ) + K 0 log p . Contr ol of A 3 . W e rewrite the term A 3 as |k A 3 k| ∞ = √ d |k ( Q ∗ S S ) − 1 k| 2 |kQ n S c S − Q ∗ S c S k| ∞ ≤ √ d C min |kQ n S c S − Q ∗ S c S k| ∞ . W e then apply Eq. 28 with δ = εC min 6 √ d to conclude that P |k A 3 k| ∞ ≥ ε 6 ≤ exp − K n d 3 + log( p − d ) , and thus, P |kQ n S c S ( Q n S S ) − 1 k| ∞ ≥ 1 − ε 2 = O exp( − K n d 3 + log p ) . J. Additional experiments Parameters ( n, p, d ) . Figure 5 shows the success proba- bility at inferring the incoming links of nodes on the same type of canonical networks as depicted in Fig. 2 . W e choose nodes the same in-de gree but dif ferent super-neighboorhod set sizes p i and experiment with different scalings β of the number of cascades n = 10 β d log p . W e set the regula- rization parameter λ n as a constant factor of p log( p ) /n as suggested by Theorem 2 and, for each node, we used cascades which contained at least one node in the super- neighborhood of the node under study . W e used an ex- ponential transmission model and time windo w T = 10 . As predicted by Theorem 2 , very different p values lead to curves that line up with each other quite well. Figure 6 shows the success probability at inferring the in- coming links of nodes of a hierarchical Kroneck er network with equal super neighborhood size ( p i = 70 ) but dif ferent in-degree ( d i ) under different scalings β of the number of cascades n = 10 β d log p and choose the regularization pa- rameter λ n as a constant f actor of p log( p ) /n as suggested by Theorem 2 . W e used an exponential transmission model and time window T = 5 . As predicted by Theorem 2 , in this case, different d values lead to noticeably different curves. Comparison with N E T R A T E and First-Edge. Figure 7 compares the accuracy of our algorithm, N E T R AT E and First-Edge against number of cascades for different type of networks and transmission models. Our method typi- cally outperforms both competitiv e methods. W e find es- pecially striking the competitive advantage with respect to First-Edge, howe ver , this may be explained by comparing the sample comple xity results for both methods: First-Edge needs O ( N d log N ) cascades to achie ve a probability of success approaching 1 in a rate polynomial in the num- ber of cascades while our method needs O ( d 3 log N ) to achiev e a probability of success approaching 1 in a rate e x- ponential in the number of cascades.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment