Learning Influence Functions from Incomplete Observations

Learning Inﬂuence F unctions from Incomplete Observ ations Xinran He Ke Xu Da vid Kemp e Y an Liu Univ ersit y of Southern California, Los Angeles, CA 90089 {xinranhe, xuk, dkempe, yanliu.cs}@usc.edu No v em b er 9, 2016 Abstract W e study the problem of learning inﬂuence functions under incomplete observ ations of no de activ ations. Incomplete observ ations are a ma jor concern as most (online and real-w orld) so cial net works are not fully observ able. W e establish both proper and improper P A C learnabilit y of inﬂuence functions under randomly missing observ ations. Proper P AC learnabilit y under the Discrete-Time Linear Threshold (DL T) and Discrete-Time Indep endent Cascade (DIC) mo dels is established b y reducing incomplete observ ations to complete observ ations in a mo diﬁed graph. Our improp er P AC learnability result applies for the DL T and DIC mo dels as well as the Con tinuous-Time Indep endent Cascade (CIC) mo del. It is based on a parametrization in terms of reachabilit y features, and also gives rise to an eﬃcient and practical heuristic. Exp erimen ts on synthetic and real-world datasets demonstrate the ability of our metho d to comp ensate even for a fairly large fraction of missing observ ations. 1 In tro duction Man y so cial phenomena, suc h as the spread of diseases, b ehaviors, tec hnologies, or products, can naturally b e modeled as the diﬀusion of a contagion across a netw ork. Owing to the p otentially high so cial or economic v alue of accelerating or inhibiting suc h diﬀusions, the goal of understanding the ﬂow of information and predicting information cascades has b een an active area of research [ 10 , 7 , 9 , 14 , 1 , 20 ]. In this context, a k ey task is learning inﬂuenc e functions : the functions mapping sets of initial adopters to the individuals who will b e inﬂuenced (also called active ) b y the end of the diﬀusion pro cess [10]. Man y metho ds hav e b een developed to solve the inﬂuence function learning problem [ 9 , 7 , 5 , 8 , 3 , 16 , 18 , 24 , 25 ]. Most approaches are based on ﬁtting the parameters of a diﬀusion model based on observ ations, e.g., [ 8 , 7 , 18 , 9 , 16 ]. Recen tly , Du et al. [ 3 ] prop osed a mo del-fr e e approac h to learn inﬂuence functions as co verage functions; Narasimhan et al. [ 16 ] establish prop er P AC learnabilit y of inﬂuence functions under several widely-used diﬀusion models. All existing approac hes rely on the assumption that the observ ations in the training dataset are complete, complete, in the sense that all activ e nodes are observed as b eing activ e. How ever, this assumption fails to hold in virtually all practical applications [ 15 , 6 , 2 , 21 ]. F or example, so cial media data are usually collected through cra wlers or acquired with public APIs provided by so cial media platforms, such as T witter or F aceb o ok. Due to non-tec hnical reasons and established restrictions on the APIs, it is often impossible to obtain a complete set of observ ations ev en for a short p erio d 1 of time. In turn, the existence of unobserv ed nodes, links, or activ ations ma y lead to a signiﬁcant misestimation of the diﬀusion model’s parameters [19, 15]. In this pap er, w e take a step tow ards addressing the problem of learning inﬂuence functions from incomplete observ ations (Here sp eciﬁcally w e mean missing activ ation in the observ ed cascades). Missing data are a complicated phenomenon, but to address it meaningfully and rigorously , one must mak e at least some assumptions about the pro cess resulting in the loss of data. As a ﬁrst step, we fo cus on r andom loss of observ ations: for eac h activ ated node independently , the node’s activ ation is observed only with probabilit y r , the r etention r ate , and fails to b e observed with probabilit y 1 − r . Random observ ation loss naturally occurs when crawling data from so cial media, where rate restrictions are lik ely to aﬀect all observ ations equally . W e establish both proper and improp er P A C learnabilit y of inﬂuence functions under incomplete observ ations for t wo p opular diﬀusion mo dels: the Discrete-Time Indep enden t Cascade (DIC) and Discrete-Time Linear Threshold (DL T) mo dels. The result is pro v ed by interpreting the incomplete observ ations as complete observ ations in a transformed graph, In fact, randomly missing observ ations do not signiﬁcan tly increase the required sample complexit y . The P AC learnability result implies go o d sample complexity b ounds for the DIC and DL T mo dels. Ho w ev er, ev en without missing observ ations, proper P A C learnabilit y of the CIC and other models app ears to b e more challenging. F urthermore, the P AC learnability result do es not lead to an eﬃcient algorithm, as it inv olves marginalizing a large n um b er of hidden v ariables (one for eac h no de not observ ed to b e activ e). T o w ards designing more practical algorithms and obtaining learnabilit y under a broader class of diﬀusion mo dels, we pursue improper learning approaches. Concretely , w e use the parameterization of Du et al. [ 3 ] in terms of reachabilit y basis functions, and optimize a mo diﬁed loss function suggested b y Natara jan et al. [ 17 ] to address incomplete observ ations. W e pro ve that the algorithm ensures improp er P AC learning for the DIC, DL T and Con tin uous-Time Indep endent Cascade (CIC) models. Exp erimen tal results on synthetic cascades generated from these diﬀusion models and real-world cascades in the MemeT rack er dataset demonstrate the eﬀectiv eness of our approac h. Our algorithm ac hiev es nearly a 20% reduction in estimation error compared to the best baseline metho ds on the MemeT rac k er dataset, b y comp ensating for incomplete observ ations. Sev eral recent works also aim to address the issue of missing observ ations in so cial netw ork analysis, but with diﬀeren t emphases. F or example, Chieric hetti et al. [ 2 ] and Sadiko v et al. [ 21 ] mainly fo cus on recov ering the size of a diﬀusion process, while our task is to learn the inﬂuence functions from several incomplete cascades. Myers et al. [ 15 ] mainly aim to mo del unobserv ed external inﬂuence in diﬀusion. Duong et al. [ 6 ] examine learning diﬀusion mo dels with missing links from c omplete observ ations, while w e learn inﬂuence functions from incomplete cascades with missing activ ations. Most related to our work are pap ers by W u et al. [ 23 ] and simultaneous work by Lokho v [ 13 ]. Both study the problem of net w ork inference under incomplete observ ations. Lokhov prop oses a dynamic message passing approach to marginalize all the missing activ ations, in order to infer diﬀusion mo del parameters using maximum likelihoo d estimation, while W u et al. dev elop an EM algorithm. Notice that the goal of learning the mo del parameters diﬀers from our goal of learning the inﬂuence functions directly . Both [ 13 ] and [ 23 ] pro vide empirical ev aluation, but do not pro vide theoretical guaran tees. 2 2 Preliminaries 2.1 Mo dels of Diﬀusion and Incomplete Observ ations Diﬀusion Mo del. W e mo del propagation of opinions, pro ducts, or b eha viors as a diﬀusion pro cess o v er a so cial netw ork. The so cial netw ork is represented as a directed graph G = ( V , E ) , where n = | V | is the num b er of no des, and m = | E | is the num b er of edges. Each edge e = ( u, v ) is asso ciated with a parameter w uv represen ting the strength of inﬂuence user u has on v . W e assume that the graph structure (the edge set E ) is kno wn, while the parameters w uv are to b e learned. Dep ending on the diﬀusion mo del, there are diﬀerent wa ys to represent the strength of inﬂuence b et w een individuals. No des can b e in one of t w o states, inactive or active . W e sa y that a node gets activ ated if it adopts the opinion/pro duct/b eha vior under the diﬀusion pro cess. In this w ork, we fo cus on pr o gr essive diﬀusion mo dels, where a node remains activ e once it gets activ ated. The diﬀusion pro cess b egins with a set of seed no des (initial adopters) S , who start active. The pro cess then proceeds in discrete or con tin uous time: according to a probabilistic process, additional no des may b ecome activ e based on the inﬂuence from their neigh b ors. Let N ( v ) b e the in-neighbors of no de v and A t the set of no des activ ated b y time t . W e consider the following three widely used diﬀusion mo dels: • Discrete-time Linear Threshold (DL T) mo del [10]: Eac h no de v has a threshold θ v dra wn indep enden tly and uniformly from the interv al [0 , 1] . The diﬀusion under the DL T mo del unfolds in discrete time steps: a no de v b ecomes active at step t if the total incoming w eigh t from its neighbors exceeds its threshold: P u ∈ N ( v ) ∩ A t − 1 w uv ≥ θ v . • Discrete-time Indep enden t Cascade (DIC) model [10]: The DIC mo del is also a discrete-time mo del. Under the DIC mo del, the weigh t w uv ∈ [0 , 1] captures an activ ation probabilit y . When a no de u b ecomes active in step t , it attempts to activ ate all currently inactiv e neighbors in step t + 1 . F or eac h neigh b or v , it succeeds with probabilit y w uv . If it succeeds, v b ecomes activ e; otherwise, v remains inactiv e. Once u has made all these attempts, it do es not get to mak e further activ ation attempts at later times. • Con tin uous-time Indep enden t Cascade (CIC) mo del [8]: The CIC model unfolds in con tin uous time. Each edge e = ( u, v ) is associated with a delay distribution with w uv as its parameter. When a node u b ecomes newly activ e at time t , for ev ery neigh b or v that is still inactiv e, a dela y time d uv is drawn from the delay distribution. d uv is the duration it takes u to activ ate v , whic h could b e inﬁnite (if u do es not succeed in activ ating v ). No des are considered activ ated by the process if they are activ ated within a sp eciﬁed observ ation windo w [0 , τ ] . Fix one of the diﬀusion mo dels deﬁned ab ov e and its parameters. F or eac h seed set S , let ∆ S b e the distribution of ﬁnal activ e sets when the seed set is S . (In the case of the DIC and DL T mo del, this is the set of active nodes when no new activ ations o ccur; for the CIC model, it is the set of no des active at time τ .) F or any no de v , let F v ( S ) = Prob A ∼ ∆ S [ v ∈ A ] b e the (marginal) probability that v is activ ated according to the dynamics of the diﬀusion model with initial seeds S . Then, deﬁne the inﬂuenc e function F : 2 V → [0 , 1] n mapping seed sets to the vector of marginal activ ation probabilities: F ( S ) = [ F 1 ( S ) , . . . , F n ( S )] . Notice that the marginal probabilities do not capture the full information about the diﬀusion pro cess contained in ∆ S (since they do not observ e co-activ ation patterns), but they are suﬃcient for man y applications, suc h as inﬂuence maximization [ 10 ] and inﬂuence estimation [4]. 3 Cascades and Incomplete Observ ations. W e fo cus on the problem of learning inﬂuence func- tions from cascades. A cascade C = ( S, A ) is a realization of the random diﬀusion pro cess; S is the set of seeds and A ∼ ∆ S , A ⊇ S is the set of activ ated nodes at the end of the random pro cess. Similar to Narasimhan et al. [ 16 ], w e focus on activation-only observ ations, namely , w e only observe which no des were activ ated, but not when these activ ations occurred. 1 T o capture the fact that some of the no de activ ations ma y ha v e b een unobserv ed, w e use the follo wing mo del of indep enden tly randomly missing data: for eac h (activ ated) node v ∈ A \ S , the activ ation of v is actually observe d indep enden tly with probabilit y r . With probability 1 − r , the no de’s activ ation is unobserv able. F or seed no des v ∈ S , the activ ation is never lost. F ormally , deﬁne ˜ A as follows: each v ∈ S is deterministically in ˜ A , and eac h v ∈ A \ S is in ˜ A indep enden tly with probabilit y r . Then, the incomplete cascade is denoted b y ˜ C = ( S, ˜ A ) . 2.2 Ob jectiv e F unctions and Learning Goals T o measure estimation error, w e primarily use a quadratic loss function, as in [ 16 , 3 ]. F or t wo n -dimensional v ectors x , y , the quadratic loss is deﬁned as ` sq ( x , y ) = 1 n · || x − y || 2 2 . W e also use this notation when one or both argumen ts are sets: when an argument is a set S , w e formally mean to use the indic ator function χ S as a v ector, where χ S ( v ) = 1 if v ∈ S , and χ S ( v ) = 0 otherwise. In particular, for an activ ated set A , w e write ` sq ( A, F ( S )) = 1 n || χ A − F ( S ) || 2 2 . W e no w formally deﬁne the problem of learning inﬂuence functions from incomplete observ ations. Let P b e a distribution ov er seed sets (i.e., a distribution ov er 2 V ), and ﬁx a diﬀusion model M and parameters, together giving rise to a distribution ∆ S for eac h seed set. The algorithm is giv en a set of M incomplete cascades ˜ C = { ( S 1 , ˜ A 1 ) , . . . , ( S M , ˜ A M ) } , where eac h S i is drawn indep endently from P , and ˜ A i is obtained by the incomplete observ ation pro cess describ ed ab o v e from the (random) activ ated set A i ∼ ∆ S i . The goal is to learn an inﬂuence function F that accurately captures the diﬀusion pro cess. Accuracy of the learned inﬂuence function is measured in terms of the squared error with resp ect to the true mo del: err sq [ F ] = E S ∼P ,A ∼ ∆ S [ ` sq ( A, F ( S ))] . That is, the exp ectation is ov er the seed set and the randomness in the diﬀusion process, but not the data loss process. P A C Learnabilit y of Inﬂuence F unctions. W e characterize the learnabilit y of inﬂuence func- tions under incomplete observ ations using the Probably Approximately Correct (P AC) learning framew ork [ 22 ]. Let F M b e the class of inﬂuence functions under the diﬀusion mo del M , and F L the class of inﬂuence functions the learning algorithm is allo w ed to c ho ose from. W e say that F M is P A C learnable if there exists an algorithm A with the follo wing prop ert y: for all ε, δ ∈ (0 , 1) , all parametrizations of the diﬀusion model, and all distributions P o v er seed sets S : when given activation-only and inc omplete training cascades ˜ C = { ( S 1 , ˜ A 1 ) , . . . , ( S M , ˜ A M ) } with M ≥ pol y ( n, m, 1 /ε, 1 /δ ) , A outputs an inﬂuence function F ∈ F L satisfying: Prob ˜ C [ err sq [ F ] − err sq [ F ∗ ] ≥ ε ] ≤ δ. Here, F ∗ ∈ F M is the ground truth inﬂuence function. The probabilit y is o v er the training cascades, including the seed set generation, the stochastic diﬀusion pro cess, and the missing data pro cess. W e say that an inﬂuence function learning algorithm A is pr op er if F L ⊆ F M ; that is, the learned inﬂuence function is guaran teed to b e an instance of the true diﬀusion mo del. Otherwise, w e sa y that A is an impr op er learning algorithm. 1 Narasimhan et al. [ 16 ] refer to this mo del as p artial observations ; we change the terminology to av oid confusion with “incomplete observ ations.” 4 𝐺 𝑢 𝑣 𝑤 𝑥 𝑢 𝑣 𝑤 𝑥 𝑢 ’ 𝑣′ 𝑤′ 𝑥′ 𝐺 ' Figure 1: Illustration of the graph transformation. The light green node is the seed, the dark green no des are the activ ated and observed no des while the yello w no de is activ ated but lost due to incomplete observ ations. 3 Prop er P A C Learning under Incomplete Observ ations In this section, w e establish prop er P A C learnabilit y of inﬂuence functions under the DIC and DL T mo dels. F or b oth diﬀusion mo dels, F M can b e parameterized by an edge parameter vector w , whose en tries w e are the activ ation probabilities (DIC model) or edge w eights (DL T model). Our goal is to ﬁnd an inﬂuence function F w ∈ F M that outputs accurate marginal activ ation probabilities. While our goal is pr op er learning — meaning that the function must b e from F M — we do not require that the inferred parameters matc h the true edge parameters w . Our main theoretical results are summarized in Theorem 1 and Theorem 2. Theorem 1. L et λ ∈ (0 , 0 . 5) . The class of inﬂuenc e functions under the DIC mo del in which al l e dge activation pr ob abilities satisfy w e ∈ [ λ, 1 − λ ] is P A C le arnable under inc omplete observations with r etention r ate r . The sample c omplexity is ˜ O ( n 3 m ε 2 r 4 ) . Theorem 2. L et λ ∈ (0 , 0 . 5) , and c onsider the class of inﬂuenc e functions under the DL T mo del such that the e dge weight for every e dge satisﬁes w e ∈ [ λ, 1 − λ ] , and for every no de v , 1 − P u ∈ N ( v ) w uv ∈ [ λ, 1 − λ ] . This class is P AC le arnable under inc omplete observations with r etention r ate r . The sample c omplexity is ˜ O ( n 3 m ε 2 r 4 ) . In this section, we present the intuition and a pro of sk etch for the t w o theorems. Details of the pro of are pro vided in App endix A. The ke y idea of the pro of of b oth theorems is that a set of incomplete cascades ˜ C on G under the t w o models can b e considered as essentially complete cascades on a transformed graph ˆ G = ( ˆ V , ˆ E ) . The inﬂuence functions of nodes in ˆ G can b e learned using a mo diﬁcation of the result of Narasimhan et al. [ 16 ]. Subsequen tly , the inﬂuence functions for G are directly obtained from the inﬂuence functions for ˆ G , by exploiting that inﬂuence functions only fo cus on the marginal activ ation probabilities. The transformed graph ˆ G is built b y adding a la yer of n no des to the original graph G . F or each no de v ∈ V of the original graph, w e add a new node v 0 ∈ V 0 and a directed edge ( v , v 0 ) with kno wn and ﬁxed edge parameter ˆ w v v 0 = r . (Con v enien tly , the same parameter v alue serv es as activ ation probabilit y under the DIC model and as edge weigh t under the DL T mo del.) The new no des V 0 ha v e no other inciden t edges, and w e retain all edges e = ( u, v ) ∈ E . Inferring their parameters is the learning task. An example of the transformation on a simple graph consisting of four nodes is sho wn in Figure 1. 5 F or each observ ed (incomplete) cascade ( S i , ˜ A i ) on G (with S i ⊆ ˜ A i ), we pro duce an observed activ ation set A 0 i as follows: (1) for each v ∈ ˜ A i \ S i , we let v 0 ∈ A 0 i deterministically; (2) for each v ∈ S i indep enden tly , we include v 0 ∈ A 0 i with probability r . This deﬁnes the training cascades ˆ C = { ( S i , A 0 i ) } . No w consider any edge parameters w , applied to b oth G and the ﬁrst la y er of ˆ G . Let F ( S ) denote the inﬂuence function on G , and ˆ F ( S ) = [ ˆ F 1 0 ( S ) , . . . , ˆ F n 0 ( S )] the inﬂuence function of the no des in the added la yer V 0 of ˆ G . Then, by the transformation, w e get that ˆ F v 0 ( S ) = r · F v ( S ) (1) for all nodes v ∈ V . And b y the deﬁnition of the observ ation loss pro cess, we also ha ve that for all non-seed no des v / ∈ S i , Prob[ v ∈ ˜ A i ] = r · F v ( S ) = ˆ F v 0 ( S ) . While the cascades ˆ C are not complete on all of ˆ G , in a precise sense, they pro vide complete information on the activ ation of no des in V 0 . In App endix A, we show that Theorem 2 of Narasimhan et al. [ 16 ] can b e extended to pro vide iden tical guaran tees for learning ˆ F ( S ) from the mo diﬁed observ ed cascades ˆ C . F or the DIC model, this is a straightforw ard mo diﬁcation of the pro of from [ 16 ]. F or the DL T model, [ 16 ] had not established P A C learnabilit y 2 , so w e pro vide a separate proof. Because the results of [ 16 ] and our generalizations ensure pr op er learning, they provide edge w eigh ts w b et w een the no des of V . W e use these exact same edge weigh ts to deﬁne the learned inﬂuence functions in G . Equation (1) then implies that the learned inﬂuence functions on V satisfy F v ( S ) = 1 r · ˆ F v 0 ( S ) . The detailed analysis in App endix A sho ws that the learning error only scales b y a m ultiplicativ e factor 1 r 2 . The P AC learnability result shows that there is no information-theoretical obstacle to inﬂuence function learning under incomplete observ ations. Ho w ev er, it do es not imply an eﬃcient algorithm. The reason is that a hidden v ariable w ould b e asso ciated with each no de not observ ed to be activ e, and computing the ob jectiv e function for empirical risk minimization would require marginalizing o v er all of the hidden v ariables. The prop er P AC learnability result also do es not readily generalize to the CIC mo del and other diﬀusion mo dels, ev en under complete observ ations. This is due to the lac k of a succinct c haracterization of inﬂuence functions as under the DIC and DL T mo dels. Therefore, in the next section, w e explore improper learning approaches with the goal of designing practical algorithms and establishing learnabilit y under a broader class of diﬀusion mo dels. 4 Eﬃcien t Improp er Learning Algorithm In this section, w e develop improper learning algorithms for eﬃcient inﬂuence function learning. Instead of parameterizing the inﬂuence functions using the edge parameters, w e adopt the mo del-free inﬂuence function learning framework, InﬂuLearner, proposed by Du et al. [ 3 ] to represent the inﬂuence function as a sum of w eigh ted basis functions. F rom no w on, we fo cus on the inﬂuence function F v ( S ) of a single ﬁxed node v . 2 [ 16 ] shows that the DL T mo del with ﬁxed thresholds is P AC learnable under complete cascades. W e study the DL T mo del when the thresholds are uniformly distributed random v ariables. 6 Inﬂuence F unction P arameterization. F or all three diﬀusion mo dels (CIC, DIC and DL T), the diﬀusion pro cess can be characterized equiv alently using liv e-edge graphs. Concretely , the results of [ 10 , 4 ] state that for each instance of the CIC, DIC, and DL T mo dels, there exists a distribution Γ o ver liv e-edge graphs H assigning probabilit y γ H to eac h liv e-edge graph H suc h that F ∗ v ( S ) = P H : at le ast one no de in S has a path to v in H γ H . T o reduce the represen tation complexity , notice that from the p ersp ective of activ ating v , tw o diﬀeren t liv e-edge graphs H , H 0 are “equiv alent” if v is reac hable from exactly the same no des in H and H 0 . Therefore, for an y node set T , let β ∗ T := P H : exactly the no des in T hav e paths to v in H γ H . W e then use c haracteristic vectors as feature v ectors r T = χ T , where we will in terpret the entry for no de u as u ha ving a path to v in a liv e-edge graph. More precisely , let φ ( x ) = min { x, 1 } , and χ S the characteristic v ector of the seed set S . Then, φ ( χ > S · r T ) = 1 if and only if v is reachable from S , and we can write F ∗ v ( S ) = X T β ∗ T · φ ( χ > S · r T ) . This represen tation still has exponentially man y features (one for each T ). In order to make the learning problem tractable, we sample a smaller set T of K features from a suitably chosen distribution, implicitly setting the weigh ts β T of all other features to 0. Th us, we will parametrize the learned inﬂuence function as F β v ( S ) = X T ∈T β T · φ ( χ > S · r T ) . The goal is then to learn w eigh ts β T for the sampled features. (They will form a distribution, i.e., || β || 1 = 1 and β ≥ 0 .) The crux of the analysis is to show that a suﬃcien tly small n um b er K of features (i.e., sampled sets) suﬃces for a go o d appro ximation, and that the weigh ts can b e learned eﬃcien tly from a limited n umber of observed incomplete cascades. Sp eciﬁcally , w e consider the log lik eliho o d function ` ( t, y ) = y log t + (1 − y ) log (1 − t ) , and learn the parameter vector β via the follo wing maximum likelihoo d estimation problem: Maximize P M i =1 ` ( F β v ( S i ) , χ A i ( v )) sub ject to || β || 1 = 1 , β ≥ 0 . Handling Incomplete Observ ations. The maximum likelihoo d estimation cannot b e directly applied to incomplete cascades, as w e do not ha v e access to A i (only the incomplete v ersion ˜ A i ). T o address this issue, notice that the MLE problem is actually a binary classiﬁcation problem with log loss and y i = χ A i ( v ) as the label. F rom this persp ective, incompleteness is simply class-conditional noise on the labels. Let ˜ y i = χ ˜ A i ( v ) b e our observation of whether v w as activ ated or not under the incomplete cascade i . Then, Prob[ ˜ y i = 1 | y i = 1] = r and Prob[ ˜ y i = 1 | y i = 0] = 0 . In words, the incomplete observ ation ˜ y i suﬀers from one-sided error compared to the complete observ ation y i . Known tec hniques can b e used to address this issue. By results of Natara jan et al. [ 17 ], we can construct an unbiased estimator of ` ( t, y ) using only the incomplete observ ations ˜ y , as in the following lemma. 7 Lemma 3 (Corollary of Lemma 1 [ 17 ]) . L et y b e the true activation of no de v and ˜ y the inc omplete observation. Then, deﬁning ˜ ` ( t, y ) := 1 r y log t + 2 r − 1 r (1 − y ) log (1 − t ) , for any t , we have E ˜ y h ˜ ` ( t, ˜ y ) i = ` ( t, y ) . Based on this lemma, we solve the maxim um lik eliho o d estimation problem with the adjusted lik eliho o d function ˜ ` ( t, y ) : Maximize P M i =1 ˜ ` ( F β v ( S i ) , χ ˜ A i ( v )) (2) sub ject to || β || 1 = 1 , β ≥ 0 . W e analyze conditions under which the solution to (2) pro vides improp er P AC learnabilit y under incomplete observ ations; these conditions will apply for all three diﬀusion mo dels. These conditions are similar to those of Lemma 1 in the w ork of Du et al. [ 3 ], and concern the appro ximabilit y of the reac habilit y distribution β ∗ T . Speciﬁcally , let q b e a distribution o ver no de sets T suc h that q ( T ) ≤ C β ∗ T for all no de sets T . Here, C is a (p ossibly v ery large) num b er that we will wan t to b ound b elow. Let T 1 , . . . , T K b e K i.i.d. samples dra wn from the distribution q . The features are then r k = χ T k . W e use the truncated v ersion of the function F β ,λ v ( S ) with parameter 3 λ as in [3]: F β ,λ v ( S ) = (1 − 2 λ ) F β v ( S ) + λ. Let M λ b e the class of all such truncated inﬂuence functions, and F ˜ β ,λ v ∈ M λ the inﬂuence functions obtained from the optimization problem (2) . The following theorem (pro v ed in App endix B) establishes the accuracy of the learned functions. Theorem 4. Assume that the le arning algorithm uses K = ˜ Ω ( C 2 ε 2 ) fe atur es in the inﬂuenc e function it c onstructs, and observes 4 M = ˜ Ω ( log C ε 4 r 2 ) inc omplete c asc ades with r etention r ate r . Then, with pr ob ability at le ast 1 − δ , the le arne d inﬂuenc e functions F ˜ β ,λ v for e ach no de v and se e d distribution P satisfy E S ∼P h ( F ˜ β ,λ v ( S ) − F ∗ v ( S )) 2 i ≤ ε. The theorem implies that with enough incomplete cascades, an algorithm can approximate the ground truth inﬂuence function to arbitrary accuracy . Therefore, all three diﬀusion mo dels are improp erly P A C learnable under incomplete observ ations. The ﬁnal sample complexit y does not con tain the graph size, but is implicitly captured by C , whic h will dep end on the graph and how well the distribution β ∗ T can b e appro ximated. Notice that with r = 1 , our b ound on M has logarithmic dep endency on C instead of p olynomial, as in [ 3 ]. The reason for this impro vemen t is discussed further in Appendix B. 3 The pro of of Theorem 4 in App endix B will show how to choose λ . 4 The ˜ Ω notation suppresses all logarithmic terms except log C , as C could b e exp onential or worse in the num b er of no des. 8 Eﬃcien t Implemen tation. As mentioned ab ov e, the features T cannot b e sampled from the exact reachabilit y distribution β ∗ T , b ecause it is inaccessible (and complex). In order to obtain useful guaran tees from Theorem 4, w e follo w the approach of Du et al. [ 3 ], and approximate the distribution β ∗ T with the product of the marginal distributions, estimated from observed cascades. The optimization problem (2) is conv ex and can therefore be solved in time polynomial in the n um b er of features K . How ever, the guarantees in Theorem 4 require a p ossibly large num b er of features. In order to obtain an eﬃcient algorithm for practical use and our exp eriments, we sacriﬁce the guarantee and use a ﬁxed n um b er of features. Notice that the optimization problem (2) can b e solved independently for eac h no de v ; the learned functions F i ( S ) can then be com bined in to F ( S ) = [ F 1 ( S ) , . . . , F n ( S )] . As the optimization problem factorizes o v er nodes, the metho d is obviously parallelizable, th us scaling to large net works. A further p oin t regarding the implementation: in our theoretical analysis, we assumed that the retention rate r is known to the learning algorithm. In practice, it can b e estimated via cross- v alidation. As w e show in the next section, the algorithm is not v ery sensitive to the missp eciﬁcation of the reten tion rate. 5 Exp erimen ts In this section, we exp erimen tally ev aluate the algorithm from Section 4. Since no other metho ds explicitly accoun t for incomplete observ ations, w e compare it to several state-of-the-art metho ds for inﬂuence function learning with full information. Hence, the main goal of the comparison is to examine to what exten t the impact of missing data can b e mitigated b y b eing aw are of it. W e compare our algorithm to the following approac hes: (1) CIC is an approac h ﬁtting the parameters of a CIC mo del, using the NetRate algorithm [ 7 ] with exponential delay distribution. (2) DIC ﬁts the activ ation probabilities of a DIC mo del using the metho d in [ 18 ]. (3) InﬂuLearner is the mo del-free approac h proposed by Du et al. in [ 3 ] and discussed in Section 4. (4) Logistic uses logistic regression to learn the inﬂuence functions F u ( S ) = f ( χ > S · c u + b ) for eac h u indep enden tly , where c u is a learnable weigh t vector and f ( x ) = 1 1+ e − x is the logistic function. (5) Linear uses linear regression to learn the total inﬂuence σ ( S ) = c > · χ S + b of the set S . Notice that the CIC and DIC metho ds hav e access to the activation time of eac h node in addition to the ﬁnal activ ation status, giving them an inheren t adv antage. 5.1 Syn thetic cascades Data generation. W e generate syn thetic net works with core-peripheral structure following the Kronec k er graph mo del [ 12 ] with parameter matrix [0 . 9 , 0 . 5; 0 . 5 , 0 . 3] . 5 Eac h generated net w ork has 512 no des and 1024 edges. W e then generate synthetic cascades follo wing the CIC, DIC and DL T mo dels. F or the CIC mo del, w e use an exp onen tial delay distribution on eac h edge whose parameters are drawn indep enden tly and uniformly from [0 , 1] . The observ ation window length is τ = 1 . 0 . F or the DIC mo del, the activ ation probabilit y for each edge is chosen indep endently and uniformly from [0 , 0 . 4] . F or the DL T mo del, w e follo w [ 10 ] and set the edge weigh t w uv as 1 /d v where d v is the in-degree of node v . F or each mo del, w e generate 8192 cascades as training data. The seed sets are sampled uniformly at 5 W e also exp erimented on Kronec ker graphs with hierarchical communit y structure ( [0 . 9 , 0 . 1; 0 . 1 , 0 . 9] ) and random structure ( [0 . 5 , 0 . 5; 0 . 5 , 0 . 5] ). The results are similar and omitted due to space constraints. 9 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Retention Rate 2 3 4 5 6 7 8 9 10 11 MAE CIC DIC InfluLearner Linear Logistic Our Method (a) CIC 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Retention Rate 2 3 4 5 6 7 8 9 10 11 MAE CIC DIC InfluLearner Linear Logistic Our Method (b) DIC 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Retention Rate 2 3 4 5 6 7 8 9 10 11 MAE CIC DIC InfluLearner Linear Logistic Our Method (c) DL T Figure 2: MAE of estimated inﬂuence as a function of the reten tion rate on synthetic datasets for (a) CIC model, (b) DIC model, (c) DL T model. The error bars shows one standard deviation. random with sizes drawn from a p ow er la w distribution with parameter 2 . 5 . The generated cascades ha v e av erage sizes of 10 . 8 , 12 . 8 and 13 . 0 in the CIC, DIC and DL T models, respectively . W e then create incomplete cascades by v arying the retention rate betw een 0 . 1 and 0 . 9 . The test set con tains 200 indep endently sampled seed sets generated in the same w ay as the training data. T o sidestep the computational cost of running Monte Carlo sim ulations, we estimate the ground truth inﬂuence of the test seed sets using the method prop osed in [3], with the true mo del parameters. Algorithm settings. W e apply all algorithms to cascades generated from all three mo dels; that is, we also consider the results under mo del missp eciﬁcation. Whenever applicable, w e set the h yp erparameters of the ﬁve comparison algorithms to the ground truth v alues. When applying the NetRate algorithm to discrete-time cascades, w e set the observ ation windo w to 10 . 0 . When applying the metho d in [ 18 ] to contin uous-time cascades, we round activ ation times up to the nearest m ultiple of 0.1, resulting in 10 discrete time steps. F or the mo del-free approaches (InﬂuLearner and our algorithm), we use K = 200 features. Results. Figure 2 sho ws the Mean Absolute Error (MAE) b etw een the estimated total inﬂuence σ ( S ) and the true inﬂuence v alue, a veraged ov er all testing seed sets. F or each setting (diﬀusion mo del and reten tion rate), the reported MAE is a veraged o v er ﬁve indep endent runs. The main insight is that accounting for missing observ ations indeed strongly mitigates their eﬀect: notice that for retention rates as small as 0 . 5 , our algorithm can almost completely compensate for the data loss, whereas b oth the mo del-free and parameter ﬁtting approac hes deteriorate signiﬁcantly ev en for reten tion rates close to 1. F or the parameter ﬁtting approac hes, even such large reten tion rates can lead to missing and spurious edges in the inferred netw orks, and thus signiﬁcant estimation errors. Additional observ ations include that ﬁtting inﬂuence using (linear or logistic) regression do es not perform well at all, and that the CIC inference approac h appears more robust to model missp eciﬁcation than the DIC approac h. Sensitivit y of reten tion rate. W e presen ted the algorithms as kno wing r . Since r itself is inferred from noisy data, it ma y be somewhat misestimated. Figure 4 shows the impact of misestimating r . 10 96% 98% 100% 102% 104% 106% 0.02 0.05 0.1 0.2 CIC DIC DL T Figure 3: Relative error in MAE when the true reten tion rates are drawn from the truncated Gaussian distribution. The x -axis shows standard deviation σ of the reten tion rates from the mean, and the y -axis is the relativ e diﬀerence of the MAE compared to the case where all retention rates are the same and kno wn. W e generate syn thetic cascades from all three diﬀusion mo dels with a true retention rate of 0 . 8 , and then apply our algorithm with (incorrect) reten tion rate r ∈ { 0 . 6 , 0 . 65 , . . . , 0 . 95 , 1 } . The results are a v eraged o ver ﬁve indep endent runs. While the performance decreases as the misestimation gets w orse (after all, with r = 1 , the algorithm is basically the same as InﬂuLearner), the degradation is graceful. Non-uniform retention rate In practice, diﬀerent no des may hav e diﬀerent retention rates, while w e may b e able only to estimate the mean retention rate r . F or our exp eriments, w e dra w eac h no de v ’s retention rate r v indep enden tly from a distribution with mean r . Sp eciﬁcally , in our exp erimen ts, we use uniform and Gaussian distributions; for the uniform distribution, we draw r v ∼ Unif [ r − σ, r + σ ] ; for the Gaussian distribution, we draw r v ∼ N ( r , σ 2 ) , truncating dra ws at 0 and 1. 6 In b oth cases, σ measures the lev el of noise in the estimated reten tion rate. W e set the mean reten tion rate r to 0 . 8 and v ary σ in { 0 , 0 . 02 , 0 . 05 , 0 . 1 , 0 . 2 } . Figure 3 shows the results for the Gaussian distribution; the results of uniform distribution are similar and omitted. The results show that our mo del is very robust to random and independent p erturbations of individual retention rates for each no de. 5.2 Inﬂuence Estimation on real cascades W e further ev aluate the p erformance of our method on the real-world MemeT rac k er 7 dataset [ 11 ]. The dataset consists of the propagation of short textual phrases, referred to as Memes , via the publication of blog p osts and main-stream media news articles b et w een March 2011 and F ebruary 2012. Sp eciﬁcally , the dataset con tains seven groups of cascades corresponding to the propagation of Memes with certain keyw ords, namely “apple and jobs”, “tsunami earthquake”, “william k ate marriage” ’, “o ccupy wall-street”, “airstrikes”, “egypt” and “elections.” Eac h cascade group consists of 6 The truncation could lead to a bias on the mean of r . How ever, empirical simulations sho w that the bias is negligible (only 0 . 01 when σ = 0 . 2 ). 7 W e use the prepro cessed version of the dataset released by Du et al. [ 3 ] and av ailable at http://www.cc.gatech. edu/~ndu8/InfluLearner.html . Notice that the dataset is semi-real, as multi-node seed cascades are artiﬁcially created by merging single-no de seed cascades. 11 ! 10 !5 0 5 10 15 20 25 30 35 40 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 CI C DIC DL T Figure 4: Relative error in MAE under reten- tion rate misspeciﬁcation. x -axis: reten tion rate r used b y the algorithm. y -axis: relativ e diﬀerence of MAE compared to using the true reten tion rate 0.8. ! " #! #" $! $" %! %" # $ % & " ' ( )*+ ,-./012 .324 54 51 Lin ear Logi stic DIC CIC In flu Lear ne r Our3 M ethod Figure 5: MAE of inﬂuence estimation on seven sets of real-w orld cascades with 20% of activ a- tions missing. 1000 nodes, with a n um b er of cascades v arying from 1000 to 44000 . W e follow exactly the same ev aluation metho d as Du et al. [3] with a training/test set split of 60%/40%. T o test the p erformance of inﬂuence function learning under incomplete observ ations, w e randomly delete 20% of the occurrences, setting r = 0 . 8 . The results for other reten tion rates are similar and omitted. Figure 5 shows the MAE of our metho ds and the ﬁv e baselines, av eraged o v er 100 random dra ws of test seed sets, for all groups of memes. While some baselines p erform v ery po orly , ev en compared to the best baseline (InﬂuLearner), our algorithm provides an 18% reduction in MAE (a v eraged ov er the seven groups), showing the p oten tial of data loss aw areness to mitigate its eﬀects. 6 Mo del Extensions So far, w e hav e assumed that the retention rate is the same for all the no des; how ever, our approac h can b e easily generalized to the case in whic h eac h individual node has a diﬀerent (but known) reten tion rate. The follo wing theorem generalizes Theorems 1 and 2. (The pro ofs of all theorems from this section are giv en in App endix C.) Theorem 5. L et λ ∈ (0 , 0 . 5) and for e ach no de v , let r v b e v ’s r etention r ate. W rite ¯ r = 1 n P n i =1 1 r 2 v . • The class of inﬂuenc e functions under the DIC mo del in which al l e dge activation pr ob abilities satisfy w e ∈ [ λ, 1 − λ ] is P A C le arnable with sample c omplexity ˜ O ( ¯ r 2 n 3 m ε 2 ) . • The class of inﬂuenc e functions under the DL T mo del such that the e dge weight for every e dge satisﬁes w e ∈ [ λ, 1 − λ ] , and for every no de v , 1 − P u ∈ N ( v ) w uv ∈ [ λ, 1 − λ ] , is P AC le arnable with sample c omplexity ˜ O ( ¯ r 2 n 3 m ε 2 ) . The empirical ev aluation in the previous section also shows that the p erformance do es not change signiﬁcan tly if the true retention rate of eac h node is independently p erturb ed around the estimated mean loss rate. A second limitation of our approac h is that w e assume the retention rate r to b e kno wn to the algorithm. Estimating r in a real-w orld setting present s a “chic ken and egg” problem; how ev er, we 12 b eliev e that a somewhat accurate estimate of r (p erhaps based on past data for whic h ground truth can be obtained at m uch higher cost) will still be a signiﬁcan t impro vemen t ov er the status quo, namely , pretending that no data are missing. Moreo v er, even approximate information about r leads to p ositiv e results on prop er P AC learnabilit y . W e show that the P A C learnabilit y result can be extended to the case where w e only kno w that the true reten tion rate lies in a given in terv al. Instead of kno wing the exact v alue r , we only kno w that r lies in an interv al measured by the relative error η , namely r ∈ I = [ ¯ r · (1 − η ) , ¯ r · (1 + η )] . Within that in terv al, the reten tion rate is adversarially chosen. W e can then generalize Theorems 1 and 2 as follows. Theorem 6. L et λ ∈ (0 , 0 . 5) , and assume that the gr ound truth r etention r ate r is adversarial ly chosen in I = [ ¯ r · (1 − η ) , ¯ r · (1 + η )] . F or al l ε, δ ∈ (0 , 1) , al l p ar ametrizations of the diﬀusion mo del, and al l distributions P over se e d sets S : when given activ ation-only and incomplete tr aining c asc ades ˜ C , ther e exists a pr op er le arning algorithm A which outputs an inﬂuenc e function F ∈ F M satisfying: Prob ˜ C [ err sq [ F ] − err sq [ F ∗ ] ≥ ε + 4 η 2 (1 − η ) 2 ] ≤ δ. • F or the DIC mo del, when al l e dge activation pr ob abilities satisfy w e ∈ [ λ, 1 − λ ] , the r e quir e d numb er of observations is M = ˜ O ( n 3 m ε 4 r 4 (1 − η ) 4 ) . • F or the DL T mo del, when al l e dges weights satisfy w e ∈ [ λ, 1 − λ ] , and for every no de v , 1 − P u ∈ N ( v ) w uv ∈ [ λ, 1 − λ ] , the r e quir e d numb er of observations is M = ˜ O ( n 3 m ε 4 r 4 (1 − η ) 4 ) . Notice that the result of Theorem 6 is not technically a P AC learnability result, due to the additiv e error that dep ends on the in terv al size. How ever, the theorem provides useful approximation guaran tees when the interv al size is small. A dep endence of the guaran tee on the in terv al size is inevitable. F or when nothing is known ab out the retention rate (for η large enough), all information ab out the marginal activ ation probabilities is lost in the incomplete data: for instance, if no no des are ever ob esrv ed active, we cannot distinguish the case r = 0 from the case in which no no des b ecome activ ated. The exp eriments from the previous section conﬁrm that for moderate uncertain ty ab out the reten tion rate, the performance of our approac h is not v ery sensitiv e to the misestimation of r . 7 Conclusion and F uture W ork W e studied the problem of learning inﬂuence functions under incomplete observ ations, whic h are common in real-w orld applications. W e established prop er P AC learnabilit y of inﬂuence functions under tw o p opular diﬀusion mo dels, the DIC and DL T mo del. The incompleteness only has mo derate impact on the sample complexity b ound, but computational eﬃciency would require an oracle for eﬃcien t empirical risk minimization. W e next designed an eﬃcient improp er learning algorithm with learning guarantees for the DIC, DL T, and CIC models. Our framework can be easily generalized to handle non-uniform (but indep enden t) loss of no de activ ations. W e also ha ve partial results theoretically establishing robustness to misestimations of r (whic h we observ ed exp erimentally in Section 5). A m uch more signiﬁcan t departure for future w ork would b e non-random loss of activ ations, e.g., losing all activ ations of some randomly c hosen 13 no des. As another direction, it w ould be w orthwhile to generalize the P AC learnabilit y results to other diﬀusion models, and to design an eﬃcient algorithm with P AC learning guaran tees. A ckno wledgments W e w ould lik e to thank anon ymous reviewers for useful feedback. The research was sp onsored in part b y NSF research grant I IS-1254206 and b y the U.S. Defense Adv anced Researc h Pro jects Agency (DARP A) under the So cial Media in Strategic Communication (SMISC) program, Agreemen t Num b er W911NF-12-1-0034. The views and conclusions are those of the authors and should not be in terpreted as represen ting the oﬃcial p olicies of the funding agency , or the U.S. Gov ernment. 14 References [1] K. Amin, H. Heidari, and M. Kearns. Learning from con tagion (without timestamps). In Pr o c. 31st Intl. Conf. on Machine L e arning , pages 1845–1853, 2014. [2] F. Chieric hetti, D. Liben-Now ell, and J. M. Klein b erg. Reconstructing Patterns of Informa- tion Diﬀusion from Incomplete Observ ations. In Pr o c. 23r d A dvanc es in Neur al Information Pr o c essing Systems , pages 792–800, 2011. [3] N. Du, Y. Liang, M.-F. Balcan, and L. Song. Inﬂuence F unction Learning in Information Diﬀusion Netw orks. In Pr o c. 31st Intl. Conf. on Machine L e arning , 2014. [4] N. Du, L. Song, M. Gomez-Ro driguez, and H. Zha. Scalable Inﬂuence Estimation in Contin uous- Time Diﬀusion Net works. In Pr o c. 25th A dvanc es in Neur al Information Pr o c essing Systems , pages 3147–3155, 2013. [5] N. Du, L. Song, S. Y uan, and A. J. Smola. Learning Net works of Heterogeneous Inﬂuence. In Pr o c. 24th A dvanc es in Neur al Information Pr o c essing Systems , pages 2780–2788. 2012. [6] Q. Duong, M. P . W ellman, and S. P . Singh. Mo deling information diﬀusion in netw orks with unobserv ed links. In So cialCom/P ASSA T , pages 362–369, 2011. [7] M. Gomez-Ro driguez, D. Balduzzi, and B. Sc hölkopf. Uncov ering the temp oral dynamics of diﬀusion netw orks. In Pr o c. 28th Intl. Conf. on Machine L e arning , pages 561–568, 2011. [8] M. Gomez-Ro driguez, J. Lesko vec, and A. Krause. Inferring net works of diﬀusion and inﬂuence. A CM T r ansactions on Know le dge Disc overy fr om Data , 5(4), 2012. [9] A. Go y al, F. Bonchi, and L. V. S. Lakshmanan. Learning inﬂuence probabilities in so cial net w orks. In Pr o c. 3r d ACM Intl. Conf. on W eb Se ar ch and Data Mining , pages 241–250, 2010. [10] D. Kemp e, J. Kleinberg, and E. T ardos. Maximizing the Spread of Inﬂuence in a So cial Netw ork. In Pr o c. 9th Intl. Conf. on Know le dge Disc overy and Data Mining , pages 137–146, 2003. [11] J. Lesk ov ec, L. Bac kstrom, and J. Klein b erg. Meme-tracking and the dynamics of the news cycle. In Pr o c. 15th Intl. Conf. on Know le dge Disc overy and Data Mining , pages 497–506, 2009. [12] J. Lesko vec, D. Chakrabarti, J. Kleinberg, C. F aloutsos, and Z. Ghahramani. Kronec k er graphs: An approach to mo deling net works. The Journal of Machine L e arning R ese ar ch , 11:985–1042, 2010. [13] A. Y. Lokho v. Reconstructing spreading couplings from partial observ ations of several cascades. In Pr o c. 28th A dvanc es in Neur al Information Pr o c essing Systems , 2016. [14] S. A. Myers and J. Lesk ov ec. On the conv exity of latent so cial net work inference. In Pr o c. 22nd A dvanc es in Neur al Information Pr o c essing Systems , pages 1741–1749, 2010. [15] S. A. My ers, C. Zhu, and J. Lesko vec. Information Diﬀusion and External Inﬂuence in Netw orks. In Pr o c. 18th Intl. Conf. on Know le dge Disc overy and Data Mining , pages 33–41. [16] H. Narasimhan, D. C. P arkes, and Y. Singer. Learnability of Inﬂuence in Netw orks. In Pr o c. 27th A dvanc es in Neur al Information Pr o c essing Systems , pages 3168–3176, 2015. 15 [17] N. Natara jan, I. S. Dhillon, P . K. Ra vikumar, and A. T ewari. Learning with noisy labels. In Pr o c. 25th A dvanc es in Neur al Information Pr o c essing Systems , pages 1196–1204, 2013. [18] N. Praneeth and S. Sujay . Learning the Graph of Epidemic Cascades. In Pr o c. 12th ACM Sigmetrics Conf. on Me asur ement and Mo deling of Computer Systems , pages 211–222, 2012. [19] D. Quang, W. M. P , and S. S. P . Mo deling Information Diﬀusion in Netw orks with Unobserved Links. In So cialCom , pages 362–369, 2011. [20] N. Rosenfeld, M. Nitzan, and A. Glob erson. Discriminativ e learning of infection mo dels. In Pr o c. 9th ACM Intl. Conf. on W eb Se ar ch and Data Mining , pages 563–572, 2016. [21] E. Sadiko v, M. Medina, J. Lesko v ec, and H. Garcia-Molina. Correcting for missing data in information cascades. In Pr o c. 4th A CM Intl. Conf. on W eb Se ar ch and Data Mining , pages 55–64, 2011. [22] L. G. V alian t. A theory of the learnable. Communic ations of the A CM , 27(11):1134–1142, 1984. [23] X. W u, A. Kumar, D. Sheldon, and S. Zilb erstein. P arameter learning for latent netw ork diﬀusion. In Pr o c. 29th Intl. Joint Conf. on A rtiﬁcial Intel ligenc e , pages 2923–2930, 2013. [24] S.-H. Y ang and H. Zha. Mixture of Mutually Exciting Pro cesses for Viral Diﬀusion. In Pr o c. 30th Intl. Conf. on Machine L e arning , 2013. [25] K. Zhou, H. Zha, and L. Song. Learning Social Infectivit y in Sparse Lo w-rank Net work Using Multi-dimensional Hawk es Processes. In Pr o c. 30th Intl. Conf. on Machine L e arning , 2013. 16 A Pro ofs for Section 3 A.1 Pro of of Theorem 1 Here, we ﬂesh out the pro of sk etch from Section 3 for the DIC mo del. F or the transformed graph ˆ G , w e consider only the inﬂuence functions of the n no des in the added lay er V 0 . Recall that w e write ˆ F ( S ) = [ ˆ F 1 0 ( S ) , . . . , ˆ F n 0 ( S )] for the inﬂuence function of those nodes. Let ˆ F ∗ b e the ground truth inﬂuence function for the same no des, and F ∗ the ground truth inﬂuence function for G . Let M ( G ) and M ( ˆ G ) be the class of inﬂuence functions of G and ˆ G . F or functions ˆ F , w e write c err sq [ ˆ F ] = E S,A h 1 n P v 0 ∈ V 0 ( χ A ( v 0 ) − ˆ F v 0 ( S )) 2 i . Notice that the ground truth functions minimize the exp ected squared error, i.e., ˆ F ∗ ∈ argmin ˆ F ∈M ( ˆ G ) c err sq [ ˆ F ] and F ∗ ∈ argmin F ∈M ( G ) err sq [ F ] . W e will sho w that err sq [ F ] − err sq [ F ∗ ] can be made arbitrary small. W e ﬁrst pro v e a v ariation of Theorem 2 from [ 16 ] for learning ˆ F , by verifying that all the supp orting lemmas still apply . The modiﬁed Theorem 2 from [16] is the following: Theorem 7. Assume that the le arning algorithm observes M = ˜ Ω ( ˆ  − 2 n 3 m ) tr aining c asc ades ˆ C = { ( S i , A 0 i ) } under the DIC mo del. Then, with pr ob ability at le ast 1 − δ , we have c err sq [ ˆ F ] − c err sq [ ˆ F ∗ ] ≤ ˆ . (3) Pr o of. While the cascades in ˆ C are incomplete on V , they are c omplete on V 0 . W e use this completeness of the cascades as follows. Consider the restricted class of the DIC mo del on the transformed graph ˆ G in whic h only the m activ ation probabilities w b et w een no des in V are learnable, while the edges ( v , v 0 ) hav e a ﬁxed w eight of r . Deﬁne the log-lik eliho o d for a cascade ( S, A 0 ) as L ( S, A 0 | w ) = X v 0 ∈ V 0 χ A 0 i ( v 0 ) log ( ˆ F w v 0 ( S )) + (1 − χ A 0 i ( v 0 )) log (1 − ˆ F w v 0 ( S )) . The algorithm outputs an inﬂuence function ˆ F based on the solution of the follo wing optimization problem: w ∗ ∈ argmax w ∈ [ λ, 1 − λ ] m M X i =1 L ( S i , A 0 i | w ) . As the function ˆ F is learned from the DIC model, Lemma 3 in [ 16 ] carries thorough to establish the Lipschitz contin uity of DIC inﬂuence functions. Lemma 8 (Lipsc hitz con tinuit y of DIC inﬂuence) . Fix S ⊆ V and v 0 ∈ V 0 . F or any w , w 0 ∈ R m with || w − w 0 || 1 ≤  , we have | ˆ F w v 0 ( S ) − ˆ F w 0 v 0 ( S ) | ≤  . Moreo v er, such instances (on 2 n no des) still only hav e m parameters, and the L ∞ co v ering n um b er b ound in Lemma 8 from [16] applies without an y c hanges. Lemma 9 (Co v ering n umber of DIC inﬂuence functions) . The L ∞ c overing numb er of the r estricte d class of the DIC inﬂuenc e functions on the tr ansforme d gr aph for r adius  is O (( m/ ) m ) . Establishing the sample complexit y b ound on the log-likelihoo d ob jectiv e (Lemma 4 in [ 16 ]) requires that all function v alues be bounded a w a y from 0 and 1 (Lemma 9 in [ 16 ]). W e assume that r < 1 , as Lemma 4 in [ 16 ] directly holds when there are no missing data at all. Let λ > 0 b e the b ound on the edge activ ation probabilities in G from our Theorem 1; that is, λ ≤ w uv ≤ 1 − λ for all 17 u, v ∈ V . Due to the la yered structure of ˆ G , w e ha ve that r · λ n ≤ ˆ F v 0 ( S ) ≤ r · (1 − λ n ) . 8 Therefore, Lemma 4 in [16] carries thorough with the same sample complexity of ˜ O (ˆ  − 2 n 3 m ) : Lemma 10 (Sample complexit y guarantee on the log-lik eliho o d ob jectiv e) . Fix , δ ∈ (0 , 1) and M = ˜ Ω(ˆ  − 2 n 3 m ) . With pr ob ability at le ast 1 − δ (over the dr aws of the tr aining c asc ades), max w ∈ [ λ, 1 − λ ] m E S,A 0  1 n L ( S, A 0 | w )  − E S,A 0  1 n L ( S, A 0 | w ∗ )  ≤ . As all the lemmas used in the pro of of Theorem 2 from [ 16 ] remain true, w e ha ve prov ed our Theorem 7, with the guaran tee that err sq [ ˆ F ] − err sq [ ˆ F ∗ ] ≤ ˆ  . Finally , we recall that according to Equation (1) , F v ( S ) = 1 r · ˆ F v 0 ( S ) and F ∗ v ( S ) = 1 r · ˆ F ∗ v 0 ( S ) , giving us that err sq [ F ] − err sq [ F ∗ ] ( ∗ ) = 1 n X v ∈ V E S  ( F v ( S ) − F ∗ v ( S )) 2  Equation (1) = 1 n X v 0 ∈ ˆ V E S  ( 1 r ˆ F v 0 ( S ) − 1 r ˆ F ∗ v 0 ( S )) 2  ( ∗ ) = c err sq [ ˆ F ] − c err sq [ ˆ F ∗ ] r 2 Equation (3) ≤ ˆ  r 2 (The steps labeled (*) are applications of Equation (4) from [ 16 ].) No w, by taking ˆ  = ε · r 2 , with ˜ O ( n 3 m ε 2 r 4 ) incomplete cascades, we obtain that err sq [ F ] − err sq [ F ∗ ] ≤ ε . A.2 Pro of of Theorem 2 W e will sho w that the analogue of Theorem 7 for the DL T mo del also holds. W e do so b y follo wing the same sequence of steps as in App endix A.1 and verifying that all the steps in the pro of of Theorem 2 in [ 16 ] still hold. The main diﬀerence is that a new pro of is needed for establishing Lipsc hitz con tinuit y of the DL T inﬂuence function with resp ect to the L 1 norm (the analogue of Lemma 3 in [16]). W e begin b y establishing this lemma. Lemma 11 (Lipsc hitz con tin uit y) . Fix S ⊆ V and u ∈ V . F or any w , w 0 ∈ R m with || w − w 0 || 1 ≤ ε , we have that | F w u ( S ) − F w 0 u ( S ) | ≤ ε . Pr o of. As sho wn in [ 10 ], the inﬂuence functions under the DL T model can b e also characterized via the reac hability under a distribution o ver live-edge graphs. Sp eciﬁcally , the distribution is as follows [ 10 , Theorem 2.5]: for eac h no de v , pick at most one of its incoming edges at random, selecting the edge from z ∈ N ( v ) with probability w z v and selecting no incoming edge with probabilit y 1 − P z ∈ N ( v ) w z v . F or eac h no de v , let the random v ariable X v b e the incoming neigh b or chosen for v , with X v = ⊥ if v has no incoming edge. F or simplicity of notation, we deﬁne w ⊥ v = 1 − P z ∈ N ( v ) w z v . 8 As in the pro of of Lemma 4 in [ 16 ], we assume that there exists a path in the graph ˆ G from a no de in S to no de v 0 ; the cases where this assumption fails can b e handled easily . 18 Deﬁne X = ( X v ) v ∈ V , and write X for the set of all such v ectors X . F or any no de v , we write X − v for the set of all v ectors with edges (or ⊥ ) for all nodes except v . And for a vector X ∈ X − v , w e write X [ v 7→ z ] for the vector in whic h all en tries agree with those in X , except for the en try for v whic h is no w z . Let R X ( v , S ) b e the indicator function of whether node v is reachable from the seed set S in the graph ( V , X ) , where w e in terpret X as the set of all edges ( X v , v ) with X v 6 = ⊥ . Claim 2.6 of [ 10 ] implies that F w u ( S ) = X X Y v ∈ V w X v v R X ( u, S ) . W e ﬁx an edge ( y, y 0 ) and tak e the partial deriv ative of F w u ( S ) with resp ect to w y y 0 :     ∂ F w u ( S ) ∂ w y y 0     =       ∂ ∂ w y y 0   X z ∈ N ( y ) ∪{⊥} w z y X X ∈X − y Y v ∈ V \{ y } w X v v · R X [ y 7→ z ] ( u, S )         =       X X ∈X − y Y v ∈ V \{ y } w X v v · R X [ y 7→ y 0 ] ( u, S ) − X X ∈X − y Y v ∈ V \{ y } w X v v · R X [ y 7→⊥ ] ( u, S )       ≤       X X ∈X − y Y v ∈ V \{ y } w X v v       = 1 . Therefore, ||∇ w F w u ( S ) || ∞ ≤ 1 , implying Lipschitz contin uity . Next, we b ound the v alues of the inﬂuence functions a w a y from 0 and 1. Because each edge weigh t w z v ∈ [ λ, 1 − λ ] by assumption, and we further assumed that w ⊥ v = 1 − P z ∈ N ( v ) w z v ∈ [ λ, 1 − λ ] , it follows directly (as in the pro of for the DIC mo del) that r · λ n ≤ ˆ F v 0 ( S ) ≤ r · (1 − λ n ) . This establishes the analogue of Lemma 9 in [ 16 ], and w e can therefore apply Lemma 4 in [ 16 ], obtaining a sample complexit y of ˜ O ( ˆ  − 2 n 3 m ) under the DL T mo del. As all the used lemmas remain true, the results of Theorem 7 also hold for the DL T mo del. Finally , exploiting the same relation b etw een F ( S ) and ˆ F ( S ) as in the proof of Theorem 1 leads to the conclusion of Theorem 2. B Pro of of Theorem 4 Let M = ˜ Ω ( log C  4 r 2 ) , and let F ˜ β ,λ v ( S ) b e the inﬂuence functions obtained in Theorem 4. W e will sho w that for an y single no de v , with probability at least 1 − δ /n , E S h ( F ˜ β ,λ v ( S ) − F ∗ v ( S )) 2 i ≤ ε. The theorem then follows b y taking a union bound o ver all n no des. Recall that M λ is function class of all truncated inﬂuence functions. W e write R M ( M λ ) := E S i ∼P , (  i ) i ∼ Uniform ( {− 1 , 1 } M ) " sup F ∈M λ 1 M M X i =1  i · F v ( S i ) # 19 for its Rademac her complexit y , where the  i ’s are i.i.d. Rademacher (symmetric Bernoulli) random v ariables. By Lemma 12 in [ 3 ], there exists a truncated inﬂuence function F ˆ β ,λ v ∈ M λ with K = O ( C 2 ε 2 log C n εδ ) features suc h that E S ∼P h ( F ˆ β ,λ v ( S ) − F ∗ v ( S )) 2 i ≤ 2 ε 2 + 2 λ 2 with probability at least 1 − δ 2 n . Using the log likelihoo d function ` ( t, y ) = y log t + (1 − y ) log (1 − t ) as deﬁned in Section 4, we write the log loss of inﬂuence function F v as err log [ F v ] = E S,A [ − ` ( F v ( S ) , A )] . By Lemma 2 in [ 17 ], with probability at least 1 − δ 2 n , err log [ F ˜ β ,λ v ] ≤ min f ∈M λ err log [ f ] + 4 λ · r R M ( M λ ) + r log(2 n/δ ) 2 M . Because F ˆ β ,λ v ∈ M λ , we can b ound that min f ∈M λ err log [ f ] ≤ err log [ F ˆ β ,λ v ( S )] on the right-hand side. Subtracting err log [ F ∗ v ] from both sides, we obtain err log [ F ˜ β ,λ v ] − err log [ F ∗ v ] ≤ err log [ F ˆ β ,λ v ] − err log [ F ∗ v ] + 4 λ · r R M ( M λ ) + r log(2 n/δ ) 2 M . (4) The square and log errors can b e related to eac h other as in the proof of Theorem 2 in [ 16 ], as follo ws: E S h ( F ˜ β ,λ v ( S ) − F ∗ v ( S )) 2 i ≤ 1 2 ( err log [ F ˜ β ,λ v ] − err log [ F ∗ v ]) . Hence, in order to obtain a b ound on E S h ( F ˜ β ,λ v ( S ) − F ∗ v ( S )) 2 i , it suﬃces to upp er-b ound the righ t-hand side of (4) . The term err log [ F ˆ β ,λ v ] − err log [ F ∗ v ] can b e b ounded as in the pro of of Lemma 2 in [ 3 ], using Lemma 11 and Lemma 16 from [ 3 ]: Assume that F ˆ β ,λ v uses K = Ω( C 2 ˆ  2 log C n ˆ  ˆ δ ) features. Then, with probabilit y at least 1 − ˆ δ , w e ha ve err log [ F ˆ β ,λ v ] − err log [ F ∗ v ] ≤ ˆ  2 + λ 2 λ (1 + log 1 λ ) . (5) Next, we b ound the Rademac her complexity of the function class M λ : Lemma 12. The R ademacher c omplexity R M ( M λ ) for the function class M λ with at most K fe atur es is at most q 2 log (1+ K ) M . Pr o of. Recall that w e use basis functions φ i ( S ) := min { 1 , χ > S r T i } . Let W = { φ i | i = 1 , . . . , K } ∪ { 1 } , where 1 is the constan t function with v alue 1 . By deﬁnition, we ha ve M λ ⊆ conv ( W ) , where con v ( W ) denotes the conv ex h ull. Therefore, R M ( M λ ) ≤ R M ( con v ( W )) = R M ( W ) . Since | φ i ( S ) | ≤ 1 , b y Massart’s ﬁnite lemma 9 , we hav e R M ( W ) ≤ q 2 log (1+ K ) M , completing the pro of. T o ﬁnish the pro of of Theorem 4, let  b e the desired accuracy . Deﬁne ˆ δ = δ 2 n and ˆ  = λ =  c 0 log 1  , where c 0 is a suﬃciently large constant. Then, the righ t-hand side of (5) is upp er-b ounded by ˆ  · (1 + log 1 ˆ  ) ≤  2 . 9 Massart’s ﬁnite lemma states the follo wing: Let F b e a ﬁnite set of functions, such that sup f ∈F 1 n P n i =1 f ( X i ) 2 ≤ C 2 for any v ariables v alues X 1 , . . . , X n . Then, the Rademacher complexit y of F is upp er b ounded b y R n ( F ) ≤ q 2 C 2 log |F | n . 20 With M = Ω( log K  4 r 2 ) , we ha ve 4 λ · r R M ( M λ ) ≤  4 . Whenev er M = Ω( log n δ  2 ) , we also get that q log( n/δ ) 2 M ≤  4 . T aking M as the maxim um of the ab ov e three, whic h is satisﬁed when M = ˜ Ω ( log C  4 r 2 ) , we can substitute all of the b ounds in to the right-hand side of (4) and obtain that E S h ( F ˜ β ,λ v ( S ) − F ∗ v ( S )) 2 i ≤  with probability at least 1 − δ n . No w, taking a union b ound ov er all no des v concludes the proof. Discussion. Notice that when the reten tion rate is 1, our Theorem 4 signiﬁcantly improv es the sample complexity b ound compared to Du et al. [ 3 ]. The sample complexit y in [ 3 ] is ˜ O ( C 2  3 ) , while our theorem implies a sample complexity of ˜ O ( log C  4 ) under complete observ ations. The impro v emen t is derived from bounding the Rademacher complexit y of the function class M λ instead of the L 2 , ∞ dimension. The Rademac her bound leads to a logarithmic dependence of the sample complexity on the num b er of features K , whereas the L 2 , ∞ b ound results in a polynomial dependence. C Pro ofs for Section 6 C.1 Pro of of Theorem 5 Let r i b e the retention rate of node i and ˆ ε i the desired error guarantee for learning the inﬂuence function F v i . It is immediate from the proofs of Theorems 1 and 2 that M = max i { ˜ O ( n 3 m ˆ ε 2 i r 4 i ) } incomplete cascades are suﬃcient to guarantee that with probabilit y at least 1 − δ , for eac h i , we obtain E S  ( F v i ( S ) − F ∗ v i ( S )) 2  ≤ ˆ ε i . The estimation error for the ov erall inﬂuence is the av erage ε = 1 n P i ˆ ε i . Given non-uniform reten tion rates r i , we can c ho ose non-uniform ˆ ε i yielding the desired ε , so as to minimize the sample complexit y . The corresp onding optimization problem is the following: Minimize max i 1 ˆ ε 2 i r 4 i sub ject to 1 n X i ˆ ε i = ε, ˆ ε i > 0 for all i. The minim um is achiev ed when all 1 ˆ ε 2 i r 4 i are equal to some constant C , meaning that ˆ ε i = 1 √ C · r 2 i . The constant C can b e obtained from the constrain t 1 n P i ˆ ε i = ε , yielding that C = ¯ r 2 ε 2 , where ¯ r = 1 n P i 1 r 2 i . This completes the pro of of the theorem. C.2 Pro of of Theorem 6 The pro of of Theorem 6 is similar to that of Theorems 1 and 2. W e again treat the incomplete cascades as complete cascades in a transformed graph ˆ G . Because w e no longer know the true reten tion rate r , we cannot set the probabilit y on the egde ( v , v 0 ) to ˆ w v v 0 = r . Instead, w e treat the ˆ w v v 0 as parameters to infer, under the constraint that ˆ w v v 0 ∈ I = [ ¯ r · (1 − η ) , ¯ r · (1 + η )] . W e sp ell out the details of the pro of for the DIC mo del; the pro of for the DL T mo del is practically identical. 21 As in Theorem 1, w e consider only the inﬂuence functions of the n no des in the added la yer V 0 . F ollo wing the pro of of Theorem 1, with probability as least 1 − δ , using M = ˜ O ( n 3 m ε 2 ) cascades, for all v 0 ∈ V 0 , E S h ( ˆ F v 0 ( S ) − ˆ F ∗ v 0 ( S )) 2 i ≤ ˆ ε. (6) In the pro of of Theorem 1, the fact that ˆ w v v 0 = r allo w ed us to obtain the inﬂuence function at v via F v ( S ) = 1 r · ˆ F v 0 ( S ) . Since the edge probabilities ˆ w v v 0 are now inferred, we instead use the inferred probabilities for obtaining the activ ation functions for nodes v . On the other hand, the ground truth inﬂuence functions for v and v 0 are still related via the correct v alue r . W riting ˆ r v = ˆ w v v 0 , this giv es us the follo wing: F v ( S ) = ˆ F v 0 ( S ) ˆ r v F ∗ v ( S ) = ˆ F ∗ v 0 ( S ) r . (7) Consider E S  ( F v ( S ) − F ∗ v ( S )) 2  for any no de v . The exp ected squared estimation error for no de v can now b e written as follows: E S  ( F v ( S ) − F ∗ v ( S )) 2  Equation (7) = E S   ˆ F v 0 ( S ) ˆ r v − ˆ F ∗ v 0 ( S ) r ! 2   = E S   ˆ F v 0 ( S ) − ˆ F ∗ v 0 ( S ) r + ˆ F v 0 ( S ) r − ˆ r v r · ˆ r v ! 2   ≤ 1 r 2 · E S h ( ˆ F v i ( S ) − ˆ F ∗ v i ( S )) 2 i (8) + 2 r · E S " ˆ F v 0 ( S ) ˆ r v · | r − ˆ r v | r · | ˆ F v 0 ( S ) − ˆ F ∗ v 0 ( S ) | # (9) + E S " ˆ F 2 v 0 ( S ) ˆ r 2 v · ( r − ˆ r v ) 2 r 2 # . (10) W e will b ound the term (8) using Inequality (6) . In order to bound the terms (9) and (10) , observe the following: • F or all seed sets S and no des v 0 ∈ V 0 , we hav e ˆ F v 0 ( S ) ≤ ˆ r v b y the structure of the transformed graph. • | r − ˆ r v | r ≤ 2 η 1 − η follo ws from the assumption that ˆ r v , r ∈ [ ¯ r · (1 − η ) , ¯ r · (1 + η )] . • By Jensen’s inequalit y and Inequalit y (6), E S h | ˆ F v 0 ( S ) − ˆ F ∗ v 0 ( S ) | i ≤ √ ˆ ε . Using the preceding three inequalities, w e can b ound the term (9) b y 4 η √ ˆ ε r (1 − η ) and the term (10) b y 4 η 2 (1 − η ) 2 . When ˆ ε ≤ εr 2 2 , using the inequalit y (6) , the term (8) is upp er-b ounded b y 1 r 2 E S  ( F v i ( S ) − F ∗ v i ( S )) 2  ≤ ε 2 . Similarly , when ˆ ε ≤ ε 2 r 2 (1 − η ) 2 64 η 2 , the additive term (9) is upp er-b ounded b y 2 η √ ˆ ε r (1 − η ) ≤ ε 2 . Thus, 22 taking ˆ ε = min { εr 2 2 , ε 2 r 2 (1 − η ) 2 64 η 2 } , the ﬁrst tw o terms combined are bounded b y ε . Thus, using M = ˜ O ( n 3 m ε 4 r 4 (1 − η ) 4 ) cascades, with probability at least 1 − δ , for eac h no de v , E S  ( F v ( S ) − F ∗ v ( S )) 2  ≤ ε + 4 η 2 (1 − η ) 2 . No w, taking an av erage on b oth sides of the ab ov e equation o ver all the no des v ∈ V concludes the pro of of Theorem 6. 23

Learning Influence Functions from Incomplete Observations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment