Influential Node Detection in Implicit Social Networks using Multi-task Gaussian Copula Models

Inﬂuen tial No de Detection in Implicit So cial Net w orks using Multi-task Gaussian Copula Mo dels Qun w ei Li qli33@syr.edu Syr acuse University Bha vya Kailkh ura kailkhura1@llnl.go v ∗ L awr enc e Livermor e National L abs Ja ya raman J. Thiagara jan jja y aram@llnl.go v L awr enc e Livermor e National L abs Zhenliang Zhang zhenliang.zhang@intel.com Intel L abs Pramo d K. V arshney v arshney@syr.edu Syr acuse University Editor: Oren Ana v a, Marco Cuturi, Azadeh Khaleghi, Vitaly Kuznetsov, Alexander Rakhlin Abstract Inﬂuen tial no de detection is a central research topic in so cial netw ork analysis. Man y ex- isting metho ds rely on the assumption that the netw ork structure is completely known a priori . Ho wev er, in many applications, netw ork structure is unav ailable to explain the underlying information diﬀusion phenomenon. T o address the c hallenge of information dif- fusion analysis with incomplete kno wledge of netw ork structure, we develop a m ulti-task lo w rank linear inﬂuence mo del. By exploiting the relationships b et ween contagions, our approac h can simultaneously predict the volume (i.e. time series prediction) for eac h con- tagion (or topic) and automatically identify the most inﬂuential no des for eac h contagion. The proposed model is v alidated using synthetic data and an ISIS twitter dataset. In addition to impro ving the v olume prediction p erformance signiﬁcantly , we show that the prop osed approach can reliably infer the most inﬂuential users for sp eciﬁc contagions. 1. In tro duction Information emerges dynamically and diﬀuses quic kly via agen t in teractions in complex net works (e.g. so cial net works) (López-Pintado, 2008). Consequen tly , understanding and prediction of information diﬀusion mechanisms are c hallenging. There is a rapidly growing in terest in exploiting knowledge of the information dynamics to b etter characterize the fac- tors inﬂuencing spread of diseases, planned terrorist attacks, and eﬀective so cial marketing campaigns, etc (Guille and Hacid, 2012). The broad applicabilit y of this problem in so cial net work analysis has led to fo cused researc h on the following questions: (I) Whic h con ta- gions are the most p opular and can diﬀuse the most? (I I) Which members of the netw ork ∗ . This work was supported in part by ARO under Grant W911NF-14-1-0339. This w ork was p erformed under the auspices of the U.S. Dept. of Energy by Lawrence Livermore National Laboratory under Con tract DE-AC52-07NA27344. 1 are inﬂuen tial and play imp ortan t roles in the diﬀusion pro cess? (I II) What is the range o ver which the contagions can diﬀuse (Guille et al., 2013)? While attempting to answ er these questions, one is confron ted with tw o crucial c hallenges. First, a descriptiv e diﬀusion mo del, which can mimic the b ehavior observed in real w orld data, is required. Second, eﬃ- cien t learning algorithms are required for inferring inﬂuence structure based on the assumed diﬀusion mo del. A v ariet y of information diﬀusion prediction framew orks hav e b een developed in the literature (Y ang and Lesk ov ec, 2010; W ang et al., 2013; Guille et al., 2013; Du et al., 2013; Zhang et al., 2016). A t ypical assumption in many of these approac hes is that a connected net work graph and kno wledge of the corresp onding structure are av ailable a priori . How ev er, in practice, the structure of the netw ork can b e implicit or diﬃcult to mo del, e.g., mo deling the structure of the spread of infectious disease is almost impossible. As a result, netw ork structure unaw are diﬀusion prediction mo dels hav e gained in terest. F or example, (Y ang and Lesk ov ec, 2010), Y ang et. al. prop osed a linear inﬂuence mo del, whic h can eﬀectively predict the information v olume b y assuming that eac h of the contagions spreads with the same inﬂuence in an implicit net work. Subsequen tly , in (W ang et al., 2013), the authors extended LIM b y exploiting the sparse structure in the inﬂuence function to identify the inﬂuen tial no des. Though the relationships b et w een multiple contagions can b e used for more accurate mo deling, most of the existing approac hes ignore that information. In this paper, w e address the ab o ve issues by augmen ting linear inﬂuence mo dels with complex task dep endency information. More sp eciﬁcally , we consider the dep endency of dif- feren t contagions in the net w ork, and characterize their relationships using Copula Theory . F urthermore, by imp osing a low-rank regularizer, we are able to characterize the clustering structure of the con tagions and the no des in the net work. Through this no vel form ulation, w e attempt to b oth improv e the accuracy of the prediction system and b etter regularize the inﬂuence structure learning problem. Finally , w e develop an eﬃcien t algorithm based on proximal mappings to solv e this optimization problem. Exp erimen ts with syn thetic data rev eal that the proposed approach fairs signiﬁcan tly b etter than a state-of-the-art m ulti-task v arian t of LIM b oth in terms of volume prediction and inﬂuence structure estimation p er- formance. In addition, w e demonstrate the sup eriorit y of the prop osed metho d in predicting the time-v arying volume of t w eets using the ISIS twitter dataset 1 . 2. Background In this section, we presen t the form ulation of linear inﬂuence mo del (LIM) (Y ang and Lesk ov ec, 2010) and discuss its limitations. Consider a set of N no des that participate in an information diﬀusion pro cess of K diﬀeren t con tagions ov er time. No de u ∈ { 1 , . . . , N } can b e infected by contagion k ∈ { 1 , . . . , K } at time t ∈ { 0 , 1 , . . . , T } . The v olume V k ( t ) is deﬁned as the total num b er of no des that get infected by the contagion k at time t . Let the indicator function M u,k ( t ) = 1 represen t the ev ent that node u got infected by contagion k at time t , and 0 otherwise. LIM mo dels the v olume V k ( t ) as a sum of inﬂuences of no des u 1. ISIS dataset from Kaggle is av ailable at https://www.k aggle.com/kzaman/how-isis-uses-t witter. 2 that got infected before time t : V k ( t + 1) = N X u =1 L − 1 X l =0 M u,k ( t − l ) I u ( l + 1) , (1) where each no de u has a particular non-negative inﬂuence function I u ( l ) . One can simply think of I u ( l ) as the n umber of follow-up infections l time units after u got infected. The v alue of L is set to indicate that the inﬂuence of a no de drops to 0 after L time units. Th us, the inﬂuence of no de u is denoted b y the vector I u = ( I u (1) , . . . , I u ( L )) T ∈ R L × 1 . Next, using the notation V k = ( V (1) , . . . , V ( T )) T ∈ R T × 1 and I = ( I T 1 , · · · , I T N ) T ∈ R LN × 1 , the inference pro cedure of LIM can b e form ulated as follo ws minimize K X k =1 k V k − M k · I k 2 2 + 1 ( I ) , (2) where M k is obtained via concatenation of M u,k , k · k 2 denotes the Euclidean norm, and 1 ( I ) is an indicator function that is zero when I uk ( l ) ≥ 0 and + ∞ otherwise. Though LIM has b een eﬀectiv e in predicting the future volume for eac h con tagion, it assumes that eac h no de has the same inﬂuence across all the con tagions. Consequently , to ac hieve con tagion-sensitiv e no de selection in an implicit net work, the LIM mo del w as extended and the multitask sparse linear inﬂuential mo del (MSLIM) w as prop osed in (W ang et al., 2013). The inﬂuence function is deﬁned b y extending I u in LIM into contagion-sensitiv e I u,k ∈ R L × 1 , whic h is a L -length vector representing the inﬂuence of the node u for the contagion k . F or each con tagion k , let I k ∈ R LN × 1 b e the vector obtained b y concatenating I 1 k , . . . , I N k . F or eac h no de u , the inﬂuence matrix for the node u is deﬁned: I u = ( I u 1 , . . . , I uK ) ∈ R L × K . Using these notations, the inference pro cedure to estimate I u,k w as form ulated as follows minimize 1 2 K X k =1 k V k − M k · I k k 2 2 + λ N X u =1 k I u k F + γ N X u =1 K X k =1 k I uk k 2 + 1 ( I ) , (3) where k · k F denotes the F rob enius norm. The p enalt y term k I u k F w as used to encourage the entire matrix I u to b e zero altogether, which means that the no de u is non-inﬂuen tial for all diﬀeren t contagions. If the estimated k I u k F > 0 (i.e., the matrix I u is non-zero), a ﬁne-grained selection is p erformed by the p enalt y N P u =1 K P k =1 k I uk k 2 , whic h is essentially a group- Lasso p enalt y and can encourage the sparsit y of vectors { I uk } . F or a sp eciﬁc contagion k , one can identify the most inﬂuential no des by ﬁnding the optimal solution { ˆ I uk } of (3). Ho wev er, the p enalt y terms used in MSLIM encourages that certain no des hav e no inﬂuence o ver all the con tagions which may not b e true in practice. F urthermore, for most of the real w orld applications, there exists complex dependencies among the contagions. In order to alleviate these shortcomings, w e prop ose a nov el probabilistic multi-task learning framew ork and develop eﬃcient optimization strategies. 3. Prop osed Approach Probabilistic Multi-Con tagion Modeling of Diﬀusion: W e assume a linear regression mo del for eac h task: V k = M k I k + n k , where V k , M k and I k are deﬁned as b efore, and 3 n k ∈ R T × 1 is an i.i.d. zero-mean Gaussian noise vector with the cov ariance matrix Σ k . The distribution for V k giv en M k , I k and Σ k can b e expressed as V k | M k , I k , Σ k ∼ N  M k I k , Σ k  = exp  − 1 2  V k − M k I k  T Σ − 1 k  V k − M k I k   (2 π ) T 2 | Σ k | 1 2 . (4) Assuming that the inﬂuence for a single con tagion is also Gaussian distributed, w e can express the marginal distributions as I k | m k , Θ k ∼ N ( m k , Θ k ) , where m k ∈ R LN × 1 is the mean v ector and can b e expressed as m k = [ m T 1 ,k , . . . , m T N ,k ] T , and Θ k ∈ R LN × LN is the co v ariance matrix of I k . F or a node u and con tagion k , we assume that the v ariables in the inﬂuence I uk ha ve the same mean, i.e., m u,k = m u,k 1 L × 1 , where m u,k is a scalar and 1 L × 1 is a vector of all ones with dimension L × 1 . Let m 0 ∈ R N × K represen t the mean matrix with en tries m u,k , and it is connected as m = ( m 1 , . . . , m K ) = Qm 0 , where Q ∈ R LN × N = I N × N ⊗ 1 L × 1 and I N × N is the identit y matrix with dimension N × N and ⊗ is the Kronec ker pro duct op erator. 3.1 Dep endence Structure Mo deling Using Copulas Consider a general case where the contagions are correlated. W e construct a new inﬂuence matrix I =  I 1 , . . . , I K  ∈ R LN × K . In our form ulation, I k ’s are assumed to b e correlated and the joint distribution of I is not a simple pro duct of all the marginal distributions of I k as is adopted by most multi-task learning formulations. Here, we prop ose to use a m ulti-task copula that is obtained by tailoring the copula model for the multi-task learning problem. Theorem 1 (Sklar’s The or em). Consider an N -dimensional distribution function F with mar ginal distribution functions F 1 , . . . , F N . Then ther e exists a c opula C , such that for al l x 1 , . . . , x N in [ −∞ , ∞ ] , F ( x 1 , . . . , x N ) = C ( F 1 ( x 1 ) , . . . , F N ( x N )) . If F n is c ontinuous for 1 ≤ n ≤ N , then C is unique, otherwise it is determine d uniquely on RanF 1 × . . . × RanF N wher e RanF n is the r ange of F n . Conversely, given a c opula C and univariate CDFs F 1 , . . . , F N , F is a valid multivariate CDF with mar ginals F 1 , . . . , F N . As a direct consequence of Sklar’s Theorem, for contin uous distributions, the join t proba- bilit y densit y function (PDF) f ( x 1 , . . . , x N ) is obtained b y , f ( x 1 , . . . , x N ) = N Y n =1 f n ( x n ) ! c ( F 1 ( X 1 ) , . . . , F N ( X N )) , (5) where f n ( · ) is the marginal PDF and c is termed as the copula density given b y c ( v ) = ∂ N C ( v 1 , . . . , v N ) ∂ v 1 , . . . , ∂ v N (6) where v n = F n ( x n ) . W e extend the copula theory to multi-task learning and express the join t distribution of I as follows: p ( I 1 , I 2 , . . . , I K ) = K Y k =1 N ( m k , Θ k ) ! c ( F 1 ( I 1 ) , F 2 ( I 2 ) , . . . , F K ( I K )) , (7) 4 where F k ( I k ) is the CDF of the inﬂuence for k th con tagion. The copula density function c ( · ) tak es all marginal CDF s { F k ( I k ) } K k =1 as its argumen ts, and main tains the output correlations in a parametric form. Gaussian copula: There are a ﬁnite num b er of w ell deﬁned copula families that can c haracterize several dep endence structures. Though, we can in vestigate the choice of an appropriate copula, we consider the Gaussian copula for its fav orable analytical prop erties. A Gaussian copula can b e constructed from the multiv ariate Gaussian CDF, and the resulting prior on I is given by a m ultiv ariate Gaussian distribution as I ∼ MN LN × K ( m , U , Ω ) = exp  − 1 2 tr  U − 1 ( I − m ) Ω − 1 ( I − m ) T  (2 π ) LN K 2 | Ω | LN 2 | U | K 2 (8) where U ∈ R LN × LN is the ro w co v ariance matrix mo deling the correlation b et ween the inﬂuence of diﬀeren t nodes, Ω ∈ R K × K is the column cov ariance matrix modeling the correlation b et w een the inﬂuence for diﬀerent contagions, and m ∈ R LN × K is the mean matrix of I . The tw o cov ariances can b e computed as E h ( I − m ) ( I − m ) T i = U tr ( Ω ) and E h ( I − m ) T ( I − m ) i = Ω tr ( U ) respectively . W e assume that N individual nodes are spreading the contagions and inﬂuencing others indep enden tly , and th us the ro w co v ariance matrix is diagonal and can b e expressed as U = diag ( e 2 1 , e 2 2 , . . . , e 2 N ) ⊗ I L × L , where e 2 n , n ∈ { 1 , . . . , N } are scalars. The p osterior distribution for I , which is prop ortional to the pro duct of the prior in Eq. 4 and the likelihoo d function in Eq. 8, is giv en as p ( I | M , V , Σ , U , Ω ) ∝ p ( V | M , I , Σ ) p ( I | m , U , Ω ) = K Y k =1 N  M k I k , Σ k  ! MN LN × K ( I | m , U , Ω ) , (9) where M = ( M 1 , . . . , M K ) ∈ R T × LN K , V = ( V 1 , . . . , V K ) ∈ R T × K , Σ is the corresp onding co v ariance matrix of n = ( n 1 , . . . , n K ) ∈ R T × K . W e assume Σ k , σ 2 I T × T and also an iden tical v alue of e 2 n = e 2 , ∀ k = 1 , . . . , K, ∀ n = 1 , . . . , N . W e emplo y maxim um a p osteriori (MAP) and maximum lik eliho o d estimation (MLE), and obtain I , m , and Ω b y min I , m , Ω 1 σ 2 K X k =1 k V k − M k I k k 2 2 + 1 e 2 tr  ( I − m ) Ω − 1 ( I − m ) T  + LN ln | Ω | + 1 ( I ) . Ho wev er, if we assume Ω − 1 to b e non-sparse, the solution to Ω − 1 will not b e deﬁned (when K > LN ) or will o verﬁt (when K is of the same order as LN ) (Rai et al., 2012). In fact, some con tagions in the net work can be uncorrelated, whic h mak es the corresp onding en try v alues in Ω − 1 zero. Hence, w e add a l 1 p enalt y to promote sparsity of matrix Ω − 1 to obtain min I , m , Ω K X k =1 k V k − M k I k k 2 2 + λ 1 tr  ( I − m ) Ω − 1 ( I − m ) T  − λ 2 ln | Ω − 1 | + λ 3 k Ω k 1 + 1 ( I ) . 5 3.2 Mo deling Structure of Inﬂuence Matrix I In order to b etter c haracterize the inﬂuence matrix, w e prop ose to imp ose a low rank struc- ture on the inﬂuence matrix I . The no des or the contagions in the inﬂuence netw ork are kno wn to form communities (or clustering structures), whic h may b e captured using the lo w-rank prop ert y of the inﬂuence matrix. Note that, the sparse structure in the inﬂuence matrix implies that most individuals only inﬂuence a small fraction of contagions in the net work while there can b e a few no des with wide-spread inﬂuence. W e incorporate this in to our formulation b y using a sparsity promoting regularizer o ver I u,k . min I , m , Ω K X k =1 k V k − M k I k k 2 2 + λ 1 tr  ( I − m ) Ω − 1 ( I − m ) T  − λ 2 ln | Ω − 1 | + λ 3 k Ω k 1 + λ 4 k I k ∗ + λ 5 N X u =1 K X k =1 k I uk k 2 + 1 ( I ) , (10) where k · k ∗ denotes the nuclear norm, and λ 1 , λ 2 , λ 3 , λ 4 and λ 5 are the regularization parameters. With the estimated { ˆ I uk } , one can predict the total v olume of the contagion k at T + 1 by ˆ V k ( T + 1) = P N u =1 P L − 1 l =0 M uk ( T − l ) I uk ( l + 1) . 4. Algorithm W e adopt an alternating optimization approach to solv e the problem in Eq. 10. Optimization w.r.t. m : Given I and Ω − 1 , the mean matrix m can b e obtained by solving the following problem min m tr  ( I − m ) Ω − 1 ( I − m ) T  . The estimate ˆ m can b e analytically obtained as ˆ m = 1 L QQ T I . Optimization w.r.t. Ω : Given I and m , the contagion in verse co v ariance matrix Ω − 1 can b e estimated by solving the follo wing optimization problem min Ω λ 1 tr  ( I − m ) Ω − 1 ( I − m ) T  − λ 2 ln | Ω − 1 | + λ 3 k Ω k 1 The ab ov e is an instance of the standard in verse cov ariance estimation problem with sample co v ariance λ 1 λ 2 ( I − m ) T ( I − m ) , which can b e solved using standard to ols. In particular, we use the graphical Lasso pro cedure in (F riedman et al., 2008) ˆ Ω − 1 = g Lasso  λ 1 /λ 2 ( I − m ) T ( I − m ) , λ 3  . (11) Optimization w.r.t. I : The corresp onding optimization problem b ecomes min I K X k =1 k V k − M k I k k 2 2 + λ 1 tr  ( I − m ) Ω − 1 ( I − m ) T  + λ 4 k I k ∗ + λ 5 N X u =1 K X k =1 k I uk k 2 + 1 ( I ) . W e rewrite the problem as min I ` ( I ) + λ 4 k I k ∗ + 1 ( I ) . (12) 6 Algorithm 1 Incremental Pro ximal Descen t 1: Initialize I = A 2: rep eat 3: Set I = I − θ ∇ I ` ( I ) 4: Set I = prox θλ 4 k·k ∗ ( I ) 5: Set I = P 1 ( I ) 6: until conv ergence 7: return I where ` ( I ) = K P k =1 k V k − M k I k k 2 2 + λ 1 tr  ( I − m ) Ω − 1 ( I − m ) T  + λ 5 N P u =1 K P k =1 k I uk k 2 . This form ulation in volv es a sum of a conv ex diﬀerentiable loss and conv ex non-diﬀerentiable reg- ularizers whic h renders the problem non-trivial. A string of algorithms ha ve b een dev elop ed for the case where the optimal solution is easy to compute when each regularizer is considered in isolation. This corresp onds to the case where the proximal op erator deﬁned for a con v ex regularizer R : R LN × K → R at a p oint Z b y prox R ( Z ) = arg min 1 2 k I − Z k 2 F + R ( I ) , is easy to compute for eac h regularizer taken separately . See (Combettes and Pesquet, 2011) for a broad o verview of pro ximal metho ds. The pro ximal operator for the n uclear norm is giv en b y the shrink age op eration as follows (Beck and T eb oulle, 2009). If U diag ( σ 1 , . . . , σ n ) V T is the singular v alue decomp osition of Z , then prox λ 4 k·k ∗ ( Z ) = U diag (( σ i − λ 4 ) + ) i V T . The pro ximal operator of the indicator function 1 ( I ) is simply the pro jection on to I u,k ( l ) ≥ 0 , whic h is denoted b y P 1 ( I ) . Next, we mention a matching serial algorithm in tro duced in (Bertsek as, 2011). W e presen t here a version where updates are p erformed according to a cyclic order (Ric hard et al., 2012). Note that one can also randomly select the order of the up dates. W e use the optimization algorithm 1 to solv e the optimization problem in Eq. 12. 5. Exp erimen ts W e compare the p erformance of the proposed approac h to MSLIM by applying it to b oth syn thetic and real datasets. Since the v olume of a con tagion ov er time V k ( t ) can b e viewed as a time series, w e set up this problem as a time series prediction task and ev aluate the p erformance using the prediction mean-squared error (MSE). F urthermore, for the synthetic data set, where we ha ve access to the true inﬂuence matrix I , we also ev aluate the perfor- mance of the inﬂuence matrix prediction task using the metric k ˆ I − I k F . W e determined the regularization parameters for the prop osed mo del using cross v alidation. In particular, w e split the ﬁrst 60% of the time instances as the training set and the rest for v alidation. F ollo wing (W ang et al., 2013), we com bine the training and v alidation sets to re-train the mo del with the b est selected regularization parameters and estimate the inﬂuence matrix. 5.1 Syn thetic Data W e created a synthetic dataset with the num b er of no des ﬁxed at N = 100 and the n umber of con tagions at K = 20 . In addition, w e assumed that L = 10 and T = 20 . A rank 5 (low- rank) inﬂuence matrix I w as generated randomly with uniformly distributed entries. The matrix M was generated with uniformly distributed random integers { 0 , 1 } . F ollowing our 7 T able 1: Prediction p erformance for diﬀerent information diﬀusion mo dels on synthetic data. Approac h MSLIM Prop osed V olume Prediction MSE 0.834 0.007 Inﬂuence Matrix Estimation Error 0.7681 0.62 mo del assumption, the volume for each V k w as calculated as follows V k = M k × I k + N ( 0 , ∆ ) where N ( 0 , ∆ ) is a multiv ariate normal distribution with cov ariance matrix ∆ . In T able 1, w e present the results obtained using the prop osed approach and its comparison to MSLIM. As can b e observ ed, for b oth volume prediction and inﬂuence matrix estimation tasks, the prop osed approac h achiev es highly accurate estimates. 5.2 ISIS T witter Data In this section, we demonstrate the application of the prop osed approac h to a real-w ord analysis task. W e b egin b y describing the t witter dataset used for analysis and the pro cedure adopted to extract the set of contagions. F ollowing this, w e discuss the problem setup and presen t comparisons to MSLIM on predicting the time-v arying t weet v olume. Finally , we presen t a qualitative analysis of the inferred inﬂuence structure for diﬀeren t con tagions. The ISIS dataset from Kaggle 2 is comprised of ov er 17 , 000 tw eets from 112 users p osted b et w een Jan uary 2015 and Ma y 2016. In addition to the actual t weets, meta-information suc h as the user name and the timestamp for each tw eet are included. W e p erformed a standard pre-pro cessing b y remo ving a v ariety of stop words, e.g. URLs, sym b ols. After prepro cessing, we conv erted each t weet in to a bag-of-words represen tation and extracted the term frequency-inv erse do cumen t frequency (tf-idf ) feature. T opic Modeling: When applying our approach, the ﬁrst step is to deﬁne semanticall y meaningful con tagions. A simple w ay of deﬁning topics is to directly use w ords as topics (e.g., ISIS). How ev er, a single wo rd may not b e rich enough to represent a broad topic (e.g., so cial net work sites). Hence, we prop ose to p erform topic mo deling on the tw eets based on the tf-idf features. In our exp erimen t, w e obtained the topics using Non-negative Matrix F actorization (NMF), which is a p opular scheme for topic disco very , with the num b er of topics K set at 10 . T able 2 lists the top 10 words for eac h of the topics learned using NMF. V olume Time Series Prediction: In our exp erimen t, we set one day as the discrete time step for aggregating the t weet volume. The parameter L denotes the num ber of time steps it takes for the inﬂuence of a user to deca y to zero. W e set the parameter L equal to 5 since w e observ ed that b ey ond L = 5 , there is hardly any impro vemen t in p erformance. The MSE on the predicted volume is computed ov er the entire p erio d of observ ation. The comparison of the prediction MSE is presen ted in T able 3. It can be seen that the prop osed approac h signiﬁcan tly outperforms MSLIM in predicting the time-v arying v olume. Inﬂuen tial No de Detection: F or a con tagion k , w e iden tify the most inﬂuen tial nodes with resp ect to this con tagion as no des having high k I u , k k 2 v alues. First, in Figure 2(a), we 2. ISIS dataset from Kaggle is av ailable at https://www.k aggle.com/kzaman/how-isis-uses-t witter. 8 T able 2: T op w ords for eac h topic learned using NMF with the ISIS twitter dataset. T opic 1 isis ramiallolah iraq attac k liby a warreporter1 saa aamaq usa abu T opic 2 killed soldiers to day airstrikes injured wounded civilians militan ts iraqi attac k T opic 3 syria russia ramiallolah turk ey ypg breakingnews usa group saa terror T opic 4 state islamic ﬁghters ﬁgh ting group saudi new http wila ya con trol T opic 5 alepp o nid gazaui reb els north today northern syrian ypg turkish T opic 6 assad regime myra forces reb els fsa pro islam syrian jaysh T opic 7 al qaeda nusra abu sham ahrar islam jabhat http warreporter1 T opic 8 arm y iraq near ramiallolah iraqi lujah turk ey ramadi w est sinai T opic 9 allah p eople muslims abu accept m uslim make know don islam T opic 10 breaking islamicstate forces amaqagency cit y ﬁgh ters iraqi near area syrian (a) A verage Inﬂuence (b) Maximum Inﬂuence Figure 1: Comparing statistics from the estimated inﬂuence matrix with the volume of t weets corresp onding to eac h of the users to identify inﬂuential users. In b oth cases, the users with a large inﬂuence score are marked in red. plot the correlation among 10 topics learned b y NMF. More speciﬁcally , w e plot the pair- wise correlation structure learned by our approac h. It can b e seen that, a strong p ositiv e correlation structure exists, which enabled the improv ed prediction in T able 3. F ollowing this, w e use the predicted inﬂuence matrix to select a set of highly inﬂuential no des from the dataset. A simple approach to select the inﬂuen tial users can b e to select the ones with a large num ber of t weets. How ever, w e argue that the inﬂuence predicted in an information diﬀusion mo del can b e v astly diﬀerent. Consequen tly , we consider a user to b e inﬂuen tial if she has a high inﬂuence score for at least one of the topics, or if she can be inﬂuen tial for m ultiple topics. F or example, in Figure 1(a), w e plot a verage inﬂuence scores of the users (a veraged o ver all the topics) against the total n um b er of t weets. Similarly , in Figure 1(b), 9 (a) (b) Figure 2: (a) Correlation Structure among the topics (non-black color represen ts positive correlation), (b) T op 9 inﬂuen tial users and their tw eet distributions. T able 3: V olume prediction p erformance on the ISIS t witter dataset. Approac h MSLIM Prop osed V olume Prediction MSE 2.7 0.329 w e plot inﬂuence scores of the users (maximum ov er all the topics) against the total num b er of tw eets. The ﬁrst striking observ ation is that the users with high inﬂuence scores are not necessarily the ones with the most num b er of t w eets. Instead, their impact on the information diﬀusion relies hea vily on the complex dynamics of the implicit netw ork. Finally , in Figure 2(b) w e plot the p ercen tage of tw eets regarding eac h of the topics for top 9 inﬂuential no des. Inﬂuen tial no des are obtained as a union of no des identiﬁed based on b oth av erage and maximum inﬂuence scores. More sp eciﬁcally , w e select the union of users with av erage inﬂuence score greater than 1 . 3 and maximum inﬂuence score greater than 1 . 8 . In addition to displa ying the distribution across topics, for each inﬂuen tial user, we sho w the total num b er of tw eets p osted b y that user. It can b e seen that the total n umber of t weets of these users v ary a lot and, therefore, is not a go o d indication of their inﬂuence. 6. Conclusion In this pap er, we considered the problem of inﬂuen tial no de detection and v olume time series prediction. W e prop osed a descriptiv e diﬀusion mo del to take dep endencies among the top- ics in to account. W e also proposed an eﬃcient algorithm based on alternating metho ds to p erform inference and learning on the mo del. It w as shown that the proposed technique out- p erforms existing inﬂuen tial node detection techniques. F urthermore, the proposed model w as v alidated b oth on a synthetic and a real (ISIS) dataset. W e sho w ed that the prop osed approac h can eﬃciently select the most inﬂuential users for sp eciﬁc con tagions. W e also presen ted sev eral interesting patterns of the selected inﬂuential users for the ISIS dataset. 10 References Amir Bec k and Marc T eb oulle. A fast iterativ e shrink age-thresholding algorithm for linear in verse problems. SIAM journal on imaging scienc es , 2(1):183–202, 2009. Dimitri P Bertsek as. Incremental gradien t, subgradient, and proximal metho ds for conv ex optimization: A surv ey . Optimization for Machine L e arning , 2010:1–38, 2011. P atrick L Com b ettes and Jean-Christophe Pesquet. Pro ximal splitting methods in signal pro cessing. In Fixe d-p oint algorithms for inverse pr oblems in scienc e and engine ering , pages 185–212. Springer, 2011. Nan Du, Le Song, Manuel Gomez-Ro driguez, and Hongyuan Zha. Scalable inﬂuence estima- tion in contin uous-time diﬀusion netw orks. In A dvanc es in neur al information pr o c essing systems , pages 3147–3155, 2013. Jerome F riedman, T revor Hastie, and Rob ert Tibshirani. Sparse inv erse cov ariance estima- tion with the graphical lasso. Biostatistics , 9(3):432–441, 2008. A drien Guille and Hakim Hacid. A predictive mo del for the temp oral dynamics of informa- tion diﬀusion in online so cial netw orks. In Pr o c e e dings of the 21st international c onfer enc e on W orld Wide W eb , pages 1145–1152. A CM, 2012. A drien Guille, Hakim Hacid, Cecile F avre, and Djamel A Zighed. Information diﬀusion in online so cial netw orks: A survey . A CM SIGMOD R e c or d , 42(2):17–28, 2013. Dunia Lóp ez-Pin tado. Diﬀusion in complex so cial netw orks. Games and Ec onomic Behavior , 62(2):573–590, 2008. Piyush Rai, Abhishek Kumar, and Hal Daume. Sim ultaneously leveraging output and task structures for m ultiple-output regression. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , pages 3185–3193, 2012. Emile Richard, Pierre-andre Sa v alle, and Nicolas V ay atis. Estimation of simultaneously sparse and low rank matrices. In Pr o c e e dings of the 29th International Confer enc e on Machine L e arning (ICML) , pages 1351–1358, 2012. Yingze W ang, Guang Xiang, and Shi-Kuo Chang. Sparse m ulti-task learning for detecting inﬂuen tial nodes in an implicit diﬀusion net work. In AAAI , 2013. Jaew on Y ang and Jure Lesk ov ec. Mo deling information diﬀusion in implicit net w orks. In 2010 IEEE International Confer enc e on Data Mining , pages 599–608. IEEE, 2010. P eng Zhang, Jing He, Guo dong Long, Guangy an Huang, and Chengqi Zhang. T ow ards anomalous diﬀusion sources detection in a large netw ork. ACM T r ansactions on Internet T e chnolo gy (TOIT) , 16(1):2, 2016. 11

Influential Node Detection in Implicit Social Networks using Multi-task Gaussian Copula Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment