Identifying Rumor Sources Using Dominant Eigenvalue of Nonbacktracking Matrix

IDENTIFYING R UMOR SOURCES USING DOMINANT EIGENV ALUE OF NONB A CKTRA CKING MA TRIX Jiachun P an, W enyi Zhang Uni versity of Science and T echnology of China, Hefei, China Emails: pjc1993@mail.ustc.edu.cn, wenyizha@ustc.edu.cn ABSTRA CT W e consider the problem of identifying rumor sources in a network, in which rumor spreading obeys a time-slotted susceptible-infected model. Unlike existing approaches, our proposed algorithm iden- tiﬁes as sources those nodes, which when set as sources, result in the smallest dominant eigen value of the corresponding reduced non- backtracking matrix deduced from message passing equations. W e also propose a reduced-complexity algorithm deriv ed from the previ- ous algorithm through a perturbation approximation. Numerical ex- periments on synthesized and real-w orld networks suggest that these proposed algorithms generally have higher accuracy compared with representativ e existing algorithms. Index T erms — Dominant eigenv alue, message passing equa- tions, multiple rumor sources, nonbacktracking matrix, susceptible- infected model 1. INTR ODUCTION Now adays, facilitated by the dev elopment of Internet and smart de- vices, online social networks such as T witter , Facebook, and W eibo hav e become important enablers and primary conduits of rumor-lik e information. It is thus of considerable interest and importance to accurately identify rumor sources in networks. Sev eral works hav e inv estigated the problem of identifying ru- mor sources given a snapshot observation of the infected nodes in a network. For the basic case of a single rumor source, a network centrality metric called rumor center was initially dev eloped in [1], which turns out to be the maximum likelihood estimate of the rumor source for regular tree networks under the susceptible-infected (SI) rumor spreading model. This has inspired a large number of works. For example, in [2] [3] another network centrality metric called Jordan center was used to detect a single rumor source under the susceptible-infected-recov ered (SIR) rumor spreading model; in [4] the problem of rumor source identiﬁcation was solved using Monte Carlo estimators, for an arbitrary network structure; in [5] the case where multiple snapshot observations are available was studied and a joint estimator was de veloped. An important extension of the basic case is the case where mul- tiple rumor sources exist. In [6] a method based on rumor central- ity was developed, with computational complexity O ( N | S | ) , where N is the number of infected nodes and | S | is the number of rumor sources. In [7] a method based on Jordan Center was developed un- der the SIR model, and it was proved that for regular trees the iden- tiﬁed sources are within a constant distance from the actual sources with a high probability . In [8] a K-center method was dev eloped, This work w as supported in part by the National Natural Science Foun- dation of China under Grant 61722114. which ﬁrst transforms the network into a distance network using an effecti ve distance metric, then adaptiv ely partitions the distance net- work and ﬁnally performs source identiﬁcation. The method has a complexity of O ( M N log α ) , where α is a slowly gro wing inv erse- Ackermann function related to the number of nodes N and the num- ber of edges M . W e note tw o facts: ﬁrst, all the existing methods for both the sin- gle source and the multiple source cases are not optimal when net- works contain loops; second, all the existing methods for the multi- ple source case need to partition the infected network and then iden- tify a source in each partitioned part, regardless of the reality that the infected nodes from different sources may substantially o verlap. In this paper , we in vestigate the problem of identifying multiple rumor sources in a general network which may be highly loopy , and our main contribution is the proposal of a novel heuristic method based on the dominant eigen value of a matrix obtained from the Hashimoto or nonbacktracking matrix [9] of the infected network. W e develop the method via an approximate analysis of message passing equations of the rumor spreading process, combined with some empirical observations. Unlike existing methods, our method neither needs to con vert a loopy network into a tree, nor needs to partition the infected network into non-overlapping parts. Numeri- cal experiments on synthesized and real-w orld netw orks suggest that our method generally has higher accuracy compared with represen- tativ e existing algorithms, especially for highly loopy netw orks. The rest of the paper is or ganized as follo ws. Section 2 describes the problem formulation, message passing equations, and the pro- posed approach. Section 3 presents the algorithms. Section 4 shows the numerical results. Section 5 concludes the paper . 2. PR OBLEM FORMULA TION, MESSA GE P ASSING EQU A TIONS, AND PR OPOSED APPRO A CH In this section, we describe the problem setup, dev elop the message passing equations, and motiv ate our approach. 2.1. Rumor spreading model W e assume that rumors spread in an undirected graph G = ( V , E ) , where V is the set of nodes and E is the set of edges. W e adopt a time-slotted susceptible-infected (SI) model. In this model, each node v ∈ V has two possible states: susceptible and infected. A source node is infected from the beginning; a non-source node is said to be susceptible if it has not received the rumor , and infected if it has received the rumor from one of its neighbors. W e assume that there are | S | rumor sources, and they start to initialize their ru- mor spreadings at the same time, termed time zero . At each time step, an infected node infects a neighboring susceptible node with probability p  1 , and such infections of different neighboring sus- ceptible nodes are mutually independent. As time grows without bound, ev entually all connected nodes will be infected for any non- zero value of p . Giv en a snapshot observation of the infected nodes, we want to identify those rumor sources. 2.2. Message passing equations Denote by P ( t ) i the probability that node i has not been infected by time t , and by v ( t ) i → j the probability that node i has not passed the rumor to a neighboring node j by time t . Note that by assumption the rumor spreading starts at time t = 0 . Assign to each node an indicator n i to indicate whether node i is a source: n i = 0 if node i is a source, and n i = 1 otherwise. Denote by ∂ i the set of neighbors of node i , and ∂ i \ j the set of neighbors of node i excluding node j . For a general network with loops, the probabilities n v ( t ) k → i   k ∈ ∂ i \ j o are to some extent cor- related and this f act makes a rigorous analysis intractable. Therefore, in subsequent dev elopment we make a key approximation that these probabilities are mutually independent; see, e.g., [10] for a simi- lar treatment. Hence we can deduce the following message passing equations: v ( t ) i → j = 1 − t X τ =1 (1 − p ) τ − 1 p   1 − n i Y k ∈ ∂ i \ j v ( t − τ ) k → i   . (1) T o see the validity of (1), note that if node i is a source, n i = 0 , the equation (1) is simply the probability that node i has not infected node j throughout t time steps, with the infection probability at each time being a Bernoulli distribution; otherwise, if n i = 1 , the term in the box bracket of (1) is the probability that node i has been infected by time t − τ (under the approximation of mutual independence), the term (1 − p ) τ − 1 p is the probability that node i then takes τ time steps to infect node j , and (1) then follows from the law of total probability . W e can also deduce the probabilities P ( t ) i as: P ( t ) i = n i Y j ∈ ∂ i v ( t ) j → i . (2) T aking the limit t → ∞ in (1), and noting that (1 − p ) τ − 1 p → 0 as τ → ∞ , we get v ( ∞ ) i → j = n i Y k ∈ ∂ i \ j v ( ∞ ) k → i . (3) This conﬁrms that once a node is infected, its neighbors will e ventu- ally be infected as time grows without bound. 2.3. Linear approximation A snapshot observation of the infected nodes is a graph with N nodes and M edges. For a given snapshot observ ation, the approximate message passing equations in (1) constitute a collection of nonlinear equations of v ( τ ) → =  · · · , v ( τ ) i → j , · · ·  for all the 2 M directed links i → j and all time steps τ ∈ { 0 , 1 , . . . , t } , which can be collectiv ely written as a discrete-time dynamical system with initial condition v (0) → = e : v ( t ) → = e − t X τ =1 (1 − p ) τ − 1 p [ e − f  v ( t − τ ) → , n →  ] , (4) where e is a 1 × 2 M all-one vector , n → = ( · · · , n i → j , · · · ) , in which n i → j = n i and f = ( · · · , f i → j , · · · ) in which f i → j is the nonlinear function of v → for i → j giv en in (1). T o obtain some insights, we linearly approximate f i → j ( v → , n → ) at v → = e : f i → j ( v → , n → ) = f i → j ( e , n → ) + f 0 i → j ( e , n → )( v → − e ) , (5) where f 0 i → j ( v → , n → ) =  · · · , ∂ f i → j ∂ v k → l , · · ·  . By (1), we hav e that for l 6 = i , ∂ f i → j ∂ v k → l     v → = e = 0 , and for l = i and k 6 = j , ∂ f i → j ∂ v k → l     v → = e = n i . So we get the linear approximation of v ( t ) i → j as v ( t ) i → j = 1 − t X τ =1 (1 − p ) τ − 1 p   1 − n i − n i X k ∈ ∂ i \ j  v ( t − τ ) k → i − 1    , (6) which can be collectiv ely written in matrix form as v ( t ) → = e − t X τ =1 (1 − p ) τ − 1 p h e − n → + ( e − v ( t − τ ) → ) R i , (7) where R is a 2 M × 2 M matrix: R k → l,i → j = n i B k → l,i → j , B k → l,i → j = ( 1 if l = i and j 6 = k 0 otherwise . (8) The matrix B is known as the Hashimoto or nonbacktracking matrix of a graph [9], which is also closely associated with the spreading capability [11]. W e thus call R a reduced nonbacktracking matrix since it is obtained from B by setting entries to zero corresponding to n i = 0 . W e can further rewrite (7) in a recursiv e form so that v ( t +1) → is only related to v ( t ) → : u ( t +1) → = p ( e − n → ) + u ( t ) → [ p I + (1 − p ) R ] , (9) where I is the identity matrix, u ( t ) → = e − v ( t ) → in which u ( t ) i → j = 1 − v ( t ) i → j is the probability that node i has passed the rumor to a neighboring node j by time t . 2.4. Proposed appr oach The task of identifying rumor sources is to determine a { 0 , 1 } -vector n subject to P N i =1 n i = N − | S | . Different choices of n lead to different ev olution processes of v ( t ) → (or u ( t ) → equiv alently). In our study , we have numerically computed such evolution processes, ac- cording to both (4) and its linear approximation (7), for different types of networks. Fig. 1 displays a representative scenario, where for a snapshot observation of the infected graph in a small-world net- work containing a single rumor source (i.e., | S | = 1 ), 1 we compute and draw k u ( t ) → k under different choices of n . The dark star curve is the evolution process of k u ( t ) → k when n coincides with the actual rumor source, the grey solid curves are the ev olution processes of k u ( t ) → k when n randomly indicates a rumor source, and the dashed curves are the linear approximations of u ( t ) → . From Fig. 1, we hav e the following empirical observ ations: 1 Here the number of infected nodes is 400 and p is 0.1. 0 10 20 30 40 50 t 0 5 10 15 20 25 30 35 40 | | u → | | infection begins at the actual source infection begins at other nodes evolution process (4) linear approximation (7) Fig. 1 . Evolution processes of k u → k in a small-world network. • The ev olution processes of k u ( t ) → k computed according to (4) eventually approach the stable state where ev ery con- nected node is infected with probability one, and when n coincides with the actual rumor source, the e volution process approaches this stable state the most quickly . • The linear approximation of u ( t ) → computed according to (7) is accurate for small values of t , so that when n coincides with the actual rumor source, the linearly approximated e volution process of k u ( t ) → k grows the most quickly . Iterativ ely applying (9) leads to: u ( t ) → = p ( e − n → )  I + [(1 − p ) I + p R ] + [(1 − p ) I + p R ] 2 + ... + [(1 − p ) I + p R ] t − 1  . (10) So in the linear approximation, the growth rate of k u ( t ) → k is deter- mined by ( e − n → )[(1 − p ) I + p R ] t ≈ ( e − n → )[(1 − p ) I + p B ] t since p  1 and R differs from B only at the few entries indicated by n → . In light of the two empirical observations, now the task is to choose n → such that the non-zero entries of ( e − n → ) select the rows of [(1 − p ) I + p B ] t whose sum vector yields the largest norm. With a binomial expansion, this norm is determined by the numbers of nonbacktracking paths of lengths up to t starting from edges like s → i where s is a source node indicated by n and i is one of its neighbors. On the other hand, note that choosing n → also determines the reduced nonbacktracking matrix R , and that when the resulting R has the smallest dominant eigen value, the corresponding sources hav e the largest number of nonbacktracking paths passing them [11]. This observation provides a reasonable heuristic for the choice of n → , 2 and motiv ates us to propose the follo wing minimax criterion: min n max λ ( R ) . (11) 3. ALGORITHMS In this section, we describe algorithms to identify rumor sources based on the reasoning in the previous section. 3.1. Multiple source identiﬁcation (MSI) algorithm According to (11), we need to ﬁnd the nodes, which when set as sources, result in the minimum dominant eigen value of the corre- 2 This heuristic has been veriﬁed with extensi ve numerical e xperiments in our study . sponding reduced nonbacktracking matrix. W e can use the power iteration to compute the dominant eigenv alue in a time that scales as O ( M ) , 3 and repeat this for each of the  N | S |  conﬁgurations of sources. Thus, the comple xity for the MSI algorithm is O ( M N | S | ) . The procedure of the MSI algorithm is in T able Algorithm 1 . Algorithm 1 Multiple Source Identiﬁcation (MSI) Input: Nonbacktracking matrix B of an infected graph of N nodes and M edges; Number of rumor sources | S | ; Output: Identiﬁed rumor sources ˆ S ; 1: f or i = 1 , . . . ,  N | S |  do 2: Enumerate a set of potential source nodes as ˆ S i = { i 1 , i 2 , ..., i | S | } ; 3: Form R i by setting the entries in B to zero corresponding to n i j = 0 , i j ∈ ˆ S i ; 4: Use the power iterativ e to compute the dominant eigen value λ i, max of R i ; 5: end f or 6: Declare the set ˆ S i with the minimal λ i, max as rumor sources ˆ S . 3.2. Perturbation-based multiple source identiﬁcation (PMSI) algorithm Now we deri ve an approximation of ∆ λ = λ max ( B ) − λ max ( R ) by applying a method similar to that in [12]. Denote R by B − ∆ B , the dominant eigen value of R by λ max − ∆ λ and its corresponding right eigen vector by u − ∆ u . W e hav e ( B − ∆ B )( u − ∆ u ) = ( λ max − ∆ λ )( u − ∆ u ) . (12) where λ max is the dominant eigen v alue of B . Then we multiply both sides of (12) by the left eigen vector v T and get ∆ λ = v T ∆ B u − v T ∆ B ∆ u v T u − v T ∆ u . (13) W e apply a perturbation analysis on ∆ u . When we set entries in B to zero according to the source set S , the entries i → s ( s ∈ S , i ∈ ∂ s ) in u will be zero, and other entries will be perturbed slightly . So we write ∆ u = u → s − δ u , where u → s is a vector in which we only keep the entries i → s ( s ∈ S , i ∈ ∂ s ) of u and set others to zero, and δ u is small. So by neglecting second order terms u T ∆ B δ u and ∆ λu T δ u , we obtain ∆ λ ≈ v T ∆ B u − v T ∆ B u → s v T u − v T u → s . (14) According to the deﬁnition of ∆ B , we obtain ∆ λ ≈ P i ∈ ∂ s P k ∈ ∂ s \ i v i → s u s → k v T u − v T u → s . (15) W ith this approximation, we only need to compute the dominant eigen value and associated eigenv ector of the nonbacktracking ma- trix B , rather than the dominant eigen values of all the reduced non- backtracking matrices, as in the MSI algorithm. So the complex- ity is reduced from O ( M N | S | ) to O ( N | S | ) . The procedure of the reduced-complexity PMSI algorithm is in T able Algorithm 2 . 3 In our simulations in Section 4 the number of iterations is ﬁxed as 20 . Algorithm 2 Perturbation-based Multiple Source Identiﬁcation (PMSI) Input: Nonbacktracking matrix B of an infected graph of N nodes and M edges; Number of rumor sources | S | ; Output: Identiﬁed rumor sources ˆ S ; 1: f or i = 1 , . . . ,  N | S |  do 2: Enumerate a set of potential source nodes as ˆ S i = { i 1 , i 2 , ..., i | S | } ; 3: Get u → i j ( i j ∈ ˆ S i ) from u and calculate ∆ λ i according to (15); 4: end f or 5: Declare the set ˆ S i with the maximal ∆ λ i as rumor sources ˆ S . 4. SIMULA TIONS In this section, we ev aluate the performance of our proposed algo- rithms on different synthesized and real-world networks, including small-world netw orks, power grids, Facebook netw orks, and regular lattices. 4.1. Single source case In single source case, we compare our algorithms with two represen- tativ e algorithms, the Jordan center (JC) [2] and the Rumor center (RC) combined with a breadth-ﬁrst-search (BFS) tree heuristic [1]. Note that for loopy networks all these algorithms are heuristic in nature. W e e valuate the performance using three metrics: (1) Accuracy : the probability that the identiﬁed source node is the actual source. (2) One-hop accuracy : the probability that the distance between the identiﬁed source node and the actual source is no more than one hop. (3) A verage err or distance : the av erage number of hops between the identiﬁed source node and the actual source. In simulating the rumor spreading process we choose p < 0 . 1 so that the infected nodes are suf ﬁciently spread. W e consider four kinds of networks: synthesized small-world networks, the western states power grid network of the United States, a fraction of the Facebook network with 4039 nodes, 4 and regular lattices. Note that these networks are all loopy , especially for the latter three kinds. W e generate 500 instances of 400-node infected graphs for each net- work. The av erage diameters of infected graphs for these networks are 15.5 (small-world networks), 19.5 (power grids), 10.9 (Face- book networks) and 36.8 hops (regular lattices), respectively . T able 1 shows the simulation results. W e see that the MSI algorithm gener- ally outperforms both JC and RC-BFS, and the PMSI algorithm also performs quite well, — sometimes it e ven outperforms the original MSI algorithm. The performance adv antage is the most evident for highly loopy networks, e.g., Facebook networks and regular lattices. 4.2. Multiple source case In multiple source case, we generate 500 instances of 100-node in- fected graphs for small-world and Facebook networks. The sources are randomly picked among each network. The a verage diameters of infected graphs are 16 (small-world networks) and 8.9 hops (Face- book networks), respectiv ely . W ith multiple sources, we modify the 4 Data source: http://snap.stanford.edu/data/index.html T able 1 . Simulation results in single source case (a) Accuracy Network JC RC-BFS MSI PMSI Small-world 18.2 % 11.6 % 19.8 % 19.6 % Power grids 2.6 % 2.6 % 2.6 % 1.4 % Facebook 1.8 % 1.8 % 3.2 % 1.4 % Regular lattices 6.8 % 0.6 % 10.4 % 14.4 % (b) One-hop accuracy Network JC RC-BFS MSI PMSI Small-world 77.8 % 58.8 % 78.6 % 78.0 % Power grids 17.2 % 9.6 % 18.2 % 13.8 % Facebook 17.6 % 19.0 % 35.6 % 28.4 % Regular lattices 28.2 % 7.0 % 31.6 % 47.4 % (c) A verage error distance Network JC RC-BFS MSI PMSI Small-world 1.06 1.40 1.05 1.06 Power grids 3.17 3.45 3.43 3.77 Facebook 2.37 2.35 1.96 2.13 Regular lattices 2.36 4.36 2.39 1.73 T able 2 . Simulation results in multiple source case Network | S | Accuracy One-hop accuracy 4 Small-world 2 1 . 8% 26 . 4% 2.244 3 0% 22 . 4% 1.45 Facebook 2 4 . 0% 22 . 2% 1.653 3 1 . 0% 12 . 2% 1.60 4 : av erage error distance performance metrics in the following way: we associate the iden- tiﬁed sources ˆ S with the actual sources S so that the normalized total error distance between ˆ S and S , i.e., 4 = 1 | S | P | S | i =1 d ( ˆ s i , s i ) , where d ( ˆ s i , s i ) is the number of hops between the actual source s i and its associated identiﬁed source ˆ s i , is minimized. W e then deﬁne the accuracy as the probability that ˆ S = S , the one-hop accuracy as the probability that d ( ˆ s i , s i ) ≤ 1 , ∀ i = 1 , . . . , | S | , and the av erage error distance as the av erage of the minimum 4 . T able 2 sho ws the simulation results when | S | is two or three using the MSI algorithm. Although the accuracy and the one-hop accuracy drastically degrade compared with the single source case, the av erage error distance is usually less than two hops. Furthermore, it is interesting to notice that average error distance decreases as the number of sources increases. This may be due to that with more sources, even it is challenging to accurately identify all of them, it is likely that some of them can be accurately identiﬁed so that the av erage of error distances is decreased. 5. CONCLUSION W e proposed a novel heuristic source identiﬁcation method for gen- eral loopy networks with multiple rumor sources, motiv ated by de- ducing and analyzing the behavior of message passing equations of the rumor spreading process, combined with some empirical obser- vations. Numerical experiments sho w that for sev eral representativ e kinds of general networks, the proposed method is competitive with existing methods. In future research, it is desirable to deepen our un- derstanding of the proposed heuristic method, and to provide a solid theoretical foundation that explains its ef fectiv eness. 6. REFERENCES [1] D. Shah and T . Zaman, “Rumors in a network: Who’ s the culprit?, ” IEEE T rans. Inf. Theor . , vol. 57, no. 8, pp. 5163– 5181, Aug. 2011. [2] K. Zhu and L. Y ing, “Information source detection in the sir model: A sample-path-based approach, ” IEEE/ACM T ransac- tions on Networking , vol. 24, no. 1, pp. 408–421, Feb 2016. [3] W . Luo, W . P . T ay , and M. Leng, “How to identify an infection source with limited observations, ” IEEE J ournal of Selected T opics in Signal Pr ocessing , vol. 8, no. 4, pp. 586–597, Aug 2014. [4] Nino Antulov-F antulin, Alen Lan ˇ ci ´ c, T omislav ˇ Smuc, Hrvoje ˇ Stefan ˇ ci ´ c, and Mile ˇ Siki ´ c, “Identiﬁcation of patient zero in static and temporal networks: Robustness and limitations, ” Phys. Rev . Lett. , vol. 114, pp. 248701, Jun 2015. [5] Zhaoxu W ang, W enxiang Dong, W enyi Zhang, and Chee W ei T an, “Rumor source detection with multiple observations: fun- damental limits and algorithms, ” in ACM SIGMETRICS / Inter- national Conference on Measurement and Modeling of Com- puter Systems, SIGMETRICS ’14, Austin, TX, USA - J une 16 - 20, 2014 , 2014, pp. 1–13. [6] W . Luo, W . P . T ay , and M. Leng, “Identifying infection sources and regions in large networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 61, no. 11, pp. 2850–2865, June 2013. [7] Z. Chen, K. Zhu, and L. Y ing, “Detecting multiple information sources in networks under the sir model, ” IEEE T ransactions on Network Science and Engineering , vol. 3, no. 1, pp. 17–31, Jan 2016. [8] J. Jiang, S. W en, S. Y u, Y . Xiang, and W . Zhou, “K-center: An approach on the multi-source identiﬁcation of information diffusion, ” IEEE T ransactions on Information F orensics and Security , vol. 10, no. 12, pp. 2616–2626, Dec 2015. [9] Ki-ichiro Hashimoto, “On brandt matrices associated with the positiv e deﬁnite quaternion hermitian forms, ” J . F ac. Sci. Univ . T ok yo Sect. IA Math , vol. 27, no. 1, pp. 227–245, 1980. [10] Brian Karrer and M. E. J. Newman, “Message passing ap- proach for general epidemic models, ” Phys. Rev . E , vol. 82, pp. 016101, Jul 2010. [11] Flaviano Morone and Hern ´ an A. Makse, “Inﬂuence maximiza- tion in complex networks through optimal percolation, ” Na- tur e , vol. 524, July 2015. [12] Juan G. Restrepo, Edward Ott, and Brian R. Hunt, “Character- izing the dynamical importance of network nodes and links, ” Phys. Rev . Lett. , vol. 97, pp. 094102, Sep 2006.

Identifying Rumor Sources Using Dominant Eigenvalue of Nonbacktracking Matrix

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment