Multi-layer graph analysis for dynamic social networks

1 Multi-layer graph analysis for dynamic social networks Brandon Oselio, Student Member , IEEE, Alex K ulesza, Alfred O. Hero, III, F ellow , IEEE Abstract —Modern social networks frequently encompass mul- tiple distinct types of connectivity inf ormation; for instance, explicitly acknowledged friend relationships might complement behavioral measures that link users according to their actions or interests. One way to represent these networks is as multi-layer graphs, where each layer contains a unique set of edges over the same underlying vertices (users). Edges in different layers typically hav e related but distinct semantics; depending on the application multiple layers might be used to reduce noise through av eraging, to perform multifaceted analyses, or a combination of the two. However , it is not obvious how to extend standard graph analysis techniques to the multi-layer setting in a ﬂexible way . In this paper we develop latent variable models and methods for mining multi-layer networks for connectivity patterns based on noisy data. Index T erms —Hypergraphs, multigraphs, mixtur e graphical models, Pareto optimality Multi-layer networks arise naturally when there exists more than one source of connectivity information for a group of users. For instance, in a social networking context there is often kno wledge of direct communication links, i.e., relational information. Examples of relational information include the frequency with which users communicate ov er social media, or whether a user has sent or received emails from another user in a given time period. Howe ver , it is also possible to derive behavioral relationships based on user actions or interests. These behavioral relationships are inferred from information that does not directly connect users, such as individual preferences or usage statistics. In this paper we show how to deal with multiple layers of a social network when performing tasks like inference, clustering, and anomaly detection. W e propose a generativ e hierarchical latent-v ariable model for multi-layer networks, and show how to perform inference on its parameters. Using techniques from Bayesian Model A veraging [ 1 ], the layers of the network are conditionally decoupled using a latent selection v ariable; this makes it possible to write the posterior probability of the latent variables giv en the multi-layer network. The resulting mixture can be viewed as a scalarization of a multi-objective optimization problem [ 2 ], [ 3 ], [ 4 ]. When the posterior probability functions are conv ex, the scalarization is both optimal and consistent with the Bayesian principle of model-averaged inference [ 2 ], [5]. W e then step back from the Bayesian setting and discuss The authors are with the Department of Electrical Engineering and Computer Science, Univ ersity of Michigan, Ann Arbor , MI 48109, USA. T el: 1-734-763- 0564. Fax: 1-734-763-8041. Emails: { boselio, kulesza, hero } @umich.edu. This work was partially supported by ARO grant number W911NF-12- 1-0443. Parts of this paper were presented in the Proceedings of the IEEE W orkshop on Computational Advances in Multi-Sensor Adaptiv e Processing (CAMSAP), St. Martin, Dec. 2013. Adjacency matrices A 1 W 2 W 1 A 2 Observed matrices Fig. 1. Adjacency and Observation Matrices. This graphical model depicts how the latent adjacency matrices can affect the observerv ation matrices. Note that the observation matrices are dependent on all adjacency matrices in general. how multi-objectiv e optimization can be used to perform MAP estimation of the desired latent variables. Using the concept of Pareto optimality [ 4 ], an entire front of solutions is deﬁned; this allows a user to deﬁne a preference over optimization functions and tune the algorithm accordingly . The result is a lev el of supervised optimization and inference that utilizes the structure of multi-layer networks without scalarization. Experiments on a simulated e xample show that our method yields improved clustering performance in noisy conditions. The dev eloped framew ork is then combined with the dynamic stochastic block model (DSBM) [ 6 ], which captures a variety of complex temporal network phenomena. Finally , the multi- layer DSBM is applied to a real-world data set drawn from the ENR ON email corpus. This example illustrates how we can combine two layers of a network to explore comple x connections through both time and layer mixing parameters. I . M U LT I - L A Y E R N E T W O R K S A multi-layer graph G = ( V , E ) comprises vertices V = { v 1 , . . . , v p } , common to all layers, and edges E = ( E 1 , . . . , E L ) on L layers, where E i is the edge set for layer i . In the real-world network setting, we will assume that the observed data are noisy reﬂections of a true underlying multi- layer graph. For con venience we will work with adjacency representations, letting A i ∈ R p × p be the true adjacenc y matrix of layer i , and W i ∈ R p × p the corresponding observed adjacency matrix. Figure 1 depicts the model graphically . In some cases W i might be binary , reﬂecting merely the presence or absence of a connection—for instance, whether two users were seen to communicate. In other settings, such as measuring temporal or content correlation scores between users, the entries of W i could be real-v alued. The goal is to estimate A 1 , . . . , A L giv en the observations W 1 , . . . , W L . Using standard parametric methods this will require computing the posterior distribution of A 1 , . . . , A L , which can be difﬁcult giv en the number of parameters. Speciﬁcally , the inﬂuence 2 Latent variables A 1 W 2 W 1 A 2 Y Observed matrices Fig. 2. General Latent V ariable Model. This model represents a latent variable model, in which a set of variables Y control the distributions of the adjacency matrices and through them the observation matrices. of A 1 , . . . , A L on a single W i is difﬁcult to measure, as the dependencies are unspeciﬁed. I I . H I E R A R C H I C A L M O D E L D E S C R I P T I O N A hierarchical model is proposed that simpliﬁes this in- ference procedure by conditionally decoupling W 1 , . . . , W L . For simplicity , we specialize to the case where L = 2 . This also allows us to vie w the networks in the setting described in the introduction: one layer of the network represents the observed extrinsic relationships between users, and the other layer represents their correlated intrinsic behaviors. W e introduce a latent variable denoted Y (see Figure 2) that conditionally decouples the posterior distributions of the two layers: P ( W 1 , W 2 | A 1 , A 2 , Y ) = P ( W 1 | A 1 , Y ) P ( W 2 | A 2 , Y ) (1) P ( W 1 , W 2 | A 1 , A 2 ) = Z P ( W 1 , W 2 | A 1 , A 2 , Y ) P ( Y | A 1 , A 2 ) d Y . (2) Shifting the focus from the adjacency matrices A 1 , A 2 , to the latent variable Y , using Y as a compact description of how these adjacencies combine to form the multi-layer network structure. It is possible to write down the posterior distribution for Y as P ( Y | W 1 , W 2 ) = X A 1 ,A 2 P ( Y | A 1 , A 2 ) P ( A 1 , A 2 | W 1 , W 2 ) . I I I . P O S T E R I O R M I X T U R E M O D E L I N G Consider the graphical model shown in Figure 3. W e have collapsed the A 1 , A 2 v ariables with the observed data W 1 , W 2 , because we are mainly interested in inferring W , and W i can be considered a representation of the real connectivity . Follo wing the previous model, we have decomposed Y = ( W , Z ) , where W ∈ R p × p is a latent adjacency or similarity matrix describing the underlying connections between vertices, and Z ∈ { 1 , 2 } is a model selection variable, P ( Z = 1) = α , and P ( Z = 2) = 1 − α . Here there is the implicit assumption a common connectivity structure W informs all layers of the network. In a sense, the model produces observed matrices that correspond to multiple views of the latent variable W . The model selection variable Z will decouple the posterior distribution of W giv en both layers into a weighted sum of marginalized posteriors given each individual layer . The prior for W is P ( W ) , left unspeciﬁed for no w . The distributions P ( W 1 | W , Z ) and P ( W 2 | W , Z ) are in general task- dependent (e.g., they could be Gaussian, W ishart, Bernoulli, etc.), but we will make the simplifying assumption that Z acts as a selector v ariable, so that W and W 1 are conditionally independent giv en Z = 2 , and like wise W and W 2 are conditionally independent when Z = 1 . F ormally , using the notation P z to denote conditioning on Z = z , we hav e P 2 ( W 1 | W ) = P 2 ( W 1 ) (3) P 1 ( W 2 | W ) = P 1 ( W 2 ) . (4) W e are interested in the posterior distribution of the latent variable W giv en the observed variables W 1 , W 2 : P ( W | W 1 , W 2 ) (5) = P ( W , Z = 1 | W 1 , W 2 ) + P ( W , Z = 2 | W 1 , W 2 ) (6) = P ( W | W 1 , W 2 , Z = 1) P ( Z = 1 | W 1 , W 2 ) + P ( W | W 1 , W 2 , Z = 2) P ( Z = 2 | W 1 , W 2 ) (7) = ξ P ( W | W 1 , W 2 , Z = 1) + (1 − ξ ) P ( W | W 1 , W 2 , Z = 2) , (8) where ξ = P ( Z = 1 | W 1 , W 2 ) . Let’ s consider the ﬁrst term. W e hav e P ( W | W 1 , W 2 , Z = 1) = P ( W , W 1 , W 2 , Z = 1) P ˆ W P ( ˆ W , W 1 , W 2 , Z = 1) (9) = P ( W ) P 1 ( W 1 | W ) P 1 ( W 2 ) P ˆ W P ( ˆ W ) P 1 ( W 1 | ˆ W ) P 1 ( W 2 ) . (10) Since P 1 ( W 2 ) does not depend on W , it factors out of the sum in the denominator and cancels; thus (10) becomes P ( W | W 1 , W 2 , Z = 1) = P ( W ) P 1 ( W 1 | W ) P 1 ( W 1 ) . (11) Performing the same computation on the other side and combining, we hav e P ( W | W 1 , W 2 ) (12) = ξ P ( W ) P 1 ( W 1 | W ) P 1 ( W 1 ) + (1 − ξ ) P ( W ) P 2 ( W 2 | W ) P 2 ( W 2 ) (13) = P ( W ) [ γ 1 P 1 ( W 1 | W ) + γ 2 P 2 ( W 2 | W )] , (14) where γ 1 = ξ /P 1 ( W 1 ) and γ 2 = (1 − ξ ) /P 2 ( W 2 ) are constants with respect to W . If we assume the prior on W is uniform, then the MAP estimate of W is also the maximum likelihood estimate, which can be written as argmax W [ γ 1 P 1 ( W 1 | W ) + γ 2 P 2 ( W 2 | W )] . (15) The above solutions describe not just one MAP estimate of W , but rather a family of MAP estimates, based on the priors that we implicitly assign to each model by choosing a speciﬁc value of α (which affects ξ and γ in turn). Qualitativ ely , this can be viewed as determining a relativ e conﬁdence parameter between the networks; if W 1 is more trusted than W 2 , then α would be greater than 0.5. As an example, assume that both P ( W 1 | W ) and P ( W 2 | W ) are isotropic Gaussians, i.e., P ( W 1 | W ) = N ( W, σ 2 1 I p ) (16) P ( W 2 | W ) = N ( W, σ 2 2 I p ) . (17) 3 Latent variables W 2 W 1 Z W Observed matrices Fig. 3. Model with Similarity Matrix and Selection V ariable. W e introduce the similarity matrix W and the selection variable Z to describe our latent variable model. Conditioning on W and Z, we assume that the two layers are independent from each other . Then the solution for ˆ W has the form ˆ W = β W 1 + (1 − β ) W 2 , (18) for some choice of 0 ≤ β ≤ 1 . A proof of this is given in Appendix A. In the non-isotropic, non-Gaussian case the solution will not hav e such a simple form. Howe ver , numerical methods can be used to compute the solution (15). I V . S I M U L ATI O N E X A M P L E W e use simulations to show that clustering of nodes in a weighted graph can be improv ed using the MAP estimate of W . This simulation example uses the Bayesian posterior representa- tion, where P ( W 1 | W ) and P ( W 2 | W ) are isotropic multi v ariate Gaussian distributions with (posterior) mean W . T wo weighted random graphs with 500 nodes are constructed with 10 known clusters of equal size. The weights between nodes in the same cluster are independently generated from the normal distrib ution N (5 , 0 . 5) , and the edge weights between nodes that are not in the same cluster are independently generated from the normal distribution N (4 . 7 , 0 . 5) . The dichotomy between these edge weights is to simulate the underlying community structure with variabi lity . The networks are then corrupted with i.i.d. Gaussian noise on each edge weight with zero mean and dif ferent variances. Speciﬁcally , the ﬁrst network layer is corrupted with additive noise distributed as N (0 , σ 1 ) and the second layer is corrupted with additiv e noise distributed as N (0 , σ 2 ) . This setup corresponds to the form of ˆ W that is deriv ed in (18). For various choices of mixing parameters β , the combined network ˆ W is calculated and then clustered using a spectral clustering algorithm [ 7 ]. The spectral clustering algorithm ﬁnds the eigen v ectors of the graph Laplacian L = D − A , where D , A are the degree and adjacency matrix obtained from ˆ W . The Adjusted Rand Indices (ARI) [ 8 ] are computed in comparison to the true clustering structure; this gi ves us a measure of the quality of the clustering. For each of several different levels of noise v ariance, this experiment is run 50 times, and the results are averaged. Figure 4 computes the solution (15), and shows that using (14) to estimate the mixture of networks improv es clustering when compared to using only one layer of the network, as expected. V . P A R E TO S U M M A R I Z A T I O N S Of course, in practice it may be difﬁcult to effecti vely set the prior parameter α . In such cases we can generate a 1 2 3 4 5 0 0.5 1 0 0.2 0.4 0.6 0.8  2  ARI Fig. 4. Clustering Simulation. This surface plot shows the ARI for different simulations of σ 2 and β . Note that for all levels of σ 2 , a β that is around 0.5 tends to produce the best clustering. T ABLE I V AR I A NC E S A N D A R I S C O RE S σ 1 σ 2 Max ARI β 1 1 0.6843 0.4747 1 1.5 0.6561 0.5859 1 2 0.5564 0.6364 1 2.5 0.5649 0.6970 1 3 0.4918 0.7879 1 3.5 0.5209 0.7475 1 4 0.4809 0.7374 1 4.5 0.4653 0.7879 family of MAP estimates and apply multiple-objectiv e ranking techniques. In particular, one can view the maximization (15) of the combined posterior distributions as a particular scalarization of a multi-objectiv e optimization problem. Ho wev er , there are other solutions to multiple objecti ve optimization that do not use linear scalarization, such as Pareto front analysis [ 9 ], [ 10 ], [11]. Consider the multi-objectiv e optimization problem ˆ W = argmin W [ f 1 ( W ) , f 2 ( W )] , (19) where the minimization in (19) is in the sense of multi-objective minimization, to be made clear below . For the model deri ved in Section III, we have f 1 ( W ) = − P 1 ( W | W 1 ) and f 2 ( W ) = − P 2 ( W | W 2 ) , with (19) being interpreted in terms of linear scalarization using weighting coefﬁcients γ and 1 − γ : ˆ W = argmin W [ γ f 1 ( W ) + (1 − γ ) f 2 ( W )] . (20) An alternativ e to the scalarization approach is a ranking approach that seeks to ﬁnd a family of solutions W that would be highly ranked by any scalarization, linear or non-linear . This leads to the idea of Pareto optimization. A solution to a multi-objective optimization problem is said to be weakly Pareto optimal (or weakly non-dominated) if it is not possible to improve any single objective function without lowering some other objectiv e function [ 2 ], [ 3 ]. More formally , we say that a solution W 1 dominates a solution W 2 if f i ( W 1 ) ≤ f i ( W 2 ) for ev ery objective function f i and there exists some j such that f j ( W 1 ) < f j ( W 2 ) . The ﬁrst Pareto front is the set of weakly non-dominated points. 4 −0.08 −0.06 −0.04 −0.02 0 −0.08 −0.07 −0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0 f 2 (W) f 1 (W) Fig. 5. Pareto front for two Gaussians. A conv ex Pareto front would bulge tow ard the lower left corner, but this plot demonstrates that even relatively simple objective equations can have extremely non-conv ex Pareto fronts. In terms of ﬁnding Pareto optimal points, the linear scalar- ization technique discussed abov e can identify the complete Pareto front when the solution space is a con ve x set and the individual objectiv e functions are con vex functions on the solution space [ 5 ]. Howe v er , if these con vexity conditions are not met, the scalarization technique will not ﬁnd the entire Pareto front. Often, the posterior distributions in (19) are not con ve x. Figure 5 shows an example of the Pareto front of a multiobjectiv e optimization, where f 1 and f 2 are the two dimensional pdfs of normal distributions, as shown belo w: f i ( W ) = (2 π ) − n/ 2 | Σ i | − 1 2 e − 1 2 ( W − W i ) T Σ − 1 i ( W − W i ) (21) W 1 =  10 8  , W 2 =  8 10  , Σ 1 = Σ 2 = 2 I 2 . (22) Even this relativ ely simple distribution has a non-con ve x Pareto front; note that minimizing a linear combination of f 1 and f 2 can only ﬁnd optima at the e xtremes of the curve, and does not explore the interior , which may be more useful for some applications. This example motiv ates further research into generating MAP estimates in this manner , as ﬁnding the Pareto front could gi ve us an adv antage when attempting to infer parameters of the model as we do abov e, or perform some other common task; see for instance [12]. V I . S T O C H A S T I C B L O C K M O D E L S A N D T H E D S B M Consider a single layer network. Often we are interested in networks that are expected to have some community structure. A community is deﬁned as a subset of nodes that beha ve similarly to each other , where similarity is determined according to some ﬁxed criterion. This allows for a more interesting community structure than just using the density of connections in a group, i.e., creating communities based on high intra- connectivity between nodes. For instance, one group may exhibit strong interconnection with another group, but only moderate connectivity within themselves. A Stochastic Block Model (SBM) is one way to model such community structure. [13], [14]. Consider a network with N nodes that we expect to fall in K classes, where c ∈ R N is a kno wn class membership vector . In this setup we are considering binary relationships between nodes, and so a connectivity matrix A = { a xy } ∈ R N × N is observed. The parameters for a standard SBM are prior probabilities of edges occurring between nodes within and across classes. Speciﬁcally , let Θ be a matrix of class probabilities called the Bernoulli parameter matrix, where θ ij is the probability of a link forming between a node in class i and class j. While the graph adjacency matrix will be N × N , Θ ∈ R K × K and is symmetric. Letting S i = { x | c ( x ) = i } , it can be shown ([6]) that the MLE of θ ij is ˆ θ ij = m ij n ij (23) m ij = X x ∈ S i X y ∈ S j a xy (24) n ij = ( | S i || S j | , , i 6 = j | S i | ( | S i | − 1) , , i = j . (25) This estimate of Θ (which we call Y ) can be used to explore the structure of the network. When the class membership vector c is unknown, the SBM can be modiﬁed to simultaneously estimate c and the Bernoulli matrix Θ [15]. The SBM accounts for community structure, but does not account for temporal changes in the network. One solution to this problem would be to ﬁt a SBM to every time step in the sequence. This approach, ho wev er , fails to take advantage of information from previous time steps, and it does not encourage the class membership to ev olve smoothly ov er time. Recently , the Dynamic SBM (DSBM) has been introduced to account for some of these effects [15], [16], [6]. The DSBM of [ 6 ] employs an extended Kalman ﬁlter (EKF) to track temporal changes in the network. T wo types of DSBM were introduced in [ 6 ]: one that is given the class membership a priori , and another that estimates the class memberships along with the other SBM parameters. For the beneﬁt of the reader , the a priori DSBM is brieﬂy revie wed belo w . The DBSM is based on the following simple linear model observation: Y t = Θ t + z t , (26) where z t is i.i.d. zero-mean Gaussian noise and Θ t is an unknown matrix of Bernoulli parameters at time t . Because the elements of Θ t must be between 0 and 1, the DSBM uses a logistic transform to map them onto the real line: ψ ij = log( θ ij ) − log(1 − θ ij ) ∈ ( −∞ , + ∞ ) . (27) Since the logistic transform is inv ertible, (26) can be written as Y t = h ( ψ t ) + z t . (28) A linear state space model for the time ev olution of the logistically transformed parameters ψ t is assumed. W ith this state space model for ψ t and the observ ation model (28), an extended Kalman ﬁlter estimator can be implemented to 5 produce state estimates ˆ ψ t | t − 1 from which the SBM parameters can be tracked ov er time: ˆ ψ t | t − 1 = F t ˆ ψ t − 1 | t − 2 + K t | t − 1 η t , (29) where η t = Y t − H t ˆ ψ t | t − 1 is the Kalman innovation process, K t | t − 1 is the ( ˆ ψ t | t − 1 -dependent) Kalman gain, and H t is the Jacobian of h ( ˆ ψ t | t − 1 ) . Once the inference is complete, the Kalman estimate is then mapped back into Bernoulli parameters. When the class memberships are unkno wn, the DSBM can be modiﬁed to estimate these memberships and the probability parameters simultaneously [7]. F or the ENR ON data experiment described belo w , we implemented a multi-layer extension of the a priori DSBM in [7] using a simple random walk state space model ( F t = I , the identity matrix). V I I . E N RO N E X A M P L E The proposed dynamic SBM multi-layer community detec- tion approach of Section VI is illustrated on real-world ENRON email data set 1 . This data set consists of approximately a half million email messages sent or received by 150 senior employees of the ENR ON Corporation. These emails were made publicly av ailable as a result of the SEC inv estigation of the company in 2002, and constitute one of the largest publicly av ailable email repositories. This dataset represents a unique opportunity to examine priv ate email messages in a corporate setting. This is rare due to pri vac y concerns and proprietary information, but the ENRON dataset is for the most part untouched, except for a few emails that were speciﬁcally requested to be removed. In addition to the raw emails, the dataset also contains the job title of the employees that are included. This is useful to separate the employees into classes, so that we may examine their behavior using the DSBM and its related techniques. T o explore the multi-lev el structure, two layers are extracted from the ENR ON dataset. As discussed previously , one layer represents the extrinsic, ”relational” information between users, and the other represents intrinsic, ”behavioral” information between users. The network layers are extracted from the data as follows. First, a r elational network is recovered from the headers of emails by identifying the sender and recei ver(s) of each message, including Cc and Bcc recipients. For each week in the dataset, a separate network of employees is constructed from the emails sent during that week. A second set of behavioral networks are recovered using the contents of email messages. On the same weekly basis the contents of all emails originating from each user are combined to form long “documents”. Only emails that are sent by the user are considered, which is different from the relational case. This is to obtain a better representation of each user’ s individual writing habits, as opposed to the writing habits of them and their peers. These emails combine to produce a dictionary of words from which term frequency-in v erse document frequency (TF- IDF) scores are calculated [ 17 ]. TF-IDF scores are commonly used for identifying important words in text analysis, and are 1 http://www .cs.cmu.edu/ ∼ enron computed using tf ( t, d ) = f ( t, d ) max ˆ t f ( ˆ t, d ) (30) idf ( t ) = log  | D | N ( t, D )  (31) score ( t, d ) = tf ( t, d ) idf ( t ) , (32) where f ( t, d ) is the frequency of term t in document d , N ( t, D ) is the number of documents in which the term t appears, and | D | is the size of the document corpus, which in this case is the number of active network nodes. For each activ e user (document), a TF-IDF score is computed for each word in the dictionary . Using the vector of TF-IDF scores for each user , we measure the cosine similarity of each user by taking dot products in order to obtain a similarity matrix W . Again, this is done for ev ery week in the rele vant time period, creating a second dynamic network with weighted edges. Howe ver , since we started in the SBM framew ork, it is necessary to transform the weighted edge network into a binary network. T o do this, the similarity scores are thresholded. T o be roughly consistent with the density of the relational network, we keep the top 15% greatest correlations between users at each time step, setting all other connections to 0. This allows us to create networks of similar sparsity lev el. The above procedure yields a two-layer binary dynamic network that we can use to obtain insight into the structural dynamics of the ENR ON data. T o do so, we extended the dynamic stochastic block model (DSBM) [ 6 ], [ 18 ] to the multi-layer setting. W e group employees by their role in the company (CEO, President, Director , etc.). Thus, the DSBM class memberships are known a priori , and the a priori DSBM described in Section VI can be implemented to estimate the Bernoulli parameters, which predict the likelihood of an edge between users from any pair of groups. Figure 6(a) and Figure 6(b) shows some of the estimated Bernoulli parameters for different classes when the DSBM is run on the two layers separately . Figure 6(a) represents the ev olution of the relational layer, while the Figure 6(b) represents the behavioral layer . The DSBM was run over a 120 week period, from December 6th, 1999 to March 27th, 2002. The vertical lines represent important events in the ENRON time line. Line 1 corresponds to ENR ON releasing a code of ethics policy . It is also the ﬁrst time that the company’ s stock reached abov e $90. Line 2 corresponds to their stock closing below $60. This was a critical point in the timeline, because the company began losing many partnerships, including one to create a video-on-demand system. In this same month, a few of the employees had begun to communicate the uneasiness with ENR ON’ s accounting practices. Line 3 is the week of Jeffrey Skilling’ s resignation. A mere month after his resignation as CEO, the SEC began their ofﬁcial inquiry into ENR ON. These ev ents are chosen as a baseline to compare the two layers of the network. For the relational DSBM parameters, the most interesting results come from the CEO’ s acti vity . Note that the CEO group combines all past and present CEO’ s. This evolution of 6 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Directors 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities CEOs 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Presidents 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Vice Presidents 1 2 3 To directors To CEOs To presidents To VPs To managers To traders To others (a) Relational DSBM Parameters 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Directors 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities CEOs 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Presidents 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step Estimate of edge probabilities Vice Presidents 1 2 3 To directors To CEOs To presidents To VPs To managers To traders To others (b) Behavioral DSBM Parameters Fig. 6. DSBM Simulation Results. These graphs show the estimated DSBM parameters for different classes, and how they evolve over time. (a) is the ev olution of the DSBM parameters from the relational layer , while (b) is the ev olution of parameters from the behavioral layer . parameters seems to indicate that during some of the important milestones in ENR ON’ s demise, the CEO’ s were talking to each other more often, as well as sending out emails to the other employees in the network. This suggests that they were at least somewhat aw are of what was happening with the company during these e vents, and had maybe discussed matters among themselves. From the relational layer , it also appears that the CEO’ s were the most activ e in communicating with other groups, where as the Directors sho wed very little connectivity . One explanation for this is that because the subset of employees that were studied were higher up in the company , the Director group didn’t communicate with them as much, instead managing the lower le vel employees. Another interesting result is that the President group had much more activity towards the end of the time period, suggesting that as the legal situation worsened, their activity increased. The behavioral DSBM parameters appear to be more noisy than their relational counterparts. In addition, they show very 7 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step EKF estimate of edge probabilities Directors 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step EKF estimate of edge probabilities CEOs 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step EKF estimate of edge probabilities Presidents 1 2 3 0 20 40 60 80 100 120 0 0.2 0.4 0.6 0.8 1 Time step EKF estimate of edge probabilities Vice Presidents 1 2 3 To directors To CEOs To presidents To VPs To managers To traders To others Fig. 7. Combined DSBM Results. These graphs show the results of combining the two layers of the network with a parameter α = 0 . 5 . Therefore, we should see attributes from both the behavioral and relational DSBM, and maybe some new , interesting results that result from combining the two layers. different behavior than the relational layer . The V ice Presidents appear much more activ e during the entire period when compared with the relational layer . Because of the nature of the TF-IDF and thresholding process, there could be a number of reasons for this. One possible reason could be that the weeks in which the V ice Presidents were activ e, they could have been sending a lot of forwarded emails, acting as a conduit of information between parties. This would cause the TF-IDF scores for the V ice President group to rise. Another interesting phenomenon in the behavioral layer is that of the CEOs. Speciﬁcally , it is interesting how their activity drops of f signiﬁcantly , and in fact one event that is very much apparent in the relational parameters completely disappears. This can only happen if the document content for the CEOs during those weeks are completely orthogonal to the other groups. Because we consider only text that the sender has written, and we only consider sent emails, one explanation could be that the CEO’ s forwarded many emails without adding any additional text. This would cause the list of words for the CEO to become very small. Ho wev er , a more likely explanation after some examination of the dataset shows that there is a large amount of activity in the relational dataset because many of the employees were emailing the CEO in a petition-like fashion, creating much activity . Howe ver , the CEO group actually sent very few emails during that time. Combining the two networks as in Section III, we run the DSBM for dif ferent lev els of the mixing parameter α . This was the probability of the selection variable choosing W 1 ov er W 2 . Because of the use of binary networks in this example, the α parameter is used as the probability that the combined data will choose to use the relational network when the two layers disagree with each other . The objectiv e in this particular example is to sho w that using this method we can not only reduce noise, but also discov er interesting multifaceted behavior that is not obvious from one layer alone. W e expect that this form of combination will emphasize traits or attributes that occur in both networks; howe ver , attributes that exist mostly in one network but are strong enough will also be retained. W e can study these ef fects through various network measures; in this case we look at betweenness and degree centrality . Figure 7 shows the DSBM parameters for mixing parameters α = 0 . 5 . Smaller v alues of α should be chosen because the relational network seems to be less noisy and more stable. This makes sense as the extrinsic relational interactions are directly measured. One interesting phenomenon that occurs is that much of the behavior that we saw in the relational layer is present, including the high level of CEO activity . Howe ver , the period of inactivity that is experienced in the behavioral layer for the CEO group has an effect by dampening the some of the strong peaks that we saw towards the end of the time period. Figure 8 shows the betweeness centrality of the Directors group over time as the mixing parameter is varied. In general, the betweeness rises roughly monotonically as α is varied; howe v er , from week 95 to week 115, betweenness centrality is signiﬁcantly increased when using a combined dynamic network—that is, an intermediate value of α . This time corresponds to the beginning of the company’ s upheav al and public disclosure of troubles. It may be concluded that by examining both network layers simultaneously we hav e remov ed some of the edges between other classes, and thus the centrality score of this particular group increased. It is true that during this time, when overall email usage increased, the betweenness centrality measure went do wn, as there were 8 more shortest paths through users from other groups. Using the combination of layers, howe ver , there appears to be an increase in the number of shortest paths through the Directors group. On the other hand, we can also see well-behav ed monotonic behavioral correlations in some cases. Figure 9 sho ws a transition of degree centrality for the class of CEOs (of which there were four during this time period). The beha vioral network shows more connectivity for the CEO class. This phenomenon makes sense, as the behavioral data takes into account all written documents, which could be correlated with those of other users, while the relational network only takes into account direct communication between the CEOs and others. In reality , much of that communication is performed through third parties (such as assistants), and thus CEOs probably do not send as much email as the average employee. Increasingly anomalous behavior occurs to ward the end of the time period. W e hypothesize that this is due to a larger volume of unusual emails sent directly to the CEO during this tumultuous period. 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 0 10 20 30 α time (weeks) Fig. 8. Betweenness Centrality for Directors. This centrality is a measure of how connected a node is to the rest of the network. Larger centrality scores often occur for intermediate values of α , particularly between time 95 and 115. 0 0.5 1 20 40 60 80 100 120 0 1 2 3 4 α time (weeks) Fig. 9. Degree Centrality for CEOs. Higher degree centrality for α near one signiﬁes greater activity in the behavioral network. Anomalous behavior can be seen in the later time steps as activity patterns shift. V I I I . R E L A T E D W O R K The literature on single layer networks is large, with contributions coming from many dif ferent ﬁelds. There are many results on structural and spectral properties of a single- layer network, including community detection [ 19 ], random walk return times [ 20 ], and percolation theory results [ 21 ]. Diffusion or infection models have also been studied in the context of complex networks (see [22], for instance). Estimation of community structure in a network of agents is an acti v e area of research in its own right. Speciﬁcally , the stochastic block model (SBM) [ 18 ], [ 13 ] is used to model community structure within a network by assuming identical statistical behavior for disjoint subsets of nodes. These communities are more ﬂexible than simple cliques because it is not required that they be heavily interconnected, but only that they interact with nodes in other subcommunities uniformly . More recently , the SBM has been extended to track temporal changes in the network, appropriately called a Dynamic SBM, or DSBM. W e follo w the dev elopment in [ 6 ], but there hav e been other extensions of the classic SBM. In particular , [ 15 ] uses Gibbs sampling and probabilistic simulated annealing to estimate the Bernoulli parameters and class memberships over time. [ 16 ] also ﬁts a DSBM, b ut with a mixed membership model for the agents. The DSBM in [ 6 ] uses an extended Kalman ﬁlter to track temporal changes between nodes, which will result in a smoothed and potentially insightful ev olution of the estimated parameters. Recently , there has been a growing interest in the multi-level network problem. Some basic network properties have been extended to the multilevel structure [ 23 ], [ 24 ] as well as some results that serve as an extension of single layer concepts, such as multi-level network gro wth [ 25 ] and spreading of epidemics [ 26 ]. The metrics that hav e been proposed attempt to incorporate the dependence of the layers into the statistical framew ork, which allows for a much richer view of the network. In the same v ein, the approach described in this paper performs parameter inference on a multi-lev el network, incorporating some of the dependence information that the multi-lev el structure allo ws. Bayesian model av eraging is also related to this work; ideas from BMA are used to create conditional independence between the layers of a network [ 1 ]. This framew ork accounts for the interdependent relationships between the multiple layers into latent variables, which can then be estimated. I X . C O N C L U S I O N W e introduced a nov el method for inference on multilayer networks. A hierarchical model was used to jointly describe the noisy observ ation matrices and MAP estimation was performed on the relev ant latent variable. A simulation example using clustering demonstrated that the mixture of layers under the correct circumstances can lead to better results, and possibly a better understanding of the underlying structure between users. A real-life example was also discussed using the ENRON email dataset. The approach developed here can be extended to non- linear multi-objectiv e optimization techniques to e xplore other ways of inferring multi-layer networks, such as Pareto ranking [12] or posterior Pareto ranking [27]. 9 X . A C K N OW L E D G E M E N T S W e would like to thank Ke vin Xu for providing the code for the DSBM model and his suggestions for utilizing it, as well as his general comments on the content of the paper . X I . A P P E N D I X A : S O L U T I O N O F T W O G AU S S I A N D I S T R I B U T I O N S Theorem 1: Let W ∈ R n The solution to the maximization problem ˆ W = argmax W f ( W ) = [ γ 1 P 1 ( W 1 | W ) + γ 2 P 2 ( W 2 | W )] , (33) with P ( W i | W ) of the multiv ariate Normal distribution P ( W 1 | W ) = N ( W, σ 2 1 I n ) (34) P ( W 2 | W ) = N ( W, σ 2 2 I n ) , (35) is of the form ˆ W = β W 1 + (1 − β ) W 2 , beta ∈ [0 , 1] . (36) Proof: The proof is separated into two steps. First, we show that for any arbitrary point W ∈ R n , the point W k which is the projection of W onto the line g ( W ) = W 1 + β ( W − W 1 ) increases the value of f , that is f ( W ) ≤ f ( W k ) . (37) Then we sho w that for all points on the line g ( W ) , f is maximized for some point on the line segment between W 1 and W 2 , corresponding to β ∈ [0 , 1] Let W ∈ R n . There exists a unique decomposition of x into a vector parallel to g ( W ) and one perpendicular to g ( W ) : W = W k + W ⊥ . (38) Plugging W into f ( W ) , we hav e f ( W ) = (2 π ) − n/ 2 | σ 2 1 I n | − 1 2 e − 1 2 ( W − W 1 ) T ( σ 2 1 I n ) − 1 ( W − W 1 ) + (2 π ) − n/ 2 | σ 2 2 I n | − 1 2 e − 1 2 ( W − W 2 ) T ( σ 2 2 I n ) − 1 ( W − W 2 ) (39) =  2 π σ 2 1  − n/ 2 e 1 2 σ 1 k W k + W ⊥ − W 1 k 2 +  2 π σ 2 2  − n/ 2 e 1 2 σ 2 k W k + W ⊥ − W 2 k 2 . (40) The exponent can be decomposed as follo ws: ( W k + W ⊥ − W 1 ) T ( W k + W ⊥ − W 1 ) (41) = ( W k − W 1 ) T ( W k − W 1 ) + 2 W ⊥ ( W k − W 1 ) + x T ⊥ W ⊥ (42) = ( W k − W 1 ) T ( W k − W 1 ) + W T ⊥ W ⊥ (43) ≥ ( W k − W 1 ) T ( W k − W 1 ) . (44) Note that since W 1 is on the line g ( W ) , and W ⊥ is orthogonal to all points on g ( W ) , W T ⊥ W 1 = 0 and so the cross term goes to 0. The same can be shown for the other exponential term with W 2 . Since the term with x is greater than with just x k , so f ( W k ) ≥ f ( W ) . (45) Finally , let us show that the maximum for f must be between W 1 and W 2 . This can easily be seen by the fact that both summation terms in f decrease as the distance between W and the means W 1 and W 2 increases. When on the line g , but outside the line segment between W 1 and W 2 , moving closer to the means will increase both terms. Therefore, the maximum of f must be on the line g , with β restricted between 0 and 1. R E F E R E N C E S [1] A. Raftery , “Bayesian model selection in social research, ” Sociological methodology , vol. 25, pp. 111–164, 1995. [2] M. Ehrgott, “Multiobjective optimization, ” AI Magazine , vol. 29, no. 4, pp. 47–57, Winter 2008. [Online]. A v ailable: http://search.proquest.com. proxy .lib .umich.edu/docvie w/208128027?accountid=14667 [3] X.-S. Y ang, Multiobjective Optimization . John W ile y and Sons, Inc., 2010, pp. 231–246. [Online]. A v ailable: http://dx.doi.org/10.1002/ 9780470640425.ch18 [4] P . Ngatchou, A. Zarei, and M. El-Sharkawi, “Pareto multi objective optimization, ” in Intelligent Systems Application to P ower Systems, 2005. Pr oceedings of the 13th International Confer ence on , 2005, pp. 84–91. [5] A. M. Geoffrion, “Proper efﬁcienc y and the theory of vector maximization, ” Journal of Mathematical Analysis and Applications , vol. 22, no. 3, pp. 618 – 630, 1968. [Online]. A v ailable: http://www .sciencedirect.com/science/article/pii/0022247X68902011 [6] K. S. Xu and A. O. H. III, “Dynamic stochastic blockmodels: Statistical models for time-ev olving networks, ” CoRR , vol. abs/1304.5974, 2013. [7] U. Luxburg, “A tutorial on spectral clustering, ” Statistics and Computing , vol. 17, no. 4, pp. 395–416, 2007. [Online]. A v ailable: http://dx.doi.org/10.1007/s11222- 007- 9033- z [8] L. Hubert and P . Arabie, “Comparing partitions, ” J ournal of Classiﬁcation , vol. 2, no. 1, pp. 193–218, 1985. [Online]. A v ailable: http://dx.doi.org/10.1007/BF01908075 [9] Y . Jin, Multi-objective machine learning . Springer , 2006, vol. 16. [10] A. O. Hero, G. Fleury , A. J. Mears, and A. Swaroop, “Multicriteria gene screening for analysis of differential expression with dna microarrays, ” EURASIP Journal on Applied Signal Processing , vol. 2004, pp. 43–52, 2004. [11] K.-J. Hsiao, K. Xu, J. Calder, and A. Hero, “Multi-criteria anomaly detection using pareto depth analysis, ” in Advances in Neural Information Pr ocessing Systems 25 , P . Bartlett, F . Pereira, C. Burges, L. Bottou, and K. W einberger , Eds., 2012, pp. 854–862. [Online]. A vailable: http://books.nips.cc/papers/ﬁles/nips25/NIPS2012 0395.pdf [12] B. Oselio, A. Kulesza, and A. Hero, “Multi-objectiv e optimization for multi-level networks, ” in Social Computing, Behavioral-Cultural Modeling and Pr ediction , ser . Lecture Notes in Computer Science, W . Kennedy , N. Agarwal, and S. Y ang, Eds. Springer International Publishing, 2014, vol. 8393, pp. 129–136. [Online]. A vailable: http://dx.doi.org/10.1007/978- 3- 319- 05579- 4 16 [13] Y . J. W ang and G. Y . W ong, “Stochastic blockmodels for directed graphs, ” Journal of the American Statistical Association , vol. 82, no. 397, pp. 8–19, 1987. [14] K. S. X. 0001, M. Kliger, and A. O. H. III, “ Adapti ve ev olutionary clustering, ” CoRR , vol. abs/1104.1990, 2011. [Online]. A vailable: http://dblp.uni- trier .de/db/journals/corr/corr1104.html#abs- 1104- 1990 [15] T . Y ang, Y . Chi, S. Zhu, Y . Gong, and R. Jin, “Detecting communities and their ev olutions in dynamic social networks— a bayesian approach, ” Machine Learning , vol. 82, no. 2, pp. 157–189, 2011. [Online]. A vailable: http://dx.doi.org/10.1007/s10994- 010- 5214- 7 [16] Q. Ho, L. Song, and E. P . Xing, “Evolving cluster mixed-membership blockmodel for time-varying networks, ” in Pr oc. 14th Int. Conf. Artif. Intell. Stat , 2011, pp. 342–350. [17] M. Baena-Garcia, J. Carmona-Cejudo, G. Castillo, and R. Morales- Bueno, “Tf-sidf: T erm frequency , sketched in verse document frequency , ” in Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on , 2011, pp. 1044–1049. 10 [18] P . W . Holland, K. B. Laskey , and S. Leinhardt, “Stochastic blockmodels: First steps, ” Social Networks , vol. 5, no. 2, pp. 109 – 137, 1983. [Online]. A v ailable: http://www .sciencedirect.com/science/article/pii/ 0378873383900217 [19] M. E. J. Newman, “Fast algorithm for detecting community structure in networks, ” Phys. Rev . E , vol. 69, p. 066133, Jun 2004. [Online]. A vailable: http://link.aps.org/doi/10.1103/PhysRevE.69.066133 [20] J. D. Noh and H. Rieger, “Random walks on complex networks, ” Phys. Rev . Lett. , vol. 92, p. 118701, Mar 2004. [Online]. A vailable: http://link.aps.org/doi/10.1103/PhysRe vLett.92.118701 [21] R. Albert and A.-L. Barab ´ asi, “Statistical mechanics of complex networks, ” Rev . Mod. Phys. , vol. 74, pp. 47–97, Jan 2002. [Online]. A vailable: http://link.aps.org/doi/10.1103/RevModPhys.74.47 [22] A. Guille, H. Hacid, C. F avre, and D. A. Zighed, “Information diffusion in online social networks: a survey , ” SIGMOD Rec. , vol. 42, no. 1, pp. 17–28, Jul. 2013. [Online]. A v ailable: http: //doi.acm.org.proxy .lib.umich.edu/10.1145/2503792.2503797 [23] F . Battiston, V . Nicosia, and V . Latora, “Metrics for the analysis of multiplex networks. ” [24] G. Bianconi, “Statistical mechanics of multiplex networks: Entropy and overlap, ” Phys. Rev . E , vol. 87, p. 062806, Jun 2013. [Online]. A v ailable: http://link.aps.org/doi/10.1103/PhysRe vE.87.062806 [25] V . Nicosia, G. Bianconi, V . Latora, and M. Barthelemy , “Growing multiplex networks, ” Phys. Rev . Lett. , vol. 111, p. 058701, Jul 2013. [Online]. A vailable: http://link.aps.org/doi/10.1103/PhysRevLett. 111.058701 [26] A. Saumell-Mendiola, M. A. Serrano, and M. Bogu ˜ n ´ a, “Epidemic spreading on interconnected networks, ” Phys. Rev . E , vol. 86, p. 026106, Aug 2012. [Online]. A v ailable: http://link.aps.org/doi/10.1103/PhysRe vE. 86.026106 [27] A. O. Hero and G. Fleury , “Pareto-optimal methods for gene ranking. ” VLSI Signal Processing , vol. 38, no. 3, pp. 259–275, 2004. [Online]. A vailable: http://dblp.uni- trier .de/db/journals/vlsisp/vlsisp38. html#HeroF04

Multi-layer graph analysis for dynamic social networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment