Detecting change points in the large-scale structure of evolving networks

Detecting c hange p oin ts in the large-scale structure of ev olving net w orks Leto P eel 1 , ∗ and Aaron Clauset 1, 2, 3 , † 1 Dep artment of Computer Scienc e, University of Color ado, Boulder, CO 80309 2 BioF r ontiers Institute, University of Color ado, Boulder, CO 80303 3 Santa F e Institute, 1399 Hyde Park R d., Santa F e, NM 87501 In teractions among p eople or ob jects are often dynamic in nature and can b e represen ted as a sequence of netw orks, eac h providing a snapshot of the in teractions ov er a brief p eriod of time. An imp ortan t task in analyzing such evolving net works is change-p oint dete ction , in which w e both iden tify the times at which the large-scale pattern of interactions changes fundamen tally and quan tify ho w large and what kind of change o ccurred. Here, we formalize for the ﬁrst time the net work c hange-p oin t detection problem within an online probabilistic learning framew ork and in troduce a metho d that can reliably solv e it. This method com bines a generalized hierarchical random graph mo del with a Ba yesian hypothesis test to quantitat ively determine if, when, and precisely how a c hange p oin t has o ccurred. W e analyze the detectability of our metho d using synthetic data with kno wn change points of diﬀerent types and magnitudes, and show that this metho d is more accurate than sev eral previously used alternativ es. Applied to tw o high-resolution evolving so cial net w orks, this metho d iden tiﬁes a sequence of change points that align with kno wn external “sho c ks” to these net works. I. INTR ODUCTION Net works are frequen tly used as a general framework to quantify and analyze the in teractions b etw een ob jects or people. Netw ork mo dels can b e used to b etter under- stand the large-scale structure of interactions by identi- fying clusters of highly in teracting communities or func- tional groups of structurally equiv alent no des. How ev er, these in teractions are often dynamic in nature, and tradi- tional approaches can o v erlo ok the non-stationary struc- ture of real net works. In these dynamic and temp orally ev olving systems w e are not only interested in under- standing the large-scale structure but also iden tifying if, when and how it changes in time. F or instance, in so cial netw orks, change p oin ts may b e the result of normal p erio dic b ehavior, as in the weekly transition from w eekda ys to w eek ends. In other cases, c hange p oints ma y result from the collective an ticipa- tion of or response to external ev ents or system “sho c ks”. Detecting suc h changes in social netw orks could pro vide a better understanding of patterns of so cial life and an early detection of so cial stress caused b y , e.g, natural or man-made disasters. Here w e deﬁne the network change-p oint dete ction problem and introduce an online probabilistic learning al- gorithm for solving it. Identifying a c hange p oint requires inferring a structural “norm” for interactions across a se- quence of graphs and accurately detecting if, when and ho w this norm has shifted at some p oin t in time. T o c haracterize what kind and how large a change o ccurred, w e prefer interpretable mo dels of netw ork structure, so that c hanges in parameter v alues ha ve direct meaning with resp ect to the net w ork’s large-scale structure. Here, ∗ leto.peel@colorado.edu † aaron.clauset@colorado.edu 0.50 0.25 0.00 w t c t c ^ t d Struc tur al Inde x T ime s FIG. 1: Schematic of a netw ork change p oin t. A sequence of net works in which v ertices divide into tw o groups at time t c , represen ted by a change in an abstract structural index ∆ µ . T o detect this c hange point, w e estimate the time of change b t c within a sliding windo w of the last w net works, and call t d the time of detection in which b t c is found to be statistically signiﬁcan t. w e take the no vel approach of c haracterizing netw ork norms via probabilistic distributions ov er graphs, whic h w e learn in an online fashion. Iden tifying the timing and shap e of suc h c hange points divides a netw ork’s ev olu- tion in to con tiguous p erio ds of relative structural stabil- it y , allowing us to subsequently analyze each p erio d in- dep enden tly , while also facilitating hypotheses ab out the underlying processes shaping the data. Bey ond sp ecialized solutions for change-point detec- tion in cybersecurity domains [18], many “change-point detection” metho ds for netw orks are in fact solving the distinct anomaly detection problem of identifying net- w ork snapshots that deviate signiﬁcan tly from a station- ary netw ork norm [4, 14]. In contrast, traditional c hange- p oin t detection fo cuses on the more diﬃcult problem of iden tifying signiﬁcant shifts in the norm itself. Solving this problem requires distinguishing a statistically signif- ican t c hange from mere noise, and th us qualitative ap- 2 proac hes are lik ely to b e insuﬃcien t [6, 24]. Muc h lik e traditional online c hange-p oint detection metho ds for scalar- or vector-v alued time series [5], our approac h for the netw ork change-point detection problem has three comp onents: 1. select a parametric family of probability distribu- tions appropriate for the data, and a sliding window size w ; 2. infer t w o v ersions of the mo del, one represen ting a c hange of parameters at a particular p oin t in time within the windo w, and the one representing the n ull h yp othesis of no c hange o v er the entire win- do w; and, 3. conduct a statistical h ypothesis test to choose whic h mo del, change or no-change, is the b etter ﬁt. P ast w ork on the net work change-point detection prob- lem has typically con verted the sequence of netw orks in to a time series of scalar v alues and then applied traditional tec hniques [19, 23]. Here, we introduce a nov el solution based on generative mo dels of netw orks, whic h deﬁne a parametric probability distribution ov er graphs. Our particular c hoice of mo del is the generalized hierarc hical random graph (GHRG), which compactly models nested comm unity structure at all scales in a netw ork and pro- vides an in terpretable output for later analysis. Our ap- proac h, ho w ev er, is entirely general, and the GHRG could b e replaced with another generative net work model, e.g., sto c hastic blo ck mo dels [15, 21], hierarc hical graph mo d- els [7, 9], or Kronec ker pro duct graph model [17]. Finally , to choose b et ween the change versus no-change mo dels, w e use a Ba yesian h ypothesis test, with a user-deﬁned parameter specifying a target false-positive rate. W e then sho w that this approac h quantitativ ely and accurately determines if the netw ork norm has c hanged, when precisely the norm change o ccurred, and how the norm has changed. Sp eciﬁcally , w e present a taxonomy of diﬀerent types and sizes of netw ork change p oints and a quantitativ e c haracterization of the diﬃculty of de- tecting them using synthetic netw ork data with known c hange p oints. W e then test the metho d on tw o real, high-resolution evolving so cial netw orks of physical and digital interactions, showing that it more accurately re- co vers the timing of known signiﬁcan t external ev ents than comparable techniques. I I. DEFINING A PROBABILITY DISTRIBUTION OVER NETW ORKS Under a probabilistic approach to change-point de- tection, we must choose a parametric distribution ov er net works. Here, we in tro duce the generalized hierar- c hical random graph (GHRG) model. This mo del has sev eral features that make it attractiv e for change-point FIG. 2: A snapshot of the Enron email net w ork from Octob er 2001 and its corresponding GHRG dendrogram. In the dendrogram, leav es are vertices in the email net work and the tree gives their nested group structure. detection and generalizes the p opular hierarc hical ran- dom graph (HRG) mo del [9]. First, the GHR G natu- rally captures b oth assortativ e and disassortative com- m unity structure patterns, mo dels communit y structure at all scales in the net w ork, and provides accurate and in terpretable ﬁts to so cial, biological and ecological net- w orks. Second, our generalization relaxes the require- men t that the dendrogram is a full binary tree, thereby eliminating the HRG’s non-identiﬁabilit y and improving the mo del’s interpretabilit y for quan tifying how a net- w ork’s structure v aries across a change p oin t. Third, w e use a Bay esian mo del of connection probabilities that quan tiﬁes our uncertain ty ab out the net work’s underly- ing generativ e model. The GHRG mo dels a net work G = { V , E } comp osed of v ertices V and edges E ⊆ { V × V } . The mo del decom- p oses the N v ertices into a series of nested groups, whose relationships are represented by a dendrogram T . V er- tices in G are lea ves of T , and the probabilit y that tw o v ertices u and v connect in G is given by a parameter p r lo cated at their low est common ancestor in T . In the clas- sic HRG model [9], each tree no de in T has exactly tw o subtrees, and p r giv es the densit y of connections betw een the vertices in the left and righ t subtrees. As a result, distinct combinations of dendrograms and probabilities pro duce identical distributions ov er net works, pro ducing a non-identiﬁable mo del. In the GHRG, we eliminate this p ossibilit y by allowing tree nodes to hav e any num b er of c hildren and preferring more compact trees. Figure 2 illustrates the GHRG applied to a netw ork of email com- m unications. Giv en tree T and set of connection probabilities { p r } , the GHR G deﬁnes a distribution o ver netw orks and a lik eliho o d function p ( G | T , { p r } ) = Y r p E r r (1 − p r ) N r − E r , (1) where E r is the num b er of edges b et ween vertices with common ancestor r and N r is the total n umber of p ossible edges betw een v ertices with common ancestor r : N r = X c i 0 or E r < 1, and the likelihoo d giv en by Eq. (1) drops to 0, an unhelpful outcome. W e mitigate this behavior by assuming Ba yesian priors on the p r v alues. No w, instead of setting p r to a p oin t v alue, we mo del eac h p r as a distribution, which quanti- ﬁes our uncertain ty in its v alue and preven ts its expected v alue from becoming 0 or 1. F or conv enience, w e employ a Beta distribution with h yp erparameters α = β = 1, whic h corresp onds to a uniform distribution ov er the pa- rameters p r . Because the Beta distribution is conjugate with the Binomial distribution, we ma y in tegrate out eac h of the p r parameters analytically p ( G | T , α, β ) = Y r Γ( α + β ) Γ( α )Γ( β ) Γ( E r + α )Γ( N r − E r + β ) Γ( N r + α + β ) . (3) By relaxing the binary-tree requirement, the GHRG pro duces a spectrum of hierarchical structure (Fig. 3). On one end of this sp ectrum, T con tains only a single in ternal tree no de—the ro ot—and every pair of vertices connects with the same probability p r asso ciated with it, equiv alent to the p opular Erd˝ os-R ´ en yi random graph mo del. As more tree no des and their parameters are added to T , the num ber of levels of hierarch y increases, allo wing the mo del to capture more v aried large-scale patterns. In the limit, T is a full binary tree, and w e reco ver the classic HR G model [9]. I II. LEARNING THE MODEL Fitting the GHRG mo del to a netw ork requires a searc h ov er all trees on N lea ves and the correspond- ing link probability sets { p r } , which w e accomplish using Ba yesian posterior inference and techniques from ph ylo- genetic tree reconstruction. T ree structures are not amenable to classic conv ex op- timization tec hniques, and instead must be searched ex- plicitly . Ho wev er, searc hing o ver all non-binary trees is costly . Phylogenetic tree reconstruction faces a similar problem, which is commonly solved b y taking a ma jorit y “consensus” of a set of sampled binary trees [8]. This consensus pro cedure selects the set of bipartitions on the lea ves that o ccur in a ma jority of the sampled binary trees, and eac h such set denotes a unique non-binary tree con taining exactly those divisions. F or instance, if every sampled tree is identical, then each is identical to the consensus tree; if every sampled tree is a distinct set of bipartitions, the consensus tree has a single internal no de to whic h all of the leaf nodes are connected. Th us, we es- timate T in the GHRG by using a Mark o v Chain Monte Carlo (MCMC) procedure to ﬁrst sample the p osterior distribution of bipartitions. F rom this set of sampled bi- partitions, w e deriv e their non-binary ma jority consensus tree (an approach previously outlined in Ref. [9], but not used to pro duce a probabilistic mo del) and assign link probabilities { p r } to the remaining tree nodes. Giv en this form ulation, w e may up date the p osterior distribution o v er the parameter p r giv en a sequence of observ ed netw orks { G t } b y up dating the hyperparame- ters as ˜ α r = α + X { G t } E G t r (4) ˜ β r = β + X { G t } N r − E G t r . (5) Th us, w e obtain the p osterior hyperparameters from the sum of the prior pseudo counts of edges and the empir- ically observed edge counts (num b er of present and ab- sen t connections). This Bay esian approac h pro duces an implicit regularization. As the n umber of observ ations N r increases, the posterior distribution becomes increas- ingly p eaked, reﬂecting a decrease in parameter uncer- tain ty . In the GHR G mo del, parameters closer to the ro ot of T represen t larger-scale structures in G and gov- ern the likelihoo d of more edges. These parameters are th us estimated with greater certain ty , while the distri- bution o v er parameters far from the root, representing small-scale structures, hav e greater v ariance. This im- plicit regularization preven ts ov er-ﬁtting to small-scale structural v ariations and improv es the inferred norm’s robustness to noise. IV. DETECTING CHANGE POINTS IN NETW ORKS The ﬁnal piece of our online net work change-point de- tection metho d is to determine whether and when the parameters of our curren t model of “normal” connectiv- it y ha ve changed. T o accomplish this, w e use the p os- terior Ba yes factor ov er a sliding window of ﬁxed length w to detect if any c hanges ha ve o ccurred with resp ect to a GHRG mo del ﬁtted ov er the window. A change is detected if the factor exceeds a threshold determined b y a desired false p ositive rate. A. P osterior Bay es factor T o determine whether or not we b eliev e a change has o ccured within a particular window we use the p osterior 4 E r dös- R én yi G r aph Bi n ar y HRG inc r eas i n g l evels o f h ier ar c h y dec r ea sin g level s of h i er a r c h y FIG. 3: A spectrum of large-scale structure, corresp onding to diﬀerent amoun ts of hierarc h y in the GHR G, ranging from a simple random graph to a complete hierarchical organization. Bayes factor [2]. Similar to a lik eliho od ratio test [3], but consisten t with our Bay esian framework, the p osterior Ba yes factor is a ratio of the observ ed data’s likelihoo d under tw o diﬀeren t mo dels: a null hypothesis model H 0 , in whic h no change occurs, and an alternative h ypothesis mo del H 1 , in which a c hange o ccurs at some particular time t c . Ho wev er, rather than ev aluate the likelihoo ds under maximium lik eliho od parameters, we use the p os- terior marginal lik elihoo d by w eighting the a v erage likeli- ho od b y the posterior distribution. F or the GHR G, this is calculated by up dating the prior h yperparameters ( α, β ) in Eq. (3) with the p osterior h yp erparameters ( b α, b β ) in Eq. (5). W e restrict our consideration of change points to those within a sliding window of w net w orks, the last of which is at the “current” time τ . W e assume the c hange p oin t t c o ccurs b etw een some pair of snapshots, whic h we indicate using a 0.5 oﬀset. F or the no-c hange mo del, w e sa y that all net w orks within the windo w w ere drawn from a model with parameters ψ ( ∅ ) . F or the change m odel, w e let ψ (0) denote the model parameters for net w orks up to t c within our window and ψ (1) the parameters for net works after t c , but still within the windo w. Rewriting the change and no-change hypotheses in terms of such a shift in a parametric distribution ov er graphs at t c , w e ha v e H 0 : ψ ( ∅ ) = ψ ( ∅ ) (no c hange) H 1 : ψ (0) 6 = ψ (1) (c hange) . Using e ψ = { ˜ α r , ˜ β r } to denote the set of p osterior hy- p erparameters, the GHRG’s posterior Ba yes factor for a sequence of graphs { G τ − w +1 , ..., G τ } is Λ b t c = b t c − 0 . 5 X t = τ − w +1 log p ( G t | T τ , e ψ (0) b t c ) + τ X t = b t c +0 . 5 log p ( G t | T τ , e ψ (1) b t c ) − τ X t = τ − w +1 log p ( G t | T τ , e ψ ( ∅ ) ) , where e ψ ( ∅ ) is the set of p osterior hyperparameters p er- taining to the no-c hange hypothesis (no change p oint an ywhere in the window of w net works), while e ψ (0) b t c and e ψ (1) b t c are the hyperparameters for the net works up to and then follo wing the p oint b t c , respectively . Finally , the time at which the change occurs t c is itself an unknown v alue, and must be estimated. W e make the conserv ativ e choice, choosing b t c as the time p oin t b et w een a pair of consecutive net works that maximizes our test statistic Λ across the window. Letting g τ for a giv en window of w netw orks ending at τ b e that largest v alue g τ = max τ − w +1 < b t c < τ Λ b t c , (6) w e then say that the time of detection t d is the ﬁrst time p oin t τ at which g τ exceeds a threshold h (called a “stop- ping rule” in the c hange-p oin t detection literature): t d = min { τ : g τ > h } . (7) B. P arametric b o otstrapping The choice of threshold h , whic h g τ m ust exceed for a detection to occur, sets the metho d’s resulting false 5 p=0. 21 4 p=0.0 11 10 13 12 14 1 0 3 2 5 4 7 6 9 8 p=0. 37 24 25 23 27 20 21 17 16 19 18 22 28 26 29 15 F r agm en t F o r m F o r m F r agm en t p=0. 04 1 p=0. 37 11 10 13 12 14 1 0 3 2 5 4 7 6 9 8 p=0. 37 24 25 23 27 20 21 17 16 19 18 22 28 26 29 15 μ=0 .1 μ=1 .0 p=0.2 24 25 26 27 20 21 22 23 28 29 1 0 3 2 5 4 7 6 9 8 11 10 13 12 15 14 17 16 19 18 μ=0 .1 μ=0 .5 p=0. 04 1 p=0. 37 11 10 13 12 14 1 0 3 2 5 4 7 6 9 8 p=0. 37 24 25 23 27 20 21 17 16 19 18 22 28 26 29 15 Split Mer ge Mer ge Split FIG. 4: T axonomy of change points: formation versus fragmen tation and merging versus splitting. In each case, w e sho w both the blo c k structure of the adjacency matrix for t wo groups and the corresp onding GHRG mo del. F or our experiments, we switc h from one structure to the other b y changing the structural index µ . See text for more detail. p ositiv e rate and the distribution of g τ under the null mo del. Recen t results on mo del comparison for statis- tical mo dels of net works, and sp eciﬁcally the sto c hastic blo c k mo del, of which the GHRG is a particularly useful sp ecial case, suggest that for technical reasons the null distribution can deviate substan tially from the χ 2 distri- bution [25]. T o av oid a missp eciﬁed test, we estimate the n ull distribution numerically , via Monte Carlo samples from a parametric bo otstrap distribution [13] deﬁned by the GHR G for the no-change mo del. In this w ay , w e estimate the n ull distribution exactly , rather than via a p ossibly misspeciﬁed appro ximation. F or each netw ork we sample from the no-change GHRG mo del and calculate g τ from Eq. (6) to obtain its distri- bution under the hypothesis of no change (see Algorithm 1). Using the sampled distribution, the threshold h may then be c hosen so that p ( g τ > h ) = p fp is the desired false p ositiv e rate. In practice we do this by calculating a p -v alue for the test case by coun ting the proportion of lik eliho o d ratios in our null distribution that are higher than our test statistic g τ : p -v alue = |{ g τ } null > g τ | |{ g τ } null | . (8) Th us, if we ﬁnd a p -v alue b elow the c hosen threshold, we sa y a change is detected, and when the no-change mo del is correct, w e are incorrect no more than p fp of the time. ALGORITHM 1: Parametric b ootstrap sampling g τ from the null model distribution Input G = { G } τ τ − w +1 { g τ } null = ∅ . GHR G( T τ , e ψ τ ) = ﬁtGHRG( G ). for i = 1 to 1000 do sample w graphs, { G t } w t =1 ∼ GHRG( T τ , e ψ τ ). for ˜ t c = 1 to w do calculate e ψ ∅ , e ψ 0 , e ψ 1 according to Eq. (4) and Eq. (5). Λ ˜ t c = P ˜ t c − 1 t =1 log p  G t | T τ , e ψ (0) ˜ t c  + P w t = ˜ t c log p  G t | T τ , e ψ (1) ˜ t c  − P w t =1 log p  G t | T τ , e ψ ∅  end for g ( i ) τ = max ˜ t c Λ ˜ t c . { g τ } null = { g τ } null + g ( i ) τ . end for V. DETECT ABILITY OF CHANGE POINTS Before applying our metho d to empirical data with un- kno wn structure and unkno wn change p oin ts, we ﬁrst systematically characterize the detectability of diﬀerent t yp es of netw ork c hange points under con trolled circum- stances on synthetic data, generated using our GHR G mo del, with known structure and c hanges. The following change-point types constitute diﬃcult but realistic tests that co v er a broad v ariet y of empiri- 6 estimated change-point 0.0 0.2 0.4 0.6 0.8 1.0 S t r u c t u r a l I n d e x ( µ ) P(Change-point) M e r g e : ∆ µ = 0 . 4 0 1 2 3 4 5 6 7 time of detection 0.0 0.2 0.4 0.6 0.8 1.0 S t r u c t u r a l I n d e x ( µ ) / P ( D e t e c t i o n ) No Detection = 0.00 False Detection 0.03 S t r u c t u r a l I n d e x ( µ ) P(Detection) estimated change-point S p l i t : ∆ µ = - 0 . 4 0 1 2 3 4 5 6 7 time of detection No Detection = 0.00 False Detection 0.06 estimated change-point F r a g : ∆ µ = 0 . 9 0 1 2 3 4 5 6 7 time of detection No Detection = 0.01 False Detection 0.05 estimated change-point F o r m : ∆ µ = - 0 . 9 0 1 2 3 4 5 6 7 time of detection No Detection = 0.02 False Detection 0.03 FIG. 5: Results for merge, split, fragment and form change p oin ts, where the size of the change is parameterized by the structural index µ (orange line) and a discon tinuit y o ccurs at the true p oin t of change. The resulting distributions of ( top r ow ) estimated c hange p oin ts times b t c (green bars), and ( b ottom r ow ) detection times t d (blue bars), for eac h change t yp e. The 0.5 time oﬀset indicates that the estimated c hange p oin t time b t c o ccurs betw een a pair of consecutive net work snapshots, while t d giv es the time of the last net w ork in the windo w when a detection ev ent occurs. F alse detections (false p ositiv es) o ccur when t d < t c ; no detection (false negativ es) o ccur when τ − w + 1 > t c without a detection. cally observed large-scale changes to netw ork structure. F or our n umerical tests, we c ho ose net work with N = 30 v ertices and a sparse and constan t expected num ber of connections (marginal link probability of 0 . 2). These small net works provide a more diﬃcult test case than larger net w orks because the limited amount of data mak e c hange p oints harder to detect since it becomes harder to characterize the norms. F urthermore, we deﬁne four general t yp es of change points: splitting , when one large comm unity divides in tw o; mer ging , when t w o communi- ties com bine (the time-reversal of splitting); formation , when one of tw o groups of vertices add edges to make a comm unity; and fr agmentation , when one of tw o groups loses all its edges (the time-rev ersal of formation). Deﬁning the structural index µ = p out /( p in + p out ) pro vides a single parameter that controls the switc hing b et w een these distinct states. F or the merge/split c hange p oin ts, w e c ho ose the merged state to b e µ = 0 . 5, whic h pro duces a single communit y in which every edge occurs with the same probabilit y p in = p out . In the split state, the net w ork is comprised of tw o distinct comm unities. F or formation/fragmen tation c hange points, we use the same tw o-communit y model, but now ﬁx the link proba- bilit y within one communit y and use µ to describ e rela- tionship b et w een the p in and p out of the second comm u- nit y . W e no w summarize these c hange p oin ts with resp ect to µ : merge µ 6 = 0 . 5 → µ = 0 . 5 fragmen t µ < 1 → µ = 1 split µ = 0 . 5 → µ 6 = 0 . 5 form µ = 1 → µ < 1 . All tests used a w = 4 window size and a 0.05 false- p ositiv e rate. F or comparison, we used three simpler v er- sions of our probabilistic framework in which we replace the generative mo del with a univ ariate Gaussian and con- v ert the netw ork sequence into a time series of scalar v al- ues (mean degree, mean geo desic distance, mean lo cal clustering coeﬃcient). F or eac h of the change t yp es, Figure 5 shows tw o diﬀer- en t distributions (ov er 100 runs): the estimated change p oin t b t c and the time of detection t d (see Figure 1). W e ﬁnd that the estimated change p oin ts tend to either b e correct or sligh tly early . The time of detection (the end of the sliding window when a c hange is detected) quan- tiﬁes how many netw orks after the change we must see b efore w e identify the change p oint. W e ﬁnd that the merge and fragment c hanges are detected quickly , while their change points are often estimated early . In con- trast, the split and formation changes are detected later, while the estimation of the c hange p oin ts themselv es is more accurate. In Figure 6 we compare the false p ositive and false 7 FIG. 6: F alse negative ( b ottom ) error rates for our method and the simple methods on the four diﬀerent c hange types for diﬀerent magnitudes of change(∆ µ ). negativ e error rates among all four methods. On false p ositiv es, all metho ds are close to 0.05, which matc hes the desired false alarm rate. Ho wev er, the false negativ e rates diﬀer widely , with the simple metho ds p erforming terribly in nearly every case, ev en when the size of the c hange is large. In contrast, our metho d p erforms well across all four tests, except when the size of the change is very small, e.g., when ∆ µ ≈ 0, which represents the hardest cases, where small-sample ﬂuctuations obscure m uch of the actual change. In the “split” and “merge” exp erimen ts we notice that there seems to b e a thresh- old around whic h the detectability rapidly c hanges. In these exp erimen ts, for changes of small magnitude, the net works are close to Erd˝ os-R´ en yi random graph b oth b efore and after the c hange p oin t. It is lik ely then, that this is related to the detectability phase transition kno wn to occur in the comm unit y detection problem [10]. VI. CHANGE POINTS IN REAL NETWORKS W e now apply these approaches 1 to detect changes in t wo high-resolution evolving netw orks, the MIT Reality Mining proximit y net work 2 [12] and the Enron email net work 3 [16], for which a set of external “sho c ks” exists that serv e as targets for c hange-p oin t detection. Both data sets are ev olving net works of human social interac- tions, but represent diﬀeren t in teraction t yp es. W e quantify the performance in terms of precision and recall as a function of the detection dela y s b etw een es- 1 Code for our metho d is av ailable at http://tinyurl.com/ letopeel/code.html 2 realitycommons.media.mit.edu/realitymining.html 3 www.cs.cmu.edu/ ~ enron/ 0 1 2 3 4 5 Delay (+/-) 0.0 0.2 0.4 0.6 0.8 1.0 Recall (MIT) GHRG+GLR Clust Degree Geodesic 0 1 2 3 4 5 Delay (+/-) 0.0 0.2 0.4 0.6 0.8 1.0 Precision (MIT) GHRG+GLR Clust Degree Geodesic 0 1 2 3 4 5 Delay (+/-) 0.0 0.2 0.4 0.6 0.8 1.0 Recall (Enron) GHRG+GLR Clust Degree Geodesic 0 1 2 3 4 5 Delay (+/-) 0.0 0.2 0.4 0.6 0.8 1.0 Precision (Enron) GHRG+GLR Clust Degree Geodesic FIG. 7: Precision and recall of our metho d and the three baseline metho ds at identifying known external ev ents as a function of delay ( s ). timated change p oin ts { b t c } and sc heduled even ts { t c } , i.e., Precision(s) = 1 n c X i δ  inf j    b t ( i ) c − t ( j ) c    ≤ s  (9) Recall(s) = 1 n a X j δ  inf i    b t ( i ) c − t ( j ) c    ≤ s  , (10) where δ ( x ) is a delta function that equals 1 if x is true and 0 otherwise, and n c and n a are the num b er of estimated c hange p oints and actual ev en ts resp ectively . The pre- cision is then the prop ortion of estimated change p oin ts that o ccur within a given delay s of a known ev ent. Sim- ilarly , recall is the prop ortion of known ev ents that o ccur within a delay s of an estimated c hange point. A. So cial pro ximity net w ork The MIT netw ork is comprised of proximit y data for 97 facult y and graduate studen ts, recorded con tinuously via Blueto oth scans from their mobile phone o ver 35 w eeks [12]. F rom the raw scan data, w e extracted a sequence of weekly netw orks, in which an edge denotes ph ysical pro ximity to one of the 97 sub jects at some p oint that week. Asso ciated with the dataset are 16 known ex- ternal ev ents including public holida ys, spring and winter breaks, exam p erio ds, etc [11]. F or each detection metho d, the GHR G and the three simple methods, w e used a window size w = 4, the same as in the synthetic exp erimen ts. The results for each metho d are sho wn in Figure 7 (top). W e see that cluster- ing co eﬃcient and mean degree slightly outp erform our approac h in terms of precision, but our approac h is muc h b etter than all the baseline approac hes at recall. Closer examination of the change-points detected with each of 8 the simple methods rev eals that they exhibit lo w sensitiv- it y relative to the known external ev ents [11]. In partic- ular, they do pic k out sev eral real change p oin ts, includ- ing Win ter break (mean geo desic distance), the b egin- ning (mean degree) and end (clustering co eﬃcien t) of the indep enden t activities perio d. How ever, they also miss the ma jorit y of other even ts. In particular mean degree and clustering coeﬃcient only detect a total of 2 change p oin ts, whic h explains the high precision scores. F ur- thermore there is little consistency across these metho ds (with the exception of the beginning of Sp onsor week). Th us, these techniques seem b oth unreliable and incon- sisten t. In con trast, the GHR G metho d iden tiﬁes nearly all of the kno wn external ev ents, along with a few additional c hange points, e.g., one week b efore and one week af- ter Sp onsor w eek. This fact agrees well with the so- cial dynamics of Sp onsor w eek, an even t inv olving 75 of the sub jects and whic h typically shifts w ork schedules dramatically as they seek to meet deadlines and pro ject goals [11]. Additionally , the GHR G method ﬁnds more change p oin ts in the F all semester than in the Spring. Examin- ing the dendrograms themselv es, we ﬁnd that the changes in the inferred structures in the F all are muc h more dra- matic than in the Spring; see Figure 8 (top). This agrees with the fact that 35 of the sub jects were new students in the F all semester and thus still establishing their so- cial patterns [11]. By the Spring semester, these pat- terns had largely stabilized, and the large p erturbation of Sponsor w eek w as absen t. Overall, the GHR G both re- co vers known even ts, highligh ts additional c hanges, and pro vides an interpretable basis for discov ering new pat- terns within this evolving netw ork. B. Enron email netw ork The Enron net w ork is comprised of emails among 151 users, mostly senior managemen t of the Enron energy compan y . W e iden tiﬁed a total of 25 even ts during the 3 year timeline. These data were made public by the F ederal Energy Regulatory Commission during its inv es- tigation after the compan y’s collapse. Using a cleaned v ersion of these data [16], we applied b oth the GHR G and simple methods to weekly snapshots from Ma y 1999 to June 2002. Because this netw ork sequence is very long, w e ex- amined the impact of v arying windo w size, c ho osing w = { 4 , 8 , 16 } . Results across windo w sizes were highly consisten t, although larger v alues produced additional c hange p oints. This suggests that window size may op- erate like a temp oral resolution parameter, with longer windo ws giving more resolution. The results for each metho d are shown in Figure 7 (b ottom) using a windo w size w = 16. The results for eac h metho d are sho wn in Figure 8 (b ottom). As with the MIT netw ork, we again ﬁnd that the simple metho ds p erform p o orly; the GHRG method p erforms muc h better than all of them in both precision and recall. Examining the GHRG change p oin ts and the list of external ev en ts, w e ﬁnd that the identiﬁed change p oin ts correlate well with key meetings and even ts such share price ﬂuctuations. Examining the inferred dendro- grams, w e ﬁnd that a particularly large structural change o ccurred around the launc h of Enron online (No v 1999). Before its launch, the net work’s structure is very sparse, while after launc h, the n umber of levels in the GHRG mo del increase dramatically reﬂecting the formation of man y communities. W e als o see that the structure and densit y of the net works increase o ver the perio d starting immediately after the Californian blac k outs through un- til Stephen Coop er tak es ov er as CEO and the collapse of the company is imminent. VI I. DISCUSSION When analyzing a sequence of time evolving netw orks, a cen tral goal is to understand how the netw ork’s struc- ture has c hanged ov er time, and ho w it might change in the future. Change-point detection provides a principled approac h to this problem, b y decomp osing a p otentially non-stationary sequence of net w orks in to subsequences of distinct but probabilistically stationary structural pat- terns. Here, we hav e presented the ﬁrst c hange-p oin t detection method for evolving net w orks that utilizes gen- erativ e netw ork mo dels and statistical hypothesis tests. By formalizing this problem within a probabilistic frame- w ork, w e dev elop ed a statistically principled metho d that can detect, in an online fashion, if, when and how such c hange p oin ts occur in the large-scale patterns of in terac- tions. Under our framework, c hange p oints o ccur when the shap e of an estimated probability distribution ov er net works changes signiﬁcantly . Not all such c hange p oints are equally easy to detect. Using syn thetic data with kno wn structure and known c hange p oin ts, w e observed that only changes of a large enough magnitude could be detected reliably . F urther- more, w e found that changes asso ciated with t wo commu- nities merging or with one of sev eral comm unities losing its in ternal connections (“fragmen tation”) were more dif- ﬁcult to accurately detect than those associated with one comm unity splitting in tw o or with many singletons con- necting to form a new communit y (“formation”). This asymmetry in the detectabilit y of diﬀeren t types of net- w ork c hanges b egs the question of whether more sophis- ticated techniques can eliminate these diﬀerences, and whether adding auxiliary information lik e edge weigh ts [1] or vertex attributes [22] makes this problem easier or harder. That b eing said, c hange-p oin t metho ds based on net- w ork measures like the mean degree, clustering co eﬃ- cien t, or mean geo desic path length p erformed p oorly , yielding high false negative rates even for large struc- tural changes (Fig. 6). This p o or p erformance is likely 9 FIG. 8: Change points detected in the MIT proximit y ( top ) and Enron email ( b ottom ) net works. Each ﬁgure showsthe c hange p oin ts detected (green bars) using our metho d follow ed by the three simple methods (along with the v alues of the relev ant summary statistic). Known even ts are indicated by the colored vertical bars spanning all methods. 10 1 0 1 1 0 2 1 0 3 Nodes 1 0 1 1 0 2 1 0 3 1 0 4 Time (s) FIG. 9: Time tak en to infer GHRG for a windo w of 4 net works of v arying size. Assumes sparse graphs, i.e., O(N) edges. the result of netw ork measures discarding m uch of the sp eciﬁc information that generative mo dels utilize. Ap- plied to t wo high-resolution ev olving so cial net w orks, our metho d provided very go od results, recov ering the tim- ing, from netw ork data alone, of many more known ex- ternal “sho ck” ev ents than the net w ork-measure metho ds (Fig. 7). The computational burden of our current implemen ta- tion lies in the MCMC pro cedure used to infer the hier- arc hical structure (see Figure 9 for inference times). The recen t work of Ref. [7] prop oses a greedy approac h to in- ferring a hierarch y that could naturally be used within our change-point detection as a more scalable alternative. Ho wev er, it is imp ortan t to understand what the trade- oﬀ is betw een accuracy and scalabilit y , and w e b eliev e that this w ould be an interesting and useful direction for future w ork. Although the GHRG model yielded go od results, in principle, an y generative mo del could b e used in its place, e.g., the stochastic blo c k model [1, 15, 21] or the Kro- nec ker pro duct graph model [17]. Similarly , the recen t w ork in graph h yp othesis testing [20] could p otentially b e adapted to the c hange-p oin t detection problem. Tw o k ey features of the GHRG mo del for change-point detec- tion, how ever, are its interpretabilit y and the wa y it nat- urally adapts its dendrogram structure to ﬁt the net work, adding or remo ving levels in the hierarch y , as the netw ork ev olves. Combined with the strong results on syn thetic and real-world data, this approach to c hange-p oint detec- tion promises to hav e broad application, p erhaps particu- larly in so cial net w orks, where interpretabilit y pro vides a crucial bridge to testing h ypotheses ab out the underlying so cial dynamics driving netw ork evolution. VI II. ACKNO WLEDGEMENTS W e thank Dan Larremore for helpful con versations, and ackno wledge supp ort from Grant #F A9550-12-1- 0432 from the U.S. Air F orce Oﬃce of Scien tiﬁc Researc h (AF OSR) and the Defense Adv anced Research Pro jects Agency (D ARP A). [1] C. Aicher, A. Z. Jacobs, and A. Clauset. Learning latent blo c k structure in weigh ted netw orks. Journal of Complex Networks , 2014. [2] M. Aitkin. Posterior ba yes factors (with discussion). Journal of the R oyal Statistic al So ciety , 53:111–142, 1991. [3] M. Aitkin. The calibration of p-v alues, posterior Bay es factors and the AIC from the p osterior distribution of the likelihoo d. Statistics and Computing , 7(4):253–261, 1997. [4] L. Ak oglu and C. F aloutsos. Ev ent detection in time series of mobile comm unication graphs. In 27th Army Scienc e Confer enc e , Orlando, FL, USA, 2010. [5] M. Basseville and I. V. Nikiforov. Dete ction of A brupt Changes: The ory and Applic ation . Pren tice-Hall, Inc., Upp er Saddle River, NJ, USA, 1993. [6] M. Berlingerio, M. Coscia, F. Giannotti, A. Monreale, and D. Pedresc hi. Evolving net w orks: Eras and turning p oin ts. Intel ligent Data A nalysis , 17(1):27–48, 2013. [7] C. Blundell and Y. W. T eh. Bay esian hierarc hical com- m unity disco very . In A dvanc es in Neur al Information Pr oc essing Systems 26 , pages 1601–1609. 2013. [8] D. Bry ant. A classiﬁcation of consensus methods for phy- logenetics. In Bio c onsensus , volume 61 of DIMACS Ser. Discr ete Math. The or et. Comput. Sci. , pages 163–183. Amer. Math. So c., Providence, RI, 2003. [9] A. Clauset, C. Mo ore, and M. E. J. Newman. Structural inference of hierarchies in netw orks. In E. Airoldi, D. M. Blei, S. E. Fien b erg, A. Goldenberg, E. P . Xing, and A. X. Zheng, editors, Statistic al Network A nalysis: Mo d- els, Issues, and New Dir e ctions , volume 4503 of L e ctur e Notes in Computer Scienc e , pages 1–13. Springer Berlin Heidelb erg, 2007. [10] A. Decelle, F. Krzak ala, C. Moore, and L. Zdeborov´ a. In- ference and phase transitions in the detection of mo dules in sparse netw orks. Phys. R ev. L ett. , 107:065701, Aug 2011. [11] N. Eagle. Machine p er c eption and le arning of c omplex so cial systems . Departmen t of Media Arts and Sciences, Massac husetts Institute of T echnology , 2005. [12] N. Eagle and A. Pen tland. Reality mining: Sensing complex so cial systems. Personal Ubiquitous Comput. , 10(4):255–268, Mar. 2006. [13] B. Efron and R. J. Tibshirani. An intr oduction to the b ootstr ap . Chapman and Hall, 1993. [14] S. Hirose, K. Y amanishi, T. Nak ata, and R. F ujimaki. Net work anomaly detection based on eigen equation com- pression. In Pr o c ee dings of the 15th ACM SIGKDD Inter- national Confer ence on Know le dge Disc overy and Data Mining , KDD ’09, pages 1185–1194, New Y ork, NY, USA, 2009. ACM. 11 [15] P . W. Holland, K. B. Laskey , and S. Leinhardt. Sto chas- tic blockmodels: First steps. So cial Networks , 5(2):109– 137, 1983. [16] B. Klimt and Y. Y ang. Introducing the Enron corpus. In First Confer enc e on Email and Anti-Sp am (CEAS) , 2004. [17] J. Lesko vec, D. Chakrabarti, J. Klein berg, and C. F alout- sos. Realistic, mathematically tractable graph genera- tion and evolution, using Kroneck er multiplication. In Pr oc e edings of the 9th Eur op ean Confer enc e on Princi- ples and Pr actic e of Know le dge Disc overy in Datab ases , PKDD’05, pages 133–145, Berlin, Heidelb erg, 2005. Springer-V erlag. [18] C. L´ evy-Leduc and F. Roueﬀ. Detection and lo calization of change-points in high-dimensional netw ork traﬃc data. The A nnals of Applie d Statistics , pages 637–662, 2009. [19] I. McCulloh and K. M. Carley . Detecting change in longi- tudinal social net w orks. Journal of So cial Structur e , 12, 2011. [20] S. Moreno and J. Neville. Net w ork h ypothesis testing using mixed Kroneck er pro duct graph mo dels. In 13th IEEE International Confer enc e on Data Mining , 2013. [21] K. No wic ki and T. A. B. Snijders. Estimation and Pre- diction for Sto chastic Blockstructures. Journal of the Americ an Statistic al Asso ciation , 96(455), 2001. [22] L. P eel. T op ological feature based classiﬁcation. In Pr o- c ee dings of the 14th International Confer enc e on Infor- mation F usion (FUSION) , pages 1–8. IEEE, 2011. [23] C. E. Priebe, J. M. Conroy , D. J. Marc hette, and Y. Park. Scan statistics on Enron graphs. Comput. Math. Or gan. The ory , 11(3):229–247, Oct. 2005. [24] J. Sun, C. F aloutsos, S. Papadimitriou, and P . S. Y u. Graphscope: P arameter-free mining of large time- ev olving graphs. In Pro c ee dings of the 13th A CM SIGKDD International Confer enc e on Know le dge Dis- c overy and Data Mining , KDD ’07, pages 687–696, New Y ork, NY, USA, 2007. A CM. [25] X. Y an, J. E. Jensen, F. Krzak ala, C. Mo ore, C. R. Shal- izi, L. Zdeb oro v´ a, P . Zhang, and Y. Zhu. Mo del selec- tion for degree-corrected blo c k models. arXiv , 1207.3994, 2012.

Detecting change points in the large-scale structure of evolving networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment