Learning Latent Block Structure in Weighted Networks

Learning Laten t Blo c k Structure in W eigh ted Net w orks Christopher Aicher Dep artment of Applie d Mathematics, University of Color ado, Boulder, CO, 80309 c hristopher.aicher@colorado.edu Abigail Z. Jacobs Dep artment of Computer Scienc e, University of Color ado, Boulder, CO, 80309 Aar on Clauset Dep artment of Computer Scienc e, University of Color ado, Boulder, CO, 80309 BioF r ontiers Institute, University of Color ado, Boulder, CO 80303 Santa F e Institute, Santa F e, NM 87501 Abstract Comm unity detection is an important task in net work analysis, in whic h w e aim to learn a net work partition that groups together vertices with similar communit y-lev el connectivit y pat- terns. By ﬁnding suc h groups of vertices with similar structural roles, w e extract a compact represen tation of the netw ork’s large-scale structure, which can facilitate its scientiﬁc interpre- tation and the prediction of unkno wn or future interactions. Popular approaches, including the sto c hastic block mo del, assume edges are unw eighted, whic h limits their utilit y by discarding po- ten tially useful information. W e in tro duce the weighte d sto chastic blo ck model (WSBM), which generalizes the stochastic block model to netw orks with edge w eights drawn from any exponen- tial family distribution. This mo del learns from both the presence and weigh t of edges, allowing it to disco ver structure that would otherwise b e hidden when weigh ts are discarded or thresh- olded. W e describ e a Bay esian v ariational algorithm for eﬃciently approximating this mo del’s p osterior distribution o ver latent block structures. W e then ev aluate the WSBM’s performance on b oth edge-existence and edge-weigh t prediction tasks for a set of real-world weigh ted net- w orks. In all cases, the WSBM p erforms as well or b etter than the b est alternatives on these tasks. comm unity detection, weigh ted relational data, blo ck mo dels, exp onen tial family , v ariational Bay es. 1 In tro duction Net works are an increasingly imp ortan t form of structured data consisting of interactions b et ween pairs of individuals in large so cial and biological data sets. Unlike attribute data where each obser- v ation is asso ciated with an individual, netw ork data is represented b y graphs, where individuals are v ertices and interactions are edges. Because vertices are pairwise related, net work data violates tra- ditional assumptions of attribute data, such as indep endence. This intrinsic diﬀerence in structure prompts the dev elopment of new to ols for handling netw ork data. In so cial and biological netw orks, v ertices often play distinct structural roles in generating the net work’s large-scale structure. T o identify suc h laten t structural roles, we aim to identify a net- w ork partition that groups together v ertices with similar group-level connectivit y patterns. W e call these groups “communities,” and their inference pro duces a compact description of the large-scale 1 (a) Assortative (b) Disassortative (c) Core-Periphery (d) Ordered Figure 1: Examples of structure that can b e learned using the SBM. The ﬁrst row sho ws the abstract connections b etw een four groups (blue, red, green, and purple). The second row shows the ‘blo c k’ structure found in the adjacency matrix after sortin g b y group mem b ership; blac k corresponds to edges and white corresp onds to non-edges. (a) Assortative structure: edges mainly exist within groups. (b) Disassortativ e structure: edges mainly exist betw een distinct groups. (c) Core-Periphery structure: the ‘core’ (blue) connects mainly with itself and the ‘p eriphery’ (red, green, and purple), while the ‘p eriphery’ mainly connects with the ‘core’. (d) Ordered structure: blue connects to red, red connects to green, and green connects to purple. structure of a netw ork. (W e note that this deﬁnition of a “communit y” is more general than the assortativ e-only deﬁnition that is commonly used.) This compact large-scale description itself has man y p otential uses, including dividing a large heterogeneous system in to several smaller and more homogeneous parts that ma y b e studied semi-indep enden tly , and in predicting unkno wn or future patterns of in teractions. By grouping vertices by these roles, communit y detection in netw orks is similar to clustering in vector spaces, and many approaches ha ve been prop osed [13]. The sto c hastic blo c k mo del (SBM) [17, 34] is a p opular generative model for learning communit y structure in unw eighted netw orks. In its classic form, the SBM is a probabilistic mo del of pairwise in teractions among n v ertices. Eac h v ertex i b elongs to one of K latent groups or “blo c ks” denoted b y z i , and each edge A ij exists with a probability θ z i z j that dep ends only on the group memberships of the connecting vertices. V ertices in the same blo c k are sto c hastically equiv alen t, indicating their equiv alent roles in generating the netw ork’s structure. The SBM is fully sp eciﬁed by a v ector z denoting the group membership of each v ertex and a K × K matrix θ of edge bundle probabilities, where θ k,k 0 giv es the probability that a vertex in group k connects to some vertex of group k 0 . The SBM is p opular in part b ecause it can generate a wide v ariety of large-scale patterns of net- w ork connectivit y dep ending on the c hoice of θ (Figs 1(a-d)). F or example, if the diagonal elements of θ are greater than its oﬀ-diagonal elemen ts, the blo ck structure is assortative, with communities exhibiting greater edge densities within than b et ween them (Fig. 1(a))—a common pattern in so- cial netw orks [21]. Reversing the pattern in θ generates disassortative structure (Fig. 1(b)), whic h is often found in language and ecological net works [22]. Other c hoices of θ can generate hierar- c hical, multi-partite, or core-p eriphery patterns [9, 26]. The SBM also has b een generalized for coun t-v alued data, degree-correction [18], bipartite structure [19], and categorical v alues [15]. In addition to this ﬂexibility , the SBM’s probabilistic structure provides a principled approac h to 2 1 2 4 3 (a) Example Netw ork Thr es hold = 3 Thr es hold = 2 Thr es hold = 1 Thr es hold = 4 (b) NMI vs Threshold Figure 2: (a) An example of a w eighted netw ork where thresholding will never succeed. (b) A plot of the normalized mutual information (NMI) b etw een the true communit y structure and inferred SBM comm unity structure after thresholding at v arious threshold v alues (av eraged o ver 100 trials). Examples of communit y structure found by thresholding are shown abov e the graph (diﬀerent colors represen t diﬀeren t comm unities). As the NMI is less than 1 for all threshold v alues, the SBM after thresholding never infers the true communit y structure shown in (a). quan tifying uncertain t y of group membership, an attractive feature in unsup ervised netw ork analysis. This structure has led to theoretical guarantees, including consistency of the SBM estimators [7] and the iden tiﬁability and consistency of laten t blo ck mo dels [3, 4]. Ho wev er, eac h of these mo dels assumes an unw eighted net work, where edge presence or absence is represented as a binary v ariable (or p erhaps a count-v alued v ariable), while most real-world net works ha v e weigh ts, e.g., in teraction frequency , v olume, or character. Such information is typically discarded via thresholding b efore analysis, which can obscure or distort latent structure [33]. T o illustrate this loss of information from thresholding, consider a toy net work of four equally-sized groups lab eled 1–4 (see Fig. 2), where eac h edge ( i, j ) is assigned a weigh t equal to the smaller of the endp oints’ group labels, plus a small amount of noise. Edges b etw een groups are th us assigned w eights near 1, 2, or 3, while those within a group are assigned w eights near 1–4. This mo del is ob viously unrealistic, but serves to illustrate the common consequences of applying a global threshold to edge-weigh ted netw orks. T o apply the SBM to this simple netw ork, we must con vert it into an unw eighted netw ork by discarding edges with weigh ts less than some threshold. T o illustrate the results of this action, we consider all p ossible thresholds, and compute the av erage normalized m utual information (NMI) b et w een the b est comm unity structure found using the SBM and the true structure (Fig. 2). No matter what threshold we choose, edges are divided into at most three groups: those with w eight 3 ab o v e, at, or b elow the threshold. The SBM can thus recov er a maxim um of three groups, rather than the four planted in this net work, and the threshold determines whic h three groups it ﬁnds. No threshold yields the correct inference here, b ecause thresholding discards edge w eight information. Instead of thresholding, we could use more complex methods, such as using multiple thresholds or a binning scheme, to con vert a weigh ted netw ork into an un weigh ted or coun t-v alued netw ork of some sort. These metho ds w ould p erform b etter than applying a single threshold, at the cost of additional complexity in sp ecifying multiple threshold or bin v alues. Regardless of the metho d, these approac hes will still discard p otentially useful edge w eigh t information. T o exploit the maximal amoun t of information in the original data in reco vering the true hidden structure, we should prefer to mo del the edge weigh ts directly . In this pap er, we introduce the weighte d sto chastic blo ck mo del (WSBM), a generalization of the SBM that can learn from b oth the presence and weigh t of edges. The weigh ted stochastic blo c k mo del pro vides a natural solution to this problem b y generalizing the SBM to learn from both types of edge information. Sp eciﬁcally , the WSBM mo dels each weigh ted edge A ij as a draw from a parametric exp onen tial family distribution, whose parameters depend only on the group mem b erships of the connecting vertices i and j . It includes as sp ecial cases most standard distributional forms, e.g., the normal, the exp onen tial, and their generalizations, and enables the direct use of weigh ted edges in reco vering latent group or blo ck structure. This pap er generalizes and extends our previous work [1]. W e ﬁrst describe the form of the WSBM, which com bines edge existence and weigh t information. W e then derive a v ariational Bay es algorithm for eﬃciently learning WSBM parameters from data. Applying this algorithm to a small real-world weigh ted netw ork, we show that the SBM and WSBM can learn distinct latent structures as a result of observing or ignoring edge weigh ts. Finally , we compare the performance of the WSBM to alternativ e metho ds for t wo edge prediction tasks, using a set of real-w orld net works. In all cases, the WSBM p erforms as w ell as alternativ es on edge-existence prediction, and outp erforms all alternativ es on edge-weigh t prediction. This mo del th us enables the disco very of latent group structures in a wider range of netw orks than w as previously p ossible. 2 W eigh ted Sto chastic Blo ck Mo del W e begin b y reviewing the SBM and exp onen tial families, and then describe a natural generalization of the SBM to weigh ted netw orks. In what follo ws, we consider the general case of directed graphs; undirected graphs are a sp ecial case of this mo del. In the SBM, the netw ork’s adjacency matrix A contains binary v alues represen ting edge exis- tences, i.e., A ij ∈ { 0 , 1 } , the integer K denotes a ﬁxed num b er of latent groups, and the vector z con tains the group lab el of each vertex z i ∈ { 1 , . . . , K } . The num b er of laten t groups K controls the mo del’s complexity and ma y b e chosen in a v ariety of wa ys—w e defer a discussion of this matter un til section 3.3. Each p ossible group assignment vector z represents a diﬀeren t partition of the v ertices in to K groups, and each pair of groups ( k k 0 ) deﬁnes a “bundle” of edges that run b et w een them. The SBM assigns an edge existence parameter to eac h edge bundle θ kk 0 , which we represent collectiv ely by the K -by- K matrix θ . The existence probability of an edge A ij is given b y the parameter θ z i z j that dep ends only on the group memberships of vertices i and j . Assuming that each edge existence A ij is conditionally indep endent giv en z and θ , the SBM’s lik eliho o d function is Pr( A | z , θ ) = Y ij θ A ij z i z j  1 − θ z i z j  1 − A ij , (1) 4             Figure 3: Graphical mo del for the WSBM. Each weigh ted edge A ij (plate) is distributed according to the appropriate edge parameter θ z i ,z j for eac h observ ed interaction ( i, j ). In our v ariational Ba yes inference scheme, the WSBM’s laten t parameters z , θ are themselv es mo deled as random v ariables distributed according to µ, τ , respectively . W e highligh t that the arrow from z to θ z i ,z j hides the complex relational structure b et ween each z i . whic h we may rewrite as Pr( A | z , θ ) = Y ij exp  A ij · log  θ z i z j 1 − θ z i z j  + log  1 − θ z i z j   . Th us, the likelihoo d has the form of an exp onential family Pr( A | z , θ ) ∝ exp   X ij T ( A ij ) · η ( θ z i z j )   , (2) where T ( x ) = ( x, 1) is the v ector-v alued function of suﬃcient statistics of the Bernoulli random v ariable and η ( x ) = (log [ x/ (1 − x )] , log[1 − x ]) is the vector-v alued function of natural paramete rs. App endix B provides further details about exp onen tial families. This choice of functions ( T , η ) pro duces binary-v alued edge weigh ts. By c ho osing an appropriate but diﬀerent pair of functions ( T , η ), deﬁned on some domain X and × resp ectiv ely , we may sp ecify a stochastic blo ck mo del whose weigh ts are dra wn from an exp onen tial family distribution ov er X . As in the SBM, this weigh ted sto chastic blo c k mo del (WSBM) is deﬁned by a vector z and matrix θ , but now eac h θ z i z j sp eciﬁes the parameters gov erning the weigh t distribution of the ( z i z j ) edge bundle. Figure 3 visualizes the dep endencies in the WSBM’s likelihoo d function as a graphical mo del. The generative pro cess of creating a w eighted netw ork from the WSBM consists of the following steps. • F or each vertex i , assign a group membership z i . • F or each pair of groups ( k , k 0 ), assign an edge bundle parameter θ kk 0 ∈ × • F or each edge ( i, j ), draw A ij ∈ X from the exp onential family ( T , η ) parametrized by θ z i z j . The communit y structure of the WSBM retains the sto c hastic equiv alence principle of the classic SBM, in which all v ertices in a group main tain the same probabilistic connectivity to the rest of the net work. F or example, if the edge w eights are real-v alued X = R , then we ma y choose to mo del the edge weigh ts with the normal distribution, which has suﬃcient statistics T = ( x, x 2 , 1) and natural 5 parameters η = ( µ/σ 2 , − 1 / (2 σ 2 ) , − µ 2 / (2 σ 2 )). Instead of edge-existence probabilities, each edge- bundle ( z i z j ) is now parameterized by a mean and v ariance θ z i z j = ( µ z i z j , σ 2 z i z j ). In this case, the lik eliho o d function would b e Pr( A | z , µ, σ 2 ) = Y ij N  A ij | µ z i z j , σ 2 z i z j  = Y ij exp A ij · µ z i z j σ 2 z i z j − A 2 ij · 1 2 σ 2 z i z j − 1 · µ 2 z i z j σ 2 z i z j ! . (3) That is, this particular WSBM uses a normal distribution instead of a Bernoulli distribution to model the v alues observed in an edge bundle. W e emphasize that the c hoice of the normal distribution is merely illustrative. This construction pro duces complete graphs, in which ev ery pair of vertices is connected by an edge with some real-v alued weigh t. F or a complete net work, this formulation may b e entirely suﬃcien t. How ever, most real-world netw orks are sparse, with only O ( n ) pairs having a connection that ma y ha ve a w eigh t, and a dense model like this one cannot be applied directly . W e no w describ e ho w sparsity can b e naturally incorp orated within our model, which also pro duces more scalable inference algorithms. 2.1 Sparse W eighted Graphs A k ey insight for mo deling edge-weigh ted sparse netw orks la ys in clarifying the meaning of zeros in a weigh ted adjacency matrix. Typically , a v alue A ij = 0 may represent one of three things: (i) the absence of an edge, (ii) an edge that exists but has w eigh t zero, or (iii) missing data, i.e., an unobserv ed interaction. In b oth of the former tw o cases, w e do in fact observe the interaction, while in the latter, we do not. F or observed in teractions, w e call the observ ed non-interaction to be a “non-edge,” and w e let A ij = 0 denote the presence of an edge with w eight zero. In many empirical net works, distinct types of interactions may ha ve b een confounded, e.g., non-edges, edges with zero w eight, and unobserved in teractions may all be assigned a v alue A ij = 0. How ev er, for accurate inference, this distinction can b e imp ortan t. F or example, a non-edge may indicate an interaction that is impossible to measure, which is distinct from choosing not to measure the interaction (an unobserv ed interaction) or an interaction with w eight zero. Here, we assume that these three types of interactions are distinguished in our input data. This creates t w o t yp es of information: information from edge existence (non-edges vs w eigh ted edges) an d information from edge weigh t (the weigh ted v alues). T o handle these tw o types of information, the WSBM then mo dels an edge’s existence as a Bernoulli or binary random v ariable, as in the SBM, and mo dels an edge’s weigh t using an exp onen tial family distribution. T erms corresp onding to unobserv ed interactions contribute no information to inference and are dropp ed from the lik eliho od function. If the pair ( T e , η e ) denotes the family of edge-existence distributions and the pair ( T w , η w ) denotes the family of edge-weigh t distributions then we ma y combine their contributions in the lik eliho o d function via a simple tuning parameter α ∈ [0 , 1] that determines their relative imp ortance in inference log Pr( A | z , θ ) = α X ij ∈ E T e ( A ij ) · η e  θ ( e ) z i z j  + (1 − α ) X ij ∈ W T w ( A ij ) · η w  θ ( w ) z i z j  , (4) where E is the set of observed interactions (including non-edges) and W is the set of weigh ted edges ( W ⊂ E ). This generalization can b e reduced to the compact form of Eq. (2) by combining the v ectors α T e with (1 − α ) T w and η e with η w . By tuning α , we can learn diﬀerent laten t structures. When α = 1, the model ignores edge w eight information and reduces to the SBM. When α = 0, the mo del treats edge absence as if it 6 w ere unobserved, and ﬁts only to the weigh t information. When 0 < α < 1, the likelihoo d combines information from b oth edge existence and w eights. In principle, the b est choice of α could also b e learned, but we leav e this subtle problem for future work. In practice, we often ﬁnd that α = 1 / 2, giving equal w eight to b oth types of information, works w ell. 2.2 Degree Correction The last piece of the WSBM is a generalization to naturally handle hea vy-tailed degree distributions, whic h are ubiquitous in real-world net works and are known to cause the SBM to pro duce undesir- able results, e.g., placing all high-degree vertices in a group together, regardless of their natural comm unity membership [18]. Karrer and Newman introduced an elegant extension of the SBM that circumv en ts this b eha v- ior. In their “degree corrected” SBM (here DCBM), they add vertex degree information into the generativ e mo del by adding an “edge-prop ensit y” parameter φ i to each vertex [18]. As a result, the n umber of edges that exist b etw een a pair of vertices i and j is a Poisson random v ariable with mean φ i φ j θ z i ,z j . Because vertices with high prop ensity are more likely to connect than vertices with low prop ensit y , the prop ensit y parameters φ allow for heterogenous degree distributions within groups. In the DCBM, vertices in the same blo c k are no longer sto chastically equiv alent, but ha ve similar group-lev el connectivity patterns conditioned on their prop ensit y parameters φ . The likelihoo d function for this mo del is Pr( A | z , θ , φ ) ∝ Y ij  φ i φ j θ z i z j  A ij exp  − φ i φ j θ z i z j  , where the maximum lik eliho o d estimate of each prop ensit y parameter φ i is simply the vertex degree d i [18]. By ﬁxing φ i = d i , we can rewrite the DCBM in the exp onential family form Pr( A | z , θ , φ ) ∝ Y ij exp  A ij · log θ z i z j − d i d j · θ z i z j  , (5) where the suﬃcien t statistics are T = ( A ij , − d i d j ) and the natural parameters are η = (log θ z i z j , θ z i z j ). Th us, to deriv e a degree-corrected weigh ted sto chastic blo c k model, we simply replace the SBM con tribution in Eq. (4) with that of the DCBM in Eq. (5). W e note that this mo del can easily extended to included in- and out-prop ensit y parameters for directed net works. This degree-corrected weigh ted sto c hastic blo c k mo del allows for heterogeneous degree distribu- tions within groups by mo deling vertex degree or rather the sum of edge existences. This is distinct from what one might call a ‘strength’-corrected SBM that pro duces heterogeneous weigh t distribu- tions within edge bundles by mo deling v ertex strength (the sum of a vertex’s edge weigh ts). This ‘strength’-corrected mo del is not consider here and is an area for future w ork. 3 Learning Laten t Blo ck Structure Giv en some sparse weigh ted graph A , we recov er the underlying comm unities by learning the pa- rameters z , θ . An y of a large num ber of standard approac hes can b e used to optimize the likelihoo d function for the WSBM. Here, we describ e an eﬃcient v ariational Ba yes approac h [5, 16], whic h eﬀectiv ely handles one technical diﬃculty in ﬁtting the mo del to real data. Sp eciﬁcally , learning the parameters z , θ b y directly maximizing the lik eliho o d in Eq. (2) can suﬀer degenerate solutions under contin uous v alued w eights. F or instance, consider the WSBM with normally distributed edge w eights, where some bundle of edges has all-equal weigh ts. In this case, 7 the maxim um lik eliho o d estimate is a v ariance parameter equal to zero, which creates a degeneracy in the likelihoo d calculation. This case is not pathological, as a p o or choice of partition z —chosen, p erhaps, inadverten tly ov er the course of maximizing the lik eliho o d—can easily create tw o small groups with only a few edges, each with the same weigh t, betw een them. This problem has not previously b een identiﬁed in the blo ck-modeling literature b ecause the SBM is a mo del where edge “w eights” are discrete Bernoulli random v ariables, whose parameters are never degenerate. W e solve this problem using Bay esian regularization. In the Bay esian framew ork, we treat the parameters as random v ariables and assign an appropriate prior distribution π to our parameters z , θ . If we treat the prior distribution as the probability of the parameters π ( z , θ ) = Pr( z , θ ) then w e may calculate the p osterior distribution as the probability of the parameters conditioned on the data π ∗ ( z , θ ) = Pr( z , θ | A ) through Bay es’ law π ∗ ( z , θ ) ∝ Pr( A | z , θ ) π ( z , θ ) . After calculating the posterior distribution, we ma y either return our posterior b eliefs π ∗ ab out the parameters z , θ or further calculate a p oint estimate to minimize a p osterior exp ected loss with resp ect to a given loss function [23, 31]. In both cases, it suﬃces to calculate the p osterior π ∗ . The maximum lik eliho o d estimate corresp onds to only maximizing the likelihoo d Pr( A | z , θ ). The inclusion of the prior distribution π preven ts the p osterior distribution π ∗ from o ver-ﬁtting to the degenerate maximum lik eliho o d solution and therefore estimation can pro ceed smo othly . Ho wev er, the posterior distribution is generally diﬃcult to calculate analytically . Instead, w e appro ximate π ∗ ( z , θ ) by a factorizable distribution q ( z , θ ) = q z ( z ) q θ ( θ ), a common approach in b oth machine learning and statistical physics. W e select our appro ximation q by minimizing its Kullbac k-Leibler (KL) divergence to the p osterior D KL ( q || π ∗ ) = − Z q log π ∗ q . The Kullback-Leibler divergence is a non-symmetric, non-negative, information-theoretic measure of diﬀerence b etw een tw o distribution. Thus, our approximation q can b e though t of as the closest appro ximation to the p osterior π ∗ , sub ject to factorization and distribution constrain ts. Expanding the constant likelihoo d log Pr( A ), we observe that minimizing the KL-divergence is equiv alent to maximizing the functional G ( q ) deﬁned as follows. Let log Pr( A ) = Z Θ X z ∈ Z q ( z , θ ) dθ log Pr( A ) = Z Θ X z ∈ Z q ( z , θ ) log Pr( A, z , θ ) Pr( z , θ | A ) dθ = Z Θ X z ∈ Z q ( z , θ ) log Pr( A, z , θ ) q ( z , θ ) dθ − Z Θ X z ∈ Z q ( z , θ ) log Pr( z , θ | A ) q ( z , θ ) dθ = G ( q ) + D K L ( q ( z , θ ) || π ∗ ( z , θ )) , where G ( q ) = Z Θ X z ∈ Z q ( z , θ ) log Pr( A, z , θ ) q ( z , θ ) dθ = E q (log Pr( A | z , θ )) + E q  log π ( z , θ ) q ( z , θ )  . (6) The ﬁrst term of Eq. (6) is the exp ected log-likelihoo d under the approximation q and the second term is the negative KL-divergence of the approximation q from the prior π . Therefore, we aim to 8 maximize the expected log-lik eliho od of the data and w eakly constrain the appro ximation to be close to the prior. The second term serves as a regularizer which prev ents ov er-ﬁtting and eliminates the aforemen tioned maximum lik eliho o d degeneracies. In practice, the ﬁrst term dominates the second term given suﬃcien t data and approximates the maximum lik eliho o d estimation. Because the KL-divergence is non-negative, we can think of G ( q ) as a functional lo wer b ound on the log-evidence or marginal log-likelihoo d, that is, log Pr( A ) = G ( q ) + D KL ( q || π ∗ ) ≥ G ( q ) . (7) Maximizing G ( q ) is equiv alent to minimizing the KL divergence D KL ( q || π ∗ ) because the log-evidence log Pr( A ) is constant. Therefore as we maximize G ( q ), our approximation q gets closer to the true p osterior π ∗ . F or more details on v ariational Bay esian inference in graphical models, w e refer the in terested reader to Ref. [5]. 3.1 Conjugate Distributions T o calculate G in practice, we m ust assign prior distributions π to our parameters and place con- strain ts on the distributions of our approximation q . F or mathematical conv enience, w e choose π and restrict q to b e the pro duct of parameterized conjugate distributions. Because q takes a pa- rameterized form, maximizing the functional G ( q ) o ver all factorized distributions q simpliﬁes to maximizing G ( q ) ov er the parameters of q . F or the edge bundle parameters θ , the standard conjugate prior of the parameter of an exp onential family ( T , η ) is π ( θ ) = 1 Z ( τ ) exp ( τ · η ( θ )) , (8) where τ parameterizes the prior and Z ( τ ) is a normalizing constant for ﬁxed τ . F or notational conv enience, we let r index into the K × K edge-bundles b et w een groups; hence θ = ( θ 1 , ..., θ r ). When we up date the prior based on the observ ed w eights in a giv en edge bundle r , the p osterior’s parameter b ecomes τ ∗ = τ + T r , where T r is the suﬃcient statistic of the observ ed edges. Thus τ can b e view ed as a set of pseudo-observ ations that push the lik eliho o d function aw ay from the degenerate cases so that every edge bundle, no matter how small or uniform, pro duces a v alid parameter estimate. F or the vertex lab els z , the natural conjugate prior is a categorical distribution with parameter µ ∈ R n × k .The parameter µ i ( k ) represen ts the probability that vertex i b elongs to group k in all of its interactions. If the probability in parameter µ i is spread among multiple groups, then this indicates uncertaint y in the membership of vertex i and not mixed membership. W e ﬁt µ i directly , with ﬂat prior µ 0 ( k ) = 1 /K . The form of our prior is thus π ( z , θ | µ 0 , τ 0 ) = Y i µ 0 ( z i ) × Y r 1 Z ( τ 0 ) exp( τ 0 · η ( θ r )) , (9) where µ 0 , τ 0 are the parameters for the priors π i , π r , pick ed to b e a “non-informative” reference prior [6] or ﬂat. Similarly , our approximation q takes the form q ( z , θ | µ, τ ) = Y i µ i ( z i ) × Y r 1 Z ( τ r ) exp( τ r · η ( θ r )) . (10) 9 3.2 An eﬃcient algorithm for optimizing G No w we consider maximizing G o ver q ’s parameters µ i , τ r . T o simplify notation, let h T i r , h η i r b e the exp ected v alues of the suﬃcient statistics T r and natural parameters η r under the approximation q , that is, w e set h T i r = X ij X ( z i ,z j )= r µ i ( z i ) µ j ( z j ) T ( A ij ) (11) h η i r = ∂ ∂ τ log Z ( τ )     τ = τ r . (12) Substituting the conjugate prior forms of π , q into G thus yields G ∝ X r ( h T i r + τ 0 − τ r ) · h η i r + X r log Z ( τ r ) Z ( τ 0 ) + X i X z i µ i ( z i ) log µ 0 ( z i ) µ i ( z i ) . (13) T o optimize G , we take deriv atives with respect to q ’s parameters µ , τ and set them to zero. W e iterativ ely solve for the maximum by up dating µ and τ indep enden tly . F or the edge bundle parameters τ , the deriv ativ e of G is ∂ G ∂ τ r = ( h T i r + τ 0 − τ r ) ∂ h η i r ∂ τ r , (14) and setting this equal to zero yields a compact up date equation τ r = τ 0 + h T i r (15) for each edge bundle r . F or the vertex lab el parameters µ , we include Lagrange multipliers λ i to enforce the constraint P z µ i ( z ) = 1. Setting the deriv ative of G with resp ect to µ i equal to λ i yields ∂ G ∂ µ i ( z ) = X r  ∂ h T i r ∂ µ i ( z ) · h η i r  − log µ i ( z ) = λ i , where ∂ h T i r ∂ µ i ( z ) := X z 0 :( z ,z 0 )= r X j 6 = i T ( A ij ) µ j ( z 0 ) . Solving for µ i ( z ) yields a compact up date equation µ i ( z ) ∝ exp X r ∂ h T i r ∂ µ i ( z ) · h η i r ! , (16) where each µ i is normalized to a probability distribution. T o calculate the µ i v alues, w e iterativ ely up date each µ i from some initial guess until conv ergence to within some numerical tolerance. Algorithm 1 gives pseudo co de for the full v ariational Bay es algorithm, which alternates b et ween up dating the edge-bundle parameters and the vertex lab el parameters using up date equations Eqs. (15, 16). Up dating θ is relatively fast. First, we calculate h T i r and τ r for each edge bundle r and then up date each h η i r , which takes O ( nK 2 ) time. Updating µ is the limiting step of the calculation, as we iteratively up date µ un til conv ergence while holding θ ﬁxed. T o calculate ∂ h T i r /∂ µ i ( z ), each 10 Algorithm 1 V ariational Bay es for WSBM Input: Edge-w eighted netw ork A and Mo del K , α, T , η Initialize µ rep eat for all r = 1 , . . . , K 2 do Set h T i r := P ij P ( z i ,z j )= r µ i ( z i ) µ j ( z j ) T ( A ij ) Set τ r := τ 0 + h T i r Set h η i r := ∂ ∂ τ log Z ( τ )   τ = τ r end for rep eat for all i = 1 , . . . , n do ∂ h T i r ∂ µ i ( z ) := P ( z ,z 0 )= r P j 6 = i T ( A ij ) µ j ( z 0 ) µ i ( z ) ∝ exp  P r ∂ h T i r ∂ µ i ( z ) · h η i r  end for un til µ con verge un til µ, τ conv erge return µ, τ v ertex m ust sum ov er its connected edges for each edge bundle, whic h tak es O ( d i K 2 ) time. If m is the total n umber of edges in the netw ork, then up dating µ takes O ( mK 2 ) time. In particular, if the total num ber of edges in the net work is sparse m = O ( n ), then up dating µ takes O ( nK 2 ) time. In practice, w e w ould run the algorithm to con vergence from a n umber of randomly-c hosen initial conditions, and then select the b est µ, τ . In addition to the v ariational Bay es algorithm ab ov e, we derive in App endix C an eﬃcient lo op y b elief propagation algorithm [12, 35, 36] for the WSBM on sparse graphs. The lo op y b elief propagation algorithm creates a more ﬂexible approximation to the p osterior distribution than the v ariational Ba yes algorithm, but with a slightly higher computational cost. Small mo diﬁcations for dealing with sparse weigh ted netw orks, are described in App endix D. Finally , App endix A describ es ho w to obtain our implemen tation of these metho ds. 3.3 Selecting K with Bay es factors As with most sto chastic blo ck mo dels, the n umber of groups K is a free parameter that must b e c hosen b efore the mo del can b e applied to data. F or the WSBM, we must also choose the tuning parameter α and the exponential family distributions ( T , η ). In principle, any of several model selection techniques could b e used, including minimum de- scription length [28], integrated likelihoo d [11] or Bay es factors [16]. Classic complexity-con trol tec hniques like the AIC or BIC are known to misestimate K in certain situations [35]. Here, we describ e an approach for choosing K based on Ba yes factors that chooses the v alue K with largest marginal log-likelihoo d. Let M 1 = ( K 1 , α 1 , T , η ) and M 2 = ( K 2 , α 2 , T , η ) b e tw o comp eting mo dels, one with K 1 groups and one with K 2 groups. The Bay es factor b etw een these mo dels is log B ( M 1 , M 2 ) = log Pr( A | M 1 ) Pr( A | M 2 ) ≈ G 1 − G 2 , (17) where we approximate the marginal log-likelihoo d of each mo del Pr( A | M i ) with our low er b ound 11 (a) Example Netw ork σ 2 = 0 . 15 0 5 10 15 −5000 −4000 −3000 −2000 −1000 K Inferred LogPr LogPr vs K Inferred σ 2 = 0.15 σ 2 = 0.25 σ 2 = 0.35 σ 2 = 0.45 σ 2 = 0.55 σ 2 = 0.65 (b) G vs K Inferred 0 5 10 15 0 0.2 0.4 0.6 0.8 1 K Inferred NMI NMI vs K Inferred σ 2 = 0.15 σ 2 = 0.25 σ 2 = 0.35 σ 2 = 0.45 σ 2 = 0.55 σ 2 = 0.65 (c) NMI vs K Inferred Figure 4: (a) Example net work with K = 8 groups and v ariance σ 2 = 0 . 15. (b) Approximate marginal log-likelihoo d G for each mo del as a function of K . (c) NMI b et w een the ﬁtted mo del and the true planted structure as a function of K . Each data series lines in (b,c) corresp onds to a diﬀeren t choice of v ariance σ 2 in edge weigh ts. Results are av eraged ov er 20 trials and the error bars are the standard errors. G i Eq. (7). Although Bay es factors assigns a uniform prior on a set of nested mo dels, this approac h has a built-in p enalt y for complex mo dels through the prior distribution. In our exp eriments b elo w, w e treat K , α, T , η as ﬁxed. This metho d has pro duced go o d results on synthetic data with known plan ted structure [1]. W e now demonstrate the use and eﬃcacy of Bay es factors in selecting the num ber of groups K in the WSBM. F or our demonstration, we choose K = 8 groups of 10 verti ces eac h, and consider the metho d’s p erformance for a v ariet y of edge-weigh t structures. Sp eciﬁcally , e dge weigh ts within eac h group are drawn from a normal distribution with mean − 1 and v ariance σ 2 , while edge weigh ts b et w een groups are dra wn from a normal distribution with mean 1 and v ariance σ 2 . By v arying the v ariance parameter σ 2 , w e v ary the diﬃculty of recov ering the true group structure, with a larger v ariance σ 2 making inference more diﬃcult b y causing the edge weigh t distributions within and b etw een groups to increasingly ov erlap. Figure 4(a) shows an example net work dra wn from this mo del, where we c ho ose σ 2 = 0 . 15. T o each choice of σ 2 , and for a large num b er of netw orks drawn from this mo del, we ﬁt the WSBM using the normal distribution for the edge weigh ts and v ary the num b er of inferred groups K from 1 to 14. Figure 4(b) sho ws the approximate marginal log-likelihoo d G of each ﬁtted mo del as K v aries, whic h represen ts our prop ortional b elief that each choice of K is the correct. Similarly , ﬁgure 4(c) shows the NMI b etw een eac h ﬁtted model and the true plan ted structure, whic h represen ts the performance of each choice of K . Reassuringly , b oth quan tities are maximized at or close to the true v alue of K . When the within- and b et ween-group edge-w eight distributions are relatively w ell separated, b oth the marginal log-lik eliho o d G and NMI are consistently maximized at K = 8, indicating that Bay es factors provide a reasonably reliable metho d for selecting the correct num b er of groups and thereby recov ering the true planted structure in most cases. As the distributions o verlap (greater σ 2 here), it b ecomes more diﬃcult to distinguish groups, and accuracy degrades to some degree, as would b e exp ected. 12 4 Exp erimen tal Ev aluation In this section, we ev aluate the p erformance of the WSBM on several real-world netw orks, in tw o diﬀeren t w ays. First, w e consider the question of whether adding edge-w eigh t information necessarily reinforces the laten t group structure contained in the edge existences. That is, can the WSBM can ﬁnd structure distinct from what the SBM would ﬁnd? Second, we ev aluate the WSBM’s p erformance on tw o prediction tasks. The ﬁrst fo cuses on predicting missing edges (also called “link prediction”), while the second fo cuses on predicting missing edge weigh ts. W e compare its p erformance with other blo c k mo dels through cross-v alidation. 4.1 Edge weigh t v ersus edge existence laten t group structure T o prob e the question of whether edge w eights can contain latent group structures that are distinct from those contained in the edge existences, we c onsider a simple netw ork deriv ed from the comp e- titions among a set of professional sp orts teams. In this net work, called “NFL-2009” hereafter, each v ertex represen ts one of the 32 professional American fo otball teams in the National F ootball League (NFL). In this netw ork, an edge exists whenever pair of teams play ed eac h other in the 2009 season, and eac h of these edges is assigned a w eight equal to the a verage score diﬀerence across games pla y ed b y that pair [32]. (This deﬁnition of edge w eight implies the net work is sk ew-symmetric A ij = − A j i .) These teams are divided equally among t wo “conferences” (called AF C and NFC), and within each conference, teams are assigned to one of 4 divisions, each containing 4 teams. Play among teams, i.e., the existence of an edge, is determined by division memberships, and man y teams nev er play eac h other during the regular season. T o analyze this netw ork, we choose K = 4 and ﬁt b oth the SBM ( α = 1) and the “pure” WSBM ( α = 0) using the normal distribution as a mo del of edge weigh ts. This choice is reasonable for these data as score diﬀerences can b e p ositiv e or negativ e and score totals are close to a binomial distribution [20]. The α = 1 (SBM) case ignores the weigh ts of edges, while the α = 0 (pure WSBM) case ignores the presence or absence of edges, fo cusing only on the observ ed score diﬀerences. Examining the results of both models, we see that the blo c k structure learned b y the SBM ( α = 1, Figs 5(a-b)) exactly recov ers the ma jor divisions within each conference, along with the division b et w een conferences, illustrating that division mem b ership fully explains which teams play ed each other in this season. That is, the empty oﬀ-diagonal blo c ks (Fig. 5(b)) reﬂect the fact that tw o pairs of tw o divisions never play each other. In contrast, the block structure learned b y the pure WSBM ( α = 0, Figs 5(c-d)) recov ers a global ordering of teams (as in Fig. 1(d)) that reﬂects each team’s general skill, so that teams within each blo c k hav e roughly equal skill. This pattern mixes teams across conference and division lines, and th us disagrees with the blo c k structure reco vered by the SBM. F or instance, consider the upp er-left group in Fig. 5(c), which generally has p ositiv e score diﬀerences (wins) in games against teams in either low er group, with a mean lead of 11 p oints. Similarly , the lo wer-left group has p ositive score diﬀerences (wins) against teams in the low er-right group. The small upper-right group p erforms equally well against teams of e v ery other group. Within each group, how ev er, score diﬀerences tend to ward zero, indicating roughly equal skill. The fact that the SBM and pure WSBM reco ver entirely distinct blo c k structures illustrates that adding edge-w eigh t information to the inference step can dramatically alter our conclusions ab out the latent blo c k structure of a netw ork. That is, adding edge w eights do es not necessarily reinforce the inferences pro duced from binary edges alone. The extremal settings of the parameter α in our mo del allo ws a practitioner to choose whic h of these t yp es of laten t structure to ﬁnd, while if a mixed-type conclusion is preferred, an in termediate v alue of α may b e chosen. In the following 13 (a) SBM ( α = 1) (b) SBM Adjacency Matrix (c) WSBM ( α = 0) 1 2 3 4 1 2 3 4 (d) WSBM Adjacency Matrix Figure 5: NFL-2009 net work: black no des ( • ) are teams in conference 1 (NFC) and white no des ( ◦ ) are teams in conference 2 (AF C). Edges are colored b y score diﬀerential (red p ositiv e, green appro ximately zero, blue negativ e). (a) Net work sho wing SBM communities. (b) Adjacency matrix, sorted by SBM communities. (c) Net work sho wing WSBM comm unities. (d) Adjacency matrix, sorted by WSBM communities. The SBM ( α = 1) groups corresp ond to NFL conference structure whereas the WSBM ( α = 0) corresponds to relative skills levels. section, w e demonstrate that suc h a mo del, whic h w e call the “balanced” WSBM, that can learn sim ultaneously from edge existence and weigh t information. 4.2 Predicting edge existence and weigh t T o illustrate a more rigorous ev aluation of the WSBM, in this section, we consider the problem of predicting missing information when the mo del is ﬁtted to a partially observ ed netw ork. In particular, we consider predicting the existence or the weigh t of some unobserved interaction, a 14 similar task to missing and spurious link prediction [8, 14]. Here, we compare the WSBM to other blo ck mo dels on ﬁve real-w orld net works from v arious domains. Most of these mo dels are only deﬁned for un weigh ted net w orks, and th us some care is required to make them perform under the edge-weigh t prediction task, which w e describ e below. W e ev aluate p erformance n umerically across multiple trials of cross-v alidation, training eac h mo del on 80% of the n 2 p ossible edges and testing on the remaining 20%. The weigh ted graphs we consider are the following. • Airp ort . V ertices represen t the n = 500 busiest airp orts in the United States, and each of the m = 5960 directed edges is w eighted by the num b er of passengers tra veling from one airp ort to another [10]. • Col lab or ation . V ertices represen t n = 226 nations on Earth, and each of the m = 20616 edges is w eighted by a normalized coun t of academic pap ers whose author lists include that pair of nations [25]. • Congr ess . V ertices represen t the n = 163 committees in the 102nd United States Congress, and eac h of the m = 26569 edges is weigh ted by the pairwise normalized “in terlo c k” v alue of shared members [30]. • F orum . V ertices represen t n = 1899 users of a studen t so cial net work at UC Irvine, and eac h of the m = 20291 directed edges is weigh ted by the num b er of messages sent b et w een users [24]. • Col le ge FB . V ertices represent the n = 1411 NCAA college fo otball teams, and eac h of the m = 22168 edges are w eighted by the av erage p oint diﬀerence across games b et w een a pair of teams [32]. F or each of the tw o prediction tasks and for each netw ork, we ev aluate the following mo dels. The “pure” WSBM (pWSBM), using only weigh t information ( α = 0), a “balanced” WSBM (bWSBM), using b oth edge and weigh t information ( α = 0 . 5), the “classic” SBM, using only edge information ( α = 1), a degree-corrected w eigh ted blo c k model DCWBM, where ( α = 0 . 5) and the degree-corrected blo c k mo del (DCBM). F or the weigh ted blo ck mo dels, we select the normal distribution to mo del the edge w eights. In both prediction tasks, w e ﬁrst choose a uniformly random 20% of the n 2 in teractions, which we treat as missing when we ﬁt the model to the netw ork. W e then ﬁt each model to the observed edges and infer group mem b ership labels for eac h v ertex in the net work. Finally , we use the posterior mean obtained from v ariational inference as the predictor for edge existence and edge w eight for unobserved in teractions betw een those groups. F or the mo dels that do not naturally mo del edge weigh ts (SBM, DCBM), we take their partitions and compute the sample mean weigh t for each of the induced edge bundles in the weigh ted netw ork and use this v alue to predict the weigh t of an y missing edge in that bundle. These estimators correctly correspond to the underlying generativ e mo del for edge- prediction in the SBM and DCBM, and are a natural extension for predicting edge-weigh ts for a giv en blo c k membership. Under this sc heme, each mo del is made to predict the unobserved interactions for a given netw ork, and w e score the accuracy of these predictions using the mean-squared error (MSE). Ev aluating edge-existence prediction could b e achiev ed using alternativ e criteria suc h as A UC [9], which gives similar results. Eac h of these mo dels has a free parameter K that determines the num b er of parameters that are estimated, whic h th us con trols their o v erall ﬂexibilit y . W e con trol this v ariable mo del complexity and ensure a fair comparison by ﬁxing all mo dels to hav e K = 4 latent groups, and we treat all net works as directed. Finding the true num b er of laten t groups K for each netw ork is separate w orthwhile problem not considered here. T o compare the results across diﬀeren t data sets, all edge-weigh ts w ere normalized to fall on the in terv al [ − 1 , 1]. Non-negative weigh ts were normalized after applying a logarithmic transform (cases marked with a star ∗ in T ables 1 and 2). F or each mo del and eac h net work, w e ran 25 indep enden t trials with our 80 / 20 cross-v alidation 15 T able 1: Av erage mean-squared error (MSE) on edge prediction in 25 trials. Net work pWSBM bWSBM SBM DCWBM DCBM Airp ort 0.0202(1) 0.0156(1) 0.0158(1) 0.0238(1) 0.0238(1) Collab oration 0.1446(3) 0.1167(3) 0.1138(3) 0.2289(5) 0.2454(5) Congress 0.1765(4) 0.1648(4) 0.1640(5) 0.2298(9) 0.2402(9) F orum 0.00560(1) 0.00535(1) 0.00535(1) 0.00565(1) 0.00565(1) College FB 0.0369(2) 0.0344(1) 0.0346(1) 0.0387(2) 0.0389(2) T able 2: Av erage mean-squared error (MSE) on normalized weigh t prediction in 25 trials. Net work pWSBM bWSBM SBM DCWBM DCBM Airp ort* 0.0486(6) 0.0543(5) 0.0632(8) 0.0746(9) 0.0918(8) Collab oration* 0.0407(1) 0.0462(1) 0.0497(3) 0.0500(2) 0.0849(3) Congress* 0.0571(4) 0.0594(4) 0.0634(6) 0.0653(4) 0.1050(6) F orum* 0.0726(3) 0.0845(3) 0.0851(4) 0.0882(4) 0.0882(4) College FB 0.0124(1) 0.0140(1) 0.0145(1) 0.0149(1) 0.0160(2) split, as described ab o ve, and then compute the av erage MSE on the particular prediction task. The results for predicting edge existences are summarized in T able 1 and the results for predicting edge weigh ts are summarized in T able 2. Bolded v alues denote the b est MSE across all mo dels, and paren theses indicate the uncertaint y (standard error) in the last digit. Notably , in the edge-existence prediction task, the SBM and the balanced WSBM are the most accurate among all models, often by a large margin. The fact that the SBM p erforms w ell is p erhaps unsurprising, as it is, by design, only sensitive to edge existences in the ﬁrst place. How ever, the balanced WSBM is learning from both existence and w eigh t information, and its strong performance indicates that for these netw orks, learning from edge w eights does not necessarily confuse predictions on edge existence. In the edge-weigh t prediction task, how ev er, the pure WSBM ( α = 0) is the most accurate, often by a large margin, as we might exp ect for a mo del designed to learn only from edge w eight information. In this exp erimen tal framework, none of the degree corrected mo dels performs w ell. This is lik ely caused by the DCBM’s and DCWBM’s correction for edge prop ensit y in the group member- ship. By fo cusing on ﬁnding communit y structure after accounting for edge prop ensit y , the DCBM and DCWSBM hav e less accurate predictions in predicting edge existence and edge w eigh t. It is w orth p oin ting out, how ever, that prediction is not the only measure of utility for communit y de- tection techniques, and degree-corrected mo dels often p erform b etter than non-corrected mo dels at reco vering meaningful latent group structures in practical situations. W e th us exp ect the degree- corrected WSBM will b e most useful in situations where the goal is the reco very of scien tiﬁcally meaningful group structures, rather than edge existence or weigh t prediction. In general, the SBM p erforms w ell on edge prediction but p oorly on w eigh t prediction, while the pure WSBM performs p o orly on edge prediction but w ell on weigh t prediction. This pattern is precisely as we migh t exp ect, as the SBM only considers existence information, while the pure WSBM only considers weigh ts. What is surprising, ho w ever, is the go o d p erformance on b oth tasks b y the balanced WSBM ( α = 0 . 5), whic h is as goo d or nearly as go od as SBM in edge prediction, but substan tially b etter than the SBM in weigh t prediction. This demonstrates that the balanced WSBM is a more p ow erful 16 mo del than the SBM: it p erforms as well as the SBM on SBM-like tasks and b etter on edge weigh t tasks. In these examples, incorp orating edge weigh t information in to the SBM framework do es not detract the WSBM p erformance in edge prediction. In fact, this go o d general performance is p ossible b ecause the balanced WSBM learns from b oth edge existence and edge weigh t information. 5 Discussion In the analysis of netw orks, the inference of latent comm unity structure is a common task that facilitates subsequen t analysis, e.g., by dividing a large heterogeneous netw ork into a set of smaller, more homogeneous subgraphs, and can reveal imp ortan t insigh ts in to its basic organizational pat- terns. When edges are annotated with weigh ts, this extra information is often discarded, e.g., by applying a single universal threshold to all weigh ts. The weigh ted sto c hastic block mo del (WSBM) w e describ ed here is a natural generalization of the p opular sto chastic blo ck mo del (SBM) to edge- w eighted sparse netw orks. Crucially , the WSBM provides a statistically principled solution to the comm unity detection problem in edge-w eighted net works, and remo v es the need to apply an y thresh- olds before analysis. Thus, this mo del preserv es the maximal amount of information in suc h netw orks for characterizing their large-scale structure. The WSBM’s general form, given in Eq. (4), is parametrized by a mixing parameter α , whic h allo ws it to learn simultaneously from b oth the existence (presence or absence) of edges and their asso ciated w eights. In our tests with real-w orld net works, the WSBM yields excellen t results on both edge existence and weigh t prediction tasks. Additionally , the balanced mo del ( α = 0 . 5) p erformed as w ell or nearly as well as the b est alternative blo c k mo del, suggesting it may work well as a general mo del for no vel applications where it is not known whether edge existences or edge weigh ts are more informativ e. In many applications, the inferred group structure will b e of primary in terest. F or these cases, it is imp ortan t to note that the groups identiﬁed by the WSBM can b e distinct from those identiﬁed b y examining only an unw eighted version of the same net work. Both forms of latent structure may b e interesting and are lik ely to shed diﬀerent light on the underlying organization of the net work. It remains an op en question to determine the types of netw orks for which weigh t information contains distinct partition structure from edge existences, although we hav e shown at least one example of suc h a netw ork in section 4.1. The v ariational algorithm describ ed here provides an eﬃcient metho d for ﬁtting the WSBM to an empirical net work. It scalability is relatively go od b y modern standards, and th us should be applicable to netw orks of millions of vertices or more. Alternative algorithms such as those based on Mark ov c hain Mon te Carlo for unw eigh ted netw orks are p ossible [8, 29]; how ev er, each m ust contend with several technical problems presented by edge weigh t distributions, e.g., the degeneracies in the lik eliho o d function pro duced by edge-bundles whose w eights hav e zero v ariance. Finally , there are several natural extensions of the WSBM, including mixed memberships [2], bi- partite forms [19], dynamic net works [27], diﬀerent distributions for diﬀerent edge bundles, and the handling of more complex forms of auxiliary information, e.g., on the vertices or edges. An imp or- tan t and op en theoretical question presented by this mo del is whether utilizing weigh t information mo diﬁes the fundamental detectability of latent group structure, which exhibits a phase transition in the classic SBM [12]. W e lo ok forward to these and other extensions. 17 F unding This work was supp orted by the U.S. Air F orce Oﬃce of Scien tiﬁc Research and the Defense Ad- v anced Researc h Pro jects Agency [gran t num b er F A9550-12-1-0432]. Ac kno wledgmen ts W e thank Dan Larremore, Leto Peel, and Nora Connor for helpful conv ersations and suggestions. Certain data included herein are derived from the Science Citation Index Expanded, So cial Science Citation Index and Arts & Humanities Citation Index, prepared by Thomson Reuters, Philadelphia, Pennsylv ania, USA, Copyrigh t Thomson Reuters, 2011 18 A Co de Av ailabilit y A working implemen tation of the WSBM inference co de, written by the authors, may b e found at http://tuvalu.santafe.edu/%7Eaaronc/wsbm/ . This co de implements the eﬃcient algorithms discussed in App endix D. B Exp onen tial F amilies Let X b e a ﬁxed domain and Θ b e set of parameters. An exp onential family is a collection of parametric distributions F that can b e written in the form F = { f ( x | θ ) = h ( x ) exp ( T ( x ) · η ( θ )) for x ∈ X | θ ∈ Θ } , where h, T , η are ﬁxed functions. The map T is the suﬃcient statistic function and the map η ( θ ) are the natural parameters. Note that T and η can b e vectors. The function h ( x ) distinguishes diﬀerent probabilit y distributions, but app ears as an additive constant in the log-likelihoo d function, which can thus b e ignored. Th us, only the pair ( T , η ) directly impacts the likelihoo d function. Examples of exp onen tial families include the normal, exponential, gamma, log-normal, P areto, binomial, multinomial, Poisson, and b eta distributions. Examples of distributions that are not exp onen tial families are the Uniform distribution and certain mixture distributions. A common re presen tation of an exp onential family sometimes includes the log-partition function A ( θ ) written as f ( x | θ ) = h ( x ) exp  ˜ T ( x ) · ˜ η ( θ ) − A ( θ )  . T o k eep notation compact we absorb − 1 · A into T · η . A conv enient prop ert y of exp onential families is that they hav e easily written conjugate priors. F or our exp onential family the standard class of conjugate priors π are π ( θ ) = 1 Z ( τ ) exp ( τ · η ( θ )) , where τ are the (hyper-)parameters of the prior and can be thought of as pseudo-observ ations of T . The function Z is the normalizing constant, deﬁned as Z ( τ ) = Z Θ exp ( τ · η ( θ )) dθ . Finally , it can b e sho wn that the exp ected v alue of η ( θ ) under π ( · | τ ) is h η ( θ ) i = ∂ log Z ( τ ) ∂ τ . F urther details on exp onential families can b e found in Refs. [23, 31] and for appropriate prior distributions in Ref. [6]. C Belief Propagation Deriv ation The main diﬀerence b et ween a loopy b elief propagation (hereafter simply BP) algorithm and the v ariational Bay es algorithm described in the main text lays in ho w w e up date the group mem bership 19 parameters µ [36]. The BP approach gives a more accurate approximation of the true p osterior of z and has b een sho wn to pro duce go o d results in the classic SBM case [12]. In v ariational Bay es, we used a mean-ﬁeld appro ximation to the p osterior distribution π ∗ : π ∗ ( z ) ≈ q ( z ) = Y i µ i ( z i ) , where each vertex lab el is assumed to b e indep enden tly distributed according to q i (a categorical or m ultinomial random v ariable). In BP , we use pairwise appro ximations to the p osterior distribution. Ideally , this approxi mation w ould hav e the form π ∗ ( z ) ≈ q ( z ) ∝ Y ij µ ij ( z i , z j ) , where µ ij ( · , · ) are joint probabilities. How ever, this t ypically is not achiev able b ecause normalizing the pro duct of distributions o ver all edge pairs is non-trivial (eac h v ertex of degree k i app ears k i times). Luc kily , in the case of trees, it is p ossible to normalize q to a probability distribution by accoun ting for this rep etition, that is, q ( z ) = Q ij ∈ E µ ij ( z i , z j ) Q i µ i ( z i ) k i − 1 , where µ i is the marginal of µ ij , k i is the degree of vertex i , and E is the set of observed edges. But, the factor graph of the WSBM is not a tree, so this form is not necessarily exact. Here, w e tak e a lo opy BP approach and assume the structure of pairwise terms ij ∈ E is in fact lo cally tree-like, and then apply the BP up date equations. The assumption for lo cally tree-lik e structure makes this algorithm a p oor choice on dense netw orks (when we observ e O ( n 2 ) in teractions), but is b oth acceptable and eﬀective for sparse net works. Under this formulation, our goal is to maximize the v ariational approximation to the likelihoo d of the data G , s o that the KL divergence b et ween q and π ∗ is minimized. Recall from Eq. (6) the ob jective function G consists of tw o parts G = E q log Pr( A | z , θ ) + E q log ( π /q ) , a likelihoo d term and a prior regularizer term. The likelihoo d term is E q log Pr( A | z , θ ) ∝ X r   X ij T ( A ij ) E q ( z i z j ) + τ (0) r   · h η i r ≈ X r  h T i r + τ (0) r  · h η i r , where h T i r = X ij X ( z ,z 0 )= r µ ij ( z , z 0 ) T ( A ij ) h η i r = ∂ log Z ( τ ) ∂ τ     τ = τ r , and where w e approximate E q ( z i z j ) ≈ q ij ( z i , z j ) and τ (0) r , µ (0) are parameters for the prior. The regularizer term consists of tw o parts E q log ( π /q ) = E q (log π ) − E q (log q ) . 20 The second term is requires us to sum ov er q ( z ) which is combinatorically diﬃcult to calculate, so w e use the Bethe appro ximation − E q (log q ) ≈ − X ij ∈ E X z ,z 0 µ ij ( z , z 0 ) log µ ij ( z , z 0 ) + X i,z ( k i − 1) µ i ( z ) log µ i ( z ) + X r − τ r · h η i r + log Z ( τ r ) . Com bining these parts, the ob jective function may b e written as G = X r  h T i r + τ (0) r − τ r  · h η i r + X r log Z ( τ r ) Z ( τ (0) r ) + X i,z ( k i − 1) µ i ( z ) log µ i ( z ) µ (0) i ( z ) − X ij ∈ E X z ,z 0 µ ij ( z , z 0 ) log µ ij ( z , z 0 ) µ (0) i ( z ) µ (0) j ( z 0 ) . T o enforce the marginalization and normalization restrictions on q ( z ), we in tro duce Lagrange m ultipliers, yielding G 0 = G + X i λ i X i µ i − 1 ! + X ij ∈ E X z λ ij,z µ i ( z ) − X z 0 µ ij ( z , z 0 ) ! + X z 0 λ 0 ij,z 0 µ j ( z 0 ) − X z µ ij ( z , z 0 ) !! . Note that λ i enforces normalization of µ i , λ ij,z enforces marginalization ov er i , and λ 0 ij,z 0 enforces marginalization o ver j . W e maximize G 0 b y setting its deriv atives with resp ect to the parameters of q equal to 0 F or the edge parameters θ , we diﬀeren tiate with resp ect to τ r ∂ G 0 ∂ τ r =  h T i r + τ (0) r − τ r  ∂ h η i r ∂ τ r − h η i r + ∂ log Z ( τ ) ∂ τ     τ = τ r ∝ h T i r + τ (0) r − τ r . This is the same expression as for the v ariational Bay es solution, since we only mo diﬁed q ( z ). The up date equations for τ remain τ r = τ (0) r + h T i r . F or the v ertex lab els z , we will diﬀeren tiate with resp ect to µ i ( z ) and µ ij ( z , z 0 ) and solve this system of equations using a message passing metho d, which is standard in BP . The deriv ativ es are ∂ G 0 ∂ µ i ( z ) = ( k i − 1)  log µ i ( z ) − log µ (0) i ( z ) + 1  + λ i + X j : ij ∈ E λ ij,z = 0 , and ∂ G 0 ∂ µ ij ( z , z 0 ) = T ( A ij ) · h η i z ,z 0 − log µ ij ( z , z 0 ) + log µ (0) i ( z ) + log µ (0) j ( z 0 ) − 1 − λ ij,z − λ 0 ij,z 0 = 0 . Solving for µ i ( z ) and µ ij ( z , z 0 ) we obtain µ i ( z ) ∝ µ (0) i ( z ) Y j : ij ∈ E e − λ ij,z / ( k i − 1) µ ij ( z , z 0 ) ∝ µ (0) i ( z ) µ (0) j ( z 0 ) exp  T ( A ij ) · h η i z ,z 0  e − λ ij,z e − λ 0 ij,z 0 . F or notational conv enience, let M ij ( z , z 0 ) = exp  T ( A ij ) · h η i z ,z 0  . 21 Since P z 0 µ ij ( z , z 0 ) = µ i ( z ), we hav e µ i ( z ) ∝ µ (0) i ( z ) X z 0 µ (0) j ( z 0 ) M ij ( z , z 0 ) e − λ ij,z e − λ 0 ij,z 0 . Setting our t wo equations for µ i ( z ) are equal, we obtain µ (0) i ( z ) Y j 0 : ij 0 ∈ E e − λ ij 0 ,z / ( k i − 1) ∝ µ (0) i ( z ) X z 0 µ (0) j ( z 0 ) M ij ( z , z 0 ) e − λ ij,z e − λ 0 ij,z 0 Y j 0 : ij 0 ∈ E e − λ ij 0 ,z / ( k i − 1) ∝ X z 0 µ (0) j ( z 0 ) M ij ( z , z 0 ) e − λ ij,z e − λ 0 ij,z 0 . (*) Let ψ i → j ( z j ) denote the message from vertex i to vertex j and set e − λ ij,z = Y k : ik ∈ E ,k 6 = j ψ k → i ( z ) e − λ 0 ij,z 0 = Y k : j,k ∈ E ,k 6 = i ψ k → j ( z 0 ) . Plugging in our deﬁnition of ψ , we obtain Y j : ij ∈ E e − λ ij,z / ( k i − 1) = Y j : ij ∈ E Y k : ik ∈ E ,k 6 = j ψ k → i ( z ) 1 / ( k i − 1) = Y ij ∈ E ψ j → i ( z ) . And, using Eq. (*), we obtain the follo wing recusive deﬁnition for ψ Y ij ∈ E ψ j → i ( z ) ∝ X z 0 µ (0) j ( z 0 ) X z 0 M ij ( z , z 0 ) ! Y k : ik ∈ E ,k 6 = j ψ k → i ( z ) Y k : j k ∈ E ,k 6 = i ψ k → j ( z 0 ) ψ j → i ( z ) ∝ X z 0 µ (0) j ( z 0 ) M ij ( z , z 0 ) Y k : k ,j ∈ E ,k 6 = i ψ k → j ( z 0 ) . Finally , our up date equations for µ b ecome µ i ( z ) ∝ µ (0) i ( z ) Y ij ∈ E ψ j → i ( z ) µ ij ( z , z 0 ) ∝ µ (0) i ( z ) µ (0) j ( z 0 ) M ij ( z , z 0 ) Y k : ik ∈ E ,k 6 = j ψ k → i ( z ) Y l : l,j ∈ E ,l 6 = i ψ l → j ( z 0 ) . If m = | E | is the n umber of observed edges/interactions, then the BP algorithm requires O ( m ) messages to b e passed and therefore each iteration has an O (( m + n ) K 2 ) running time (up dating the messages ψ and then the group membership parameters µ ). It will b e conv enien t to use the following equiv alent messages ϕ used b y [12, 35, 37] in our BP algorithm ϕ i → j ( z 0 ) = µ (0) j ( z 0 ) Y k : k ,j ∈ E ,k 6 = i ψ k,j ( z 0 ) . Note that from our old message ψ up date equations, we obtain ψ i → j ( z 0 ) = X z M ij ( z , z 0 ) ϕ j → i ( z ) . 22 Putting these t wo equations together, our new up date equations using ϕ for our messages b ecome ϕ i → j ( z 0 ) = µ (0) j ( z 0 ) Y k : k j ∈ E ,k 6 = i X z M k,j ( z , z 0 ) ϕ j → k ( z ) µ i ( z ) ∝ µ (0) i ( z ) Y ij ∈ E X z 0 M ij ( z , z 0 ) ϕ i → j ( z 0 ) µ ij ( z , z 0 ) ∝ M ij ( z , z 0 ) ϕ j → i ( z ) ϕ i → j ( z 0 ) . Algorithm 2 giv es pseudo co de for the full lo opy BP algorithm. Algorithm 2 Lo opy BP for sparse netw orks Input: Data E , Mo del K, α, T , η Initialize µ rep eat for all r = 1 , . . . , K 2 do Set h T i r := P ij P ( z i ,z j )= r µ i ( z i ) µ j ( z j ) T ( A ij ) Set τ r := τ 0 + h T i r Set h η i r := ∂ ∂ τ log Z ( τ )   τ = τ r end for Calculate M ij for all ( ij ) in E Set M ij ( k , k 0 ) = exp( T ( A ij ) · h η i k,k 0 + T ( A j i ) · h η i k 0 ,k ) for all k , k 0 rep eat for all ( ij ) in E do Set ϕ j → i ( z i ) ∝ µ 0 ( z i ) Q k 6 = i,k j ∈ E P z k ϕ i → k ( z k ) M ik ( z i , z k ) end for un til ϕ con verge for all i = 1 , . . . , n do Set µ i ( z i ) ∝ µ 0 ( z i ) Q ij ∈ E P z j ϕ i → j ( z j ) M ij ( z i , z j ) end for un til µ, τ conv erge return µ, τ D Mo diﬁcations for Sparse W eigh ted Graphs W e now consider mo diﬁcations to our v ariational Bay es algorithm (Algorithm 1) and our BP algo- rithm (Algorithm 2) for the case of sparse weigh ted graphs discussed in section 2.1. Recall that for a netw ork of n nodes we can partition the n 2 in teraction in to 3 disjoint edge lists W , N , M , where W is a list of weighte d e dges , N is a list of non-e dges , and M is a list of missing e dges or unobserve d edges. W e deﬁne the union E = W ∪ N as the list of observe d edges. Let m W = | W | b e the num b er of w eighted edges, m E = | E | b e the num b er of observed edges, and m M = | M | b e the num ber of missing edges. Note that m E + m M = | E | + | M | = | A | = n 2 . Both algorithms w e presen ted require O ( | E | K 2 ) time when updating µ . If the n um b er of observe d edges is sparse ( | E | = O ( n )), then no changes are required. Ho wev er it may b e the case that the num b er of weigh ted edges is sparse ( | W | = O ( n )), while the num b er of non-edges is dense 23 ( | N | = O ( n 2 )). In this case, if w e assume the n umber of missing edges is also sparse ( | M | = O ( n )), then we can mo dify Algorithms 1 and 2, so that running time is once again O ( nK 2 ). The key idea is to exploit the structure of our edge-existence distribution. First we introduce some notation, then we consider the edge bundle τ up dates, and ﬁnally w e in tro duce mo diﬁcations to the group membership µ up dates. Notation. There are tw o types of degrees: the degree with resp ect to weigh ted edges and degree with resp ect to observed edges. Let d − W ( i ) b e the in-degree of v ertex i with resp ect to weigh ted edges. Let d + W ( i ) b e the out-degree of vertex i with resp ect to weigh ted edges. Let d − E ( i ) b e the in-degree of vertex i with resp ect to observed edges. Let d + E ( i ) b e the out-degree of vertex i with resp ect to observed edges. Let our exponential family edge-weigh t distribution f w under parameter θ w tak e form f w ( x | θ w ) = h w ( x ) exp ( T w ( x ) · η w ( θ w )) , where h w , T w , η w are ﬁxed functions. Let our exponential family edge-existence distribution f e under parameter θ e tak e the form f e ( x | θ e ) = h e ( x ) exp ( T e ( x ) · η e ( θ e )) , where h e , T e , η e are ﬁxed functions. Let R : K × K → r b e the mapping b etw een the groups and edge-bundles. D.1 Update for τ (edge distribution) The edge bundle up dates consist of tw o steps: (i) calculating the exp ected suﬃcient statistic h T i for eac h edge bundle, and (ii) up dating τ for eac h edge bundle. W eighted τ w . F or the weigh ted distribution, the exp ected suﬃcien t statistic h T w i r for all edge bundles r can b e calculated using Eq. (18) for all pairs of groups ( z , z 0 ), as h T w i R ( z ,z 0 ) + = X ij ∈ W T w ( A ij ) µ i ( z ) µ j ( z 0 ) . (18) Since the running time for each pair is dominated by the summation ov er the set W , eac h iteration o ver Eq. (18) takes O ( n + m W ) in O ( K 2 ( n + m W )) time. Edge existence τ e . T o up date T e w e note that the suﬃcient statistic v alue for a non-edge is t ypically zero except for the last dimension that takes the v alue 1 for observed edges. Knowing that this v alue is 1 for all edges lets us calculate T e without needing to sum ov er W ∪ N . Therefore w e update T e using Eq. (18) for all but the last dimensions of T e . F or the last dimension w e up date T e with h T e i R ( z ,z 0 ) + = X ij µ i ( z ) µ j ( z 0 ) − X ij ∈ M µ i ( z ) µ j ( z 0 ) = X i µ i ( z ) !   X j µ j ( z 0 )   − X ij ∈ M µ i ( z ) µ j ( z 0 ) , (19) whic h takes O ( K 2 ( n + m M )) time. 24 Degree-corrected edge existence τ e . F or the degree corrected blo c k mo del, recall that edge existence distribution is mo diﬁed slightly by replacing the T e ( A ij ) = 1 in the last dimension of T e with the pro duct of i, j ’s in and out degrees, T e ( A ij ) = d + W ( i ) d − W ( j ). This changes equation (19) by replacing µ i ( z ) with d + W ( i ) µ i ( z ) and µ j ( z 0 ) with d − W ( j ) µ j ( z 0 ). This gives us h T e i R ( z ,z 0 ) + = X i d + W ( i ) µ i ( z ) !   X j d − W ( j ) µ j ( z 0 )   − X ij ∈ M d + W ( i ) µ i ( z ) d − W ( j ) µ j ( z 0 ) . (20) The running time remains the same as in the edge existence case. D.2 Update for µ (vertex lab els) V ariational Bay es Algorithm. The up date for the vertex lab els under the v ariational Bay es algorithm, is to (i) calculate ∂ h T i r ∂ µ i ( z ) and (ii) update µ i using µ i ( z ) ∝ exp X r ∂ h T i r ∂ µ i ( z ) · h η i r ! . The rate limiting step is in calculating ∂ h T i r ∂ µ i ( z ) . F or the weigh ted suﬃcient statistics T w , we calculate for all pairs ( z , z 0 ) and for each vertex i ∂ h T w i R ( z ,z 0 ) ∂ µ i ( z ) + = X j ∈ ∂ i + W T w ( A ij ) µ j ( z 0 ) , ∂ h T w i R ( z 0 ,z ) ∂ µ i ( z ) + = X j ∈ ∂ i − W µ j ( z 0 ) T w ( A j i ) , (21) where ∂ i + W is the neighborho o d formed b y the outgoing weigh ted edges of vetex i . Since the sum in Eq. (21) is ov er d + W ( i ) terms, the running time is O ( K 2 P i d + W ( i )) = O ( K 2 ( n + m W )). Similar to ho w w e up dated τ e , in the edge-existence case w e up date ∂ h T i r ∂ µ i ( z ) b y calculating the en tire sum and subtracting aw a y the missing edges. Again, w e exploit the fact that the last dimension of T e is 1 for observed edges, and ∂ h T e i R ( z ,z 0 ) ∂ µ i ( z ) + =   X j µ j ( z 0 )   − X j ∈ ∂ i + M T e ( A ij ) µ j ( z 0 ) , . (22) Calculating Eq. (22) for all vertices has a total O ( K 2 ( n + m M )) running time if w e pre-calculate P j µ j ( z 0 ). F or the degree corrected blo c k mo del w e replace µ j with d − W ( j ) µ j ( z 0 ) and use Eq. (22). Lo op y BP Algorithm. The up date for the vertex lab els under the BP algorithm requires us to (i) calculate the marginal evidence from each edge M ij ( z , z 0 ), (ii) up date messages ϕ j → i ( z i ) b et ween w eighted edges, (iii) approx imate messages ϕ → i ( z i ) = µ i ( z i ) b et ween non-edges, and (iv) caclulate the vertex lab el probabilities µ i . W e calculate the marginal evidence M M ij ( z , z 0 ) = exp  T ( A ij ) · h η i R ( z ,z 0 ) + T ( A j i ) · h η i R ( z 0 ,z )  , (23) 25 for each weigh ted edge ij ∈ W for all z , z 0 . This takes O ( K 2 m W ) time. Note that M ij = M j i . F or the non-edges, we again exploit the fact that the last dimension of T e is 1 for observed edges and only need to calculate M ij = M N once using Eq. (23) for all non-edges ij ∈ N . The messages betw een weigh ted edges are ϕ i → j ( z 0 ) ∝ µ 0 ( z 0 ) Y k 6 = j,k ∈ ∂ i W X z k ϕ j → k ( z k ) M j k ( z 0 , z k ) . (24) Eac h step requires O ( | ∂ i W | K 2 ) calculations. In the case of a sparse graph, ∂ i W = O (1) and since w e rep eat this step for each pair i, j in W , the ov erall running time is O ( K 2 m W ). Since there are O ( n 2 ) non-edges, the messages b et ween non-edges must b e appro ximated for our algorithm to b e eﬃcient. The idea b ehind this approximation is to exploit the sparsity of the w eighted edges. T o b e concrete, supp ose we select the Bernoulli distribution for our edge-existence distribution f e ( x | p ). Then our marginal evidence takes the form ˜ M ij ( z , z 0 ) =    exp  h log p i z ,z 0  · M ij ( z , z 0 ) if ij ∈ E exp  h log(1 − p ) i z ,z 0  otherwise , (25) where p is the edge-existence parameter θ e . If the graph is sparse, then h log(1 − p ) i r = O (1 /n ). Th us for i, j ∈ E , we ha v e ˜ M ij ≈ 1. And therefore messages betw een non-edges can b e appro ximated as ϕ i → j ( z 0 ) = µ (0) j ( z 0 ) Y k 6 = i X z ˜ M k,j ( z , z 0 ) ϕ j → k ( z ) ≈ µ (0) j ( z 0 ) Y k X z ˜ M k,j ( z , z 0 ) ϕ j → k ( z ) = µ j ( z 0 ) . (26) Th us we can appro ximate all messages b etw een non-edges ϕ i → j ( z 0 ) with their marginal distribution µ j ( z 0 ) taking O ( nK ) space and time. The Poisson and degree-corrected case are more complicated and should follow along the lines of [35]. This extension is left for future w ork. Giv en the messages and marginal evidence, w e calculate the vertex lab el probabilities with µ i ( z ) ∝ µ (0) i ( z ) Y ij ∈ W X z 0 M ij ( z , z 0 ) ϕ i → j ( z 0 ) · Y ij ∈ N X z 0 M N ( z , z 0 ) µ j ( z 0 ) = Y j ∈ ∂ i W P z 0 M ij ( z , z 0 ) ϕ i → j ( z 0 ) P z 0 M N ( z , z 0 ) µ j ( z 0 ) ·   X z 0 M N ( z , z 0 ) X j µ j ( z 0 )   ∂ i E , (27) where ∂ i E is the total (in- and out-) degree of observ ed edges and M N is the marginal evidence of a non-edge. Each of these updates tak es O ( | ∂ i W | K 2 ) calculations. Let ∂ i W b e the in and out neigh b orho od of i . In the case of a sparse graph, ∂ i W is O (1) and since we rep eat this step for each pair i, j in W , the o verall running time is O ( K 2 m W ). In conclusion, all three steps take O ( nK 2 ) time when the n umber of weigh ted edges and missing edges is sparse ( | W | = O ( n ) and | M | = O ( n )). Although b oth the v ariational Bay es algorithm and the lo op y BP algorithm hav e the same asymptotic running time, the constant in front of O ( nK 2 ) for the loopy BP algorithm dep ends on the av erage weigh ted degree of the netw ork. 26 References 1 . C. Aicher, A. Z. Jacobs, and A. Clauset. Adapting the sto c hastic blo c k mo del to edge-weigh ted net works. ICML Workshop on Structur e d L e arning (SLG 2013) , May 2013. 2 . E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P . Xing. Mixed membership sto c hastic blo c kmo dels. J. Mach. L e arn. R es. , 9:1981–2014, 2008. 3 . E. S. Allman, C. Matias, and J. A. Rhodes. P arameter iden tiﬁability in a class of random graph mixture mo dels. J. Statist. Plann. Infer enc e , 141(5):1719–1736, May 2011. 4 . C. Ambroise and C. Matias. New consistent and asymptotically normal parameter estimates for random-graph mixture mo dels. J. R. Stat. So c. Ser. B , 74(1):3–35, Jan. 2012. 5 . H. A ttias. A v ariational ba y esian framew ork for graphical models. In A dv. in Neur al Info. Pr o c. Sys. 12 , pages 209–215. MIT Press, 2000. 6 . J. O. Berger and J. M. Bernardo. On the developmen t of reference priors. Bayesian statistics , 4(4):35–60, 1992. 7 . A. Celisse, J.-J. Daudin, and L. Pierre. Consistency of maximum-lik eliho od and v ariational estimators in the sto c hastic blo ck mo del. Ele ctr on. J. Stat. , 6:1847–1899, 2012. 8 . A. Clauset, C. Mo ore, and M. E. Newman. Structural inference of hierarchies in netw orks. In L e ctur e Notes in Computer Scienc e , v olume 4503, pages 1–13. Springer, 2007. 9 . A. Clauset, C. Moore, and M. E. J. Newman. Hierarc hical structure and the prediction of missing links in netw orks. Natur e , 453(7191):98–101, 2008. 10 . V. Colizza, R. P astor-Satorras, and A. V espignani. Reaction diﬀusion pro cesses and metap op- ulation mo dels in heterogeneous netw orks. Nat. Phys. , 3(4):276–282, 2007. 11 . E. Cˆ ome and P . Latouche. Mo del selection and clustering in sto chastic blo ck mo dels with the exact integrated complete data likelihoo d. Pre-print, , Mar. 2013. 12 . A. Decelle, F. Krzak ala, C. Mo ore, and L. Zdeb oro v´ a. Inference and phase transitions in the detection of modules in sparse netw orks. Phys. R ev. L ett. , 107(6):65701, 2011. 13 . S. F ortunato. Comm unity detection in graphs. Phys. R ep. , 486(3):75–174, 2010. 14 . R. Guimer` a and M. Sales-Pardo. Missing and spurious interactions and the reconstruction of complex netw orks. Pr o c. Natl. A c ad. Sci. USA , 106(52):22073–22078, 2009. 15 . R. Guimer` a and M. Sales-P ardo. A netw ork inference metho d for large-scale unsup ervised iden tiﬁcation of nov el drug-drug interactions. PLOS Comput. Biol. , 9(12):e1003374, Dec. 2013. 16 . J. Hofman and C. Wiggins. Bay esian approac h to net work mo dularity . Phys. R ev. L ett. , 100(25):258701, 2008. 17 . P . Holland, K. Laskey , and S. Leinhardt. Sto chastic blockmodels: First steps. So cial Networks , 5(2):109–137, 1983. 18 . B. Karrer and M. E. J. Newman. Sto chastic blockmodels and comm unity structure in net works. Phys. R ev. E , 83(1):016107, 2011. 19 . D. B. Larremore, A. Clauset, and A. Z. Jacobs. Eﬃciently inferring comm unity structure in bipartite netw orks. Pre-print, , Mar. 2014. 20 . S. Merritt and A. Clauset. Scoring dynamics across professional team sp orts: temp o, balance and predictability . EPJ Data Scienc e , 3:4, 2014. 21 . M. E. J. Newman. Mixing patterns in netw orks. Phys. R ev. E , 67(2):026126, F eb. 2003. 22 . M. E. J. Newman. Networks: An Intr o duction . Oxford Universit y Press, Oxford; New Y ork, 2010. 23 . A. O’Hagan. Kendal l’s A dvanc e d The ory of Statistics: Bayesian Infer enc e. 2B . Wiley , 1 edition, 2004. 24 . T. Opsahl and P . Panzarasa. Clustering in weigh ted netw orks. So cial Networks , 31(2):155–163, 2009. 27 25 . R. K. Pan, K. Kaski, and S. F ortunato. W orld citation and collab oration netw orks: uncov ering the role of geography in science. Sci. R ep. , 2:902, 2012. 26 . Y. Park, C. Mo ore, and J. Bader. Dynamic net works from hierarc hical bay esian graph cluster- ing. PLOS ONE , 5(1):e8118, 2010. 27 . L. Peel and A. Clauset. Detecting c hange p oints in the large-scale structure of evolving net- w orks. Pre-print, , 2014. 28 . T. P eixoto. Parsimonious mo dule inference in large net works. Phys. R ev. L ett. , 110(14):148701, 2013. 29 . T. P eixoto. Hierarc hical blo c k structures and high-resolution model selection in large net w orks. Phys. R ev. X , 4:011047, 2014. 30 . M. A. P orter, P . J. Muc ha, M. Newman, and C. M. W armbrand. A net w ork analysis of commit- tees in the United States House of Represen tatives. Pr o c. Natl. A c ad. Sci. USA , 102(20):7057– 7062, 2005. 31 . C. Rob ert. The Bayesian Choic e: F r om De cision-The or etic F oundations to Computational Implementation . Springer, New Y ork, 2007. 32 . ST A TS LCC. Cop yright 2014, 2014. 33 . A. C. Thomas and J. K. Blitzstein. V alued ties tell fewer lies: Why not to dichotomize net w ork edges with thresholds. Pre-prin t, , 2011. 34 . Y. J. W ang and G. Y. W ong. Sto chastic blo c kmo dels for directed graphs. J. Am. Stat. Asso c. , 82(397):8–19, 1987. 35 . X. Y an, J. E. Jensen, F. Krzak ala, C. Mo ore, C. R. Shalizi, L. Zdeb oro v´ a, P . Zhang, and Y. Zhu. Mo del selection for degree-corrected blo c k mo dels. Pre-prin t, , 2012. 36 . J. Y edidia, W. F reeman, and Y. W eiss. Understanding b elief propagation and its generaliza- tions. Explor. Artif. Intel l. Mil lenn. , 8:236–239, 2003. 37 . P . Zhang, F. Krzak ala, J. Reichardt, and L. Zdeb orov´ a. Comparativ e study for inference of hidden classes in sto c hastic blo ck mo dels. J. Stat. Me ch. , 2012(12):P12021, 2012. 28

Learning Latent Block Structure in Weighted Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment