Bayesian stochastic blockmodeling

Bayesian stochastic blockmodeling a T iago P . Peixoto † Department of Mathematical Sciences and Centr e for Networks and Collective Behaviour , University of Bath, United Kingdom and ISI F oundation, T urin, Italy This chapter provides a self-contained introduction to the use of Bayesian inference to e x- tract large-scale modular structures from network data, based on the stochastic blockmodel (SBM), as well as its degree-corrected and overlapping generalizations. W e focus on non- parametric formulations that allo w their inference in a manner that prev ents ov erﬁtting, and enables model selection. W e discuss aspects of the choice of priors, in particular how to av oid underﬁtting via increased Bayesian hierarchies, and we contrast the task of sampling network partitions from the posterior distribution with ﬁnding the single point estimate that maximizes it, while describing efﬁcient algorithms to perform either one. W e also sho w ho w inferring the SBM can be used to predict missing and spurious links, and shed light on the fundamental limitations of the detectability of modular structures in networks. a T o appear in “ Advances in Netw ork Clustering and Blockmodeling, ” edited by P . Doreian, V . Batagelj, A. Ferligoj, (W iley , New Y ork, 2019 [forthcoming]). † t.peixoto@bath.ac.uk 2 CONTENTS I. Introduction 3 II. Structure v ersus randomness in networks 3 III. The stochastic blockmodel (SBM) 5 IV . Bayesian inference: the posterior probability of partitions 7 V . Microcanonical models and the minimum description length principle (MDL) 11 VI. The “resolution limit” underﬁtting problem, and the nested SBM 13 VII. Model v ariations 17 A. Model selection 18 B. De gree correction 18 C. Group o verlaps 22 D. Further model e xtensions 25 VIII. Ef ﬁcient inference using Marko v chain Monte Carlo (MCMC) 26 IX. T o sample or to optimize? 28 X. Generalization and prediction 31 XI. Fundamental limits of inference: the detectability-indetectability phase transition 35 XII. Conclusion 38 References 39 3 I. INTR ODUCTION Since the past decade and a half there has been an e ver -increasing demand to analyze network data, in particular those stemming from social, biological and technological systems. Often these systems are very large, comprising millions of e ven billions of nodes and edges, such as the W orld W ide W eb, and the global-lev el social interactions among humans. A particular challenge that arises is ho w to describe the lar ge-scale structures of these systems, in a w ay that abstracts aw ay from lo w-le vel details, allo wing us to focus instead on “the big picture. ” Dif ferently from systems that are naturally embedded in some low-dimensional space — such as the population density of cities or the physiology of organisms —- we are unable just to “look” at a network and readily extract its most salient features. This has prompted a fury of acti vity in de veloping algorithmic ap- proaches to extract such global information in a well-deﬁned manner , man y of which are described in the remaining chapters of this book. Most of them operate on a rather simple ansatz, where we try to di vide the netw ork into “b uilding blocks, ” which then can be described at an aggreg ate le vel in a simpliﬁed manner . The majority of such methods go under the name “community detection, ” “network clustering” or “blockmodeling. ” In this chapter we consider the situation where the ul- timate objecti ve when analyzing network data in this way is to model it, i.e. we want to make statements about possible generativ e mechanisms that are responsible for the network formation. This o verall aim sets us in a well-deﬁned path, where we get to formulate probabilistic models for network structure, and use principled and rob ust methods of statistical inference to ﬁt our models to data. Central to this approach is the ability to distinguish structure from randomness, so that we do not fool ourselves into believing that there are elaborate structures in our data which are in fact just the outcome of stochastic ﬂuctuations — which tends to be the Achilles’ heel of alterna- ti ve nonstatistical approaches. In addition to providing a description of the data, the models we infer can also be used to generalize from observations, and make statements about what has not yet been observed, yielding something more tangible than mere interpretations. In what follows we will give an introduction to this inference approach, which includes recent de velopments that allo w us to perform it in a consistent, versatile and ef ﬁcient manner . II. STR UCTURE VERSUS RANDOMNESS IN NETWORKS If we observ e a random string of characters we will ev entually encounter ev ery possible sub- string, provided the string is long enough. This leads to the famous thought e xperiment of a large number of monkeys with type writers: Assuming that they type randomly , for a sufﬁciently lar ge number of monkeys an y output can be observed, including, for example, the very text you are reading. Therefore, if we are e ver faced with this situation, we should not be surprised if a such a text is in fact produced, and most importantly , we should not of fer its simian author a place in a uni versity department, as this occurrence is unlikely to be repeated. Howe ver , this e xample is of little practical rele vance, as the number of monke ys necessary to type the text “blockmodeling” by chance is already of the order of 10 18 , and there are simply not that many monk eys. Networks, howe ver , are dif ferent from random strings. The network analogue of a random string is an Erd ˝ os-R ´ enyi random graph [1] where each possible edge can occur with the same probability . But dif ferently from a random string, a random graph can contain a wealth of structure before it becomes astronomically large — specially if we searc h for it. An example of this is shown in Fig. 1 for a modest netw ork of 5 , 000 nodes, where its adjacenc y matrix is visualized using three dif ferent node orderings. T wo of the orderings seem to re veal patterns of lar ge-scale connections that are tantalizingly clear , and indeed would be eagerly captured my many network clustering 4 0 2000 4000 Node 0 1000 2000 3000 4000 5000 Node 0 2000 4000 Node 0 1000 2000 3000 4000 5000 Node 0 2000 4000 Node 0 1000 2000 3000 4000 5000 Node Figure 1. The three panels sho w the same adjacency matrix, with the only difference between them being the ordering of the nodes. The different orderings show seemingly clear, albeit very distinct patterns of modular structure. Ho wev er , the adjacency matrix in question corresponds to an instance of a fully random Erd ˝ os-R ´ enyi model, where each edge has the same probability p = h k i / ( N − 1 ) of occurring, with h k i = 3. Although the patterns seen in the second and third panels are not mere fabrications — as they are really there in the network — they are also not meaningful descriptions of this network, since they arise purely out of random ﬂuctuations. Therefore, the node groups that are identiﬁed via these patterns bear no relation to the generati ve process that produced the data. In other words, the second and third panels correspond each to an overﬁt of the data, where stochastic ﬂuctuations are misrepresented as underlying structure. This pitfall can lead to misleading interpretations of results from clustering methods that do not account for statistical signiﬁcance. methods [2]. In particular , the y seem to sho w groupings of nodes that have distinct probabilities of connections to each other — in direct contradiction to actual process that generated the network, where all connections had the same probability of occurring. What makes matters e ven worse is that in Fig. 1 is shown only a very small subset of all orderings that have similar patterns, but are otherwise very distinct from each other . Naturally , in the same way we should not confuse a monke y with a proper scientist in our previous example, we should not use any of these node groupings to explain why the network has its structure. Doing so should be considering overﬁtting it, i.e. mistaking random ﬂuctuations for generati ve structure, yielding an o verly complicated and ultimately wrong explanation for the data. The remedy to this problem is to think probabilistically . W e need to ascribe to each possible explanation of the data a probability that it is correct, which takes into account modeling assump- tions, the statistical e vidence a v ailable in the data, as well an y source of prior information we may hav e. Imbued in the whole procedure must be the principle of parsimony — or Occam’ s razor — where a simpler model is preferred if the evidence is not sufﬁcient to justify a more complicated one. In order to follow this path, before we look at any network data, we must ﬁrst look in the “forward” direction, and decide on which mechanisms generate netw orks in the ﬁrst place. Based on this, we will ﬁnally be able to look “backwards, ” and tell which particular mechanism generated a gi ven observed netw ork. 5 III. THE STOCHASTIC BLOCKMODEL (SBM) As mentioned in the introduction, we wish to decompose networks into “building blocks, ” by grouping together nodes that hav e a similar role in the network. From a generativ e point of vie w , we wish to work with models that are based on a partition of N nodes into B such b uilding blocks, gi ven by the vector b b b with entries b i ∈ { 1 , . . . , B } , specifying the group membership of node i . W e wish to construct a generativ e model that takes this di vision of the nodes as parameters, and generates networks with a probability P ( A A A | b b b ) , where where A A A = { A i j } is the adjacency matrix. But what shape should P ( A A A | b b b ) hav e? If we wish to impose that nodes that belong to the same group are statistically indistinguishable, our ensemble of networks should be fully characterized by the number of edges that connects nodes of two groups r and s , e rs = ∑ i j A i j δ b i , r δ b j , s , (1) or twice that number if r = s . If we take these as conserved quantities, the ensemble that reﬂects our maximal indif ference tow ards any other aspect is the one that maximizes the entropy [3] S = − ∑ A A A P ( A A A | b b b ) ln P ( A A A | b b b ) (2) subject to the constraint of Eq. 1. If we relax somewhat our requirements, such that Eq. 1 is obeyed only on expectation, then we can obtain our model using the method of Lagrange multipliers, using the Lagrangian function F = S − ∑ r ≤ s µ rs ∑ A A A P ( A A A | b b b ) ∑ i < j A i j δ b i , r δ b j , s − h e rs i ! − λ ∑ A A A P ( A A A | b b b ) − 1 ! (3) where h e rs i are constants independent of P ( A A A | b b b ) , and µ µ µ and λ are multipliers that enforce our desired constraints and normalization, respectiv ely . Obtaining the saddle point ∂ F / ∂ P ( A A A | b b b ) = 0, ∂ F / ∂ µ rs = 0 and ∂ F / ∂ λ = 0 giv es us the maximum entropy ensemble with the desired properties. If we constrain ourselves to simple graphs, i.e. A i j ∈ { 0 , 1 } , without self-loops, we ha ve as our maximum entropy model P ( A A A | p p p , b b b ) = ∏ i < j p A i j b i , b j ( 1 − p b i , b j ) 1 − A i j . (4) with p rs = e − µ rs / ( 1 + e − µ rs ) being the probability of an edge existing between any two nodes belonging to group r and s . This model is called the stochastic blockmodel (SBM), and has its roots in the social sciences and statistics [4–7], but has appeared repeatedly in the literature under a v ariety of dif ferent names [8–13]. By selecting the probabilities p p p = { p rs } appropriately , we can achie ve arbitrary mixing patterns between the groups of nodes, as illustrated in Fig. 2. W e stress that while the SBM can perfectly accommodate the usual “community structure” pattern [14], i.e. when the diagonal entries of p p p are dominant, it can equally well describe a lar ge v ariety of other 6 1 2 3 4 5 6 Group s 1 2 3 4 5 6 Group r (a) (b) Figure 2. The stochastic blockmodel (SBM): (a) The matrix of probabilities between groups p rs deﬁnes the large-scale structure of generated networks; (b) a sampled network corresponding to (a), where the node colors indicate the group membership. patterns, such as bipartiteness, core-periphery , and many others. Instead of simple graphs, we may consider multigraphs by allo wing multiple edges between nodes, i.e. A i j ∈ N . Repeating the same procedure, we obtain in this case P ( A A A | λ λ λ , b b b ) = ∏ i < j λ A i j b i , b j ( λ b i , b j + 1 ) A i j + 1 , (5) with λ rs = e − µ rs / ( 1 − e − µ rs ) being the averag e number of edges e xisting between an y two nodes belonging to group r and s . Whereas the placement of edges in Eq. 4 is giv en by a Bernoulli distribution, in Eq. 5 it is giv en by a geometric distribution, reﬂecting the dif ferent nature of both kinds of networks. Although these models are not the same, there is in fact little difference between the networks they generate in the sparse limit gi ven by p rs = λ rs = O ( 1 / N ) with N  1. W e see this by noticing ho w their log-probabilities become asymptotically identical in this limit, i.e. ln P ( A A A | p p p , b b b ) ≈ 1 2 ∑ rs e rs ln p rs − n r n s p rs + O ( 1 ) , (6) ln P ( A A A | λ λ λ , b b b ) ≈ 1 2 ∑ rs e rs ln λ rs − n r n s λ rs + O ( 1 ) . (7) Therefore, since most networks that we are likely to encounter are sparse [15], it does not matter which model we use, and we may prefer whate ver is more con venient for our calculations. W ith this in mind, we may consider yet another variant, which uses instead a Poisson distribution to 7 sample edges [16], P ( A A A | λ λ λ , b b b ) = ∏ i < j e − λ b i , b j λ A i j b i , b j A i j ! × ∏ i e − λ b i , b i / 2 ( λ b i , b i / 2 ) A ii / 2 ( A ii / 2 ) ! , (8) where no w we also allo w for self-loops. Like the geometric model, the Poisson model generates multigraphs, and it is easy to verify that it also leads to Eq. 7 in the sparse limit. This model is easier to use in some of the calculations that we are going to make, in particular wh en we consider important extensions of the SBM, therefore we will focus on it. 1 The model above generates undirected networks. It can be very easily modiﬁed to generate directed networks instead, by making λ rs an asymmetric matrix, and adjusting the model likelihood accordingly . The same is true for all model variations that are going to be used in the follo wing sections. Howe v er , for the sake of conciseness we will focus only on the undirected case. W e point out that the corresponding expressions for the directed case are readily av ailable in the literature (e.g. Refs. [17–19]). No w that we hav e deﬁned ho w networks with prescribed modular structure are generated, we need to de velop the re verse procedure, i.e. how to infer the modular structure from data. IV . B A YESIAN INFERENCE: THE POSTERIOR PR OBABILITY OF P ARTITIONS Instead of generating networks, our nominal task is to determine which partition b b b generated an observed network A A A , assuming this was done via the SBM. In other words, we want to obtain the probability P ( b b b | A A A ) that a node partition b b b was responsible for a network A A A . By e voking elementary properties of conditional probabilities, we can write this probability as P ( b b b | A A A ) = P ( A A A | b b b ) P ( b b b ) P ( A A A ) (9) with P ( A A A | b b b ) = Z P ( A A A | λ λ λ , b b b ) P ( λ λ λ | b b b ) d λ λ λ (10) being the mar ginal likelihood integrated ov er the remaining model parameters, and P ( A A A ) = ∑ b b b P ( A A A | b b b ) P ( b b b ) (11) is called the evidence , i.e. the total probability of the data under the model, which serves as a normalization constant in Eq. 9. Eq. 9 is known as Bayes’ rule , and far from being only a simple mathematical step, it encodes ho w our prior beliefs about the model, i.e. before we observe any data — in the above represented by the prior distrib utions P ( b b b ) and P ( λ λ λ | b b b ) — are af fected by the observation, yielding the so-called posterior distribution P ( b b b | A A A ) . The overall approach outlined abov e has been proposed to the problem of network inference by sev eral authors [18–33], 1 Although the Poisson model is not strictly a maximum entropy ensemble, the generativ e process behind it is easy to justify . W e can imagine it as the random placement of exactly E edges into the N ( N − 1 ) / 2 entries of the matrix A A A , each with a probability q i j of attracting an edge, with ∑ i < j q i j = 1, yielding a multinomial distribution P ( A A A | q q q , E ) = E ! ∏ i < j q A i j i j / A i j ! — where, differently from Eq. 8, the edge placements are not conditionally indepen- dent. But if we now sample the total number of edges E from a Poisson distribution P ( E | ¯ E ) with av erage ¯ E , by ex- ploiting the relationship between the multinomial and Poisson distributions, we ha ve P ( A A A | q q q ) = ∑ E P ( A A A | q q q , E ) P ( E | ¯ E ) = ∏ i < j e − ω i j ω A i j i j / A i j !, where ω i j = q i j / ¯ E , which does amount to conditionally independent edge placements. Making q i j = ¯ E λ b i , b j , and allowing self-loops, we arri ve at Eq. 8. 8 with dif ferent implementations that v ary in some superﬁcial details in the model speciﬁcation, approximations used, and in particular in the choice of priors. Here we will not re vie w or compare all approaches in detail, b ut rather focus on the most important aspects, while choosing a particular path that makes e xact calculations possible. The prior probabilities are a crucial element of the inference procedure, as they will affect the shape of the posterior distribution, and ultimately , our inference results. In more traditional scenarios, the choice of priors would be guided by previous observ ations of data that are belie ved to come from the same model. Ho wev er , this is not an applicable scenario when considering networks, which are typically singletons , i.e. they are unique objects, instead of coming from a population (e.g. there is only one internet, one network of trade between countries, etc). 2 In the absence of such empirical prior information, we should try as much as possible to be guided by well deﬁned principles and reasonable assumptions about our data, rather than ad hoc choices. A central proposition we will be using is the principle of maximum indiffer ence about the model before we observe any data. This will lead us to so-called uninformative priors, 3 that are maximum entropy distributions that ascribe the same probability to each possible parameter combination [3]. These priors ha ve the property that the y do not bias the posterior distribution in an y particular w ay , and thus let the data “speak for itself. ” But as we will see in the following, the nai v e application of this principle will lead to adverse ef fects in many cases, and upon closer inspection we will often be able to identify aspects of the model that we should not be agnostic about. Instead, a more meaningful approach will be to describe higher -order aspects of the model with their o wn models. This can be done in a manner that preserv es the unbiased nature of our results, while being able to provide a more f aithful representation of the data. W e be gin by choosing the prior for the partition, b b b . The most direct uninformati ve prior is the “ﬂat” distribution where all partitions into at most B = N groups are equally likely , namely P ( b b b ) = 1 ∑ b b b 0 1 = 1 a N (12) where a N are the ordered Bell numbers [42], gi ven by a N = N ∑ B = 1  N B  B ! (13) where n N B o are the Stirling numbers of the second kind [43], which count the number of ways to partition a set of size N into B indistinguishable and nonempty groups (the B ! in the abov e equation recov ers the distinguishability of the groups, which we require). Ho we ver , upon closer inspection we often ﬁnd that such ﬂat distributions are not a good choice. In this particular case, since there are many more partitions into B + 1 groups than there are into B groups (if B is suf ﬁciently smaller than N ), Eq. 12 will typically prefer partitions with a number of groups that is comparable to the number of nodes. Therefore, this uniform assumption seems to betray the principle of parsimony that we stated in the introduction, since it fa vors large models with man y groups, before we ev en observe the data. 4 Instead, we may wish to be agnostic about the number of gr oups itself , by ﬁrst 2 One could argue that most networks change in time, and hence belong to a time series, thus possibly allowing priors to be selected from earlier observations of the same network. This is a potentially useful way to proceed, b ut also opens a Pandora’ s box of dynamical network models, where simplistic notions of statistical stationarity are likely to be contradicted by data. Some recent progress has been made on the inference of dynamic networks [34 – 41], b ut this ﬁeld is still in relativ e infancy . 3 The name “uninformativ e” is something of a misnomer , as it is not really possible for priors to truly carry “no in- formation” to the posterior distribution. In our context, the term is used simply to refer to maximum entr opy priors , conditioned on speciﬁc constraints. 4 Using constant priors such as Eq. 12 makes the posterior distrib ution proportional to the likelihood. Maximizing 9 sampling it from its own uninformativ e distribution P ( B ) = 1 / N , and then sampling the partition conditioned on it P ( b b b | B ) = 1 n N B o B ! , (14) since n N B o B ! is the number of ways to partition N nodes into B labelled groups. 5 Since b b b is a parameter of our model, the number of groups B is a called a hyperparameter , and its distribu- tion P ( B ) is called a hyperprior . But once more, upon closer inspection we can identify further problems: If we sample from Eq. 14, most partitions of the nodes will occupy all the groups ap- proximately equally , i.e. all group sizes will be the approximately the same. Is this something we want to assume before observing any data? Instead, we may wish to be agnostic about this aspect as well, and choose to sample ﬁrst the distribution of group sizes n n n = { n r } , where n r is the number of nodes in group r , forbidding empty groups, P ( n n n | B ) =  N − 1 B − 1  − 1 , (15) since  N − 1 B − 1  is the number of ways to di vide N nonzero counts into B nonempty bins. Given these randomly sampled sizes as a constraint, we sample the partition with a uniform probability P ( b b b | n n n ) = ∏ r n r ! N ! . (16) This gi ves us ﬁnally P ( b b b ) = P ( b b b | n n n ) P ( n n n | B ) P ( B ) = ∏ r n r ! N !  N − 1 B − 1  − 1 N − 1 . (17) At this point the reader may wonder if there is any particular reason to stop here. Certainly we can ﬁnd some higher-order aspect of the group sizes n n n that we may wish to be agnostic about, and introduce a “ hyperhyperprior ”, and so on, indeﬁnitely . The reason why we should not keep recursi vely being more and more agnostic about higher-order aspects of our model is that it brings increasingly diminishing returns. In this particular case, if we assume that the indi vidual group sizes are suf ﬁciently large, we obtain asymptotically ln P ( b b b ) ≈ − N H ( n n n ) + O ( ln N ) (18) where H ( n n n ) = − ∑ r ( n r / N ) ln ( n r / N ) is the entropy of the group size distrib ution. The value ln P ( b b b ) → − N H ( n n n ) is an information-theoretical limit that cannot be surpassed, re gardless of ho w we choose P ( n n n | B ) . Therefore, the most we can optimize by being more reﬁned is a marginal factor O ( ln N ) in the log-probability , which would amount to little practical difference in most cases. In the above, we went from a purely ﬂat uninformativ e prior distrib ution for b b b , to a Bayesian hierarchy with three le vels, where we sample ﬁrst the number of groups, the groups sizes, and then ﬁnally the partition. In each of the lev els we used maximum entropy distrib utions that are such a posterior distribution is therefore entirely equiv alent to a “non-Bayesian” maximum likelihood approach, and nulliﬁes our attempt to prev ent ov erﬁtting. 5 W e could have used simply P ( b b b | B ) = 1 / B N , since B N is the number of partitions of N nodes into B groups, which are allo wed to be empty . Howe ver , this would force us to distinguish between the nominal and the actual number of groups (discounting empty ones) during inference [33], which becomes unnecessary if we simply forbid empty groups in our prior . 10 constrained by parameters that are themselves sampled from their o wn distrib utions at a higher le vel. In doing so, we remov ed some intrinsic assumptions about the model (in this case, number and sizes of groups), thereby postponing an y decision on them until we observ e the data. This will be a general strategy we will use for the remaining model parameters. Having dealt with P ( b b b ) , this lea ves us with the prior for the group to group connections, λ λ λ . A good starting point is an uninformativ e prior conditioned on a global a verage, ¯ λ , which will determine the e xpected density of the network. For a continuous v ariable x , the maximum entropy distribution with a constrained average ¯ x is the exponential, P ( x ) = e − x / ¯ x / ¯ x . Therefore, for λ λ λ we hav e P ( λ λ λ | b b b ) = ∏ r ≤ s e − n r n s λ rs / ( 1 + δ rs ) ¯ λ n r n s / ( 1 + δ rs ) ¯ λ , (19) with ¯ λ = 2 E / B ( B + 1 ) determining the expected total number of edges, 6 where we ha ve assumed the local av erage h λ rs i = ¯ λ ( 1 + δ rs ) / n r n s , such that that the expected number of edges e rs = λ rs n r n s / ( 1 + δ rs ) will be equal to ¯ λ , irrespecti v e of the group sizes n r and n s [19]. Combining this with Eq. 8, we can compute the integrated mar ginal likelihood of Eq. 10 as P ( A A A | b b b ) = ¯ λ E ( ¯ λ + 1 ) E + B ( B + 1 ) / 2 × ∏ r < s e rs ! ∏ r e rr !! ∏ r n e r r ∏ i < j A i j ! ∏ i A ii !! . (20) Just as with the node partition, the uninformativ e assumption of Eq. 19 also leads to its own problems, b ut we postpone dealing with them to Sec. VI. For no w , we hav e e verything we need to write the posterior distribution, with the exception of the model e vidence P ( A A A ) giv en by Eq. 11. Unfortunately , since it in v olves a sum o ver all possible partitions, it is not tractable to compute the e vidence exactly . Ho wev er , since it is just a normalization constant, we will not need to determine it when optimizing or sampling from the posterior , as we will see in Sec. VIII. The numerator of Eq. 9, which is comprised of the terms that we can compute exactly , already contains all the information we need to proceed with the inference, and also has a special interpretation, as we will see in the next section. The posterior of Eq. 9 will put low probabilities on partitions that are not backed by sufﬁcient statistical evidence in the network structure, and it will not lead us to spurious partitions such as those depicted in Fig. 1. Inferring partitions from this posterior amounts to a so-called non- parametric approach; not because it lacks the estimation of parameters, but because the number of parameters itself, a.k.a. the or der or dimension of the model, will be inferred as well. More speciﬁcally , the number of groups B itself will be an outcome of the inference procedure, which will be chosen in order to accommodate the structure in the data, without overﬁtting. The precise reasons why the latter is guaranteed might not be immediately obvious for those unfamiliar with Bayesian inference. In the follo wing section we will pro vide an explanation by making a straight- forward connection with information theory . The connection is based on a different interpretation of our model, which allo w us to introduce some important improvements. 6 More strictly , we should treat ¯ λ just as another hyperparameter and integrate over its o wn distribution. But since this is just a global parameter , not affected by the dimension of the model, we can get away with setting its value directly from the data. It means we are pretending we know precisely the density of the network we are observing, which is not a very strong assumption. Ne vertheless, readers that are uneasy with this procedure can rest assured that this can be completely amended once we mov e to microcanonical models in Sec. V (see footnote 15). 11 V . MICR OCANONICAL MODELS AND THE MINIMUM DESCRIPTION LENGTH PRINCIPLE (MDL) W e can re-interpret the integrated marginal likelihood of Eq. 20 as the joint likelihood of a micr ocanonical model gi ven by 7 P ( A A A | b b b ) = P ( A A A | e e e , b b b ) P ( e e e | b b b ) , (21) where P ( A A A | e e e , b b b ) = ∏ r < s e rs ! ∏ r e rr !! ∏ r n e r r ∏ i < j A i j ! ∏ i A ii !! , (22) P ( e e e | b b b ) = ∏ r < s ¯ λ e rs ( ¯ λ + 1 ) e rs + 1 ∏ r ¯ λ e rs / 2 ( ¯ λ + 1 ) e rs / 2 + 1 = ¯ λ E ( ¯ λ + 1 ) E + B ( B + 1 ) / 2 , (23) and e e e = { e rs } is the matrix of edge counts between groups. The term “microcanonical” — bor- ro wed from statistical physics — means that model parameters correspond to “hard” constraints that are strictly imposed on the ensemble, as opposed to “soft” constraints that are obeyed only on av erage. In the particular case above, P ( A A A | e e e , b b b ) is the probability of generating a multigraph A A A where Eq. 1 is alw ays fulﬁlled, i.e. the total number of edges between groups r and s is always exactly e rs , without any ﬂuctuation allo wed between samples (see Ref. [19] for a combinatorial deri vation). This contrasts with the parameter λ rs in Eq. 8, which determines only the avera ge number of edges between groups, which ﬂuctuates between samples. Con versely , the prior for the edge counts P ( e e e | b b b ) is a mixture of geometric distributions with a verage ¯ λ , which does allo w the edge counts to ﬂuctuate, guaranteeing the overall equiv alence. The fact that Eq. 21 holds is rather remarkable, since it means that — at least for the basic priors we used — these two kinds of model (“canonical” and microcanonical) cannot be distinguished from data, since their mar ginal likelihoods (and hence the posterior probability) are identical 8 . W ith this microcanonical interpretation in mind, we may frame the posterior probability in an information-theoretical manner as follows. If a discrete v ariable x occurs with a probability mass P ( x ) , the asymptotic amount of information necessary to describe it is − log 2 P ( x ) (if we choose bits as the unit of measurement), by using an optimal lossless coding scheme such as Huffman’ s algorithm [44]. W ith this in mind, we may write the numerator of the posterior distrib ution in Eq. 9 as P ( A A A | b b b ) P ( b b b ) = P ( A A A | e e e , b b b ) P ( e e e , b b b ) = 2 − Σ , (24) where the quantity Σ = − log 2 P ( A A A , e e e , b b b ) (25) = − log 2 P ( A A A | e e e , b b b ) − log 2 P ( e e e , b b b ) (26) is called the description length of the data [45, 46]. It corresponds to the asymptotic amount of information necessary to encode the data A A A together with the model parameters e e e and b b b . There- 7 Some readers may w onder why Eq. 21 should not contain a sum, i.e. P ( A A A | b b b ) = ∑ e e e P ( A A A | e e e , b b b ) P ( e e e | b b b ) . Indeed, that is the proper way to write a marginal likelihood. Howe ver , for the microcanonical model there is only one element of the sum that fulﬁlls the constraint of Eq. 1, and thus yields a nonzero probability , making the marginal likelihood identical to the joint, as expressed in Eq. 21. The same is true for the partition prior of Eq. 17. W e will use this fact in our notation throughout, and omit sums when they are unnecessary . 8 This equi valence occurs for a v ariety of Bayesian models. For instance, if we ﬂip a coin with a probability p of coming up heads, the integrated likelihood under a uniform prior after N trials in which m heads were observed is P ( x x x ) = R 1 0 p m ( 1 − p ) N − m d p = ( N − m ) ! m ! / ( N + 1 ) !. This is the same as the “microcanonical” model P ( x x x ) = P ( x x x | m ) P ( m ) with P ( x x x | m ) =  N m  − 1 and P ( m ) = 1 / ( N + 1 ) , i.e. the number of heads is sampled from a uniform distrib ution, and the coin ﬂips are sampled randomly among those that hav e that exact number of heads. 12 (a) 1 5 10 15 20 25 30 B 2600 2800 3000 3200 3400 Σ = − log 2 P ( A A A , b b b ) (bits) Original Randomized (b) Figure 3. Bayesian inference of the SBM for a network of American colle ge football teams [53]: (a) The partition that maximizes the posterior probability of Eq. 9, or equi v alently , minimizes the description length of Eq. 24. Nodes marked in red are not classiﬁed according to the known division into “conferences. ” (b) Description length as a function of the number of groups of the corresponding optimal partition, both for the original and randomized data. fore, if we ﬁnd a network partition that maximizes the posterior distribution of Eq. 20, we are also automatically ﬁnding one which minimizes the description length. 9 W ith this, we can see ho w the Bayesian approach outlined abov e prev ents overﬁtting: As the size of the model increases (via a larger number of occupied groups), it will constrain itself better to the data, and the amount of information necessary to describe it when the model is kno wn, − log 2 P ( A A A | e e e , b b b ) , will decrease. At the same time, the amount of information necessary to describe the model itself, − log 2 P ( e e e , b b b ) , will increase as it becomes more complex. Therefore, the latter will function as a penalty 10 that pre vents the model from becoming overly complex, and the optimal choice will amount to a proper balance between both terms. 11 Among other things, this approach will allow us to properly esti- mate the dimension of the model — represented by the number of groups B — in a parsimonious way . W e no w illustrate this approach with a real-world dataset of American college football teams [53], where a node is a team and an edge exists if two teams played against each other in a season. If we ﬁnd the partition that maximizes the posterior distribution, we unco ver B = 10 groups, as can be seen in Fig. 3a. If we compare this partition with the kno wn di vision of the teams into “confer - ences” [54, 55], we ﬁnd that the y match with a high degree of precision, with the exception of only a few nodes. 12 In Fig. 3b we show the description length of the optimal partitions if we constrain them to ha ve a pre-speciﬁed number of groups, which allo ws us to see how the approach penalizes 9 Sometimes the minimum description length principle (MDL) is considered as an alternativ e method to Bayesian inference. Although it is possible to apply MDL in a manner that makes the connection with Bayesian inference difﬁcult, as for example with the normalized maximum likelihood scheme [47, 48], in its more direct and tractable form it is fully equiv alent to the Bayesian approach [46]. Note also that we do not in fact require the connection with microcanonical models made here, as the description length can be deﬁned directly as Σ = − log 2 P ( A A A , b b b ) , without referring explicitly to internal model parameters. 10 Some readers may notice the similarity between Eq. 26 and other penalty-based criteria, such as BIC [49] and AIC [50]. Although all these criteria share the same overall interpretation, BIC and AIC rely on speciﬁc assump- tions about the asymptotic shape of the model likelihood, which are known to be in valid for the SBM [51], unlike Eq. 26 which is exact. 11 An important result in information theory states that compressing random data is asymptotically impossible [52]. This lies at the heart of the effecti veness of the MDL approach in prev enting overﬁtting, as incorporating randomness into the model description cannot be used to reduce the description length. 12 Care should be taken when comparing with “known” divisions in this manner , as there is no guarantee that the a vailable metadata is in fact rele vant for the netw ork structure. See Refs. [56 – 58] for more detailed discussions. 13 (a) 0 20 40 60 80 100 B 6000 8000 10000 12000 14000 16000 Σ = − log 2 P ( A A A , b b b ) (bits) SBM Nested SBM (b) Figure 4. Inference of the SBM on a simple artiﬁcial network composed of 64 cliques of size 10, illustrating the underﬁtting problem: (a) The partition that maximizes the posterior probability of Eq. 9, or equiv alently , minimizes the description length of Eq. 24. The 64 cliques are grouped into 32 groups composed of two cliques each. (b) Minimum description length as a function of the number of groups of the corresponding partition, both for the SBM and its nested v ariant, which is less susceptible to underﬁtting, and puts all 64 cliques in their o wn groups. both too simple and too complex models, with a global minimum at B = 10 — corresponding to the most compressiv e partition. Importantly , if we now randomize the network, by placing all its edges in a completely random fashion, we obtain instead a trivial partition into B = 1 group — indicating that the best model for this data is indeed a fully random graph. Hence, we see that this approach completely av oids the pitfall discussed in Sec. II and does not identify groups in fully random networks, and that the division sho wn in Fig. 3a points to a statistically signiﬁcant structure in the data, that cannot be explained simply by random ﬂuctuations. VI. THE “RESOLUTION LIMIT” UNDERFITTING PR OBLEM, AND THE NESTED SBM Although the Bayesian approach outlined abo ve is in general protected against o verﬁtting, it is still susceptible to underﬁtting , i.e. when we mistake statistically signiﬁcant structure for random- ness, resulting in the inference of an ov erly simplistic model. This happens whene ver there is a large discrepanc y between our prior assumptions and what is observed in the data. W e illustrate this problem with a simple example: Consider a network formed of 64 isolated cliques of size 10, as sho wn in Fig. 4a. If we employ the approach described in the pre vious section, and maximize the posterior of Eq. 9, we obtain a partition into B = 32 groups, where each group is composed of two cliques. This is a fairly unsatisfying characterization of this network, and also some what perplexing, since the probability that the inferred SBM will generate the observed network — i.e. each of the 32 groups will simultaneously and spontaneously split in two disjoint cliques — is v anishingly small. Indeed, intuiti vely it seems we should do signiﬁcantly better with this rather obvious e xample, and that the best ﬁt would be to put each of the cliques in their o wn group. In order to see what went wrong, we need to re visit our prior assumptions, in particular our choice for P ( λ λ λ | b b b ) in Eq. 19, or equi valently , our choice of P ( e e e | b b b ) in Eq. 23 for the microcanonical for- 14 mulation. In both cases, they correspond to uninformati ve priors, which put approximately equal weight on all allowed types of lar ge-scale structures. As argued before, this seems reasonable at ﬁrst, since we should not bias our model before we observe the data. Howe v er , the implication of this choice is that we expect a priori the structure of the network at the aggregate group lev el, i.e. considering only the groups and the edges between them (not the indi vidual nodes), to be fully random. This is indeed not the case in the simple e xample of Fig. 4, and in fact it is unlikely to be the case for most networks that we encounter , which will probably be structured at a higher le vel as well. The unfa v orable outcome of the uninformati ve assumption can also be seen by inspecting its ef fect on the description length of Eq. 24. If we re visit our simple model with C cliques of size m , grouped uniformly into B groups of size C / B , and we assume that these v alues are sufﬁciently large so that Stirling’ s factorial approximation ln x ! ≈ x ln x − x can be used, the description length becomes Σ ≈ − ( E − N ) log 2 B + B ( B + 1 ) 2 log 2 E , (27) where N = C m is the total number of nodes and E = C  m 2  is the total number of edges, and we hav e omitted terms that do not depend on B . From this, we see that if we increase the number of groups B , this incurs a quadratic penalty in the description length giv en by the second term of Eq. 27, which originates precisely from our e xpression of P ( e e e | b b b ) : It corresponds to the amount of information necessary to describe all entries of a symmetric B × B matrix that takes independent v alues between 0 and E . Indeed, a slightly more careful analysis of the scaling of the description length [19, 29] re veals that this approach is unable to uncov er a number of groups that is larger than B max ∝ √ N , ev en if their existence is ob vious, as in our example of Fig. 4. 13 T rying to av oid this limitation might seem like a conundrum, since replacing the uninforma- ti ve prior for P ( e e e | b b b ) amounts to making a more deﬁnite statement on the most likely large-scale structures that we e xpect to ﬁnd, which we might hesitate to stipulate, as this is precisely what we want to discover from the data in the ﬁrst place, and we want to remain unbiased. Luckily , there is in fact a general approach av ailable to us to deal with this problem: W e postpone our decision about the higher-order aspects of the model until we observe the data. In fact, we already saw this approach in action when we decided on the prior for the partitions; W e do so by replacing the uninformati ve prior with a par ametric distribution, whose parameters are in turn modelled by a another distribution, i.e. a hyperprior . The parameters of the prior then become latent variables that are learned from data, allo wing us to uncover further structures, while remaining unbiased. The microcanonical formulation allo ws us to proceed in this direction in a straightforward manner , as we can interpret the matrix of edge counts e e e as the adjacency matrix of a multigraph where each of the groups is represented as a single node. W ithin this interpretation, an elegant solution presents itself, where we describe the matrix e e e with another SBM, i.e. we partition each of the groups into meta-groups, and the edges between groups are placed according to the edge counts between meta-groups. For this second SBM, we can proceed in the same manner , and model it by a third SBM, and so on, forming a nested hierarc hy , as illustrated in Fig. 5 [61]. More precisely , if we denote by B l , b b b l and e e e l the number of groups, the partition and the matrix of edge counts at le vel l ∈ { 0 , . . . , L } , we hav e P ( e e e l | b b b l − 1 , e e e l + 1 , b b b l ) = ∏ r < s n l r n l s e l + 1 rs ! ! − 1 ∏ r n l r ( n l r + 1 ) / 2 e l + 1 rs / 2 ! ! − 1 , (28) 13 This same problem occurs for slight v ariations of the SBM and corresponding priors, provided they are uninformati ve, such as those in Refs. [30, 31, 33], and also with other penalty based approaches that rely on a functional form similar to Eq. 27 [59]. Furthermore, this limitation is conspicuously similar to the “resolution limit” present in the popular heuristic of modularity maximization [60], although it is not yet clear if a deeper connection exists between both phenomena. 15 l = 0 l = 1 l = 2 l = 3 Nested model Observ ed net work N nodes E edges B 0 nodes E edges B 1 nodes E edges B 2 nodes E edges (a) 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 E 1 0 1 1 0 2 1 0 3 1 0 4 N / B 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 (b) 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 E 1 0 1 1 0 2 1 0 3 1 0 4 N / B 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 (c) Figure 5. (a) Diagrammatic representation of the nested SBM described in the text, with L = 3 lev els, adapted from Ref. [61]. (b) A verage group sizes N / B obtained with the SBM using uninformativ e priors, for a v ariety of empirical networks, listed in Ref. [61]. The dashed line sho ws a slope √ E , highlighting the systematic underﬁtting problem. (c) The same as in (b), but using the nested SBM, where the underﬁtting has virtually disappeared, with datasets randomly scattered in the allo wed range. with   n m   =  n + m − 1 m  counting the number of m -combinations with repetitions from a set of size n . Eq. 28 is the likelihood of a maximum-entropy multigraph SBM, i.e. ev ery multigraph occurs with the same probability , provided they fulﬁll the imposed constraints 14 [17]. The prior for the partitions is again gi ven by Eq. 17, P ( b b b l ) = ∏ r n l r ! B l − 1 !  B l − 1 − 1 B l − 1  − 1 B − 1 l − 1 , (29) with B − 1 = N , so that the joint probability of the data, edge counts and the hierarchical partition { b b b l } becomes P ( A A A , { e e e l } , { b b b l }| L ) = P ( A A A | e e e 1 , b b b 0 ) P ( b b b 0 ) L ∏ l = 1 P ( e e e l | b b b l − 1 , e e e l + 1 , b b b l ) P ( b b b l ) , (30) where we impose the boundary conditions B L = 1 and P ( b b b L ) = 1. W e can treat the hierarchy depth L as a latent variable as well, by placing a prior on it P ( L ) = 1 / L max , where L max is the maximum value allowed. But since this only contributes to an overall multiplicati ve constant, it 14 Note that we cannot use in the upper le vels exactly the same model we use in the bottom le vel, giv en by Eq. 22, as most terms in the subsequent le vels will cancel out. This happens because the model in Eq. 22 is based on the uniform generation of conﬁgurations, not multigraphs [19]. Howe ver , we are free to use Eq. 28 in the bottom level as well. 16 has no ef fect on the posterior distrib ution, and thus can be omitted. If we impose L = 1, we recover the uninformati ve prior for e e e = e e e 1 , P ( e e e | b b b 0 ) =   B ( B + 1 ) / 2 E   − 1 , (31) which is dif ferent from Eq. 23 only in that the number of edges E is not allo wed to ﬂuctuate. 15 The inference of this model is done in the same manner as the uninformati ve one, by obtaining the posterior distribution of the hierarchical partition P ( { b b b l }| A A A ) = P ( A A A , { b b b l } ) P ( A A A ) = P ( A A A , { e e e l } , { b b b l } ) P ( A A A ) , (32) and the description length is gi ven analogously by Σ = − log 2 P ( A A A |{ e e e l } , { b b b l } ) − log 2 P ( { e e e l } , { b b b l } ) . (33) This approach has a series of advantages; in particular , we remain a priori agnostic with respect to what kind of large-scale structure is present in the network, ha ving constrained ourselves sim- ply in that it can be represented as a SBM at a higher lev el, and with the uninformativ e prior as a special case. Despite this, we are able to overcome the underﬁtting problem encountered with the uninformativ e approach: If we apply this model to the e xample of Fig. 4, we can successfully distinguish all 64 cliques, and provide a lower overall description length for the data, as can be seen in Fig. 4b . More generally , by in vestigating the properties of the model likelihood, it is pos- sible to sho w that the maximum number of groups that can be uncov ered with this model scales as B max ∝ N / log N , which is signiﬁcantly larger than the limit with uninformati ve priors [19, 61]. The difference between both approaches manifests itself very often in practice, as shown in Fig. 5b, where systematic underﬁtting is observ ed for a wide variety of network datasets, which disappears with the nested model, as seen in Fig. 5c. Crucially , we achiev e this decreased tendenc y to underﬁt without sacriﬁcing our protection against ov erﬁtting: Despite the more elaborate model speciﬁca- tion, the inference of the nested SBM is completely nonparametric, and the same Bayesian and information-theoretical principles still hold. Furthermore, as we already mentioned, the uninfor- mati ve case is a special case of the nested SBM, i.e. when L = 1, and hence it can only improve the inference (e.g. by reducing the description length), with no drawbacks. W e stress that the number of hierarchy lev els, as with any other dimension of the model, such as the number of groups in each le vel, is inferred from data, and does not need to be determined a priori. In addition to the above, the nested model also gi ves us the capacity of describing the data at multiple scales, which could potentially exhibit different mixing patterns. This is particularly useful for large networks, where the SBM might still give us a very complex description, which becomes easier to interpret if we concentrate ﬁrst on the upper le vels of the hierarchy . A good example is the result obtained for the internet topology at the autonomous systems le vel, sho wn in Fig. 6. The lowest le vel of the hierarchy shows a division into a large number of groups, with a fairly complicated structure, whereas the higher le vels sho w an increasingly simpliﬁed picture, culminating in a core-periphery org anization as the dominating pattern. 15 The prior of Eq. 31 and the hierarchy in Eq. 30 are conditioned on the total number of edges E , which is typically unknown before we observe the data. Similarly to the parameter ¯ λ in the canonical model formulation, the strictly correct approach would be to consider this quantity as an additional model parameter, with its prior distribution P ( E ) . Howe ver , in the microcanonical model there is no integration inv olved, and P ( E ) — regardless of how we specify it — would contrib ute to an o verall multiplicativ e constant that disappears from the posterior distribution after normalization. Therefore we can simply omit it. 17 Figure 6. Fit of the (degree-corrected) nested SBM for the internet topology at the autonomous systems le vel, adapted from Ref. [61]. The hierarchical division re veals a core-periphery org anization at the higher le vels, where most routes go through a relativ ely small number of nodes (sho wn in the inset and in the map). The lower le vels rev eal a more detailed picture, where a large number of groups of nodes are identiﬁed ac- cording to their routing patterns (amounting largely to distinct geographical regions). The layout is obtained with an edge bundling algorithm by Holten [62], which uses the hierarchical partition to route the edges. VII. MODEL V ARIA TIONS V arying the number of groups and building hierarchies are not the only w ays we ha ve of adapt- ing the complexity of the model to the data. W e may also change the internal structure of the model, and ho w the divisi on into groups affect the placement of edges. In fact, the basic ansatz of the SBM is very versatile, and man y v ariations ha ve been proposed in the literature. In this section we revie w two important ones — SBMs with degree correction and group ov erlap — and revie w other model ﬂav ors in a summarized manner . Before we go further into the model v ariations, we point out that the multiplicity of models is a strength of the inference approach. This is dif ferent from the broader ﬁeld of network clustering, where a large number of av ailable algorithms often yield conﬂicting results for the same data, leaving practitioners lost in ho w to select between them [63, 64]. Instead, within the inference frame work we can in fact compare dif ferent models in a principled manner and select the best one according to the statistical e vidence a vailable. W e proceed with a general outline of the model selection procedure, before follo wing with speciﬁc model variations. 18 A. Model selection Suppose we deﬁne two versions of the SBM, labeled C 1 and C 2 , each with their own posterior distribution of partitions, P ( b b b | A A A , C 1 ) and P ( b b b | A A A , C 2 ) . Suppose we ﬁnd the most likely partitions b b b 1 and b b b 2 , according to C 1 and C 2 , respecti vely . Ho w do we decide which partition is more repre- sentati ve of the data? The consistent approach is to obtain the so-called posterior odds ratio [3, 65] Λ = P ( b b b 1 , C 1 | A A A ) P ( b b b 2 , C 2 | A A A ) = P ( A A A | b b b 1 , C 1 ) P ( b b b 1 ) P ( C 1 ) P ( A A A | b b b 2 , C 2 ) P ( b b b 2 ) P ( C 2 ) , (34) where P ( C ) is our prior belief that variant C is v alid. A v alue of Λ > 1 indicates that the choice ( b b b 1 , C 1 ) is Λ times more plausible as an explanation for the data than the alternati ve, ( b b b 2 , C 2 ) . If we are a priori agnostic with respect to which model ﬂav or is best, i.e. P ( C 1 ) = P ( C 2 ) , we ha ve then Λ = P ( A A A | b b b 1 , C 1 ) P ( b b b 1 ) P ( A A A | b b b 2 , C 2 ) P ( b b b 2 ) = 2 − ∆Σ , (35) where ∆Σ = Σ 1 − Σ 2 is the description length dif ference between both choices. Hence, we should generally prefer the model choice that is most compressiv e, i.e. with the smallest description length. Howe v er , if the value of Λ is close to 1, we should refrain from forcefully rejecting the alternati ve, as the e vidence in data would not be strongly decisiv e either way . I.e. the actual value of Λ giv es us the conﬁdence with which we can choose the preferred model. The ﬁnal decision, ho wev er , is subjecti v e, since it depends on what we might consider plausible. A v alue of Λ = 2, for example, typically cannot be used to forcefully reject the alternati ve hypothesis, whereas a value of Λ = 10 100 might. An alternati ve test we can make is to decide which model class is most representati ve of the data, when averaged over all possible partitions. In this case, we proceed in a an analogous way by computing the posterior odds ratio Λ 0 = P ( C 1 | A A A ) P ( C 2 | A A A ) = P ( A A A | C 1 ) P ( C 1 ) P ( A A A | C 2 ) P ( C 2 ) , (36) where P ( A A A | C ) = ∑ b b b P ( A A A | b b b , C ) P ( b b b ) (37) is the model evidence. When P ( C 1 ) = P ( C 2 ) , Λ 0 is called the Bayes factor , with an interpretation analogous to Λ above, but where the statement is made with respect to all possible partitions, not only the most likely one. Unfortunately , as mentioned pre viously , the evidence P ( A A A | C ) cannot be computed exactly for the models we are interested in, making this criterion more dif ﬁcult to employ in practice (although approximations ha ve been proposed, see e.g. Ref [19]). W e return to the issue of when it should we optimize or sample from the posterior distrib ution in Sec. IX, and hence which of the two criteria should be used. B. Degree correction The underlying assumption of all variants of the SBM considered so far is that nodes that belong to the same group are statistically equiv alent. As it turns out, this fundamental aspect results in a very unrealistic property . Namely , this generativ e process implies that all nodes that belong to the 19 same group recei ve on a verage the same number of edges. Ho we ver , a common property of many empirical networks is that they hav e very heterogeneous degrees, often broadly distributed o ver se veral orders of magnitudes [15]. Therefore, in order for this property to be reproduced by the SBM, it is necessary to group nodes according to their de gree, which may lead to some seemingly odd results. An example of this was giv en in Ref. [16] and is sho wn in Fig.7a. It corresponds to a ﬁt of the SBM to a network of political blogs recorded during the 2004 American presidential election campaign [66], where an edge exists between two blogs if one links to the other . If we guide ourselves by the layout of the ﬁgure, we identify two assortativ e groups, which happen to be those aligned with the Republican and Democratic parties. Howe ver , inside each group there is a signiﬁcant variation in degree, with a few nodes with many connections and man y with very fe w . Because of what just has been explained, if we perform a ﬁt of the SBM using only B = 2 groups, it prefers to cluster the nodes into high-degree and low-de gree groups, completely ignoring the party alliance. 16 Arguably , this is a bad ﬁt of this netw ork, since — similarly to the underﬁtting example of Fig. 4 — the probability of the ﬁtted SBM generating a network with such a party structure is v anishingly small. In order to solv e this undesired beha vior , Karrer and Ne wman [16] proposed a modiﬁed model, which the y dubbed the de gree-corrected SBM (DC-SBM). In this v ariation, each node i is attributed with a parameter θ i that controls its expected de gree, independently of its group membership. Gi ven this e xtra set of parameters, a network is generated with probability P ( A A A | λ λ λ , θ θ θ , b b b ) = ∏ i < j e − θ i θ j λ b i , b j ( θ i θ j λ b i , b j ) A i j A i j ! × ∏ i e − θ 2 i λ b i , b i / 2 ( θ 2 i λ b i , b i / 2 ) A ii / 2 ( A ii / 2 ) ! , (38) (a) (b) Figure 7. Inferred partition for a netw ork of political blogs [66] using (a) the SBM and (b) the DC-SBM, in both cases forcing B = 2 groups. The node sizes are proportional to the node de grees. The SBM divides the network into lo w and high-degree groups, whereas the DC-SBM prefers the di vision into political factions. 16 It is possible that unexpected results of this kind inhibited the initial adoption of SBM methods in the network sci- ence community , which focused instead on more heuristic community detection methods, sav e for a fe w exceptions (e.g. [20, 22, 23, 25, 67, 68]). 20 where λ rs again controls the expected number of edges between groups r and s . Note that since the parameters λ rs and θ i always appear multiplying each other in the likelihood, their indi vid- ual v alues may be arbitrarily scaled, provided their products remain the same. If we choose the parametrization ∑ i θ i δ b i , r = 1 for ev ery group r , then they acquire a simple interpretation: λ rs is the expected number of edges between groups r ans s , λ rs = h e rs i , and θ i is proportional to the expected de gree of node i , θ i = h k i i / ∑ s λ b i , s . When inferring this model from the political blogs data — again forcing B = 2 — we obtain a much more satisfying result, where the two political factions are neatly identiﬁed, as seen in Fig. 7b . As this model is capable of fully decoupling the community structure from the degrees, which are captured separately by the parameters λ λ λ and θ θ θ , respecti vely , the degree heterogeneity of the network does not interfere with the identiﬁcation of the political factions. Based on the abo ve e xample, and on the kno wledge that most networks possess heterogeneous degrees, we could e xpect the DC-SBM to provide a better ﬁt for most of them. Howe ver , before we jump to this conclusion, we must ﬁrst acknowledge that the seemingly increased quality of ﬁt obtained with the SBM came at the expense of adding an e xtra set of parameters, θ θ θ [51]. Howe v er intuiti ve we might judge the improvement brought on by degree correction, simply adding more parameters to a model is an almost sure recipe for overﬁtting. Therefore, a more prudent approach is once more to frame the inference problem in a Bayesian way , by focusing on the posterior distribution P ( b b b | A A A ) , and on the description length. For this, we must include a prior for the node propensities θ θ θ . The uninformativ e choice is the one which ascribes the same probability to all possible choices, P ( θ θ θ | b b b ) = ∏ r ( n r − 1 ) ! δ ( ∑ i θ i δ b i , r − 1 ) . (39) Using again an uninformati ve prior for λ λ λ , P ( λ λ λ | b b b ) = ∏ r ≤ s e − λ rs / ( 1 + δ rs ) ¯ λ / ( 1 + δ rs ) ¯ λ (40) with ¯ λ = 2 E / B ( B + 1 ) , the marginal likelihood no w becomes P ( A A A | b b b ) = Z P ( A A A | λ λ λ , θ θ θ , b b b ) P ( λ λ λ | b b b ) P ( θ θ θ | b b b ) d λ λ λ d θ θ θ = ¯ λ E ( ¯ λ + 1 ) E + B ( B + 1 ) / 2 × ∏ r < s e rs ! ∏ r e rr !! ∏ i < j A i j ! ∏ i A ii !! × ∏ r ( n r − 1 ) ! ( e r + n r − 1 ) ! × ∏ i k i ! , (41) where k i = ∑ j A i j is the degree of node i , which can be used in the same way to obtain a posterior for b b b , via Eq. 9. Once more, the model above is equiv alent to a microcanonical formulation [19], gi ven by P ( A A A | b b b ) = P ( A A A | k k k , e e e , b b b ) P ( k k k | e e e , b b b ) P ( e e e | b b b ) , (42) with P ( A A A | k k k , e e e , b b b ) = ∏ r < s e rs ! ∏ r e rr !! ∏ i k i ! ∏ i < j A i j ! ∏ i A ii !! ∏ r e r ! , (43) P ( k k k | e e e , b b b ) = ∏ r   n r e r   − 1 , (44) and P ( e e e | b b b ) giv en by Eq. 23. In the model abov e, P ( A A A | k k k , e e e , b b b ) is the probability of generating a 21 62 7 7 1 17 2 1 46 2 60 (a) Edge counts, P ( e e e | b b b ) . (b) Degrees, P ( k k k | e e e , b b b ) . (c) Network, P ( A A A | k k k , e e e , b b b ) . Figure 8. Illustration of the generativ e process of the microcanonical DC-SBM. Given a partition of the nodes, the edge counts between groups are sampled (a), follo wed by the de grees of the nodes (b) and ﬁnally the network itself (c). Adapted from Ref. [19]. multigraph where the edge counts between groups as well as the de grees k k k are ﬁxed to speciﬁc v alues. 17 The prior P ( k k k | e e e , b b b ) is the uniform probability of generating a degree sequence, where all possibilities that satisfy the constraints imposed by the edge counts e e e , namely ∑ i k i δ b i , r = e r , occur with the same probability . The description length of this model is then gi ven by Σ = − log 2 P ( A A A , b b b ) = − log 2 P ( A A A | k k k , e e e , b b b ) − log 2 P ( k k k , e e e , b b b ) . (45) Because uninformati ve priors were used to deri ve the abov e equations, we are once more subject to the same underﬁtting problem described previously . Luckily , from the microcanonical model we can again deri ve a nested DC-SBM, by replacing P ( e e e ) by a nested sequence of SBMs, exactly in the same was as was done before [19, 61]. W e also hav e the opportunity of replacing the uninformati ve prior for the degrees in Eq. 44 with a more realistic option. As was argued in Ref. [19], degree sequences generated by Eq. 44 result in e xponential degree distributions, which are not quite as heterogeneous as what is often encountered in practice. A more reﬁned approach, which is already familiar to us at this point, is to increase the Bayesian hierarchy , and choose a prior that is conditioned on a higher -order aspect of the data, in this case the fr equency of degrees, i.e. P ( k k k | e e e , b b b ) = P ( k k k | e e e , b b b , η η η ) P ( η η η | e e e , b b b ) , (46) where η η η = { η r k } , with η r k being the number of nodes of degree k in group r . In the above, P ( η η η | e e e , b b b ) is a uniform distrib ution of frequencies, and P ( k k k | e e e , b b b , η η η ) generates the degrees according to the sampled frequencies (we omit the respectiv e expressions for bre vity , and refer to Ref. [19] instead). Thus, this model is capable of using regularities in the degree distribution to inform the division into groups, and is generally capable of better ﬁts than the uniform model of Eq. 44. If we apply this nonparametric approach to the same political blog network of Ref. 9, we ﬁnd a much more detailed picture of its structure, rev ealing many more than two groups, as sho wn in Fig. 9, for three model variants: the nested SBM, the nested DC-SBM and the nested DC-SBM 17 The ensemble equiv alence of Eq. 42 is in some ways more remarkable than for the traditional SBM. This is because a direct equi valence between the ensembles of Eqs. 38 and 43 is not satisﬁed even in the asymptotic limit of large networks [17, 69], which does happen for Eqs. 8 and 22. Equiv alence is observ ed only if the indi vidual de grees k i also become asymptotically large. Howe ver , when the parameters λ λ λ and θ θ θ are inte grated out, the equiv alence becomes exact for networks of an y size. 22 (a) (b) (c) Figure 9. Most likely hierarchical partitions of a network of political blogs [66], according to the three model variants considered, as well as the inferred number of groups B 1 at the bottom of the hierarchy , and the description length Σ : (a) NDC-SBM, B 1 = 42, Σ ≈ 89 , 938 bits, (b) DC-SBM, B 1 = 23, Σ ≈ 87 , 162 bits, (c) DC-SBM with the de gree prior of Eq. 46, B 1 = 20, Σ ≈ 84 , 890 bits. The nodes circled in blue were classiﬁed as “liberals” and the remaining ones as “conservati ves” in Ref. [66] based on the blog contents. Adapted from Ref. [19]. with the de gree prior of Eq. 46. All three model variants are in fact capable of identifying the same Republican/Democrat division at the topmost hierarchical lev el — sho wing that the non- degree-corrected SBM is not as inept in capturing this aspect of the data as the result obtained by forcing B = 2 might suggest. Howe v er the internal divisions of both factions that they uncov er are very distinct from each other . If we inspect the obtained values of the description length with each model we see that the DC-SBM (in particular when using Eq. 46) results in a smaller v alue, indicating that it better captures the structure of the data, despite the increased number of parameters. Indeed, a systematic analysis carried out in Ref. [19] sho wed that the DC-SBM does in fact yield shorter description lengths for a majority of empirical datasets, thus ultimately conﬁrming the original intuition behind the model formulation. C. Group overlaps Another way we can change the internal structure of the model is to allo w the groups to o verlap, i.e. we allow a node to belong to more than one group at the same time. The connection patterns of the nodes are then assumed to be a mixture of the “pure” groups, which results in a richer type of model [24]. Following Ball et al. [70], we can adapt the Poisson formulation to ov erlapping SBMs in a straightforward manner , P ( A A A | κ κ κ , λ λ λ ) = ∏ i < j e − λ i j λ A i j i j A i j ! ∏ i e − λ ii / 2 ( λ ii / 2 ) A ii / 2 A ii / 2! , (47) with λ i j = ∑ rs κ ir λ rs κ js , (48) where κ ir is the probability with which node i is chosen from group r , so that ∑ i κ ir = 1, and λ rs is once more the expected number of edges between groups r and s . The parameters κ κ κ re- 23 place the disjoint partition b b b we hav e been using so far by a “soft” clustering into ov erlapping categories 18 . Note, ho we ver , that this model is a direct generalization of the non-o verlapping DC- SBM of Eq. 38, which is recovered simply by choosing κ ir = θ i δ r , b i . The Bayesian formulation can also be performed by using an uninformati ve prior for κ κ κ , P ( κ κ κ ) = ∏ r ( N − 1 ) ! δ ( ∑ i κ ir − 1 ) , (49) in addition to the same prior for λ λ λ in Eq. 40. Unfortunately , computing the marginal likelihood using Eq. 47 directly , P ( A A A | κ κ κ ) = Z P ( A A A | κ κ κ , λ λ λ ) P ( λ λ λ ) d λ λ λ , (50) is not tractable, which pre vents us from obtaining the posterior P ( κ κ κ | A A A ) . Instead, it is more useful to consider the auxiliary labelled matrix, or tensor , G G G = { G rs i j } , where G rs i j is a particular decompo- sition of A i j where the two edge endpoints — or “half-edges” — of an edge ( i , j ) are labelled with groups ( r , s ) , such that A i j = ∑ rs G rs i j . (51) Since a sum of Poisson v ariables is also distrib uted according to a Poisson, we can write Eq. 47 as P ( A A A | κ κ κ , λ λ λ ) = ∑ G G G P ( G G G | κ κ κ , λ λ λ ) ∏ i ≤ j δ ∑ rs G rs i j , A i j , (52) with each half-edge labelling being generated by P ( G G G | κ κ κ , λ λ λ ) = ∏ i < j ∏ rs e − κ ir λ rs κ js ( κ ir λ rs κ js ) G rs i j G rs i j ! × ∏ i ∏ rs e − κ ir λ rs κ is / 2 ( κ is λ rs κ is / 2 ) G rs ii / 2 ( G rs ii / 2 ) ! . (53) W e can now compute the mar ginal likelihood as P ( G G G ) = Z P ( G G G | κ κ κ , λ λ λ ) P ( κ κ κ ) P ( λ λ λ | ¯ λ ) d κ κ κ d λ λ λ , = ¯ λ E ( ¯ λ + 1 ) E + B ( B + 1 ) / 2 ∏ r < s e rs ! ∏ r e rr !! ∏ rs ∏ i < j G rs i j ! ∏ i G rs ii !! × ∏ r ( N − 1 ) ! ( e r + N − 1 ) ! × ∏ ir k r i ! , (54) which is very similar to Eq. 41 for the DC-SBM. W ith the abov e, and knowing from Eq. 51 that there is only one choice of A A A that is compatible with any gi ven G G G , i.e. P ( A A A | G G G ) = ∏ i ≤ j δ ∑ rs G rs i j , A i j , (55) we can sample from (or maximize) the posterior distribution of the half-edge labels G G G , just like we did for the node partition b b b in the nonov erlapping models, P ( G G G | A A A ) = P ( A A A | G G G ) P ( G G G ) P ( A A A ) ∝ P ( G G G ) × ∏ i ≤ j δ ∑ rs G rs i j , A i j , (56) 18 Note that, dif ferently from the non-o verlapping case, here it is possible for a node not to belong to any group, in which case it will nev er receiv e an incident edge. 24 (a) (b) (c) Figure 10. Network of co-purchases of books about US politics [71], with groups inferred using (a) the non- ov erlapping DC-SBM, with description length Σ ≈ 1 , 938 bits, (b) the ov erlapping SBM with description length Σ ≈ 1 , 931 bits and (c) the ov erlapping SBM forcing only B = 2 groups, with description length Σ ≈ 1 , 946 bits. where the product in the last term only accounts for choices of G G G which are compatible with A A A , i.e. fulﬁll Eq. 51. Once more, the model of Eq. 54 is equi v alent to its microcanonical analogue [18], P ( G G G ) = P ( G G G | k k k , e e e ) P ( k k k | e e e ) P ( e e e ) , (57) where P ( G G G | k k k , e e e ) = ∏ r < s e rs ! ∏ r e rr !! ∏ ir k r i ! ∏ rs ∏ i < j G rs i j ! ∏ i G rs ii !! ∏ r e r ! , (58) P ( k k k | e e e ) = ∏ r   N e r   − 1 (59) and P ( e e e ) gi ven by Eq. 23. The variables k k k = { k r i } are the labelled degrees of the labelled netw ork G G G , where k r i is the number of incident edges of type r a node i has. The description length becomes like wise Σ = − log 2 P ( G G G ) = − log 2 P ( G G G | k k k , e e e ) − log 2 P ( k k k | e e e ) − log 2 P ( e e e ) . (60) The nested v ariant can be once more obtained by replacing P ( e e e ) in the same manner as before, and P ( k k k | e e e ) in a manner that is conditioned on the labelled degree frequencies and degree of ov erlap, as described in detail in Ref. [18]. Equipped with this more general model, we may ask ourselves again if it pro vides a better ﬁt of most networks, like we did for the DC-SBM in the pre vious section. Indeed, since the model is more general, we might conclude that this is a inevitability . Howe ver , this could be a fallac y , since more general models also include more parameters and hence are more likely to overﬁt. Indeed, previous claims about the existence of “perv asiv e overlap” in networks, based on nonsta- tistical methods [72], seemed to be based to some extent on this problematic logic. Claims about community ov erlaps are very different from, for example, the statement that networks possess heterogeneous degrees, since community ov erlap is not something that can be observed directly; instead it is something that must be inferr ed , which is precisely what our Bayesian approach is de- signed to do in a methodologically correct manner . An example of such a comparison is sho wn in Fig 10, for a small network of political books. This network, when analyzed using the nonoverlap- ping SBM, seems to be composed of three groups, easily interpreted as “left wing, ” “right wing” and “center , ” as the av ailable metadata corroborates. If we ﬁt the overlapping SBM, we observe 25 (a) (b) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 µ 4 . 5 5 . 0 5 . 5 6 . 0 Σ /E B = 4 (ov erlapping) B = 15 (nonov erlapping) (c) Figure 11. (a) Artiﬁcial network sampled from an assortative overlapping SBM with B = 4 groups and expected mixture sizes giv en by n ~ b ∝ µ | ~ b | , with µ ∈ [ 0 , 1 ] controlling the degree of o verlap (see Ref. [73] for details). (b) the same network as in (a), but generated according to an equiv alent nonoverlapping SBM with B = 15 groups. (c) Description length per edge Σ / E for the same models in (a) and (b), as a function of the degree of ov erlap µ , showing a cross-ov er where the nonoverlapping model is preferred. Adapted from Ref. [73]. a mixed di vision into the same kinds of group. If we force the inference of only two groups, we see that some of the “center” nodes are split between “right wing” or “left wing. ” The latter might seem like a more pleasing interpretation, but looking at the description length rev eals that it does not improve the description of the data. The best model in this case does seem to be the o verlap- ping SBM with B = 3 groups. Howe v er , the difference in the description length between all model v ariants is not very large, making it difﬁcult to fully reject any of the three variants. A more sys- tematic analysis done in Ref. [18] re vealed that for most empirical networks, in particular lar ger ones, the overlapping models do not provide the best ﬁts in the majority of cases, and yield lar ger description lengths than the nono verlapping variants. Hence it seems that the idea of ov erlapping groups is less pervasi ve than that of degree heterogeneity — at least according to our modeling ansatz. It should be emphasized that we can always represent a network generated by an overlapping SBM by one generated with the nonoverlapping SBM with a lar ger number of groups representing the indi vidual types of mixtures. Although model selection gi ves us the most parsimonious choice between the tw o, it does not remo ve the equi v alence. In Fig. 11 we sho w ho w networks generated by the overlapping SBM can be better represented by the nonoverlapping SBM (i.e. with a smaller description length) as long as the ov erlapping regions are suf ﬁciently large. D. Further model extensions The simple and versatile nature of the SBM has spawned a large family of extensions and generalizations incorporating various types of more realistic features. This includes, for example, versions of the SBM that are designed for networks with continuous edge cov ariates (a.k.a. edge weights) [74, 75], multilayer networks that are composed of dif ferent types of edges [73, 76–79], networks that e volv e in time [34–41], networks that possess node attributes [80] or are annotated with metadata [57, 58], networks with uncertain structure [81], as well as networks that do not possess a discrete modular structure at all, and are instead embedded in generalized continuous spaces [82]. These model variations are too numerous to be described here in any detail. But it suf ﬁces to say that the general Bayesian approach outlined here, including model selection, also 26 applicable to these v ariations, without any conceptual dif ﬁculty . VIII. EFFICIENT INFERENCE USING MARK O V CHAIN MONTE CARLO (MCMC) Although we can write e xact expressions for the posterior probability of Eq. 9 (up to a normal- ization constant) for a variety of model v ariants, the resulting distributions are not simple enough to allow us to sample from them — much less ﬁnd their maximum — in a direct manner . In fact, fully characterizing the posterior distrib ution or ﬁnding its maximum is, for most models like the SBM, typically a NP-hard problem. What we can do, ho wev er , is to employ Mark ov chain Monte Carlo (MCMC) [83], which can be done efﬁciently , and in an asymptotically exact manner , as we no w show . The central idea is to sample from P ( b b b | A A A ) by ﬁrst starting from some initial conﬁgu- ration b b b 0 (in principle arbitrary), and making move proposals b b b → b b b 0 with a probability P ( b b b 0 | b b b ) , such that, after a suf ﬁciently long time, the equilibrium distrib ution is gi ven e xactly by P ( b b b | A A A ) . In particular , gi ven any arbitrary move proposals P ( b b b 0 | b b b ) — with the only condition that they fulﬁll ergodicity , i.e. that they allow e very state to be visited ev entually — we can guarantee that the desired posterior distribution is ev entually reached by employing the Metropolis-Hastings crite- rion [84, 85], which dictates we should accept a giv en move proposal b b b → b b b 0 with a probability a gi ven by a = min  1 , P ( b b b 0 | A A A ) P ( b b b | A A A ) P ( b b b | b b b 0 ) P ( b b b 0 | b b b )  , (61) otherwise the proposal is rejected. The ratio P ( b b b | b b b 0 ) / P ( b b b 0 | b b b ) in Eq. 61 enforces a property kno wn as detailed balance or r ever sibility , i.e. T ( b b b 0 | b b b ) P ( b b b | A A A ) = T ( b b b | b b b 0 ) P ( b b b 0 | A A A ) , (62) where T ( b b b 0 | b b b ) are the ﬁnal transition probabilities after incorporating the acceptance criterion of Eq. 61. The detailed balance condition of Eq. 62 together with the er godicity property guarantee that the Markov chain will con v erge to the desired equilibrium distribution P ( b b b | A A A ) . Importantly , we note that when computing the ratio P ( b b b 0 | A A A ) / P ( b b b | A A A ) in Eq. 61, we do not need to determine the intractable normalization constant of Eq. 9, since it cancels out, and thus it can be performed exactly . The above giv es a generic protocol that we can use to sample from the posterior whenev er we can compute the numerator of Eq. 9. If instead we are interested in maximizing the posterior , we can introduce an “in v erse temperature” parameter β , by changing P ( b b b | A A A ) → P ( b b b | A A A ) β in the above equations, and making β → ∞ in slow increments; what is known as simulated annealing [86]. The simplest implementation of this protocol for the inference of SBMs is to start from a random partition b b b 0 , and use mo ve proposals where a node i is randomly selected, and then its ne w group membership b 0 i is chosen randomly between all B + 1 choices (where the remaining choice means we populate a ne w group), P ( b 0 i | b b b ) = 1 B + 1 . (63) By inspecting Eqs. 20, 41, 54 and 17 for all SBM v ariants considered, we notice that the ratio P ( b b b 0 | A A A ) / P ( b b b | A A A ) can be computed in time O ( k i ) , where k i is the degree of node i , independently of other properties of the model such as the number of groups B . Note that this is not true for all alternati ve formulations of the SBM; e.g. for the models in Refs. [30, 31, 33, 87, 88] computing 27 i b i = r j b j = t e tr e ts e tu r t s u (a) → (b) Figure 12. Efﬁcient MCMC strategies: (a) Move proposals are made by inspecting the neighborhood of node i and selecting a random neighbor j . Based on its group membership t = b j , the edge counts between groups are inspected (right), and the move proposal b i = s is made with probability proportional to e t s . (b) The initial state of the MCMC is obtained with an agglomerati ve heuristic, where groups are merged together using the same proposals described in (a). such an update requires O ( k i + B ) time [the heat-bath mov e proposals of Ref. [33] increases this e ven further to O ( B ( k i + B )) ], thus making them very inef ﬁcient for large networks, where the number of groups can reach the order of thousands or more. Hence, when using these move proposals, a full sweep of all N nodes in the network can be done in time O ( E ) , independent of B . Although fairly simple, the abov e algorithm suffers from some shortcomings that can seri- ously degrade its performance in practice. In fact, it is typical for nai ve implementations of the Metropolis-Hastings algorithm to perform very badly , despite its theoretical guarantees. This is because the asymptotic properties of the Marko v chain may take a very long time to be realized, and the equilibrium distrib ution is ne ver observed in practical time. Generally , we should expect good con ver gence times only when: 1. The initial state b b b 0 is close enough to the most likely states of the posterior and 2. the mo ve proposals P ( b b b 0 | b b b ) resemble the shape of the posterior . Indeed, it is a tri vial (and not very useful) fact that if the starting state b b b o is sampled directly from the posterior , and the move proposals match the posterior e xactly , P ( b b b 0 | b b b ) = P ( b b b 0 | A A A ) , the Marko v chain would be instantaneously equilibrated. Hence if we can approach this ideal scenario, we should be able to improv e the inference speeds. Here we describe two simple strategies in achieving such an im- prov ement which ha ve been shown to yield a signiﬁcance performance impact [89]. The ﬁrst one is to replace the fully random move proposals of Eq. 63 by a more informativ e choice. Namely , we use the current information about the model being inferred to guide our next move. W e do so by selecting the membership of a node i being mov ed according to P ( b i = r | b b b ) = ∑ s P ( s | i ) e sr + ε e s + ε ( B + 1 ) , (64) where P ( s | i ) = ∑ j A i j δ b j , s / k i is the fraction of neighbors of node i that belong to group s , and ε > 0 is an arbitrary parameter that enforces ergodicity , but with no other signiﬁcant impact in the algorithm, provided it is suf ﬁciently small (howe v er , if ε → ∞ we recov er the fully random mov es of Eq. 63). What this mo ve proposal means is that we inspect the local neighborhood of the node i , and see which groups s are connected to this node, and we use the typical neighborhood r of the groups s to guide our placement of node i (see Fig. 12a). The purpose of these mov e proposals is not to waste time with attempted mov es that will almost surely be rejected, as will typically happen with the fully random v ersion. W e emphasize that the mov e proposals of Eq. 64 do not bias the partitions tow ard any speciﬁc kind of mixing pattern; in particular they do not prefer assortati v e v ersus non-assortati ve partitions. Furthermore, these proposals can be generated ef ﬁciently , simply by following three steps: 1. sampling a random neighbor j of node i , and 28 inspecting its group membership s = b j , and then; 2. with probability ε ( B + 1 ) / ( e s + ε ( B + 1 )) sampling a fully random group r (which can be a new group); 3. or otherwise, sampling a group label r with a probability proportional to the number of edges leading to it from group s , e sr . These steps can be performed in time O ( k i ) , again independently of B , as long as a continuous book-keeping is made of the edges which are incident to each group, and therefore it does not af fect the overall O ( E ) time complexity . The second strate gy is to choose a starting state that lies close to the mode of the posterior . W e do so by performing a Fibonacci search [90] on the number of groups B , where for each v alue we obtain the best partition from a lar ger partition with B 0 > B using an agglomerativ e heuristic, composed of the follo wing steps taken alternati vely: 1. W e attempt the mov es of Eq. 64 until no improv ement to the posterior is observed, 2. W e merge groups together , achieving a smaller number of groups B 00 ∈ [ B , B 0 ] , stopping when B 00 = B . W e do the last step by treating each group as a single node and using Eq. 64 as a merge proposal, and selecting the ones that least decrease the posterior (see Fig 12b). As sho wn in Ref. [89], the ov erall complexity of this initialization algorithm is O ( E log 2 N ) , and thus can be employed for very lar ge networks. The approach abov e can be adapted to the ov erlapping model of Sec. VII C, where instead of the partition b b b , the mov e proposals are made with respect to the indi vidual half-edge labels [18]. For the nested model, we ha ve instead a hierarchical partition { b l } , and we proceed in each step of the Markov chain by randomly choosing a level l and performing the proposals of Eq. 64 on that le vel, as described in Ref. [19]. The combination of the two strategies described above makes the inference procedure quite scalable, and has been successfully employed on networks on the order of 10 7 to 10 8 edges, and up to B = N groups. The MCMC algorithm described in this section, for all model variants described, is implemented in the graph-tool library [91], freely av ailable under the GPL license at http: //graph- tool.skewed.de . IX. TO SAMPLE OR TO OPTIMIZE? In the examples so far , we have focused on obtaining the most likely partition from the posterior distribution, which is the one that minimizes the description length of the data. But is this in fact the best approach? In order to answer this, we need ﬁrst to quantify how well our inference is doing, by comparing our estimate ˆ b b b of the partition to the true partition that generated the data b b b ∗ , by deﬁning a so-called loss function . For e xample, if we choose to be very strict, we may reject any partition that is strictly dif ferent from b b b ∗ on equal measure, using the indicator function ∆ ( ˆ b b b , b b b ∗ ) = ∏ i δ ˆ b i , b ∗ i , (65) so that ∆ ( ˆ b b b , b b b ∗ ) = 1 only if ˆ b b b = b b b ∗ , otherwise ∆ ( ˆ b b b , b b b ∗ ) = 0. If the observ ed data A A A and parameters b b b are truly sampled from the model and priors, respecti vely , the best assessment we can make for b b b ∗ is giv en by the posterior distribution P ( b b b | A A A ) . Therefore, the av erage of the indicator over the posterior is gi ven by ¯ ∆ ( ˆ b b b ) = ∑ b b b ∆ ( ˆ b b b , b b b ) P ( b b b | A A A ) . (66) 29 If we maximize ¯ ∆ ( ˆ b b b ) with respect to ˆ b b b , we obtain the so-called maximum a posteriori (MAP) estimator ˆ b b b = argmax b b b P ( b b b | A A A ) , (67) which is precisely what we ha ve been using so far , and it is equiv alent to employing the MDL principle. Ho we ver , using this estimator is arguably ov erly optimistic, as we are unlikely to ﬁnd the true partition with perfect accurac y in any but the most ideal cases. Instead, we may relax our expectations and consider instead the o verlap function d ( ˆ b b b , b b b ∗ ) = 1 N ∑ i δ ˆ b i , b ∗ i , (68) which measures the fraction of nodes that are correctly classiﬁed. If we maximize now the average of the ov erlap ov er the posterior distribution ¯ d ( ˆ b b b ) = ∑ b b b d ( ˆ b b b , b b b ) P ( b b b | A A A ) , (69) we obtain the mar ginal estimator ˆ b i = ar gmax r π i ( r ) , (70) where π i ( r ) = ∑ b b b \ b i P ( b i = r , b b b \ b i | A A A ) (71) is the marginal distribution of the group membership of node i , summed o ver all remaining nodes. 19 The marginal estimator is notably different from the MAP estimator in that it le ver - ages information from the entire posterior distribution to inform the classiﬁcation of any single node. If the posterior is tightly concentrated around its maximum, both estimators will yield com- patible answers. In this situation the structure in the data is very clear , and both estimators agree. Otherwise, the estimators will yield dif ferent aspects of the data, in particular if the posterior pos- sesses many local maxima. For example, if the data has indeed been sampled from the model we are using, the multiplicity of local maxima can be just a reﬂection of the randomness in the data, and the mar ginal estimator will be able to a verage ov er them and pro vide better accurac y [92, 93]. In vie w of the abo ve, one could ar gue that the marginal estimator should be generally preferred ov er MAP . Ho we ver , the situation is more complicated for data which are not sampled from model being used for inference (i.e. the model is misspeciﬁed ). In this situation, multiple peaks of the distribution can point to very different partitions that are all statistically signiﬁcant. These different peaks function as alternativ e explanations for the data that must be accepted on equal footing, according to their posterior probability . The marginal estimator will in general mix the properties of all peaks into a consensus classiﬁcation that is not representativ e of any single hypothesis, whereas the MAP estimator will concentrate only on the most likely one (or an arbitrary choice if they are all equally likely). An illustration of this is giv en by the well-kno wn Zachary’ s karate club network [95], which captures the social interactions between members of a karate club amidst a 19 The careful reader will notice that we must ha ve in f act a trivial constant marginal π i ( r ) = 1 / B for e very node i , since there is a symmetry of the posterior distribution with respect to re-labelling of the groups, in principle rendering this estimator useless. In practice, ho wever , our samples from the posterior distribution (e.g. using MCMC) will not span the whole space of label permutations in any reasonable amount of time, and instead will concentrate on a mode around one of the possible permutations. Since the modes around the label permutations are entirely symmetric, the node marginals obtained in this manner can be meaningfully used. Howe ver , for networks where some of the groups are not v ery lar ge, local permutations of individual group labels are statistically possible during MCMC inference, leading to degeneracies in the marginal π i ( r ) of the affected nodes, resulting in artefacts when using the marginal estimator . This problem is e xacerbated when the number of groups changes during MCMC sampling. 30 (a) b b b 0 , Σ = 321 . 3 bits (b) b b b 1 , Σ = 327 . 5 bits (c) b b b 2 , Σ = 329 . 3 bits P artition space P artition space − Σ = log 2 P ( A A A , b b b ) − 355 − 350 − 345 − 340 − 335 − 330 − 325 b b b 2 b b b 1 b b b 0 (d) 1 2 3 4 5 B 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 P ( B | A A A ) (e) Figure 13. Posterior distribution of partitions of Zachary’ s karate club network using the DC-SBM. Panels (a) to (c) sho w three modes of the distribution and their respectiv e description lengths; (d) 2D projection of the posterior obtained using multidimensional scaling [94]; (e) Marginal posterior distribution of the number of groups B . conﬂict between the club’ s administrator and an instructor , which lead to a split of the club in tw o disjoint groups. The measurement of the network was done before the ﬁnal split actually happened, and it is very often used as an example of a network exhibiting community structure. If we analyze this network with the DC-SBM, we obtain three partitions that occur with very high probability from the posterior distrib ution: A tri vial B = 1 partition, corresponding to the conﬁguration model without communities (Fig. 13a), a “leader-follo wer” di vision into B = 2 groups, separating the administrator and instructor , together with two close allies, from the rest of the network (Fig. 13b), and ﬁnally a B = 2 di vision into the aforementioned factions that anticipated the split (Fig. 13c). If we would guide ourselves strictly by the MDL principle (i.e. using the MAP estimator), the preferred partition would be the trivial B = 1 one, indicating that the most likely explanation of this network is a fully random graph with a pre-speciﬁed degree sequence, and that the observed community structure emerged spontaneously . Howe ver , if we inspect the posterior distribution more closely , we see that other di visions into B > 1 groups amount to around 50% of the posterior probability (see Fig. 13e). Therefore, if we consider all B > 1 partitions collectively , they gi ve us little reason to completely discard the possibility that the network does in fact posses some group 31 structure. Inspecting the posterior distribution ev en more closely , as shown in Fig. 13d, re veals a multimodal structure clustered around the three aforementioned partitions, gi ving us three very dif ferent explanations for the data, none of which can be decisively discarded in fav or of the others — at least not according to the e vidence av ailable in the network structure alone. The situation encountered for the karate club network is a good example of the so-called bias- variance tr ade-off that we are often forced to face: If we choose to single-out a singe partition as a unique representation of the data, we must in variably bias our result to ward any of the three most likely scenarios, discarding the remaining ones at some loss of useful information. Otherwise, if we choose to eliminate the bias by incorporating the entire posterior distribution in our representa- tion, by the same token it will incorporate a larger variance, i.e. it will simultaneously encompass di ver ging explanations of the data, lea ving us without an unambiguous and clear interpretation. The only situation where this trade-of f is not required is when the model is a perfect ﬁt to the data, such that the posterior is tightly peaked around a single partition. Therefore, the variance of the posterior serv es as a good indication of the quality of ﬁt of the model, providing another reason to include it in the analysis. It should also be remarked that when using a nonparametric approach, where the dimension of the model is also inferred from the posterior distribution, the potential bias incurred when obtain- ing only the most likely partition usually amounts to an underﬁt of the data, since the uncertainty in the posterior typically translates into the e xistence of a more conserv ati ve partition with fewer groups. 20 Instead, if we sample from the posterior distribution, we will av erage over may al- ternati ve ﬁts, including those that model the data more closely with a larger number of groups. Ho wev er , each indi vidual sample of the posterior will tend to incorporate more randomness from the data, which will disappear only if we a verage ov er all samples. This means that single samples will tend to ov erﬁt the data, and hence we must resist looking at them indi vidually . It is only in the aforementioned limit of a perfect ﬁt that we are guaranteed not to be misled one way or another . An additional example of this is sho wn in Fig. 14 for a netw ork of collaborations among scientists. If we infer the best nested SBM, we ﬁnd a speciﬁc hierarchical di vision of the netw ork. Ho wev er , if we sample hierarchical divisions from the posterior distribution, we typically encounter larger models — with a larger number of groups and deeper hierarchy . Each individual sample from the posterior is likely to be an ov erﬁt, but collectively they giv e a more accurate picture of the net- work in comparison with the most likely partition, which probably ov er-simpliﬁes it. As already mentioned, this discrepancy , observed for all three SBM v ersions, tells us that neither of them is an ideal ﬁt for this network. The ﬁnal decision on which approach to take depends on the actual objectiv e and resources av ailable. In general, sampling from the posterior will be more suitable when the objectiv e is to generalize from observation and make predictions (see next section and Ref. [97]), and when computational resources are ample. Conv ersely , if the objecti ve is to make a precise statement about the data, e.g. in order to summarize and interpret it, and the computational resources are scarce, maximizing the posterior tends to be more adequate. X. GENERALIZA TION AND PREDICTION When we ﬁt a model like the SBM to a network, we are doing more than simply dividing the nodes into statistically equiv alent groups; we are also making a statement about a possible 20 This is dif ferent from par ametric posteriors, where the dimension of the model is externally imposed in the prior , and the MAP estimator tends to ov erﬁt [92, 93]. 32 (a) (b) 40 45 50 55 60 65 70 75 80 85 90 B 1 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 0 . 12 0 . 14 0 . 16 0 . 18 P ( B 1 | A A A ) DC-SBM, uniform hyperprior DC-SBM, uniform prior NDC-SBM 12 15 18 21 24 27 30 33 36 B 2 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 P ( B 2 | A A A ) 4 8 12 16 20 B 3 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 P ( B 3 | A A A ) (c) Figure 14. Hierarchical partitions of a network of collaboration between scientists [96]. (a) Most likely hierarchical partition according to the DC-SBM with a uniform hyperprior . (b) Uncorrelated samples from the posterior distribution. (c) Marginal posterior distribution of the number of groups at the ﬁrst three hierarchical le vels, according to the model variants described in the legend. The vertical lines mark the v alue obtained for the most likely partition. Adapted from Ref. [19]. mechanism that generated the network. This means that, to the extent that the model is a good representation of the data, we can use it generalize and mak e predictions about what has not been observed. This has been most e xplored for the prediction of missing and spurious links [25, 68]. This represents the situation where we kno w or stipulate that the observ ed data is noisy , and may contain edges that in fact do not exist, or does not contain edges that do e xist. W ith a generativ e model like the SBM, we are able to ascribe probabilities to existing and non-existing edges of being spurious or missing, respecti vely , as we now describe. Follo wing Ref. [97], the scenario we will consider is the situation where there e xists a complete network G G G which is decomposed in two parts, G G G = A A A O O O + δ A A A (72) where A A A O O O is the netw ork that we observ e, and the δ A A A is the set of missing and spurious edges that we want to predict, where an entry δ A i j > 0 represents a missing edge, and δ A i j < 0 a spurious one. Hence, our task is to obtain the posterior distrib ution P ( δ A A A | A A A O O O ) . (73) The central assumption we will make is that the complete network G G G has been generated using some arbitrary version of the SBM, with a mar ginal distribution P G ( G G G | b b b ) . (74) 33 Gi ven a generated network G G G , we then select δ A A A from some arbitrary distrib ution that models our source of errors P δ A ( δ A A A | G G G ) . (75) W ith the abov e model for the generation of the complete network and its missing and spurious edges, we can proceed to compute the posterior of Eq. 73. W e start from the joint distribution P ( A A A O O O , δ A A A | G G G ) = P ( A A A O O O | δ A A A , G G G ) P δ A ( δ A A A | G G G ) (76) = δ ( G G G − ( A A A O O O + δ A A A )) P δ A ( δ A A A | G G G ) , (77) where we hav e used the fact P ( A A A O O O | δ A A A , G G G ) = δ ( G G G − ( A A A O O O + δ A A A )) originating from Eq. 72. F or the joint distribution conditioned on the partition, we sum the abov e ov er all possible graphs G G G , sampled from our original model, P ( A A A O O O , δ A A A | b b b ) = ∑ G G G P ( A A A O O O , δ A A A | G G G ) P G ( G G G | b b b ) (78) = P δ A ( δ A A A | A A A O O O + δ A A A ) P G ( A A A O O O + δ A A A | b b b ) . (79) The ﬁnal posterior distribution of Eq. 73 is therefore P ( δ A A A | A A A O O O ) = ∑ b b b P ( A A A O O O , δ A A A | b b b ) P ( b b b ) P ( A A A O O O ) (80) = P δ A ( δ A A A | A A A O O O + δ A A A ) ∑ b b b P G ( A A A O O O + δ A A A | b b b ) P ( b b b ) P ( A A A O O O ) , (81) with P ( A A A O O O ) being a normalization constant, independent of δ A A A . This expression gi ves a general recipe to compute the posterior , where one av erages the marginal likelihood P G ( A A A O O O + δ A A A | b b b ) ob- tained by sampling partitions from the prior P ( b b b ) . Ho wev er , this procedure will typically take an astronomical time to con ver ge to the correct asymptotic value, since the largest values of P G ( A A A O O O + δ A A A | b b b ) will be far away from most values of b b b sampled from P ( b b b ) . A much better ap- proach is to perform importance sampling, by re writing the posterior as P ( δ A A A | A A A O O O ) ∝ P δ A ( δ A A A | A A A O O O + δ A A A ) ∑ b b b P G ( A A A O O O + δ A A A | b b b ) P G ( A A A O O O | b b b ) P G ( A A A O O O | b b b ) P ( b b b ) (82) ∝ P δ A ( δ A A A | A A A O O O + δ A A A ) ∑ b b b P G ( A A A O O O + δ A A A | b b b ) P G ( A A A O O O | b b b ) P G ( b b b | A A A O O O ) , (83) where P G ( b b b | A A A O O O ) is the posterior of partitions obtained by pretending that the observed network came directly from the SBM. W e can sample from this posterior using MCMC as described in Sec. VIII. As the number of entries in δ A A A is typically much smaller than the number of observed edges, this importance sampling approach will tend to con verge much faster . This allo ws us to compute P ( δ A A A | A A A O O O ) in practical manner — up to a normalization constant. Ho we ver , if we want to compare the relati ve probability between speciﬁc sets of missing/spurious edges, { δ A A A i } , via the 34 (a) (b) Figure 15. T wo hypothetical missing edges in the network of American colle ge football teams. The edge ( a ) connects teams of dif ferent conferences, whereas ( b ) connects teams of the same conference. According to the nested DC-SBM, their posterior probability ratios are λ a ≈ 0 . 013 ( 1 ) and λ b ≈ 0 . 987 ( 1 ) . ratio λ i = P ( δ A A A i | A A A O O O ) ∑ j P ( δ A A A j | A A A O O O ) , (84) this normalization constant plays no role. The above still depends on our chosen model for the production of missing and spurious edges, gi ven by Eq. 75. In the absence of domain-speciﬁc information about the source of noise, we must consider all alternati ve choices { δ A A A i } to be equally likely a priori, so that the we can simply replace P δ A ( δ A A A | A A A O O O + δ A A A ) ∝ 1 in Eq. 83 — although more realistic choices can also be included. In Fig. 15 we show the relativ e probabilities of two hypothetical missing edges for the American college football netw ork, obtained with the approach abo ve. W e see that a particular missing edge between teams of the same conference is almost a hundred times more likely than one between teams of dif ferent conference. The use of the SBM to predict missing and spurious edges has been employed in a v ariety of applications, such as the prediction of novel interactions between drugs [98], conﬂicts in so- cial networks [99], as well to provide user recommendations [100, 101], and in many cases has outperformed a v ariety of competing methods. 35 XI. FUND AMENT AL LIMITS OF INFERENCE: THE DETECT ABILITY -INDETECT ABILITY PHASE TRANSITION Besides deﬁning useful models and in vestigating their behavior in data, there is another line of questioning which deals with ho w far it is possible to go when we try to infer the structure of networks. Naturally , the quality of the inference depends on the statistical e vidence a vailable in the data, and we may therefore ask if it is possible at all to uncov er planted structures — i.e. structures that we impose ourselves — with our inference methods, and if so, what is the best performance we can expect. Research in this area has exploded in recent years [92, 93], after it was shown by Decelle et al [102, 103] that not only it may be impossible to uncov er planted structures with the SBM, but the inference under goes a “phase transition” where it becomes possible only if the structure is strong enough to cross a non-tri vial threshold. This result was obtained using methods from statistical physics, which we no w describe. The situation we will consider is a “best case scenario, ” where all parameters of the model are known, with the exception of the partition b b b — this in contrast to our overall approach so far , where we considered all parameters to be unknown random v ariables. In particular , we will consider only the prior P ( b b b | γ γ γ ) = ∏ i γ b i . (85) where γ r is the probability of a node belonging in group r . Gi ven this, we wish to obtain the posterior distribution of the node partition, using the SBM of Eq. 8, P ( b b b | A A A , λ λ λ , γ γ γ ) = P ( A A A | b b b , λ λ λ ) P ( b b b | γ γ γ ) P ( A A A | λ λ λ , γ γ γ ) . = e − H ( b b b ) Z (86) which was written abo ve in terms of the “Hamiltonian” H ( b b b ) = − ∑ i < j ( A i j ln λ b i , b j − λ b i , b j ) − ∑ i ln γ b i , (87) drawing an analogy with Potts-like models in statistical physics [104]. The normalization constant, called the “partition function, ” is giv en by Z = ∑ b b b e − H ( b b b ) . (88) Far from being an unimportant detail, the partition function can be used to determine all statistical properties of our inference procedure. For example, if we wish to obtain the marginal posterior distribution of node i , we can do so by introducing the perturbation H 0 ( b b b ) = H ( b b b ) − µ δ b i , r and computing the deri vati ve P ( b i = r | A A A , λ λ λ , γ γ γ ) = ∂ ln Z ∂ µ     µ = 0 = ∑ b b b δ b i , r e − H ( b b b ) Z . (89) Unfortunately , it does not seem possible to compute the partition function Z in closed form for an arbitrary graph A A A . Ho wev er , there is a special case for which we can compute the partition function, namely when A A A is a tr ee . This is useful for us, because graphs sampled from the SBM will be “locally tree-like” if they are sparse (i.e. the degrees are small compared to the size of 36 the network k i  N ), and the group sizes scale with the size of the system, i.e. n r = O ( N ) (which implies B  N ). Locally tree-like means that typical loops will ha ve length O ( N ) , and hence at the immediate neighborhood of any gi ven node the graph will look lik e a tree. Although being locally tree-like is not quite the same as being a tree, the graph will become increasing closer to being a tree in the “thermodynamic limit” N → ∞ . Because of this, many properties of locally tree-like graphs will become asymptotically identical to trees in this limit. If we assume that this limit holds, we can compute the partition function by pretending that the graph is close enough to being a tree, in which case we can write the so-called Bethe free energy (we refer to Refs. [103, 105] for a detailed deri vation) F = − ln Z = − ∑ i ln Z i + ∑ i < j A i j ln Z i j − E (90) with the auxiliary quantities gi ven by Z i j = N ∑ r < s λ rs ( ψ i → j r ψ j → i s + ψ i → j s ψ j → i r ) + N ∑ r λ rr ψ i → j r ψ j → i r (91) Z i = ∑ r n r e − h r ∏ j ∈ ∂ i ∑ r N λ rb i ψ j → i r , (92) where ∂ i means the neighbors of node i . In the above equations, the values ψ i → j r are called “messages, ” and they must fulﬁll the self-consistency equations ψ i → j r = 1 Z i → j γ r e − h r ∏ k ∈ ∂ i \ j  ∑ s N λ rs ψ k → i s  (93) where k ∈ ∂ i \ j means all neighbors k of i e xcluding j , the v alue Z i → j is a normalization constant enforcing ∑ r ψ i → j r = 1, and h r = ∑ i ∑ r λ rb i ψ i r is a local auxiliary ﬁeld. Eqs. 93 are called the belief- propagation (BP) equations [105], and the entire approach is also known under the name “cavity method” [106]. The v alues of the messages are typically obtained by iteration, where we start from some initial conﬁguration (e.g. a random one), and compute ne w v alues from the right-hand side of Eq. 93, until they con ver ge asymptotically . Note that the messages are only deﬁned on edges of the network, and an update in volv es inspecting the v alues at the neighborhood of the nodes, where the messages can be interpreted as carrying information about the marginal distrib ution of a giv en node, if the same is remov ed from the network (hence the names “belief propagation” and “cavity method”). Each iteration of the BP equations can be done in time O ( E B 2 ) , and the con ver gence is often obtained only after a fe w iterations, rendering the whole computation fairly ef ﬁcient, pro vided B is reasonably small. After the messages hav e been obtained, they can be used to compute the node marginals, P ( b i = r | A A A , λ λ λ , γ γ γ ) = ψ i r = 1 Z i γ r ∏ j ∈ ∂ i  ∑ s ( N λ rs ) A i j e − λ rs ψ j → i s  , (94) where Z i is a normalization constant. This whole procedure giv es a way of computing the marginal distribution P ( b i = r | A A A , λ λ λ , γ γ γ ) in a manner that is asymptotically exact — if A A A is suf ﬁciently large and locally tree-like. Since networks that are sampled from the SBM fulﬁll this property 21 , we may proceed with our original 21 Real networks, howe ver , should not be expected to be locally tree-like. This does not in v alidate the results of this 37 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0  = N ( λ in − λ out ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 NMI Figure 16. Normalized mutual information (NMI) between the planted and inferred partitions of a PP model with N = 10 5 , B = 3 and h k i = 3 and ε = N ( λ in − λ out ) . The vertical line marks the detectability threshold ε ∗ = B p h k i . question, and test if we can recov er the true v alue of b b b we used to generate a network. For the test, we use a simple parametrization named the planted partition model (PP) [10, 107], where γ r = 1 / B and λ rs = λ in δ rs + λ out ( 1 − δ rs ) , (95) with λ in and λ out specifying the expected number of edges between nodes of the same groups and of different groups, respectiv ely . If we generate networks from this ensemble, use the BP equations to compute the posterior marginal distribution of Eq. 94 and compare its maximum v alues with the planted partition, we observ e, as sho wn in Fig. 16, that it is reco verable only up to a certain v alue of ε = N ( λ in − λ out ) , abov e which the posterior distribution is fully uniform. By inspecting the stability of the fully uniform solution of the BP equations, the e xact threshold can be determined [103], ε ∗ = B p h k i , (96) where h k i = N ∑ rs λ rs / B 2 is the av erage de gree of the network. The existence of this threshold is remarkable, because the ensemble is only equi valent to a completely random one if ε = 0; yet there is a non-negligible range of v alues ε ∈ [ 0 , ε ∗ ] for which the planted structure cannot be r ecover ed even though the model is not r andom . This might seem counter-intuiti ve, if we argue that making N sufﬁciently large should at some point gi ve us enough data to infer the model with arbitrary precision. The hole in logic lies in the fact that the number of parameters — the node partition b b b — also gro ws with N , and that we would need the ef fectiv e sample size, i.e. the number of edges E , to grow faster than N to guarantee that the data is sufﬁcient. Since for sparse graphs we hav e E = O ( N ) , we are nev er able to reach the limit of sufﬁcient data. Thus, we should be able to achiev e asymptotically perfect inference only for dense graphs (e.g. with E = O ( N 2 ) ) or by section, which pertain strictly to data sampled from the SBM. Howe ver , despite not being exact, the BP algorithm yields surprisingly accurate results for real networks, e ven when the tree-like property is violated [103]. 38 inferring simultaneously from many graphs independently sampled from the same model. Neither situation, ho wev er , is representati ve of what we typically encounter when we study networks. The abov e result carries important implications into the ov erall ﬁeld of network clustering. The existence of the “detectable” phase for ε > ε ∗ means that, in this regime, it is possible for algorithms to discover the planted partition in polynomial time, with the BP algorithm doing so optimally . Furthermore, for B > 4 (or B > 3 for the dissortativ e case with λ in < λ out ) there is another regime in a range ε ∗ < ε < ε † , where BP con verges to the planted partition only if the messages are initialized close enough to the corresponding ﬁx ed point. In this regime, the posterior landscape exhibits a “glassy” structure, with e xponentially many maxima that are almost as lik ely as the planted partition, b ut are completely uncorrelated with it. The problem of ﬁnding the planted partition in this case is possible, but conjectured to be NP-hard. Many systematic comparisons of different community detection algorithms were done in a manner that was oblivious to these fundamental facts regarding detectability and hardness [108, 109], e ven though their existence had been conjectured before [110, 111], and hence should be re-framed with it in mind. Furthermore, we point out that although the analysis based on the BP equations is mature and widely accepted in statistical physics, the y are not completely rigorous from a mathematical point of vie w . Because of this, the result of Decelle et al [103] leading to the threshold of Eq. 96 has initiated intense activity from mathematicians in search of rigorous proofs, which have subsequently been found for a variety of relaxations of the original statement (see Ref. [112] for a re view), and remains an acti ve area of research. XII. CONCLUSION In this chapter we ga ve a description of the basic v ariants of the stochastic blockmodel (SBM), and a consistent Bayesian formulation that allows us to infer them from data. The focus has been on dev eloping a frame work to extract the large-scale structure of networks while a voiding both ov erﬁtting (mistaking randomness for structure) and underﬁtting (mistaking structure for randomness), and doing so in a manner that is analytically tractable and computationally ef ﬁcient. The Bayesian inference approach provides a methodologically correct answer to the v ery cen- tral question in network analysis of whether patterns of large-scale structure can in fact be sup- ported by statistical evidence. Besides this practical aspect, it also opens a window into the fun- damental limits of network analysis itself, giving us a theoretical underpinning we can use to understand more about the nature of network systems. Although the methods described here go a long way into allowing us to understand the structure of networks, some important open problems remain. From a modeling perspecti ve, we know that for most systems the SBM is quite simplistic, and falls very short of giving us a mechanistic explanation for them. W e can interpret the SBM as being to network data what a histogram is to spatial data [113], and thus while it fulﬁlls the formal requirements of being a generati ve model, it will ne ver deplete the modeling requirements of an y particular real system. Although it is nai ve to expect to achiev e such a le vel of success with a general model like the SBM, it is yet still unclear ho w far we can go. For example, it remains to be seen how tractable it is to incorporate local structures — like densities of subgraphs — together with the large-scale structure that the SBM prescribes. From a methodological perspecti ve, although we can select between the v arious SBM ﬂav ors gi ven the statistical evidence av ailable, we still lack good methods to assess the quality of ﬁt of the SBM at an absolute le v el. In particular , we do not yet ha ve a systematic understanding of ho w well the SBM is able to reproduce properties of empirical systems, and what would be the most 39 important sources of deﬁciencies, and ho w these could be overcome. In addition to these outstanding challenges, there are areas of de velopment that are more likely to undergo continuous progress. Generalizations and extensions of the SBM to cover speciﬁc cases are essentially open ended, such as the case of dynamic networks, and we can perhaps expect more realistic models to appear . Furthermore, since the inference of the SBM is in general a NP-hard problem, and thus most probably lacks a general solution, the search for more efﬁcient algorithmic strategies that work in particular cases is also a long term goal that is likely to attract further attention. [1] Paul Erd ˝ os and Alfr ´ ed R ´ enyi, “On random graphs, I, ” Publicationes Mathematicae (Debrecen) 6 , 290–297 (1959). [2] Roger Guimer ` a, Marta Sales-Pardo, and Lu ´ ıs A. Nunes Amaral, “Modularity from ﬂuctuations in random graphs and complex netw orks, ” Physical Revie w E 70 , 025101 (2004). [3] E. T . Jaynes, Pr obability Theory: The Logic of Science , edited by G. Larry Bretthorst (Cambridge Uni versity Press, Cambridge, UK ; Ne w Y ork, NY , 2003). [4] Paul W . Holland, Kathryn Blackmond Laskey , and Samuel Leinhardt, “Stochastic blockmodels: First steps, ” Social Networks 5 , 109–137 (1983). [5] Y uchung J. W ang and George Y . W ong, “Stochastic Blockmodels for Directed Graphs, ” Journal of the American Statistical Association 82 , 8–19 (1987). [6] T om A. B. Snijders and Krzysztof Nowicki, “Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure, ” Journal of Classiﬁcation 14 , 75–100 (1997). [7] Krzysztof Nowicki and T om A. B Snijders, “Estimation and Prediction for Stochastic Blockstruc- tures, ” Journal of the American Statistical Association 96 , 1077–1087 (2001). [8] Bo S ¨ oderberg, “General formalism for inhomogeneous random graphs, ” Physical Revie w E 66 , 066121 (2002). [9] B ´ ela Bollob ´ as, Sv ante Janson, and Oli ver Riordan, “The phase transition in inhomogeneous random graphs, ” Random Structures & Algorithms 31 , 3–122 (2007). [10] Anne Condon and Richard M. Karp, “Algorithms for graph partitioning on the planted partition model, ” Random Structures & Algorithms 18 , 116–140 (2001). [11] Mari ´ an Bogu ˜ n ´ a and Romualdo P astor-Satorras, “Class of correlated random networks with hidden v ariables, ” Physical Revie w E 68 , 036112 (2003). [12] J.-J. Daudin, F . Picard, and S. Robin, “A mixture model for random graphs, ” Statistics and Comput- ing 18 , 173–183 (2008). [13] Ginestra Bianconi, Paolo Pin, and Matteo Marsili, “Assessing the relev ance of node features for network structure, ” Proceedings of the National Academy of Sciences 106 , 11433 –11438 (2009). [14] Santo Fortunato, “Community detection in graphs, ” Physics Reports 486 , 75–174 (2010). [15] Mark Newman, Networks: An Intr oduction (Oxford University Press, 2010). [16] Brian Karrer and M. E. J. Ne wman, “Stochastic blockmodels and community structure in networks, ” Physical Re view E 83 , 016107 (2011). [17] Tiago P . Peixoto, “Entropy of stochastic blockmodel ensembles, ” Physical Revie w E 85 , 056122 (2012). [18] Tiago P . Peixoto, “Model Selection and Hypothesis T esting for Large-Scale Network Models with Overlapping Groups, ” Physical Revie w X 5 , 011033 (2015). [19] Tiago P . Peixoto, “Nonparametric Bayesian inference of the microcanonical stochastic block model, ” 40 Physical Re view E 95 , 012317 (2017). [20] M. B. Hastings, “Community detection as an inference problem, ” Physical Revie w E 74 , 035102 (2006). [21] Charles K emp and Joshua B. T enenbaum, “Learning systems of concepts with an inﬁnite relational model, ” in In Pr oceedings of the 21st National Confer ence on Artiﬁcial Intelligence (2006). [22] Martin Rosvall and Carl T . Bergstrom, “An information-theoretic framew ork for resolving commu- nity structure in complex networks, ” Proceedings of the National Academy of Sciences 104 , 7327– 7331 (2007). [23] Jake M. Hofman and Chris H. Wiggins, “Bayesian Approach to Network Modularity , ” Physical Re- vie w Letters 100 , 258701 (2008). [24] Edoardo M. Airoldi, Da vid M. Blei, Stephen E. Fienberg, and Eric P . Xing, “Mixed Membership Stochastic Blockmodels, ” J. Mach. Learn. Res. 9 , 1981–2014 (2008). [25] Roger Guimer ` a and Marta Sales-Pardo, “Missing and spurious interactions and the reconstruction of complex netw orks, ” Proceedings of the National Academy of Sciences 106 , 22073 –22078 (2009). [26] M. Mørup, M. N. Schmidt, and Lars Kai Hansen, “Inﬁnite multiple membership relational model- ing for complex networks, ” in 2011 IEEE International W orkshop on Machine Learning for Signal Pr ocessing (2011) pp. 1–6. [27] J ¨ org Reichardt, Roberto Alamino, and David Saad, “The Interplay between Microscopic and Meso- scopic Structures in Complex Netw orks, ” PLoS ONE 6 , e21282 (2011). [28] Morten Mørup and Mikkel N. Schmidt, “Bayesian Community Detection, ” Neural Computation 24 , 2434–2456 (2012). [29] Tiago P . Peixoto, “Parsimonious Module Inference in Large Networks, ” Physical Revie w Letters 110 , 148701 (2013). [30] M.N. Schmidt and M. Morup, “Nonparametric Bayesian Modeling of Comple x Networks: An Intro- duction, ” IEEE Signal Processing Magazine 30 , 110–128 (2013). [31] Etienne C ˆ ome and Pierre Latouche, “Model selection and clustering in stochastic block models based on the exact inte grated complete data likelihood, ” Statistical Modelling 15 , 564–589 (2015). [32] Xiaoran Y an, “Bayesian Model Selection of Stochastic Block Models, ” arXiv:1605.07057 [cs, stat] (2016), arXi v: 1605.07057. [33] M. E. J. Newman and Gesine Reinert, “Estimating the Number of Communities in a Network, ” Physical Re view Letters 117 , 078301 (2016). [34] W enjie Fu, Le Song, and Eric P . Xing, “Dynamic Mixed Membership Blockmodel for Evolving Networks, ” in Pr oceedings of the 26th Annual International Confer ence on Machine Learning , ICML ’09 (A CM, Ne w Y ork, NY , USA, 2009) pp. 329–336. [35] K.S. Xu and A.O. Hero, “Dynamic Stochastic Blockmodels for T ime-Ev olving Social Networks, ” IEEE Journal of Selected T opics in Signal Processing 8 , 552–562 (2014). [36] Tiago P . Peixoto and Martin Rosvall, “Modelling sequences and temporal networks with dynamic community structures, ” Nature Communications 8 , 582 (2017). [37] Leto Peel and Aaron Clauset, “Detecting Change Points in the Lar ge-Scale Structure of Ev olving Networks, ” in T wenty-Ninth AAAI Conference on Artiﬁcial Intellig ence (2015). [38] Amir Ghasemian, Pan Zhang, Aaron Clauset, Cristopher Moore, and Leto Peel, “Detectability Thresholds and Optimal Algorithms for Community Structure in Dynamic Networks, ” Physical Re- vie w X 6 , 031005 (2016). [39] Xiao Zhang, Cristopher Moore, and Mark E. J. Newman, “Random graph models for dynamic networks, ” The European Physical Journal B 90 , 200 (2017). [40] Marco Corneli, Pierre Latouche, and Fabrice Rossi, “Exact ICL maximization in a non-stationary 41 temporal extension of the stochastic block model for dynamic networks, ” Neurocomputing Advances in artiﬁcial neural networks, machine learning and computational intelligenceSelected papers from the 23rd European Symposium on Artiﬁcial Neural Networks (ESANN 2015), 192 , 81–91 (2016). [41] Matias Catherine and Miele V incent, “Statistical clustering of temporal networks through a dynamic stochastic block model, ” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 , 1119–1141 (2016). [42] N. J. A. Sloane, The on-line encyclopedia of integ er sequences: A000670 (2003). [43] N. J. A. Sloane, The on-line encyclopedia of integ er sequences: A008277 (2003). [44] David J. C. MacKay , Information Theory , Infer ence and Learning Algorithms , ﬁrst edition ed. (Cam- bridge Uni versity Press, 2003). [45] J. Rissanen, “Modeling by shortest data description, ” Automatica 14 , 465–471 (1978). [46] Peter D. Gr ¨ unwald, The Minimum Description Length Principle (The MIT Press, 2007). [47] Y urii Mikhailovich Shtar’kov , “Univ ersal sequential coding of single messages, ” Problemy Peredachi Informatsii 23 , 3–17 (1987). [48] Peter Gr ¨ unwald, “A tutorial introduction to the minimum description length principle, ” arXi v:math/0406077 (2004). [49] Gideon Schwarz, “Estimating the Dimension of a Model, ” The Annals of Statistics 6 , 461–464 (1978), mathematical Re views number (MathSciNet): MR468014; Zentralblatt MA TH identiﬁer: 0379.62005. [50] H. Akaike, “A new look at the statistical model identiﬁcation, ” IEEE T ransactions on Automatic Control 19 , 716–723 (1974). [51] Xiaoran Y an, Cosma Shalizi, Jacob E. Jensen, Florent Krzakala, Cristopher Moore, Lenka Zde- borov ´ a, Pan Zhang, and Y aojia Zhu, “Model selection for degree-corrected block models, ” Journal of Statistical Mechanics: Theory and Experiment 2014 , P05007 (2014). [52] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory , 99th ed. (W iley-Interscience, 1991). [53] M. Girv an and M. E. J. Ne wman, “Community structure in social and biological netw orks, ” Proceed- ings of the National Academy of Sciences 99 , 7821 –7826 (2002). [54] T S Evans, “Clique graphs and overlapping communities, ” Journal of Statistical Mechanics: Theory and Experiment 2010 , P12037 (2010). [55] T S Ev ans, “ American Colle ge Football Network Files, ” FigShare (2012), 10.6084/m9.ﬁgshare.93179. [56] Leto Peel, Daniel B. Larremore, and Aaron Clauset, “The ground truth about metadata and commu- nity detection in networks, ” Science Advances 3 , e1602548 (2017). [57] M. E. J. Ne wman and Aaron Clauset, “Structure and inference in annotated networks, ” Nature Com- munications 7 , 11863 (2016). [58] Darko Hric, T iago P . Peixoto, and Santo Fortunato, “Network Structure, Metadata, and the Prediction of Missing Nodes and Annotations, ” Physical Re view X 6 , 031038 (2016). [59] Y . X. Rachel W ang and Peter J. Bick el, “Lik elihood-based model selection for stochastic block mod- els, ” The Annals of Statistics 45 , 500–528 (2017). [60] Santo Fortunato and Marc Barth ´ elemy , “Resolution limit in community detection, ” Proceedings of the National Academy of Sciences 104 , 36–41 (2007). [61] Tiago P . Peixoto, “Hierarchical Block Structures and High-Resolution Model Selection in Large Networks, ” Physical Revie w X 4 , 011047 (2014). [62] D. Holten, “Hierarchical Edge Bundles: V isualization of Adjacency Relations in Hierarchical Data, ” IEEE T ransactions on V isualization and Computer Graphics 12 , 741–748 (2006). 42 [63] Benjamin H. Good, Yves-Ale xandre de Montjoye, and Aaron Clauset, “Performance of modularity maximization in practical contexts, ” Physical Revie w E 81 , 046106 (2010). [64] Darko Hric, Richard K. Darst, and Santo Fortunato, “Community detection in networks: Structural communities versus ground truth, ” Physical Revie w E 90 , 062805 (2014). [65] Harold Jef freys, Theory of Pr obability , auﬂage: third. ed. (Oxford Univ ersity Press, Oxford Oxford- shire : Ne w Y ork, 2000). [66] Lada A. Adamic and Natalie Glance, “The political blogosphere and the 2004 U.S. election: divided they blog, ” in Pr oceedings of the 3r d international workshop on Link disco very , LinkKDD ’05 (A CM, Ne w Y ork, NY , USA, 2005) pp. 36–43. [67] Roger Guimer ` a and Lu ´ ıs A. Nunes Amaral, “Functional cartography of complex metabolic net- works, ” Nature 433 , 895–900 (2005). [68] Aaron Clauset, Cristopher Moore, and M. E. J. Newman, “Hierarchical structure and the prediction of missing links in networks, ” Nature 453 , 98–101 (2008). [69] Diego Garlaschelli, Frank den Hollander , and Andrea Roccav erde, “Ensemble nonequi valence in random graphs with modular structure, ” Journal of Physics A: Mathematical and Theoretical 50 , 015001 (2017). [70] Brian Ball, Brian Karrer , and M. E. J. Newman, “Efﬁcient and principled method for detecting communities in networks, ” Physical Revie w E 84 , 036103 (2011). [71] V Krebs, “Political Books Network, ” unpublished, retrie ved from Mark Newman’ s website: http: //www- personal.umich.edu/ ~ mejn/netdata/ . [72] Y ong-Y eol Ahn, James P . Bagro w , and Sune Lehmann, “Link communities reveal multiscale com- plexity in netw orks, ” Nature 466 , 761–764 (2010). [73] Tiago P . Peixoto, “Inferring the mesoscale structure of layered, edge-valued, and time-varying net- works, ” Physical Revie w E 92 , 042807 (2015). [74] Christopher Aicher , Abigail Z. Jacobs, and Aaron Clauset, “Learning latent block structure in weighted networks, ” Journal of Complex Networks , cnu026 (2014). [75] Tiago P . Peixoto, “Nonparametric weighted stochastic block models, ” Ph ysical Re vie w E 97 , 012306 (2018). [76] N. Stanley , S. Shai, D. T aylor , and P . J. Mucha, “Clustering Network Layers with the Strata Multi- layer Stochastic Block Model, ” IEEE T ransactions on Network Science and Engineering 3 , 95–105 (2016). [77] Subhadeep Paul and Y uguo Chen, “Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel, ” Electronic Journal of Statistics 10 , 3807–3870 (2016). [78] T oni V all ` es-Catal ` a, Francesco A. Massucci, Roger Guimer ` a, and Marta Sales-Pardo, “Multilayer Stochastic Block Models Re veal the Multilayer Structure of Complex Networks, ” Physical Revie w X 6 , 011036 (2016). [79] Caterina De Bacco, Eleanor A. Po wer , Daniel B. Larremore, and Cristopher Moore, “Community detection, link prediction, and layer interdependence in multilayer networks, ” Physical Re vie w E 95 , 042317 (2017). [80] Leto Peel, “Activ e discovery of network roles for predicting the classes of network nodes, ” Journal of Complex Netw orks 3 , 431–449 (2015). [81] Tra vis Martin, Brian Ball, and M. E. J. Ne wman, “Structural inference for uncertain networks, ” Physical Re view E 93 , 012306 (2016). [82] M. E. J. Newman and T iago P . Peixoto, “Generalized Communities in Networks, ” Physical Revie w Letters 115 , 088701 (2015). [83] M. E. J. Ne wman and G. T . Barkema, Monte Carlo Methods in Statistical Physics (Oxford Uni versity 43 Press, U.S.A., Oxford : Ne w Y ork, 1999). [84] Nicholas Metropolis, Arianna W . Rosenbluth, Marshall N. Rosenbluth, Augusta H. T eller , and Ed- ward T eller , “Equation of State Calculations by Fast Computing Machines, ” The Journal of Chemical Physics 21 , 1087 (1953). [85] W . K. Hastings, “Monte Carlo sampling methods using Markov chains and their applications, ” Biometrika 57 , 97 –109 (1970). [86] S. Kirkpatrick, C. D Gelatt Jr , and M. P V ecchi, “Optimization by simulated annealing, ” Science 220 , 671 (1983). [87] Prem K. Gopalan and David M. Blei, “Ef ﬁcient discovery of ov erlapping communities in massi ve networks, ” Proceedings of the National Academy of Sciences 110 , 14534–14539 (2013). [88] Maria A. Riolo, George T . Cantwell, Gesine Reinert, and M. E. J. Ne wman, “Efﬁcient method for estimating the number of communities in a network, ” Physical Revie w E 96 , 032310 (2017). [89] Tiago P . Peixoto, “Ef ﬁcient Monte Carlo and greedy heuristic for the inference of stochastic block models, ” Physical Re view E 89 , 012804 (2014). [90] J. Kiefer , “Sequential Minimax Search for a Maximum, ” Proceedings of the American Mathematical Society 4 , 502 (1953). [91] Tiago P . Peixoto, “The graph-tool python library , ” ﬁgshare (2014), 10.6084/m9.ﬁgshare.1164194, av ailable at https://graph- tool.skewed.de . [92] Cristopher Moore, “The Computer Science and Physics of Community Detection: Landscapes, Phase T ransitions, and Hardness, ” arXiv:1702.00467 [cond-mat, physics:physics] (2017), arXi v: 1702.00467. [93] Lenka Zdeboro v ´ a and Florent Krzakala, “Statistical physics of inference: thresholds and algorithms, ” Adv ances in Physics 65 , 453–552 (2016). [94] Tre vor F . Cox and M. A. A. Cox, Multidimensional Scaling, Second Edition , 2nd ed. (Chapman and Hall/CRC, Boca Raton, 2000). [95] W ayne W . Zachary , “An Information Flow Model for Conﬂict and Fission in Small Groups, ” Journal of Anthropological Research 33 , 452–473 (1977). [96] M. E. J. Ne wman, “Finding community structure in networks using the eigen vectors of matrices, ” Physical Re view E 74 , 036104 (2006). [97] T oni V all ` es-Catal ` a, T iago P . Peixoto, Roger Guimer ` a, and Marta Sales-Pardo, “On the consis- tency between model selection and link prediction in networks, ” arXiv:1705.07967 [cond-mat, stat] (2017), arXi v: 1705.07967. [98] Roger Guimer ` a and Marta Sales-Pardo, “A Network Inference Method for Large-Scale Unsupervised Identiﬁcation of Nov el Drug-Drug Interactions, ” PLoS Comput Biol 9 , e1003374 (2013). [99] N ´ uria Rovira-Asenjo, T ` ania Gum ´ ı, Marta Sales-Pardo, and Roger Guimer ` a, “Predicting future con- ﬂict between team-members with parameter-free models of social networks, ” Scientiﬁc Reports 3 (2013), 10.1038/srep01999. [100] Roger Guimer ` a, Alejandro Llorente, Esteban Moro, and Marta Sales-Pardo, “Predicting Human Preferences Using the Block Structure of Complex Social Netw orks, ” PLoS ONE 7 , e44620 (2012). [101] Antonia Godoy-Lorite, Roger Guimer ` a, Cristopher Moore, and Marta Sales-Pardo, “Accurate and scalable social recommendation using mixed-membership stochastic block models, ” Proceedings of the National Academy of Sciences 113 , 14207–14212 (2016). [102] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov ´ a, “Inference and Phase T ransitions in the Detection of Modules in Sparse Networks, ” Physical Revie w Letters 107 , 065701 (2011). [103] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov ´ a, “ Asymptotic analy- 44 sis of the stochastic block model for modular networks and its algorithmic applications, ” Physical Re view E 84 , 066106 (2011). [104] F . Y . W u, “The Potts model, ” Re views of Modern Physics 54 , 235–268 (1982). [105] Marc Mezard and Andrea Montanari, Information, Physics, and Computation (Oxford Uni versity Press, 2009). [106] M. Mezard, Spin Glass Theory And Beyond: An Intr oduction T o The Replica Method And Its Appli- cations (Wspc, Singapore ; Ne w Jersey , 1986). [107] M. E Dyer and A. M Frieze, “The solution of some random NP-hard problems in polynomial ex- pected time, ” Journal of Algorithms 10 , 451–489 (1989). [108] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi, “Benchmark graphs for testing com- munity detection algorithms, ” Physical Re view E 78 , 046110 (2008). [109] Andrea Lancichinetti and Santo Fortunato, “Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities, ” Physical Re view E 80 , 016118 (2009). [110] J ¨ org Reichardt and Michele Leone, “(Un)detectable Cluster Structure in Sparse Networks, ” Physical Re view Letters 101 , 078701 (2008). [111] Peter Ronhovde and Zohar Nussinov , “Multiresolution community detection for megascale networks by information-based replica correlations, ” Physical Re view E 80 , 016109 (2009). [112] Emmanuel Abbe, “Community detection and stochastic block models: recent dev elopments, ” arXi v:1703.10146 [cs, math, stat] (2017), arXiv: 1703.10146. [113] Soﬁa C. Olhede and Patrick J. W olfe, “Network histograms and uni v ersality of blockmodel approxi- mation, ” Proceedings of the National Academy of Sciences 111 , 14722–14727 (2014).

Bayesian stochastic blockmodeling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment