Nonparametric Bayesian inference of the microcanonical stochastic block model

Nonparametric Ba y esian inference of the micro canonical sto c hastic blo c k mo del Tiago P . P eixoto ∗ Dep artment of Mathematic al Scienc es and Centr e for Networks and Col le ctive Behaviour, University of Bath, Claverton Down, Bath BA2 7A Y, Unite d Kingdom and ISI F oundation, Via Alassio 11/c, 10126 T orino, Italy A principled approac h to characterize the hidden structure of netw orks is to formulate genera- tiv e mo dels, and then infer their parameters from data. When the desired structure is comp osed of modules or "comm unities", a suitable choice for this task is the sto c hastic blo ck model (SBM), where no des are divided in to groups, and the placement of edges is conditioned on the group mem- b erships. Here, we present a nonparametric Bay esian metho d to infer the mo dular structure of empirical netw orks, including the num ber of modules and their hierarchical organization. W e fo cus on a micro canonical v ariant of the SBM, where the structure is imp osed via hard constraints, i.e. the generated net works are not allo w ed to violate the patterns imp osed b y the mo del. W e show ho w this simple model v ariation allows sim ultaneously for tw o imp ortan t impro vemen ts ov er more tra- ditional inference approaches: 1. Deep er Bay esian hierarc hies, with noninformative priors replaced b y sequences of priors and h yperpriors, that not only remov e limitations that seriously degrade the inference on large netw orks, but also reveal structures at multiple scales; 2. A very eﬃcient inference algorithm that scales well not only for netw orks with a large num ber of nodes and edges, but also with an unlimited num b er of mo dules. W e show also how this approach can b e used to sample mo dular hierarchies from the posterior distribution, as well as to p erform mo del selection. W e discuss and analyze the diﬀerences b et ween sampling from the p osterior and simply ﬁnding the single parameter estimate that maximizes it. F urthermore, w e exp ose a direct equiv alence b etw een our micro canonical approach and alternative deriv ations based on the canonical SBM. I. INTR ODUCTION One of the most basic goals in the study of so cial, bi- ological and technological net works is the characteriza- tion of their structural patterns. As these systems b e- come large, this quickly b ecomes a non trivial problem, as naive metho ds of inspection are no longer useful, and simple statistics often hide crucial information. A popu- lar approach to this problem is the developmen t of meth- o ds that divide the netw ork b y grouping together no des that share similar features, thereby reducing it to a more manageable size, and in the pro cess revealing an y latent mo dular organization. This is the core idea b ehind a v ery large n umber of heuristic metho ds prop osed in the last decade and a half [1, 2], which despite sharing the same motiv ation diﬀer substan tially from eac h other, due mainly to the v arious w ays this intuitiv e idea can b e im- plemen ted concretely . Ov er time it has b ecome clear that most of these metho ds are marred b y serious limitations, suc h as the incapacity of distinguishing structure from noise [3] and to ﬁnd small structures in large systems [4], as well as the fact that the same metho d often yields m ultiple diverging results for the same netw ork [5], and that the outcomes of most metho ds agree neither with eac h other [2] nor with known no de annotations [6]. Lik e some more recent w orks in this area, here we fol- lo w a diﬀerent and arguably more principled path, de- signed to ov ercome some of the se limitations. Namely , instead of form ulating heuristics, we construct probabilis- tic generative mo dels of net works, that include the afore- ∗ t.peixoto@bath.ac.uk men tioned idea of mo dular struc ture as parameters to the mo del. The mo dular organization is then determined b y inferring these parameters from data, using well-founded metho ds from Bay esian inference and statistical physics. In this context, the problem of separating structure from noise is dealt with by employing nonparametric inference, where generative pro cesses for the model parameters are also formulated via prior probabilities . Additionally , the comparison of diﬀeren t mo dular partitions — obtained either from the same or from diﬀerent mo dels incorp o- rating p otentially diﬀerent ideas ab out mo dular organi- zation — can b e p erformed probabilistically , and amoun t to a comparison of alternative generative hypotheses ac- cording to statistical evidence. In this work, we fo cus on a sp eciﬁc family of generative mo dels based on the stochastic blo c k mo del (SBM) [7], where no des are divided into groups, and the edges are placed randomly b etw een no des, with probabilities that dep end on their group memberships. In particular, we consider a micr o c anonic al v ariation of this family , where the structural constraints are imp osed strictly across the ensem ble, as opp osed to only on av erage, as is more typ- ically done. W e show how this approach makes it easier to incorp orate more elab orate generative mo dels, where parameters are sampled from conditioned prior proba- bilities, whic h themselves are sampled from hyperprior distributions. This yields a more p ow erful metho d that rev eals the hierarc hical organization of net works in m ul- tiple scales, and has a muc h increased capacit y of ﬁnding statistically signiﬁcant structures in large data. F urther- more, we sho w ho w this particular form ulation allows for a very eﬃcient inference algorithm that scales well not only for net works with a large num b er of no des and 2 edges, but also with an unlimited n umber of mo dules — in con trast to the ma jority of other similar inference al- gorithms that b ecome increasingly slow er as the num ber of groups b ecomes large. The approach taken here builds up on ideas from pre- vious w ork [8 – 10], but here we fo cus on obtaining hi- erarc hical netw ork partitions that are sample d from the p osterior distribution, instead of ﬁnding only the most lik ely partition, which requires a diﬀeren t ansatz. W e also show ho w mo del selection can be used to choose b e- t ween diﬀerent mo del v arian ts according to the statistical evidence av ailable in the data, and how the metho d fares for a v ariet y of empirical netw orks. F urthermore we show that the micro canonical formulation used here is — in its most basic form — equiv alen t to a sp eciﬁc Bay esian for- m ulation of the “canonical” SBM, and thus we establish a bridge b et ween b oth approaches. The pap er is divided as follows. W e b egin in Sec. II with the micro canonical SBM, and follow in Sec. I I I with the outline of the nonparametric inference approach, by describing in turn the priors and hyperpriors of the dif- feren t set of parameters. In Sec. IV we show how the micro canonical formulation is related to the more usual canonical approach, and in Sec. V we analyze the limi- tations of the inference pro cedure, and w e show how the hierarc hical approac h is capable of ﬁnding a muc h larger n umber of groups in large net works. In Sec. VI we presen t an eﬃcient MCMC algorithm to sample hierarchical par- titions from the p osterior distribution. In Sec. VI I we sho w how diﬀerent model v ariations can b e compared, and in Sec. VI I I we show ho w the same v ariations b e- ha ve for empirical net works. W e ﬁnalize in Sec. IX with a discussion. I I. THE MICROCANONICAL DEGREE-CORRECTED SBM W e b egin with a “degree-corrected” v ersion of the SBM [11] (DC-SBM), where in addition to the mo dular structure, the netw orks generated p ossess a prescrib ed degree sequence. How ev er, diﬀerently from its original deﬁnition, here we assume that the degree sequence is ﬁxed exactly , instead of only in exp ectation. W e will see later that the non-degree-corrected version of the mo del (NDC-SBM) can b e obtained from this more general for- m ulation as a sp ecial case. The parameters of the mo del are the partition b = { b i } of N nodes into B groups, where b i ∈ [1 , B ] is the group mem b ership of node i , the degree sequence k = { k i } , and the matrix of edge counts b et ween groups e = { e rs } , where e rs is the n umber of edges b etw een groups r and s (for conv enience of notation, e rr is twic e the n umber of edges inside group r ). Given these parameters, netw orks are generated like in the conﬁguration mo del [12, 13]: T o each vertex i is attributed k i half-edges (or “stubs”), whic h are paired randomly to each other — allo wing for m ultiple pairings b etw een the same t wo no des as well as self-lo ops — respecting the constraint that betw een groups r and s there are exactly e rs pairings. Assuming momen tarily that the half-edges are distinguishable, the n umber of p ossible pairings that satisfy this constrain t is giv en by Ω( e ) = Q r e r ! Q r 0 , and q ( m, n ) = 0 for m ≤ 0 or n ≤ 0 . With this, the full table of v alues for m ≤ M and n ≤ m can b e computed in time O ( M 2 ) . Hence, if the num b er of edges and no des is not to o large, w e can pre-compute these v alues as a setup to the inference pro cedure. How ev er, this can still b e- come computationally exp ensive for very large systems. Unfortunately , no closed-form expression for q ( m, n ) is kno wn whic h w ould allow us to compute it in constant time. F ortunately , how ev er, accurate asymptotic expres- sions are known, which p ermit eﬃcient computation for large arguments. Namely , for large v alues of m the num- b er of partitions approac hes asymptotically the following v alue [27–29] q ( m, n ) ≈ f ( u ) m exp( √ mg ( u )) , (31) where u = n/ √ m and the functions f ( u ) and g ( u ) are giv en by f ( u ) = v ( u ) 2 3 / 2 π u h 1 − (1 + u 2 / 2) e − v ( u ) i − 1 / 2 , (32) g ( u ) = 2 v ( u ) u − u ln(1 − e − v ( u ) ) , (33) and v ( u ) is given implicitly b y solving v = u p − v 2 / 2 − Li 2 (1 − e v ) , (34) where Li 2 ( z ) = − R z 0 [ln(1 − t ) /t ]d t is the dilogarithm function. (Eq. 34 can b e easily solv ed numerically via Newton’s metho d, or simply via rep eated iteration, whic h con verges within machine precision usually after only v ery few steps). This approximation holds for v alues of n ≥ m 1 / 6 . F or smaller v alues n  m 1 / 3 w e hav e in- stead [30] q ( m, n ) ≈  m − 1 n − 1  m ! . (35) With Eqs. 31 to 35 w e hav e an approximation for q ( m, n ) for the entire range of parameters m and n that is remark- ably accurate, as sho wn in Fig. 2: F or arguments of the 7 0 2000 4000 6000 8000 10000 m 0 50 100 150 200 250 ln q ( m , n ) n = 10000 n = 100 n = 30 n = 10 n = 2 Exact Approximation 0 2000 4000 6000 8000 10000 m − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 ln q approx. ( m , n ) − ln q exact ( m , n ) n = 10000 n = 100 n = 30 n = 10 n = 2 1000 4000 7000 10000 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 Figure 2. Comparisons b etw een the exact and approximated v alues of the num b er of restricted partitions q ( m, n ) , using Eqs. 30 and 31 to 35. The top panel shows b oth v alues com- puted for diﬀerent v alues of m and n , and the b ottom panel sho ws the absolute diﬀerence of their logarithms, with the inset displaying a zo om into the large m region. order 10 3 , the largest log ratio b etw een the approximate and exact v alues is only around 0 . 1 , whic h has a neg- ligible eﬀect on th e outcome of hypothesis testing, and is b elow the accuracy usually required for MCMC sam- pling. In our implementation, w e pre-compute q ( m, n ) using the exact Eq. 30 for m < 10 4 , and resort to Eqs. 31 to 35 only for larger arguments, thus guaranteeing a com- putation of q ( m, n ) in time O (1) , and hence incurring a negligible impact in the o verall algorithmic complexity of the inference pro cedure. As seen in Fig. 3, the exp ected degree distribution sam- pled from Eq. 29 is typically signiﬁcantly broader than the exponential distribution obtained with Eq. 26. As sho wn in Appendix A, this will approac h a Bose-Einstein distribution, with a v ariance σ 2 k ∝ √ N that will div erge for a large system size. In particular, the distribution will asymptotically approach a scale-free form p k ∼ 1 /k 0 500 1000 1500 2000 k 10 − 3 10 − 2 10 − 1 10 0 10 1 10 2 10 3 10 4 h η k i Non-de gree-corrected Uniform Conditioned on uniform hyperprior 10 0 10 1 10 2 10 3 10 4 10 − 5 10 − 4 10 − 3 10 − 2 10 − 1 10 0 10 1 10 2 10 3 10 4 Figure 3. Exp ected degree distributions for the three diﬀeren t priors considered in the text for the degree sequence inside eac h group — the NDC-SBM, the uniform prior of Eq. 26 and the prior of Eq. 29 conditioned on a degree distribution sampled randomly — for N = 10 4 no des and av erage degree h k i = 10 . In all cases, the distributions were sampled from their resp ective micro canonical distributions using rejection sampling. The dashed line shows the Bose-Einstein distribu- tion of Eq. A13. for k  √ E , follow ed b y an exp onential decay for larger argumen ts. Although this prior assumption clearly fav ors broader degree distributions, it could b e argued that it still does not prop erly capture the structure of real netw orks, most of whic h also do not p ossess a Bose-Einstein degree dis- tribution. Indeed, it may seem that b y changing b et w een the priors cons idered ab o v e, w e hav e simply switched b et w een Poisson, geometric and Bose-Einstein distribu- tions, which are just three of an inﬁnite range of p ossibil- ities. How ev er, in reality , the conditioned prior of Eq. 27 will not concentrate as strongly on the exp ected distri- bution as the other tw o, and th us will not signiﬁcan tly p enalize distributions that deviate from it, even if the deviation is very large, as will now b e shown. In order to assess the improv emen t brough t on b y the conditioned prior, it is instructive to obtain the asymp- totic b eha vior of q ( m, n ) in the limit of “suﬃcient data” with m  1 and n  1 , which is given b y [30] q ( m, n ) ≈ p ( m ) exp − √ 6 m π e − π n/ √ 6 m ! , (36) as long as n  √ m and where p ( m ) = q ( m, m ) is the n umber of unconstrained partitions of m , whic h itself is giv en exactly by the recursion p ( m ) = X k> 0 ( − 1) k − 1 p ( m − k (3 k − 1) / 2) (37) and for large v alues of m by the Hardy-Ramanujan for- 8 m ula [31, 32] p ( m ) ≈ 1 4 √ 3 m exp( π p 2 m/ 3) . (38) With these results, we see immediately that for “sparse” groups with e r ∝ n r and n r  1 we hav e ln q ( e r , n r ) ∼ O ( √ n r ) , and hence ln P ( k | e , b ) ≈ − X r n r H ( η r ) + O ( √ n r ) , (39) where H ( η r ) = − P k ( η r k /n r ) ln( η r k /n r ) is the entrop y of the empirical degree distribution in group r . Therefore, for suﬃcien tly many no des in each group, the h yp erprior of Eq. 29 will “wash out” and the probability of Eq. 27 will approac h that of the the actual degree sequence, what- ev er its form may b e, even if it deviates from the t ypical form of Fig. 3. This is not the case of the uniform prior of Eq. 26, whic h is not able to “learn” the underlying dis- tribution in the same manner. Eq. 39 also means that an exact prior knowledge of the true degree distribution in eac h group would impro ve the log-probability (and the description length) only b y a factor O ( √ n r ) , which will b e dwarfed asymptotically by the remaining terms that scale linearly as O ( n r ) . Therefore, any further impro ve- men t in the c hoice of prior for the degree sequence is conﬁned to a relatively narrow range, similarly to what happ ens with the prior for the partition of the no des into groups. D. Prior for the edge counts and nested SBM hierarc hies The remaining piece is the prior for the edge counts b et w een groups, e . W e can start again with a uniform prior P ( e ) =       B 2   E     − 1 , (40) where   ( ( B 2 ) ) E   coun ts the num ber of symmetric e rs ma- trices with a constrained sum P rs e rs = 2 E . P erhaps unsurprisingly at this p oint, this is also not a go o d choice. This time, how ev er, the negativ e eﬀects are somewhat more dramatic than the previous choices of uniform priors. Namely , this assumption will limit our capacit y to detect small groups in v ery large netw orks: It in tro duces a “resolution limit”, where the largest num b er of groups that can b e inferred scales as B max ∼ √ N [8], similar to what is observ ed with the mo dularit y maxi- mization heuristic [4]. W e revisit this issue in more detail in Sec. V. As was shown in Ref. [10], this problem can b e solved again by deep ening the Bay esian hierarch y . It is useful no w to notice that the matrix e can b e interpreted as the adjacency matrix of a multigraph with B no des and E edges. Hence, an appropriate choice seems to b e to use the SBM again to generate it, where each group r b elongs to one of another set of groups, and so on recursively , a L num ber of times, P ( { e l }|{ b l } ) = L Y l =1 P ( e l | e l +1 , b l ) , (41) where b l is the partition of the groups in level l , e l is the (weigh ted) adjacency matrix at level l , and we en- force alw ays that B L = 1 . Note that since the num b er of edges is the same in all lev els while the num b er of no des decreases, the m ultigraphs b ecome increasingly denser at the upp er lev els, and the o ccurrence of parallel edges b e- comes predominant, even if the graph at the lo west lev el is sparse and simple. Although the lik eliho o d of Eq. 3 that was used at the b ottom lev el also admits arbitrarily dense m ultigraphs, it will not generate them uniformly within the SBM constrain ts, since it is based on an uni- form generation of c onﬁgur ations . Because of this, it is not a go od idea to use the exact same mo del as the priors in the upp er la yers, which will introduce a signiﬁcan t bias as the multigraphs b ecome dense. Indeed, simply insert- ing Eq. 3 into Eq. 41 makes all successive levels cancel out in the likelihoo d, yielding a trivial mo del where only the ﬁrst and last levels hav e an y contribution. A m uch b etter approac h, whic h is unbiased and maximally non- informativ e within the imp osed constraints, is to use a uniform NDC-SBM for m ultigraphs directly , where all allo wed multigraphs (not their corresp onding conﬁgura- tions) o ccur with the same probability . The lik eliho o d can b e obtained via basic enumeration [33], and is given b y P ( e l | e l +1 , b l ) = Y r 0 if c > (2 ln 2) /B ≈ 1 . 4 /B . The ensemble is equiv alen t to a fully random netw ork at a slightly smaller v alue c = 1 /B [but is already unde- tectable at c = 1 /B ± ( B − 1) / ( B p h k i ) [39]]. Hence, as the num ber of groups increases, for the v ast ma- jorit y of parameter choices c ∈ [(2 ln 2) /B , 1] w e ha v e that option 2 is fa vored with a conﬁdence that grows as ln Λ = O ( B 2 ) . Beside these argumen ts, there are other more impor- tan t reasons to prefer option 2. If w e adopt its micro- canonical interpretation, we can address the issues with the noninformative priors discussed in the previous sec- tions, and replace b oth P ( k | e , b ) and P ( e ) by distribu- tions conditioned on hyperparameters. F urthermore, as already mentioned, changes to the likelihoo d of Eq. 55 can b e computed more eﬃciently than Eq. 51: If w e mov e a no de i to a new group, w e need to up date O ( B ) terms in Eq. 51, whereas in Eq. 55 at most only O ( k i ) terms need to b e recomputed (indep enden t of B ). This leads to a substan tial impro vemen t in the p erformance of inference algorithms, as discussed further in Sec. VI. V. HO W MANY GROUPS CAN BE INFERRED? One of the main strengths of the nonparametric ap- proac h presented here is that it can b e used to deter- mine the n umber of groups B , in addition to the other mo del parameters. One natural question that arises is whether there are in trinsic limitations associated with the inference of this parameter. In particular, here w e are interested in the situation where the inferred num- b er of groups B ∗ is smaller than the true v alue B used to generated the netw ork, such that parts of the mo du- lar structure are not resolved b y inference. As sho wn in Ref. [8] with a simpliﬁed version of the mo del presented here, if the size and density of the netw ork are kept ﬁxed, and the planted v alue exceeds a threshold B > B max , we ha ve that B ∗ = B max and the plan ted mo dular structure cannot b e fully resolv ed. In particular, the choice of a noninformativ e prior for the edge counts P ( e ) leads to a limitation where at most only B max = O ( √ N ) groups can b e identiﬁed. Replacing this noninformative prior by 12 Figure 4. Planted partition of B = 6 equal sized groups (no de colors), b eing wrongly ﬁtted as a B ∗ = 3 mo del (shaded re- gion). In the ﬁtted mo del, the tw o groups inside each shaded region are not prop erly identiﬁed. This problem happ ens whenev er B > B max for B max = O ( √ N ) using noninformativ e priors for the edge counts, but only for B max = O ( N / ln N ) when the hierarchical priors are used instead. a series of nested SBMs was sho wn in Ref. [10] to signiﬁ- can tly alleviate this limitation, increasing the maximum n umber of groups to B max = O ( N/ ln N ) . Here we re- visit this issue, considering the more elaborate mo dels presen ted in this work. W e p erform our analysis on a degree-corrected planted partition mo del, with B plan ted groups of equal size, each con taining exactly E /B edges connecting their no des randomly , and no connections at all b et w een nodes of diﬀeren t groups, i.e. e rs = 2 E δ rs /B . The likelihoo d of an y particular netw ork sampled from this mo del is P ( A | k , e , b ) = (2 E /B )!! B (2 E /B )! B × Q i k i ! Q i B max . VI. INFERENCE ALGORITHM The inference task we ha ve is to sample from (or max- imize) the p osterior distribution of the hierarc hical par- tition, P ( { b l }| A ) = P ( A , { b l } ) P ( A ) . (74) The approach we will take is based on a Mark ov chain Mon te Carlo imp ortance sampling for the partitions at all hierarch y lev els. The algorithm will revolv e around mo ving the membership of no des in diﬀeren t hierarc hical lev els at random, and accepting or rejecting those mov es, so that after a suﬃciently long equilibration time, the hierarc hical partitions are sampled according to Eq. 74. W e note that this p osterior can b e factorized as P ( { b l }| A ) = Q l P ( e l − 1 , b l | e l ) P ( A ) = Y l P ( b l | e l − 1 , e l ) (75) with p er-lev el p osteriors P ( b l | e l , e l +1 ) = P ( e l | e l +1 , b l ) P ( b l ) P ( e l | e l +1 ) , (76) where we assume e 0 = A , and P ( e l | e l +1 ) is a normaliza- tion constant. Therefore, a w ork able approach is to separately sam- ple partitions at eac h lev el according to its individual p osterior, conditioned on the remaining levels, which are k ept unchanged for the time being. If we sample from eac h level in this manner w e can guarantee ergo dicit y , and if the mo ves at the individual levels are reversible, the ov erall distribution will corresp ond to the desired full p osterior of Eq. 74. Since the hierarchical lev els are cou- pled, when moving a no de at level l , we m ust ensure that this do es not inv alidate the partition at lev el l + 1 . Hence, w e must forbid node mov es betw een groups that are themselves at diﬀerent groups in the next level. (This constrain t do es not break ergo dicity , since all partitions in the upp er lev els will be allo wed to c hange at some p oin t). In more detail, w e pro ceed as follows. At each individ- ual level l , we p erform a mo ve prop osal of no de i from its curren t group r to a new group s , according to a probabilit y P ( b ( l ) i = r → s ) that we will sp ecify shortly . W e compute the diﬀerence in the log-lik eliho o d ∆ ln P l at that level, and w e accept the mo ve according to the Metrop olis-Hastings criterion [40, 41], i.e. with a proba- bilit y a = min ( 1 , e ∆ ln P l P ( b ( l ) i = s → r ) P ( b ( l ) i = r → s ) ) , (77) where P ( b ( l ) i = s → r ) is the probability of the rev erse mo ve being prop osed. The log-likelihoo d diﬀerence is computed as ∆ ln P l = ln P ( b ( l ) i = s, b l r b ( l ) i | e l , e l +1 ) P ( b ( l ) i = r , b l r b ( l ) i | e l , e l +1 ) , (78) where b l r b ( l ) i means the partition of the remaining no des excluding no de i . Note that in computing Eq. 78, we do not need to determine the normalization constant in Eq. 76, and the remaining relev an t terms corresp ond only to a subset of the full joint distribution of Eq. 45. Typi- cally , the num b er of groups in the upp er lev els decreases exp onen tially , and hence the algorithmic complexity is dominated by the b ottom level l = 0 . As mentioned pre- viously , the num ber of terms of the joint distribution that are necessary to compute ∆ ln P 0 is prop ortional only to the degree k i of no de i , and is indep enden t of B 1 , and hence can b e computed quic kly . Therefore, if we attempt one mov e for each no de in the netw ork, such a “sweep” can b e completed in time O ( E ) , indep enden t on the total n umber of groups. An imp ortan t elemen t of this algorithm is the mo ve prop osal probability P ( b ( l ) i = r → s ) . An y choice with 14 nonzero probability for all v alues of s will preserve er- go dicit y , and — coupled with the Metropolis-Hastings criterion — also detailed balance. These tw o ingredients are suﬃcient to guarantee that hierarchical partitions are ev entually sampled from the correct p osterior distribu- tion. How ev er, in practice, the equilibration time will dep end strongly on the mo ve prop osals, and will b ecome shorter if they are close to the actual p osterior. The sim- plest choice we could mak e is to select from all groups with equal probability P ( b ( l ) i = r → s ) = 1 B l + 1 , (79) where we also accoun t for the o ccupation of a new group, whic h if the mo ve is accepted, will increase B l b y one (pro vided the no de i is not the last one in its current group). Since this probability is alwa ys nonzero, it fulﬁlls our requiremen ts. How ev er, it will lead to v ery large equilibration times, in particular for large v alues of B l . This is b ecause the actual p osterior distribution for no de i is lik ely to b e concentrated only in a small subset of all possible groups, and hence most suc h fully random prop osals will simply b e rejected. A b etter approac h was dev elop ed in Ref. [9], and it consists in insp ecting the curren t parameters of the mo del to provide a b etter guess of the p osterior. It amounts to making mov e prop osals according to P ( b ( l ) i = r → s ) = X t P ( t | i, l ) e l ts +  e l t +  ( B l + 1) , (80) where P ( t | i, l ) = P j A ( l ) ij δ ( b ( l ) j , t ) /k ( l ) i is the fraction of neigh b ors of no de i in level l that belong to group t , and  > 0 is an arbitrary parameter that enforces ergo dicity , but with no other signiﬁcant impact in the algorithm, pro vided it is suﬃcien tly small. It is worth while to em- phasize that these mo ve prop osals do not bias the parti- tions tow ard any particular mixing pattern. F or exam- ple, they do not prefer assortativ e versus non-assortative partitions, since they insp ect the neighbors of a no de only to access with other groups the ir kinds are typically connected — which can b e diﬀeren t from the the group assignmen t of the original no de. F urthermore, these pro- p osals can b e generated eﬃciently , simply by 1. sampling a random neigh b or j of no de i , and in- sp ecting its group membership t = b j , and then 2. with probabilit y  ( B l + 1) / ( e t +  ( B l + 1)) sampling a fully random group s (which can b e a new group), 3. or otherwise, sampling a group lab el s with a prob- abilit y prop ortional to the n umber of edges leading to it from group t , e ts . The ab ov e can be done in time O ( k i ) , again indep en- den tly of B l , as long as a contin uous bo ok-keeping is made of the edges whic h are inciden t to each group, and therefore it do es not aﬀect the o verall O ( E ) time com- plexit y . As rep orted in Ref. [9], these mov e prop osals tend to signiﬁcantly improv e the mixing times, and re- mo ve an explicit dep endency on the num ber of groups, that would otherwise b e presen t with the fully random mo ves. This approac h is also more eﬃcient than the rejection- free “heat bath” algorithm used in Ref. [17], since the latter requires all p ossible mov es to b e prob ed, incurring an additional time complexity that gro ws linearly with the num ber of groups. In addition to the mov e prop osals, another crucial as- p ect of the algorithm’s eﬃciency is the choice of the start- ing state. A simple approach s uc h as starting from a ran- dom partition can lead to metastable states, from which it takes a long time to escap e. Instead, here w e adopt the agglomerativ e initialization approach presented in Ref. [9], which amounts to putting eac h no de in their o wn group, and then progressively merging groups, while alternatingly allowing for individual no de mov es. This can b e done for each hierarchical lev el iteratively , as de- scrib ed in detail in Ref. [10]. As rep orted in Ref. [9], this approac h greatly reduces the tendency to get trapp ed in a metastable state, and serv es as an initialization proto- col that further reduces the o verall mixing time of the MCMC. While the ab ov e algorithm serves to sample from the p osterior distribution of Eq. 74, it can be easily mod- iﬁed to ﬁnd its maximum b y introducing an “inv erse- temp erature” parameter β in Eq. 77 via the replacemen t ∆ ln P l → β ∆ ln P l . By making β → ∞ the algorithm is turned into a greedy heuristic that, if rep eated many times, yields a reliable estimate of the maximum. The lack of an explicit dependence on the num ber of groups of the algorithm ab ov e is atypical, since most other prop osed Bay esian (or semi-Bay esian) algorithms ha ve either quadratic O ( E B 2 ) [15 – 17, 34] or linear O ( E B ) [14, 42] dependencies, whic h means that those can b e applied to large netw orks only if the n umber of groups is kept small. F urthermore, the increased eﬃ- ciency obtained here do es not rely on any approximations made to the likelihoo d. A reference implemen tation of the algorithm is freely a v ailable as part of the graph-tool library [43] 2 . VI I. MODEL COMP ARISON With the three diﬀerent mo del ﬂav ors av ailable (NDC- SBM, DC-SBM with uniform degree prior or uniform hy- p erprior) we are left with the problem of deciding whic h oﬀers the b est description of a giv en net work. This prob- lem can b e formulated in at least tw o wa ys, dep ending on whether w e w ant to compare individual partitions or en tire mo del classes, which w e describ e now detail. If we wish to compare tw o individual partitions, ob- tained from the p osterior distribution of t w o diﬀerent 2 A v ailable at https://graph- tool.skewed.de . 15 mo dels, w ee need to consider the joint p osterior probabil- it y P ( { b l } , H| A ) , where H is the mo del class b eing used. F or example, when comparing results from the DC-SBM and NDC-SBM, we can comp ute the ratio, Λ 1 = P ( { b l } , H NDC | A ) P ( { b l } 0 , H DC | A ) = P ( A , { b l }|H NDC ) P ( A , { b l } 0 |H DC ) × P ( H NDC ) P ( H DC ) = 2 − ∆Σ (81) where in the last equation ∆Σ = Σ NDC − Σ DC is the diﬀerence in the description length, and we hav e as- sumed that b oth mo del classes are equally likely a priori, P ( H NDC ) = P ( H DC ) . If Λ 1 < 1 , we ha ve that the data fa vors the particular hierarchical partition { b l } 0 together with the degree-corrected mo del v arian t, or if Λ 1 > 1 we ha ve the opp osite case. Cho osing a mo del according to Λ 1 is identical to employing the MDL criterion, but its v alue can b e used to quan tify the degree of conﬁdence. E.g. a v alue Λ 1 = 1 / 2 indicates a very mo dest evidence supp orting the DC-SBM that cannot b e reliably distin- guished from pure c hance, whereas a v alue of Λ 1 = 1 / 10 5 w ould clearly indicate that it is a m uch b etter mo del than the NDC-SBM. The criterion ab ov e should not b e confused with the “frequentist” approach of computing the p ar amet- ric lik eliho o d ratio b et ween b oth mo dels, as w as done in Ref. [44]. In the latter case, whic h do es not in volv e an y prior probabilities, the ratio needs to b e compared to the distribution obtained with the null mo del, which is more cum b ersome to obtain. How ever, as is understo od in general (and can also b e shown for the particular case of the SBM [22]), this frequen tist criterion should co- incide asymptotically with the Bay esian criterion ab ov e as long as uniform priors are used. On the other hand, since here we use deeper Ba yesian hierarchies, and hence non uniform priors, these amount to diﬀeren t tests, with Λ 1 b eing more sensitive to regularities in the data, since it uses prop erties of the parameters themselves in the decision. The comparison ab ov e using Λ 1 is easy to perform, since it requires one to simply insp ect the result of the inference pro cedure. Ho wev er, it may b e p ossible that the same netw ork admits many alternative ﬁts with very sim- ilar p osterior probabilities. A more strict Bay esian stance w ould require us to treat those on an equal fo oting, and an y statemen t about the generativ e mo del behind the data should b e av eraged ov er all p ossible ﬁts, weigh ted according to the resp ectiv e p osterior probability . Hence, in this scenario we may b e interested instead in compar- ing the entire mo del classes to each other, which inv olv es ev aluating the so-called mo del evidenc e by summing ov er all hierarchical partitions, P ( A |H ) = X { b l } P ( A , { b l } ) . (82) With this, we can again compute the p osterior o dds ratio, e.g. Λ 2 = P ( H NDC | A ) P ( H DC | A ) = P ( A |H NDC ) P ( A |H DC ) × P ( H NDC ) P ( H DC ) . (83) If we hav e no prior preference to wards either mo del, P ( H NDC ) = P ( H DC ) , the v alue of Λ 2 is known as the Ba yes factor [45], and lik e Λ 1 can b e used to establish a degree of conﬁdence in the outcome. Unfortunately , the exact computation of the sum in Eq. 82 is intractable. W e therefore resort to a v ariational approac h, ﬁrstly by writing ln P ( A |H ) = ln X { b l } P ( A , { b l } ) (84) = X { b l } q ( { b l } ) ln P ( A , { b l } ) (85) − X { b l } q ( { b l } ) ln q ( { b l } ) , (86) with q ( { b l } ) = P ( A , { b l } ) P ( A ) (87) b eing precisely the p osterior distribution of for the hier- arc hical partition that we obtain from with the MCMC algorithm used ab o ve. (Note that so far we hav e not made an y approximations, with the identities ab o ve hold- ing exactly .) The ﬁrst term in Eq. 85 is easy to compute, as it amounts to the av erage log-likelihoo d (or min us the description length) of the partitions we obtain with the MCMC ab o ve, h ln P ( A , { b l } ) i = X { b l } q ( { b l } ) ln P ( A , { b l } ) . (88) On the other hand, the second term in Eq. 86 amounts to the entrop y of the p osterior distribution, H ( { b l } ) = − X { b l } q ( { b l } ) ln q ( { b l } ) , (89) and measures how strongly it is concentrated. F or exam- ple, in the extreme (and unrealistic) case where for each mo del being compared only one partition o ccurs with probabilit y q ( { b l } ) = 1 , the entrop y will b e zero, and we ha ve that Λ 1 = Λ 2 . Otherwise the entrop y H ( { b l } ) will eﬀectiv ely measure how many partitions contribute to the a verage log-likelihoo d, so that a model class with a larger en tropy will b e preferred ov er another with less v ari- ance, even if their p osterior probabilities are on av erage the same. Unfortunately , the entrop y H ( { b l } ) is noto- riously diﬃcult to compute exactly , even asymptotically via MCMC algorithms, and encapsulates the diﬃcult y of computing Eq. 82 directly . A brute force approach sim- ply do es not w ork, since it would require keeping trac k of all visited hierarchical partitions, whic h grow com binato- rially in num b er with system size. Other approaches suc h 16 as thermo dynamic integration [46], annealed imp ortance sampling [47] and ﬂat-histogram metho ds [48] are also p ossible, but te nd to b e signiﬁcantly ineﬃcient in com- parison. Instead, here w e make a so-called “mean ﬁeld” assumption on the shap e of q ( { b l } ) which assumes that it factorizes ov er all lev els q ( { b l } ) ≈ q 1 i ( b 1 ) Y l> 1 Y i q l i ( b l i ) . (90) F or the ﬁrst level w e use the so-called “Bethe appro x- imation” [49], whic h tak es in to account the correlation b et w een adjacent nodes in the netw ork, q 1 ( b 1 ) ≈ Y i 1 we cannot use the same approxi- mation since the adjacency matrices will b e in general m ultigraphs that will keep changing throughout the algo- rithm. Therefore w e used ab ov e a mean-ﬁeld appro xima- tion where the p osterior factorizes o ver all no des. With this we can ﬁnally write Eq. 84 as ln P ( A ) ≈ h ln P ( A , { b l } ) i + X l H l (94) where H 1 = − X i 1 . Th us, Eq. 94 can b e computed simply by equilibrating the MCMC, obtaining the a verage log-likelihoo d and the no de and edge p osterior marginal distribution, q l i ( r ) and q 1 ij ( r , s ) . Dataset N h k i B 1 h B 1 i σ B 1 Southern women interactions [53] 32 5 . 6 2 2 . 4 0 . 9 Zachary’s k arate club [54] 34 4 . 6 2 2 . 2 0 . 5 Dolphin so cial network [55] 62 5 . 1 2 2 . 9 0 . 5 Characters in Les Misérables [56] 77 6 . 6 8 8 . 6 0 . 7 American college fo otball [57] 115 10 . 7 10 10 . 1 0 . 3 Florida fo od web (wet) [58] 128 32 . 9 14 14 . 2 0 . 4 Residence hall friendships [59] 217 24 . 6 20 20 0 C. ele gans neural netw ork [60] 297 15 . 9 20 13 . 5 0 . 5 Scientiﬁc coauthorships [61] 379 4 . 8 28 29 . 6 1 . 6 Country-language netw ork [62] 868 2 . 9 4 10 . 1 1 . 9 Malaria gene similarity [63] 1 , 104 5 . 4 56 55 . 8 1 . 9 E-mail [64] 1 , 133 9 . 6 28 26 . 9 0 . 3 Political blogs [50] 1 , 222 31 . 2 15 15 0 Scientiﬁc coauthorships [61] 1 , 589 3 . 5 48 67 . 3 3 . 4 Protein iteractions (I) [65] 1 , 706 7 . 3 26 40 . 2 0 . 6 Bible names co-o currence [62] 1 , 773 10 . 3 63 79 . 1 5 . 3 Global airp ort network [10] 3 , 286 41 . 6 268 264 . 6 6 . 1 W estern states power grid [66] 4 , 941 2 . 7 38 37 . 3 1 Protein iteractions (I I) [67] 6 , 327 46 . 6 419 406 . 4 18 . 6 Internet AS [68] 6 , 474 4 . 3 40 50 7 . 2 Adv ogato user trust [69] 6 , 541 15 . 6 174 80 . 7 0 . 6 Chess games [62] 7 , 301 17 . 8 79 79 0 Dictionary entries [70] 13 , 356 18 1 , 378 1 , 378 . 9 2 . 3 Cora citations [71] 23 , 166 7 . 9 575 575 0 . 2 Google+ social network [72] 23 , 628 3 . 3 46 41 . 3 2 . 4 arXiv hep-th citations [68] 27 , 770 25 . 4 1 , 211 1 , 207 . 1 4 Linux source dep endency [62] 30 , 837 13 . 9 448 384 . 7 3 . 1 PGP web of trust [73] 39 , 796 15 . 2 1 , 350 1 , 323 . 2 26 . 4 F aceb o ok wall p osts [74] 46 , 952 37 . 4 6 , 930 6 , 794 . 9 129 . 9 Brightkite so cial netw ork [75] 58 , 228 7 . 4 171 177 . 4 3 . 8 Gnutella hosts [76] 62 , 586 4 . 7 24 24 0 Y outub e group memberships [77] 124 , 325 4 . 7 273 266 . 7 4 . 7 T able I. Empirical netw orks used in this work, with their num- b er of no des N , av erage degree h k i = 2 E / N , num ber of groups at the low est hierarchical level B 1 according to the MDL cri- terion, and the same v alue av eraged from the p osterior distri- bution h B 1 i , as well as standard deviation of the distribution, σ B 1 . VI II. RESUL TS FOR EMPIRICAL NETWORKS W e demonstrate the use of our approac h on empir- ical net works (summarized in T able I), which w e also use to compare diﬀerent model v ariations. W e b egin with a netw ork of p olitical blogs compiled b y A damic and Glance [50] during the 2004 general election in the USA. In this netw ork no des are blogs, and an edge ex- ists b et w een tw o no des if one blog cites the other (hence, the netw ork is directed, and therefore the directed ver- sions of the SBM were used, see App endix B). This net- w ork w as used in Ref. [11] as an example where the DC- SBM yielded more meaningful results, since it preferred a partition of the no des that w as largely compatible with the original categorization done in Ref. [50], based on the conten t of the blogs, into “liberal” and “conserv ative” sites. The NDC-SBM, on the other hand, preferred to di- vide the no des only according to degree. How ev er, in that analysis the num ber of groups was ﬁxed at B = 2 . Using the nonparametric approach describ ed here, where the n umber of groups is determined from data itself, the re- sults show a less extreme amoun t of discrepancy , as seen in Fig. 5, which sho ws the most likely partition accord- ing to each mo del ﬂav or. In all cases, the division of the no des is largely compatible with the accepted one: The hierarc hy branches at the top into the t wo p olitical fac- 17 (a) (b) (c) Figure 5. Most lik ely hierarchical partitions of a netw ork of p olitical blogs [50], according to the three mo del v arian ts considered, as well as the num ber of groups B 1 at the b ottom of the hierarch y , and the description length Σ : (a) NDC-SBM, B 1 = 42 , Σ ≈ 89938 bits, (b) DC-SBM, uniform prior, B 1 = 23 , Σ ≈ 87162 bits, (c) DC-SBM, uniform h yp erprior, B 1 = 20 , Σ ≈ 84890 bits. The no des circled in blue were classiﬁed as “lib erals” and the remaining ones as “conserv ativ es” in Ref. [50] based on the blog conten ts. Note that in all cases this division in tw o groups is correctly identiﬁed at the topmost level of the hierarch y . Ho wev er, the low er levels yield signiﬁcantly diﬀerent sub divisions depending on which mo del type is used. The la y out is obtained with an algorithm by Holten [51]. (a) (b) 40 45 50 55 60 65 70 75 80 85 90 B 1 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 0 . 12 0 . 14 0 . 16 0 . 18 P ( B 1 | A A A ) DC-SBM, uniform hyperprior DC-SBM, uniform prior NDC-SBM 12 15 18 21 24 27 30 33 36 B 2 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 P ( B 2 | A A A ) 4 8 12 16 20 B 3 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 P ( B 3 | A A A ) Figure 6. Hierarchical partitions of a netw ork of collab oration b etw een scientists [52]. (a) Most likely hierarchical partition according to the DC-SBM with a uniform hyperprior. (b) Uncorrelated samples from the p osterior distribution. (c) Marginal p osterior distribution of the n umber of groups at the ﬁrst three hierarc hical levels, according to the mo del v ariants describ ed in the legend. The v ertical lines mark the v alue obtained for the most likely partition. tions, and then pro ceeds into further sub-divisions inside eac h group. Ho wev er, when insp ecting the low er levels of the hierarch y , we see that the diﬀeren t v arian ts yield distinct sub divisions inside the t wo main groups. The non-degree-corrected v ersion yields the largest num ber of groups, follow ed b y the degree corrected one with uni- form degree priors, and ﬁnally the version with uniform degree h yp erpriors with the smallest num ber of groups. In this particular case, the mo dels with smaller n umber of groups hav e also the smallest description length, which 18 (a) 1 2 3 4 5 6 7 8 B 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 P ( B 1 | A A A ) 0 1 2 3 4 5 B 2 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 P ( B 2 | A A A ) 0 1 2 3 4 5 B 3 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 P ( B 3 | A A A ) (b) 4 5 6 7 8 9 10 11 12 13 14 B 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 P ( B 1 | A A A ) 0 1 2 3 4 5 6 7 8 9 B 2 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 P ( B 2 | A A A ) 0 2 4 6 B 3 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 P ( B 3 | A A A ) (c) 9 10 11 12 13 14 B 1 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 P ( B 1 | A A A ) 1 2 3 4 5 6 7 8 9 B 2 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 P ( B 2 | A A A ) 0 1 2 3 4 5 B 3 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 P ( B 3 | A A A ) (d) 1 2 3 4 5 6 7 8 9 B 1 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 P ( B 1 | A A A ) 0 1 2 3 4 5 B 2 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 P ( B 2 | A A A ) 0 1 2 3 4 B 3 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 P ( B 3 | A A A ) (e) 30 33 36 39 42 45 48 51 54 57 60 B 1 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 P ( B 1 | A A A ) DC-SBM, uniform hyperprior DC-SBM, uniform prior NDC-SBM 8 10 12 14 16 18 20 22 24 B 2 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 P ( B 2 | A A A ) 2 4 6 8 10 B 3 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 P ( B 3 | A A A ) (f ) 280 300 320 340 360 380 400 420 440 460 B 1 0 . 00 0 . 01 0 . 02 0 . 03 0 . 04 0 . 05 0 . 06 0 . 07 0 . 08 0 . 09 P ( B 1 | A A A ) 120 140 160 180 200 220 B 2 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 0 . 12 0 . 14 P ( B 2 | A A A ) 60 80 100 B 3 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 P ( B 3 | A A A ) (g) 180 210 240 270 300 330 360 390 420 450 480 B 1 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 P ( B 1 | A A A ) 60 80 100 120 140 160 180 B 2 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 P ( B 2 | A A A ) 30 40 50 60 70 B 3 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 P ( B 3 | A A A ) (h) 1250 1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 B 1 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 0 . 12 0 . 14 P ( B 1 | A A A ) 420 440 460 480 500 520 540 560 580 B 2 0 . 00 0 . 05 0 . 10 0 . 15 0 . 20 0 . 25 0 . 30 0 . 35 0 . 40 P ( B 2 | A A A ) 160 180 200 220 B 3 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 P ( B 3 | A A A ) Figure 7. Marginal p osterior distribution of the num ber of groups at the ﬁrst three hierarchical lev els, according to the model v arian ts describ ed in the legend, for some of the empirical netw orks listed in table I: (a) Dolphin so cial netw ork, (b) Characters in Les Misérables, (c) American college fo otball, (d) Southern women interactions, (e) Malaria gene similarity , (f ) Protein in teractions (II), (g) Global airp ort netw ork, (h) Dictionary entries. The v ertical lines mark the v alue obtained for the most lik ely partition (the MDL criterion). 19 Southern women interactions Zachary’ s karate club Dolphin social network Characters in Les Mis ´ erables American colle ge football Florida food web (wet) Residence hall friendships C. ele gans neural network Scientiﬁc coauthorships Country-language network Malaria gene similarity E-mail Political blogs Scientiﬁc coauthorships Protein interactions (I) Bible names co-ocurrence Global airport network W estern states po wer grid Protein interactions (II) Internet AS Adv ogato user trust Chess games Dictionary entries Cora citations Google+ social network arXi v hep-th citations Linux source dependenc y PGP web of trust F acebook wall posts Brightkite social network Gnutella hosts Y outube group memberships − 10 5 − 10 4 − 10 3 − 10 2 − 10 1 − 10 0 0 log 10 Λ 1 DC-SBM, uniform hyperprior DC-SBM, uniform prior NDC-SBM (a) Southern women interactions Zachary’ s karate club Dolphin social network Characters in Les Mis ´ erables American colle ge football Florida food web (wet) Residence hall friendships C. ele gans neural network Scientiﬁc coauthorships Country-language network Malaria gene similarity E-mail Political blogs Scientiﬁc coauthorships Protein interactions (I) Bible names co-ocurrence Global airport network W estern states po wer grid Protein interactions (II) Internet AS Adv ogato user trust Chess games Dictionary entries Cora citations Google+ social network arXi v hep-th citations Linux source dependenc y PGP web of trust F acebook wall posts Brightkite social network Gnutella hosts Y outube group memberships − 10 5 − 10 4 − 10 3 − 10 2 − 10 1 − 10 0 0 log 10 Λ 2 (b) Figure 8. Posterior o dds ratio relative to the b est mo del, according to (a) the MDL criterion, Λ 1 (Eq. 81) and (b) full p osterior probabilit y , Λ 2 (Eq. 83) for the empirical netw orks listed in T able I. The ratio is computed so that the preferred mo del has Λ 1 / 2 = 1 and thus app ears on the top of the ﬁgures. The remaining p oin ts for eac h dataset correspond to the o dds ratio of the remaining mo dels relative to the winning one. The solid lines mark a Λ = 10 − 2 conﬁdence threshold. The netw orks are ordered by increasing num ber of no des (see table I). 20 10 0 10 1 10 2 k 10 0 10 1 10 2 N k 10 0 10 1 10 2 10 3 k 10 0 10 1 10 2 10 3 N k Figure 9. Degree histograms for the Email (left) and arXiv hep-th citations (right) netw orks. In b oth cases the solid lines sho w a geometric distribution N k = N p (1 − p ) k − 1 , with p = 1 / h k i . seems to indicate that the division into a larger n umber of groups are necessary for the models that are unable to otherwise prop erly explain the heterogeneit y in the degree sequence. Thus, despite their uniform agreement with the accepted division, the MDL criterion still con- ﬁrms the DC-SBM as a b etter mo del for this net work. W e now mo ve to a so cial netw ork b etw een scientists, where an edge exists if tw o scientists collab orated on a pap er [52]. Here, w e compare the results obtained b y emplo ying MDL (i.e. ﬁnding the most lik ely partition) and sampling many partitions from the p osterior distri- bution, as shown in Fig. 6. W e observe that while the sampled partitions share close similarities to the MDL result, there is a noticeable v ariance among the individ- ual samples. Fig. 6 also shows the marginal distribution for the n umber of groups at the ﬁrst three hierarc hical lev els. F or all three mo del v arian ts, the typical num b er of groups is signiﬁcantly higher that what is obtained for the optimal partition (due to the low degree v ariabilit y in this particular netw ork, it is one of the few that are bet- ter mo delled b y the NDC-SBM, as seen in Fig. 8). This can be understo o d as an entropic eﬀect, where the exis- tence of a muc h larger num ber of more complex mo dels with smaller y et comparable lik eliho o d pushes the p oste- rior distribution tow ards them. This is a go od example of the bias-v ariance trade-oﬀ men tioned in Sec. II I A, where w e see that the MDL results in a more conserv ativ e par- tition, whereas the full p osterior dep osits more collective w eight on larger mo dels that are also more n umerous. This seems to indicate that no single partition (and its asso ciated mo del) serv es as a ov erwhelmingly b etter ex- planation among those considered — a symptom that no speciﬁc mo del v arian t can p erfectly accommo date the net work structure, and thus that the SBM is p ossibly not a suitable generative mo del for this data. This disagreement betw een MDL and p osterior sam- pling is not universal, and dep ends strongly on the net- w ork structure. In Fig. 7 we show further results for other netw orks, that show a fair amount of diversit y in this resp ect. In many cases the MDL estimate lies close to the mo de of the p osterior, indicating a fair amount of agreement (at least as far as the n umber of groups is concerned). If w e compare the diﬀerent mo del ﬂav ors as outlined in Sec. VI I, w e obtain that most typically the DC-SBM with uniform degree h yp erpriors provides the smallest descrip- tion length for a large v ariet y of netw orks, as shown in Fig. 8a. As exp ected, the margin b y which the b est mo del is selected increases with the size of the net work, as larger net works typically contain more data. If w e compare in- stead the whole mo del class, b y summing o ver all parti- tions, we obtain largely consistent (though not identical) outcomes, as seen in Fig. 8b. Exceptions to this include net works where there is no signiﬁcan t statistical evidence to support the most complex mo dels — either due to their small size or narrow degree distributions (e.g. Sci- en tiﬁc coauthorships, Malaria gene similarit y and W est- ern states p o wer grid) — and often the simpler NDC- SBM is preferred, as well as some netw orks for which the DC-SBM with uniform degree priors is preferred instead (E-mail, arXiv hep-th citations). A closer insp ection of these netw orks reveal that their global degree distribu- tion is fairly narro w, well approximated by an exp onen- tial distribution, as shown in Fig. 9. Since this is what is precisely assumed by the uniform degree prior, this mo del v ariation has the adv an tage in this case. It is worth while to observ e that according to b oth criteria, the preference to wards the DC-SBM ov er the NDC-SBM is sometimes only attained with the uniform degree hyperprior. In man y cases the NDC-SBM yields a smaller description length or larger evidence than the degree-corrected v ari- an t with a uniform prior. This means that correcting for arbitrary degree frequencies — as opp osed to sim- ply the degrees but assuming uniform frequencies — can rev eal important information on the structure of the net- w ork that would otherwise remain obscured. Neverthe- less, our results seem to v alidate the in tuition b ehind the DC-SBM as argued in Ref. [11], that most netw orks are b etter mo deled as mixtures of groups with heterogeneous degrees, as opp osed to groups with the homogeneous de- grees that are generated by the NDC-SBM. Imp ortantly , w e reac h this conclusion aw are that the NDC-SBM is a larger model class with more parameters, since this fact is fully incorp orated in our comparison. IX. DISCUSSION The micro canonical approac h to the inference of large- scale net work structures oﬀers an opp ortunit y to enco de deep er Bay esian hierarc hies into the generativ e mo dels, whic h alleviates the underﬁtting problems present other- wise, while at the same time enabling the implementation of eﬃcient inference algorithms with a complexity that is not explicitly dep endent on the num ber of groups b eing inferred. W e sho wed how the degree-corrected SBM can b e for- m ulated in a Bay esian wa y , via the incorp oration of pri- ors for the degree sequence that depend on the degree distribution, and hence are more capable of decoupling mo dular organization from degree regularities. W e hav e 21 again visited the issue of the maximum num b er of groups that can be inferred, and determined that the hierarchi- cal v ersion of the mo del is signiﬁcantly less susceptible to underﬁtting, by b eing able to unco ver small groups in v ery large netw orks. W e also show ed that the micro canonical mo del is iden- tical to a Bay esian version of the typical canonical formu- lation, if we consider only its shallow er v ersion with uni- form priors. Hence, the main strength of the approac h presen ted here lies not in details of the model speciﬁ- cation, but rather in the ease with which higher order Ba yesian considerations can b e incorporated. Throughout the work we ha v e contrasted tw o ap- proac hes to Bay esian inference, one where we search for the single best netw ork parametrization (the MDL crite- rion), and the other where parametrizations are sampled according to their p osterior probability . W e sho wed that the bias-v ariance trade-oﬀ that these tw o options repre- sen t can manifest itself in practice, where a lac k of qual- it y of ﬁt yields a disagreement b et ween b oth approac hes. By p erforming a systematic analysis of v arious empirical net works, w e observed that the degree of discrepancy is v aried, and itself serves as an indication of the suitability of the SBM in capturing the net work structure. W e argue that the metho ds prop osed here can b e useful in the principled detection of large-scale netw ork struc- tures and in their interpretation. In particular we b eliev e it can b e used as a basis for a further understanding of the quality of the SBM family of models in capturing the prop erties of real netw orks. [1] Santo F ortunato, “ Communit y detection in graphs,” Ph ysics Rep orts 486 , 75–174 (2010). [2] Santo F ortunato and Darko Hric, “ Comm unity detection in netw orks: A user guide,” Physics Reports (2016), 10.1016/j.ph ysrep.2016.09.002. [3] Roger Guimerà, Marta Sales-P ardo, and Luís A. Nunes Amaral, “ Mo dularity from ﬂuctuations in random graphs and complex netw orks,” Phys. Rev. E 70 , 025101 (2004). [4] Santo F ortunato and Marc Barthélemy , “ Resolution limit in communit y detection,” PNAS 104 , 36–41 (2007). [5] Benjamin H. Go o d, Y ves-Alexandre de Montjo y e, and Aaron Clauset, “ Performance of modularity maximiza- tion in practical contexts,” Phys. Rev. E 81 , 046106 (2010). [6] Darko Hric, Richard K. Darst, and San to F ortunato, “ Communit y detection in netw orks: structural clusters v ersus ground truth,” arXiv:1406.0146 [ph ysics, q-bio] (2014), arXiv: 1406.0146. [7] Paul W. Holland, Kathryn Blackmond Laskey , and Sam uel Leinhardt, “ Stochastic blo ckmodels: First steps,” So cial Netw orks 5 , 109–137 (1983). [8] Tiago P . Peixoto, “ Parsimonious Mo dule Inference in Large Netw orks,” Phys. Rev. Lett. 110 , 148701 (2013). [9] Tiago P . Peixoto, “ Eﬃcien t Monte Carlo and greedy heuristic for the inference of sto c hastic block mo dels,” Ph ys. Rev. E 89 , 012804 (2014). [10] Tiago P . Peixoto, “ Hierarc hical Blo ck Structures and High-Resolution Mo del Selection in Large Netw orks,” Ph ys. Rev. X 4 , 011047 (2014). [11] Brian Karrer and M. E. J. Newman, “ Stochastic blo ck- mo dels and communit y structure in netw orks,” Phys. Rev. E 83 , 016107 (2011). [12] B. Bollobás, “ A probabilistic proof of an asymptotic for- m ula for the n umber of lab eled regular graphs.” Eur. J. Com b. 1 , 311–316 (1980). [13] Bailey K. F osdick, Daniel B. Larremore, Jo el Nishimura, and Johan Ugander, “ Conﬁguring Random Graph Mo d- els with Fixed Degree Sequences,” [ph ysics, q-bio, stat] (2016), arXiv: 1608.00607. [14] Roger Guimerà and Marta Sales-Pardo, “ Missing and spurious interactions and the reconstruction of complex net works,” Pro ceedings of the National Academ y of Sci- ences 106 , 22073 –22078 (2009). [15] Xiaoran Y an, Y ao jia Zhu, Jean-Baptiste Rouquier, and Cristopher Mo ore, “ Activ e Learning for Hidden A t- tributes in Netw orks,” arXiv:1005.0794 (2010). [16] Etienne Côme and Pierre Latouche, “ Mo del selection and clustering in stochastic blo ck mo dels based on the ex- act integrated complete data likelihoo d,” Statistical Mo d- elling 15 , 564–589 (2015). [17] M. E. J. Newman and Gesine Reinert, “ Estimating the Num b er of Communities in a Netw ork,” Phys. Rev. Lett. 117 , 078301 (2016). [18] J. Rissanen, “ Mo deling by shortest data description,” Au- tomatica 14 , 465–471 (1978). [19] Peter D. Grünw ald, The Minimum Description L ength Principle (The MIT Press, 2007). [20] Pierre Latouche, Etienne Birmelé, and Christophe Am- broise, “ Ba y esian Methods for Graph Clustering,” in A d- vanc es in Data Analysis, Data Hand ling and Business Intel ligenc e , Studies in Classiﬁcation, Data Analysis, and Kno wledge Organization, edited by Andreas Fink, Berthold Lausen, Wilfried Seidel, and Alfred Ultsch (Springer Berlin Heidelb erg, 2009) pp. 229–239. [21] Jake M. Hofman and Chris H. Wiggins, “ Bay esian Ap- proac h to Net work Mo dularit y ,” Phys. Rev. Lett. 100 , 258701 (2008). [22] Xiaoran Y an, “ Ba y esian Mo del Selection of Sto c hastic Blo c k Mo dels,” arXiv:1605.07057 [cs, stat] (2016), arXiv: 1605.07057. [23] Martin Rosv all and Carl T. Bergstrom, “ An information- theoretic framework for resolving comm unity structure in complex netw orks,” PNAS 104 , 7327–7331 (2007). [24] Tiago P . Peixoto, “ Mo del Selection and Hyp othesis T est- ing for Large-Scale Net work Models with Overlapping Groups,” Ph ys. Rev. X 5 , 011033 (2015). [25] Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman, “ Po wer-La w Distributions in Empirical Data,” SIAM Rev. 51 , 661–703 (2009). [26] George E. Andrews, The The ory of Partitions (Cam- bridge Universit y Press, Cambridge, 1984). [27] G. Szekeres, “ An Asymptotic F ormula in the Theory of P artitions,” Q J Math 2 , 85–108 (1951). [28] G. Szekeres, “ Some Asymptotic F ormulae in the Theory of Partitions (ii),” Q J Math 4 , 96–111 (1953). [29] E. Ro dney Canﬁeld, “ F rom Recursions to Asymptotics: 22 On Szekeres’ F ormula for the Num b er of Partitions,” The Electronic Journal of Combinatorics 4 , R6 (1996). [30] Paul Erdős and Joseph Lehner, “ The distribution of the n umber of summands in the partitions of a p ositive inte- ger,” Duk e Math. J 8 , 335–345 (1941). [31] G. H. Hardy and S. Raman ujan, “ Une formule asymp- totique p our le nombres des partitions de n,” Comptes Rendus Acad. Sci. Paris, Sér. A 2 (1917). [32] G. H. Hardy and S. Ramanujan, “ Asymptotic F ormulaae in Com binatory Analysis,” Pro ceedings of the London Mathematical So ciety s2-17 , 75–115 (1918). [33] Tiago P . Peixoto, “ Entrop y of sto chastic blockmodel en- sem bles,” Phys. Rev. E 85 , 056122 (2012). [34] Aurelien Decelle, Florent Krzak ala, Cristopher Mo ore, and Lenk a Zdeborov á, “ Inference and Phase T ransitions in the Detection of Mo dules in Sparse Netw orks,” Phys. Rev. Lett. 107 , 065701 (2011). [35] Ginestra Bianconi, “ En tropy of net work ensembles,” Ph ys. Rev. E 79 , 036114 (2009). [36] Tiziano Squartini, Jo ey de Mol, F rank den Hollander, and Diego Garlaschelli, “ Breaking of Ensemble Equiv a- lence in Netw orks,” Phys. Rev. Lett. 115 , 268701 (2015). [37] Diego Garlasc helli, F rank den Hollander, and Andrea Ro cca verde, “ Ensemble nonequiv alence in random graphs with mo dular structure,” arXiv:1603.08759 (2016). [38] Anne Condon and Richard M. Karp, “ Algorithms for graph partitioning on the planted partition mo del,” Ran- dom Structures & Algorithms 18 , 116–140 (2001). [39] Aurelien Decelle, Florent Krzak ala, Cristopher Mo ore, and Lenk a Zdeb oro v á, “ Asymptotic analysis of the sto c hastic blo c k mo del for mo dular netw orks and its al- gorithmic applications,” Phys. Rev. E 84 , 066106 (2011). [40] Nicholas Metropolis, Arianna W. Rosenbluth, Mar- shall N. Rosenbluth, Augusta H. T eller, and Edw ard T eller, “ Equation of State Calculations by F ast Comput- ing Machines,” J. Chem. Phys. 21 , 1087 (1953). [41] W. K. Hastings, “ Monte Carlo sampling metho ds using Mark ov chains and their applications,” Biometrik a 57 , 97 –109 (1970). [42] Prem K. Gopalan and David M. Blei, “ Eﬃcient discov ery of ov erlapping communities in massive netw orks,” PNAS 110 , 14534–14539 (2013). [43] Tiago P . Peixoto, “ The graph-to ol python library ,” ﬁgshare (2014), 10.6084/m9.ﬁgshare.1164194. [44] Xiaoran Y an, Cosma Shalizi, Jacob E. Jensen, Flo- ren t Krzak ala, Cristopher Moore, Lenk a Zdeb oro v á, P an Zhang, and Y ao jia Zh u, “ Mo del selection for degree- corrected blo ck mo dels,” J. Stat. Mech. 2014 , P05007 (2014). [45] Sir Harold Jeﬀreys, The The ory of Pr ob ability (Oxford Univ ersity Press, 1998). [46] Daan F renkel and Berend Smit Professor, Understanding Mole cular Simulation: F r om Algorithms to Applic ations , 2nd ed. (Academic Press, San Diego, 2001). [47] Radford M. Neal, “ Annealed importance sampling,” Statistics and Computing 11 , 125–139 (2001). [48] F ugao W ang and D. P . Landau, “ Eﬃcient, Multiple- Range Random W alk Algorithm to Calculate the Density of States,” Phys. Rev. Lett. 86 , 2050–2053 (2001). [49] Marc Mezard and Andrea Montanari, Information, Physics, and Computation (Oxford Universit y Press, 2009). [50] Lada A. Adamic and Natalie Glance, “ The p olitical blo- gosphere and the 2004 U.S. election: divided they blog,” in Pr o c e e dings of the 3r d international workshop on Link disc overy , LinkKDD ’05 (A CM, New Y ork, NY, USA, 2005) pp. 36–43. [51] D. Holten, “ Hierarc hical Edge Bundles: Visualization of A djacency Relations in Hierarchical Data,” IEEE T rans- actions on Visualization and Computer Graphics 12 , 741–748 (2006). [52] M. E. J. Newman, “ Finding communit y structure in net- w orks using the eigenv ectors of matrices,” Phys. Rev. E 74 , 036104 (2006). [53] Allison Davis and Burleigh B. Gardner, De ep South: A So cial A nthr op olo gic al Study of Caste and Class , re- vised ed. edition ed. (Universit y of South Carolina Press, Colum bia, S.C, 2009). [54] W ayne W. Zachary , “ An Information Flow Mo del for Conﬂict and Fission in Small Groups,” Journal of An- throp ological Research 33 , 452–473 (1977). [55] David Lusseau, Karsten Sc hneider, Oliver J. Boisseau, P atti Haase, Elisab eth Slooten, and Stev e M. Da wson, “ The bottlenose dolphin communit y of Doubtful Sound features a large prop ortion of long-lasting asso ciations,” Beha v Ecol So ciobiol 54 , 396–405 (2003). [56] Donald E. Knuth, The Stanfor d Gr aphBase: A Platform for Combinatorial Computing , 1st ed. (A ddison-W esley Professional, New Y ork, N.Y. : Reading, Mass, 1993). [57] M. Girv an and M. E. J. Newman, “ Communit y structure in so cial and biological netw orks,” Pro ceedings of the Na- tional Academ y of Sciences 99 , 7821 –7826 (2002). [58] Rob ert E. Ulanowicz and Donald L. DeAngelis, “ Netw ork analysis of trophic dynamics in south ﬂorida ecosystems,” US Geological Survey Program on the South Florida Ecosystem 114 (2005). [59] Linton C F reeman, Cynthia M W ebster, and Deirdre M Kirk e, “ Exploring so cial structure using dynamic three- dimensional color images,” So cial Net works 20 , 109–118 (1998). [60] J. G. White, E. Southgate, J. N. Thomson, and S. Bren- ner, “ The structure of the nervous system of the ne- mato de Caenorhabditis elegans,” Philos. T rans. R. So c. Lond., B, Biol. Sci. 314 , 1–340 (1986). [61] M. E. J. Newman, “ Modularity and communit y structure in netw orks,” PNAS 103 , 8577–8582 (2006). [62] Jérôme Kunegis, “ KONECT: The Koblenz Netw ork Col- lection,” in Pr o c e e dings of the 22Nd International Confer- enc e on W orld Wide W eb , WWW ’13 Companion (A CM, New Y ork, NY, USA, 2013) pp. 1343–1350. [63] Daniel B. Larremore, Aaron Clauset, and Caroline O. Buc kee, “ A Net work Approach to Analyzing Highly Re- com binant Malaria Parasite Genes,” PLOS Comput Biol 9 , e1003268 (2013). [64] R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, and A. Arenas, “ Self-similar communit y structure in a net- w ork of human interactions,” Phys. Rev. E 68 , 065103 (2003). [65] Ulrich Stelzl, Uwe W orm, Maciej Lalowski, Christian Haenig, F elix H. Brembeck, Heike Go ehler, Martin Stro edic ke, Martina Zenkner, Anke Schoenherr, Susanne K o epp en, Jan Timm, Sascha Mintzlaﬀ, Claudia Abra- ham, Nicole Bo c k, Silvia Kietzmann, Astrid Go edde, En- gin T oksöz, Anja Dro ege, Sylvia Krobitsch, Bernhard K orn, W alter Birc hmeier, Hans Lehrach, and Erich E. W anker, “ A Human Protein-Protein In teraction Net- w ork: A Resource for Annotating the Proteome,” Cell 122 , 957–968 (2005). 23 [66] D. J. W atts and S. H. Strogatz, “ Collectiv e dynamics of ’small-w orld’ netw orks,” Nature 393 , 409–10 (1998). [67] G. Joshi-T ope, M. Gillespie, I. V astrik, P . D’Eustachio, E. Sc hmidt, B. de Bono, B. Jassal, G. R. Gopinath, G. R. W u, L. Matthews, S. Lewis, E. Birney , and L. Stein, “ Re- actome: a knowledgebase of biological pathw a ys,” Nucl. A cids Res. 33 , D428–D432 (2005). [68] Jure Lesko v ec, Jon Kleinberg, and Christos F alout- sos, “ Graph ev olution: Densiﬁcation and shrinking di- ameters,” ACM T rans. Knowl. Discov. Data 1 (2007), 10.1145/1217299.1217301. [69] P . Massa, M. Salvetti, and D. T omasoni, “ Bowling Alone and T rust Decline in Social Netw ork Sites,” in Eighth IEEE International Confer enc e on Dep endable, A uto- nomic and Se cur e Computing, 2009. DASC ’09 (2009) pp. 658–663. [70] Vladimir Batagelj, Andrej Mrv ar, and Matjaz Zav ersnik, “ Netw ork Analysis of texts,” (2002). [71] Lovro Šub elj and Marko Ba jec, “ Mo del of Complex Net- w orks Based on Citation Dynamics,” in Pr o c e edings of the 22Nd International Confer enc e on W orld Wide W eb , WWW ’13 Companion (ACM, New Y ork, NY, USA, 2013) pp. 527–530. [72] Jure Lesko v ec and Julian J. Mcauley , “ Learning to Dis- co ver So cial Circles in Ego Netw orks,” in A dvanc es in Neur al Information Pr o c essing Systems 25 , edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. W ein- b erger (Curran Asso ciates, Inc., 2012) pp. 539–547. [73] Oliver Rich ters and Tiago P . Peixoto, “ T rust T ransitivity in So cial Netw orks,” PLoS ONE 6 , e18384 (2011). [74] Bimal Viswanath, Alan Mislov e, Meey oung Cha, and Kr- ishna P . Gummadi, “ On the Evolution of User Interaction in F aceb ook,” in Pr o c e edings of the 2Nd ACM W orkshop on Online So cial Networks , WOSN ’09 (ACM, New Y ork, NY, USA, 2009) pp. 37–42. [75] Eunjo on Cho, Seth A. Myers, and Jure Lesko v ec, “ F riendship and mobility: user mo v ement in lo cation- based so cial netw orks,” in Pr o c e e dings of the 17th ACM SIGKDD international c onfer enc e on Know le dge disc ov- ery and data mining , KDD ’11 (ACM, New Y ork, NY, USA, 2011) pp. 1082–1090. [76] Matei Ripeanu and Ian F oster, “ Mapping the Gnutella Net work: Macroscopic Prop erties of Large-Scale Peer-to- P eer Systems,” in Pe er-to-Pe er Systems , Lecture Notes in Computer Science No. 2429, edited by Peter Druschel, F rans Kaasho ek, and Anton y Rowstron (Springer Berlin Heidelb erg, 2002) pp. 85–93. [77] Alan E. Mislov e, Online so cial networks: me asur ement, analysis, and applic ations to distributed information sys- tems (ProQuest, 2009). App endix A: Asymptotic degree distributions sampled from uniform priors and hyperpriors W e can easily obtain the exp ected degree distribution when using the uniform prior for the degree sequence in Eq. 26, if w e relax the ensemble to allow the total num- b er of edges to ﬂuctuate, with the global constraint b eing enforced only on av erage. If we fo cus on only one group with N no des and E half edges on a verage, a degree sequence k will b e sampled with a probabilit y that max- imizes the ensem ble entrop y constrained by the av erage n umber of edges, obtained via the Lagrangian F = − X k P ( k ) ln P ( k ) − λ X k P ( k ) X i k i − E ! , (A1) where λ is a Lagrange multiplier that enforces the con- strain t. Obtaining the saddle p oin t { ∂ F /∂ P ( k ) = 0 , ∂ F /∂ λ = 0 } yields the usual canonical ensemble P ( k ) = e − λ P i k i Z . (A2) The normalization constant is called the partition func- tion, and is given by Z = X k e − λ P i k i =  1 − e − λ  − N , (A3) with λ = ln(1 + N /E ) obtained b y enforcing the con- strain t E = P i k i = − ∂ ln Z/∂ λ . F rom the ab o v e, we obtain immediately that the probability of a giv en no de i having a degree k is P ( k i = k ) = e − λk e − λ P j 6 = i k j Z = (1 − e − λ ) e − λk . (A4) This is a geometric distribution, more commonly parametrized as P ( k ) = (1 − p ) p k , (A5) with an a verage h k i = (1 − p ) /p = E / N . This canonical ensem ble is not identical to the micro canonical one used in the main text, but will approac h it asymptotically in the the thermo dynamic limit, i.e. when the num ber of no des and edges b ecome suﬃciently large. W e can use the same approac h to obtain the exp ected degree distribution generated from the uniform h yper- prior of Eq. 29, whic h is somewhat more inv olved, but it is still quite feasible. W e wan t to consider the ensemble of non-negative integer coun ts { n k } , sub ject to a nor- malization constraint P ∞ k =0 n k = N and a ﬁxed a verage P ∞ k =0 k n k = E . F ollowing the same maximum-en trop y ansatz as abov e yields a partition function for this en- sem ble given by Z = X { n k } e − λ P k n k − µ P k kn k = Y k Z k , (A6) where λ and µ are Lagrange multipliers that k eep the constrain ts in place, and with Z k = 1 1 − exp( − λ − µk ) . (A7) The exp ected degree counts are given b y h n k i = − ∂ ln Z k ∂ λ = 1 exp( λ + µk ) − 1 , (A8) 24 whic h is the Bose-Einstein distribution. The parameters λ and µ are determined via the imp osed constrain ts, ∞ X k =0 1 exp( λ + µk ) − 1 = N , (A9) ∞ X k =0 k exp( λ + µk ) − 1 = E . (A10) F or suﬃciently large E and N , the sums may b e appro xi- mated by integrals, and using the p olylogarithm function, Li s ( z ) = Γ( s ) − 1 R ∞ 0 [ t s − 1 / ( e t /z − 1)]d t , we ha ve Z ∞ 0 d k exp( λ + µk ) − 1 = Li 1 ( e − λ ) µ = N , (A11) Z ∞ 0 k d k exp( λ + µk ) − 1 = Li 2 ( e − λ ) µ 2 = E . (A12) Eq. A11 can b e solved for λ as e − λ = 1 − exp( − N/µ ) , but the same cannot b e done for Eq. A12 in closed form. Ho wev er, for N  µ , we hav e λ → 0 , and hence µ ≈ p Li 2 (1) /E = p ζ (2) /E , with ζ ( s ) b eing the Riemann zeta function. This yields the asymptotic distribution, h n k i ≈ 1 exp  k p ζ (2) /E  − 1 . (A13) Its v ariance can b e obtained from the second moment, N  k 2  = Z ∞ 0 k 2 d k exp( λ + µk ) − 1 = Li 3 ( e − λ ) 2 µ 3 , (A14) whic h leads to  k 2  = ζ (3) 2  h k i ζ (2)  3 / 2 √ N , (A15) whic h div erges in the limit N  1 . F or degrees k  √ E , w e hav e exp( k p ζ (2) /E ) ≈ 1 + k p ζ (2) /E , and hence the exp ected distribution of Eq. A13 will follow a p ow er law 1 /k for small arguments, with an exp onential cut-oﬀ for larger arguments, h n k i ≈ ( p E /ζ (2) /k for k  √ E , exp( − k p ζ (2) /E ) for k  √ E . (A16) Distributions of the form 1 /k are often attributed to non-equilibrium pro cesses or critical b eha vior, but as the ab o v e sho ws, they can also come from maximum-en trop y ensem bles with simple constraints. This is tantamoun t to saying that most discrete distributions with a ﬁxed a verage tend to ha v e the abov e asymptotic form, and therefore no mechanism other than randomly c hoosing b et w een them is necessary to explain this prop erty . App endix B: Directed netw orks Although in the main text we fo cused on undirected net works, directed mo del v arian ts are easy to obtain, as w e summarize here. F or the directed DC-SBM we hav e the mo del lik eliho o d P ( A | k , e , b ) = Q i k + i ! k − i ! Q rs e rs ! Q r e + r ! e − r ! Q ij A ij ! , (B1) with k + i = P j A j i , k − i = P j A ij , e + r = P s e sr , e − r = P s e rs . F or the hierarc hical prior of edge coun ts, we ha ve to treat the multigraphs as directed, P ( e l | e l +1 , b l ) = Y rs   n l r n l s e l +1 rs   − 1 . (B2) The uniform degree prior is the pro duct of tw o priors, for the in- and out-degree sequences, P ( k | e , b ) = Y r   n r e + r   − 1   n r e − r   − 1 . (B3) Analogously for the conditioned degree prior w e need to accoun t for the joint (in, out)-degree distribution, P ( k | η ) = Y r Q k + ,k − η r k + ,k − ! n r ! (B4) and an uniform hyperprior P ( η | e , b ) = Y r q ( e + r , n r ) − 1 q ( e − r , n r ) − 1 . (B5) The NDC-SBM is also entirely analogous, corresp onding to a degree probability P ( k | e , b ) = Y r e + r ! n e r + r Q i ∈ r k + i ! Y r e − r ! n e r − r Q i ∈ r k − i ! , (B6) whic h yields the mo del likelihoo d P ( A | e , b ) = Q rs e rs ! Q r n e + r r n e − r r × 1 Q ij A ij ! . (B7)

Nonparametric Bayesian inference of the microcanonical stochastic block model

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment