Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models

Eﬃcien t Mon te Carlo and greedy heuristic for the inference of sto c hastic blo c k mo dels Tiago P . P eixoto ∗ Institut für The or etische Physik, Universität Br emen, Ho chschulring 18, D-28359 Br emen, Germany W e presen t an eﬃcien t algorithm for the inference of stochastic blo c k mo dels in large net works. The algorithm can b e used as an optimized Mark o v c hain Mon te Carlo (MCMC) metho d, with a fast mixing time and a muc h reduced susceptibilit y to getting trapped in metastable states, or as a greedy agglomerative heuristic, with an almost linear O ( N ln 2 N ) complexity , where N is the n umber of no des in the netw ork, indep enden t of the n um b er of blocks b eing inferred. W e sho w that the heuristic is capable of delivering results whic h are indistinguishable from the more exact and n umerically exp ensive MCMC method in man y artiﬁcial and empirical netw orks, despite b eing m uch faster. The metho d is entirely un biased to wards an y speciﬁc mixing pattern, and in particular it does not fav or assortativ e comm unit y structures. I. INTR ODUCTION The use of generative mo dels to infer modular struc- ture in netw orks has b een gaining increased attention in recen t y ears [1 – 12], due to its more general character, and b ecause it allows the use of more principled metho d- ology when compared to more common metho ds, such as mo dularit y maximization [13]. The most p opular gener- ativ e mo del being used for this purp ose is the so-called sto c hastic blo c k mo del [14–17], where the no des in the net work are divided into B blocks, and a B × B matrix sp eciﬁes the probabilities of edges existing b et ween no des of each blo ck. This simple model generalizes the notion of “comm unity structure” [18] in that it accommo dates not only assortativ e connections, but also arbitrary mix- ing patterns, including, for example, bipartite, and core- p eriphery structures. In this con text, the task of detect- ing mo dules in netw orks is conv erted into a pro cess of statistical inference of the parameters of the generative mo del giv en the observed data [1–12], which allo ws one to make use of the robust framework of statistical analy- sis. Among the many adv antages which this approac h brings is the capacit y of separating noise from struc- ture, such that no spurious communities are found [19– 25], increased resolution in the detection of very small blo c ks based on reﬁned mo del selection metho ds [26], and the identiﬁcation of fundamental limits in the detec- tion of mo dular structure [27–31]. How ev er, one existing dra wback in the application of statistical inference is the lac k of v ery eﬃcient algorithms, in particular for net- w orks with a very large num b er of blo c ks, with a p erfor- mance comparable to some p opular heuristics a v ailable for mo dularity-based metho ds [32, 33]. Here we present some eﬃcient techniques of performing statistical infer- ence on large netw orks, which are partially inspired b y the modularity-based heuristics, but where special care is tak en not to restrict the pro cedure to purely assorta- tiv e block structures, and to con trol the total n um b er of blocks B , such that detailed model selection criteria can b e used. F urthermore, the metho d presen ted func- ∗ tiago@itp.uni-bremen.de tions either as a greedy heuristic, with a fast O ( N ln 2 N ) algorithmic complexit y , or as full-ﬂedged Mon te Carlo metho d, which saturates the detectability range of arbi- trary mo dular structure, at the exp ense of larger running times. This pap er is divided as follows. In Sec. I I the sto chas- tic blo ck mo del is deﬁned, together with the maximum lik eliho o d inference pro cedure. Sec. II I presen ts an op- timized Marko v chain Monte Carlo (MCMC) metho d whic h is capable of reac hing equilibrium conﬁgurations more eﬃciently than more unsophisticated approac hes. In Sec. IV the MCMC techniques are complemented with an agglomerativ e heuristic which successfully av oids metastable states resulting from starting from random partitions and can b e used on its own as an eﬃcient and high-qualit y inference metho d. In this session we also compare the heuristic to the full MCMC metho d, for syn thetic netw orks. In Sec. V we compare both meth- o ds with several empirical netw orks. W e ﬁnally conclude in Sec. VI with a discussion. I I. THE STOCHASTIC BLOCK MODEL The stochastic block mo del ensem ble [14–17] is com- p osed of N no des, divided in to B blo cks, with e rs edges b et ween no des of blo cks r and s (or, for conv enience of notation, twice that num ber if r = s ). F or many empir- ical netw orks, muc h b etter results are obtained if degree v ariabilit y is included inside each block, as in the so- called degree-corrected blo ck mo del [8], in which one ad- ditionally sp eciﬁes the degree sequence { k i } of the graph as an additional set of parameters. The detection of mo dules consists in inferring the most lik ely model parameters whic h generated the observ ed net work. One do es this by ﬁnding the b est partition { b i } of the nodes, where b i ∈ [1 , B ] is the block membership of no de i , in the observed net work G , which maximizes the p osterior likelihoo d P ( G |{ b i } ) . Because each graph with the same edge counts e rs are equally likely , the p osterior lik eliho o d is P ( G |{ b i } ) = 1 / Ω( { e rs } , { n r } ) , where e rs and n r are the edge and no de counts asso ciated with the blo c k partition { b i } , and Ω( { e rs } , { n r } ) is the n umber 2 of diﬀerent netw ork realizations. Hence, maximizing the lik eliho o d is iden tical to minimizing the microcanonical en tropy [34] S ( { e rs } , { n r } ) = ln Ω( { e rs } , { n r } ) , which can b e computed [35] as S t = 1 2 X rs n r n s H b  e rs n r n s  , (1) for the traditional mo del and S c ' − E − X k N k ln k ! − 1 2 X rs e rs ln  e rs e r e s  , (2) for the degree corrected v arian t, where E = P rs e rs / 2 is the total num b er of edges, N k is the total num b er of no des with degree k , e r = P s e rs is the num b er of half- edges inciden t on blo ck r , and H b ( x ) = − x ln x − (1 − x ) ln(1 − x ) is the binary entrop y function, and it w as assumed that n r  1 . These models can b e generalized for directed net works, for which corresp onding expressions for the entropies are easily obtained [19, 35]. The metho ds describ ed in this pap er are directly applicable for directed netw orks as w ell. Although minimizing S t/c allo ws one to ﬁnd the most lik ely partition into B blo cks, it cannot b e used to ﬁnd the b est v alue of B itself. This is b ecause the minimum of S t/c is a strictly decreasing function of B , since larger mo dels can alw a ys incorporate more details of the ob- serv ed data, pro viding a better ﬁt. Indeed, if one min- imizes S t/c o ver all B v alues one will alwa ys obtain the trivial B = N partition where each node is in its o wn blo c k, which is not a useful result. The task of identify- ing the b est v alue of B in a principled fashion is known as mo del selection, which attempts to separate actual struc- ture from noise and a void o verﬁtting. In the current con- text this can b e done in a v ariet y of wa ys, such as using the minimum description length (MDL) criterion [19, 20] or p erforming Ba y esian model selection (BMS) [7, 21– 25]. In Ref. [26] a high-resolution mo del selection metho d is presented, whic h is based on MDL and a hierarc h y of nested stochastic block mo dels describing the netw ork top ology at multiple scales and is capable of discrimi- nating blo cks with sizes signiﬁcantly b elow the so-called “resolution limit” present in other mo del selection pro- cedures, and other comm unit y detection heuristics such as mo dularity optimization [36]. In Ref. [26] it is also sho wn that BMS and MDL deliv er identical results if the same mo del constraints are imp osed. Ho w ever, in order to p erform mo del selection, one ﬁrst needs to ﬁnd optimal partitions of the netw ork for given v alues of B , which is the subproblem which we consider in detail in this work. Therefore, in the remainder of this pap er we will assume that the v alue of B is a ﬁxed parameter, unless otherwise stated, but the reader should be aw are that this v alue it- self can b e determined at a later step via mo del selection, as describ ed e.g. in Refs. [19, 26]. Giv en a v alue of B , directly obtaining the partition { b i } which minimizes S t/c is in general not tractable, since it requires testing all p ossible partitions, which is only feasible for v ery small netw orks. Instead one m ust rely on approximate, or sto chastic pro cedures which are guaran teed to sample partitions with a probability given as a function of S t/c , as describ ed in the following section. I I I. MARKOV CHAIN MONTE CARLO The MCMC approac h consists in mo difying the blo c k mem b ership of eac h no de in a random fashion and accept- ing or rejecting each mov e with a probability given as a function of the entrop y diﬀerence ∆ S t/c . If the accep- tance probabilities are chosen appropriately and the pro- cess is ergo dic, i.e., all p ossible netw ork partitions are ac- cessible, and detailed balance is preserv ed, i.e., the mo ves are reversible, after a suﬃciently long equilibration time, eac h observed partition must occur with the desired prob- abilit y prop ortional to P ( G |{ b i } ) = e − S t/c . In this sense, this pro cess is exact, since it is guaranteed to ev entually pro duce the partitions with the desired probabilities, af- ter a suﬃcien t long equilibration (or mixing) time. In practice, the situation is more nuanced, since equilibra- tion times ma y b e v ery long, and one may not able to sample from a go od appro ximation of the desired distri- bution, and diﬀeren t w a ys of implemen ting the Marko v c hain leads to diﬀerent mixing times. The simplest ap- proac h one can tak e is to attempt to mo ve each vertex in to one of the B blo c ks with equal probability . This eas- ily satisﬁes the requirements of ergodicity and detailed balance, but can b e very ineﬃcient. This is particularly so in the case where the v alue of B is large, and the blo c k structure of the netw ork is w ell deﬁned, suc h that the vertex will belong to v ery few of the B blo c ks with a non-v anishing probability , whic h means that most ran- dom mo v es will simply b e rejected. A better approach has b een prop osed in Ref. [19], which w e present here in a slightly generalized fashion, and consists in attempting to mov e a v ertex from blo ck r to s with a probability giv en b y p ( r → s | t ) = e ts +  e t + B , (3) where t is the blo c k lab el of a randomly chosen neigh- b or, and  > 0 is a free parameter (note that by making  → ∞ w e recov er the fully random mov es described ab o ve). Eq. 3 means that we attempt to guess the blo ck mem b ership of a giv en no de b y insp ecting the block mem- b ership of its neigh b ors and by using the currently in- ferred mo del parameters to choose the most likely blo cks to which the original no de b elongs (see Fig. 1). It should b e observed that this mo ve imp oses no inheren t bias; in particular, it do es not attempt to ﬁnd assortative struc- tures in preference to an y other, since it depends fully on the matrix e rs curren tly inferred. F or an y c hoice of  > 0 , this mov e prop osal fulﬁlls the ergo dicit y condition, but not detailed balance. Ho w ever, this can b e enforced 3 i b i = r j b j = t e tr e ts e tu r t s u FIG. 1. L eft: Lo cal neigh borho o d of node i b elonging to blo c k r , and a randomly chosen neigh b or j b elonging to block t . Right: Block multigraph, indicating the num ber of edges b et ween blocks, represen ted as the edge thickness. In this example, the attempted mo ve b i → s is made with a larger probabilit y than either b i → u or b i → r (no mov emen t), since e ts > e tu and e ts > e tr . in the usual Metrop olis-Hastings fashion [37, 38] by ac- cepting each mov e with a probability a given by a = min  e − β ∆ S t/c P t p i t p ( s → r | t ) P t p i t p ( r → s | t ) , 1  , (4) where p i t is the fraction of neighbors of no de i whic h b e- long to block t , and p ( s → r | t ) is computed after the prop osed r → s mo v e (i.e., with the new v alues of e rt ), whereas p ( r → s | t ) is computed b efore. The parameter β in Eq. 4 is an in v erse temp erature, which can b e used to escape lo cal minima or to turn the algorithm into a greedy heuristic, as discussed b elow. The mo v es with probabilities given by Eq. 3 can b e implemen ted eﬃcien tly . W e simply write p ( r → s | t ) = (1 − R t ) e ts /e t + R t /B , with R t = B / ( e t + B ) . Hence, in order to sample s we pro ceed as follows: 1. A random neigh b or j of the no de i b eing mov ed is selected, and its blo c k membership t = b j is obtained; 2. The v alue s is randomly selected from all B c hoices with equal proba- bilit y; 3. With probabilit y R t it is accepted; 4. If it is rejected, a randomly chosen edge adjacen t to block t is selected, and the blo ck label s is taken from its oppo- site endpoint. This simple pro cedure selects the v alue of s with a probabilit y giv en b y P t p i t p ( r → s | t ) , and requires only a small n umber of op erations, whic h is in- dep enden t either on B or the n umber of neigh b ors the no de i has. The only requirement is that we keep a list of edges whic h are adjacent to each blo ck, whic h incurs an additional memory complexit y of O ( E ) . T o decide whether to accept the mov e, we need to compute the v alue of a , which can b e done in O ( k i ) time, which is the same num ber of op erations which is required to compute ∆ S t/c 1 . Therefore, an entire MCMC sw eep of all no des in the netw ork requires O ( E ) op erations, indep enden t of B . 1 F or sparse netw orks with e rs  n r n s , we may write S t ∼ = E − 1 2 P rs e rs ln e rs + P r e r ln n r , and note that to compute the change in entrop y we need to mo dify at most 4 k terms in the ﬁrst sum and 2 terms in the second, if w e c hange the mem bership of a no de with degree k . The same argument holds for S c . 10 0 10 1 10 2 10 3 10 4 τ + 1 (sweeps) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 R ( τ )  = 0 . 01  → ∞ − 0 . 004 − 0 . 003 − 0 . 002 − 0 . 001 S t /E − 3 . 234 10 − 2 10 − 1 10 0 10 1 10 2 10 3 ρ ( S t /E )  = 0 . 01  → ∞ FIG. 2. Left: Auto correlation function R ( τ ) , for a PP mo del with c = 0 . 8 and B = 100 , for a netw ork of size N = 10 4 and h k i = 10 , and t wo v alues of the parameter  , where for  → ∞ w e hav e fully random mov es. The curves were a veraged for 100 indep endent net w ork realizations. Right: PDF of the v alues of S t /E obtained for T = 2 × 10 4 consecutiv e sweeps for 100 indep enden t netw ork realizations, for diﬀeren t  v alues, sho wing the same distribution. T o test the behavior of this approac h, we examine a simple example known as the Planted Partition (PP) mo del [39]. It corresp onds to an assortative blo c k struc- ture given b y e rs = 2 E [ δ rs c/B + (1 − δ rs )(1 − c ) /B ( B − 1)] , n r = N/B , and c ∈ [0 , 1] is a free parameter whic h con- trols the assortativit y strength. In this example, the al- gorithm ab o ve leads to muc h faster mixing times, as can b e seen in Fig. 2(left), which shows the auto correlation function R ( τ ) = P T − τ t =1  S t/c ( t ) −  S t/c   S t/c ( t + τ ) −  S t/c  ( T − τ ) σ 2 S t/c , (5) where S t/c ( t ) is the entrop y v alue after t MCMC sweeps, and T is the total n um ber of sw eeps, computed after a suﬃcien tly long transien t has b een discarded. F or the particular choice of parameters chosen for Fig. 2, the au- to correlation time is of the order of 10 sweeps with the optimized mov es, and of the order of 100 sw eeps with the fully random v ariant. Despite the diﬀerence in the mixing time, b oth metho ds sample from the same distribution, as shown in Fig. 2(right). The improv emen t for smaller  v alues is more promi- nen t as the block structure b ecomes more w ell-deﬁned, as can b e seen in Fig. 3, which shows the auto correlation time τ ∗ , deﬁned here as τ ∗ = T 0 X τ =0 R ( τ ) , (6) where T 0 is the largest v alue of τ for which R ( τ ) ≥ 0 . In Fig. 3(left) are sho wn the v alues of τ ∗ dep ending on c , from whic h one can see that the relative improv emen t on the mixing time can b e up to t wo orders of magnitude, for the chosen v alue of B = 100 . As the v alue of c approaches the detectabilit y threshold (see b elow), the auto correla- tion time div erges, as is t ypical of second-order phase transitions, and the relativ e adv an tage of the optimized mo ves diminishes. Ho w ev er, for most of the parameter 4 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 c 10 0 10 1 10 2 10 3 τ ∗  = 1  → ∞ 20 40 60 80 100 B 10 1 10 2 τ ∗ c = 1 c → ∞ FIG. 3. Left: Correlation time τ ∗ as a function of the mo del parameter c , for diﬀeren t v alues of  , N = 10 4 , h k i = 10 , B = 100 , av eraged o ver 40 indep enden t netw ork realizations. Righ t: Correlation time τ ∗ as a function of the n umber of blo c ks B , for diﬀerent v alues of  , for N = 100 × B , h k i = 10 , c = 0 . 8 , av eraged ov er 40 indep endent netw ork realizations. range where the blo cks are detectable, the mixing time with the optimized mov es seems indep endent on the ac- tual num b er of blo cks, as shown in Fig. 3, where a ﬁxed blo c k size N /B = 100 w as used, and B w as v aried. One can see that for the optimized mov es the mixing time re- mains constant, whereas for the fully random mov es it increases steadily with B . Although the optimized mov es ab ov e provide a con- siderable improv emen t ov er the fully random alternativ e whenev er the num ber of blo c ks B b ecomes large, there re- mains an imp ortan t problem when applying it. Namely , the mixing time can b e hea vily dependent on how close one starts from the t ypical partitions which are obtained after equilibration. Since one do es not know this, one of- ten starts with a random partition. How ev er, this is very far from the equilibrium states, and if the blo c k structure is suﬃciently strong, this can lead to metastable conﬁg- urations, where the blo ck structure is only partially dis- co vered, as shown in Fig. 4, for a netw ork with B = 3 2 . The main problem is that not only do es it tak e a long time to escap e such metastable states, but also by ob- serving the v alues of S t/c alone, one may arriv e at the wrong conclusion that the Marko v c hain has equilibrated. F or example, in the simulation sho wn in Fig. 4, it to ok man y h undreds of sweeps for the ﬁnal drop in S t to o ccur, and b efore this, the time series is diﬃcult to distinguish from an equilibrated chain. This problem is exacerbated if the av erage blo ck size N/B increases, whic h can b e frustrating since one would like to consider these scenar- ios to b e easier than for smaller blo ck sizes. In order to a void this problem, we prop ose the agglomerative heuris- tic describ ed in the next session, which can b e used as a privileged starting point for the Marko v chain, or as an appro ximate inference to ol on its own. 2 The o ccurrence of these metastable states is indep endent of the optimized mov es and happ ens also for the fully random  → ∞ mov es. 0 200 400 600 800 1000 1200 MCMC Sweeps − 1 . 2 − 1 . 0 − 0 . 8 − 0 . 6 − 0 . 4 − 0 . 2 0 . 0 S t /E FIG. 4. Ev olution of the MCMC for a netw ork sampled from the PP model with N = 10 4 , h k i = 10 , B = 3 and c = 0 . 99 , starting from a fully random partition of the no des. The netw orks show a representativ e snapshot of the state of the system b efore and after the last drop in S t . IV. A GGLOMERA TIVE HEURISTIC In order to av oid the metastable states described pre- viously , we explore the fact that they are more lik ely to o ccur if the block sizes are large, since otherwise the quenc hed top ological ﬂuctuations present in the netw ork will oﬀer a smaller free-energy barrier which needs to b e o vercome. Therefore, a more promising approac h is to attempt to ﬁnd the b est conﬁguration for some B 0 > B , and then use that conﬁguration to obtain a b etter es- timate for one with B blo c ks 3 . This can b e done by merging blo c ks together progressiv ely , as shown in Fig. 5. W e implement this by constructing a blo ck (multi)graph, where the blo cks themselves are the no des (weigh ted by the block sizes) and the edge coun ts e rs are the edge m ultiplicities b etw een each blo c k no de. In this represen- tation, a blo c k merge is simply a blo ck membership mov e of a block no de, where initially eac h node is in its own blo c k. The choice of mo ves is done with same probabil- it y as before, i.e. via Eq. 3. In order to select the b est merges, w e attempt n m mo ves for each blo ck no de, and collectiv ely rank the b est mo v es for all no des according to ∆ S t/c . F rom this global ranking, we select the b est B 0 − B merges to obtain the desired partition in to B blo c ks. Ho w ever if the v alue of N/B 0 itself is to o large, we face again the same problem as before. Therefore we pro- ceed iteratively by starting with B 1 = N , and selecting 3 Note that we cannot simply set B 0 > B and p erform the same MCMC sweeps, expecting to obtain a partition into B blocks, since the v alues of S t/c obtained for larger B values are always smaller. Diﬀeren tly from other comm unity detection approaches such as mo dularity optimization, here we are forced to control the v alue of B explicitly , whic h we can determine at a later step via a mo del selection procedure, as discussed previously . 5 → FIG. 5. Representation of the block merges used in the agglomerativ e heuristic. Each square no de is a block in the original graph, and the merges (represen ted as red dashed lines) corresp ond simply to blo ck membership mo ves. FIG. 6. L eft: An example of a typical partition obtained by starting with a random B = 3 conﬁguration and applying only greedy mov es, until no further improv emen t is p ossible, for a PP net work with N = 300 , h k i = 10 , and c = 0 . 9 . Right: A t ypical outcome for the same netw ork, with the greedy agglomerative algorithm describ ed in the text. B i +1 = B i /σ , until we reac h the desired B v alue, where σ > 1 controls how greedily the merges are p erformed. T o diminish the eﬀect of bad merges done in the earlier steps, w e also allow individual no de mov es b et ween each merge step, by applying the MCMC steps ab o v e to the original net work, with β → ∞ . The complexity of each agglom- erativ e step is O [ n m E + N ln( B i − B i − 1 ) + τ E ] , whic h in- corp orates the search for the merge candidates, the rank- ing of the B i − B i − 1 b est merges, and the mov emen t of the individual no des, where τ is the necessary amount of sw eeps to reach a lo cal minimum. Since we hav e in total ln( N /B ) / ln σ merge steps, with the slo west one b eing the ﬁrst with B 1 = N , we hav e an ov erall complexity of O { [( n m + τ ) E + N ln N ] × ln N / ln σ } ∼ O ( N ln 2 N ) , if w e assume that B  N 4 and that the graph is sparse with E ∼ O ( N ) . Despite its greedy nature, we found that this approach is capable of almost alwa ys av oiding the metastable con- ﬁgurations described previously , and often comes v ery close or even exactly to the planted partition (see Fig. 6). The parameters n m , σ and  allow one to c hoose an 4 This is a worst-case scenario. If B ∼ N , then the complexit y reduces to O ( N ln N ) . appropriate trade-oﬀ betw een qualit y and speed. The b est results are obtained for large n m and small σ , how- ev er these need not to b e chosen fully indep enden tly . W e found that setting n m to a “reasonable” v alue suc h as 10 or 100 , and selecting σ to b e 2 , 1 . 1 or 1 . 01 allows one to prob e the full quality range of the algorithm (see b elo w). The c hoice of the v alue  is in teresting, since making  = 0 allo ws one to preserve certain graph in v arian ts through- out the whole pro cedure. Since at the ﬁrst merging step when B i = N the e rs matrix is simply the adjacency matrix, the membership mov es with  = 0 cannot merge no des which b elong to diﬀerent comp onen ts, or to diﬀer- en t partitions in bipartite netw orks. It is easy to see that this prop erty is preserv ed for later merging steps as well, so they are fully reﬂected in the ﬁnal blo c k structure. W e ﬁnd that very often this is a desired property , and leads to b etter block partitions. In situations where it is not desired, it can b e disabled by setting  > 0 . The algorithm ab o ve can b e turned in to a more robust MCMC method by making β = 1 in the in termediary phase b etw een each merge step, and waiting suﬃciently long for the Marko v chain to equilibrate. This is a slo wer, but more exact counterpart to the greedy heuristic v ari- an t, whic h is less s usceptible to getting trapp ed in the metastable states discussed previously . If one wishes to ﬁnd the minimum of S t/c , one can make β → ∞ after the c hain has equilibrated, either abruptly (as we do in the results presen ted in this pap er), or slo wly via simulated annealing [40]. W e can assess the quality of the heuristic method b y comparing with kno wn b ounds on the detectabil- it y of the PP model. If we ha ve that N /B  1 , it can b e sho wn that for h k i < [( B − 1) / ( cB − 1)] 2 [27– 29], it is not p ossible to detect the planted partition with any metho d. T o emphasize the applicability of the metho d for dissortative (or arbitrary) top ologies, we also analyze a circular multipartite blo ck mo del, with e rs = 2 E  ( δ r,s − 1 + δ r,s +1 ) c/ 2 B + (1 − c ) /B 2  , where c con trols the strength of the mo dular structure, and p eri- o dic b oundaries are assumed. In b oth cases we compare the agglomerative heuristic with MCMC results starting from the true partition, which represents the best p os- sible case. As can b e seen in Fig. 7, the results from the optimal MCMC and the heuristic are identical for up to some v alues of c which are larger than the actual de- tectabilit y threshold. Thus the greedy metho d falls short of saturating the detectable parameter region, but be- ha ves badly only for a relativ ely small range of c , b elo w whic h it b ecomes muc h harder (but not impossible) to distinguish the observed netw ork from a random graph. T o giv e a more precise idea of the exten t to whic h the graphs in this region deviate from a random top ology , w e compare with a mo del selection threshold based on the minimum description length (MDL) principle [19], h k i > 2 ln B I t/c , (7) with I t/c = ( S r t/c − S t/c ) /E , where S r t/c is the entrop y for 6 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 c 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 NMI Escape σ = 2 σ = 1 . 5 σ = 1 . 1 σ = 1 . 01 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 c 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 NMI Escape σ = 2 σ = 1 . 5 σ = 1 . 1 σ = 1 . 01 FIG. 7. Normalized mutual information (NMI) (see foot- note 6) betw een the plan ted and the inferred partitions for (top) the PP model and (b ottom) the circular m ultipartite mo del describ ed in the text, as a function of the modular strength c , for N = 10 4 and B = 10 . The “Escape” curves corresp ond to MCMC equilibrations starting from the plan ted partition, and the remaining curves to the greedy agglomera- tiv e heuristic with ratio σ shown in the legend, and n m = 10 . All curves are a veraged ov er 20 indep endent netw ork realiza- tions. The grey v ertical dashed line corresponds to the de- tectabilit y threshold c ∗ for the PP mo del, and the red dashed line to the MDL model selection threshold of Eq. 7. a fully random graph, with e rs = 2 E n r n s / N 2 (or e rs = e r e s / 2 E for the degree-corrected case), and E  B 2 w as assumed. This criterion is useful when w e do not kno w the correct v alue of B , and hence cannot rely on minimiz- ing S t/c alone, since it would alwa ys result in a B = N partition. If this condition is not fulﬁlled, the inferred partition (ev en if exact) is discarded in fav or of a fully random graph, since the model parameters in this case cannot b e used to provide a more compact description of the netw ork. F rom Fig. 7 we see that this threshold lies 6.62 6.64 6.66 6.68 Σ / E ( b i t s / e d g e ) Disease genes 6.26 6.28 6.30 6.32 6.34 6.36 6.38 Network scientists 7.73 7.74 7.75 7.76 7.77 7.78 7.79 Σ / E ( b i t s / e d g e ) Enron email 4.555 4.560 4.565 4.570 4.575 Political blogs Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) MC ( σ = 2 ) MC ( σ = 1 . 1 ) 5.420 5.425 5.430 5.435 5.440 5.445 5.450 5.455 5.460 Σ / E ( b i t s / e d g e ) Wikipedia vote Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) MC ( σ = 2 ) MC ( σ = 1 . 1 ) 7.41 7.42 7.43 7.44 7.45 7.46 7.47 7.48 7.49 PGP FIG. 8. Description length Σ for diﬀerent empirical net works, collected for 100 indep endent runs of the MCMC algorithm (MC) and the agglomerative heuristic (Agg), for diﬀerent ag- glomeration ratios σ . 0.0 0.2 0.4 0.6 0.8 1.0 NMI Disease genes 0.0 0.2 0.4 0.6 0.8 1.0 Network scientists 0.0 0.2 0.4 0.6 0.8 1.0 NMI Enron email 0.0 0.2 0.4 0.6 0.8 1.0 Political blogs Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) MC ( σ = 2 ) MC ( σ = 1 . 1 ) 0.0 0.2 0.4 0.6 0.8 1.0 NMI Wikipedia vote Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) MC ( σ = 2 ) MC ( σ = 1 . 1 ) 0.0 0.2 0.4 0.6 0.8 1.0 PGP FIG. 9. Normalized mutual information (NMI) b et ween the b est ov erall partition and each one collected for 100 indep en- den t runs of the MCMC algorithm (MC) and the agglomera- tiv e heuristic (Agg), for diﬀerent agglomeration ratios σ . v ery close to the region where the agglomerative algo- rithm is incapable of discov ering the optimal partition. Hence, in situations where mo del selection needs to be p erformed, any signiﬁcan t impro vemen t to the quality of the algorithm would b e ultimately discarded, at least in these sp eciﬁc examples. In other s ituations, where an in- creased precision close to the detectability transition is desired, the heuristic should b e used only as a comp onen t of the full-ﬂedged MCMC pro cedure with β = 1 , as de- scrib ed ab ov e, which should b e able to ev entually reac h the optimal conﬁgurations, but requires longer running times. 7 Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) 11.68 11.69 11.70 11.71 11.72 Σ / E ( b i t s / e d g e ) IMDB Network Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) 5.76 5.78 5.80 5.82 5.84 5.86 5.88 5.90 Berkley/Stanford Web Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) 0.0 0.2 0.4 0.6 0.8 1.0 NMI IMDB Network Agg ( σ = 2 ) Agg ( σ = 1 . 1 ) 0.0 0.2 0.4 0.6 0.8 1.0 Berkley/Stanford Web FIG. 10. Description length Σ for diﬀeren t empirical net- w orks, as well the Normalized mutual information (NMI) be- t ween the b est ov erall partition and each one, collected for 100 indep enden t runs of the agglomerative heuristic, for diﬀerent agglomeration ratios σ . V. PERF ORMANCE ON EMPIRICAL NETW ORKS W e hav e analyzed a few empirical netw orks to assess the b ehaiv or of the algorithm in realistic situations. W e ha ve chosen the following netw orks: The largest comp o- nen t of coauthorships in netw ork science [41] ( N = 379 , E = 914 , undirected), the human disease gene net- w ork [42] ( N = 903 , E = 6 , 760 , undirected), the po- litical blog net work [43] ( N = 1 , 222 , E = 19 , 021 , di- rected), the Wikip edia vote netw ork [44] ( N = 8 , 298 , E = 103 , 689 , directed), the Enron email netw ork [45, 46] ( N = 36 , 692 , E = 367 , 662 , undirected), the largest strong comp onen t of the PGP netw ork [47] ( N = 39 , 796 , E = 301 , 498 , directed), the IMDB ﬁlm-actor net- w ork [19] ( N = 372 , 547 , E = 1 , 812 , 312 , undirected), and the Berkeley/Stanford web graph [46] ( N = 654 , 782 , E = 7 , 499 , 425 , directed). In all cases w e used the degree-corrected mo del. Since for these netw orks the most appropriate v alue of B is unkno wn, we p erformed mo del selection using the MDL criterion as describ ed in Ref. [19], where w e ﬁnd the partition which minimizes the description length Σ = L t/c + S t/c , where L t/c is the amoun t of information necessary to describe the mo del parameters, which increases with B 5 . F or the netw orks with moderate size w e were capable of comparing the results with the agglomerativ e heuristic to those of the more time consuming MCMC metho d. Figs. 8 and 10 sho wn the v alues of Σ after several runs of each algo- rithm. It can b e observ ed that the results obtained with both metho ds seem largely indistinguishable for some net w orks (h uman diseases, net work scientists, and Wikip edia votes), whereas the MCMC algorithm leads to b etter results for others (Enron email, p olitical blogs), and in terestingly to worse results for the PGP net w ork. 5 As mentioned previously , a more reﬁned MDL metho d presented in Ref. [26] computes L t/c via a hierarchical sequence of sto chas- tic blo c k mo dels, which pro vides b etter resolution at the exp ense of some additional complexity . But since our ob jective here is to compare metho ds of ﬁnding partitions, not mo del selection, we opt for the simpler criterion. The b etter results for MCMC are exp ected, but the worse result for the PGP netw ork is not. W e can explain this b y pointing out that for that net work the av erage v alue of S c obtained with MCMC for β = 1 noticeably dif- fers from the minimum pos sible v alue. Since we used an abrupt co oling to β → ∞ , the MCMC is more likely to get trapp ed in a lo cal minimum than the agglomerativ e heuristic, which is never allo wed to heat up to the β = 1 conﬁgurations. MCMC w ould probably matc h, or ev en impro ve the heuristic results if, e.g. sim ulated annealing w ould b e used to reach the β → ∞ region. How ever, this serv es as an example of at least one scenario where the agglomerativ e heuristic can lead to ev en better results, despite b eing muc h faster than MCMC. P erhaps a more meaningful comparison among the dif- feren t results is to determine how the obtained partitions diﬀer from each other. This is sho wn in Figs. 9 and 10, where the normalized mutual information (NMI) 6 b e- t ween the b est partition across all runs of all algorithms and every other partition found is compared for the tw o algorithms. Despite leading to diﬀerent Σ v alues, the t ypical partitions found for each algorithm seem equally far from the (approximated) global maxim um, so the dif- ference in Σ can b e attributed to minor diﬀerences in the partitions. F rom this we can conclude the agglomerative heuristic delivers results comparable to MCMC for man y empirical netw orks, while b eing signiﬁcantly faster. Note that the NMI v alues in Fig. 9 are o verall reason- ably high, indicating that the partitions are muc h more similar than diﬀeren t, how ev er they are almost nev er 1 , or very close to it, except for the smallest netw orks. This seems to p oint to a certain degree of degeneracy of opti- mal partitions, similar to those reported in Ref. [48] for metho ds based on mo dularity maximization. A more de- tailed analysis of this is needed, but we leav e it to future w ork. VI. CONCLUSION W e ha ve presented an optimized MCMC metho d 7 for inferring sto c hastic blo ck models in large net works, whic h p ossesses an improv ed mixing time due to optimized pro- p osed no de membership mov es, and an agglomerative pro cedure whic h strongly reduces the likelihoo d of get- ting trapp ed in undesired metastable states. By increas- ing the in verse temperature to β → ∞ this method is turned into an agglomerativ e heuristic, with a fast al- gorithmic complexit y of O ( N ln 2 N ) in sparse netw orks. 6 The NMI is deﬁned as 2 I ( { b i } , { c i } ) / [ H ( { b i } ) + H ( { c i } )] , where I ( { b i } , { c i } ) = P rs p bc ( r, s ) ln ( p bc ( r, s ) /p b ( r ) p c ( s )) , and H ( { x i } ) = − P r p x ( r ) ln p x ( r ) , where { b i } and { c i } are t w o par- titions of the netw ork. 7 An eﬃcient C++ implementation of the algorithm describ ed here is freely a v ailable as part of the graph-tool Python library at http://graph- tool.skewed.de . 8 W e hav e shown that although the heuristic do es not fully saturate the detectabilit y range of the MCMC metho d, it tends to ﬁnd indistinguishable partitions for a v ery large range of parameters of the generative mo del, as w ell as for many empirical netw orks. The metho d also allows for detailed con trol of the num b er of blo cks B b eing inferred, whic h mak es it suitable to b e use d in conjunction with mo del selection techniques [19 – 26]. The heuristic metho d is comparable to the agglomera- tiv e algorithm of Clauset et al [32] (and v arian ts thereof, e.g. Refs. [49 – 51]), which has the same ov erall complex- it y , but is restricted to ﬁnding purely assortativ e blo c k structures, based on modularity optimization, and is strictly agglomerative, whereas the algorithm presented here p ermits individual no de mov es betw een the blo c ks at ev ery stage, which allo ws for the correction of bad merges done in the earliest stages. It can also b e compared to the p opular metho d of Blondel et al [33], whic h is not strictly agglomerative, but it is also restricted to assor- tativ e structures, and is based on modularity , although it is typically faster than either the metho d of Clauset et al and the metho d presented here. Both the MCMC metho d and the greedy heuristic compare fa vorably to many statistical inference meth- o ds whic h depend on obtaining the full marginal prob- abilit y π i r that node i belongs to block r [27, 28, 52]. Although this gives more detailed information on the net work structure, it do es so at the exp ense of muc h increased algorithmic complexit y . F or instance, the b e- lief propagation approach of Refs. [27, 28, 52], although it p ossesses strong optimal prop erties, requires O ( N B 2 ) op erations p er update sweep, in addition to an O ( E B ) memory complexity . Since in realistic situations the de- sired v alue of B is likely to scale with some p o wer of N , this approach quic kly b ecomes impractical and hinders its application to very large netw orks, in contrast to the log-linear complexit y in N (independent of B ) with the metho d prop osed in this pap er. [1] M. B. Hastings, Physical Review E 74 , 035102 (2006). [2] D. Garlasc helli and M. I. Loﬀredo, Physical Review E 78 , 015101 (2008). [3] M. E. J. Newman and E. A. Leich t, Pro ceedings of the National Academ y of Sciences 104 , 9564 (2007). [4] J. Reic hardt and D. R. White, The Europ ean Physical Journal B 60 , 217 (2007). [5] J. M. Hofman and C. H. Wiggins, Physical Review Let- ters 100 , 258701 (2008). [6] P . J. Bic kel and A. Chen, Pro ceedings of the National A cademy of Sciences 106 , 21068 (2009). [7] R. Guimerà and M. Sales-Pardo, Pro ceedings of the Na- tional Academ y of Sciences 106 , 22073 (2009). [8] B. Karrer and M. E. J. Newman, Ph ysical Review E 83 , 016107 (2011). [9] B. Ball, B. Karrer, and M. E. J. Newman, Physical Re- view E 84 , 036103 (2011). [10] J. Reic hardt, R. Alamino, and D. Saad, PLoS ONE 6 , e21282 (2011). [11] Y. Zh u, X. Y an, and C. Moore, arXiv:1205.7009 (2012). [12] E. B. Baskerville, A. P . Dobson, T. Bedford, S. Allesina, T. M. Anderson, and M. Pascual, PLoS Comput Biol 7 , e1002321 (2011). [13] M. E. J. Newman and M. Girv an, Physical Review E 69 , 026113 (2004). [14] P . W. Holland, K. B. Laskey , and S. Leinhardt, So cial Net works 5 , 109 (1983). [15] S. E. Fien berg, M. M. Mey er, and S. S. W asserman, Journal of the American Statistical Asso ciation 80 , 51 (1985). [16] K. F aust and S. W asserman, So cial Netw orks 14 , 5 (1992). [17] C. J. Anderson, S. W asserman, and K. F aust, Social Net works 14 , 137 (1992). [18] S. F ortunato, Physics Rep orts 486 , 75 (2010). [19] T. P . P eixoto, Physical Review Letters 110 , 148701 (2013). [20] M. Rosv all and C. T. Bergstrom, Proceedings of the Na- tional Academ y of Sciences 104 , 7327 (2007). [21] J.-J. Daudin, F. Picard, and S. Robin, Statistics and Computing 18 , 173 (2008). [22] M. Mariadassou, S. Robin, and C. V acher, The Annals of Applied Statistics 4 , 715 (2010), mathematical Reviews n umber (MathSciNet): MR2758646. [23] C. Mo ore, X. Y an, Y. Zhu, J.-B. Rouquier, and T. Lane, in Pr o c e e dings of the 17th ACM SIGKDD international c onfer enc e on Know le dge disc overy and data mining , KDD ’11 (ACM, New Y ork, NY, USA, 2011) p. 841–849. [24] P . Latouche, E. Birmele, and C. Ambroise, Statistical Mo delling 12 , 93–115 (2012). [25] E. Côme and P . Latouche, Mo del sele ction and cluster- ing in sto chastic blo ck mo dels with the exact inte gr ated c omplete data likeliho o d , arXiv e-print 1303.2962 (2013). [26] T. P . Peixoto, Hier ar chic al blo ck structur es and high- r esolution mo del sele ction in lar ge networks , arXiv e- prin t 1310.4377 (2013). [27] A. Decelle, F. Krzak ala, C. Mo ore, and L. Zdeb orov á, Ph ysical Review Letters 107 , 065701 (2011). [28] A. Decelle, F. Krzak ala, C. Mo ore, and L. Zdeb orov á, Ph ysical Review E 84 , 066106 (2011). [29] E. Mossel, J. Neeman, and A. Sly , (2012). [30] J. Reic hardt and M. Leone, Physical Review Letters 101 , 078701 (2008). [31] D. Hu, P . Ronho vde, and Z. Nussinov, Philosophical Magazine 92 , 406 (2012). [32] A. Clauset, M. E. J. Newman, and C. Mo ore, Physical Review E 70 , 066111 (2004). [33] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefeb vre, Journal of Statistical Mec hanics: Theory and Exp eriment 2008 , P10008 (2008). [34] G. Bianconi, Physical Review E 79 , 036114 (2009). [35] T. P . Peixoto, Physical Review E 85 , 056122 (2012). [36] S. F ortunato and M. Barthélemy , Pro ceedings of the Na- tional Academ y of Sciences 104 , 36 (2007). [37] N. Metropolis, A. W. Rosenbluth, M. N. Rosen bluth, A. H. T eller, and E. T eller, The Journal of Chemical Ph ysics 21 , 1087 (1953). 9 [38] W. K. Hastings, Biometrik a 57 , 97 (1970). [39] A. Condon and R. M. Karp, Random Structures & Al- gorithms 18 , 116–140 (2001). [40] S. Kirkpatrick, C. D. Gelatt Jr, and M. P . V ecc hi, Science 220 , 671 (1983). [41] M. E. J. Newman, Physical Review E 74 , 036104 (2006). [42] K. I. Goh, M. E. Cusic k, D. V alle, B. Childs, M. Vidal, and A. L. Barabási, Pro ceedings of the National Academ y of Sciences 104 , 8685 (2007). [43] L. A. Adamic and N. Glance, in Pr o c e e dings of the 3r d international workshop on Link disc overy , LinkKDD ’05 (A CM, New Y ork, NY, USA, 2005) p. 36–43. [44] J. Lesko vec, D. Huttenlo c her, and J. Kleinberg, in Pr o- c e e dings of the SIGCHI Confer enc e on Human F actors in Computing Systems , CHI ’10 (ACM, New Y ork, NY, USA, 2010) p. 1361–1370. [45] B. Klimt and Y. Y ang, in CEAS (2004). [46] J. Lesko vec, K. J. Lang, A. Dasgupta, and M. W. Ma- honey , arXiv:0810.1355 (2008). [47] O. Ric hters and T. P . Peixoto, PLoS ONE 6 , e18384 (2011). [48] B. H. Goo d, Y.-A. de Mon tjoy e, and A. Clauset, Ph ysical Review E 81 , 046106 (2010). [49] K. W akita and T. T surumi, in Pr o c e e dings of the 16th International Confer enc e on W orld Wide W eb , WWW ’07 (ACM, New Y ork, NY, USA, 2007) p. 1275–1276. [50] P . Sch uetz and A. Caﬂisc h, Ph ysical Review E 77 , 046112 (2008). [51] P . Sch uetz and A. Caﬂisc h, Ph ysical Review E 78 , 026112 (2008). [52] X. Y an, J. E. Jensen, F. Krzak ala, C. Mo ore, C. R. Shal- izi, L. Zdeborov a, P . Zhang, and Y. Zhu, (2012).

Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment