Achieving Optimal Misclassification Proportion in Stochastic Block Model

Community detection is a fundamental statistical problem in network data analysis. Many algorithms have been proposed to tackle this problem. Most of these algorithms are not guaranteed to achieve the statistical optimality of the problem, while proc…

Authors: Chao Gao, Zongming Ma, Anderson Y. Zhang

Achieving Optimal Misclassification Proportion in Stochastic Block Model
Ac hieving Optimal Misclassification Prop ortion in Sto c hastic Blo c k Mo del Chao Gao 1 , Zongming Ma 2 , Anderson Y. Zhang 1 and Harrison H. Zhou 1 1 Y ale Univ ersity and 2 Univ ersity of P ennsylv ania Octob er 6, 2015 Abstract Comm unity detection is a fundamen tal statistical problem in netw ork data analysis. Man y algorithms hav e b een prop osed to tac kle this problem. Most of these algorithms are not guar- an teed to ac hieve the statistical optimality of the problem, while pro cedures that achiev e infor- mation theoretic limits for general parameter spaces are not computationally tractable. In this pap er, we present a computationally feasible tw o-stage metho d that achiev es optimal statistical p erformance in misclassification proportion for sto c hastic blo c k mo del under w eak regularity conditions. Our t wo-stage pro cedure consists of a refinement stage motiv ated b y penalized lo- cal maximum lik eliho o d estimation. This stage can take a wide range of w eakly consistent comm unity detection pro cedures as initializer, to whic h it applies and outputs a communit y as- signmen t that achiev es optimal misclassification proportion with high probability . The practical effectiv eness of the new algorithm is demonstrated by comp etitiv e n umerical results. Keyw ords. Clustering, Comm unity detection, Minimax rates, Net work analysis, Sp ectral clustering. 1 In tro duction Net work data analysis [ 71 , 29 ] has b ecome one of the leading topics in statistics. In fields such as ph ysics, computer science, so cial science and biology , one observ es a netw ork among a large n umber of sub jects of in terest such as particles, computers, people, etc. The observed netw ork can b e mo deled as an instance of a random graph and the goal is to infer structures of the underlying generating pro cess. A structure of particular interest is c ommunity : there is a partition of the graph nodes in some suitable sense so that each no de belongs to a comm unit y . Starting with the prop osal of a series of metho dologies [ 28 , 55 , 33 , 40 ], w e hav e seen a tremendous literature dev oted to algorithmic solutions to uncov ering comm unit y structure and great adv ances hav e also b een made in recent y ears on the theoretical understanding of the problem in terms of statistical consistency and thresholds for detection and exact recov eries. See, for instance, [ 11 , 23 , 77 , 51 , 53 , 49 , 2 , 54 , 31 ], among others. In spite of the great efforts exerted on this “communit y detection” problem, its state-of-the-art solution has not yet reached the comparable level of maturity as what statisticians ha ve achiev ed in other high dimensional problems such as nonparametric estimation [ 63 , 38 ], high dimensional regression [ 12 ] and cov ariance matrix estimation [ 14 ], etc. In these more w ell-established problems, not only do we know the fundamental statistical limits, we also ha ve computationally feasible algorithms to ac hieve them. The ma jor goal of the presen t paper is to serve as a step to wards such maturity in net work data analysis b y proposing a computationally feasible algorithm for comm unity detection in sto c hastic blo ck mo del with pro v able statistical optimality . 1 T o describ e net work data with comm unity structure, w e fo cus on the sto c hastic blo c k mo del (SBM) prop osed b y [ 34 ]. Let A ∈ { 0 , 1 } n × n b e the symmetric adjacency matrix of an undirected random graph generated according to an SBM with k communities. Then the diagonal entries of A are all zeros and eac h A uv = A v u for u > v is an indep endent Bernoulli random v ariable with mean P uv = B σ ( u ) σ ( v ) for some symmetric connectivity matrix B ∈ [0 , 1] k × k and some lab el function σ : [ n ] → [ k ], where for any p ositiv e in teger m , [ m ] = { 1 , . . . , m } . In other word, if the u th no de and the v th no de b elong to the i th and the j th comm unity resp ectively , then σ ( u ) = i , σ ( v ) = j and there is an edge connecting u and v with probabilit y B ij . Comm unity detection then refers to the problem of estimating the lab el function σ sub ject to a p erm utation of the communit y lab els { 1 , . . . , k } . A natural loss function for suc h an estimation problem is the prop ortion of wrong lab els (sub ject to a p erm utation of the label set [ k ]), which we shall refer to as misclassification proportion from here on. In ground breaking works b y Mossel et al. [ 51 , 53 ] and Massouli´ e [ 49 ], the authors established sharp threshold for the regimes in whic h it is p ossible and imp ossible to achiev e a misclassification prop ortion strictly less than 1 2 when k = 2 and both comm unities are of the same size (so that it is better than random guess), whic h solved the conjecture in [ 23 ] that w as only justified in ph ysics rigor. F or some recen t progress on the general case of fixed k and p ossibly unequal sized comm unities, see [ 1 ]. On the other hand, Abb e et al. [ 2 ], Mossel et al. [ 54 ] and Ha jek et al. [ 31 ] established the necessary and sufficien t condition for ensuring zero misclassification proportion (usually referred to as “strong consistency” in the literature) with high probability when k = 2 and comm unity sizes are equal, and was later generalized to a larger set of fixed k by [ 32 ]. Arguably , what is of more interest to statisticians is the in termediate regime b et ween the ab o v e t wo cases, namely when the misclassification prop ortion is v anishing as the num b er of no des grows but not exactly zero. This is usually called the regime of “w eak consistency” in the netw ork literature. T o ac hieve weak (and strong) consistency , statisticians hav e prop osed v arious metho ds. One p opular approach is sp ectral clustering [ 65 ] which is motiv ated by the observ ation that the rank of the n × n matrix P = ( P uv ) = ( B σ ( u ) σ ( v ) ) is at most k and its leading eigenv ectors con tain information of the comm unity structure. The application of sp ectral clustering on netw ork data go es bac k to [ 30 , 50 ], and its p erformance under the stochastic blo c k mo del has b een inv estigated b y [ 21 , 59 , 62 , 25 , 57 , 39 , 45 , 67 , 19 , 37 , 42 ], among others. T o further improv e the performance, v arious w ays for refining spectral clustering ha v e b een prop osed, suc h as those in [ 7 , 54 , 46 , 72 , 19 ] whic h lead to strong consistency or con vergence rates that are exponential in signal-to-noise ratio, while [ 52 ] studied the problem of minimizing a non-v anishing misclassification prop ortion. How ever, in the regime of w eak consistency , these refinement metho ds are not guaran teed to attain the optimal misclassification prop ortion to b e in tro duced b elo w. Another imp ortan t line of researc h is devoted to the inv estigation of likelihoo d-based metho ds, which w as initiated by [ 11 ] and later extended to more general settings by [ 77 , 20 ]. T o tac kle the intractabilit y of optimizing the likelihoo d function, an EM algorithm using pseudo-likelihoo d w as prop osed by [ 7 ]. Another wa y to ov ercome the intractabilit y of the maxim um lik eliho o d estimator (MLE) is by con vex relaxation. V arious semi-definite relaxations were studied by [ 13 , 18 , 6 ], and the aforementioned sharp threshold for strong consistency in [ 31 , 32 ] w as indeed ac hieved b y semi-definite programming. Recently , Zhang and Zhou [ 74 ] established the minimax risk for misclassification prop ortion in SBM under weak conditions, which is of the form exp  − (1 + o (1)) nI ∗ k  (1) if all k communities are of equal sizes, where I ∗ is the minimum R ´ enyi div ergence of order 1 2 [ 58 ] of the within and the b et ween communit y edge distributions. See Theorem 1 b elo w for a more 2 general and precise statement of the minimax risk. Unfortunately , Zhang and Zhou [ 74 ] used MLE for achieving the risk in ( 1 ) which w as hence computationally intractable. Moreov er, none of the sp ectral clustering based metho d or tractable v ariants of the likelihoo d based metho d has a known error b ound that matches ( 1 ) with the sharp constan t 1 + o (1) on the exp onen t. The main contribution of the current pap er lies in the prop osal of a computationally feasible algorithm that prov ably achiev es the optimal misclassification prop ortion established in [ 74 ] adap- tiv ely under w eak regularity conditions. It cov ers the cases of b oth finite and diverging num b er of communities and b oth equal and unequal communit y sizes and ac hieves b oth w eak and strong consistency in the resp ective regimes. In addition, the algorithm is guaranteed to compute in p olynomial time ev en when the num b er of communities diverges with the n umber of nodes. Since the error b ound of the algorithm matc hes the optimal misclassification prop ortion in [ 74 ] under w eak conditions, it achiev es v arious existing detection b oundaries in the literature. F or instance, for an y fixed n umber of comm unities, the pro cedure is w eakly consistent under the necessary and sufficien t condition of [ 51 , 53 ], and strongly consisten t under the necessary and sufficient condition of [ 2 , 54 , 31 , 32 ]. Moreov er, it could match the optimal misclassification prop ortion in [ 74 ] even when k diverges. T o the b est of our limited kno wledge, this is the first polynomial-time algorithm that ac hieves minimax optimal p erformance. In other w ords, the prop osed procedure enjo ys b oth statistical and computational efficiency . The core of the algorithm is a refinement scheme for comm unity detection motiv ated by p enal- ized maximum likelihoo d estimation. As long as there exists an initial estimator that satisfies a certain w eak consistency criterion, the refinemen t sc heme is able to obtain an improv ed estimator that achiev es the optimal misclassification prop ortion in ( 1 ) with high probability . The key to ac hieve the b ound in ( 1 ) is to optimize the lo c al p enalized lik eliho od function for eac h node sepa- rately . This lo cal optimization step is completely data-driv en and has a closed form solution, and hence can b e computed very efficiently . The additional penalty term is indisp ensable as it pla ys a k ey rule in ensuring the optimal p erformance when the communit y sizes are unequal and when the within communit y and/or b et w een comm unity edge probabilities are unequal. T o obtain a qualified initial estimator, we sho w that b oth sp ectral clustering and its normalized v ariant could satisfy the desired condition needed for subsequent refinemen t, though the refinement sc heme works for any other metho d satisfying a certain w eak consistency condition. Note that sp ectral clustering can b e considered as a glob al metho d, and hence our tw o-stage algorithm runs in a “ fr om glob al to lo c al ” fashion. In essence, with high probability , the global stage pinp oin ts a lo cal neigh b orho od in which w e shall searc h for solution to eac h local penalized maxim um lik eliho od problem, and the subsequent lo cal stage finds the desired solution. F rom this viewp oint, one can also regard our approac h as an “optimization after lo calization” pro cedure. Historically , this idea pla yed a k ey role in the developmen t of the renowned one-step efficient estimator [ 9 , 43 , 10 ]. It has also led to recen t progress in non-conv ex optimization and lo calized gradient descent techniques for finding optimal solutions to high dimensional statistical problems. Examples include but are not limited to high-dimensional linear regression [ 76 ], sparse PCA [ 56 , 47 , 15 , 70 ], sparse CCA [ 27 ], phase retriev al [ 16 ] and high dimensional EM algorithm [ 8 , 69 ]. A closely related idea has also found success in the dev elopmen t of confidence in terv als for regression co efficients in high dimensional linear regression. See, for instance, [ 75 , 64 , 36 ] and the references therein. Last but not least, ev en when viewed as a “sp ectral clustering plus refinemen t” pro cedure, our metho d distinguishes itself from other suc h metho ds in the literature by pro v ably ac hieving the minimax optimal p erformance o ver a wide range of parameter configurations. The rest of the pap er is organized as follo ws. Section 2 formally sets up the communit y detection problem and presen ts the t wo-stage algorithm. The theoretical guaran tees for the proposed metho d are given in Section 3 , follow ed b y n umerical results demonstrating its comp etitiv e p erformance on 3 b oth simulated and real datasets in Sections 4 and 5 . A discussion on the results in the curren t pap er and p ossible directions for future inv estigation is included in Section 6 . Section 7 presen ts the pro ofs of main results with some tec hnical details deferred to the appendix. W e close this section by in tro ducing some notation. F or a matrix M = ( M ij ), w e denote its F rob enius norm b y k M k F = q P ij M 2 ij and its op erator norm by k M k op = max l λ l ( M ), where λ l ( M ) is its l th singular v alue. W e use M i ∗ to denote its i th ro w. The norm k·k is the usual Euclidean norm for vectors. F or a set S , | S | denotes its cardinality . The notation P and E are generic probability and exp ectation op erators whose distribution is determined from the con text. F or tw o p ositiv e sequences { x n } and { y n } , x n  y n means x n /C ≤ y n ≤ C x n for some constant C ≥ 1 indep enden t of n . Throughout the pap er, unless otherwise noticed, w e use C, c and their v ariants to denote absolute constants, whose v alues ma y c hange from line to line. 2 Problem form ulation and metho dology In this section, we give a precise form ulation of the communit y detection problem and present a new metho d for it. The metho d consists of tw o stages: initialization and refinement. W e shall first in tro duce the second stage, whic h is the main algorithm of the paper. It clusters the netw ork data b y p erforming a no de-wise p enalized neighbor voting based on some initial comm unity assignmen t. Then, we will discuss sev eral candidates for the initialization step including a new greedy algorithm for clustering the leading eigenv ectors of the adjacency matrix or of the graph Laplacian that is tailored specifically for sto c hastic block mo del. Theoretical guaran tees for the algorithms introduced in the curren t section will be presen ted in Section 3 . 2.1 Comm unit y detection in sto chastic blo c k mo del Recall that a sto c hastic block model is completely c haracterized b y a symmetric connectivit y matrix B ∈ [0 , 1] k × k and a lab el vector σ ∈ [ k ] n . One widely studied parameter space of SBM is Θ 0 ( n, k , a, b, β ) =  ( B , σ ) : σ : [ n ] → [ k ] , | { u ∈ [ n ] : σ ( u ) = i } | ∈  n β k − 1 , β n k + 1  , ∀ i ∈ [ k ] , B = ( B ij ) ∈ [0 , 1] k × k , B ii = a n for all i and B ij = b n for all i 6 = j  (2) where β ≥ 1 is an absolute constant. This parameter space Θ 0 ( n, k , a, b, β ) con tains all SBMs in whic h the within comm unit y connection probabilities are all equal to a n and the betw een comm unit y connection probabilities are all equal to b n . In the sp ecial case of β = 1, all comm unities are of nearly equal sizes. Assuming equal within and equal b etw een connection probabilities can b e restrictive. Thus, we also introduce the following larger parameter space Θ( n, k , a, b, λ, β ; α ) =  ( B , σ ) : σ : [ n ] → [ k ] , | { u ∈ [ n ] : σ ( u ) = i } | ∈  n β k − 1 , β n k + 1  , ∀ i ∈ [ k ] , B = B T = ( B ij ) ∈ [0 , 1] k × k , b αn ≤ 1 k ( k − 1) X i 6 = j B ij ≤ max i 6 = j B ij = b n , a n = min i B ii ≤ max i B ii ≤ αa n , λ k ( P ) ≥ λ with P = ( P uv ) = ( B σ ( u ) ,σ ( v ) )  . (3) 4 Throughout the pap er, w e treat β ≥ 1 and α ≥ 1 as absolute constan ts, while k , a, b and λ should b e view ed as functions of the num b er of no des n which can v ary as n gro ws. Moreo ver, w e assume 0 < b n < a n ≤ 1 −  throughout the pap er for some n umeric constant  ∈ (0 , 1). Th us, the parameter space Θ( n, k , a, b, λ, β ; α ) requires that the within communit y connection probabilities are b ounded from b elo w by a n and the connection probabilities b et ween any tw o comm unities are b ounded from ab o v e b y b n . In addition, it requires that the sizes of differen t comm unities are comparable. In order to guarantee that Θ( n, k , a, b, λ, β ; α ) is a larger parameter space than Θ 0 ( n, k , a, b, β ), we alw ays require λ to b e positive and sufficien tly small such that Θ 0 ( n, k , a, b, β ) ⊂ Θ( n, k , a, b, λ, β ; α ) . (4) According to Prop osition 1 in the app endix, a sufficient condition for ( 4 ) is λ ≤ a − b 2 β k . W e assume ( 4 ) throughout the rest of the paper. The lab els on the n no des induce a communit y structure [ n ] = ∪ k i =1 C i , where C i = { u ∈ [ n ] : σ ( u ) = i } is the i th comm unity with size n i = |C i | . Our goal is to reconstruct this partition, or equiv alently , to estimate the lab el of eac h no de mo dulo an y permutation of label sym b ols. Therefore, a natural error measure is the misclassification prop ortion defined as ` ( b σ , σ ) = min π ∈ S k 1 n X u ∈ [ n ] 1 { b σ ( u ) 6 = π ( σ ( u )) } , (5) where S k stands for the symmetric group on [ k ] consisting of all p erm utations of [ k ]. 2.2 Main algorithm W e are no w ready to present the main method of the paper – a refinemen t algorithm for communit y detection in sto c hastic blo c k mo del motiv ated b y penalized local maxim um lik eliho o d estimation. Indeed, for any SBM in the parameter space Θ 0 ( n, k , a, b, 1) with equal communit y size, the MLE for σ [ 13 , 18 , 74 ] is b σ = argmax σ :[ n ] → [ k ] X u 0. Output : Communit y assignmen t b σ . 1 Set S = [ n ]; 2 for i = 1 to k do 3 Let t i = arg max u ∈ S    n v ∈ S :    b U v ∗ − b U u ∗    < r o    ; 4 Set b C i = n v ∈ S :    b U v ∗ − b U t i ∗    < r o ; 5 Lab el b σ ( u ) = i for all u ∈ b C i ; 6 Up date S ← S \ b C i . end 7 If S 6 = ∅ , then for any u ∈ S , set b σ ( u ) = argmin i ∈ [ k ] 1 | b C i | P v ∈ b C i    b U u ∗ − b U v ∗    . Last but not least, we would like to emphasize that one needs not limit the initialization algo- rithm to the sp ectral m ethods introduced in this section. As Theorem 2 b elo w shows, Algorithm 1 w orks for any initialization method that satisfies a w eak consistency condition. 3 Theoretical prop erties Before stating the theoretical prop erties of the prop osed metho d, we first review the minimax rate in [ 74 ], whic h will b e used as the optimality b enc hmark. The minimax risk is go verned by the follo wing critical quantit y , I ∗ = − 2 log r a n r b n + r 1 − a n r 1 − b n ! , (14) whic h is the R ´ en yi divergence of order 1 2 b et w een Bern  a n  and Bern  b n  , i.e., Bernoulli distributions with success probabilities a n and b n resp ectiv ely . Recall that 0 < b n < a n ≤ 1 −  is assumed throughout 8 the pap er. It can b e shown that I ∗  ( a − b ) 2 na . Moreov er, when a n = o (1), I ∗ = (1 + o (1)) ( √ a − √ b ) 2 n = (1 + o (1))   r a n − r b n ! 2 + r 1 − a n − r 1 − b n ! 2   = (2 + o (1)) H 2  Bern  a n  , Bern  b n  , where H 2 ( P , Q ) = 1 2 R ( √ d P − √ d Q ) 2 is the squared Hellinger distance b et ween t wo distributions P and Q . The minimax rate for the parameter spaces ( 2 ) and ( 3 ) under the loss function ( 5 ) is giv en in the follo wing theorem. Theorem 1 ([ 74 ]) . When ( a − b ) 2 ak log k → ∞ , we have inf b σ sup ( B ,σ ) ∈ Θ E B ,σ ` ( b σ , σ ) = ( exp  − (1 + η ) nI ∗ 2  , k = 2; exp  − (1 + η ) nI ∗ β k  , k ≥ 3 , for b oth Θ = Θ 0 ( n, k , a, b, β ) and Θ = Θ( n, k , a, b, λ, β ; α ) with any λ ≤ a − b 2 β k and any β ∈ [1 , p 5 / 3) , wher e η = η n → 0 is some se quenc e tending to 0 as n → ∞ . R emark 1 . The assumption β ∈ [1 , p 5 / 3) is needed in [ 74 ] for some technical reason. Here, the parameter β enters the minimax rates when k ≥ 3 since the worst case is essen tially when one has t wo comm unities of size n β k , while for k = 2, the w orst case is essentially t wo comm unities of size n 2 . F or all other results in this pap er, we allow β to be an arbitrary constan t no less than 1. T o this end, let us show that the t w o-stage algorithm proposed in Section 2 ac hieves the optimal misclassification proportion. The essence of the tw o-stage algorithm lies in the refinemen t sc heme describ ed in Algorithm 1 . As long as an y initialization step satisfies a certain weak consistency criterion, the refinement step directly leads to a solution with optimal misclassification prop ortion. T o be specific, the initialization step needs to satisfy the follo wing condition. Condition 1. There exist constants C 0 , δ > 0 and a positive sequence γ = γ n suc h that inf ( B ,σ ) ∈ Θ min u ∈ [ n ] P B ,σ  ` ( σ, σ 0 u ) ≤ γ  ≥ 1 − C 0 n − (1+ δ ) , (15) for some parameter space Θ. Under Condition 1 , we ha ve the follo wing upp er bounds regarding the p erformance of the prop osed refinement sc heme. Theorem 2. Supp ose as n → ∞ , ( a − b ) 2 ak log k → ∞ , a  b and Condition 1 is satisfie d for γ = o  1 k log k  (16) and Θ = Θ 0 ( n, k , a, b, β ) . Then ther e is a se quenc e η → 0 such that sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − η ) nI ∗ 2  → 0 , if k = 2 , sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − η ) nI ∗ β k  → 0 , if k ≥ 3 , (17) 9 wher e I ∗ is define d as in ( 14 ) . If in addition Condition 1 is satisfie d for γ satisfying b oth ( 16 ) and γ = o  a − b ak  (18) and Θ = Θ( n, k , a, b, λ, β ; α ) , then the c onclusion in ( 17 ) c ontinues to hold for Θ = Θ( n, k , a, b, λ, β ; α ) . Theorem 2 assumes a  b . The case when a  b may not hold is considered in Section 6 . Compared with Theorem 1 , the upp er b ounds ( 17 ) ac hieved b y Algorithm 1 is minimax optimal. The condition ( 16 ) for the parameter space Θ 0 ( n, k , a, b, β ) is v ery mild. When k = O (1), it reduces to γ = o (1) and simply means that the initialization should b e weakly consistent at any rate. F or k → ∞ , it implies that the misclassification prop ortion within each comm unity con v erges to zero. Note that if the initialization step gives wrong lab els to all no des in one particular communit y , then the misclassification prop ortion is at least 1 /k . The condition ( 16 ) rules out this situation. F or the parameter space Θ( n, k , a, b, λ, β ; α ), an extra condition ( 18 ) is required. This is b ecause estimating the connectivity matrix B in Θ( n, k , a, b, λ, β ; α ) is harder than in Θ 0 ( n, k , a, b, β ). In other words, if w e do not pursue adaptiv e estimation, ( 18 ) is not needed. R emark 2 . Theorem 2 is an adaptiv e result without assuming the knowledge of a and b . When these t w o parameters are known, w e can directly use a and b in ( 11 ) of Algorithm 1 . By scrutinizing the pro of of Theorem 2 , the conditions ( 16 ) and ( 18 ) can be w eakened as γ = o ( k − 1 ) in this case. Giv en the results of Theorem 2 , it remains to chec k the initialization step via sp ectral clus- tering satisfies Condition 1 . F or matrix P = ( P uv ) = ( B σ ( u ) σ ( v ) ) with ( B , σ ) b elonging to either Θ 0 ( n, k , a, b, β ) or Θ( n, k , a, b, λ, β ; α ), w e use λ k to denote λ k ( P ). Define the a v erage degree by ¯ d = 1 n X u ∈ [ n ] d u . (19) Theorem 3. Assume e ≤ a ≤ C 1 b for some c onstant C 1 > 0 and k a λ 2 k ≤ c, (20) for some sufficiently smal l c ∈ (0 , 1) . Consider USC ( τ ) with a sufficiently smal l c onstant µ > 0 in A lgorithm 2 and τ = C 2 ¯ d for some sufficiently lar ge c onstant C 2 > 0 . F or any c onstant C 0 > 0 , ther e exists some C > 0 only dep ending on C 0 , C 1 , C 2 and µ such that ` ( b σ , σ ) ≤ C a λ 2 k , with pr ob ability at le ast 1 − n − C 0 . If k is fixe d, the same c onclusion holds without assuming a ≤ C 1 b . R emark 3 . Theorem 3 improv es the error b ound for sp ectral clustering in [ 45 ]. While [ 45 ] requires the assumption a > C log n , our result also holds for a = o (log n ). A result close to ours is that by [ 19 ], but their clustering step is different from Algorithm 2 . Moreov er, the conclusion of Theorem 3 holds with probabilit y 1 − n − C 0 for an arbitrary large C 0 , which is critical b ecause the initialization step needs to satisfy Condition 1 for the subsequen t refinement step to work. On the other hand, the b ound in [ 19 ] is stated with probabilit y 1 − o (1). When k = O (1), Theorem 2 and Theorem 3 join tly imply the follo wing result. 10 Corollary 3.1. Consider Algorithm 1 initialize d by σ 0 with USC ( τ ) for τ = C ¯ d , wher e C is a sufficiently lar ge c onstant. Supp ose as n → ∞ , k = O (1) , ( a − b ) 2 a → ∞ and a  b . Then, ther e exists a se quenc e η → 0 such that sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − η ) nI ∗ 2  → 0 , if k = 2 , sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − η ) nI ∗ β k  → 0 , if k ≥ 3 , wher e the p ar ameter sp ac e is Θ = Θ 0 ( n, k , a, b, β ) . Compared with Theorem 1 , the prop osed procedure achiev es the minimax rate under the condi- tion ( a − b ) 2 a → ∞ and a  b . When k = O (1), the condition ( a − b ) 2 a → ∞ is necessary and sufficient for w eak consistency in view of Theorem 1 . More general results including the case of k → ∞ are stated and discussed in Section 6 . The following theorem characterizes the misclassification rate of normalized sp ectral clustering. Theorem 4. Assume e ≤ a ≤ C 1 b for some c onstant C 1 > 0 and k a log a λ 2 k ≤ c, (21) for some sufficiently smal l c ∈ (0 , 1) . Consider NSC ( τ ) with a sufficiently smal l c onstant µ > 0 in Algorithm 2 and τ = C 2 ¯ d for some sufficiently lar ge c onstant C 2 > 0 . Then, for any c onstant C 0 > 0 , ther e exists some C > 0 only dep ending on C 0 , C 1 , C 2 and µ such that ` ( b σ , σ ) ≤ C a log a λ 2 k , with pr ob ability at le ast 1 − n − C 0 . If k is fixe d, the same c onclusion holds without assuming a ≤ C 1 b . R emark 4 . A sligh tly different regularization of normalized sp ectral clustering is studied by [ 57 ] only for the dense regime, while Theorem 4 holds under b oth dense and sparse regimes. Moreov er, our result also impro ves that of [ 42 ] due to our tighter bound on k L ( A τ ) − L ( P τ ) k op in Lemma 7 b elo w. W e conjecture that the log a factor in b oth the assumption and the b ound of Theorem 4 can b e remo ved. Note that Theorem 3 and Theorem 4 are stated in terms of the quantit y λ k . W e may sp ecialize the results into the parameter spaces defined in ( 2 ) and ( 3 ). By Proposition 1 , λ k ≥ a − b 2 β k for Θ 0 ( n, k , a, b, β ) and λ k ≥ λ for Θ( n, k , a, b, λ, β ; α ). The implications of Theorem 3 and Theorem 4 and their uses as the initialization step for Algorithm 1 are discussed in full details in Section 6 . 4 Numerical results In this section w e presen t the p erformance of the proposed algorithm on simulated datasets. The exp erimen ts co ver three different sce narios: (1) dense netw ork with communities of equal sizes; (2) dense net work with comm unities of unequal sizes; and (3) sparse net work. Recall the definition of ¯ d in ( 19 ). F or each setting, w e report results of Algorithm 1 initialized with four differen t approaches: USC( ∞ ), USC(2 ¯ d ), NSC(0) and NSC( ¯ d ), the description of which can all b e found in Section 2.3 . F or all these sp ectral clustering metho ds, Algorithm 2 was used to cluster the leading eigenv ectors. 11 The constan t µ in the critical radius definition was set to b e 0 . 5 in all the results rep orted here. F or each setting, the results are based on 100 indep enden t draws from the underlying sto c hastic blo c k mo del. T o achiev e faster running time, we also ran a simplified v ersion of Algorithm 1 . Instead of obtaining n different initializers { σ u } u ∈ [ n ] to refine each no de separately , the simplified algorithm refines all the no des with a single initialization on the whole net work. Th us, the running time can b e reduced roughly b y a factor of n . Sim ulation results b elow suggest that the simplified version ac hieves similar p erformances to that of Algorithm 1 in all the settings we ha ve considered. F or the precise description of the simplified algorithm, w e refer readers to Algorithm 3 in the appendix. Balanced case In this setting, w e generate netw orks with 2500 no des and 10 comm unities, each of whic h consists of 250 no des, and w e set B ii = 0 . 48 for all i and B ij = 0 . 32 for all i 6 = j . Figure 1 sho ws the boxplots of the num b er of misclassified no des. The first four boxplots corresp ond to the four different sp ectral clustering metho ds, in the order of USC( ∞ ), USC(2 ¯ d ), NSC(0) and NSC( ¯ d ). The middle four corresp ond to the results achiev ed by applying the simplified refinemen t scheme with these four initialization metho ds, and the last four show the results of Algorithm 1 with these four initialization metho ds. Regardless of the initialization metho d, Algorithm 1 or its simplified v ersion reduces the n um b er of misclassified no des from around 30 to around 5. ● ● ● ● ● 0 10 20 30 40 USC( ∞ ) USC(2d) NSC(0) NSC(d) Refine (Simple) with USC( ∞ ) Refine (Simple) with USC(2d) Refine (Simple) with NSC(0) Refine (Simple) with NSC(d) Refine with USC( ∞ ) Refine with USC(2d) Refine with NSC(0) Refine with NSC(d) Refinement No. of nodes mis−clustered Figure 1: Bo xplots of n umber of misclassified no des: Balanced case. Simple indicates that the simplified version of Algorithm 1 is used instead. Im balanced case In this setting, w e generate netw orks with 2000 no des and 4 communities, the sizes of whic h are 200 , 400 , 600 and 800, respectively . The connectivit y matrix is B =     0 . 50 0 . 29 0 . 35 0 . 25 0 . 29 0 . 45 0 . 25 0 . 30 0 . 35 0 . 25 0 . 50 0 . 35 0 . 25 0 . 30 0 . 35 0 . 45     . Hence, the within-communit y edge probabilit y is no smaller than 0.45 while the b et ween-comm unit y edge probabilit y is no greater than 0.35, and the underlying SBM is inhomogeneous. Figure 2 sho ws the boxplots of the num b er of misclassified nodes obtained by different initialization metho ds and their refinements, and the b o xplots are presented in the same order as those in Figure 1 . Similarly , w e can see refinemen t significantly reduces the error. Sparse case In this setting we consider a muc h sparser sto c hastic blo c k mo del than the previous t wo cases. In particular, each sim ulated netw ork has 4000 no des, divided in to 10 communities all of size 400. W e set all B ii = 0 . 032 and all B ij = 0 . 005 when i 6 = j . The av erage degree of eac h 12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 USC( ∞ ) USC(2d) NSC(0) NSC(d) Refine (Simple) with USC( ∞ ) Refine (Simple) with USC(2d) Refine (Simple) with NSC(0) Refine (Simple) with NSC(d) Refine with USC( ∞ ) Refine with USC(2d) Refine with NSC(0) Refine with NSC(d) Refinement No. of nodes mis−clustered Figure 2: Boxplots of num b er of misclassified no des: im balanced case. Simple indicates that the simplified version of Algorithm 1 is used instead. no de in the netw ork is around 30. Figure 3 shows the b o xplots of the n umber of misclassified no des obtained b y differen t initialization metho ds and their refinements, and the b o xplots are presented in the same order as those in Figure 1 . Compared with either USC or NSC initialization, refinement reduces the n umber of misclassified nodes b y 50%. ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 50 75 100 USC( ∞ ) USC(2d) NSC(0) NSC(d) Refine (Simple) with USC( ∞ ) Refine (Simple) with USC(2d) Refine (Simple) with NSC(0) Refine (Simple) with NSC(d) Refine with USC( ∞ ) Refine with USC(2d) Refine with NSC(0) Refine with NSC(d) Refinement No. of nodes mis−clustered Figure 3: Boxplots of n umber of misclassified no des: Sparse case. Simple indicates that the simplified version of Algorithm 1 is used instead. Summary In all three simulation settings, for all four initialization approaches considered, the refinemen t scheme in Algorithm 1 (and its simplified v ersion) was able to significan tly reduce the n umber of misclassified no des, which is in agreemen t with the theoretical prop erties presented in Section 3 . 5 Real data example W e now compare the results of our algorithm and some existing metho ds on a political blog dataset [ 3 ]. Eac h no de in this net work represents a blog ab out US p olitics and a pair of no des is connected if one blog con tains a link to the other. There were 1490 no des to start with, eac h lab eled lib eral or conserv ativ e. In what follows, we consider only the 1222 nodes located in the largest connected comp onen t of the netw ork. This pre-pro cessing step is the same as what w as done in [ 40 ]. After pre-pro cessing, the netw ork has 586 lib eral blogs and 636 conserv ative ones whic h naturally form t wo comm unities. As shown in the right panel of Figure 4 , no des are more lik ely to be connected if they ha ve the same p olitical ideology . T able 1 summarizes the results of Algorithm 1 and its simplified v ersion on this netw ork with four different initialization metho ds, as w ell as the p erformances of directly applying the four metho ds on the dataset. The av erage degree of the netw ork ¯ d is 27, which is used as the tuning parameter for regularized NSC. F or regularized USC, w e set τ equals to t wice the a verage degree, 13 Figure 4: Connectivity of p olitical blogs. Left panel: plot of the adjacency matrix when the no des are not group ed. Right panel: plot of the adjacency matrix when the no des are group ed according to p olitical ideology . Initialization USC( ∞ ) USC(2 ¯ d ) NSC(0) NSC( ¯ d ) Refinemen t NA Algo 1 Simple NA Algo 1 Simple NA Algo 1 Simple NA Algo 1 Simple No. of no des misclassified 383 116 115 583 307 294 579 585 581 308 86 87 T able 1: P erformance on the political blog dataset. “NA” stands for direct application of the initialization metho d on the whole dataset; “Algo 1 ” stands for the application of Algorithm 1 with σ 0 b eing the lab eled initialization metho d; “Simple” stands for the application of the simplified v ersion of Algorithm 1 with σ 0 b eing the labeled initialization metho d. leading to the remov al of 196 most connected no des. The result of directly applying any of the four sp ectral clustering based initializations was unsatisfactory with at least 30% no des misclassified. Despite the unsatisfactory p erformance of the initializers, Algorithm 1 and its simplified version are able to significantly reduce the num b er of misclassified no des except for the case of NSC(0), and the p erformance of the t wo are close to eac h other regardless of the initialization metho d. An in teresting observ ation is that if we apply the refinement scheme m ultiple times, the num b er of misclassified no des keeps decreasing un til conv ergence and the further reduction of misclassifi- cation prop ortion compared to a single refinemen t can b e sizable. Figure 5 plots the num b ers of misclassified no des for multiple iterations of refinemen t via the simplified version of Algorithm 1 . W e are able to achiev e 61, 58 or 63 misclassified no des out of 1222 dep ending on whic h initializa- tion metho d is used. F or the three initialization metho ds included in the figure, the num b er of misclassified no des conv erges within several iterations. NSC with τ = 0 is not included in Figure 5 due to the relativ ely inferior initialization, but its error also con verges to around 60/1222 after 20 iterations. F or comparison, state-of-the-art metho d such as SCORE [ 37 ] w as rep orted to achiev e a comparable error of 58/1222. It is w orth noting that SCORE was designed under the setting of degree-corrected sto c hastic blo ck model, whic h fits the curren t dataset b etter than SBM due to the presence of h ubs and low-degree no des. The regularized sp ectral clustering implemented b y [ 57 ], whic h was also designed under the degree-corrected sto c hastic blo ck model, w as rep orted to hav e an error of (80 ± 2) / 1222. The semi-definite programming metho d by [ 13 ] ac hieved 63/1222. T o summarize, our algorithm leads to significan t p erformance improv ement ov er sev eral p opular 14 50 100 150 200 250 300 1 2 3 4 5 6 7 8 9 10 Iter ations No . of nodes mis−clustered Method Refine with USC( ∞ ) Refine with USC(2d) Refine with NSC(d) Figure 5: Num b er of misclassified no des vs. n um b er of refinemen t sc heme applied sp ectral clustering based metho ds on the political blog dataset. With rep eated refinements, it demonstrates comp etitive p erformance even when compared with methods designed for mo dels that b etter fit the curren t dataset. 6 Discussion In this section, w e discuss a few imp ortan t issues related to the methodology and theory we hav e presen ted in the previous sections. 6.1 Error b ounds when a  b may not hold In Section 3 , we established upp er b ounds on misclassification prop ortion under the assumption of a  b . The following theorem shows that sligh tly weak er upp er b ounds can b e obtained even when a  b do es not hold. T o state the result, recall that we assume throughout the pap er a n ≤ 1 −  for some numeric constant  ∈ (0 , 1). Theorem 5. Supp ose as n → ∞ , ( a − b ) 2 ak log k → ∞ and Condition 1 is satisfie d for γ satisfying ( 16 ) and Θ = Θ 0 ( n, k , a, b, β ) . Then for some p ositive c onstants c  and C  that dep end only on  , for any sufficiently smal l c onstant  0 ∈ (0 , c  ) , if we r eplac e the definition of t u ’s in ( 11 ) with t u = 1 2 log b a u (1 − b b u /n ) b b u (1 − b a u /n ) ! ∧ log 1  0 / 2 , (22) then we have sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − C   0 ) nI ∗ 2  → 0 , if k = 2 , sup ( B ,σ ) ∈ Θ P B ,σ  ` ( σ, b σ ) ≥ exp  − (1 − C   0 ) nI ∗ β k  → 0 , if k ≥ 3 , (23) wher e I ∗ is define d as in ( 14 ) . In p articular, we c an set C  = 10 3 2 −   2 log 2  and c  = min( 1 10 C  ,  2 −  ) . If in addition Condition 1 is satisfie d for γ satisfying b oth ( 16 ) and ( 18 ) and Θ = Θ( n, k , a, b, λ, β ; α ) , then the same c onclusion holds for Θ = Θ( n, k , a, b, λ, β ; α ) . 15 Compared with the conclusion ( 17 ) in Theorem 2 , the v anish sequence η in the exp onen t of the upp er b ound is replaced b y C   0 , which is guaran teed to be smaller than min(0 . 1 , 2 log(2 / ) ) and can b e driven to b e arbitrarily small b y decreasing  0 . T o ac hieve this, the t u ’s used in defining the p enalt y parameters in the penalized neighbor voting step need to b e truncated at the v alue log 1  0 / 2 . 6.2 Implications of the results W e no w discuss some implications of the results in Theorems 2 – 5 . When using USC as initialization for Algorithm 1 , we obtain the following results b y combining Theorem 2 , Theorem 3 and Theorem 5 . Recall that ¯ d is the a verage degree of no des in A defined in ( 19 ). Theorem 6. Consider Algorithm 1 initialize d by σ 0 with USC ( τ ) with τ = C ¯ d for some sufficiently lar ge c onstant C > 0 . If as n → ∞ , a  b and ( a − b ) 2 ak 3 log k → ∞ , (24) then ther e is a se quenc e η → 0 such that ( 17 ) holds with Θ = Θ 0 ( n, k , a, b, β ) . If as n → ∞ , a  b and λ 2 ak (log k + a/ ( a − b )) → ∞ , (25) then ( 17 ) holds for Θ = Θ( n, k , a, b, λ, β ; α ) . If for either p ar ameter sp ac e, a  b may not hold but k is fi xe d and ( 24 ) or ( 25 ) holds r esp e ctively, then ( 23 ) holds as long as t u is r eplac e d by ( 22 ) in A lgorithm 1 . Compared with Theorem 1 , the minimax optimal p erformance is ac hieved under mild conditions. T ake Θ = Θ 0 ( n, k , a, b, β ) for example. F or an y fixed k , the minimax optimal misclassification prop ortion is ac hieved with high probability only under the additional condition of a  b . In addition, weak consistency is achiev ed for fixed k as long as ( a − b ) 2 a → ∞ , regardless of the b eha vior of a b . This condition is indeed necessary and sufficient for w eak consistency . See, for instance, [ 51 , 53 , 73 , 74 ]. T o achiev e strong consistency for fixed k , it suffices to ensure ` ( σ, b σ ) < 1 n and Theorem 6 implies that it is sufficien t to ha ve lim inf n →∞ nI ∗ 2 log n > 1 , when k = 2; lim inf n →∞ nI ∗ β k log n > 1 , when k ≥ 3, (26) regardless of the b eha vior of a b . On the other hand, Theorem 1 shows that it is imp ossible to ac hiev e strong consistency if lim sup n →∞ nI ∗ 2 log n < 1 , when k = 2; lim sup n →∞ nI ∗ β k log n < 1 , when k ≥ 3. (27) When a n = o (1), nI ∗ = (1 + o (1))( √ a − √ b ) 2 and so one can replace nI ∗ in ( 26 ) – ( 27 ) with ( √ a − √ b ) 2 . In the literature, Abb e et al. [ 2 ], Mossel et al. [ 54 ] and Ha jek et al. [ 31 ] obtained comparable strong consistency results via efficient algorithms for the sp ecial case of t wo communities of equal sizes, i.e., k = 2 and β = 1. Under the additional assumption of a  b  log n , Ha jek et al. [ 32 ] later achiev ed the result via efficient algorithm for the case of fixed k and β = 1, and Abb e and Sandon [ 1 ] inv estigated the case of fixed k and β ≥ 1. In comparison, our result holds 16 for any fixed k and any β ≥ 1 without assuming a  b  log n . In the weak consistency regime, in terms of m isclassification proportion, for the sp ecial case of k = 2 and β = 1, Y un and Proutiere [ 72 ] achiev ed the optimal rate for Θ 0 ( n, 2 , a, b, 1) when a  b  a − b , while the error b ounds in other pap ers are typically off by a constan t multiplier on the exp onen t. In comparison, Theorem 6 pro vides optimal results ( 17 ) and near optimal results ( 23 ) for a muc h broader class of mo dels under m uch weak er conditions. Last but not least, our algorithm can pro v ably achiev e strong consistency and minimax optimal p erformance even for growing k , which to our limited knowledge, is the first in the literature. The performance of Algorithm 1 initialized by NSC can be summarized as the follo wing theorem b y combining Theorem 2 , Theorem 4 and Theorem 5 . In this case, the sufficient condition for ac hieving minimax optimal p erformance is slightly stronger than when USC is used for initialization. Theorem 7. Consider Algorithm 1 initialize d by σ 0 with NSC ( τ ) with τ = C ¯ d for some sufficiently lar ge c onstant C > 0 . If as n → ∞ , a  b and ( a − b ) 2 ak 3 log k log a → ∞ , (28) then ther e is a se quenc e η → 0 such that ( 17 ) holds with Θ = Θ 0 ( n, k , a, b, β ) . If as n → ∞ , a  b and λ 2 ak log a (log k + a/ ( a − b )) → ∞ , (29) then ( 17 ) holds for Θ = Θ( n, k , a, b, λ, β ; α ) . If for either p ar ameter sp ac e, a  b may not hold but k is fi xe d and ( 28 ) or ( 29 ) holds r esp e ctively, then ( 23 ) holds as long as t u is r eplac e d by ( 22 ) in A lgorithm 1 . Last but not least, we would lik e to p oin t out that when the key parameters a and b are known, w e can obtain the desired p erformance guarantee under weak er conditions as summarized in the follo wing theorem. Theorem 8 (The case of known a, b ) . Supp ose a, b ar e known. Consider A lgorithm 1 initialize d by σ 0 with USC( τ ) with τ = C a for some sufficiently lar ge c onstant C > 0 and b a u = a , b b u = b in ( 9 ) for al l u ∈ [ n ] . If as n → ∞ , a  b and ( a − b ) 2 ak 3 → ∞ , (30) then ther e is a se quenc e η → 0 such that ( 17 ) holds with Θ = Θ 0 ( n, k , a, b, β ) . If as n → ∞ , a  b and λ 2 ak → ∞ , (31) then ( 17 ) holds with Θ = Θ( n, k , a, b, λ, β ; α ) . If for either p ar ameter sp ac e without assuming a  b , ( 30 ) or ( 31 ) holds r esp e ctively, then ( 23 ) holds if in addition t u is r eplac e d by ( 22 ) . If inste ad NSC( τ ) is use d for initialization with τ = C a for some sufficiently lar ge c onstant C > 0 , then the ab ove c onclusions hold if we r eplac e ( 30 ) with ( a − b ) 2 ak 3 log a → ∞ and ( 31 ) with λ 2 ak log a → ∞ , r esp e ctively. 17 6.3 P oten tial future research problems Simplified version of Algorithm 1 and iterativ e refinement In simulation studies, w e ex- p erimen ted a simplified v ersion of Algorithm 1 (with precise description as Algorithm 3 in appendix) and sho wed that it provided similar p erformance to Algorithm 1 on simulated datasets. Moreo ver, for the p olitical blog data, we sho wed that iterativ e application of this simplified refinement sc heme k ept driving do wn the n um b er of misclassified no des till con v ergence. It is of great in terest to see if comparable theoretical results to Theorem 2 could b e established for the simplified and/or the iterativ e version, and if the iterative v ersion conv erges to a lo cal optimum of certain ob jectiv e func- tion for communit y detection. Though answering these in triguing questions is beyond the scop e of the current pap er, we think it can serv e as an interesting future researc h problem. Data-driv en c hoice of k The kno wledge of k is assumed and is used in both metho dology and theory of the present pap er. Date-driven c hoice of k is of both practical imp ortance and con temp orary researc h interest, and researc hers hav e prop osed v arious wa ys to ac hieve this goal for sto c hastic blo c k mo del, including cross-v alidation [ 17 ], T racy–Widom test [ 44 ], information criterion [ 60 ], likelihoo d ratio test [ 68 ], etc. Whether these methods are optimal and whether it is p ossible to select k in a statistically optimal wa y remains an important op en problem. More general mo dels The results in this pap er cov er a large range of parameter spaces for sto c hastic blo c k mo dels and we show the comp etitiv e p erformance of the prop osed algorithm b oth in theory and on numerical examples. Despite its p opularit y , sto c hastic blo c k mo del has its own limits for mo deling netw ork data. Therefore, an imp ortan t future researc h direction is to design computationally feasible algorithms that can achiev e statistically optimal p erformance for more general netw ork models, suc h as degree-corrected sto chastic blo c k mo dels. 7 Pro ofs of main results The main result of the pap er, Theorem 2 , is prov ed in Section 7.1 . Theorem 3 and Theorem 4 are pro ved in Section 7.2 and Section 7.3 resp ectiv ely . The pro ofs of the remaining results, together with some auxiliary lemmas, are given in the app endix. 7.1 Pro of of Theorem 2 W e first state a lemma that guaran tees the accuracy of parameter estimation in Algorithm 1 . Lemma 1. L et Θ = Θ( n, k , a, b, λ, β ; α ) . Supp ose as n → ∞ , ( a − b ) 2 ak → ∞ and Condition 1 holds with γ satisfying ( 16 ) and ( 18 ) . Then ther e is a se quenc e η → 0 as n → ∞ and a c onstant C > 0 such that min u ∈ [ n ] inf ( B ,σ ) ∈ Θ P  min π ∈ S k max i,j ∈ [ k ] | b B u ij − B π ( i ) π ( j ) | ≤ η  a − b n  ≥ 1 − C n − (1+ δ ) . (32) F or Θ = Θ 0 ( n, k , a, b, β ) , the c onclusion ( 32 ) c ontinues to hold even when the assumption ( 18 ) is dr opp e d. Pr o of. 1 ◦ Let Θ = Θ( n, k , a, b, λ, β ; α ). F or any communit y assignments σ 1 and σ 2 , define ` 0 ( σ 1 , σ 2 ) = 1 n n X u =1 1 { σ 1 ( u ) 6 = σ 2 ( u ) } . (33) 18 Fix any ( B , σ ) ∈ Θ and u ∈ [ n ]. Define ev ent E u =  ` 0 ( π u ( σ ) , σ 0 u ) ≤ γ  . (34) T o simplify notation, assume that π u = Id is the identit y permutation. Fix any i ∈ [ k ]. On E u , n i ≥ | e C u i ∩ C i | ≥ n i − γ 1 n, | e C u i ∩ C c i | ≤ γ 2 n, where γ 1 , γ 2 ≥ 0 and γ 1 + γ 2 ≤ γ . (35) Let C 0 i b e an y deterministic subset of [ n ] suc h that ( 35 ) holds with e C u i replaced by C 0 i . By definition, there are at most γ n X l =0  n i l  γ n X m =0  n − n i m  ≤ ( γ n + 1) 2  en i γ n  γ n  en γ n  γ n ≤ exp  2 log ( γ n + 1) + 2 γ n log e γ  ≤ exp  C 1 γ n log 1 γ  differen t subsets with this prop ert y where C 1 > 0 is an absolute constant. Let E 0 i b e the edges within C 0 i . Then |E 0 i | consists of indep enden t Bernoulli random v ariables, where at least (1 − β γ k ) 2 prop ortion of them follow the Bern( B ii ) distribution, at most ( β γ k ) 2 prop ortion that are sto c hastically smaller than Bern( αa n ) and sto c hastically larger than Bern( a n ), and at most 2 β γ k prop ortion are stochastically smaller than Bern( b n ). Therefore, w e obtain that (1 − β γ k ) 2 B ii + ( β γ k ) 2 a n ≤ E " |E 0 i | 1 2 |C 0 i | ( |C 0 i | − 1) # ≤ max t ∈ [0 ,β γ k ]  (1 − t ) 2 B ii + t 2 αa n + 2 t b n  . (36) Note that the LHS is (1 − (2 + o (1)) β γ k ) B ii . On the other hand, under condition ( 18 ), the RHS is attained at t = 0 and equals B ii exactly . Thus, we conclude that      E " |E 0 i | 1 2 |C 0 i | ( |C 0 i | − 1) # − B ii      ≤ C β γ k αa n = η 0  a − b n  (37) for some η 0 → 0 that depends only on a, k , α, β and γ , where the last inequalit y is due to ( 18 ). On the other hand, b y Bernstein’s inequalit y , for an y t > 0, P    |E 0 i | − E |E 0 i |   > t  ≤ 2 exp ( − t 2 2( 1 2 ( n i + γ n ) 2 αa n + 2 3 t ) ) . Let t 2 = ( n i + γ n ) 2 αa n ( C 1 γ n log γ − 1 + (3 + δ ) log n ) ∨ (2 C 1 γ n log γ − 1 + 2(3 + δ ) log n ) 2 .  n k p aγ log γ − 1 + γ n log γ − 1  2 , where w e the second inequality holds since log x x is monotone decreasing as x increases and so γ log γ − 1 ≥ 1 n log n for any γ ≥ 1 n , which is the case of most interest since γ < 1 n leads to γ = 0 and so the initialization is already p erfect. Even when γ = 0, w e can still contin ue to the follo wing argumen ts by replacing ev ery γ with 1 n and all the steps contin ue to hold. Thus, w e obtain that for p ositive constan t C α,β ,δ that dep ends only on α, β and δ , P n   |E 0 i | − E |E 0 i |   > C α,β ,δ  n k p aγ log γ − 1 + γ n log γ − 1 o ≤ exp  − C 1 γ n log γ − 1  n − (3+ δ ) . (38) 19 Th us, with probability at least 1 − exp  − C 1 γ n log γ − 1  n − (3+ δ ) ,      |E 0 i | 1 2 |C 0 i | ( |C 0 i | − 1) − E |E 0 i | 1 2 |C 0 i | ( |C 0 i | − 1)      ≤ C α,β ,δ  k n p aγ log γ − 1 + k 2 γ log γ − 1 n  = η 0  a − b n  , (39) where η 0 → 0 dep ends only on a, k , α , β , γ and δ . Here, the last inequality holds since k p aγ log γ − 1 = √ ak p k γ log γ − 1 , where √ ak  a − b since ( a − b ) 2 ak → ∞ and kγ log γ − 1 = O (1), and k 2 γ log γ − 1 = k γ log γ − 1 · k . k  ( a − b ) 2 a . a − b. W e combine ( 37 ) and ( 39 ) and apply the union b ound to obtain that for a sequence η → 0 that dep ends only on a, k , α, β , γ and δ , with probabilit y at least 1 − n − (3+ δ )      | e E u i | 1 2 | e C u i | ( | e C u i | − 1) − B ii      ≤ η  a − b n  . (40) The pro of for B ij estimation is analogous and hence is omitted. A final union b ound on i, j ∈ [ k ] leads to the desired claim since all the constan ts and v anishing sequences in the ab o v e analysis dep end only on a, b, k , α, β , γ and δ , but not on u , B or σ . 2 ◦ If Θ = Θ 0 ( n, k , a, b, β ), then condition ( 18 ) on γ is no longer needed. This is b ecause ( 36 ) can b e replaced by min t ∈ [0 ,β γ k ]  (1 − t ) 2 a n + 2 t (1 − t ) b n + t 2 b n  ≤ E " |E 0 i | 1 2 |C 0 i | ( |C 0 i | − 1) # ≤ max t ∈ [0 ,β γ k ]  (1 − t ) 2 a n + t 2 a n + 2 t (1 − t ) b n  , where the LHS equals a n − (1 − β γ k (1 + o (1))) a − b n = a n + o ( a − b n ) and the RHS equals a n . Th us, no additional condition is needed to guarantee ( 37 ) in the foregoing arguments. This completes the pro of. The next t wo lemmas establish the desired error b ound for the node-wise refinemen t. Lemma 2. L et Θ 0 b e define d as in ( 2 ) and k ≥ 2 . Supp ose as n → ∞ , ( a − b ) 2 ak → ∞ and a  b . If ther e exists two se quenc es γ = o (1 /k ) and η = o (1) , c onstants C, δ > 0 and p ermutations { π u } n u =1 ⊂ S k such that inf ( B ,σ ) ∈ Θ 0 min u ∈ [ n ] P n ` 0 ( π u ( σ ) , σ 0 u ) ≤ γ , | b a u − a | ≤ η ( a − b ) , | b b u − b | ≤ η ( a − b ) o ≥ 1 − C n − (1+ δ ) . (41) Then for b σ u ( u ) define d as in ( 10 ) with ρ = ρ u in ( 12 ) , ther e is a se quenc e η 0 = o (1) such that for k = 2 , sup ( B ,σ ) ∈ Θ 0 max u ∈ [ n ] P { b σ u ( u ) 6 = π u ( σ ( u )) } ≤ ( k − 1) exp  − (1 − η 0 ) nI ∗ 2  + C n − (1+ δ ) , and for k ≥ 3 , sup ( B ,σ ) ∈ Θ 0 max u ∈ [ n ] P { b σ u ( u ) 6 = π u ( σ ( u )) } ≤ ( k − 1) exp  − (1 − η 0 ) nI ∗ β k  + C n − (1+ δ ) . 20 Pr o of. In what follows, let E u denote the ev ent in ( 41 ). F or the sake of brevity , we let p = a/n , q = b/n , b p u = b a u /n and b q u = b b u /n . Moreov er, let σ u = π u ( σ ), n i = |{ v : σ u ( v ) = i }| , m i = |{ v : σ 0 u ( v ) = i }| and m 0 i = |{ v : σ 0 u ( v ) = σ u ( v ) = i }| . Without loss of generality , let σ u ( u ) = 1. Then we hav e P { b σ u ( u ) 6 = 1 and E u } ≤ X l 6 =1 P    E u and X σ u ( v )= l A uv − X σ u ( v )=1 A uv ≥ ρ u ( m l − m 1 )    = X l 6 =1 p l . (42) No w w e bound eac h p l . By the indep endence structure and Chernoff bound, w e ha ve p l ≤ E n exp ( − t u ρ u ( m l − m 1 )) ( q e t u + 1 − q ) m 0 l ( pe t u + 1 − p ) m l − m 0 l ( pe − t u + 1 − p ) m 0 1 ( q e − t u + 1 − q ) m 1 − m 0 1 1 { E u } o (43) ≤ E  exp ( − t u ρ u ( m l − m 1 )) ( q e t u + 1 − q ) m l ( pe − t u + 1 − p ) m 1 1 { E u }  (44) × E (  pe t u + 1 − p q e t u + 1 − q  m l − m 0 l  q e − t u + 1 − q pe − t u + 1 − p  m 1 − m 0 1 1 { E u } ) . (45) W e are going to give b ounds for the terms in ( 44 ) and ( 45 ) resp ectively . Before doing that, we need some preparatory inequalities. Define t ∗ through the equation e t ∗ = s p (1 − q ) q (1 − p ) . Then, on the even t E u , e t u − t ∗ + e t ∗ − t u ≤ exp( C 1 η ) , (46) for some constan t C 1 > 0. Moreo ver, | e t u − 1 | ∨ | e − t u − 1 | ≤ C 2 p − q p = C 2 a − b a , (47) for some constan t C 2 > 0. Therefore, for the term in ( 44 ), on the ev en t E u , exp ( − t u ρ u ( m l − m 1 )) ( q e t u + 1 − q ) m l ( pe − t u + 1 − p ) m 1 = exp ( − t u ρ u ( m l − m 1 ))  q e t u + 1 − q  ( m l − m 1 ) / 2  pe − t u + 1 − p  ( m 1 − m l ) / 2 (48) ×  q e t u + 1 − q  ( m 1 + m l ) / 2  pe − t u + 1 − p  ( m 1 + m l ) / 2 . (49) By ( 46 ), the term in ( 49 ) is upper bounded b y  pq + (1 − p )(1 − q ) + √ pq p (1 − p )(1 − q )( e t u − t ∗ + e t ∗ − t u )  m 1 + m l 2 ≤ exp  − (1 + o (1)) m 1 + m l 2 I ∗  ≤ exp  − (1 + o (1)) n 1 + n l 2 I ∗  . 21 By ( 47 ), the term in ( 48 ) is upper bounded b y exp ( − t u ρ u ( m l − m 1 ))  q e t u + 1 − q  ( m l − m 1 ) / 2  pe − t u + 1 − p  ( m 1 − m l ) / 2 = exp  m 1 − m l 2  log pe − t u + 1 − p q e t u + 1 − q − log b p u e − t u + 1 − b p u b q u e t u + 1 − b q u  ≤ exp  | m 1 − m l | 2  | e − t u − 1 || b p u − p | + | e t u − 1 || b q u − q |   ≤ exp  o  n k ( p − q ) 2 p  = exp  o (1) n 1 + n l 2 I ∗  . Therefore, we can upper bound ( 44 ) as E n e − t u ρ u ( m l − m 1 ) ( q e t u + 1 − q ) m l ( pe − t u + 1 − p ) m 1 1 { E u } o ≤ exp  − (1 + o (1)) n 1 + n l 2 I ∗  . (50) No w w e pro vide an upp er b ound for ( 45 ). By ( 47 ), on E u , pe t u + 1 − p q e t u + 1 − q = 1 + ( p − q )( e t u − 1) q e t u + 1 − q ≤ 1 + O  ( p − q ) 2 p  ≤ exp  O  ( p − q ) 2 p  , and q e − t u + 1 − q pe − t u + 1 − p = 1 + ( p − q )(1 − e − t u ) pe − t u + 1 − p ≤ 1 + O  ( p − q ) 2 p  ≤ exp  O  ( p − q ) 2 p  . Therefore, E (  pe t u + 1 − p q e t u + 1 − q  m l − m 0 l  q e − t u + 1 − q pe − t u + 1 − p  m 1 − m 0 1 1 { E u } ) ≤ exp  o (1) n 1 + n l 2 I ∗  . (51) By combining ( 50 ) and ( 51 ), w e hav e p l ≤ exp  − (1 + o (1)) n 1 + n l 2 I ∗  . (52) Using ( 42 ), this implies P { b σ u ( u ) 6 = 1 and E u } ≤ ( k − 1) exp  − (1 + o (1)) min l 6 =1  n 1 + n l 2  I ∗  , and so P { b σ u ( u ) 6 = 1 } ≤ ( k − 1) exp  − (1 + o (1)) min l 6 =1  n 1 + n l 2  I ∗  + C n − (1+ δ ) . When k = 2, min l 6 =1  n 1 + n l 2  = n 2 , and when k ≥ 3, min l 6 =1  n 1 + n l 2  ≥ n β k . Thus, the pro of is complete. Lemma 3. L et Θ b e define d as in ( 3 ) and k ≥ 2 . Supp ose as n → ∞ , ( a − b ) 2 ak → ∞ and a  b . If ther e exists two se quenc es γ = o  a − b ak  and η = o (1) , c onstants C , δ > 0 and p ermutations { π u } n u =1 ⊂ S k such that ( 41 ) holds. Then for b σ u ( u ) define d as in ( 10 ) with ρ = ρ u in ( 12 ) , the c onclusions of L emma 2 c ontinue to hold. 22 Pr o of. The proof is similar to that of Lemma 2 and we use the same notation as there. First, w e giv e a b ound for p l defined in ( 42 ). Let X j ∼ Bern( q ), Y j ∼ Bern( p ) and Z j ∼ Bern( αp ), j ≥ 1, be m utually independent. Then, a stochastic order argumen t giv es p l ≤ E   P    m 0 l X j =1 X j + m l − m 0 l X j =1 Z j − m 0 1 X j =1 Y j ≥ ρ ( m l − m 1 ) and E u    A − u      ≤ E  exp ( − t u ρ u ( m l − m 1 )) ( q e t u + 1 − q ) m l ( pe − t u + 1 − p ) m 1 1 { E u }  (53) × E (  1 q e t u + 1 − q  m l − m 0 l  1 pe − t u + 1 − p  m 1 − m 0 1 ( αpe t u + 1 − α p ) m l − m 0 l 1 { E u } ) . (54) Note that the term in ( 53 ) is the same as that in ( 44 ), and th us it can b e upp er b ounded b y ( 50 ) as b efore. T o b ound for ( 54 ), observ e that by ( 47 ), 1 q e t u + 1 − q ≤ exp  q | e t u − 1 |  ≤ exp ( O ( p − q )) , 1 pe − t u + 1 − p ≤ exp  C p | e − t u − 1 |  ≤ exp ( O ( p − q )) and αpe t u + 1 − α p ≤ exp  αp | e t u − 1 |  ≤ exp ( O ( p − q )) . Th us, under the assumption γ = o  p − q kp  , the term ( 54 ) is b ounded b y exp  o (1) n 1 + n l 2 I ∗  . The remaining pro of is the same as that of Lemma 2 . Finally , w e need a lemma to justify the consensus step in Algorithm 1 . Lemma 4. F or any c ommunity assignments σ and σ 0 : [ n ] → [ k ] , such that for some c onstant C ≥ 1 min l ∈ [ k ] | { u : σ ( u ) = l } | , min l ∈ [ k ] |  u : σ 0 ( u ) = l  | ≥ n C k , and min π ∈ S k ` 0 ( σ, π ( σ 0 )) < 1 C k . Define map ξ : [ k ] → [ k ] as ξ ( i ) = argmax l   { u : σ ( u ) = l } ∩ { u : σ 0 ( u ) = i }   , ∀ i ∈ [ k ] . (55) Then ξ ∈ S k and ` 0 ( σ, ξ ( σ 0 )) = min π ∈ S k ` 0 ( σ, π ( σ 0 )) . Pr o of. By the definition in ( 55 ), we obtain ξ = argmin ξ 0 :[ k ] → [ k ] ` 0 ( σ, ξ 0 ( σ 0 )) , and ` 0 ( σ, ξ ( σ 0 )) ≤ min π ∈ S k ` 0 ( σ, π ( σ 0 )) < 1 C k . Th us, what remains to b e shown is that ξ ∈ S k , i.e., ξ ( l 1 ) 6 = ξ ( l 2 ) for any l 1 6 = l 2 . T o this end, note that if for some l 1 6 = l 2 , ξ ( l 1 ) = ξ ( l 2 ), then there w ould exist some l 0 ∈ [ k ] such that for any l ∈ [ k ], ξ ( l ) 6 = l 0 , and so ` 0 ( σ, ξ ( σ 0 )) ≥ 1 n X u : σ ( u )= l 0 1 { σ ( u ) 6 = ξ ( σ 0 ( u )) } = | { u : σ ( u ) = l 0 } | n ≥ 1 C k . This is in contradiction to the second last display , and hence ξ ∈ S k . This completes the pro of. 23 Pr o of of The or em 2 . Let Θ = Θ( n, k , a, b, λ, β ; α ), and fix any ( B , σ ) ∈ Θ. F or an y u ∈ [ n ], b y Condition 1 and the fact that σ 0 u and b σ u differ only at the comm unity assignment of u , for γ 0 = γ + 1 /n , there exists some π u ∈ S k suc h that P  ` 0 ( σ, π − 1 u ( b σ u )) ≤ γ 0 n  ≥ 1 − C 0 n − (1+ δ ) . (56) Without loss of generalit y , w e assume π 1 = Id is the iden tity map. Now for an y fixed u ∈ { 2 , . . . , n } , define map ξ u : [ k ] → [ k ] as in ( 55 ) with σ and σ 0 replaced by b σ 1 and b σ u . Then b y definition b σ ( u ) = ξ u ( b σ u ( u )) . (57) In addition, ( 56 ) implies with probability at least 1 − C n − (1+ δ ) , we hav e ` 0 ( σ, b σ 1 ) ≤ γ 0 and ` 0 ( σ, π − 1 u ( b σ u )) ≤ γ 0 . So the triangle inequality implies ` 0 ( b σ 1 , π − 1 u ( b σ u )) ≤ 2 γ 0 and hence the condition of Lemma 4 is satisfied. Thus, Lemma 4 implies P  ξ u = π − 1 u  ≥ 1 − C n − (1+ δ ) . (58) When k ≥ 3, Lemma 1 , ( 16 ) and ( 18 ) imply that the condition of Lemma 3 is satisfied, which in turn implies that for a sequence η 0 = o (1), P { b σ ( u ) 6 = σ ( u ) } = P { ξ u ( b σ u ( u )) 6 = σ ( u ) } ≤ P  ξ u ( b σ u ( u )) 6 = σ ( u ) , ξ u = π − 1 u  + P  ξ u 6 = π − 1 u  ≤ P { b σ u ( u ) 6 = π u ( σ ( u )) } + P  ξ u 6 = π − 1 u  ≤ C n − (1+ δ ) + ( k − 1) exp  − (1 − η 0 ) nI ∗ β k  . Set η = η 0 + β r k nI ∗ = o (1) (59) where the last inequality holds since nI ∗ k  ( a − b ) 2 ak → ∞ . Th us, Mark o v’s inequalit y leads to P  ` 0 ( σ, b σ ) > ( k − 1) exp  − (1 − η ) nI ∗ β k  ≤ 1 ( k − 1) exp n − (1 − η ) nI ∗ β k o 1 n n X u =1 P { b σ ( u ) 6 = σ ( u ) } ≤ exp  − ( η − η 0 ) nI ∗ β k  + C n − (1+ δ ) ( k − 1) exp n − (1 − η ) nI ∗ β k o ≤ exp ( − r nI ∗ k ) + C n − (1+ δ ) ( k − 1) exp n − (1 − η ) nI ∗ β k o . If ( k − 1) exp n − (1 − η ) nI ∗ β k o ≥ n − (1+ δ / 2) , then P  ` 0 ( σ, b σ ) > ( k − 1) exp  − (1 − η ) nI ∗ β k  ≤ exp ( − r nI ∗ k ) + C n − δ / 2 = o (1) . 24 If ( k − 1) exp n − (1 − η ) nI ∗ β k o < n − (1+ δ / 2) , then P  ` 0 ( σ, b σ ) > ( k − 1) exp  − (1 − η ) nI ∗ β k  = P { ` 0 ( σ, b σ ) > 0 } ≤ n X u =1 P { b σ ( u ) 6 = σ ( u ) } ≤ n ( k − 1) exp  − (1 − η ) nI ∗ β k  + C n − δ ≤ C n − δ / 2 = o (1) . Here, the second last inequality holds since η > η 0 and so ( k − 1) exp {− (1 − η 0 ) nI ∗ / ( β k ) } < ( k − 1) exp {− (1 − η ) nI ∗ / ( β k ) } < n − (1+ δ / 2) . W e complete the pro of for the case of Θ( n, k , a, b, λ, β ; α ) and k ≥ 3 b y noting that ( k − 1) exp n − (1 − η ) nI ∗ β k o = exp n − (1 − η 00 ) nI ∗ β k o for another sequence η 00 = o (1) under the assumption ( a − b ) 2 ak log k → ∞ and no constant or sequence in the foregoing ar- gumen ts inv olves B , σ or u . When Θ = Θ( n, k , a, b, λ, β ; α ) and k = 2, the foregoing argumen ts con tinue to hold with β and k replaced with 1 and 2 resp ectiv ely . When Θ = Θ 0 ( n, k , a, b, β ), we can run the foregoing arguments with Lemma 3 replaced by Lemma 2 to reach the conclusion in ( 17 ), which do es not require condition ( 18 ). This completes the pro of. 7.2 Pro of of Theorem 3 The following lemma is critical to establish the result of Theorem 3 . Its pro of is giv en in the app endix. Let us introduce the notation O ( k 1 , k 2 ) = { V ∈ R k 1 × k 2 : V T V = I k 2 } for k 1 ≥ k 2 . Lemma 5. Consider a symmetric adjac ency matrix A ∈ { 0 , 1 } n × n and a symmetric matrix P ∈ [0 , 1] n × n satisfying A uu = 0 for al l u ∈ [ n ] and A uv ∼ Bernoul li ( P uv ) indep endently for al l u > v . F or any C 0 > 0 , ther e exists some C > 0 such that k T τ ( A ) − P k op ≤ C p np max + 1 , with pr ob ability at le ast 1 − n − C 0 uniformly over τ ∈ [ C 1 ( np max + 1) , C 2 ( np max + 1)] for some sufficiently lar ge c onstants C 1 , C 2 , wher e p max = max u ≥ v P uv . Lemma 6. F or P = ( P uv ) = ( B σ ( u ) σ ( v ) ) , we have SVD P = U Λ U T , wher e U = Z ∆ − 1 W , with ∆ = diag ( √ n 1 , ..., √ n k ) , Z ∈ { 0 , 1 } n × k is a matrix with exactly one nonzer o entry in e ach r ow at ( i, σ ( i )) taking value 1 and W ∈ O ( k , k ) . Pr o of. Note that P = Z B Z T = Z ∆ − 1 ∆ B ∆( Z ∆ − 1 ) T , and observ e that Z ∆ − 1 ∈ O ( n, k ). Apply SVD to the matrix ∆ B ∆ T = W Λ W T for some W ∈ O ( k, k ), and then we hav e P = U Λ U T with U = Z ∆ − 1 W ∈ O ( k, k ). Pr o of of The or em 3 . Under the current assumption, E τ ∈ [ C 0 1 a, C 0 2 a ] for some large C 0 1 and C 0 2 . Using Bernstein’s inequality , w e hav e τ ∈ [ C 1 a, C 2 a ] for some large C 1 and C 2 with probability at least 1 − e − C 0 n . When ( 20 ) holds, by Lemma 5 , w e deduce that the k th eigen v alue of T τ ( A ) is low er b ounded by c 1 λ k with probability at least 1 − n − C 0 for some small constant c 1 ∈ (0 , 1). 25 Figure 6: The sc hematic plot for the proof of Theorem 3 . The balls { T i } i ∈ [ k ] are centered at { Q i } i ∈ [ k ] , and the cen ters are at least q 2 k β n a wa y from eac h other. The balls { b C i } i ∈ [ k ] in tersect with large proportions of { T i } i ∈ [ k ] , and their subscripts do not need to match due to some permutation. By Davis–Kahan’s sin-theta theorem [ 22 ], w e hav e k b U − U W 1 k F ≤ C √ k λ k k T τ ( A ) − P k op for some W 1 ∈ O ( k , k ) and some constant C > 0. Applying Lemma 6 , we hav e k b U − V k F ≤ C √ k λ k k T τ ( A ) − P k op , (60) where V = Z ∆ − 1 W 2 ∈ O ( n, k ) for some W 2 ∈ O ( k, k ). Com bining ( 60 ), Lemma 5 and the conclusion τ ∈ [ C 1 a, C 2 a ], we hav e k b U − V k F ≤ C √ k √ a λ k , (61) with probability at least 1 − n − C 0 . The definition of V implies that k V u ∗ − V v ∗ k = r 1 n u + 1 n v 1 { σ ( u ) 6 = σ ( v ) } . (62) In other words, define Q = ∆ − 1 W 2 ∈ R k × k and we hav e V u ∗ = Q σ ( u ) ∗ for each u ∈ [ n ]. Hence, for σ ( u ) 6 = σ ( v ),   Q σ ( u ) ∗ − Q σ ( v ) ∗   = k V u ∗ − V v ∗ k ≥ q 2 k β n . Recall the definition r = µ q k n in Algorithm 2 . Define the sets T i = n u ∈ σ − 1 ( i ) :    b U u ∗ − Q i ∗    < r 2 o , i ∈ [ k ] . By definition, T i ∩ T j = ∅ when i 6 = j , and w e also hav e ∪ i ∈ [ k ] T i = n u ∈ [ n ] :    b U u ∗ − V u ∗    < r 2 o . (63) 26 Therefore,    ∪ i ∈ [ k ] T i  c   r 2 4 ≤ X u ∈ [ n ]    b U u ∗ − V u ∗    2 ≤ C 2 k a λ 2 k , where the last inequality is b y ( 61 ). After rearrangement, w e ha ve    ∪ i ∈ [ k ] T i  c   ≤ 4 C 2 na µ 2 λ 2 k . (64) In other words, most no des are close to the centers and are in the set ( 63 ). Note that the sets { T i } i ∈ [ k ] are disjoint. Supp ose there is some i ∈ [ k ] suc h that | T i | < | σ − 1 ( i ) | −    ∪ i ∈ [ k ] T i  c   , we hav e   ∪ i ∈ [ k ] T i   = P i ∈ [ k ] | T i | < n −    ∪ i ∈ [ k ] T i  c   =   ∪ i ∈ [ k ] T i   , whic h is imp ossible. Th us, the cardinality of T i for each i ∈ [ k ] is lo w er bounded as | T i | ≥ | σ − 1 ( i ) | −    ∪ i ∈ [ k ] T i  c   ≥ n β k − 4 C 2 na µ 2 λ 2 k > n 2 β k , (65) where the last inequality ab o v e is by the assumption ( 20 ). Intuitiv ely sp eaking, except for a negli- gible prop ortion, most data p oin ts in { b U u ∗ } u ∈ [ n ] are very close to the p opulation centers { Q i ∗ } i ∈ [ k ] . Since the cen ters are at least q 2 k β n a wa y from eac h other and { T i } i ∈ [ k ] and { b C i } i ∈ [ k ] are both defined through the critical radius r = µ q k n for a small µ , eac h b C i should in tersect with only one T i (see Figure 6 ). W e claim that there exists some p erm utation π of the set [ k ], suc h that for b C i defined in Algorithm 2 , b C i ∩ T π ( i ) 6 = ∅ and | b C i | ≥ | T π ( i ) | for each i ∈ [ k ] . (66) In what follows, we first establish the result of Theorem 3 by assuming ( 66 ). The pro of of ( 66 ) will b e giv en in the end. Note that for an y i 6 = j , T π ( i ) ∩ b C j = ∅ , which is deduced from the fact that b C j ∩ T π ( j ) 6 = ∅ and the definition of b C j . Therefore, T π ( i ) ⊂ b C c j for all j 6 = i . Combining with the fact that T π ( i ) ∩ b C c i ⊂ b C c i , we get T π ( i ) ∩ b C c i ⊂ ( ∪ i ∈ [ k ] b C i ) c . Therefore, ∪ i ∈ [ k ]  T π ( i ) ∩ b C c i  ⊂  ∪ i ∈ [ k ] b C i  c . (67) Since T i ∩ T j = ∅ for i 6 = j , w e deduce from ( 67 ) that X i ∈ [ k ]    T π ( i ) ∩ b C c i    ≤     ∪ i ∈ [ k ] b C i  c    . (68) By definition, b C i ∩ b C j = ∅ for i 6 = j , w e deduce from ( 66 ) that     ∪ i ∈ [ k ] b C i  c    = n − X i ∈ [ k ] | b C i | ≤ n − X i ∈ [ k ] | T i | =    ∪ i ∈ [ k ] T i  c   . (69) Com bining ( 68 ), ( 69 ) and ( 64 ), w e ha ve X i ∈ [ k ]    T π ( i ) ∩ b C c i    ≤ 4 C 2 na µ 2 λ 2 k . (70) 27 Since for an y u ∈ ∪ i ∈ [ k ] ( b C i ∩ T π ( i ) ), w e ha ve b σ ( u ) = i when σ ( u ) = π ( i ), the mis-classification rate is b ounded as ` 0 ( b σ , π − 1 ( σ )) ≤ 1 n     ∪ i ∈ [ k ] ( b C i ∩ T π ( i ) )  c    ≤ 1 n      ∪ i ∈ [ k ] ( b C i ∩ T π ( i ) )  c ∩  ∪ i ∈ [ k ] T i     +    ∪ i ∈ [ k ] T i  c    ≤ 1 n   X i ∈ [ k ]    T π ( i ) ∩ b C c i    +    ∪ i ∈ [ k ] T i  c     ≤ 8 C 2 a µ 2 λ 2 k , where the last inequality is from ( 70 ) and ( 64 ). This pro v es the desired conclusion. Finally , w e are going to establish the claim ( 66 ) to close the proof. W e use mathematical induction. F or i = 1, it is clear that | b C 1 | ≥ max i ∈ [ k ] | T i | holds by the definition of b C 1 . Supp ose b C 1 ∩ T i = ∅ for all i ∈ [ k ], and then we must ha ve    ∪ i ∈ [ k ] T i  c   ≥ | b C 1 | ≥ max i ∈ [ k ] | T i | ≥ n 2 β k , where the last inequality is b y ( 65 ). This con tradicts ( 64 ) under the assumption ( 20 ). Therefore, there must b e a π (1) such that b C 1 ∩ T π (1) 6 = ∅ and | b C 1 | ≥ | T π (1) | . Moreov er, | b C c 1 ∩ T π (1) | = | T π (1) | − | T π (1) ∩ b C 1 | ≤ | b C 1 | − | T π (1) ∩ b C 1 | = | b C 1 ∩ T c π (1) | ≤    ∪ i ∈ [ k ] T i  c   , where the last inequalit y is b ecause T π (1) is the only set in { T i } i ∈ [ k ] that in tersects b C 1 b y the definitions. By ( 64 ), w e get | b C c i ∩ T π ( i ) | ≤ 4 C 2 na µ 2 λ 2 k , (71) for i = 1. No w supp ose ( 66 ) and ( 71 ) are true for i = 1 , ..., l − 1. Because of the sizes of { b C i } i ∈ [ l − 1] and the fact that { T i } i ∈ [ k ] are mutually exclusive, we ha ve  ∪ l − 1 i =1 b C i  ∩  ∪ i ∈ [ k ] \∪ l − 1 i =1 { π ( i ) } T i  = ∅ . Therefore, for the set S in the curren t step, ∪ i ∈ [ k ] \∪ l − 1 i =1 { π ( i ) } T i ⊂ S . By the definition of b C l , w e hav e | b C l | ≥ max i ∈ [ k ] \∪ l − 1 i =1 { π ( i ) } | T i | ≥ n 2 β k . Supp ose b C l ∩ T π ( i ) 6 = ∅ for some i = 1 , ..., l − 1. Then, this T π ( i ) is the only set in { T i } i ∈ [ k ] that intersects b C l b y their definitions. This implies that | b C l | ≤ | b C l ∩ T π ( i ) | +    ∪ i ∈ [ k ] T i  c   . Since b C l ∩ b C π ( i ) = ∅ , | b C l ∩ T π ( i ) | ≤ | b C c i ∩ T π ( i ) | is b ounded by ( 71 ). T ogether with ( 64 ), w e hav e | b C l | ≤ 8 C 2 na µ 2 λ 2 k , 28 whic h contradicts | b C l | ≥ n 2 β k under the assumption ( 20 ). Therefore, we must ha ve b C l ∩ T π ( i ) = ∅ for all i = 1 , ..., l − 1. No w suppose b C l ∩ T π ( i ) = ∅ for all i ∈ [ k ], we must ha ve    ∪ i ∈ [ k ] T i  c   ≥ | b C l | ≥ n 2 β k , whic h con tradicts ( 64 ). Hence, b C l ∩ T π ( l ) 6 = ∅ for some π ( l ) ∈ [ k ] \ ∪ l − 1 i =1 { π ( i ) } , and ( 66 ) is established for i = l . Moreo ver, ( 71 ) can also b e established for i = l b y the same argument that is used to pro ve ( 71 ) for i = 1. The proof is complete. 7.3 Pro of of Theorem 4 Define P τ = P + τ n 11 T . The proof of the follo wing lemma is given in the app endix. Lemma 7. Consider a symmetric adjac ency matrix A ∈ { 0 , 1 } n × n and a symmetric matrix P ∈ [0 , 1] n × n satisfying A uu = 0 for al l u ∈ [ n ] and A uv ∼ Bernoul li ( P uv ) indep endently for al l u > v . F or any C 0 > 0 , ther e exists some C > 0 such that k L ( A τ ) − L ( P τ ) k op ≤ C s log( e ( np max + 1)) np max + 1 , with pr ob ability at le ast 1 − n − C 0 uniformly over τ ∈ [ C 1 ( np max + 1) , C 2 ( np max + 1)] for some sufficiently lar ge c onstants C 1 , C 2 , wher e p max = max u ≥ v P uv . Lemma 8. Consider P = ( P uv ) = ( B σ ( u ) σ ( v ) ) . L et the SVD of the matrix L ( P τ ) b e L ( P τ ) = U Σ U T , with U ∈ O ( n, k ) and Σ = diag ( σ 1 , ..., σ k ) . F or V = U W with any W ∈ O ( r, r ) , we have k V u ∗ − V v ∗ k = q 1 n u + 1 n v when σ ( u ) 6 = σ ( v ) and V u ∗ = V v ∗ when σ ( u ) = σ ( v ) . Mor e over, σ k ≥ λ k 2 τ as long as τ ≥ np max . Pr o of. The first part is Lemma 1 in [ 39 ]. Define ¯ d v = P u ∈ [ n ] P uv and ¯ D τ = diag( ¯ d 1 + τ , ..., ¯ d n + τ ). Then, we hav e L ( P τ ) = ¯ D − 1 / 2 τ P τ ¯ D − 1 / 2 τ . Note that P τ has an SBM structure so that it has rank at most k , and the k th eigen v alue of P τ is low er bounded b y λ k . Thus, w e ha v e σ k ≥ λ k max u ∈ [ n ] ¯ d u + τ . Observ e that max u ∈ [ n ] ¯ d u ≤ np max ≤ τ , and the proof is complete. Pr o of of The or em 4 . As is sho wn in the pro of of Theorem 3 , τ ∈ [ C 1 a, C 2 a ] for some large C 1 , C 2 with probabilit y at least 1 − e − C 0 n . By Da vis–Kahan’s sin-theta theorem [ 22 ], w e hav e k b U − U W k F ≤ C 1 √ k σ k k L ( A τ ) − L ( P τ ) k op for some W ∈ O ( r , r ) and some constant C 1 > 0. Let V = U W and apply Lemma 7 and Lemma 8 , we hav e k b U − V k F ≤ C √ k √ a log a λ k , (72) with probabilit y at least 1 − n − C 0 . Note that b y Lemma 8 , V satisfies ( 62 ). Replace ( 61 ) by ( 72 ), and follow the remaining pro of of Theorem 3 , the pro of is complete. 29 References [1] Emman uel Abbe and Colin Sandon. Comm unity detection in general sto chastic blo ck mo dels: funda- men tal limits and efficient recov ery algorithms. arXiv pr eprint arXiv:1503.00609 , 2015. [2] Emman uel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recov ery in the sto c hastic blo ck mo del. arXiv pr eprint arXiv:1405.3267 , 2014. [3] Lada A Adamic and Natalie Glance. The p olitical blogosphere and the 2004 US election: Divided they blog. In Pr o c e e dings of the 3r d International Workshop on Link Disc overy , pages 36–43. ACM, 2005. [4] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of- squares clustering. Machine L e arning , 75(2):245–248, 2009. [5] Noga Alon and Jo el H Sp encer. The Pr ob abilistic Metho d . John Wiley & Sons, 2004. [6] Arash A Amini and Elizav eta Levina. On semidefinite relaxations for the blo c k mo del. arXiv pr eprint arXiv:1406.5647 , 2014. [7] Arash A Amini, Aiy ou Chen, Peter J Bic kel, and Elizav eta Levina. Pseudo-likelihoo d metho ds for comm unity detection in large sparse netw orks. The A nnals of Statistics , 41(4):2097–2122, 2013. [8] Siv araman Balakrishnan, Martin J W ainwrigh t, and Bin Y u. Statistical guarantees for the EM algorithm: F rom p opulation to sample-based analysis. arXiv pr eprint arXiv:1408.2156 , 2014. [9] VD Barnett. Ev aluation of the maxim um-lik eliho o d estimator where the lik eliho o d equation has m ultiple ro ots. Biometrika , pages 151–165, 1966. [10] P eter J Bic kel. One-step Hub er estimates in the linear mo del. Journal of the Americ an Statistic al Asso ciation , 70(350):428–434, 1975. [11] P eter J Bick el and Aiyou Chen. A nonparametric view of netw ork mo dels and Newman–Girv an and other mo dularities. Pr o c e e dings of the National A c ademy of Scienc es , 106(50):21068–21073, 2009. [12] P eter J Bick el, Y a’acov Ritov, and Alexandre B Tsybako v. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics , pages 1705–1732, 2009. [13] T T on y Cai and Xiao dong Li. Robust and computationally feasible comm unity detection in the presence of arbitrary outlier no des. arXiv pr eprint arXiv:1404.6000 , 2014. [14] T T ony Cai, Cun-Hui Zhang, and Harrison H Zhou. Optimal rates of conv ergence for cov ariance matrix estimation. The Annals of Statistics , 38(4):2118–2144, 2010. [15] T T ony Cai, Zongming Ma, and Yihong W u. Sparse PCA: Optimal rates and adaptiv e estimation. The A nnals of Statistics , 41(6):3074–3110, 2013. [16] Emman uel Candes, Xiao dong Li, and Mahdi Soltanolkotabi. Phase retriev al via Wirtinger flo w: Theory and algorithms. arXiv pr eprint arXiv:1407.1065 , 2014. [17] Keh ui Chen and Jing Lei. Netw ork cross-v alidation for determining the num b er of comm unities in net work data. arXiv pr eprint arXiv:1411.1715 , 2014. [18] Y udong Chen and Jiaming Xu. Statistical-computational tradeoffs in plan ted problems and submatrix lo calization with a gro wing num b er of clusters and submatrices. arXiv pr eprint arXiv:1402.1267 , 2014. [19] P eter Chin, An up Rao, and V an V u. Sto c hastic blo c k mo del and comm unity detection in the sparse graphs: A sp ectral algorithm with optimal rate of recov ery . arXiv pr eprint arXiv:1501.05021 , 2015. [20] Da vid S Choi, Patric k J W olfe, and Edoardo M Airoldi. Sto c hastic blo ckmodels with a growing n umber of classes. Biometrika , 99(2):273–284, 2012. 30 [21] Amin Co ja-Oghlan. Graph partitioning via adaptive sp ectral tec hniques. Combinatorics, Pr ob ability and Computing , 19(02):227–284, 2010. [22] Chandler Davis and William Morton Kahan. The rotation of eigen vectors by a p erturbation. I II. SIAM Journal on Numeric al Analysis , 7(1):1–46, 1970. [23] Aurelien Decelle, Floren t Krzak ala, Cristopher Moore, and Lenk a Zdeborov´ a. Asymptotic analysis of the sto c hastic block mo del for mo dular netw orks and its algorithmic applications. Physic al R eview E , 84(6):066106, 2011. [24] Uriel F eige and Eran Ofek. Sp ectral techniques applied to sparse random graphs. R andom Structur es & Algorithms , 27(2):251–275, 2005. [25] Donniell E Fishkind, Daniel L Sussman, Minh T ang, Josh ua T V ogelstein, and Carey E Prieb e. Con- sisten t adjacency-sp ectral partitioning for the stochastic blo ck mo del when the mo del parameters are unkno wn. SIAM Journal on Matrix Analysis and Applic ations , 34(1):23–39, 2013. [26] Joel F riedman, Jeff Kahn, and Endre Szemeredi. On the second eigen v alue of random regular graphs. In Pr o c e e dings of the Twenty-First A nnual A CM Symp osium on The ory of Computing , pages 587–598. A CM, 1989. [27] Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse CCA: Adaptive Estimation and Computational Barriers. arXiv pr eprint arXiv:1409.8565 , 2014. [28] Mic helle Girv an and Mark EJ Newman. Comm unity structure in so cial and biological net works. Pr o- c e e dings of the National A c ademy of Scienc es , 99(12):7821–7826, 2002. [29] Anna Golden b erg, Alice X Zheng, Stephen E Fien b erg, and Edoardo M Airoldi. A surv ey of statistical net work mo dels. F oundations and T r ends in Machine L e arning , 2(2):129–233, 2010. [30] Lars Hagen and Andrew B Kahng. New sp ectral metho ds for ratio cut partitioning and clustering. Computer-A ide d Design of Inte gr ate d Cir cuits and Systems, IEEE T r ansactions on , 11(9):1074–1085, 1992. [31] Bruce Ha jek, Yihong W u, and Jiaming Xu. Achieving exact cluster recov ery threshold via semidefinite programming. arXiv pr eprint arXiv:1412.6156 , 2014. [32] Bruce Ha jek, Yihong W u, and Jiaming Xu. Achieving exact cluster recov ery threshold via semidefinite programming: Extensions. arXiv pr eprint arXiv:1502.07738 , 2015. [33] Mark S Handcock, Adrian E Raftery , and Jeremy M T antrum. Mo del-based clustering for social net- w orks. Journal of the R oyal Statistic al So ciety: Series A (Statistics in So ciety) , 170(2):301–354, 2007. [34] P aul W Holland, Kathryn Blackmond Laskey , and Samuel Leinhardt. Sto c hastic blo c kmo dels: First steps. So cial Networks , 5(2):109–137, 1983. [35] Roger A Horn and Charles R Johnson. Matrix A nalysis . Cambridge Univ ersity Press, 2012. [36] Adel Jav anmard and Andrea Montanari. Confidence in terv als and h yp othesis testing for high- dimensional regression. The Journal of Machine L e arning R ese ar ch , 15(1):2869–2909, 2014. [37] Jiash un Jin. F ast communit y detection by score. The A nnals of Statistics , 43(1):57–89, 2015. [38] Iain M Johnstone. Gaussian Estimation: Se quenc e and Wavelet Mo dels . Bo ok draft, 2011. [39] An tony Joseph and Bin Y u. Impact of regularization on sp ectral clustering. arXiv pr eprint arXiv:1312.1733 , 2013. [40] Brian Karrer and Mark EJ Newman. Sto c hastic blo c kmo dels and comm unity structure in netw orks. Physic al R eview E , 83(1):016107, 2011. 31 [41] Amit Kumar, Y ogish Sabharwal, and Sandeep Sen. A simple linear time (1+ ε )-appro ximation algorithm for geometric k-means clustering in any dimensions. In Pr o c e e dings-Annual Symp osium on F oundations of Computer Scienc e , pages 454–462. IEEE, 2004. [42] Can M Le, Eliza veta Levina, and Roman V ersh ynin. Sparse random graphs: Regularization and con- cen tration of the Laplacian. arXiv pr eprint arXiv:1502.03049 , 2015. [43] Lucien Marie Le Cam. Th ´ eorie asymptotique de la d´ ecision statistique , v olume 33. Presses de l’Universit ´ e de Montr ´ eal, 1969. [44] Jing Lei. A go odness-of-fit test for sto c hastic blo c k models. arXiv pr eprint arXiv:1412.4857 , 2014. [45] Jing Lei and Alessandro Rinaldo. Consistency of sp ectral clustering in stochastic block models. The A nnals of Statistics , 43(1):215–237, 2014. [46] Jing Lei and Lingxue Zhu. A generic sample splitting approach for refined communit y recov ery in sto c hastic blo ck mo dels. arXiv pr eprint arXiv:1411.1469 , 2014. [47] Zongming Ma. Sparse principal component analysis and iterativ e thresholding. The A nnals of Statistics , 41(2):772–801, 2013. [48] Meena Maha jan, Pra jakta Nimb hork ar, and Kasturi V aradara jan. The planar k-means problem is NP-hard. In W ALCOM: A lgorithms and Computation , pages 274–285. Springer, 2009. [49] Lauren t Massouli´ e. Communit y detection thresholds and the weak ramanujan property . In Pr o c e e dings of the 46th A nnual A CM Symp osium on The ory of Computing , pages 694–703. ACM, 2014. [50] F rank McSherry . Sp ectral partitioning of random graphs. In F oundations of Computer Scienc e, 2001. Pr o c e e dings. 42nd IEEE Symp osium on , pages 529–537. IEEE, 2001. [51] Elc hanan Mossel, Jo e Neeman, and Allan Sly . Sto c hastic blo c k mo dels and reconstruction. arXiv pr eprint arXiv:1202.1499 , 2012. [52] Elc hanan Mossel, Jo e Neeman, and Allan Sly . Belief propagation, robust reconstruction, and optimal reco very of blo c k mo dels. arXiv pr eprint arXiv:1309.1380 , 2013. [53] Elc hanan Mossel, Jo e Neeman, and Allan Sly . A pro of of the blo c k mo del threshold conjecture. arXiv pr eprint arXiv:1311.4115 , 2013. [54] Elc hanan Mossel, Jo e Neeman, and Allan Sly . Consistency thresholds for binary symmetric blo c k mo dels. arXiv pr eprint arXiv:1407.1591 , 2014. [55] Mark EJ Newman and Elizab eth A Leich t. Mixture models and exploratory analysis in netw orks. Pr o c e e dings of the National A c ademy of Scienc es , 104(23):9564–9569, 2007. [56] Debashis P aul and Iain M Johnstone. Augmen ted sparse principal comp onen t analysis for high dimen- sional data. arXiv pr eprint arXiv:1202.1242 , 2012. [57] T ai Qin and Karl Rohe. Regularized sp ectral clustering under the degree-corrected sto c hastic blo c k- mo del. In A dvanc es in Neur al Information Pr o c essing Systems , pages 3120–3128, 2013. [58] Alfred R ´ en yi. On measures of entrop y and information. In F ourth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , volume 1, pages 547–561, 1961. [59] Karl Rohe, Sourav Chatterjee, and Bin Y u. Sp ectral clustering and the high-dimensional sto c hastic blo c kmo del. The Annals of Statistics , 39(4):1878–1915, 2011. [60] Diego F ranco Saldana, Yi Y u, and Y ang F eng. How many communities are there? arXiv pr eprint arXiv:1412.1684 , 2014. 32 [61] Purnamrita Sark ar and P eter J Bick el. Role of normalization in spectral clustering for stochastic blo c kmo dels. arXiv pr eprint arXiv:1310.1495 , 2013. [62] Daniel L Sussman, Minh T ang, Donniell E Fishkind, and Carey E Prieb e. A consisten t adjacency sp ectral embedding for sto chastic blo c kmo del graphs. Journal of the A meric an Statistic al Asso ciation , 107(499):1119–1128, 2012. [63] Alexandre B Tsybak ov. Intr o duction to Nonp ar ametric Estimation . Springer, 2009. [64] Sara v an de Geer, Peter B ¨ uhlmann, Y a’acov Rito v, and Rub en Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional mo dels. The Annals of Statistics , 42(3):1166–1202, 2014. [65] Ulrik e v on Luxburg. A tutorial on sp ectral clustering. Statistics and Computing , 17(4):395–416, 2007. [66] Ulrik e von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of sp ectral clustering. The A nnals of Statistics , pages 555–586, 2008. [67] V an V u. A simple SVD algorithm for finding hidden partitions. arXiv pr eprint arXiv:1404.3918 , 2014. [68] YX W ang and Peter J Bic kel. Likelihoo d-based mo del selection for sto chastic blo ck mo dels. arXiv pr eprint arXiv:1502.02069 , 2015. [69] Zhaoran W ang, Quanquan Gu, Y ang Ning, and Han Liu. High dimensional exp ectation-maximization algorithm: Statistical optimization and asymptotic normality . arXiv pr eprint arXiv:1412.8729 , 2014. [70] Zhaoran W ang, Huanran Lu, and Han Liu. Nonconv ex statistical optimization: Minimax-optimal sparse PCA in p olynomial time. arXiv pr eprint arXiv:1408.5352 , 2014. [71] Stanley W asserman. So cial Network Analysis: Metho ds and Applic ations , volume 8. Cambridge Uni- v ersity Press, 1994. [72] Se-Y oung Y un and Alexandre Proutiere. Accurate comm unity detection in the sto chastic blo c k mo del via sp ectral algorithms. arXiv pr eprint arXiv:1412.7335 , 2014. [73] Se-Y oung Y un and Alexandre Proutiere. Communit y detection via random and adaptive sampling. arXiv pr eprint arXiv:1402.3072 , 2014. [74] Anderson Y Zhang and Harrison H Zhou. Minimax rates of comm unity detection in stochastic block mo del. 2015. [75] Cun-Hui Zhang and Stephanie S Zhang. Confidence in terv als for lo w dimensional parameters in high dimensional linear mo dels. Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) , 76(1):217–242, 2014. [76] Cun-Hui Zhang and T ong Zhang. A general theory of concav e regularization for high-dimensional sparse estimation problems. Statistic al Scienc e , 27(4):576–593, 2012. [77] Y unp eng Zhao, Eliza veta Levina, and Ji Zh u. Consistency of comm unity detection in net works under degree-corrected sto chastic blo c k mo dels. The Annals of Statistics , 40(4):2266–2292, 2012. 33 Supplemen t to “Achieving Optimal Misclassification Prop ortion in Sto c hastic Blo ck Mo del” By Chao Gao 1 , Zongming Ma 2 , Anderson Y. Zhang 1 and Harrison H. Zhou 1 1 Y ale Univ ersity and 2 Univ ersity of P ennsylv ania A A simplified v ersion of Algorithm 1 Algorithm 3: A simplified refinement scheme for communit y detection Input : Adjacency matrix A ∈ { 0 , 1 } n × n , n umber of communities k , initial communit y detection metho d σ 0 . Output : Communit y assignmen t b σ . Initialization: 1 Apply σ 0 on A to obtain σ 0 ( u ) for all u ∈ [ n ]; 2 Define e C i =  v : σ 0 ( v ) = i  for all i ∈ [ k ]; let e E i b e the set of edges within e C i , and e E ij the set of edges b et ween e C i and e C j when i 6 = j ; 3 Define b B ii = | e E i | 1 2 | e C i | ( | e C i | − 1) , b B ij = | e E ij | | e C i || e C j | , ∀ i 6 = j ∈ [ k ] , and let b a = n min i ∈ [ k ] b B ii and b b = n max i 6 = j ∈ [ k ] b B ij . P enalized neighbor voting: 4 F or t = 1 2 log b a (1 − b b/n ) b b (1 − b a/n ) , define ρ = − 1 2 t log b a n e − t + 1 − b a n b b n e t + 1 − b b n ! , 5 F or each u ∈ [ n ], set b σ ( u ) = argmax l ∈ [ k ] X σ 0 ( v )= l A uv − ρ X v ∈ [ n ] 1 { σ 0 ( v )= l } . B Pro ofs of Theorem 5 Pr o of of The or em 5 . Let us consider Θ = Θ 0 ( n, k , a, b, β ) and the case of Θ( n, k , a, b, λ, β ; α ) is similar except that the condition ( 18 ) is needed to establish the counterpart of Lemma 3 . The pro of essentially follows the same steps as those in the pro of of Theorem 2 . First, we note that 34 Lemma 1 contin ues to hold since it do es not need the assumption of a/b b eing b ounded. Th us, the first job is to establish the coun terpart of Lemma 2 with η 0 replaced with C  2  0 3 . As b efore, let p = a/n and q = b/n . T o this end, we first pro ceed in the same w ay to obtain ( 42 ) – ( 52 ). Without loss of generality , let us consider the case where t ∗ > log 2  0 and t u = log 2  0 since otherwise we can essen tially repeat the pro of of Theorem 2 . Note that this implies a b > ( 2  0 ) 2 . In this case, with the new t u in ( 22 ), we ha ve on the ev ent E u , ( q e t u + 1 − q )( pe − t u + 1 − p ) = e − I 0 where I 0 = − log  (1 − p ) (1 − q ) + pq +  e t u − t ∗ + e t ∗ − t u  p (1 − p ) (1 − q ) pq  ≥  1 − C  3  0 5  I ∗ . (73) T o see this, w e first note that for any x, y ∈ (0 , 1) and sufficien t small constan t c 0 > 0, if y ≥ x ≥ (1 − c 0 ) y and y − x 1 − y ≤ 1, then − log (1 − x ) = − log (1 − y ) − log  1 + y − x 1 − y  ≥ − log (1 − y ) − 2 y − x 1 − y ≥ −  1 − C 0 y c 0  log(1 − y ) , where C 0 y = 2 y − (1 − y ) log (1 − y ) . When a b > ( 2  0 ) 2 and t u = log 2  0 , we hav e I 0 = − log (1 − x ) for x = p + q − 2 pq − ( e t u − t ∗ + e t ∗ − t u ) p (1 − p ) (1 − q ) pq ≥ p − 2 pq − q e t u − pe − t u ≥ p (1 −  0 −  2 0 2 ) , while I ∗ = − log (1 − y ) for y = p + q − 2 pq − 2 p (1 − p ) (1 − q ) pq ≤ p + q ≤ p (1 + (  0 2 ) 2 ) . Th us, for an y  0 ∈ (0 , c  ), 1 −  2 ≥ y ≥ x ≥ (1 − 2  0 ) y and y − x 1 − y ≤ 1, and w e apply the inequalit y in the third last display to obtain ( 73 ). Th us, the term in ( 49 ) is upp er b ounded by exp  −  1 − C  3  0 5  n 1 + n l 2 I ∗  . On the other hand, since | e − t u − 1 | ≤ 1, | e t u − 1 | is b ounded and p − q p  1, the term in ( 48 ) con tinues to b e bounded b y exp  − o (1) n 1 + n l 2 I ∗  . Moreo ver, by the same argumen t as in Lemma 2 , ( 51 ) contin ues to hold. Th us, we can replace ( 52 ) as p l ≤ exp  −  1 − C  2  0 3  n 1 + n l 2 I ∗  , 35 and so when k ≥ 3, P { b σ u ( u ) 6 = π u ( σ ( u )) } ≤ ( k − 1) exp  −  1 − C  2  0 3  nI ∗ β k  + C n − (1+ δ ) (74) and when k = 2, w e can replace β by 1 in the last displa y . When k ≥ 3, giv en the last display and ( 58 ), we ha ve P { b σ ( u ) 6 = σ ( u ) } = P { ξ u ( b σ u ( u )) 6 = σ ( u ) } ≤ P  ξ u ( b σ u ( u )) 6 = σ ( u ) , ξ u = π − 1 u  + P  ξ u 6 = π − 1 u  ≤ P { b σ u ( u ) 6 = π u ( σ ( u )) } + P  ξ u 6 = π − 1 u  ≤ C n − (1+ δ ) + ( k − 1) exp  −  1 − C  2  0 3  nI ∗ β k  . (75) Th us, the assumption that ( a − b ) 2 ak log k → ∞ and Mark o v’s inequalit y leads to P  ` 0 ( σ, b σ ) > exp  − (1 − C   0 ) nI ∗ β k  ≤ P  ` 0 ( σ, b σ ) > ( k − 1) exp  − (1 − C  5  0 6 ) nI ∗ β k  ≤ 1 ( k − 1) exp n − (1 − C  5  0 6 ) nI ∗ β k o 1 n n X u =1 P { b σ ( u ) 6 = σ ( u ) } ≤ exp  − C   0 6 nI ∗ β k  + C n − (1+ δ ) ( k − 1) exp n − (1 − C  5  0 6 ) nI ∗ β k o . (76) If ( k − 1) exp n − (1 − C  5  0 6 ) nI ∗ β k o ≥ n − (1+ δ / 2) , then P  ` 0 ( σ, b σ ) > exp  − (1 − C   0 ) nI ∗ β k  ≤ exp  − C   0 6 nI ∗ β k  + C n − δ / 2 = o (1) . (77) If ( k − 1) exp n − (1 − C  5  0 6 ) nI ∗ β k o < n − (1+ δ / 2) , then P  ` 0 ( σ, b σ ) > exp  − (1 − C   0 ) nI ∗ β k  ≤ P { ` 0 ( σ, b σ ) > 0 } ≤ n X u =1 P { b σ ( u ) 6 = σ ( u ) } ≤ n ( k − 1) exp  − (1 − C  2  0 3 ) nI ∗ β k  + C n − δ ≤ C n − δ / 2 = o (1) . (78) Here, the second last inequality holds since ( k − 1) exp n − (1 − C  2  0 3 ) nI ∗ β k o < exp n − (1 − C  5  0 6 ) nI ∗ β k o < n − (1+ δ / 2) . W e complete the proof for the case of k ≥ 3 b y noting that no constant or sequence in the foregoing argumen ts inv olv es B , σ or u . When k = 2, w e run the foregoing arguments with β replaced by 1 to obtain the desired claim. C Pro ofs of Theorems 6 , 7 and 8 Prop osition 1. F or SBM in the sp ac e Θ 0 ( n, k , a, b, β ) satisfying n ≥ 2 β k , we have λ k ≥ a − b β k . 36 Pr o of. Since the eigen v alues of P are in v ariant with resp ect to permutation of the comm unity labels, w e consider the case where σ ( u ) = i for u ∈ n P i − 1 j =1 n j − 1 , P i j =1 n j o without loss of generalit y , where P 0 j =1 n j = 0. Let us use the notation 1 d ∈ R d and 0 d ∈ R d to denote the vectors with all en tries being 1 and 0 resp ectiv ely . Then, it is easy to c heck that P − b n 1 n 1 T n = a − b n k X i =1 v i v T i , where v 1 = ( 1 T n 1 , 0 T n 2 , ..., 0 T n k ) T , v 1 = ( 0 T n 1 , 1 T n 2 , 0 T n 3 , ..., 0 T n k ) T ,..., v k = ( 0 T n 1 , ..., 0 T n k − 1 , 1 T n k ) T . Note that { v i } k i =1 are orthogonal to each other, and therefore λ k k X i =1 v i v T i ! ≥ min i ∈ [ k ] n i ≥ n β k − 1 ≥ n 2 β k . By W eyl’s inequalit y (Theorem 4.3.1 of [ 35 ]), λ k ( P ) ≥ a − b n λ k k X i =1 v i v T i ! + λ n  b n 1 n 1 T n  ≥ a − b 2 β k . This completes the pro of. Pr o of of The or em 6 . Let us first consider Θ 0 ( n, k , a, b, β ). By Theorem 3 and Prop osition 1 , the misclassification prop ortion is b ounded by C k 2 a ( a − b ) 2 under the condition k 3 a ( a − b ) 2 ≤ c for some small c . Th us, Condition 1 holds when k 3 a ( a − b ) 2 = o (1), whic h leads to the desired conclusion in view of Theorem 2 and Theorem 5 . The pro of of the space Θ( n, k , a, b, λ, β ; α ) follows the same argument. Pr o of of The or em 7 . The pro of is the same as that of Theorem 6 . Pr o of of The or em 8 . When the parameters a and b are known, w e can use τ = C a for some suffi- cien tly large C > 0 for b oth USC( τ ) and NSC( τ ). Then, the results of Theorem 3 and Theorem 4 hold without assuming a ≤ C 1 b or fixed k . Moreo ver, b a u and b b u in ( 11 ) and ( 22 ) can b e replaced b y a and b . Then, the conditions ( 16 ) and ( 18 ) in Theorem 2 and Theorem 5 can b e w eakened as γ = o ( k − 1 ) b ecause the we do not need to establish Lemma 1 anymore. Combining Theorem 2 , Theorem 3 , Theorem 4 and Theorem 5 , we obtain the desired results. D Pro ofs of Lemma 5 and Lemma 7 The following lemma is Corollary A.1.10 in [ 5 ]. Lemma 9. F or indep endent Bernoul li r andom variables X u ∼ Bern ( p u ) and p = 1 n P u ∈ [ n ] p u , we have P   X u ∈ [ n ] ( X u − p u ) ≥ t   ≤ exp  t − ( pn + t ) log  1 + t pn  , for any t ≥ 0 . The following result is Lemma 3.5 in [ 19 ]. 37 Lemma 10. Consider any adjac ency matrix A ∈ { 0 , 1 } n × n for an undir e cte d gr aph. Supp ose max u ∈ [ n ] P v ∈ [ n ] A uv ≤ γ and for any S, T ⊂ [ n ] , one of the fol lowing statements holds with some c onstant C > 0 : 1. e ( S,T ) | S || T | γ n ≤ C , 2. e ( S, T ) log  e ( S,T ) | S || T | γ n  ≤ C | T | log n | T | , wher e e ( S, T ) is the numb er of e dges c onne cting S and T . Then, P ( u,v ) ∈ H x u A uv y v ≤ C 0 √ γ uniformly over al l unit ve ctors x, y , wher e H = { ( u, v ) : | x u y v | ≥ √ γ /n } and C 0 > 0 is some c onstant. The following lemma is critical for proving b oth theorems. Lemma 11. F or any τ > C (1 + np max ) with some sufficiently lar ge C > 0 , we have |{ u ∈ [ n ] : d u ≥ τ }| ≤ n τ with pr ob ability at le ast 1 − e − C 0 n for some c onstant C 0 > 0 . Pr o of. Let us consider an y fixed subset of no des S ⊂ [ n ] suc h that it has degree at least τ and | S | = l for some l ∈ [ n ]. Let e ( S ) b e the num b er of edges in the subgraph S and e ( S, S c ) b e the num b er of edges connecting S and S c . By the requiremen t on S , either e ( S ) ≥ C 1 lτ or e ( S, S c ) ≥ C 1 lτ for some univ ersal constant C 1 > 0. W e are going to sho w that b oth P ( e ( S ) ≥ C 1 lτ ) and P ( e ( S, S c ) ≥ C 1 lτ ) are small. Note that E e ( S ) ≤ C 2 l 2 p max and E e ( S, S c ) ≤ C 2 lnp max for some univ ersal C 2 > 0. Then, when τ > C ( np max + 1) for some sufficien tly large C > 0, Lemma 9 implies P ( e ( S ) ≥ C 1 lτ ) ≤ exp  − 1 4 C 1 lτ log  1 + C 1 τ 2 C 2 lp max  , and P ( e ( S , S c ) ≥ C 1 lτ ) ≤ exp  − 1 4 C 1 lτ log  1 + C 1 τ 2 C 2 np max  . Applying union b ound, the probabilit y that the num b er of no des with degree at least τ is greater than ξ n is P  |{ u ∈ [ n ] : d u ≥ τ }| > ξ n  ≤ X l>ξ n P  |{ u ∈ [ n ] : d u ≥ τ }| = l  ≤ X l>ξ n X | S | = l ( P ( e ( S ) ≥ C 1 lτ ) + P ( e ( S, S c ) ≥ C 1 lτ )) ≤ X l>ξ n exp  l log en l   exp  − 1 4 C 1 lτ log  1 + C 1 τ 2 C 2 lp max  + exp  − 1 4 C 1 lτ log  1 + C 1 τ 2 C 2 np max  ≤ X l>ξ n 2 exp  l log en l − 1 4 C 1 lτ log  1 + C 1 τ 2 C 2 np max  ≤ exp( − C 0 n ) , 38 where the last inequalit y is by choosing ξ = τ − 1 . Therefore, with probability at least 1 − e − C 0 n , the num b er of nodes with degree at least τ is b ounded b y τ − 1 n . Lemma 12. Given τ > 0 , define the subset J = { u ∈ [ n ] : d u ≤ τ } . Then for any C 0 > 0 , ther e is some C > 0 such that k A J J − P J J k op ≤ C  √ np max + √ τ + np max √ τ + √ np max  , with pr ob ability at le ast 1 − n − C 0 . Pr o of. The idea of the proof follo ws the argumen t in [ 26 , 24 ]. By definition, k A J J − P J J k op = sup x,y ∈ S n − 1 X ( u,v ) ∈ J × J x u ( A uv − P uv ) y v . Define L = { ( u, v ) : | x u y v | ≤ ( √ τ + √ p max n ) /n } and H = { ( u, v ) : | x u y v | ≥ ( √ τ + √ p max n ) /n } , then we hav e k A J J − P J J k op ≤ sup x,y ∈ S n − 1 X ( u,v ) ∈ L ∩ J × J x u ( A uv − P uv ) y v + sup x,y ∈ S n − 1 X ( u,v ) ∈ H ∩ J × J x u ( A uv − P uv ) y v . A discretization argumen t in [ 19 ] implies that sup x,y ∈ S n − 1 X ( u,v ) ∈ L ∩ J × J x u ( A uv − P uv ) y v . max x,y ∈N max S ⊂ [ n ] X ( u,v ) ∈ L ∩ S × S x u ( A uv − E A uv ) y v + max x,y ∈N max S ⊂ [ n ] X ( u,v ) ∈ L ∩ S × S x u ( E A uv − P uv ) y v , where N ⊂ S n − 1 and |N | ≤ 5 n . Then, Bernstein’s inequality and union b ound imply that max x,y ∈N max S ⊂ [ n ] P ( u,v ) ∈ L ∩ S × S x u ( A uv − E A uv ) y v ≤ C ( √ τ + √ np max ) with probability at least 1 − e − C 0 n . W e also ha ve max x,y ∈N max S ⊂ [ n ] P ( u,v ) ∈ L ∩ S × S x u ( E A uv − P uv ) y v ≤ k E A − P k op ≤ 1. This completes the first part. T o b ound the second part sup x,y ∈ S n − 1 P ( u,v ) ∈ H ∩ J × J x u ( A uv − P uv ) y v , w e are going to b ound sup x,y ∈ S n − 1 P ( u,v ) ∈ H ∩ J × J x u A uv y v and sup x,y ∈ S n − 1 P ( u,v ) ∈ H ∩ J × J x u P uv y v separately . By the defi- nition of H , sup x,y ∈ S n − 1 X ( u,v ) ∈ H ∩ J × J x u P uv y v = sup x,y ∈ S n − 1 X ( u,v ) ∈ H ∩ J × J x 2 u y 2 v | x u y v | P uv ≤ np max √ τ + √ p max n . T o b ound sup x,y ∈ S n − 1 P ( u,v ) ∈ H ∩ J × J x u A uv y v , it is sufficient to c heck the conditions of Lemma 10 for the graph A J J . By definition, its degree is bounded by τ . F ollo wing the argumen t of [ 45 ], the t wo conditions of Lemma 10 hold with γ = τ + np max with probabilit y at least 1 − n − C 0 . Thus, sup x,y ∈ S n − 1 P ( u,v ) ∈ H ∩ J × J x u A uv y v ≤ C ( √ τ + √ np max ) with probability at least 1 − n − C 0 . Hence, the pro of is complete. Pr o of of L emma 5 . By triangle inequality , k T τ ( A ) − P k op ≤ k T τ ( A ) − T τ ( P ) k op + k T τ ( P ) − P k op , where T τ ( P ) is the matrix obtained b y zeroing out the u th ro w and column of P with d u ≥ τ . Let J = { u ∈ [ n ] : d u ≤ τ } , and then k T τ ( A ) − T τ ( P ) k op = k A J J − P J J k op , whose b ound has b een 39 established in Lemma 12 . By Lemma 11 , | J c | ≤ n/τ with high probability . This implies k T τ ( P ) − P k op ≤ k T τ ( P ) − P k F ≤ p 2 n | J c | p 2 max ≤ √ 2 np max √ τ . T aking τ ∈ [ C 1 (1 + np max ) , C 2 (1 + np max )], the pro of is complete. No w let us pro ve Lemma 7 . The follo wing lemma, whic h con trols the degree, is Lemma 7.1 in [ 42 ]. Lemma 13. F or any C 0 > 0 , ther e exists some C > 0 such that with pr ob ability at le ast 1 − n − C 0 , ther e exists a subset J ⊂ [ n ] satisfying n − | J | ≤ n 2 e ( np max +1) and | d v − E d v | ≤ C p ( np max + 1) log ( e ( np max + 1)) , for al l v ∈ J, wher e d v = P u ∈ [ n ] A uv . Using this lemma, together with Lemma 11 and Lemma 12 , we are able to pro ve the follo wing result, which improv es the b ound in Theorem 7.2 of [ 42 ]. Lemma 14. F or any C 0 > 0 , ther e exists some C > 0 such that with pr ob ability at le ast 1 − n − C 0 , ther e exists a subset J ⊂ [ n ] satisfying n − | J | ≤ n/d and k ( L ( A τ ) − L ( P τ )) J × J k op ≤ C √ d log d ( d + τ ) τ 2 + √ d τ ! , wher e d = e ( np max + 1) . Pr o of. Let us use the notation d v = P u ∈ [ n ] A uv in the pro of. Define the set J 1 = { v ∈ [ n ] : d v ≤ C 1 d } for some sufficiently large constan t C 1 > 0. Using Lemma 11 and Lemma 12 , with probability at least 1 − n − C 0 , we hav e n − | J 1 | ≤ n 2 d , (79) and k ( A − P ) J 1 J 1 k op ≤ C √ d. (80) Let J 2 b e the subset in Lemma 13 , and then with probabilit y at least 1 − n − C 0 , J 2 satisfies n − | J 2 | ≤ n 2 d , (81) and | d v − E d v | ≤ C p d log d, for all v ∈ J 2 . (82) Define J = J 1 ∩ J 2 . By ( 79 ) and ( 81 ), w e ha ve n − | J | = | ( J 1 ∩ J 2 ) c | ≤ | J c 1 | + | J c 2 | = n − | J 1 | + n − | J 2 | ≤ n d , (83) and k ( A − P ) J J k op ≤ k ( A − P ) J 1 J 1 k op ≤ C √ d. (84) Moreo ver, ( 82 ) implies max v ∈ J | d v − E d v | ≤ C p d log d. Define ¯ d v = P u ∈ [ n ] P uv . Then, max v ∈ J | d v − ¯ d v | ≤ max v ∈ J | d v − E d v | + 1 ≤ C p d log d. (85) 40 Define D τ = diag( d 1 + τ , ..., d n + τ ) and ¯ D τ = diag( ¯ d 1 + τ , ..., ¯ d n + τ ). W e introduce the notation R = ( A τ ) J J , B = ( D τ ) − 1 / 2 J J , ¯ R = ( P τ ) J J , ¯ B = ( ¯ D τ ) − 1 / 2 J J . Using ( 85 ), w e ha v e k B − ¯ B k op ≤ max v ∈ [ n ]      1 √ d v + τ − 1 p ¯ d v + τ      ≤ C √ d log d τ 3 / 2 , for some constan t C > 0. The definitions of B and ¯ B implies k B k op ∨ k ¯ B k op ≤ 1 √ τ . W e rewrite the b ound ( 84 ) as k R − ¯ R k op ≤ C √ d . Since all entries of E A τ is b ounded b y ( τ + d ) /n , we ha ve k ¯ R k op ≤ k E A τ k op ≤ d + τ . Therefore, k R k op ≤ k ¯ R k op + k R − ¯ R k op ≤ C ( d + τ ). Finally , k ( L ( A τ ) − L ( P τ )) J × J k op ≤ k B k op k R k op k B − ¯ B k op + k B k op k R − ¯ R k op k ¯ B k op + k B − ¯ B k op k ¯ R k op k ¯ B k op ≤ C √ d log d ( d + τ ) τ 2 + √ d τ ! . The pro of is complete. Pr o of of L emma 7 . Recall that d = np max + 1. F ollowing the proof of Theorem 8.4 in [ 42 ], it can b e shown that with probabilit y at least 1 − n − C 0 , for an y J ⊂ [ n ] such that n − | J | ≤ n/d , k L ( A τ ) − L ( P τ ) k op ≤ k ( L ( A τ ) − L ( P τ )) J J k op + C 1 √ d + r log d τ ! , where the first term on the right side of the inequality ab ov e is bounded in Lemma 14 by c ho osing an appropriate J . Hence, with probabilit y at least 1 − 2 n − C 0 , k L ( A τ ) − L ( P τ ) k op ≤ C √ d log d ( d + τ ) τ 2 + √ d τ ! + C 1 √ d + r log d τ ! . Cho osing τ ∈ [ C 1 (1 + np max ) , C 2 (1 + np max )], the pro of is complete. 41

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment