Spectral Clustering and Block Models: A Review And A New Algorithm
We focus on spectral clustering of unlabeled graphs and review some results on clustering methods which achieve weak or strong consistent identification in data generated by such models. We also present a new algorithm which appears to perform optima…
Authors: Sharmodeep Bhattacharyya, Peter J. Bickel
Spectral Clustering and Block Models: A Rev iew And A New Algorithm Sharmod eep B hattachary ya and Peter J. Bickel Abstract W e focus on spectral clustering of unlabeled graph s and revie w some re- sults o n clustering m ethods which achieve weak or strong consistent id entification in data gene rated by such mode ls. W e also presen t a new algor ithm which appears to perform optimally both theoretically using asymptotic theory and empirically . 1 Intr oduction Since its introdu ction in [ 15 ], spectral analysis of various matrices associated to group s h as beco me one of the most widely u sed cluster ing tech niques in statistics and machine learning . In the con text of u nlabeled graph s, a nu mber o f metho ds, all o f which co me under the b road h eading o f spec tral clu stering have been propo sed. These m ethods based on spectra l analysis of adjacency matrices or some derived matrix suc h as one of the Lap lacians ([ 31 ], [ 28 ], [ 23 ], [ 29 ], [ 32 ]) have been studied in connectio n with th eir effectiv eness in id entifying m embers of blo cks in exchang eable graph block models. In th is paper afte r introducing th e methods and m odels, we in tend to review some of the literatu re. W e relate it to the results of M ossel, Neeman and Sharmode ep Bhattacharyya Orego n State Uni ver sity , Department of Statistics, 4 4 Kidder Hall, Co rv allis, OR, e-mail: bhattas h@scie nce.oregonstate.edu Peter J. Bickel Univ ersity of California at Berkele y , Department of Statisti cs, 367 Ev ans Hall, Berkeley , CA e- mail: bickel@ stat.b erkeley.edu 1 2 Sharmode ep Bhattacharyya and Peter J. Bickel Sly (2012 ) [ 26 ] and Massouli ´ e (20 14) [ 2 4 ], where it is shown that fo r very sparse models, there exists a phase transition belo w which members can not be iden tified better than chance and also showed that above the phase transition one can do better using rather subtle meth ods. In [ 6 ] we de velop a sp ectral clustering method based on the matrix of g eodesic distanc es b etween nod es which can ac hieve the goals of the w ork we cited and in fact behaves well for all unlabeled networks, sp arse, semi- sparse and dense. W e giv e a statement and sk etch the proof of these claims in [] b ut giv e a full ar gument for the sparse case considered by the above authors only in this paper . W e gi ve the necessary p reliminaries in Section 2, m ore history in Section 3 and show the theoretical properties of the method in Section 4. 2 Pr eliminaries There are many standard methods of clustering based on num erical similarity matri- ces which ar e discussed in a num ber of mon ograph s (Eg:Hartig an [ 19 ], Leroy and Rousseuw [ 30 ]). W e shall not d iscuss these fu rther . Our fo cus is on un labeled graphs of n vertices cha racterized by adjacency matrices, A = || a i j || fo r n data p oints. W ith a i j = 1 if the re is an e dge between i and j and a i j = 0 o therwise. The natural as- sumption then is, A = A T . Our basic g oal is to divide the poin ts in K sets such that on some average criterion the p oints in a given subset are more similar to each other than to those of oth er subsets. Our focu s is on methods of clusterin g based on the spectrum (eigenv alues and eig en vectors) of A o r related matrices. 2.1 Notation and F ormal Definition of Stochastic Block Model Definition 1. A g raph G K ( B , ( P , π )) g enerated from th e sto chastic block model (SBM) with K blocks an d pa rameters P ∈ ( 0 , 1 ) K × K and π ∈ ( 0 , 1 ) K can be de- fined in following way - each vertex of gr aph G n is assigned to a comm unity c ∈ { 1 , . . . , K } . The ( c 1 , . . . , c n ) are independ ent outcom es o f m ultinomial draws with parameter π = ( π 1 , . . . , π K ) , where π i > 0 for all i . Con ditional on the label vector c ≡ ( c 1 , . . . , c n ) , the edge variables A i j for i < j are independ ent Bern oulli variables w ith E [ A i j | c ] = P c i c j = min { ρ n B c i c j , 1 } , (1) Spectral Clustering and Block Models: A Re vie w And A New Algorithm 3 where P = [ P ab ] and B = [ B ab ] are K × K symm etric matrices. W e call P th e con- nection p robability matrix and B the kernel matrix for th e co nnection . So, we ha ve P ab ≤ 1 for all a , b = 1 , . . . , K , P 1 ≤ 1 and 1 T P ≤ 1 elem ent-wise. By definition A ji = A i j , and A ii = 0 (no self-loop s). This formulation is a reparametrizatio n du e to B ickel and C hen (200 9) [ 8 ] of the definition o f Holland an d Leinhar dt [ 2 0 ]. It permits separa te consid eration asym p- totically of the density of the graph and its structure as follows: P ( V ertex 1 b elongs to block a and vertex 2 to b lock b and are connected ) = π a π b P ab with P ab depend ing o n n. P ab = ρ n min ( B ab , 1 / ρ n ) . W e can interp ret ρ n as the un- condition al p robab ility of a n edge and B ab essentially as P ( V ertex 1 b elongs to a and vertex 2 belo ngs to b | an edg e between 1 and 2 ) . Set Π = diag ( π 1 , . . . , π K ) . 1. Define the matrices as M = Π B and S = Π 1 / 2 B Π 1 / 2 . 2. Note th at the eigenv alues of M are the same as the symmetric matrix S an d in particular are real-valued. 3. The eigenvalues of th e expected adjacency matrix ¯ A ≡ E ( A ) ar e also the sam e as tho se of S b u t with m ultiplicities. W e denote the eigenv alues by their absolu te order, λ 1 ≥ | λ 2 | ≥ · · · ≥ | λ K | . Let us den ote ( ϕ 1 , . . . , ϕ K ) , ϕ i ∈ R K , as the eigenv ectors of S correspo nding to the e igenv alu es λ 1 , . . . , λ K . If a set of λ j ’ s are equ al to λ , we choo se eigenvec- tors fro m the e igenspace correspo nding to the λ as appro priate. Then , we h av e, φ i = Π − 1 / 2 ϕ i and ψ i = Π 1 / 2 ϕ i as the left and right e igenv ectors of M . Also, h φ i , φ j i π = ∑ K k = 1 π k φ ik φ jk = δ i j . The spectral decompo sition of M , S an d B are B = K ∑ k = 1 λ k φ k φ T k , S = K ∑ k = 1 λ k ϕ k ϕ T k , M = K ∑ k = 1 λ k ψ k φ T k . 2.2 Spectral Clustering The b asic g oal o f com munity d etection is to inf er the nod e labels c fro m the data. Although we do not explicitly consider parameter estimatio n, the y can be recovered from ˆ c , an estimate of ( c 1 , . . . , c n ) by 4 Sharmode ep Bhattacharyya and Peter J. Bickel ˆ P ab ≡ 1 O ab n ∑ i = 1 n ∑ j = 1 A i j 1 ( ˆ c i = a , ˆ c j = b ) , 1 ≤ a , b ≤ K , (2) where, O ab ≡ ( n a n b , 1 ≤ a , b ≤ K , a 6 = b n a ( n a − 1 ) , 1 ≤ a ≤ K , a = b , n a ≡ n ∑ i = 1 1 ( ˆ c i = a ) , 1 ≤ a ≤ K There are a numb er of appr oaches for commun ity detection based on mod ular- ities ([ 18 ], [ 8 ]), maxim um likelihood and variational likeliho od ([ 11 ], [ 7 ]) and ap- proxim ations such a s semide finite progr amming appr oaches [ 3 ], pseud olikelihood [ 2 ] but these all tend to be comp utationally inten si ve and/or requ ire good initial assignments of blocks. The metho ds which have proved bo th com putationally e f- fective an d asymptotically correct in a sen se we shall d iscuss are related to sp ectral analysis of the adjacency or related matrices.They dif f er in important details. Giv en an n × n symmetric matr ix M based on A , the algorithm s ar e of the form: 1. Using the spectral decomp osition of M or a r elated generalized eigenpro blem. 2. Obtain an n × K matr ix of K n × 1 vectors. 3. Apply K means clustering to the n K -dimension al row vectors of the matrix of Step 2. 4. Identif y the ind ices o f the ro ws belon ging to cluster j , j = 1 , . . . , K with vertices belongin g t o block j . In addition to A , three graph Laplacian matrices discussed by v on Luxburg (200 7) [ 33 ], have been considered extensi vely , as well as som e others we shall mention briefly below and th e matrix we shall show has optimal asy mptotic proper ties a nd discuss in greater detail. The matrices popular ly considered are: • L = D − A : the graph Laplacian. • L rw = D − 1 A : the random walk Laplacian. • L sym = D − 1 / 2 AD − 1 / 2 : the symme tric Laplacian. Here D = diag ( A 1 ) , the diagona l matrix whose diagonal is the vector o f row su ms of A . She c onsiders op timization p roblems which are relaxed versions o f co mbina- torial p roblems which implicitly define clusters as sets o f nod es with m ore inter nal than external edges. L and L sym appear in two of these relaxations. The form of step 2 d iffers for L and L sym with the K vectors of the L prob - lem correspo nding to th e top K eigenv a lues of the genera lized eigenv alue prob lem Lv = λ Dv ,while the n K -dimen sional vectors of the L sym problem are obtained by Spectral Clustering and Block Models: A Re vie w And A New Algorithm 5 normalizin g the rows o f the matrix of K eigenvectors corr espondin g to the top K eigenv alu es o f L sym . Their relation to the K block model is through asymptotics. Why is spectral clustering expected to work? Given A generated b y a K -block model, let c ↔ ( n 1 , . . . , n K ) wh ere, n a is the num ber of vertices a ssigned to type a . Then we can write, E ( A | c ) = PQP T where, P is a permutation ma trix and Q n × n has succesi ve blocks of n 1 rows, n 2 rows and so on with all the vectors in each ro w the same. Thu s r ank ( E ( A | c ) = K . The same is true of the asympto tic limit o f L giv en c . If as ymptotics as n → ∞ justify concentratio n of A or L around th eir expectations then we expect all e igenv alu es other than the largest K in absolute value are small. It follows that the n rows of the K eigen vector s associated with the to p K eigenv alues should be resolvable in to K clusters in R K with c luster members iden tified with rows of A n × n , see [ 29 ], [ 32 ] for proof s. 2.3 Asymptotics Now we can co nsider several asymptotic regimes as n → ∞ . Let λ n = n ρ n be the av erage degree of the graph. (I) Th e dense regime: λ n = Ω ( n ) . (II) The semi den se regime: λ n / l o g ( n ) → ∞ . (III) The semi sparse regime: Not semidense but λ n → ∞ . (IV) The sparse regime: λ n = O ( 1 ) . Here are some results in the different regimes. W e define a method of vertex assignment to co mmunities as a rando m map δ : { 1 , . . . , n } → { 1 , . . . , K } wh ere random ness co mes through the depend ence of delta on A as a f unction. Thus spectral clustering using the various matrices which depend on A is such a δ . Definition 2. δ is said to be str ongly con sistent if P ( i belongs to a and δ ( i ) = a for all i , a ) → 1 as n → ∞ . Note that the blocks are only determin ed up to permutatio n. 6 Sharmode ep Bhattacharyya and Peter J. Bickel Bickel and Chen (20 09) [ 8 ] s how that in th e ( semi) de nse regime a m ethod called profile lik elihood is stro ngly consistent under minimal iden tifiability condition s and later this result was extend ed [ 7 ] to fitting b y maximum likelihood or variational likelihood. In fact, in the ( semi) dense regime, the block mo del likelihood asym p- totically agrees with the joint likelihoo d of A and vertex bloc k iden tities so that efficient estimatio n of all parameters is possible. I t is easy to see that the result can- not hold in the (semi) sparse regime since isolated points then exist with probability 1. Unfortu nately all of th ese methods are computatio nally inten si ve. Although spec- tral clu stering is no t stron gly consistent, a sligh t variant, r eassigning vertice s in any cluster a which are maximally connected to another cluster b rather than a , is strongly consistent. Definition 3. δ is said to be weakly consistent if and only if W ≡ n − 1 n ∑ i = 1 P ( i ∈ a , δ ( i ) 6 = a |∀ i , a ) = o ( 1 ) Spectral clu stering ap plied to A [ 32 ] o r the Laplacians ([ 29 ] in the mann er we have d escribed) has b een shown to be weakly co nsistent in the sem i dense to d ense regimes. Even weak co nsistency fails for parts of the sparse regime [ 1 ]. Th e best that can b e hoped fo r is W < 1 2 . A shar p problem has been posed and ev entually resolved in a series o f pap ers, Decelle et al [ 14 ], Mossel et al [ 27 ]. These writers considered the case K = 2 , π 1 = π 2 , B 11 = B 22 . First, Dec elle et al. [ 14 ] argued on physical g round s that if , F = 2 ( B 11 − B 12 ) 2 / ( B 11 + B 12 ) ≤ 1, then W ≥ 1 / 2 for any m ethod an d parameter s are unestimable fr om the d ata even if the y satisfy the minimal iden tifiability condition s gi ven below . On th e other h and Mossel e t al [ 27 ] and indepen dently Massoulie et al [ 24 ], devised admittedly slo w m ethods such that if F > 1 then W < 1 / 2 and parameters can be estimated consistently . W e no w p resent a fast s pectral clustering meth od given in greater detail in [ 6 ] which yield s weak consistency for the semisparse r egime on and also has the p rop- erties of the M ossel e t al and Massou lie meth ods. In fact, it r eaches the phase tr an- sition threshold for all K not just K=2, but stil l restricted to π j = 1 / K , all j and B aa + 2 ∑ [ B ab : b 6 = a ] indepen dent o f a for all a . W e note th at Zhao et. a l. (20 15) [ 17 ] exhibit a two-stage algorithm which exhib its the same behavior b ut its properties in sparse case are unknown. The algor ithm gi ven in the next section in volves spectral clustering of a new matrix, tha t of all geo desic distances between i and j . Spectral Clustering and Block Models: A Re vie w And A New Algorithm 7 3 Algorithm As usual let G n , an un directed graph on n vertices be the data. den ote the vertex set by V ( G n ) ≡ { v 1 , . . . , v n } and th e edge set by E ( G n ) ≡ { e 1 , . . . , e m } with card inalities | V ( G n ) | = n and E ( G n ) | = m . As usual a p ath between vertices u and v is a set of edges { ( u , v 1 ) , ( v 1 , v 2 ) , . . . , ( v ℓ − 1 , v ) } and the length of such a path is ℓ . The alg orithm we propo se depen ds on the grap h distance or g eodesic d istance between vertices in a graph. Definition 4. The Graph o r G eodesic distance b etween two vertices i and j of graph G is g iv en by the length of the sho rtest p ath b etween th e vertices i and j , if they are connected. Otherwise, the distance is infinite. So, for any tw o vertices u , v ∈ V ( G ) , graph distance, d g is defined by d g ( u , v ) = ( min { ℓ |∃ path of length ℓ between u and v } , ∞ , if u and v are no t connected For implemen tation, we can r eplace ∞ by n + 1, wh en, u and v a re not co nnected , since any p ath with loops can not be a geod esic. The main step s of the algorithm are as follows 1. Find the grap h distanc e m atrix D = [ d g ( v i , v j )] n i , j = 1 for a gi ven network b u t with distance u pper boun ded by k log n . Assign non-c onnected vertices an arb itrary high value. 2. Perform hierar chical clustering to identify the giant componen t G C of graph G . Let n C = | V ( G C ) | . 3. Normalize the gra ph distance matrix on G C , D C by ¯ D C = − I − 1 n C 11 T ( D C ) 2 I − 1 n C 11 T 4. Perform eigenv alue decompo sition on ¯ D C . 5. Consider the top K eigenv ectors of normalized distance matrix ¯ D C and ˜ W be the n × K m atrix form ed by arrangin g the K eigenvectors as columns in ˜ W . Per form K -means clustering on the rows ˜ W , that means, find a n n × K matrix C , wh ich has K distinct rows and minimizes || C − ˜ W || F . 8 Sharmode ep Bhattacharyya and Peter J. Bickel 6. (Alternative to 5 .) Perfo rm Gau ssian mixtur e m odel ba sed clu stering o n the rows of ˜ W , when th ere is an indication of high ly-varying average d egree be- tween the commun ities. 7. Let ˆ c : V 7→ [ K ] be the block assignment functio n accordin g to the clustering of the rows of ˜ W perfor med in either Step 5 or 6. Here are some important observations about the implemen tation of the algorithm - (a) There are standard algorithms for graph distance find ing in the algorithmic graph theory literature. In the algorithmic graph theory literature th e problem is kno wn as the all pairs shortest pat h problem. The two mo st popular algorith ms are Floyd-W arshall [ 16 ] [ 34 ] and Johnson’ s alg orithm [ 21 ]. (b) Step 3 of the a lgorithm is nothing but the classical multi-dimensional scaling (MDS) of the graph distance matrix. (c) In the Step 5 of the algorithm K -means clustering is appropriate if the expected degree of the blo cks are equal. Ho wever , if the expected degree of the blo cks are different, this leads to multi scale behavior in t he eigenv ectors of the normalize d distance matrix and bad behavior in practice. So, we perfor m Gaussian Mixture Model (GMM) based clustering instead of K -means to take into account that. General theoretical results on the alg orithm will be gi ven in [ 6 ]. In this paper, we first r estrict to the spa rse regime W e do so because the a rguments in the sp arse regime are essentially dif ferent from the others. Curiou sly , it is i n the sparse and par t of the sem i-sparse regime only that the matr ix ¯ D C concentr ates to an n × n matrix with K d istinct typ es of row vectors as f or the o ther method s of spectral clustering. It does not con centrate in the de nse regime, while the opp osite is tru e of A an d L . They do not concentrate outside the semidense regime. That the geodesic matrix does not concentrate in the dense regime can easily be seen s ince asymptotically all geodesic path s are o f con stant length . But th e distributions of p ath leng ths differs from block to block ensuring tha t the spectral c lustering works. But we do not touch this further here. 4 Theor etical Results Throu ghout this section we take ρ n = 1 n and specialize to the case B = ( p − q ) I K × K + q 11 T Spectral Clustering and Block Models: A Re vie w And A New Algorithm 9 where, I is th e identity and 1 = ( 1 , . . . , 1 ) T . T hat is, all K blocks have the same probab ility p of co nnecting two block memb ers and prob ability q of conn ecting members of two d ifferent blo cks a nd p > q . W e also assume that π a = 1 K , a = 1 , . . . , K , all blocks are asymptotically o f the same size. W e restrict ourselves to this model here b ecause it is the one tr eated by Mossel, Neeman a nd Sly (2013) [ 27 ] and already subtle technical details are not obscured . He re is the result we prove. Theorem 1. F or the given model, if ( p − q ) 2 > K ( p + ( K − 1 ) q ) , (3) and our algorithm is applied , ˆ c r esults and c is the true assignment function, then, " 1 n n ∑ i = 1 1 ( c ( v i ) 6 = ˆ c ( v i )) < 1 2 # → 1 (4) Notes: 1. ( 3 ) mark s the phase transition conjectured by [ 14 ]. 2. A close reading of our p roof shows that as ( p − q ) 2 / K ( p + ( K − 1 ) q ) → ∞ , 1 n ∑ n i = 1 1 ( c ( v i ) 6 = ˆ c ( v i )) P → 0 . W e conjectu re that our conclusion in f act holds under the following condition s, (A1) W e consider λ 1 > 1, λ 1 > max j ≥ 2 λ j , 1 ≤ j ≤ K and λ K > 0. F or M , there exists a k such that ( M k ) ab > 0 for all a , b = 1 , . . . , K . Also, π j > 0, for j = 1 , . . . , K . (A2) Each vertex has the same asymptotic av erage degree α > 1, that is, α = K ∑ k = 1 π k B ak = K ∑ k = 1 M ak , for all a ∈ { 1 , . . . , K } (A3) W e assume that λ 2 K > λ 1 or alternatively , t here exists real positi ve t , such that, K ∑ k = 1 φ k ( a ) λ t k φ k ( b ) ≤ n , f or all a , b = 1 , . . . , K Note that (A1)-(A3) all hold for the case we consider . In fact, under our model, λ 1 = p + ( K − 1 ) q K , λ 2 = p − q K , λ 2 = λ 3 = · · · = λ K 10 Sharmode ep Bhattacharyya and Peter J. Bickel with (A3) being the condition of the Theorem . Our a rgument will be stated in a form that is g eneralizable a nd we will indicate revisions in in termediate statements as needed, poin ting in particular to a lemma whose conclusion only holds if an implication of (A3) we conjectur e is v alid . The theoretical analysis of the algorithm has two main parts - I. Finding the limiting d istribution o f gr aph distance between two typical vertices of type a and type b (wher e, a , b = 1 , . . . , K ). This pa rt of th e analy sis is h ighly depend ent on results fr om multi-type bran ching pr ocesses and their relation with stochastic blo ck models. Th e pro of techniq ues and results are bo rrowed from [ 9 ], [ 5 ] and [ 4 ]. II. Finding the behavior of the top K eigenvectors of the gr aph distance matrix D using the lim iting distribution of the typical grap h distances. This part of anal- ysis is high ly dependent on perturb ation theory of linear operato rs. The proof technique s and r esults are borrowed from [ 22 ], [ 12 ] and [ 32 ]. W e will state two theorem s correspo nding to I and II above. Theorem 2. Under our model, the graph d istance d G ( u , v ) between two uniformly chosen vertices of type a and b r espe ctively , c ondition ed on bein g connected , satis- fies the following asymptotic r elation - (i) If a = b, for any ε > 0 , as n → ∞ , P [( 1 − ε ) τ 1 ≤ d G ( u , v ) ≤ ( 1 + ε ) τ 1 ] = 1 − o ( 1 ) (5) wher e, τ 1 is the minimum r ea l positive t , which satisfies the r elation below , λ t 2 + λ t 1 − λ t 2 K = n (6) (ii)If a 6 = b, for any ε > 0 , as n → ∞ , P [( 1 − ε ) τ 2 ≤ d G ( u , v ) ≤ ( 1 + ε ) τ 2 ] = 1 − o ( 1 ) (7) wher e, τ 2 is the minimum r ea l positive t , which satisfies the r elation below , λ t 1 − λ t 2 K = n (8) In Theo rem 2 we have a point- wise result. T o use matrix p erturba tion theory for part II we need the following. Spectral Clustering and Block Models: A Re vie w And A New Algorithm 11 Theorem 3. Let D B be the r estriction of th e geo desic matrix to vertices in th e big compon ent of G n . Then, under our model, P D log n − D F ≤ o ( n ) = 1 − o ( 1 ) wher e, D i j ≡ σ 1 = τ 1 / log n, if v i and v j have same type and D i j ≡ σ 2 = τ 2 / log n, otherwise, wher e, τ 1 and τ 2 ar e solutions t in Eq. ( 6 ) and ( 8 ) r esp ectively . T o genera lize Theorem 1 , we need appropriate generalizations of Theorem 2 and 3 . Heuristically , it m ay be argu ed that the genera lizations ( τ sb ) , a , b = 1 , . . . , K sho uld satisfy the equation s, K ∑ k = 1 φ k ( a ) λ t k φ k ( b ) = ( S t ) ab = n , for a ≤ b ∈ [ K ] (9) Our conjecture is that (A1)-(A3) imp ly that the eq uations have asymptotic solutions and that the statements of Theorem 2 and 3 hold with obvious modifications. Note that in T heorem 2 , since λ j = λ 2 , 2 ≤ j ≤ K there are effectively on ly two equations an d modificatio ns are also neede d for oth er degen eracies in the parame- ters. W e next turn to a branchin g pr ocess result in [ 10 ] which we will use hea vily . 4.1 A K ey Branching Process Result As others have do ne we link the network formed by SBM with the tree network generated by m ulti-type Galton-W atson branch ing pro cess. In our case, the Multi- type branchin g process (MTBP) has type space S = { 1 , . . . , K } , where a particle of type a ∈ S is rep laced in the next gene ration b y a set of particles distributed a s a Poisson pr ocess on S with intensity ( B ab π b ) K b = 1 = ( M ab ) K b = 1 . Recall the definitions of B , M and S fro m Section 2.1 . W e denote this branching process, started with a single p article o f ty pe a , by B B , π ( a ) . W e write B B , π for the sam e pr ocess with the type of the in itial particle rando m, distrib uted according to π . Accor ding to Theorem 8.1 of Chap ter 1 of [ 25 ], the branching p rocess has a positi ve surv i val p robab ility if λ 1 > 1, wh ere, λ 1 is the Perron- Frobeniu s eige n value o f M , a positive regular matrix. Recall that for our special M , λ 1 = p − q K + 1. Definition 5. (a) Define ρ ( B , π ; a ) as the p robability th at the bra nching p rocess, B B , π ( a ) , survives fo r eternity . 12 Sharmode ep Bhattacharyya and Peter J. Bickel (b) Define, ρ ≡ ρ ( B , π ) ≡ K ∑ a = 1 ρ ( B , π ; a ) π a (10) as the survival probability of the branching pr ocess B B , π giv en th at its initial distribution is π W e denote Z t = ( Z t ( a )) K a = 1 as the population of p articles of K d ifferent typ es, with Z t ( a ) d enoting p articles o f typ e a , at gene ration t fo r the Poisson multi-type branch ing pr ocess B B , π , with B and π as defined in Section 4 . F rom Theorem 24 o f [ 10 ], we get that Theorem 4 ([ 10 ]). Let β > 0 and Z 0 = x ∈ N K be fixed. The r e e xists C = C ( x , β ) > 0 such that with pr o bability a t le ast 1 − n − β , fo r a ll k ∈ [ K ] , all s , t ≥ 0 , with 0 ≤ s < t , |h φ k , Z s i − λ s − t k h φ k , Z t i| ≤ C ( t + 1 ) 2 λ s / 2 1 ( log n ) 3 / 2 (11) Remark: The ab ove stated theor em is a special case of the gener al theo rem stated in [ 10 ]. Th e g eneral theor em is required for generalizing Theor em 1 . The general version of the theorem is Theorem 5 ([ 10 ]). Let β > 0 and Z 0 = x ∈ N K be fixed. There exists C = C ( x , β ) > 0 such tha t with pr o bability at least 1 − n − β , fo r all k ∈ [ K 0 ] ( wher e, K 0 is the lar gest integer such that λ 2 k > λ 1 for all k ≤ K 0 ), all s , t ≥ 0 , with 0 ≤ s < t , |h φ k , Z s i − λ s − t k h φ k , Z t i| ≤ C ( t + 1 ) 2 λ s / 2 1 ( log n ) 3 / 2 (12) and for all k ∈ [ K ] \ [ K 0 ] , for all t ≥ 0 , |h φ k , Z t i| ≤ C ( t + 1 ) 2 λ t / 2 1 ( log n ) 3 / 2 (13) F ina lly , for all k ∈ [ K ] \ [ K 0 ] , all t ≥ 0 , E |h φ k , Z t i| 2 ≤ C ( t + 1 ) 3 λ t 1 . 4.2 The Neighborhood Explorati on Pr ocess The n eighbo rhood explor ation process o f a vertex v in g raph G gen erated from a n SBM gives us a han dle on the link between local structures of a graph from SBM and multi-ty pe br anching process. Recall the definitio ns of SBM p arameters from Section 2.1 and the definitions o f Poisson multi-type branching p rocess from Section Spectral Clustering and Block Models: A Re vie w And A New Algorithm 13 4.1 . W e assume all vertices of grap h G n generated from a stochastic block m odel has been assigned a commun ity or type ξ i (say) for vertex v i ∈ V ( G n ) . The n eighbo rhood exploration pr ocess , ( G , v ) L , of a vertex v in graph G n , gen- erates a spanning tr ee o f the induced subgraph of G n consisting of vertices o f at most L -distance from v . The spann ing tree is for med f rom the exp loration pro- cess which starts from a vertex v as th e r oot in th e r andom g raph G n generated from stochastic bloc k mod el. The set of vertices of type a o f the rando m grap h G n that are neighb ors of v and h as n ot been previously explored are called Γ 1 , a ( v ) and N 1 , a ( v ) = | Γ 1 , a ( v ) | fo r a = 1 , . . . , K and N 1 ( v ) = ( N 1 , 1 ( v ) , . . . , N 1 , K ( v )) . So, Γ 1 ( v ) = { Γ 1 , 1 ( v ) , . . . , Γ 1 , K ( v ) } ar e the children o f the r oot v at step ℓ = 1 in the span- ning tree of th e n eighbo rhood exploration proce ss. The n eighbo rhood exploration process is repeated at secon d step b y lo oking at the neighb ors of ty pe a of th e ver- tices in Γ 1 ( v ) that has not b een pr eviously explored and the set is called Γ 2 , a ( v ) and N 2 , a ( v ) = | Γ 2 , a ( v ) | for a = 1 , . . . , K . Similarly , Γ 2 ( v ) = { Γ 2 , 1 ( v ) , . . . , Γ 2 , K ( v ) } are the children of vertices Γ 1 ( v ) at step ℓ = 2 in th e spanning tree of the neighb orhoo d exploration pr ocess. The explo ration process is continu ed until step ℓ = L . No te that the pr ocess stops when all the vertices in G n has been explo red. So, if G n is connected , then, L ≤ the diameter of the graph G n . Since, we either consider G n connected or only the giant compon ent of G n , the neighbo rhoo d explora tion process will end in a finite n umber of step s but the n um- ber of steps may d epend on n and is equal to the diam eter , L , of the co nnected compon ent o f the graph c ontaining the root v . It follows from Theorem 14.11 of [ 9 ] that L / log λ 1 ( n ) P → 1 . (14) Now , we find a couplin g relation between the neighbo rhood e xp loration pr ocess of a vertex of type a in stochastic block mod el and a m ulti-type Galton -W atson process, B ( a ) starting from a vertex of type a . The Lemma is based on Proposition 31 of [ 10 ]. Lemma 1. Let w ( n ) b e a sequence such that w ( n ) → ∞ and w ( n ) / n → 0 . Let ( T , v ) be the rando m r o oted tr ee associated with the P oisson multi- type Galton-W atson branching pr ocess defined in Section 2.1 started fr o m Z 0 = δ c v and ( G , v ) be the spanning tr ee a ssociated with neighborhoo d exploration pr ocess of r andom SBM graph G n starting fr om v. F or ℓ ≤ τ , wher e τ is the nu mber of steps r equir ed to 14 Sharmode ep Bhattacharyya and Peter J. Bickel explor e w ( n ) vertices in ( G , v ) , the total variation d istance, d TV , b etween t he law of ( G , v ) ℓ and ( T , v ) ℓ at step ℓ g oes to zer o as O n − 1 2 ∨ w ( n ) / n = o ( 1 ) . Pr oo f. Let us start the neigh borho od exploration proce ss starting with vertex v of a grap h generate d fro m an SBM model with parameters ( P , π ) = ( B / n , π ) . Cor re- sponding ly the mu lti-type branching process starts from a single particle of ty pe c v , where, c v is the type or class of vertex v in SBM. Let t be such that 0 ≤ t < τ , wher e, τ is defined in the L emma statement. Now , for such a t ≥ 0, let ( x t + 1 ( 1 ) , . . . , x t + 1 ( K ) ) be leaves of ( T , v ) at time t starting from a vertex v t generated by step t of class c v t = a . Let ( y t + 1 ( 1 ) , . . . , y t + 1 ( K ) ) be the vertices exp osed at step t of the exploratio n process starting from a vertex of class a , wher e, a ∈ [ K ] . Now , if c v t is of type a , th en, we ha ve x t + 1 ( b ) f ollows Bin ( n t ( b ) , B ab / n ) and y t + 1 ( b ) follows Poi ( π b B ab ) fo r b = 1 , . . . , K , where, n t ( b ) is the nu mber of u nused vertices of typ e b remaining at time t fo r b = 1 , . . . , K . Also, y t + 1 ( b ) for different b are indep endent. Note that n b ≥ n t ( b ) ≥ n b − w ( n ) f or b = 1 , . . . , K . So, since, we hav e | n b / n − π b | = O ( n − 1 / 2 ) for b = 1 , . . . , K , we get that, | n t ( b ) − π b | < O n − 1 / 2 + w ( n ) / n for b = 1 , . . . , K Now , we know that, d T V Bin ( m ′ , λ / m ) , Poi ( m ′ λ / m ) ≤ λ m , d T V Poi ( λ ) , Poi ( λ ′ ) ≤ | λ − λ ′ | So, now , we have, d T V ( P t + 1 , Q t + 1 ) ≤ O n − 1 / 2 ∨ w ( n ) / n = o ( 1 ) where, P t + 1 is th e distribution of y t + 1 under neighbo rhood exploration p rocess an d Q t + 1 is the distribution of x t + 1 under the branchin g pr ocess, and hence Lemm a 1 follows. Now , we r estrict our selves to the giant compon ent of G n . The size of the giant compon ent of G n , C 1 ( G n ) , of a random graph gener ated from SBM ( B , π ) is related to the multi-type branchin g pr ocess through its survival pro bability as gi ven in Def- inition 5 . According to Theorem 3.1 of [ 9 ], we hav e, 1 n C 1 ( G n ) P → ρ ( B , π ) (15) Spectral Clustering and Block Models: A Re vie w And A New Algorithm 15 Under this additional condition of restricting to the giant com ponent, the branching process can be c oupled with anoth er branching process with a different kernel. The kernel of that branching process is gi ven in follo wing lemma. Lemma 2. I f v is in giant co mponen t of G n , the new branching pr ocess has kernel B ab 2 ρ ( B , π ) / K − ρ 2 ( B , π ) / K 2 K a , b = 1 . Pr oo f. The pro of is gi ven in Section 10 of [ 9 ]. Since, we will be restricting ourselves to the gia nt c ompon ent o f G n , we shall be using th e B ′ ≡ B ab 2 ρ ( B , π ) / K − ρ 2 ( B , π ) / K 2 K a , b = 1 matrix as the conn ectivity matrix in stead of B . W e abuse notation by referencing to the matrix B ′ as B too. W e pro ceed to prove th e limiting behavior of typ ical distance between vertices v and w o f G n , where, v , w ∈ V ( G n ) . W e first try to find a lower bou nd fo r distan ce between two v ertices. W e shall separately g iv e an upper b ound and lo wer bou nds for the distance between two vertices of the same type and different types. Lemma 3. Un der our model, for vertices v , w ∈ V ( G ) , if (a) typ e of v = type of w = a (say), then, |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 1 }| ≤ O ( n 2 − ε ) with high pr ob ability wher e, τ 1 is the minimum r ea l positive t , which satisfies Eq. ( 6 ) , (b) typ e of v = a 6 = b = type of w ( say), then, |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 2 }| ≤ O ( n 2 − ε ) with high pr ob ability wher e, τ 2 is the minimum r ea l positive t , which satisfies Eq. ( 8 ) . Pr oo f. Let Γ d ( v ) ≡ Γ d ( v , G n ) denote the d -distance set of v in G n , i.e., the set of vertices o f G n at g raph distance exactly d from v , and let Γ ≤ d ( v ) ≡ Γ ≤ d ( v , G n ) de- note th e d -neighborh ood ∪ d ′ ≤ d Γ d ′ ( v ) of v . Let Γ d , a ( v ) ≡ Γ d , a ( v , G n ) de note th e set of vertices of ty pe a at d -distance in G n and le t Γ ≤ d , a ( v ) ≡ Γ ≤ d , a ( v , G n ) d enote the d -neighborho od ∪ d ′ ≤ d Γ d ′ , a ( v ) of v consisting of vertices of type a . Let N a d be the number of particles at generation d of the b ranchin g process B B ( δ a ) and N a d , c be the number o f pa rticles at generatio n d o f th e br anching proce ss B B ( δ a ) o f ty pe c . So, N a d = ∑ K c = 1 N a d , c and Z t ( k ) = ∑ t d = 0 N a d , k . Lemma 1 inv olved first showing that, for n large enoug h, th e neig hborh ood ex- ploration process starting at a gi ven vertex v of G n with type a could be coupled 16 Sharmode ep Bhattacharyya and Peter J. Bickel with the branching process B B ′ ( δ a ) , where the B ′ is defined by Lemma 2 . As noted we identify B ′ with B . The n eighbo rhood exploration pro cess and multi-ty pe bran ching pro cess can be coupled so that for every d , | Γ d ( v ) | is at mo st the numb er N d + O n − 1 2 ∨ w ( n ) / n , where, N d is num ber of par ticles in g eneration d of B B ( δ a ) and in d g eneration s at most w ( n ) vertices of G n have been explor ed. From Theorem 4 , we get that with high proba bility h φ k , Z t i λ t k − h φ k , Z 0 i ≤ C ( t + 1 ) 2 ( log n ) 3 / 2 Since, for any x ∈ R K , we get th e uniqu e representation, x = ∑ K k = 1 h x , φ k i φ k , for any basis { φ k } K k = 1 of R K . If we take x = e b , where, e b is the unit vector with 1 at b -th co-ord inate and 0 els ewhere, b = 1 , . . . , K , we can get Z t ( b ) ≤ K ∑ k = 1 φ k ( b ) λ t k φ k ( a ) h Z 0 ( a ) + C ( t + 1 ) 2 ( log n ) 3 / 2 i Now , un der ou r mod el one re presentation of the e igenv ectors is φ 1 = 1 √ K ( 1 , . . . , 1 ) , φ 2 = 1 √ 2 ( − 1 , 1 , 0 , . . . , 0 ) , φ 3 = 1 √ 6 ( − 1 , − 1 , 2 , 0 , . . . , 0 ) , · · · , φ K − 1 = 1 √ K ( K − 1 ) ( − 1 , . . . , − 1 , K − 1 ) . No w usin g the r epresentation of eigenv ectors for b ranching pro cess starting fr om vertex o f type a , a ∈ [ K ] , we get with high probab ility K ∑ k = 1 Z t ( k ) ≤ λ t 1 h Z 0 ( a ) + C ( t + 1 ) 2 ( log n ) 3 / 2 i Z t ( a ) − Z t ( b ) ≥ λ t 2 h − Z 0 ( a ) − C ( t + 1 ) 2 ( log n ) 3 / 2 i , b = 1 , . . . , K and b 6 = a . So, we can simplify , for each a ∈ [ K ] with Z 0 ( a ) = 1 , with high probab ility , Z t ( a ) ≤ 1 K λ t 1 + ( K − 1 ) λ t 2 h 1 + C ( t + 1 ) 2 ( log n ) 3 / 2 i Z t ( b ) ≤ λ t 1 − λ t 2 K h 1 + C ( t + 1 ) 2 ( log n ) 3 / 2 i , b ∈ [ K ] a nd b 6 = a . Set D 1 = ( 1 − ε ) τ 1 , where, τ 1 is the solution to the equation λ t 2 + λ t 1 − λ t 2 K = n and set D 2 = ( 1 − ε ) τ 2 , where, τ 2 is the solution to the equation Spectral Clustering and Block Models: A Re vie w And A New Algorithm 17 λ t 1 − λ t 2 K = n where, ε > 0 is fixed a nd sm all. Note that bo th τ 1 and τ 2 are of the o rder O ( lo g n ) . Thus, with high probab ility , for v of type a and w ( n ) = O ( n 1 − ε ) , | Γ ≤ D 1 , a ( v ) | = ∑ D 1 d = 0 N a d , a ≤ Z D 1 ( a ) + O D 1 n − 1 2 ∨ w ( n ) / n = O ( n 1 − ε ) | Γ ≤ D 2 , b ( v ) | = ∑ D 2 d = 0 N a d , b ≤ Z D 2 ( b ) + O D 2 n − 1 2 ∨ w ( n ) / n = O ( n 1 − ε ) So, summing over v ∈ C a and v ∈ C b , where, C a = { i ∈ V ( G ) | c i = a } and C b = { i ∈ V ( G ) | c i = b } , we have, ∑ v ∈ C a | Γ ≤ D 1 , a ( v ) | = |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 1 , v , w ∈ C a }| ∑ v ∈ C a | Γ ≤ D 2 , b ( v ) | = |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 2 , v ∈ C a , w ∈ C b }| and so with high probab ility |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 1 , v , w ∈ C a }| = ∑ v ∈ V ( G n ) | Γ ≤ D , a ( v ) | = O ( n 2 − ε ) |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 2 , v ∈ C a , w ∈ C b }| = ∑ v ∈ V ( G n ) | Γ ≤ D , b ( v ) | = O ( n 2 − ε ) The above statemen t is equiv alent to P |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 1 , v , w ∈ C a }| ≤ O ( n 2 − ε ) = 1 − o ( 1 ) P |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 2 , v ∈ C a , w ∈ C b }| ≤ O ( n 2 − ε ) = 1 − o ( 1 ) for any fixed ε > 0. Now , we u pper boun d the typical distance betwee n two vertices of SBM graph G n . Lemma 4. Un der our model, for vertices v , w ∈ V ( G ) and condition ed on the event that the explor ation pr ocess starts fr o m a vertex in the giant component of G, if , (a) typ e of v = type of w = a (say), then, P ( d G ( v , w ) < ( 1 + ε ) τ 1 ) = 1 − ex p ( − Ω ( n 2 η )) wher e, τ 1 is the minimum r ea l positive t , which satisfies Eq. ( 6 ) , (b) typ e of v = a 6 = b = type of w ( say), then, 18 Sharmode ep Bhattacharyya and Peter J. Bickel P ( d G ( v , w ) < ( 1 + ε ) τ 2 ) = 1 − ex p ( − Ω ( n 2 η )) wher e, τ 2 is the minimum r ea l positive t , which satisfies Eq. ( 8 ) . Pr oo f. W e c onsider the mu lti-type branchin g process with prob ability kernel P ab = B ab n ∀ a , b = 1 , . . . , K and the correspon ding ran dom g raph G n generated from stochas- tic block mo del h as in total n n odes. W e con dition that bran ching pro cess B K sur- viv es. Note that an upp er bou nd 1 is obvious, since we are bou nding a probab ility , so it suffices to prove a correspond ing lower boun d. W e may a nd shall assume tha t B ab > 0 for some a , b . Again, let Γ d ( v ) ≡ Γ d ( v , G n ) denote the d -distance set o f v in G n , i.e ., the set of vertices of G n at graph distanc e exactly d from v , and let Γ ≤ d ( v ) ≡ Γ ≤ d ( v , G n ) denote the d -neighborh ood ∪ d ′ ≤ d Γ d ′ ( v ) of v . Le t Γ d , a ( v ) ≡ Γ d , a ( v , G n ) denote th e set of vertices o f typ e a at d -distance in G n and le t Γ ≤ d , a ( v ) ≡ Γ ≤ d , a ( v , G n ) d enote the d -n eighbo rhood ∪ d ′ ≤ d Γ d ′ , a ( v ) of v consisting of v ertices o f type a . Let N a d be the number of particles at generation d of branchin g process B B ( δ a ) and N a d , c be the n umber of particles at gener ation d of bra nching process B B ( δ a ) o f ty pe c . So, N a d = ∑ K c = 1 N a d , c and Z t ( k ) = ∑ t d = 0 N a d , k . By Lemma 1 , for w ( n ) = o ( n ) , | Γ d , c ( v ) | ≥ N d , c − O n − 1 2 ∨ w ( n ) / n , c = 1 , . . . , K . (16) for all d s.t. | Γ ≤ d ( v ) | < ω ( n ) . This rela tion b etween the nu mber o f vertices at gen- eration d of ty pe c of branch ing p rocess B B ( δ a ) , d enoted b y N d , c and th e numbe r of vertices o f ty pe c at distance d from v for the neigh borho od exploration process of G n , denoted by | Γ d , c ( v ) | becom es highly impo rtant later on in this proo f, wher e, c = 1 , . . . , K . Note th at the re lation only ho lds wh en | Γ ≤ d ( v ) | < ω ( n ) for some ω ( n ) such that ω ( n ) / n → 0 as n → ∞ . From Theorem 4 of the branching process, we get that with high probability h φ k , Z t i λ t k − h φ k , Z 0 i ≤ C ( log n ) 3 / 2 Now fo llowing the sam e line of argument as in proof of Lemma 3 , for each a ∈ [ K ] w ith Z 0 ( a ) = 1 , with high probab ility we get that, Spectral Clustering and Block Models: A Re vie w And A New Algorithm 19 Z t ( a ) ≤ 1 K λ t 1 + ( K − 1 ) λ t 2 h 1 + C ( t + 1 ) 2 ( log n ) 3 / 2 i Z t ( b ) ≤ λ t 1 − λ t 2 K h 1 + C ( t + 1 ) 2 ( log n ) 3 / 2 i , b ∈ [ K ] a nd b 6 = a . Let D 1 be the integer p art of ( 1 + 2 η ) τ ′ 1 , where, τ ′ 1 is the solution to the equation λ t 2 + λ t 1 − λ t 2 K = n 1 / 2 − η (17) Thus cond itioned on su rviv al of the branc hing pr ocess B B ( δ a ) , N a D 1 , a ≥ n 1 / 2 + η / 2 . Set D 2 = ( 1 + η ) τ ′ 2 , where, τ ′ 2 is the solution to the equation λ t 1 = n 1 / 2 + η (18) Thus conditione d on surviv al of branchin g pro cess B B ( δ a ) , N a D 2 , b ≥ n 1 / 2 + η / 2 for b = 1 , . . . , K . Furthermor e lim d → ∞ P ( N a d 6 = 0 ) = ρ ( B , a ) . Now , we ha ve conditioned t hat the bran ching process with kernel B is s urviving. The right-h and side tends to ρ ( B , a ) = 1 as η → 0 . Hence, giv en any fixed γ > 0, if we choose η > 0 small enoug h, and for large enough n , we ha ve P ∀ b : N a D 2 , b ≥ n 1 / 2 + η / 2 = 1 , P N a D 1 , a ≥ n 1 / 2 + η / 2 = 1 . Now , the neig hborh ood explo ration process and b ranchin g process can b e c ou- pled so that f or every d , | Γ d ( v ) | is at mo st th e nu mber N d of p articles in gen eration d of B B ( a ) from Lemma 1 and Eq ( 1 6 ). So, we have f or v of type a , with h igh probab ility , | Γ ≤ D 1 , a ( v ) | ≤ E D 1 ∑ d = 0 N d = o ( n 2 / 3 ) | Γ ≤ D 2 , b ( v ) | ≤ E D 2 ∑ d = 0 N d = o ( n 2 / 3 ) if η is small enoug h, since D 1 is inte ger p art of ( 1 + 2 η ) τ ′ 1 and D 2 is the integer part of ( 1 + 2 η ) τ ′ 2 , where, τ ′ 1 and τ ′ 2 are solutions to Eq. ( 17 ) and ( 18 ). Note th at the power 2 / 3 here is arbitrary , we cou ld have any power in the ra nge ( 1 / 2 , 1 ) . So, now , we are in a p osition to ap ply Eq ( 16 ), as we h ave | Γ ≤ D ( v ) | ≤ O ( n 2 / 3 a ) < ω ( n ) , with ω ( n ) / n → 0. 20 Sharmode ep Bhattacharyya and Peter J. Bickel Now let v and w be two fixed vertices of G ( n , P ) , o f typ es a and b resp ectiv ely . W e explo re both their neigh borho ods at the same time, stopping either wh en we reach d istance D in b oth n eighbo rhood s, o r we find an edg e f rom on e to the o ther, in which case v and w ar e within g raph distance 2 D + 1. W e consider tw o independen t branch ing pro cesses B B ( a ) , B ′ B ( b ) , with N a d , c and N b d , c vertices o f ty pe c in gen er- ation d respecti vely . By the previous argument, with high prob ability we encounter o ( n ) vertices in the explor ation so, by the argument lead ing to ( 16 ), wh p eith er th e explorations meet, or | Γ a d , c ( w ) | ≥ Z ( a ) d ( c ) − O n − 1 2 ∨ n − 1 3 , c = 1 , . . . , K , c 6 = a | Γ b d , c ( w ) | ≥ Z ( b ) d ( c ) − O n − 1 2 ∨ n − 1 3 , c = 1 , . . . , K , c 6 = b with the e xplorations not meeting, wh ere, Z ( a ) is the b ranching process starting f rom Z 0 = δ a , for a = 1 , . . . , K . Us ing boun d on N a d , c and the independence of the branch- ing processes, it follows that for a = b , P d ( v , w ) ≤ 2 D 1 + 1 or | Γ a D 1 , c ( v ) | , | Γ a D 1 , c ( w ) | ≥ n 1 / 2 + η ≥ 1 − o ( 1 ) . and for a 6 = b , P d ( v , w ) ≤ 2 D 2 + 1 or ∀ c : | Γ a D 2 , c ( v ) | , | Γ b D 2 , c ( w ) | ≥ n 1 / 2 + η ≥ 1 − o ( 1 ) . Write these probab ilities as P ( A j ∪ B j ) , j = 1 , 2. W e now show that P ( A c j ∩ B j ) → 0 and since P ( A j ∪ B j ) → 1 , we will have P ( A j ) → 1 . W e have not examined any edges fr om Γ D ( v ) to Γ D ( w ) , so the se edges ar e p resent ind ependen tly with their original uncon ditioned pro babilities. For any end vertex typ es c 1 , c 2 , the expected number o f these e dges is at least | Γ a D , c ( v ) || Γ a D , c ( w ) | B c 1 c 2 / n fo r first pr obability and | Γ a D , c 1 ( v ) || Γ b D , c 2 ( w ) | B c 1 c 2 / n for second p robab ility . Choo sing c 1 , c 2 such that B c 1 c 2 > 0, this expectation is Ω (( n 1 / 2 + η / 2 ) 2 / n ) = Ω ( n η ) . It follows that at least one edge is p resent with prob ability 1 − exp ( − Ω ( n η )) = 1 − o ( 1 ) . If such an edge is pr esent, then d ( v , w ) ≤ 2 D 1 + 1 for first p robability and d ( v , w ) ≤ 2 D 1 + 1 for second pr obability . So, the prob ability that the seco nd event in the above equation holds but not the first is o ( 1 ) . Thus, the last equation implies that P ( d ( v , w ) ≤ 2 D 1 + 1 ) ≥ ( 1 − γ ) 2 − o ( 1 ) ≥ 1 − 2 γ − o ( 1 ) P ( d ( v , w ) ≤ 2 D 2 + 1 ) ≥ ( 1 − γ ) 2 − o ( 1 ) ≥ 1 − 2 γ − o ( 1 ) . Spectral Clustering and Block Models: A Re vie w And A New Algorithm 21 where, γ > 0 is arb itrary . Cho osing η small en ough , w e have 2 D + 1 ≤ ( 1 + ε ) log ( n ) / log λ . As γ is arbitrary , we h av e P ( d ( v , w ) ≤ ( 1 + ε ) τ 1 ) ≥ 1 − exp ( − Ω ( n 2 η )) , P ( d ( v , w ) ≤ ( 1 + ε ) τ 2 ) ≥ 1 − exp ( − Ω ( n 2 η )) . and the lemma follows. The eq uations ( 6 ) an d ( 8 ) co ntrol the asymp totic b ounds fo r the grap h distanc e d G ( v , w ) between two vertices v an d w in V ( G n ) . Un der the condition ( A3) it follows that λ 2 2 > λ 1 . If we con sider λ 2 2 = c λ 1 , where, c is a constant, then th e equations ( 6 ) and ( 8 ) can be written in the for m of quadratic equatio ns. So, the solu tions τ 1 and τ 2 exist under the condition c τ 1 and c τ 2 are of th e order O ( n ) and th e resulting solutions τ 1 and τ 2 are b oth of the order O ( log n ) . Also, fro m the expression of the solution s τ 1 and τ 2 , the limits τ 1 log n and τ 2 log n exist and we sh all d efine th e limit as σ 1 and σ 2 respectively . 4.3 Proof of Theorem 2 and Theorem 3 4.3.1 Proof of Theorem 2 W e shall try to prove the limiting behavior of the ty pical graph distance in the giant compon ent a s n → ∞ . The Theo rem essentially follows f rom Lem ma 3 - 4 . Under the conditions mentioned in the Theor em, par t ( a) follows from Lemma 3 (a) and 4 (a) and part (b) follows from Lemma 3 (b) and 4 (b). 4.3.2 Proof of Theorem 3 From Defin ition 4 , we have th at D i j = gra ph distance between vertices v i and v j , where, v i , v j ∈ V ( G n ) . Fro m Lemma 3 , we get f or any vertices v and w with high probab ility , |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 1 }| ≤ O ( n 2 − ε ) , if type of v = ty pe of w |{{ v , w } : d G ( v , w ) ≤ ( 1 − ε ) τ 2 }| ≤ O ( n 2 − ε ) , if type of v 6 = type of w . Also, from Lemma 4 , we get 22 Sharmode ep Bhattacharyya and Peter J. Bickel P ( d G ( v , w ) < ( 1 + ε ) τ 1 ) = 1 − exp ( − Ω ( n 2 η )) , if type of v = type of w , P ( d G ( v , w ) < ( 1 + ε ) τ 2 ) = 1 − exp ( − Ω ( n 2 η )) , if type of v = type of w . Now , σ 1 = τ 1 / log n and σ 2 = τ 2 / log n are asymptotically constan t as both τ 1 and τ 2 are of the or der log n as follows from equ ations ( 6 ) and ( 8 ). So, putting the two statements together, we get that with hig h probability , n ∑ i , j = 1: ty pe ( v i ) 6 = t y pe ( v j ) D i j log n − D i j 2 = O ( n 2 − ε ) + O ( n 2 ) . ε 2 since, by Lem ma 1 , ε = o ( 1 ) and ( 1 − exp ( − Ω ( n 2 η ))) n 2 → 1 as n → ∞ . So, putting the two cases together, we get that with high prob ability , for some ε > 0, n ∑ i , j = 1 D i j log n − D i j 2 = O ( n 2 − ε ) + O ( n 2 ) . ε 2 = o ( n 2 ) . Hence, for some ε > 0, D log n − D F ≤ o ( n ) . W e have completed proofs of Theorems 2 and 3 . 4.4 P erturbation Theory of Linear Operators W e n ow establish part II of ou r prog ram. D can b e considered as a p erturba tion of the operator D . The Da vis-Kahan Theor em [ 13 ]] gives a boun d on perturbation of eigenspace instead of eigenv ector, as discussed previously . Theorem 6 (Davis-Kahan (1970 )[ 13 ]). Let H , H ′ ∈ R n × n be symmetric, su ppose V ⊂ R is an inte rval, and sup pose for so me po sitive in te ger d that W , W ′ ∈ R n × d ar e such th at the column s of W form a n o rthonorma l ba sis for the sum o f the eigenspaces o f H associa ted with the eigen values of H in V and th at the columns of W ′ form an orthonormal basis for the sum of the eigenspaces of H ′ associated with the eigenvalues of H ′ in V . Let δ be the minimum distance between any eigenvalue of H in V a nd any eigenvalue of H not in V . Then ther e e xists an orthogonal matrix R ∈ R d × d such that || WR − W ′ || F ≤ √ 2 || H − H ′ || F δ . Spectral Clustering and Block Models: A Re vie w And A New Algorithm 23 4.5 Proof of Theorem 1 The b ehavior of th e eigenvalues of the limiting operato r D can be stated as fo llows - Lemma 5. Un der o ur mo del, the e igen values of D - | µ 1 ( D ) | ≥ | µ 2 ( D ) | ≥ · · · ≥ | µ n ( D ) | , can be bounded as follows - µ 1 ( D ) = O ( n σ 1 ) , | µ K ( D ) | = O ( n ( σ 1 − σ 2 )) , µ K + 1 ( D ) = · · · = µ n ( D ) = − σ 1 (19) Also, W ith high pr ob ability it holds that | µ K ( D / log n ) | = O ( n ( σ 1 − σ 2 )) a nd µ K + 1 ( D / log n ) ≤ o ( n ) . Pr oo f. The matrix D + σ 1 I n × n is a block m atrix with blocks o f sizes { n a } K a = 1 , with ∑ K a = 1 n a = n . The e lements of ( a , b ) th block are all same and equal to σ 1 , if a = b and equal to σ 2 , if a 6 = b . Note, diago nal of D is zero , as diagonal of D is also zero. Now , we have the eigenv alues o f the K × K matrix of the values in D to b e ( σ 1 + ( K − 1 ) σ 2 , σ 1 − σ 2 , . . . , σ 1 − σ 2 ) . I f we con sider , λ 2 2 = c λ 1 , then, if c > 1, we will have σ 1 > σ 2 . So , un der our model, we h av e that σ 1 > σ 2 . So , becau se of repetitions in the block ma trix µ 1 ( D ) = O ( n σ 1 ) = O ( n ) and µ K ( D ) = O ( n ( σ 1 − σ 2 )) = O ( n ) , since, by assum ption (A3) , n a = O ( n ) , fo r all a = 1 , . . . , K . Now , the rest of th e eigenv alu es o f D + σ 1 Id n × n is zero, so the rest of eigenv a lues o f D is − σ 1 . Now , about the second part of Lemma, By W eyl’ s In equality , for all i = 1 , . . . , n , || µ i ( D / log n ) | − | λ i ( D ) || ≤ || D / log n − D || F ≤ o ( n ) Since, from (A 1)-(A3) , it follows that σ 1 − σ 2 > c > 0 , for some con stant c , so, | λ K ( D / log n ) | = O ( n ( σ 1 − σ 2 )) − o ( n ) = O ( n ( σ 1 − σ 2 )) for large n and | λ K + 1 ( D / log n ) | ≤ − σ 1 + o ( n ) = o ( n ) . Now , let W be th e e igenspace cor respond ing to the top K absolu te e igenv alu es of D and ˜ W be the eigenspa ce co rrespond ing to th e top K absolu te eige n values of D . Using Davis-Kahan Lemma 6. W ith hig h pr obab ility , ther e exists an o rthogonal matrix R ∈ R K × K such that || WR − ˜ W || F ≤ o ( σ 1 − σ 2 ) − 1 Pr oo f. The top K eigenv alues of b oth D and D / log n lies in ( Cn , ∞ ) f or som e C > 0. Also, the gap δ = O ( n ( σ 1 − σ 2 )) b etween top K an d K + 1th eigenv alues of matrix D . So, now , we can apply Da vis-Kahan Theo rem 6 and Theorem 3 , to get that, 24 Sharmode ep Bhattacharyya and Peter J. Bickel || WR − ˜ W || F ≤ √ 2 || D / log n − D || F δ ≤ o ( n ) O ( n ( σ 1 − σ 2 )) = o ( σ 1 − σ 2 ) − 1 Now , the relationship between the rows of W c an be specified as follows - Lemma 7. F or any two r ows i , j of W n × K matrix, || u i − u j || 2 ≥ O ( 1 / √ n ) , if type of v i 6 = typ e of v j . Pr oo f. The matrix D + σ 1 Id n × n is a block m atrix wit h blocks o f sizes { n a } K a = 1 , with ∑ K a = 1 n a = n . The e lements of ( a , b ) th block are all same and equal to σ 1 , if a = b and equal to σ 2 , if a 6 = b . Note, diago nal of D is zero , as diagonal of D is also zero. Now , we have the rows of eigenvectors of the K × K matrix o f the values in D that have a constant difference. Under ou r model, we have that σ 1 > σ 2 . So, because of repetitions in the block matrix, rows o f D as well as th e projection of D into into its top K eigenspac e has difference of order O ( n − 1 / 2 ) between rows o f matrix. Now , if we consider K -m eans criterion as the clustering criterion on ˜ W , then, fo r the K -means minimize r centroid matrix C is an n × K matrix with K distinct rows correspo nding to the K centroids of K - means algorith m. By property of K -means objective fu nction and Lemma 6 , with high proba bility , || C − ˜ W || F ≤ || WR − ˜ W || F || C − WR || F ≤ || C − ˜ W || F + || WR − ˜ W || F || C − WR || 2 F ≤ 4 || WR − ˜ W || 2 F ≤ o ( σ 1 − σ 2 ) − 2 By Lemm a 7 , fo r large n , we can get constant C , such that, K balls, B 1 , . . . , B K , of radius r = C n − 1 / 2 around K distinct rows of W are disjoint. Now note that with hig h proba bility the numb er of r ows i su ch that || C i − ( WR ) i || > r is at m ost cn ( σ 1 − σ 2 ) 2 , with arb itrarily small c onstant c > 0 . If the state- ment does not hold then, || C − WR || 2 F > r 2 . cn ( σ 1 − σ 2 ) 2 ≥ C n − 1 . cn ( σ 1 − σ 2 ) 2 = O ( σ 1 − σ 2 ) − 2 So, we g et a contrad iction, since || C − WR || 2 F ≤ o ( σ 1 − σ 2 ) − 2 . Thus, the number of mistakes should be at most cn ( σ 1 − σ 2 ) 2 , with arbitrarily small constant c > 0 . Spectral Clustering and Block Models: A Re vie w And A New Algorithm 25 So, for each v i ∈ V ( G n ) , if c ( v i ) is the type o f v i and ˆ c ( v i ) is the type o f v i as estimated fr om ap plying K -means o n top K e igenspace of geodesic matrix D , we get that for arbitrarily small constant, c > 0, " 1 n n ∑ i = 1 1 ( c ( v i ) 6 = ˆ c ( v i )) < c ( σ 1 − σ 2 ) 2 # → 1 So, for constant σ 1 and σ 2 , we get c > 0 such that, " 1 n n ∑ i = 1 1 ( c ( v i ) 6 = ˆ c ( v i )) < 1 2 # → 1 5 Conclusion W e h av e given an o verview of spectr al clustering in the context of community detec- tion of network s and clustering. W e h av e also introduc ed a new method of commu- nity d etection in the pa per an d we have shown bou nds on theoretical p erform ance of the method. Refer ences 1. Abbe, E., Bandeir a, A. S., Hall, G.: Ex act reco ve ry i n the stoch astic block model. arXi v preprint arXi v:1405.3267 (2014) 2. Amini, A. A., Chen, A. , Bickel, P .J., Le vina, E.: Pseudo-like lihood methods for community detection in l arge sparse networks. An n. Statist. 41 (4), 2097–2122 (2013). DOI 10.1214/ 13- A OS1138. URL htt p://dx. doi.org /10.1214/13- A OS1138 3. Amini, A.A. , Levina, E.: On semidefin ite relaxa tions f or t he block mod el. arXi v preprint arXi v:1406.5647 (2014) 4. Athreya, K.B., Ney , P .E.: Branching processes , vol. 28. Springer -V erlag Berlin (1972) 5. Bhamidi, S., V an der Hofstad, R. , Hooghiemstra, G.: First passage percolation on the erds- ren yi random graph. Combinatorics, Probability & Computing 20 (5), 683–707 (2011) 6. Bhattacharyya, S., Bick el, P .J.: Community detection in netw orks using grap h distance. arXi v preprint arXi v:1401.3915 (2014) 7. Bickel, P ., Choi, D. , Cha ng, X., Zhang, H.: Asymptotic normality of max- imum likelihood and its va riational approximation for stochastic blockmod- els. Ann. Statist. 41 (4), 1922–1943 (2013). DOI 10.1214/13- A OS1124. URL http:// dx.doi .org/10.1214/13- AOS1124 26 Sharmode ep Bhattacharyya and Peter J. Bickel 8. Bickel, P .J., Chen, A.: A nonparame tric vie w of network models and ne wman–girv an and other modularities. Proceedings of the Nationa l A cademy of Scienc es 106 ( 50), 21,068– 21,073 (2009) 9. Bollob ´ as, B., Janso n, S., Riordan, O.: The phase trans ition in inhomogene ous random graphs. Random Structures & Algorithms 31 (1) , 3–122 (2007) 10. Bordena ve, C., Lelarge, M., Massouli ´ e, L.: Non-b acktracking spectrum of random gr aphs: community detection and non-reg ular ramanujan graphs. arXi v preprint arXiv:1501 .06087 (2015) 11. Celisse, A., Daudin, J.J., Pierre, L.: Consistenc y of maximum-likelihood and var iational es- timators in t he stochastic block model. Electron. J. Stat. 6 , 1847–1899 (2012). DOI 10.1214/12-EJS729. URL http: //dx.do i.org/10.1214/12- EJS729 12. Chatelin, F .: Spectral Approximation of Linear Operators. SIAM (1983) 13. Davis, C., Kaha n, W .M.: The rotation of eigen vectors by a perturbation. iii . SIAM Journal on Numerical Analysis 7 (1), 1–46 (1970) 14. Decelle, A., Krzakala, F ., Moore, C., Zdeborov ´ a , L.: Asymptotic analysis of the stochastic block mode l for modular netw orks and its a lgorithmic applications. Physical Revie w E 84 (6), 066,106 (2011) 15. Fiedler , M. : Algebra ic connec tivity of gr aphs. Czechoslov ak Math. J . 23(98) , 298–305 (19 73) 16. Floyd, R.W .: Algorithm 97: shortest path. Communications of the A CM 5 (6), 345 (1962) 17. Gao, C., M a, Z., Zhang, A. Y . , Zhou, H.H. : Achieving optimal misclassification proportion in stochastic block model. arXi v preprint arXi v:1505.03772 (2015) 18. Girva n, M., Ne wman, M.E.: Co mmunity stru cture in social and b iological netw orks. Proceed- ings of the National Academy of Sciences 99 (12) , 7821–7826 (2002) 19. Hartigan, J.A.: Clus tering algor ithms. John W il ey & So ns, Ne w Y ork -London- Sydne y (1975). W iley Se ries in Probability and Mathematical Statistics 20. Holland, P .W ., Laskey , K.B., Leinhardt, S.: St ochastic blockmodels: First steps. Social net- works 5 (2 ), 109–137 (1983) 21. Johnson, D.B.: Ef ficient algorithms for sho rtest paths in spar se networks . Journal of the A CM (J A CM) 24 (1), 1–13 (1977) 22. Kat ¯ o, T .: Perturbation theory for linear operators , vol. 132. springer (1995) 23. von Luxb ur g, U. , Belkin, M., Bousquet, O.: Consistenc y of spectra l clustering. Ann. St atist. 36 (2), 555–586 (200 8). DOI 10.1214/0090536 07000000 640. URL http:// dx.doi .org/10.1214/009053607000000640 24. Massouli ´ e, L.: Community detection thresholds and t he weak ramanujan property . In: Pro- ceedings of the 46th Annual ACM Sympos ium on Theory of Computing, pp. 694–703. ACM (2014) 25. Mode, C.J.: Multitype bra nching process es: Theory and applications, v ol. 34. American Else- vier Pub . Co. (1971) 26. Mossel, E., Neeman, J., Sly , A.: Stochastic block models and recons truction. arXiv preprint arXi v:1202.1499 (2012) 27. Mossel, E., Neeman, J., Sly , A. : A proo f of the block model thre shold conjec ture. a rXiv preprint arXi v:1311.4115 (2013) 28. Ng, A.Y ., Jordan, M. I., W eiss, Y . , et al.: On spectral clustering: Analysis and an algorithm. Adv ances in neural information proces sing systems 2 , 849–856 (2002) Spectral Clustering and Block Models: A Re vie w And A New Algorithm 27 29. Rohe, K., Chatterjee, S., Y u, B.: Spectral clustering and the high-dimensional stochas tic blockmodel. Ann. Statist. 39 (4), 1878–1915 (2011). DOI 10.1214/11- A O S887. URL http:// dx.doi .org/10.1214/11- AOS887 30. Rousseeuw , P .J., Leroy , A.M. : Robu st regres sion and outlier detection. W iley Se- ries in Probability and Mathematical Statistics: Applied Probability and Stati stics. John Wile y & Sons, Inc., New Y ork (1987). DOI 10 .1002/0471725 382. URL http:// dx.doi .org/10.1002/0471725382 31. Shi, J., Malik, J.: Normalized cuts and i mage segme ntation. Pattern Analysis and M achine Intelligence, IEEE Tr ansactions on 22 (8), 888–905 (200 0) 32. Sussman, D.L. , T ang, M., Fishkind, D.E. , Priebe, C.E.: A consistent adjacency spectral e mbedding for stochastic blockmodel graphs . J. Amer . Statist. As- soc. 107 (499), 1119 –1128 ( 2012). DOI 10.1080/01621459.20 12.699795. URL http:// dx.doi .org/10.1080/01621459.2012.699795 33. V on Luxburg , U.: A t utorial on spectral clustering. Statistics and computing 17 (4), 395–416 (2007) 34. W arshall, S.: A theorem o n bo olean matrice s. Journal of t he A CM (J ACM) 9 (1), 11–12 ( 1962)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment