Concentration and regularization of random graphs

CONCENTRA TION AND REGULARIZA TION OF RANDOM GRAPHS CAN M. LE, ELIZA VET A LEVINA AND R OMAN VERSHYNIN Abstract. This pap er studies ho w close random graphs are t ypically to their exp ecta- tions. W e in terpret this question through the concentration of the adjacency and Laplacian matrices in the sp ectral norm. W e study inhomogeneous Erd¨ os-R ´ enyi random graphs on n v ertices, where edges form independently and p ossibly with diﬀeren t probabilities p ij . Sparse random graphs whose exp ected degrees are o (log n ) fail to concen trate; the obstruction is caused by vertices with abnormally high and low degrees. W e show that concentration can b e restored if we regularize the degrees of such vertices, and one can do this in v arious wa ys. As an example, let us rew eigh t or remov e enough edges to make all degrees b ounded ab ov e b y O ( d ) where d = max np ij . Then we show that the resulting adjacency matrix A 0 concen- trates with the optimal rate: k A 0 − E A k = O ( √ d ). Similarly , if we mak e all degrees b ounded b elo w b y d b y adding w eigh t d/n to all edges, then the resulting Laplacian concentrates with the optimal rate: kL ( A 0 ) − L ( E A 0 ) k = O (1 / √ d ). Our approach is based on Grothendieck- Pietsc h factorization, using which we construct a new decomp osition of random graphs. W e illustrate the concentration results with an application to the communit y detection problem in the analysis of netw orks. 1. Introduction Man y classical and mo dern results in probabilit y theory , starting from the Law of Large Num b ers, can be expressed as concen tration of random ob jects about their expectations. The ob jects studied most are sums of indep endent random v ariables, martingales, nice functions on product probability spaces and metric measure spaces. F or a panoramic exposition of con- cen tration phenomena in modern probability theory and related ﬁelds, the reader is referred to the b o oks [ 25 , 9 ]. This pap er studies concen tration properties of random graphs. The ﬁrst step of suc h study should b e to decide how to interpret the statement that a r andom gr aph G c onc entr ates ne ar its exp e ctation . T o do this, it will b e useful to lo ok at the graph G through the lens of the matrices classically asso ciated with G , namely the adjacency and Laplacian matrices. Let us ﬁrst build the theory for the adjacency matrix A ; the Laplacian will b e discussed in Section 1.5 . W e ma y say that G concentrates ab out its exp ectation if A is close to its exp ectation E A in some natural matrix norm; w e in terpret the exp ectation of G as the w eigh ted graph with adjacency matrix E A . V arious matrix norms could be of interest here. In this pap er, w e study concen tration in the sp ectral norm. This automatically giv es us a tigh t control of all eigenv alues and eigenv ectors, according to W eyl’s and Davis-Kahan p erturbation inequalities (see [ 5 , Sections I I I.2 and VI I.3]). Concen tration of random graphs in terpreted this w a y , and also of general random matrices, has b een studied in se v eral communities, in particular in random matrix theory , com binatorics and net w ork science. Date : August 10, 2016. E. L. is partially supp orted by NSF grants DMS-1159005 and DMS-1521551. R. V. is partially supp orted b y NSF grant 1265782 and U.S. Air F orce gran t F A9550-14-1-0009. This w ork w as done while C. L. w as a Ph.D. studen t at the Univ ersit y of Michigan. 1 2 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN W e will study random graphs generated from an inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )), where edges are formed indep enden tly with giv en probabilities p ij , see [ 7 ]. This is a gener- alization of the classical Erd¨ os-R ´ en yi mo del G ( n, p ) where all edge probabilities p ij equal p . Man y p opular graph models arise as special cases of G ( n, ( p ij )), suc h as the sto c hastic blo c k mo del, a b enc hmark mo del in the analysis of netw orks [ 22 ] discussed in Section 1.7 , and random subgraphs of giv en graphs. Often, the question of interest is estimating some features of the probabilit y matrix ( p ij ) from random graphs drawn from G ( n, ( p ij )). Concentration of adjacency matrix and Lapla- cian matrix around their exp ectations, when it holds, guaran tees that suc h features can b e reco v ered. As an example of this use of our concen tration results, w e will sho w that if ( p ij ) has a block structure, the blo cks can b e accurately estimated from a single realization of G ( n, ( p ij )) ev en when the av erage vertex degree is b ounded. 1.1. Dense graphs concen trate. The cleanest concentration results are a v ailable for the classical Erd¨ os-R ´ enyi mo del G ( n, p ) in the dense regime. In terms of the exp ected degree d = pn , we hav e with high probability that k A − E A k = 2 √ d (1 + o (1)) if d  log 4 n, (1.1) see [ 16 , 44 , 28 ]. Since k E A k = d , we see that the ty pical deviation here b ehav es lik e the square ro ot of the magnitude of exp ectation – just lik e in many other classical results of probabilit y theory . In other words, dense r andom gr aphs c onc entr ate wel l . The low er b ound on density in ( 1.1 ) can b e essentially relaxed all the w a y down to d = Ω(log n ). Th us, with high probability we hav e k A − E A k = O ( √ d ) if d = Ω(log n ) . (1.2) This result was prov ed in [ 15 ] based on the metho d dev elop ed b y J. Kahn and E. Szemeredi [ 17 ]. More generally , ( 1.2 ) holds for any inhomogeneous Erd¨ os-R´ en yi mo del G ( n, ( p ij )) with maximal exp ected degree d = max i P j p ij . This generalization can b e deduced from a recent result of S. Bandeira and R. v an Handel [ 4 , Corollary 3.6], while a weak er b ound O ( √ d log n ) follo ws from concentration inequalities for sums of indep enden t random matrices [ 35 ]. Alter- nativ ely , an argument in [ 15 ] can b e used to prov e ( 1.2 ) for a somewhat larger but still useful v alue d = max ij np ij , (1.3) see [ 27 , 12 ]. The same can b e obtained by using Seginer’s b ound on random matrices [ 20 ]. As w e will see shortly , our pap er provides an alternativ e and completely diﬀerent approach to general concen tration results like ( 1.2 ). 1.2. Sparse graphs do not concentrate. In the sp arse regime, where the exp ected degree d is b ounded, concen tration breaks down. According to [ 24 ], a random graph from G ( n, p ) satisﬁes with high probabilit y that k A k = (1 + o (1)) p d ( A ) = (1 + o (1)) s log n log log n if d = O (1) , (1.4) where d ( A ) denotes the maximal degree of the graph (a random quantit y). So in this regime w e ha v e k A k  k E A k = d , which shows that sp arse r andom gr aphs do not c onc entr ate . What exactly makes the norm A abnormally large in the sparse regime? The answ er is, the v ertices with to o high degrees. In the dense case where d  log n , all v ertices t ypically ha v e appro ximately the same degrees (1 + o (1)) d . This no longer happens in the sparser 3 regime d  log n ; the degrees do not cluster tigh tly ab out the same v alue an ymore. There are vertices with to o high degrees; they are captured by the second inequality in ( 1.4 ). Ev en a single high-degree vertex can blow up the norm of the adjacency matrix. Indeed, since the norm of A is b ounded below b y the Euclidean norm of eac h of its ro ws, w e ha v e k A k ≥ p d ( A ). 1.3. Regularization enforces concentration. If high-degree vertices destroy concentra- tion, can w e “tame” these v ertices? One prop osal would b e to remov e these vertices from the graph altogether. U. F eige and E. Ofek [ 15 ] show ed that this works for G ( n, p ) – the r emoval of the high de gr e e vertic es enfor c es c onc entr ation . Indeed, if we drop all v ertices with degrees, sa y , larger than 2 d , the the remaining part of the graph satisﬁes k A 0 − E A 0 k = O ( √ d ) (1.5) with high probabilit y , where A 0 denotes the adjacency matrix of the new graph. The argument in [ 15 ] is based on the metho d developed by J. Kahn and E. Szemeredi [ 17 ]. It extends to the inhomogeneous Erd¨ os-R ´ en yi mo del G ( n, ( p ij )) with d deﬁned in ( 1.3 ), see [ 27 , 12 ]. As w e will see, our pap er provides an alternative and completely diﬀeren t approach to such results. Although the remov al of high degree vertices solves the concentration problem, suc h solu- tion is not ideal, since those v ertices are in some sense the most important ones. In real-world net w orks, the vertices with highest degrees are “hubs” that hold the netw ork together. Their remo v al would cause the netw ork to break do wn in to disconnected comp onents, which leads to a considerable loss of structural information. W ould it b e p ossible to regularize the graph in a more gen tle w a y – instead of removing the high-degree vertices, reduce the weigh ts of their edges just enough to k eep the degrees b ounded by O ( d )? The main result of our pap er states that this is true. Let us ﬁrst state this result informally; Theorem 2.1 pro vides a more general and formal statement. Theorem 1.1 (Concentration of regularized adjacency matrices) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . F or al l high de gr e e vertic es of the gr aph (say, those with de gr e es lar ger than 2 d ), r e duc e the weights of the e dges incident to them in an arbitr ary way, but so that al l de gr e es of the new (weighte d) gr aph b e c ome b ounde d by 2 d . Then, with high pr ob ability, the adjac ency matrix A 0 of the new gr aph c onc entr ates: k A 0 − E A k = O ( √ d ) . Mor e over, inste ad of r e quiring that the de gr e es b e c ome b ounde d by 2 d , we c an r e quir e that the ` 2 norms of the r ows of the new adjac ency matrix b e c ome b ounde d by √ 2 d . 1.4. Examples of graph regularization. The regularization pro cedure in Theorem 1.1 is v ery ﬂexible. Dep ending on how one c hooses the weigh ts, one can obtain as partial cases sev eral results w e summarized earlier, as well as some new ones. 1. Do not do anything to the gr aph. In the dense regime where d = Ω(log n ), all degrees are already b ounded by 2 d with high probability . This means that the original graph satisﬁes k A − E A k = O ( √ d ). Th us we recov er the result of U. F eige and E. Ofek ( 1.2 ), which states that dense r andom gr aphs c onc entr ate wel l . 2. R emove al l high de gr e e vertic es. If we remo v e all v ertices with degrees larger than 2 d , w e reco v er another result of U. F eige and E. Ofek ( 1.5 ), which states that the r emoval of the high de gr e e vertic es enfor c es c onc entr ation . 3. R emove just enough e dges fr om high-de gr e e vertic es. Instead of remo ving the high-degree v ertices with all of their edges, w e can remo v e just enough edges to make all degrees b ounded b y 2 d . This milder regularization still pro duces the concen tration b ound ( 1.5 ). 4 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN 4. R e duc e the weight of e dges pr op ortional ly to the exc ess of de gr e es. Instead of remo ving edges, w e can reduce the w eigh t of the existing edges, a pro cedure which b etter preserv es the structure of the graph. F or instance, w e can assign weigh t p λ i λ j to the edge b etw een v ertices i and j , c hoosing λ i := min(2 d/d i , 1) where d i is the degree of vertex i . One can c hec k that this mak es the ` 2 norms of all rows of the adjacency matrix b ounded by 2 d . By Theorem 1.1 , suc h regularization pro cedure leads to the same concentration b ound ( 1.5 ). 1.5. Concen tration of Laplacian. So far, w e hav e lo oked at random graphs through the lens of their adjacency matrices. A diﬀeren t matrix that captures the geometry of a graph is the (symmmetric, normalized) Laplacian matrix, deﬁned as L ( A ) = D − 1 / 2 ( D − A ) D − 1 / 2 = I − D − 1 / 2 AD − 1 / 2 . (1.6) Here I is the identit y matrix and D = diag ( d i ) is the diagonal matrix with degrees d i = P n j =1 A ij on the diagonal. The reader is referred to [ 13 ] for an in tro duction to graph Lapla- cians and their role in sp ectral graph theory . Here w e men tion just tw o basic facts: the sp ectrum of L ( A ) is a subset of [0 , 2], and the smallest eigenv alue is alwa ys zero. Concen tration of Laplacians of random graphs has been studied in [ 35 , 11 , 39 , 23 , 18 ]. Just lik e the adjacency matrix, the Laplacian is known to concen trate in the dense regime where d = Ω(log n ), and it fails to concentrate in the sparse regime. How ever, the obstructions to concen tration are opp osite. F or the adjacency matrices, as we mentioned, the trouble is caused b y high-degree v ertices. F or the Laplacian, the problem lies with low-de gr e e vertic es . In particular, for d = o (log n ) the graph is lik ely to ha v e isolated vertices; they pro duce m ultiple zero eigen v alues of L ( A ), whic h are easily seen to destroy the concentration. In analogy to our discussion of adjacency matrices, we can try to regularize the graph to “tame” the low-degree v ertices in v arious w a ys, for example remo v e the lo w-degree v ertices, connect them to some other v ertices, artiﬁcially increase the degrees d i in the deﬁnition ( 1.6 ) of Laplacian, and so on. Here w e will fo cus on the follo wing simple w a y of regularization prop osed in [ 3 ] and analyzed in [ 23 , 18 ]. Cho ose τ > 0 and add the same num ber τ /n to all en tries of the adjacency matrix A , thereby replacing it with A τ := A + ( τ /n ) 11 T in the deﬁnition ( 1.6 ) of the Laplacian. This regularization raises all degrees d i to d i + τ . If w e c hoose τ ∼ d , the regularized graph do es not hav e low-degree vertices anymore. The follo wing consequence of Theorem 1.1 sho ws that suc h regularization indeed forces Laplacian to concen trate. Here we state this result informally; Theorem 4.1 provides a more formal statemen t. Theorem 1.2 (Concen tration of the regularized Laplacian) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . Cho ose a numb er τ ∼ d . Then, with high pr ob ability, the r e gularize d L aplacian L ( A τ ) c onc entr ates: kL ( A τ ) − L ( E A τ ) k = O  1 √ d  . W e will deduce this result from Theorem 1.1 in Section 4 . Theorem 1.2 is an improv emen t up on a b ound in [ 18 ] that had an extra log d factor, and it was conjectured there that the logarithmic factor is not needed. Theorem 1.2 conﬁrms this conjecture. 5 1.6. A n umerical exp eriment. T o conclude our discussion of v arious w a ys to regularize sparse graphs, let us illustrate the eﬀect of regularization by a numerical exp erimen t. Consider an inhomogeneous Erd¨ os-R ´ enyi graph with n = 1000 vertices, 90% of which hav e exp ected degrees 7 and 10% p ercen t hav e expecte d degrees 35. W e then regularize the graph by reducing the weigh ts of edges prop ortionally to the excess of degrees – just as we describ ed in Section 1.4 item 4 , except that we use the ov erall av erage degree (approximately 10) instead of d (whic h results in a more severe weigh t reduction suitable for our illustration purp ose). Figure 1 shows the histogram of the sp ectrum of A (left) and A 0 (righ t). As w e can see, the high degree vertices lead to the long tails in the histogram of the eigenv alues, and regularization shrinks these tails to w ard the bulk. Figure 1. Histogram of the sp ectrum of adjacency matrix A (left) and regu- larized adjacency matrix A 0 (righ t) for a sparse random graph generated from the inhomogeneous Erd¨ os-R ´ enyi mo del with n = 1000 vertices, 90% of whic h ha v e exp ected degrees 7 and 10% p ercen t ha v e exp ected degrees 35. 1.7. Application: communit y detection in netw orks. Concentration of random graphs has an important application to statistical analysis of net w orks, in particular to the problem of comm unit y detection. A common wa y of mo deling communities in netw orks is the sto chastic blo ck mo del [ 22 ], whic h is a sp ecial case of the inhomogeneous Erd¨ os-R´ en yi mo del considered in this paper. F or the purpose of this example, w e fo cus on the simplest v ersion of th e stochastic blo c k model G ( n, a n , b n ), also known as the balanced plan ted partition mo del, deﬁned as follo ws. The set of vertices is divided in to t w o subsets (communities) of size n/ 2 each. Edges b etw een vertices are drawn indep enden tly with probability a/n if they are in the same comm unit y and with probability b/n otherwise. The communit y detection problem is to recov er the comm unit y lab els of vertices from a single realization of the random graph model. A large literature exists on both the reco v ery algorithms and the theory establishing when a recov ery is possible [ 14 , 33 , 34 , 32 , 1 , 29 , 8 ]. There are metho ds that p erform b etter than a r andom guess (i.e. the fraction of misclassiﬁed v ertices is bounded aw a y from 0 . 5 as n → ∞ with high probabilit y) under the condition ( a − b ) 2 > 2( a + b ) , and no metho d can p erform b etter than a random guess if this condition is violated. Moreo v er, str ong c onsistency , or exact reco v ery (lab eling al l vertices correctly with high probabilit y) is possible when the expected degree ( a + b ) / 2 is of order log n or larger and a and b are suﬃcien tly separated, see [ 32 , 30 , 6 , 20 , 10 ]. We ak c onsistency (the fraction of mislab eled v ertices going to 0 with high probability) is achiev able if and only if ( a − b ) 2 > C n ( a + b ) with C n → ∞ , 6 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN see [ 32 ]. Man y of these results hold in the non-asymptotic regime, for graphs of ﬁxed size n . Thus, for an y ε > 0 there exists C ε (whic h only dep ends on ε ) such that one can recov er comm unities up to εn mislab eled v ertices as long as ( a − b ) 2 > C ε ( a + b ) . In particular, reco v ery of communities is p ossible even for v ery sparse graphs – those with b ounded exp ected degrees. Several types of algorithms are known to succeed in this regume, including non-backtrac king w alks [ 33 , 29 , 8 ], sp ectral methods [ 12 ] and metho ds based on semideﬁnite programming [ 19 , 31 ]. As an application of the new concen tration results, w e sho w that the r e gularize d sp e ctr al clustering [ 3 , 23 ], one of the simplest most p opular algorithms for communit y detection, can reco v er communities in the sparse regime. In general, sp ectral clustering w orks by com- puting the leading eigenv ectors of either the adjacency matrix or the Laplacian or their regularized versions, and running the k -means clustering algorithm on these eigen v ectors to reco v er the no de lab els. In the simple case of the mo del G ( n, a n , b n ), one can simply assign no des to comm unities based on the sign (p ositive or negative) of the corresp onding entries of the eigenv ector v 2 ( L ( A τ )) corresponding to the second smallest eigenv alue of regularized Laplacian matrix L ( A τ ) (or the regularized adjacency matrix A 0 ). Let us brieﬂy explain how our concen tration results v alidate regularized sp ectral cluster- ing. If the concen tration of random graphs holds and L ( A τ ) is close to L ( E A τ ), then the standard p erturbation theory (Da vis-Kahan theorem b elo w) shows that v 2 ( L ( A τ )) is close to v 2 ( L ( E A τ )), and in particular, the signs of these tw o eigenv ectors m ust agree on most v ertices. An easy calculation shows that the signs of v 2 ( L ( E A τ )) recov er the communities exactly: this v ector is a p ositive constant on one comm unit y and a negative constant on the other. Therefore, the signs of v 2 ( L ( A τ )) must recov er the communities up to a small fraction of misclassiﬁed v ertices. Before stating our result, let us quote a simple v ersion of the Da vis-Kahan theorem per- turbation theorem (see e.g. [ 5 , Theorem VI I.3.2]). Theorem 1.3 (Davis-Kahan theorem) . L et X , Y b e symmetric matric es such that the se c ond smal lest eigenvalues of X and Y have multiplicity one and they ar e of distanc e at le ast δ > 0 fr om the r emaining eigenvalues of X and Y . Denote by x and y the eigenve ctors of X and Y c orr esp onding to the se c ond lar gest eigenvalues of X and Y , r esp e ctively. Then min β = ± 1 k x + β y k ≤ 2 k X − Y k δ . Corollary 1.4 (Communit y detection in sparse graphs) . L et ε > 0 and r ≥ 1 . L et A b e the adjac ency matrix dr awn fr om the sto chastic blo ck mo del G ( n, a n , b n ) . Assume that ( a − b ) 2 > C ε ( a + b ) (1.7) wher e C ε = C r 4 ε − 2 and C is an appr opriately lar ge absolute c onstant. Cho ose τ to b e the aver age de gr e e of the gr aph, i.e. τ = ( d 1 + · · · + d n ) /n wher e d i ar e vertex de gr e es. Then with pr ob ability at le ast 1 − e − r , we have min β = ± 1 k v 2 ( L ( A τ )) + β v 2 ( L ( E A τ )) k ≤ ε. In p articular, the signs of the entir es of v 2 ( L ( A τ )) c orr e ctly estimate the p artition into the two c ommunities, up to at most εn misclassiﬁe d vertic es. 7 Pr o of. W e apply Theorem 1.3 with X = L ( A τ ) and Y = L ( E A τ ). A simple calculation shows that the sp ectral gap δ deﬁned in Theorem 1.3 is of the order ( a − b ) / ( a + b ). The claim of Corollary 1.4 then follo ws from Da vis-Kahan Theorem 1.3 , Concentration Theorem 4.1 (whic h is a formal version of Theorem 1.2 ) and condition ( 1.7 ).  1.8. Organization of the paper. In Section 2 , we state a formal version of Theorem 1.1 . W e sho w there how to deduce this result from a new decomp osition of random graphs, which w e state as Theorem 2.6 . W e pro v e this decomposition theorem in Section 3 . In Section 4 , w e state and prov e a formal version of Theorem 1.2 ab out the concentration of the Laplacian. W e conclude the pap er with Section 5 where we prop ose some questions for further inv estigation. Ac kno wledgemen t. The authors are grateful to Ramon v an Handel for several insigh tful commen ts on the preliminary version of this pap er. 2. Full version of Theorem 1.1 , and reduction to a graph decomposition In this section we state a more general and quantitativ e version of Theorem 1.1 , and we reduce it to a new form of graph decomp osition, whic h can b e of in terest on its own. Theorem 2.1 (Concentration of regularized adjacency matrices) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . F or any r ≥ 1 , the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider any subset c onsisting of at most 10 n/d vertic es, and r e duc e the weights of the e dges incident to those vertic es in an arbitr ary way. L et d 0 b e the maximal de gr e e of the r esulting gr aph. Then the adjac ency matrix A 0 of the new (weighte d) gr aph satisﬁes k A 0 − E A k ≤ C r 3 / 2  √ d + √ d 0  . Mor e over, the same b ound holds for d 0 b eing the maximal ` 2 norm of the r ows of A 0 . In this result and in the rest of the pap er, C, C 1 , C 2 , . . . denote absolute constants whose v alues may b e diﬀerent from line to line. R emark 2.2 (Theorem 2.1 implies Theorem 1.1 ) . The subset of 10 n/d v ertices in Theorem 2.1 can b e completely arbitrary . So let us c ho ose the high-degree vertices, say those with degrees larger than 2 d . There are at most 10 n/d suc h vertices with high probability; this follows b y an easy calculation, and also from Lemma 3.5 . Th us w e immediately deduce Theorem 1.1 . R emark 2.3 (Tight upper b ound) . If we do not reduce w eigh ts of an y edges and d is b ounded, then the upp er b ound in Theorem 2.1 is tight (up to a constant dep ending on r ). This is b ecause of ( 1.4 ), whic h states that the adjacency matrix do es not concentrate in the sparse regime without regularization. R emark 2.4 (Metho d to pro v e Theorem 2.1 ) . One may wonder if Theorem 2.1 can b e prov ed b y developing an  -net argumen t similar to the method of J. Kahn and E. Szemeredi [ 17 ] and its versions [ 2 , 15 , 27 , 12 ]. Although we can not rule out such p ossibilit y , we do not see ho w this metho d could handle a general regularization. The reader familiar with the metho d can easily notice an obstacle. The con tribution of the so-called light couples b ecomes hard to control when one changes, and even reduces, the individual en tries of A (the weigh ts of edges). W e will develop an alternativ e and somewhat simpler approach, which will b e able to handle a general regularization of random graphs. It sheds ligh t on the speciﬁc structure of graphs that enables concen tration. W e are going to identify this structure through a gr aph 8 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN de c omp osition in the next section. But let us pause brieﬂy to mention the following useful reduction. R emark 2.5 (Reduction to directed graphs) . Our argumen ts will b e more conv enien t to carry out if the adjacency matrix A has all independent en tries. T o b e able to make this assumption, w e can decomp ose A into the upp er-triangular and a low er-triangular parts, b oth of which ha v e indep endent en tries. If we can show that each of these parts concen trate ab out its exp ectation, it w ould follo w that A concentrate ab out E A by triangle inequality . In other words, w e may pro v e Theorem 2.1 for dir e cte d inhomogeneous Erd¨ os-R´ en yi graphs, where edges b etw een an y v ertices and in an y direction appear indep ednen tly with probabilities p ij . In the rest of the argumen t, w e will only work with such random directed graphs. 2.1. Graph decomp osition. In this section, w e reduce Theorem 2.1 to the follo wing decom- p osition of inhomogeneous Erd¨ os-R ´ en yi directed random graphs. This decomposition ma y ha v e an indep endent interest. Throughout the paper, w e denote by B N the matrix which coincides with a matrix B on a subset of edges N ⊂ [ n ] × [ n ] and has zero entries elsewhere. Theorem 2.6 (Graph decomp osition) . Consider a r andom dir e cte d gr aph fr om the inhomo- gene ous Er d¨ os-R ´ enyi mo del, and let d b e as in ( 1.3 ) . F or any r ≥ 1 , the fol lowing holds with pr ob ability at le ast 1 − 3 n − r . One c an de c omp ose the set of e dges [ n ] × [ n ] into thr e e classes N , R and C so that the fol lowing pr op erties ar e satisﬁe d for the adjac ency matrix A . • The gr aph c onc entr ates on N , namely k ( A − E A ) N k ≤ C r 3 / 2 √ d . • Each r ow of A R and e ach c olumn of A C has at most 32 r ones. Mor e over, R interse cts at most n/d c olumns, and C interse cts at most n/d r ows of [ n ] × [ n ] . Figure 2 illustrates a p ossible decomp osition Theorem 2.6 can pro vide. The edges in N form a big “core” where the graph concentrates w ell even without regularization. The edges in R and C can be thought of (at least heuristically) as those attac hed to high-degree v ertices. Figure 2. An example of graph decomp osition in Theorem 2.6 . W e will prov e Theorem 2.6 in Section 3 ; let us pause to deduce Theorem 2.1 from it. 2.2. Deduction of Theorem 2.1 . First, let us explain informally how the graph decom- p osition could lead to Theorem 2.1 . The regularization of the graph do es not destro y the prop erties of N , R and C in Theorem 2.6 . Moreov er, regularization creates a new prop ert y for us, allo wing for a go o d con trol of the c olumns of R and r ows of C . Let us fo cus on A R to b e sp eciﬁc. The ` 1 norms of all columns of this matrix are at most d 0 , and the ` 1 norms of all rows are O ( r ) b y Theorem 2.6 . By a simple calculation which we will do in Lemma 2.7 , this implies that k A R k = O ( √ r d 0 ). A similar b ound can b e pro v ed for C . Combining N , R and C together will lead to the error b ound O ( r 3 / 2 ( √ d + √ d 0 )) in Theorem 2.1 . T o mak e this argumen t rigorous, let us start with the simple calculation w e just men tioned. 9 Lemma 2.7. Consider a matrix B in which e ach r ow has ` 1 norm at most a , and e ach c olumn has ` 1 norm at most b . Then k B k ≤ √ ab . Pr o of. Let x b e a vector with k x k 2 = 1. Using Cauch y-Sc h w arz inequalit y and the assump- tions, w e ha v e k B x k 2 2 = X i  X j B ij x j  2 ≤ X i  X j | B ij | X j | B ij | x 2 j  ≤ X i  a X j | B ij | x 2 j  = a X j x 2 j X i | B ij | ≤ a X j x 2 j b = ab. Since x is arbitrary , this completes the pro of.  R emark 2.8 (Riesz-Thorin interpolation theorem implies Lemma 2.7 ) . Lemma 2.7 can also b e deduced from Riesz-Thorin in terpolation theorem (see e.g. [ 40 , Theorem 2.1]), since the maximal ` 1 norm of columns is the ` 1 → ` 1 op erator norm, and the maximal ` 1 norm of rows is the ` ∞ → ` ∞ op erator norm. W e are ready to formally deduce the main part of Theorem 2.1 from Theorem 2.6 ; w e defer the “moreo v er” part to Section 3.6 . Pr o of of The or em 2.1 (main p art). Fix a realization of the random graph that satisﬁes the conclusion of Theorem 2.6 , and decompose the deviation A 0 − E A as follows: A 0 − E A = ( A 0 − E A ) N + ( A 0 − E A ) R + ( A 0 − E A ) C . (2.1) W e will b ound the sp ectral norm of each of the three terms separately . Step 1. Deviation on N . Let us further decomp ose ( A 0 − E A ) N = ( A − E A ) N − ( A − A 0 ) N . (2.2) By Theorem 2.6 , k ( A − E A ) N k ≤ C r 3 / 2 √ d . T o control the second term in ( 2.2 ), denote by E ⊂ [ n ] × [ n ] the subset of e dges that are rew eighed in the regularization pro cess. Since A and A 0 are equal on E c , w e ha v e k ( A − A 0 ) N k = k ( A − A 0 ) N ∩E k ≤ k A N ∩E k (since 0 ≤ A − A 0 ≤ A entrywise) ≤ k ( A − E A ) N ∩E k + k E A N ∩E k (b y triangle inequalit y). (2.3) F urther, a simple restriction prop erty implies that k ( A − E A ) N ∩E k ≤ 2 k ( A − E A ) N k . (2.4) Indeed, restricting a matrix onto a pro duct subset of [ n ] × [ n ] can only reduce its norm. Although the set of rew eigh ted edges E is not a pro duct subset, it can b e decomp osed in to t w o pro duct subsets: E =  I × [ n ]  ∪  I c × I  (2.5) where I is the subset of v ertices incident to the edges in E . Then ( 2.4 ) holds; righ t hand side of that inequality is b ounded b y C r 3 / 2 √ d b y Theorem 2.6 . Thus we handled the ﬁrst term in ( 2.3 ). T o bound the second term in ( 2.3 ), w e can use another restriction prop ert y that states that the norm of the matrix with non-negative entries can only reduce by restricting onto an y subset of [ n ] × [ n ] (whether a pro duct subset or not). This yields k E A N ∩E k ≤ k E A E k ≤ k E A I × [ n ] k + k E A I c × I k (2.6) 10 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN where the second inequalit y follows by ( 2.5 ). By assumption, the matrix E A I × [ n ] has | I | ≤ 10 n/d rows and eac h of its entries is bounded b y d/n . Hence the ` 1 norm of all rows is b ounded b y d , and the ` 1 norm of all columns is b ounded by 10. Lemma 2.7 implies that k E A I × [ n ] k ≤ √ 10 d . A similar b ound holds for the second term of ( 2.6 ). This yields k E A N ∩E k ≤ 5 √ d, so w e handled the second term in ( 2.3 ). Recalling that the ﬁrst term there is b ounded by C r 3 / 2 √ d , w e conclude that k ( A − A 0 ) N k ≤ 2 C r 3 / 2 √ d . Returning to ( 2.2 ), w e recall that the ﬁrst term in the right hand is b ounded by C r 3 / 2 √ d , and w e just b ounded the second term b y 2 C r 3 / 2 √ d . Hence k ( A 0 − E A ) N k ≤ 4 C r 3 / 2 √ d. Step 2. Deviation on R and C . By triangle inequality , w e ha v e k ( A 0 − E A ) R k ≤ k A 0 R k + k E A R k . Recall that 0 ≤ A 0 R ≤ A R en trywise. By Theorem 2.6 , each of the ro ws of A R , and thus also of A 0 R , has ` 1 norm at most 32 r . Moreov er, by deﬁnition of d 0 , each of the columns of A 0 , and th us also of A 0 R , has ` 1 norm at most d 0 . Lemma 2.7 implies that k A 0 R k ≤ √ 32 r d 0 . The matrix E A R can b e handled similarly . By Theorem 2.6 , it has at most n/d entries in eac h row, and all entries are b ounded by d/n . Thus each column of E A R has ` 1 norm at most 1, and and eac h row has ` 1 norm at most d . Lemma 2.7 implies that k E A R k ≤ √ d . W e show ed that k ( A 0 − E A ) R k ≤ √ 32 r d 0 + √ d. A similar b ound holds for k ( A 0 − E A ) C k . Com bining the b ounds on the deviation of A 0 − E A on N , R and C and putting them into ( 2.1 ), we conclude that k A 0 − E A k ≤ 4 C r 3 / 2 √ d + 2  √ 32 r d 0 + √ d  . Simplifying this inequalit y , we complete the pro of of the main part of Theorem 2.1 .  3. Proof of Decomposition Theorem 2.6 3.1. Outline of the argument. W e will construct the decomp osition in Theorem 2.6 by an iterative pro cedure. The ﬁrst and crucial step is to ﬁnd a big blo ck 1 N 0 ⊂ [ n ] × [ n ] of size at least ( n − n/d ) × n/ 2 on which A concentrates, i.e. k ( A − E A ) N 0 k = O ( √ d ) . T o ﬁnd suc h blo ck, w e ﬁrst establish concen tration in ` ∞ → ` 2 norm; this can b e done b y standard probabilistic techniques. Next, we can automatically upgrade this to concentration in the spectral norm ( ` 2 → ` 2 ) once w e pass to an appropriate blo c k N 0 . This can b e done using a general result from functional analysis, whic h w e call Grothendieck-Pietsc h factorization. Rep eating this argumen t for the transp ose, we obtain another blo ck N 00 of size at least n/ 2 × ( n − n/d ) where the graph concentrates as w ell. So the graph concentrates on N 0 := N 0 ∪ N 00 . The “core” N 0 will form the ﬁrst part of the class N w e are constructing. It remains to control the graph on the complemen t of N 0 . That set of edges is quite small; it can b e describ ed as a union of a blo ck C 0 with n/d rows, a blo c k R 0 with n/d columns and 1 In this pap er, by block w e mean a pro duct set I × J with arbitrary index subsets I , J ⊂ [ n ]. These subsets are not required to b e interv als of successive integers. 11 an exceptional n/ 2 × n/ 2 blo c k; see Figure 3b for illustration. W e ma y consider C 0 and R 0 as the ﬁrst parts of the future classes C and R we are constructing. Indeed, since C 0 has so few ro ws, the expected n um ber of ones in eac h column of C 0 is b ounded by 1. F or simplicity , let us think that all columns of C 0 ha v e O (1) ones as desired. (In the formal argument, we will add the bad columns to the exceptional blo c k.) Of course, the blo c k R 0 can b e handled similarly . A t this p oint, w e decomp osed [ n ] × [ n ] in to N 0 , R 0 , C 0 and an exceptional n/ 2 × n/ 2 blo ck. No w we rep eat the pro cess for the exceptional blo ck, constructing N 1 , R 1 , and C 1 there, and so on. Figure 3c illustrates this pro cess. At the end, w e ch o ose N , R and C to b e the unions of the blo c ks N i , R i and C i resp ectiv ely . (a) The core. (b) After the ﬁrst step. (c) Final decomp osition. Figure 3. Constructing decomp osition iterativ ely in the pro of of Theorem 2.6 . Tw o precautions hav e to b e taken in this argumen t. First, we need to mak e concen tration on the core blo cks N i b etter at e ach step , so that the sum of those error b ounds would not dep end of the total n um ber of steps. This can b e done with little eﬀort, with the help of the exp onential decrease of the size of the blo cks N i . Second, w e hav e a control of the sizes but not lo cations of the exceptional blo cks. Thus to b e able to carry out the decomp osition argumen t inside an exceptional blo ck, w e need to make the argumen t v alid uniformly ov er all blo cks of that size. This will require us to b e delicate with probabilistic arguments, so we can tak e a union b ound o v er such blo c ks. 3.2. Grothendiec k-Pietsc h factorization. As w e mentioned in the previous section, our pro of of Theorem 2.6 is based on Grothendieck-Pietsc h factorization. This general and well kno wn result in functional analysis [ 36 , 37 ] has already b een used in a similar probabilistic con text, see [ 26 , Prop osition 15.11]. Grothendiec k-Pietsc h factorization compares t w o matrix norms, the ` 2 → ` 2 norm (whic h w e call the spectral norm ) and the ` ∞ → ` 2 norm. F or a k × m matrix B , these norms are deﬁned as k B k = max k x k 2 =1 k B x k 2 , k B k ∞→ 2 = max k x k ∞ =1 k B x k 2 = max x ∈{− 1 , 1 } m k B x k 2 . The ` ∞ → ` 2 norm is usually easier to con trol, since the suprem um is tak en with resp ect to the discrete set {− 1 , 1 } m , and an y vector there has all co ordinates of the same magnitude. T o compare the tw o norms, one can start with the obvious inequality k B k ∞→ 2 √ m ≤ k B k ≤ k B k ∞→ 2 . 12 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Both parts of this inequalit y are optimal, so there is an unav oidable slack b etw een the upp er and low er b ounds. Ho w ev er, Grothendiec k-Pietsc h factorization allows us to tighten the inequalit y b y c hanging B sigh tly . The next t w o results oﬀer t w o w a ys to c hange B – in troduce w eigh ts and pass to a sub-matrix. Theorem 3.1 (Grothendiec k-Pietsc h’s factorization, w eigh ted v ersion) . L et B b e a k × m r e al matrix. Then ther e exist p ositive weights µ j with P m j =1 µ j = 1 such that k B k ∞→ 2 ≤ k B D − 1 / 2 µ k ≤ p π / 2 k B k ∞→ 2 , (3.1) wher e D µ = diag( µ j ) denotes the m × m diagonal matrix with weights µ j on the diagonal. This result is a known com bination of the Little Grothendieck Theorem (see [ 41 , Corol- lary 10.10] and [ 38 ]) and Pietsc h F actorization (see [ 41 , Theorem 9.2]). In an explicit form, a v ersion of this result can b e found e.g. in [ 26 , Prop osition 15.11]. The w eigh ts µ j can be computed algorithmically , see [ 42 ]. The follo wing related v ersion of Grothendiec k-Pietsc h’s factorization can b e esp ecially useful in probabilistic contexts, see [ 26 , Prop osition 15.11]. Here and in the rest of the pap er, w e denote by B I × J the sub-matrix of a matrix B with ro ws indexed b y a subset I and columns indexed b y a subset J . Theorem 3.2 (Grothendieck-Pietsc h factorization, sub-matrix v ersion) . L et B b e a k × m r e al matrix and δ > 0 . Then ther e exists J ⊆ [ m ] with | J | ≥ (1 − δ ) m such that k B [ k ] × J k ≤ 2 k B k ∞→ 2 √ δ m . Pr o of. Consider the w eigh ts µ i giv en by Theorem 3.1 , and c ho ose J to consist of the indices j satisfying µ j ≤ 1 / ( δ m ). Since P j µ j = 1, the set J m ust contain at least (1 − δ ) m indices as claimed. F urthermore, the diagonal en tries of ( D − 1 / 2 µ ) J × J are all b ounded from b elow by √ δ m , which yields k ( B D − 1 / 2 µ ) [ k ] × J k ≥ √ δ m k B [ k ] × J k . On the other hand, b y ( 3.1 ) the left-hand side of this inequality is b ounded by 2 k B k ∞→ 2 . Rearranging the terms, w e complete the pro of.  3.3. Concen tration on a big blo c k. W e are starting to w ork tow ard constructing the core part N in Theorem 2.6 . In this section we will sho w ho w to ﬁnd a big blo ck on whic h the adjacency matrix A concentrates. First w e will establish concen tration in ` ∞ → ` 2 norm, and then, using Grothendiec k-Pietsc h factorization, in the sp ectral norm. The lemmas of this and next section should b e b est understo o d for m = n , I = J = [ n ] and α = 1. In this case, w e are working with the entire adjacency matrix, and trying to mak e the ﬁrst step in the iterative pro cedure. The further steps will require us to handle smaller blo c ks I × J ; the parameter α will then b ecome smaller in order to achiev e b etter concen tration for smaller blo c ks. Lemma 3.3 (Concen tration in ` ∞ → ` 2 norm) . L et 1 ≤ m ≤ n and α ≥ m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . L et I 0 b e the set of indic es of the r ows of A I × J that c ontain at most αd ones. Then k ( A − E A ) I 0 × J k ∞→ 2 ≤ C p αdmr log ( en/m ) . (3.2) 13 Pr o of. By deﬁnition, k ( A − E A ) I 0 × J k 2 ∞→ 2 = max x ∈{− 1 , 1 } m X i ∈ I 0  X j ∈ J ( A ij − E A ij ) x j  2 = max x ∈{− 1 , 1 } m X i ∈ I  X i ξ i  2 (3.3) where w e denoted X i := X j ∈ J ( A ij − E A ij ) x j , ξ i := 1 { P i ∈ J A ij ≤ αd } . Let us ﬁrst ﬁx a block I × J and a vector x ∈ {− 1 , 1 } m . Let us analyze the indep endent random v ariables X i ξ i . Since | X i | ≤ P j ∈ J | A ij − E A ij | ≤ P j ∈ J A ij , it follo ws by deﬁnition of ξ i that | X i ξ i | ≤ αd. (3.4) A more useful b ond will follow from Bernstein’s inequality . Indeed, X i is a sum of m indep enden t random v ariables with zero means and v ariances at most d/n . By Bernstein’s inequalit y , for any t > 0 we hav e P {| X i ξ i | > tm } ≤ P {| X i | > tm } ≤ 2 exp  − mt 2 / 2 d/n + t/ 3  , t ≥ 0 . (3.5) F or tm ≤ αd , this can b e further b ounded b y 2exp( − m 2 t 2 / 4 αd ), once we use the assumption α ≥ m/n . F or tm > α d , the left-hand side of ( 3.5 ) is automatically zero by ( 3.4 ). Therefore P {| X i ξ i | > tm } ≤ 2 exp  − m 2 t 2 4 αd  , t ≥ 0 . (3.6) W e are now ready to b ound the righ t-hand side of ( 3.3 ). By ( 3.6 ), the random v ariable X i ξ i is sub-gaussian 2 with sub-gaussian norm at most √ αd . It follo ws that ( X i ξ i ) 2 is sub- exp onen tial with sub-exp onen tial norm at most 2 αd . Using Bernstein’s inequalit y for sub- exp onen tial random v ariables (see Corrollary 5.17 in [ 43 ]), we hav e P ( X i ∈ I  X i ξ i  2 > εmαd ) ≤ 2 exp  − c min  ε 2 , ε  m  , ε ≥ 0 . (3.7) Cho osing ε := (10 /c ) r log( en/m ), we b ound this probability by ( en/m ) − 5 rm . Summarizing, we hav e prov ed that for ﬁxed I , J ⊆ [ n ] and x ∈ {− 1 , 1 } m , with probability at least 1 − ( en/m ) − 5 rm , the follo wing holds: X i ∈ I  X i ξ i  2 ≤ (10 /c ) r log( en/m ) · mα d. (3.8) T aking a union b ound ov er all p ossibilities of m, I , J, x and using ( 3.3 ), ( 3.8 ), we see that the conclusion of the lemma holds with probability at least 1 − n X m =1 2 m  n m  2  en m  − 5 rm ≥ 1 − n − r . The pro of is complete.  Applying Lemma 3.3 follow ed by Grothendieck-Piesc h factorization (Theorem 3.2 ), we obtain the follo wing. 2 F or deﬁnitions and basic facts ab out sub-gaussian random v ariables, see e.g. [ 43 ]. 14 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Lemma 3.4 (Concentration in spectral norm) . L et 1 ≤ m ≤ n and α ≥ m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . L et I 0 b e the set of indic es of the r ows of A I × J that c ontain at most αd ones. Then one c an ﬁnd a subset J 0 ⊆ J of at le ast 3 m/ 4 c olumns such that k ( A − E A ) I 0 × J 0 k ≤ C p αdr log ( en/m ) . (3.9) 3.4. Restricted degrees. The tw o simple lemmas of this section will help us to handle the part of the adjacency matrix outside the core blo ck constructed in Lemma 3.4 . First, we sho w that almost all rows ha v e at most O ( αd ) ones, and th us are included in the core blo ck. Lemma 3.5 (Degrees of subgraphs) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . Then al l but m/αd r ows of A I × J have at most 8 r αd ones. Pr o of. Fix a blo c k I × J , and denote by d i the num ber of ones in the i -th ro w of A I × J . Then E d i ≤ md/n by the assumption. Using C hernoﬀ ’s inequalit y , we obtain P { d i > 8 r αd } ≤  8 r αd emd/n  − 8 rαd ≤  2 αn m  − 8 rαd =: p. Let S b e the num ber of rows i with d i > 8 r αd . Then S is a sum of m indep enden t Bernoulli random v ariables with exp ectations at most p . Again, Chernoﬀ ’s inequality implies P { S > m/αd } ≤ ( epαd ) m/αd ≤ p m/ 2 αd =  2 αn m  − 4 rm . The second inequality here holds since eαd ≤ p − 1 / 2 . (T o see this, notice that the deﬁnition of p and assumption on α imply that p − 1 / 2 = (2 αn/m ) 4 rαd ≥ 2 4 rαd .) It remains to tak e a union b ound ov er all blo cks I × J . W e obtain that the conclusion of the lemma holds with probabilit y at least 1 − n X m =1  n m  2  2 αn m  − 4 rm ≥ 1 − n − r . In the last inequalit y we used the assumption that α ≥ p m/n . The pro of is complete.  Next, w e handle the block of ro ws that do ha v e to o man y ones. W e sho w that most c olumns of this blo c k ha v e O (1) ones. Lemma 3.6 (More on degrees of subgraphs) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size k × m with some k ≤ m/α d . Then al l but m/ 4 c olumns of A I × J have at most 32 r ones. Pr o of. Fix I and J , and denote by d j the num ber of ones in the j -th column of A I × J . Then E d j ≤ k d/n ≤ m/αn by assumption. Using Chernoﬀ ’s inequality , w e ha v e P { d j > 32 r } ≤  32 r em/αn  − 32 r ≤  10 αn m  − 32 r =: p. Let S b e the n um ber of columns j with d j > 32 r . Then S is a sum of m independent Bernoulli random v ariables with exp ectations at most p . Again, Chernoﬀ ’s inequality implies P { S > m/ 4 } ≤ (4 ep ) m/ 4 ≤ p m/ 6 ≤  10 αn m  − 5 rm . The second inequality here holds since 4 e < p 1 / 2 , whic h in turn follows by assumption on α . 15 It remains to take a union b ound ov er all blo cks I × J . It is enough to consider the blo c ks with largest p ossible num ber of columns, th us with k = d m/αd e . W e obtain that the conclusion of the lemma holds with probability at least 1 − n X m =1  n m  n d m/αd e   10 αn m  − 5 rm ≤ 1 − n − r . In the last inequalit y we used the assumption that α ≥ p m/n . The pro of is complete.  3.5. Iterativ e decomp osition: pro of of Theorem 2.1 . Finally , w e combine the to ols we dev elop ed so far, and we construct an iterative decomp osition of the adjacency matrix the w a y w e outline in Section 3.1 . Let us set up one step of this pro cedure, where we consider an m × m blo c k and decomp ose almost all of it (everything except an m/ 2 × m/ 2 blo c k) into classes N , R and C satisfying the conclusion of Theorem 2.6 . Once we can do this, we repeat the pro cedure for the m/ 2 × m/ 2 blo ck, etc. Lemma 3.7 (Decomp osition of a blo ck) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − 3 n − r . Consider a blo ck I × J of size m × m . Then ther e exists an exc eptional sub-blo ck I 1 × J 1 with dimensions at most m/ 2 × m/ 2 such that the r emaining p art of the blo ck, that is ( I × J ) \ ( I 1 × J 1 ) , c an b e de c omp ose d into thr e e classes N , R ⊂ ( I \ I 1 ) × J and C ⊂ I × ( J \ J 1 ) so that the fol lowing holds. • The gr aph c onc entr ates on N , namely k ( A − E A ) N k ≤ C r 3 / 2 p αd log ( en/m ) . • Each r ow of A R and e ach c olumn of A C has at most 32 r ones. Mor e over, R interse cts at most n/αd c olumns and C interse cts at most n/αd r ows of I × J . After a permutation of rows and columns, a decomp osition of the blo ck stated in Lemma 3.7 can b e visualized in Figure 4c . (a) Initial step. (b) Rep eat for transp ose. (c) Final decomp osition. Figure 4. Construction of a blo c k decomp osition in Lemma 3.7 . Pr o of. Since we are going to use Lemmas 3.4 , 3.5 and 3.6 , let us ﬁx realization of the random graph that satisﬁes the conclusion of those three lemmas. By Lemma 3.5 , all but m/αd rows of A I × J ha v e at most 8 r αd ones; let us denote by I 0 the set of indices of those ro ws with at most 8 rα d ones. Then w e can use Lemma 3.4 for the block I 0 × J and with α replaced by 8 r α ; the choice of I 0 ensures that all rows ha v e small num bers of ones, as required in that lemma. T o control the ro ws outside I 0 , w e ma y use Lemma 3.6 for ( I \ I 0 ) × J ; as we already noted, this blo c k has at most m/αd ro ws as required in that 16 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN lemma. In tersecting the go o d sets of columns pro duced by those t w o lemmas, we obtain a set of at most m/ 2 exceptional columns J 1 ⊂ J such that the following holds. • On the blo ck N 1 := I 0 × ( J \ J 1 ), w e ha v e k ( A − E A ) N 1 k ≤ C r 3 / 2 p αd log ( en/m ) . • F or the blo ck C := ( I \ I 0 ) × ( J \ J 1 ), all columns of A C ha v e at most 32 r ones. Figure 4a illustrates the decomp osition of the blo ck I × J in to the set of exceptional columns indexed b y J 1 and go o d sets N 1 and C . T o ﬁnish the pro of, w e apply the ab ov e argumen t to the transp ose A T on the blo c k J × I . T o b e precise, we start with the set J 0 ⊂ J of all but m/αd small columns of A I × J (those with at most 8 r αd ones); then we obtain an exceptional set I 1 ⊂ I of at most m/ 2 ro ws; and ﬁnally we conclude that A concentrates on the blo ck N 2 := ( I \ I 1 ) × J 0 and has small rows on the blo c k R := ( I \ I 1 ) × ( J \ J 0 ). Figure 4b illustrates this decomp osition. It only remains to combine the decomp ositions for A and A T ; Figure 4c illustrates a result of the com bination. Once w e deﬁne N := N 1 ∪ N 2 , it b ecomes clear that N , R and C ha v e the required prop erties. 3  Pr o of of The or em 2.6 . Let us ﬁx a realization of the random graph that satisﬁes the conclu- sion of Lemma 3.7 . Applying that lemma for m = n and with α = 1, we decompose the set of edges [ n ] × [ n ] in to three classes N 0 , C 0 and R 0 plus an n/ 2 × n/ 2 exceptional blo ck I 1 × J 1 . Apply Lemma 3.7 again, this time for the blo ck I 1 × J 1 , for m = n/ 2 and with α = p 1 / 2. W e decomp ose I 1 × J 1 in to N 1 , C 1 and R 1 plus an n/ 4 × n/ 4 exceptional blo ck I 2 × J 2 . Rep eat this pro cess for α = p m/n where m is the running size of the blo c k; w e halv e this size at each step, and so w e hav e α i ≤ 2 − i/ 2 . Figure 3c illustrates a decomp osition that we ma y obtain this wa y . In a ﬁnite n um ber of steps (actually , in O (log n ) steps) the exceptional blo c k b ecomes empt y , and the pro cess terminates. A t that p oint we ha v e decomp osed the set of edges [ n ] × [ n ] in to N , R and C , deﬁned as the union of N i , C i and R i resp ectiv ely , whic h w e obtained at each step. It is clear that R and C satisfy the required prop erties. It remains to b ound the deviation of A on N . By construction, N i satisﬁes k ( A − E A ) N i k ≤ C r 3 / 2 p α i d log ( eα i ) . Th us, b y triangle inequality we hav e k ( A − E A ) N k ≤ X i ≥ 0 C r 3 / 2 p α i d log ( eα i ) ≤ C 0 r 3 / 2 √ d. In the second inequalit y we used that α i ≤ 2 − i/ 2 , which forces the series to conv erge. The pro of of Theorem 2.6 is complete.  3.6. Replacing the degrees b y the ` 2 norms in Theorem 2.1 . Let us no w prov e the “moreo v er” part of Theorem 2.1 , where d 0 is the the maximal ` 2 norm of the ro ws and columns of the regularized adjacency matrix A 0 . This is clearly a stronger statement than in the main part of the theorem. Indeed, since all entries of A 0 are b ounded in absolute v alue by 1, eac h degree, b eing the ` 1 norm of a ro w, is b ounded b elo w b y the ` 2 norm squared. This strengthening is in fact easy to chec k. T o do so, note that the deﬁnition of d 0 w as used only once in the pro of of Theorem 2.1 , namely in Step 2 where w e b ounded the norms of A 0 R and A 0 C . Thus, to obtain the strengthening, it is enough to replace the application of Lemma 2.7 there b y the following lemma. 3 It may happ en that an entry ends up in more than one class N , R and C . In such cases, we split the tie arbitrarily . 17 Lemma 3.8. Consider a matrix B with entries in [0 , 1] . Supp ose e ach r ow of B has at most a non-zer o entries, and e ach c olumn has ` 2 norm at most √ b . Then k B k ≤ √ ab . Pr o of. T o prov e the claim, let x b e a vector with k x k 2 = 1. Using Cauc h y-Sc h w arz inequalit y and the assumptions, w e hav e k B x k 2 2 = X j  X i B ij x i  2 ≤ X j  X i : B ij 6 =0 B 2 ij X i : B ij 6 =0 x 2 i  ≤ X j  b X i : B ij 6 =0 x 2 i  = b X i x 2 i X j : B ij 6 =0 1 ≤ b X i x 2 i a = ab. Since x is arbitrary , this completes the pro of.  4. Concentra tion of the regularized Laplacian In this section, w e state the following formal v ersion of Theorem 1.2 , and w e deduce it from concen tration of adjacency matrices (Theorem 2.1 ). Theorem 4.1 (Concen tration of regularized Laplacians) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R´ enyi mo del, and let d b e as in ( 1.3 ) . Cho ose a numb er τ > 0 . Then, for any r ≥ 1 , with pr ob ability at le ast 1 − e − r one has kL ( A τ ) − L ( E A τ ) k ≤ C r 2 √ τ  1 + d τ  5 / 2 . Pr o of. Two sources con tribute to the deviation of Laplacian – the deviation of the adjacency matrix and the deviation of the degrees. Let us separate and bound them individually . Step 1. Decomposing the deviation. Let us denote ¯ A := E A for simplicit y; then E := L ( A τ ) − L ( ¯ A τ ) = D − 1 / 2 τ A τ D − 1 / 2 τ − ¯ D − 1 / 2 τ ¯ A τ ¯ D − 1 / 2 τ . Here D τ = diag( d i + τ ) and ¯ D τ = diag( ¯ d i + τ ) are the diagonal matrices with degrees of A τ and ¯ A τ on the diagonal, resp ectively . Using the fact that A τ − ¯ A τ = A − ¯ A , w e can represent the deviation as E = S + T where S = D − 1 / 2 τ ( A − ¯ A ) D − 1 / 2 τ , T = D − 1 / 2 τ ¯ A τ D − 1 / 2 τ − ¯ D − 1 / 2 τ ¯ A τ ¯ D − 1 / 2 τ . Let us b ound S and T separately . Step 2. Bounding S . Let us introduce a diagonal matrix ∆ that should b e easier to w ork with than D τ . Set ∆ ii = 1 if d i ≤ 8 r d and ∆ ii = d i /τ + 1 otherwise. Then entries of τ ∆ are upp er b ounded b y the corresponding entries of D τ , and so τ k S k ≤ k ∆ − 1 / 2 ( A − ¯ A )∆ − 1 / 2 k . Next, b y triangle inequality , τ k S k ≤ k ∆ − 1 / 2 A ∆ − 1 / 2 − ¯ A k + k ¯ A − ∆ − 1 / 2 ¯ A ∆ − 1 / 2 k =: R 1 + R 2 . (4.1) In order to bound R 1 , we use Theorem 2.1 to show that A 0 := ∆ − 1 / 2 A ∆ − 1 / 2 concen trates around ¯ A . This should b e p ossible b ecause A 0 is obtained from A b y reducing the degrees that are bigger than 8 rd . T o apply the “moreo v er” part of Theorem 2.1 , let us chec k the magnitude of the ` 2 norms of the ro ws A 0 i of A 0 : k A 0 i k 2 2 = n X j =1 A ij ∆ ii ∆ j j ≤ d i ∆ ii ≤ max(8 r d, τ ) . 18 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Here in the ﬁrst inequalit y we used that ∆ j j ≥ 1 and P j A ij = d i ; the second inequality follo ws by deﬁnition of ∆ ii . Applying Theorem 2.1 , we obtain with probability 1 − n − r that R 1 = k A 0 − ¯ A k ≤ C 1 r 2 ( √ d + √ τ ) . T o bound R 2 , w e note that by construction of ∆, the matrices ¯ A and ∆ − 1 / 2 ¯ A ∆ − 1 / 2 coincide on the blo ck I × I , where I is the set of v ertices satisfying d i ≤ 8 r d . This blo ck is v ery large – indeed, Lemma 3.5 implies that | I c | ≤ n/d with probabilit y 1 − n − r . Outside this blo ck, i.e. on the small blo c ks I c × [ n ] and [ n ] × I c , the entries of ¯ A − ∆ − 1 / 2 ¯ A ∆ − 1 / 2 are bounded b y the corresp onding entries of ¯ A , which are all b ounded by d/n . Thus, using Lemma 2.7 , w e ha v e R 2 ≤ k ¯ A I c × [ n ] k + k ¯ A [ n ] × I c k ≤ 2 √ d. Substituting the b ounds for R 1 and R 2 in to ( 4.1 ), w e conclude that k S k ≤ C 2 r 2 τ ( √ d + √ τ ) with probabilit y at least 1 − 2 n − r . Step 3. Bounding T . Bounding the sp ectral norm by the Hilb ert-Sc hmidt norm, we get k T k ≤ k T k HS = n X i,j =1 T 2 ij , where T ij = ( ¯ A ij + τ /n ) h 1 / p δ ij − 1 / q ¯ δ ij i and δ ij = ( d i + τ )( d j + τ ) and ¯ δ ij = ( ¯ d i + τ )( ¯ d j + τ ). T o b ound T ij , w e note that 0 ≤ ¯ A ij + τ /n ≤ d + τ n and   1 / p δ ij − 1 / q ¯ δ ij   =       δ ij − ¯ δ ij δ ij q ¯ δ ij + ¯ δ ij p δ ij       ≥ | δ ij − ¯ δ ij | 2 τ 3 . Recalling the deﬁnition of δ ij and ¯ δ ij and adding and subtracting ( d i + τ )( ¯ d j + τ ), w e ha v e δ ij − ¯ δ ij = ( d i + τ )( d j − ¯ d j ) + ( ¯ d j + τ )( d i − ¯ d i ) . So, using the inequalit y ( a + b ) 2 ≤ 2( a 2 + b 2 ) and b ounding ¯ d j + τ by d + τ , we obtain k T k 2 ≤ ( d + τ ) 2 n 2 τ 6 h n X i =1 ( d i + τ ) 2 n X j =1 ( d j − ¯ d j ) 2 + n ( d + τ ) 2 n X i =1 ( d i − ¯ d i ) 2 i . (4.2) W e claim that n X j =1 ( d j − ¯ d j ) 2 ≤ C 3 r 2 nd with probability 1 − e − 2 r . (4.3) Indeed, since the v ariance of each d i is b ounded by d , the exp ectation of the sum in ( 4.3 ) is b ounded b y nd . T o upgrade the v ariance b ound to an exp onential deviation b ound, one can use one of the several standard metho ds. F or example, Bernstein’s inequality implies that X i = d j − ¯ d j satisﬁes P n X i > C 4 t √ d o ≤ e − t for all t ≥ 1. This means that the random v ariable X 2 i b elongs to the Orlicz space L ψ 1 / 2 and has norm k X 2 i k ψ 1 / 2 ≤ C 3 d , see [ 26 ]. By triangle inequalit y , we conclude that k P n i =1 X 2 i k ψ 1 / 2 ≤ C 3 nd , whic h in turn implies ( 4.3 ). F urthermore, ( 4.3 ) implies n X i =1 ( d i + τ ) 2 ≤ 2 n X i =1 ( d i − ¯ d i ) 2 + 2 n X i =1 ( ¯ d i + τ ) 2 ≤ 2 C 3 r 2 nd + 2 n ( d + τ ) 2 ≤ C 5 r 2 n ( d + τ ) 2 . 19 Substituting this b ound and ( 4.3 ) in to ( 4.2 ) we conclude that k T k 2 ≤ ( d + τ ) 2 n 2 τ 6 · C 3 r 2 nd h C 5 r 2 n ( d + τ ) 2 + n ( d + τ ) 2 i ≤ C 6 r 4 τ  1 + d τ  5 . It remains to substitute the b ounds for S and T into the inequalit y k E k ≤ k S k + k T k , and simplify the expression. The resulting b ound holds with probability at least 1 − n − r − n − r − e − 2 r ≥ 1 − e − r , as claimed.  5. Fur ther questions 5.1. Optimal regularization? The main p oint of our pap er was that regularization helps sparse graphs to concen trate. W e ha v e discussed sev eral kinds of regularization in Section 1.4 and men tioned some more in Section 1.4 . W e found that an y meaningful regularization works, as long as it reduces the too high degrees and increases the to o lo w degrees. Is there an optimal w a y to regularize a graph? Designing the bes t “prepro cessing” of sparse graphs for spectral algorithms is esp ecially interesting from the applied p erspective, i.e. for real world netw orks. On the theoretical level, can regularization of sparse graphs pro duce the same optimal b ound 2 √ d (1 + o (1)) that we saw for dense graphs in ( 1.1 )? Thus, an ideal regularization should bring all parasitic outliers of the sp ectrum in to the bulk. If so, this w ould lead to a p oten tially simple sp ectral clustering algorithm for communit y detection in net w orks which matc hes the theoretical low er b ounds. Algorithms with optimal rates exist for this problem [ 33 , 29 ], but their analysis is very technical. 5.2. Ho w exactly concen tration dep ends on regularization? It would be interesting to determine ho w exactly the concen tration of Laplacian depends on the regularization pa- rameter τ . The dep endence in Theorem 4.1 is not optimal, and we hav e not made eﬀorts to impro v e it. Although it is natural to choose τ ∼ d as in Theorem 1.2 , c hoosing τ  d could also be useful [ 23 ]. Cho osing τ  d may b e in teresting as well, for then L ( E A τ ) ≈ L ( E A ) and we obtain the concen tration of L ( A τ ) around the Laplacian of the expectation of the original (rather than regularized) matrix E A . 5.3. Av erage exp ected degree? Both concen tration results of this pap er, Theorems 1.1 and 1.2 , dep end on d = max ij np ij . W ould it be p ossible to reduce d to the maximal expected degree d av e = max i P j p ij ? 5.4. F rom random graphs to random matrices? Adjacency matrices of random graphs are particular examples of random matrices. Do es the phenomenon w e describ ed, namely that regularization leads to concentration, apply for general random matrices? Guided by Theorem 1.1 , we might exp ect the following for a broader general class of random matrices B with mean zero indep endent entries. First, the only reason the sp ectral norm of B is to o large (and that it is determined b y outliers of sp ectrum) could b e the existence of a large ro w or column. F urthermore, it migh t b e p ossible to reduce the norm of B (and thus bring the outliers in to the bulk of spectrum) by regularizing in some wa y the ro ws and columns that are to o large. F or related questions in random matrix theory , see the recen t work [ 4 , 21 ]. References [1] E. Abb e, A. S. Bandeira, and G. Hall. Exact recov ery in the sto chastic blo ck mo del. IEEE T r ansactions on Information The ory , 62(1):471–487, 2016. [2] N. Alon and N. Kahale. A sp ectral technique for coloring random 3-colorable graphs. SIAM J. Comput. , (26):1733–1748, 1997. 20 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN [3] A. A. Amini, A. Chen, P . J. Bick el, and E. Levina. Pseudo-likelihoo d metho ds for communit y detection in large sparse netw orks. The Annals of Statistics , 41(4):2097–2122, 2013. [4] A. Bandeira and R. V. Handel. Sharp nonasymptotic b ounds on the norm of random matrices with indep enden t entries. Annals of Pr ob ability, to appe ar , 2014. [5] R. Bhatia. Matrix Analysis . Springer-V erlag New Y ork, 1996. [6] P . J. Bick el and A. Chen. A nonparametric view of netw ork mo dels and Newman-Girv an and other mo dularities. Pr o c. Natl. A c ad. Sci. USA , 106:21068–21073, 2009. [7] B. Bollobas, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs. R andom Structur es and Algorithms , 31:3–122, 2007. [8] C. Bordenav e, M. Lelarge, and L. Massouli´ e. Non-backtrac king sp ectrum of random graphs: communit y detection and non-regular Ramanujan graphs. , 2015. [9] S. Bouc heron, G. Lugosi, and P . Massart. Conc entr ation ine qualities: a nonasymptotic the ory of indep en- denc e . Oxford Universit y Press, 2013. [10] T. Cai and X. Li. Robust and computationally feasible communit y detection in the presence of arbitrary outlier no des. Ann. Statist. , 43(3):1027–1059, 2015. [11] K. Chaudhuri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine L e arning R ese ar ch Workshop and Confer enc e Pr o c e e dings , 23:35.1 – 35.23, 2012. [12] P . Chin, A. Rao, and V. V u. Sto chastic blo ck mo del and communit y detection in the sparse graphs : A sp ectral algorithm with optimal rate of recov ery . , 2015. [13] F. R. K. Chung. Sp e ctr al Gr aph The ory . CBMS Regional Conference Series in Mathematics, 1997. [14] A. Decelle, F. Krzak ala, C. Mo ore, and L. Zdeb orov´ a. Asymptotic analysis of the sto chastic blo ck mo del for mo dular net wo rks and its algorithmic applications. Physic al R eview E , 84:066106, 2011. [15] U. F eige and . Ofek. Sp ectral techniques applied to sparse random graphs. Wiley InterScience , 2005. [16] Z. Fredi and J. Komls. The eigenv alues of random symmetric matrices. Combinatoric a , 1:3:233–241, 1980. [17] J. F riedman, J. Kahn, and E. Szemeredi. On the second eigenv alue in random regular graphs. Pr o c Twenty First Annu ACMSymp The ory of Computing , pages 587–598, 1989. [18] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving optimal misclassiﬁcation prop ortion in stochastic blo c k mo del. , 2015. [19] O. Gu ´ edon and R. V ershynin. Communit y detection in sparse netw orks via grothendieck’s inequality . Pr ob ability The ory and R elate d Fields, to app e ar , 2014. [20] B. Ha jek, Y. W u, and J. Xu. Achieving exact cluster reco v ery threshold via semideﬁnite programming. arXiv:1412.6156 , 2014. [21] R. V. Handel. On the sp ectral norm of inhomogeneous random matrices. , 2015. [22] P . W. Holland, K. B. Lask ey , and S. Leinhardt. Sto chastic blo ckmodels: ﬁrst steps. So cial Networks , 5(2):109–137, 1983. [23] A. Joseph and B. Y u. Impact of regularization on sp ectral clustering. Ann. Statist. , 44(4):1765–1791, 2016. [24] M. Kriv elevic h and B. Sudak o v. The largest eigenv alue of sparse random graphs. Combin Pr ob ab Comput , 12:61–72, 2003. [25] M. Ledoux. The Conc entr ation of Me asur e Phenomenon , v olume 89 of Mathematical Surveys and Mono- gr aphs . Amer. Math. So ciety , 2001. [26] M. Ledoux and M. T alagrand. Pr ob ability in Banach sp ac es: Isop erimetry and pr o cesses . Springer-V erlag, Berlin, 1991. [27] J. Lei and A. Rinaldo. Consistency of sp ectral clustering in stochastic block models. Ann. Statist. , 43(1):215–237, 2015. [28] L. Lu and X. Peng. Spectra of edge-indep endent random graphs. The electr onic journal of c ombinatorics , 20(4), 2013. [29] L. Massouli´ e. Communit y detection thresholds and the weak Raman ujan prop erty . In Pr o c e e dings of the 46th Annual ACM Symp osium on The ory of Computing , STOC ’14, pages 694–703, 2014. [30] McSherry . Sp ectral partitioning of random graphs. Pr o c. 42nd FOCS , pages 529–537, 2001. [31] A. Montanari and S. Sen. Semideﬁnite programs on sparse random graphs and their application to comm unity detection. , 2015. [32] E. Mossel, J. Neeman, and A. Sly . Consistency thresholds for binary symmetric blo ck mo dels. arXiv:1407.1591 , 2014. 21 [33] E. Mossel, J. Neeman, and A. Sly . A pro of of the block mo del threshold conjecture. , 2014. [34] E. Mossel, J. Neeman, and A. Sly . Reconstruction and estimation in the planted partition mo del. Pr ob a- bility The ory and R elate d Fields , 2014. [35] R. Oliveira. Concentration of the adjacency matrix and of the laplacian in random graphs with indep en- den t edges. , 2010. [36] A. Pietsch. Op er ator Ide als . North-Holland Amsterdam, 1978. [37] G. Pisier. F actorization of line ar oper ators and ge ometry of Banach sp ac es . Num ber 60 in CBMS Regional Conference Series in Mathematics. AMS, Providence, 1986. [38] G. Pisier. Grothendiecks theorem, past and present. Bul letin (New Series) of the Americ an Mathematic al So ciety , 49(2):237–323, 2012. [39] T. Qin and K. Rohe. Regularized sp ectral clustering under the degree-corrected sto c hastic blo ckmodel. In Advanc es in Neur al Information Pro c essing Systems , pages 3120–3128, 2013. [40] E. M. Stein and R. Shak archi. F unctional Analysis: Intr o duction to F urther T opics in Analysis . Princeton Univ ersity Press, 2011. [41] N. T omczak-Jaegermann. Banach-Mazur distanc es and ﬁnite-dimensional op er ator ide als . John Wiley & Sons, Inc., New Y ork, 1989. [42] J. A. T ropp. Column subset selection, matrix factorization, and eigenv alue optimization. Pr o c e e dings of the Twentieth Annual ACM-SIAM Symp osium on Discr ete Algorithms , pages 978–986, 2009. [43] R. V ershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Ku- t yniok, editors, Compr esse d sensing: theory and applic ations . Cam bridge Univ ersity Press. Submitted. [44] V. V u. Sp ectral norm of random matrices. Combinatoric a , 27(6):721–736, 2007. Dep ar tment of St a tistics, University of California, Da vis, One Shields A ve, Da vis, CA 95616, U.S.A. E-mail addr ess : canle@ucdavis.edu Dep ar tment of St a tistics, University of Michigan, 1085 S. University A ve, Ann Arbor, MI 48109, U.S.A. E-mail addr ess : elevina@umich.edu Dep ar tment of Ma thema tics, University of Michigan, 530 Church St, Ann Arbor, MI 48109, U.S.A. E-mail addr ess : romanv@umich.edu

Concentration and regularization of random graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment