Concentration and regularization of random graphs

This paper studies how close random graphs are typically to their expectations. We interpret this question through the concentration of the adjacency and Laplacian matrices in the spectral norm. We study inhomogeneous Erd\"os-R\'enyi random graphs on…

Authors: Can M. Le, Elizaveta Levina, Roman Vershynin

CONCENTRA TION AND REGULARIZA TION OF RANDOM GRAPHS CAN M. LE, ELIZA VET A LEVINA AND R OMAN VERSHYNIN Abstract. This pap er studies ho w close random graphs are t ypically to their exp ecta- tions. W e in terpret this question through the concentration of the adjacency and Laplacian matrices in the sp ectral norm. W e study inhomogeneous Erd¨ os-R ´ enyi random graphs on n v ertices, where edges form independently and p ossibly with differen t probabilities p ij . Sparse random graphs whose exp ected degrees are o (log n ) fail to concen trate; the obstruction is caused by vertices with abnormally high and low degrees. W e show that concentration can b e restored if we regularize the degrees of such vertices, and one can do this in v arious wa ys. As an example, let us rew eigh t or remov e enough edges to make all degrees b ounded ab ov e b y O ( d ) where d = max np ij . Then we show that the resulting adjacency matrix A 0 concen- trates with the optimal rate: k A 0 − E A k = O ( √ d ). Similarly , if we mak e all degrees b ounded b elo w b y d b y adding w eigh t d/n to all edges, then the resulting Laplacian concentrates with the optimal rate: kL ( A 0 ) − L ( E A 0 ) k = O (1 / √ d ). Our approach is based on Grothendieck- Pietsc h factorization, using which we construct a new decomp osition of random graphs. W e illustrate the concentration results with an application to the communit y detection problem in the analysis of netw orks. 1. Introduction Man y classical and mo dern results in probabilit y theory , starting from the Law of Large Num b ers, can be expressed as concen tration of random ob jects about their expectations. The ob jects studied most are sums of indep endent random v ariables, martingales, nice functions on product probability spaces and metric measure spaces. F or a panoramic exposition of con- cen tration phenomena in modern probability theory and related fields, the reader is referred to the b o oks [ 25 , 9 ]. This pap er studies concen tration properties of random graphs. The first step of suc h study should b e to decide how to interpret the statement that a r andom gr aph G c onc entr ates ne ar its exp e ctation . T o do this, it will b e useful to lo ok at the graph G through the lens of the matrices classically asso ciated with G , namely the adjacency and Laplacian matrices. Let us first build the theory for the adjacency matrix A ; the Laplacian will b e discussed in Section 1.5 . W e ma y say that G concentrates ab out its exp ectation if A is close to its exp ectation E A in some natural matrix norm; w e in terpret the exp ectation of G as the w eigh ted graph with adjacency matrix E A . V arious matrix norms could be of interest here. In this pap er, w e study concen tration in the sp ectral norm. This automatically giv es us a tigh t control of all eigenv alues and eigenv ectors, according to W eyl’s and Davis-Kahan p erturbation inequalities (see [ 5 , Sections I I I.2 and VI I.3]). Concen tration of random graphs in terpreted this w a y , and also of general random matrices, has b een studied in se v eral communities, in particular in random matrix theory , com binatorics and net w ork science. Date : August 10, 2016. E. L. is partially supp orted by NSF grants DMS-1159005 and DMS-1521551. R. V. is partially supp orted b y NSF grant 1265782 and U.S. Air F orce gran t F A9550-14-1-0009. This w ork w as done while C. L. w as a Ph.D. studen t at the Univ ersit y of Michigan. 1 2 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN W e will study random graphs generated from an inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )), where edges are formed indep enden tly with giv en probabilities p ij , see [ 7 ]. This is a gener- alization of the classical Erd¨ os-R ´ en yi mo del G ( n, p ) where all edge probabilities p ij equal p . Man y p opular graph models arise as special cases of G ( n, ( p ij )), suc h as the sto c hastic blo c k mo del, a b enc hmark mo del in the analysis of netw orks [ 22 ] discussed in Section 1.7 , and random subgraphs of giv en graphs. Often, the question of interest is estimating some features of the probabilit y matrix ( p ij ) from random graphs drawn from G ( n, ( p ij )). Concentration of adjacency matrix and Lapla- cian matrix around their exp ectations, when it holds, guaran tees that suc h features can b e reco v ered. As an example of this use of our concen tration results, w e will sho w that if ( p ij ) has a block structure, the blo cks can b e accurately estimated from a single realization of G ( n, ( p ij )) ev en when the av erage vertex degree is b ounded. 1.1. Dense graphs concen trate. The cleanest concentration results are a v ailable for the classical Erd¨ os-R ´ enyi mo del G ( n, p ) in the dense regime. In terms of the exp ected degree d = pn , we hav e with high probability that k A − E A k = 2 √ d (1 + o (1)) if d  log 4 n, (1.1) see [ 16 , 44 , 28 ]. Since k E A k = d , we see that the ty pical deviation here b ehav es lik e the square ro ot of the magnitude of exp ectation – just lik e in many other classical results of probabilit y theory . In other words, dense r andom gr aphs c onc entr ate wel l . The low er b ound on density in ( 1.1 ) can b e essentially relaxed all the w a y down to d = Ω(log n ). Th us, with high probability we hav e k A − E A k = O ( √ d ) if d = Ω(log n ) . (1.2) This result was prov ed in [ 15 ] based on the metho d dev elop ed b y J. Kahn and E. Szemeredi [ 17 ]. More generally , ( 1.2 ) holds for any inhomogeneous Erd¨ os-R´ en yi mo del G ( n, ( p ij )) with maximal exp ected degree d = max i P j p ij . This generalization can b e deduced from a recent result of S. Bandeira and R. v an Handel [ 4 , Corollary 3.6], while a weak er b ound O ( √ d log n ) follo ws from concentration inequalities for sums of indep enden t random matrices [ 35 ]. Alter- nativ ely , an argument in [ 15 ] can b e used to prov e ( 1.2 ) for a somewhat larger but still useful v alue d = max ij np ij , (1.3) see [ 27 , 12 ]. The same can b e obtained by using Seginer’s b ound on random matrices [ 20 ]. As w e will see shortly , our pap er provides an alternativ e and completely different approach to general concen tration results like ( 1.2 ). 1.2. Sparse graphs do not concentrate. In the sp arse regime, where the exp ected degree d is b ounded, concen tration breaks down. According to [ 24 ], a random graph from G ( n, p ) satisfies with high probabilit y that k A k = (1 + o (1)) p d ( A ) = (1 + o (1)) s log n log log n if d = O (1) , (1.4) where d ( A ) denotes the maximal degree of the graph (a random quantit y). So in this regime w e ha v e k A k  k E A k = d , which shows that sp arse r andom gr aphs do not c onc entr ate . What exactly makes the norm A abnormally large in the sparse regime? The answ er is, the v ertices with to o high degrees. In the dense case where d  log n , all v ertices t ypically ha v e appro ximately the same degrees (1 + o (1)) d . This no longer happens in the sparser 3 regime d  log n ; the degrees do not cluster tigh tly ab out the same v alue an ymore. There are vertices with to o high degrees; they are captured by the second inequality in ( 1.4 ). Ev en a single high-degree vertex can blow up the norm of the adjacency matrix. Indeed, since the norm of A is b ounded below b y the Euclidean norm of eac h of its ro ws, w e ha v e k A k ≥ p d ( A ). 1.3. Regularization enforces concentration. If high-degree vertices destroy concentra- tion, can w e “tame” these v ertices? One prop osal would b e to remov e these vertices from the graph altogether. U. F eige and E. Ofek [ 15 ] show ed that this works for G ( n, p ) – the r emoval of the high de gr e e vertic es enfor c es c onc entr ation . Indeed, if we drop all v ertices with degrees, sa y , larger than 2 d , the the remaining part of the graph satisfies k A 0 − E A 0 k = O ( √ d ) (1.5) with high probabilit y , where A 0 denotes the adjacency matrix of the new graph. The argument in [ 15 ] is based on the metho d developed by J. Kahn and E. Szemeredi [ 17 ]. It extends to the inhomogeneous Erd¨ os-R ´ en yi mo del G ( n, ( p ij )) with d defined in ( 1.3 ), see [ 27 , 12 ]. As w e will see, our pap er provides an alternative and completely differen t approach to such results. Although the remov al of high degree vertices solves the concentration problem, suc h solu- tion is not ideal, since those v ertices are in some sense the most important ones. In real-world net w orks, the vertices with highest degrees are “hubs” that hold the netw ork together. Their remo v al would cause the netw ork to break do wn in to disconnected comp onents, which leads to a considerable loss of structural information. W ould it b e p ossible to regularize the graph in a more gen tle w a y – instead of removing the high-degree vertices, reduce the weigh ts of their edges just enough to k eep the degrees b ounded by O ( d )? The main result of our pap er states that this is true. Let us first state this result informally; Theorem 2.1 pro vides a more general and formal statement. Theorem 1.1 (Concentration of regularized adjacency matrices) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . F or al l high de gr e e vertic es of the gr aph (say, those with de gr e es lar ger than 2 d ), r e duc e the weights of the e dges incident to them in an arbitr ary way, but so that al l de gr e es of the new (weighte d) gr aph b e c ome b ounde d by 2 d . Then, with high pr ob ability, the adjac ency matrix A 0 of the new gr aph c onc entr ates: k A 0 − E A k = O ( √ d ) . Mor e over, inste ad of r e quiring that the de gr e es b e c ome b ounde d by 2 d , we c an r e quir e that the ` 2 norms of the r ows of the new adjac ency matrix b e c ome b ounde d by √ 2 d . 1.4. Examples of graph regularization. The regularization pro cedure in Theorem 1.1 is v ery flexible. Dep ending on how one c hooses the weigh ts, one can obtain as partial cases sev eral results w e summarized earlier, as well as some new ones. 1. Do not do anything to the gr aph. In the dense regime where d = Ω(log n ), all degrees are already b ounded by 2 d with high probability . This means that the original graph satisfies k A − E A k = O ( √ d ). Th us we recov er the result of U. F eige and E. Ofek ( 1.2 ), which states that dense r andom gr aphs c onc entr ate wel l . 2. R emove al l high de gr e e vertic es. If we remo v e all v ertices with degrees larger than 2 d , w e reco v er another result of U. F eige and E. Ofek ( 1.5 ), which states that the r emoval of the high de gr e e vertic es enfor c es c onc entr ation . 3. R emove just enough e dges fr om high-de gr e e vertic es. Instead of remo ving the high-degree v ertices with all of their edges, w e can remo v e just enough edges to make all degrees b ounded b y 2 d . This milder regularization still pro duces the concen tration b ound ( 1.5 ). 4 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN 4. R e duc e the weight of e dges pr op ortional ly to the exc ess of de gr e es. Instead of remo ving edges, w e can reduce the w eigh t of the existing edges, a pro cedure which b etter preserv es the structure of the graph. F or instance, w e can assign weigh t p λ i λ j to the edge b etw een v ertices i and j , c hoosing λ i := min(2 d/d i , 1) where d i is the degree of vertex i . One can c hec k that this mak es the ` 2 norms of all rows of the adjacency matrix b ounded by 2 d . By Theorem 1.1 , suc h regularization pro cedure leads to the same concentration b ound ( 1.5 ). 1.5. Concen tration of Laplacian. So far, w e hav e lo oked at random graphs through the lens of their adjacency matrices. A differen t matrix that captures the geometry of a graph is the (symmmetric, normalized) Laplacian matrix, defined as L ( A ) = D − 1 / 2 ( D − A ) D − 1 / 2 = I − D − 1 / 2 AD − 1 / 2 . (1.6) Here I is the identit y matrix and D = diag ( d i ) is the diagonal matrix with degrees d i = P n j =1 A ij on the diagonal. The reader is referred to [ 13 ] for an in tro duction to graph Lapla- cians and their role in sp ectral graph theory . Here w e men tion just tw o basic facts: the sp ectrum of L ( A ) is a subset of [0 , 2], and the smallest eigenv alue is alwa ys zero. Concen tration of Laplacians of random graphs has been studied in [ 35 , 11 , 39 , 23 , 18 ]. Just lik e the adjacency matrix, the Laplacian is known to concen trate in the dense regime where d = Ω(log n ), and it fails to concentrate in the sparse regime. How ever, the obstructions to concen tration are opp osite. F or the adjacency matrices, as we mentioned, the trouble is caused b y high-degree v ertices. F or the Laplacian, the problem lies with low-de gr e e vertic es . In particular, for d = o (log n ) the graph is lik ely to ha v e isolated vertices; they pro duce m ultiple zero eigen v alues of L ( A ), whic h are easily seen to destroy the concentration. In analogy to our discussion of adjacency matrices, we can try to regularize the graph to “tame” the low-degree v ertices in v arious w a ys, for example remo v e the lo w-degree v ertices, connect them to some other v ertices, artificially increase the degrees d i in the definition ( 1.6 ) of Laplacian, and so on. Here w e will fo cus on the follo wing simple w a y of regularization prop osed in [ 3 ] and analyzed in [ 23 , 18 ]. Cho ose τ > 0 and add the same num ber τ /n to all en tries of the adjacency matrix A , thereby replacing it with A τ := A + ( τ /n ) 11 T in the definition ( 1.6 ) of the Laplacian. This regularization raises all degrees d i to d i + τ . If w e c hoose τ ∼ d , the regularized graph do es not hav e low-degree vertices anymore. The follo wing consequence of Theorem 1.1 sho ws that suc h regularization indeed forces Laplacian to concen trate. Here we state this result informally; Theorem 4.1 provides a more formal statemen t. Theorem 1.2 (Concen tration of the regularized Laplacian) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . Cho ose a numb er τ ∼ d . Then, with high pr ob ability, the r e gularize d L aplacian L ( A τ ) c onc entr ates: kL ( A τ ) − L ( E A τ ) k = O  1 √ d  . W e will deduce this result from Theorem 1.1 in Section 4 . Theorem 1.2 is an improv emen t up on a b ound in [ 18 ] that had an extra log d factor, and it was conjectured there that the logarithmic factor is not needed. Theorem 1.2 confirms this conjecture. 5 1.6. A n umerical exp eriment. T o conclude our discussion of v arious w a ys to regularize sparse graphs, let us illustrate the effect of regularization by a numerical exp erimen t. Consider an inhomogeneous Erd¨ os-R ´ enyi graph with n = 1000 vertices, 90% of which hav e exp ected degrees 7 and 10% p ercen t hav e expecte d degrees 35. W e then regularize the graph by reducing the weigh ts of edges prop ortionally to the excess of degrees – just as we describ ed in Section 1.4 item 4 , except that we use the ov erall av erage degree (approximately 10) instead of d (whic h results in a more severe weigh t reduction suitable for our illustration purp ose). Figure 1 shows the histogram of the sp ectrum of A (left) and A 0 (righ t). As w e can see, the high degree vertices lead to the long tails in the histogram of the eigenv alues, and regularization shrinks these tails to w ard the bulk. Figure 1. Histogram of the sp ectrum of adjacency matrix A (left) and regu- larized adjacency matrix A 0 (righ t) for a sparse random graph generated from the inhomogeneous Erd¨ os-R ´ enyi mo del with n = 1000 vertices, 90% of whic h ha v e exp ected degrees 7 and 10% p ercen t ha v e exp ected degrees 35. 1.7. Application: communit y detection in netw orks. Concentration of random graphs has an important application to statistical analysis of net w orks, in particular to the problem of comm unit y detection. A common wa y of mo deling communities in netw orks is the sto chastic blo ck mo del [ 22 ], whic h is a sp ecial case of the inhomogeneous Erd¨ os-R´ en yi mo del considered in this paper. F or the purpose of this example, w e fo cus on the simplest v ersion of th e stochastic blo c k model G ( n, a n , b n ), also known as the balanced plan ted partition mo del, defined as follo ws. The set of vertices is divided in to t w o subsets (communities) of size n/ 2 each. Edges b etw een vertices are drawn indep enden tly with probability a/n if they are in the same comm unit y and with probability b/n otherwise. The communit y detection problem is to recov er the comm unit y lab els of vertices from a single realization of the random graph model. A large literature exists on both the reco v ery algorithms and the theory establishing when a recov ery is possible [ 14 , 33 , 34 , 32 , 1 , 29 , 8 ]. There are metho ds that p erform b etter than a r andom guess (i.e. the fraction of misclassified v ertices is bounded aw a y from 0 . 5 as n → ∞ with high probabilit y) under the condition ( a − b ) 2 > 2( a + b ) , and no metho d can p erform b etter than a random guess if this condition is violated. Moreo v er, str ong c onsistency , or exact reco v ery (lab eling al l vertices correctly with high probabilit y) is possible when the expected degree ( a + b ) / 2 is of order log n or larger and a and b are sufficien tly separated, see [ 32 , 30 , 6 , 20 , 10 ]. We ak c onsistency (the fraction of mislab eled v ertices going to 0 with high probability) is achiev able if and only if ( a − b ) 2 > C n ( a + b ) with C n → ∞ , 6 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN see [ 32 ]. Man y of these results hold in the non-asymptotic regime, for graphs of fixed size n . Thus, for an y ε > 0 there exists C ε (whic h only dep ends on ε ) such that one can recov er comm unities up to εn mislab eled v ertices as long as ( a − b ) 2 > C ε ( a + b ) . In particular, reco v ery of communities is p ossible even for v ery sparse graphs – those with b ounded exp ected degrees. Several types of algorithms are known to succeed in this regume, including non-backtrac king w alks [ 33 , 29 , 8 ], sp ectral methods [ 12 ] and metho ds based on semidefinite programming [ 19 , 31 ]. As an application of the new concen tration results, w e sho w that the r e gularize d sp e ctr al clustering [ 3 , 23 ], one of the simplest most p opular algorithms for communit y detection, can reco v er communities in the sparse regime. In general, sp ectral clustering w orks by com- puting the leading eigenv ectors of either the adjacency matrix or the Laplacian or their regularized versions, and running the k -means clustering algorithm on these eigen v ectors to reco v er the no de lab els. In the simple case of the mo del G ( n, a n , b n ), one can simply assign no des to comm unities based on the sign (p ositive or negative) of the corresp onding entries of the eigenv ector v 2 ( L ( A τ )) corresponding to the second smallest eigenv alue of regularized Laplacian matrix L ( A τ ) (or the regularized adjacency matrix A 0 ). Let us briefly explain how our concen tration results v alidate regularized sp ectral cluster- ing. If the concen tration of random graphs holds and L ( A τ ) is close to L ( E A τ ), then the standard p erturbation theory (Da vis-Kahan theorem b elo w) shows that v 2 ( L ( A τ )) is close to v 2 ( L ( E A τ )), and in particular, the signs of these tw o eigenv ectors m ust agree on most v ertices. An easy calculation shows that the signs of v 2 ( L ( E A τ )) recov er the communities exactly: this v ector is a p ositive constant on one comm unit y and a negative constant on the other. Therefore, the signs of v 2 ( L ( A τ )) must recov er the communities up to a small fraction of misclassified v ertices. Before stating our result, let us quote a simple v ersion of the Da vis-Kahan theorem per- turbation theorem (see e.g. [ 5 , Theorem VI I.3.2]). Theorem 1.3 (Davis-Kahan theorem) . L et X , Y b e symmetric matric es such that the se c ond smal lest eigenvalues of X and Y have multiplicity one and they ar e of distanc e at le ast δ > 0 fr om the r emaining eigenvalues of X and Y . Denote by x and y the eigenve ctors of X and Y c orr esp onding to the se c ond lar gest eigenvalues of X and Y , r esp e ctively. Then min β = ± 1 k x + β y k ≤ 2 k X − Y k δ . Corollary 1.4 (Communit y detection in sparse graphs) . L et ε > 0 and r ≥ 1 . L et A b e the adjac ency matrix dr awn fr om the sto chastic blo ck mo del G ( n, a n , b n ) . Assume that ( a − b ) 2 > C ε ( a + b ) (1.7) wher e C ε = C r 4 ε − 2 and C is an appr opriately lar ge absolute c onstant. Cho ose τ to b e the aver age de gr e e of the gr aph, i.e. τ = ( d 1 + · · · + d n ) /n wher e d i ar e vertex de gr e es. Then with pr ob ability at le ast 1 − e − r , we have min β = ± 1 k v 2 ( L ( A τ )) + β v 2 ( L ( E A τ )) k ≤ ε. In p articular, the signs of the entir es of v 2 ( L ( A τ )) c orr e ctly estimate the p artition into the two c ommunities, up to at most εn misclassifie d vertic es. 7 Pr o of. W e apply Theorem 1.3 with X = L ( A τ ) and Y = L ( E A τ ). A simple calculation shows that the sp ectral gap δ defined in Theorem 1.3 is of the order ( a − b ) / ( a + b ). The claim of Corollary 1.4 then follo ws from Da vis-Kahan Theorem 1.3 , Concentration Theorem 4.1 (whic h is a formal version of Theorem 1.2 ) and condition ( 1.7 ).  1.8. Organization of the paper. In Section 2 , we state a formal version of Theorem 1.1 . W e sho w there how to deduce this result from a new decomp osition of random graphs, which w e state as Theorem 2.6 . W e pro v e this decomposition theorem in Section 3 . In Section 4 , w e state and prov e a formal version of Theorem 1.2 ab out the concentration of the Laplacian. W e conclude the pap er with Section 5 where we prop ose some questions for further inv estigation. Ac kno wledgemen t. The authors are grateful to Ramon v an Handel for several insigh tful commen ts on the preliminary version of this pap er. 2. Full version of Theorem 1.1 , and reduction to a graph decomposition In this section we state a more general and quantitativ e version of Theorem 1.1 , and we reduce it to a new form of graph decomp osition, whic h can b e of in terest on its own. Theorem 2.1 (Concentration of regularized adjacency matrices) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R ´ enyi mo del G ( n, ( p ij )) , and let d = max ij np ij . F or any r ≥ 1 , the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider any subset c onsisting of at most 10 n/d vertic es, and r e duc e the weights of the e dges incident to those vertic es in an arbitr ary way. L et d 0 b e the maximal de gr e e of the r esulting gr aph. Then the adjac ency matrix A 0 of the new (weighte d) gr aph satisfies k A 0 − E A k ≤ C r 3 / 2  √ d + √ d 0  . Mor e over, the same b ound holds for d 0 b eing the maximal ` 2 norm of the r ows of A 0 . In this result and in the rest of the pap er, C, C 1 , C 2 , . . . denote absolute constants whose v alues may b e different from line to line. R emark 2.2 (Theorem 2.1 implies Theorem 1.1 ) . The subset of 10 n/d v ertices in Theorem 2.1 can b e completely arbitrary . So let us c ho ose the high-degree vertices, say those with degrees larger than 2 d . There are at most 10 n/d suc h vertices with high probability; this follows b y an easy calculation, and also from Lemma 3.5 . Th us w e immediately deduce Theorem 1.1 . R emark 2.3 (Tight upper b ound) . If we do not reduce w eigh ts of an y edges and d is b ounded, then the upp er b ound in Theorem 2.1 is tight (up to a constant dep ending on r ). This is b ecause of ( 1.4 ), whic h states that the adjacency matrix do es not concentrate in the sparse regime without regularization. R emark 2.4 (Metho d to pro v e Theorem 2.1 ) . One may wonder if Theorem 2.1 can b e prov ed b y developing an  -net argumen t similar to the method of J. Kahn and E. Szemeredi [ 17 ] and its versions [ 2 , 15 , 27 , 12 ]. Although we can not rule out such p ossibilit y , we do not see ho w this metho d could handle a general regularization. The reader familiar with the metho d can easily notice an obstacle. The con tribution of the so-called light couples b ecomes hard to control when one changes, and even reduces, the individual en tries of A (the weigh ts of edges). W e will develop an alternativ e and somewhat simpler approach, which will b e able to handle a general regularization of random graphs. It sheds ligh t on the specific structure of graphs that enables concen tration. W e are going to identify this structure through a gr aph 8 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN de c omp osition in the next section. But let us pause briefly to mention the following useful reduction. R emark 2.5 (Reduction to directed graphs) . Our argumen ts will b e more conv enien t to carry out if the adjacency matrix A has all independent en tries. T o b e able to make this assumption, w e can decomp ose A into the upp er-triangular and a low er-triangular parts, b oth of which ha v e indep endent en tries. If we can show that each of these parts concen trate ab out its exp ectation, it w ould follo w that A concentrate ab out E A by triangle inequality . In other words, w e may pro v e Theorem 2.1 for dir e cte d inhomogeneous Erd¨ os-R´ en yi graphs, where edges b etw een an y v ertices and in an y direction appear indep ednen tly with probabilities p ij . In the rest of the argumen t, w e will only work with such random directed graphs. 2.1. Graph decomp osition. In this section, w e reduce Theorem 2.1 to the follo wing decom- p osition of inhomogeneous Erd¨ os-R ´ en yi directed random graphs. This decomposition ma y ha v e an indep endent interest. Throughout the paper, w e denote by B N the matrix which coincides with a matrix B on a subset of edges N ⊂ [ n ] × [ n ] and has zero entries elsewhere. Theorem 2.6 (Graph decomp osition) . Consider a r andom dir e cte d gr aph fr om the inhomo- gene ous Er d¨ os-R ´ enyi mo del, and let d b e as in ( 1.3 ) . F or any r ≥ 1 , the fol lowing holds with pr ob ability at le ast 1 − 3 n − r . One c an de c omp ose the set of e dges [ n ] × [ n ] into thr e e classes N , R and C so that the fol lowing pr op erties ar e satisfie d for the adjac ency matrix A . • The gr aph c onc entr ates on N , namely k ( A − E A ) N k ≤ C r 3 / 2 √ d . • Each r ow of A R and e ach c olumn of A C has at most 32 r ones. Mor e over, R interse cts at most n/d c olumns, and C interse cts at most n/d r ows of [ n ] × [ n ] . Figure 2 illustrates a p ossible decomp osition Theorem 2.6 can pro vide. The edges in N form a big “core” where the graph concentrates w ell even without regularization. The edges in R and C can be thought of (at least heuristically) as those attac hed to high-degree v ertices. Figure 2. An example of graph decomp osition in Theorem 2.6 . W e will prov e Theorem 2.6 in Section 3 ; let us pause to deduce Theorem 2.1 from it. 2.2. Deduction of Theorem 2.1 . First, let us explain informally how the graph decom- p osition could lead to Theorem 2.1 . The regularization of the graph do es not destro y the prop erties of N , R and C in Theorem 2.6 . Moreov er, regularization creates a new prop ert y for us, allo wing for a go o d con trol of the c olumns of R and r ows of C . Let us fo cus on A R to b e sp ecific. The ` 1 norms of all columns of this matrix are at most d 0 , and the ` 1 norms of all rows are O ( r ) b y Theorem 2.6 . By a simple calculation which we will do in Lemma 2.7 , this implies that k A R k = O ( √ r d 0 ). A similar b ound can b e pro v ed for C . Combining N , R and C together will lead to the error b ound O ( r 3 / 2 ( √ d + √ d 0 )) in Theorem 2.1 . T o mak e this argumen t rigorous, let us start with the simple calculation w e just men tioned. 9 Lemma 2.7. Consider a matrix B in which e ach r ow has ` 1 norm at most a , and e ach c olumn has ` 1 norm at most b . Then k B k ≤ √ ab . Pr o of. Let x b e a vector with k x k 2 = 1. Using Cauch y-Sc h w arz inequalit y and the assump- tions, w e ha v e k B x k 2 2 = X i  X j B ij x j  2 ≤ X i  X j | B ij | X j | B ij | x 2 j  ≤ X i  a X j | B ij | x 2 j  = a X j x 2 j X i | B ij | ≤ a X j x 2 j b = ab. Since x is arbitrary , this completes the pro of.  R emark 2.8 (Riesz-Thorin interpolation theorem implies Lemma 2.7 ) . Lemma 2.7 can also b e deduced from Riesz-Thorin in terpolation theorem (see e.g. [ 40 , Theorem 2.1]), since the maximal ` 1 norm of columns is the ` 1 → ` 1 op erator norm, and the maximal ` 1 norm of rows is the ` ∞ → ` ∞ op erator norm. W e are ready to formally deduce the main part of Theorem 2.1 from Theorem 2.6 ; w e defer the “moreo v er” part to Section 3.6 . Pr o of of The or em 2.1 (main p art). Fix a realization of the random graph that satisfies the conclusion of Theorem 2.6 , and decompose the deviation A 0 − E A as follows: A 0 − E A = ( A 0 − E A ) N + ( A 0 − E A ) R + ( A 0 − E A ) C . (2.1) W e will b ound the sp ectral norm of each of the three terms separately . Step 1. Deviation on N . Let us further decomp ose ( A 0 − E A ) N = ( A − E A ) N − ( A − A 0 ) N . (2.2) By Theorem 2.6 , k ( A − E A ) N k ≤ C r 3 / 2 √ d . T o control the second term in ( 2.2 ), denote by E ⊂ [ n ] × [ n ] the subset of e dges that are rew eighed in the regularization pro cess. Since A and A 0 are equal on E c , w e ha v e k ( A − A 0 ) N k = k ( A − A 0 ) N ∩E k ≤ k A N ∩E k (since 0 ≤ A − A 0 ≤ A entrywise) ≤ k ( A − E A ) N ∩E k + k E A N ∩E k (b y triangle inequalit y). (2.3) F urther, a simple restriction prop erty implies that k ( A − E A ) N ∩E k ≤ 2 k ( A − E A ) N k . (2.4) Indeed, restricting a matrix onto a pro duct subset of [ n ] × [ n ] can only reduce its norm. Although the set of rew eigh ted edges E is not a pro duct subset, it can b e decomp osed in to t w o pro duct subsets: E =  I × [ n ]  ∪  I c × I  (2.5) where I is the subset of v ertices incident to the edges in E . Then ( 2.4 ) holds; righ t hand side of that inequality is b ounded b y C r 3 / 2 √ d b y Theorem 2.6 . Thus we handled the first term in ( 2.3 ). T o bound the second term in ( 2.3 ), w e can use another restriction prop ert y that states that the norm of the matrix with non-negative entries can only reduce by restricting onto an y subset of [ n ] × [ n ] (whether a pro duct subset or not). This yields k E A N ∩E k ≤ k E A E k ≤ k E A I × [ n ] k + k E A I c × I k (2.6) 10 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN where the second inequalit y follows by ( 2.5 ). By assumption, the matrix E A I × [ n ] has | I | ≤ 10 n/d rows and eac h of its entries is bounded b y d/n . Hence the ` 1 norm of all rows is b ounded b y d , and the ` 1 norm of all columns is b ounded by 10. Lemma 2.7 implies that k E A I × [ n ] k ≤ √ 10 d . A similar b ound holds for the second term of ( 2.6 ). This yields k E A N ∩E k ≤ 5 √ d, so w e handled the second term in ( 2.3 ). Recalling that the first term there is b ounded by C r 3 / 2 √ d , w e conclude that k ( A − A 0 ) N k ≤ 2 C r 3 / 2 √ d . Returning to ( 2.2 ), w e recall that the first term in the right hand is b ounded by C r 3 / 2 √ d , and w e just b ounded the second term b y 2 C r 3 / 2 √ d . Hence k ( A 0 − E A ) N k ≤ 4 C r 3 / 2 √ d. Step 2. Deviation on R and C . By triangle inequality , w e ha v e k ( A 0 − E A ) R k ≤ k A 0 R k + k E A R k . Recall that 0 ≤ A 0 R ≤ A R en trywise. By Theorem 2.6 , each of the ro ws of A R , and thus also of A 0 R , has ` 1 norm at most 32 r . Moreov er, by definition of d 0 , each of the columns of A 0 , and th us also of A 0 R , has ` 1 norm at most d 0 . Lemma 2.7 implies that k A 0 R k ≤ √ 32 r d 0 . The matrix E A R can b e handled similarly . By Theorem 2.6 , it has at most n/d entries in eac h row, and all entries are b ounded by d/n . Thus each column of E A R has ` 1 norm at most 1, and and eac h row has ` 1 norm at most d . Lemma 2.7 implies that k E A R k ≤ √ d . W e show ed that k ( A 0 − E A ) R k ≤ √ 32 r d 0 + √ d. A similar b ound holds for k ( A 0 − E A ) C k . Com bining the b ounds on the deviation of A 0 − E A on N , R and C and putting them into ( 2.1 ), we conclude that k A 0 − E A k ≤ 4 C r 3 / 2 √ d + 2  √ 32 r d 0 + √ d  . Simplifying this inequalit y , we complete the pro of of the main part of Theorem 2.1 .  3. Proof of Decomposition Theorem 2.6 3.1. Outline of the argument. W e will construct the decomp osition in Theorem 2.6 by an iterative pro cedure. The first and crucial step is to find a big blo ck 1 N 0 ⊂ [ n ] × [ n ] of size at least ( n − n/d ) × n/ 2 on which A concentrates, i.e. k ( A − E A ) N 0 k = O ( √ d ) . T o find suc h blo ck, w e first establish concen tration in ` ∞ → ` 2 norm; this can b e done b y standard probabilistic techniques. Next, we can automatically upgrade this to concentration in the spectral norm ( ` 2 → ` 2 ) once w e pass to an appropriate blo c k N 0 . This can b e done using a general result from functional analysis, whic h w e call Grothendieck-Pietsc h factorization. Rep eating this argumen t for the transp ose, we obtain another blo ck N 00 of size at least n/ 2 × ( n − n/d ) where the graph concentrates as w ell. So the graph concentrates on N 0 := N 0 ∪ N 00 . The “core” N 0 will form the first part of the class N w e are constructing. It remains to control the graph on the complemen t of N 0 . That set of edges is quite small; it can b e describ ed as a union of a blo ck C 0 with n/d rows, a blo c k R 0 with n/d columns and 1 In this pap er, by block w e mean a pro duct set I × J with arbitrary index subsets I , J ⊂ [ n ]. These subsets are not required to b e interv als of successive integers. 11 an exceptional n/ 2 × n/ 2 blo c k; see Figure 3b for illustration. W e ma y consider C 0 and R 0 as the first parts of the future classes C and R we are constructing. Indeed, since C 0 has so few ro ws, the expected n um ber of ones in eac h column of C 0 is b ounded by 1. F or simplicity , let us think that all columns of C 0 ha v e O (1) ones as desired. (In the formal argument, we will add the bad columns to the exceptional blo c k.) Of course, the blo c k R 0 can b e handled similarly . A t this p oint, w e decomp osed [ n ] × [ n ] in to N 0 , R 0 , C 0 and an exceptional n/ 2 × n/ 2 blo ck. No w we rep eat the pro cess for the exceptional blo ck, constructing N 1 , R 1 , and C 1 there, and so on. Figure 3c illustrates this pro cess. At the end, w e ch o ose N , R and C to b e the unions of the blo c ks N i , R i and C i resp ectiv ely . (a) The core. (b) After the first step. (c) Final decomp osition. Figure 3. Constructing decomp osition iterativ ely in the pro of of Theorem 2.6 . Tw o precautions hav e to b e taken in this argumen t. First, we need to mak e concen tration on the core blo cks N i b etter at e ach step , so that the sum of those error b ounds would not dep end of the total n um ber of steps. This can b e done with little effort, with the help of the exp onential decrease of the size of the blo cks N i . Second, w e hav e a control of the sizes but not lo cations of the exceptional blo cks. Thus to b e able to carry out the decomp osition argumen t inside an exceptional blo ck, w e need to make the argumen t v alid uniformly ov er all blo cks of that size. This will require us to b e delicate with probabilistic arguments, so we can tak e a union b ound o v er such blo c ks. 3.2. Grothendiec k-Pietsc h factorization. As w e mentioned in the previous section, our pro of of Theorem 2.6 is based on Grothendieck-Pietsc h factorization. This general and well kno wn result in functional analysis [ 36 , 37 ] has already b een used in a similar probabilistic con text, see [ 26 , Prop osition 15.11]. Grothendiec k-Pietsc h factorization compares t w o matrix norms, the ` 2 → ` 2 norm (whic h w e call the spectral norm ) and the ` ∞ → ` 2 norm. F or a k × m matrix B , these norms are defined as k B k = max k x k 2 =1 k B x k 2 , k B k ∞→ 2 = max k x k ∞ =1 k B x k 2 = max x ∈{− 1 , 1 } m k B x k 2 . The ` ∞ → ` 2 norm is usually easier to con trol, since the suprem um is tak en with resp ect to the discrete set {− 1 , 1 } m , and an y vector there has all co ordinates of the same magnitude. T o compare the tw o norms, one can start with the obvious inequality k B k ∞→ 2 √ m ≤ k B k ≤ k B k ∞→ 2 . 12 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Both parts of this inequalit y are optimal, so there is an unav oidable slack b etw een the upp er and low er b ounds. Ho w ev er, Grothendiec k-Pietsc h factorization allows us to tighten the inequalit y b y c hanging B sigh tly . The next t w o results offer t w o w a ys to c hange B – in troduce w eigh ts and pass to a sub-matrix. Theorem 3.1 (Grothendiec k-Pietsc h’s factorization, w eigh ted v ersion) . L et B b e a k × m r e al matrix. Then ther e exist p ositive weights µ j with P m j =1 µ j = 1 such that k B k ∞→ 2 ≤ k B D − 1 / 2 µ k ≤ p π / 2 k B k ∞→ 2 , (3.1) wher e D µ = diag( µ j ) denotes the m × m diagonal matrix with weights µ j on the diagonal. This result is a known com bination of the Little Grothendieck Theorem (see [ 41 , Corol- lary 10.10] and [ 38 ]) and Pietsc h F actorization (see [ 41 , Theorem 9.2]). In an explicit form, a v ersion of this result can b e found e.g. in [ 26 , Prop osition 15.11]. The w eigh ts µ j can be computed algorithmically , see [ 42 ]. The follo wing related v ersion of Grothendiec k-Pietsc h’s factorization can b e esp ecially useful in probabilistic contexts, see [ 26 , Prop osition 15.11]. Here and in the rest of the pap er, w e denote by B I × J the sub-matrix of a matrix B with ro ws indexed b y a subset I and columns indexed b y a subset J . Theorem 3.2 (Grothendieck-Pietsc h factorization, sub-matrix v ersion) . L et B b e a k × m r e al matrix and δ > 0 . Then ther e exists J ⊆ [ m ] with | J | ≥ (1 − δ ) m such that k B [ k ] × J k ≤ 2 k B k ∞→ 2 √ δ m . Pr o of. Consider the w eigh ts µ i giv en by Theorem 3.1 , and c ho ose J to consist of the indices j satisfying µ j ≤ 1 / ( δ m ). Since P j µ j = 1, the set J m ust contain at least (1 − δ ) m indices as claimed. F urthermore, the diagonal en tries of ( D − 1 / 2 µ ) J × J are all b ounded from b elow by √ δ m , which yields k ( B D − 1 / 2 µ ) [ k ] × J k ≥ √ δ m k B [ k ] × J k . On the other hand, b y ( 3.1 ) the left-hand side of this inequality is b ounded by 2 k B k ∞→ 2 . Rearranging the terms, w e complete the pro of.  3.3. Concen tration on a big blo c k. W e are starting to w ork tow ard constructing the core part N in Theorem 2.6 . In this section we will sho w ho w to find a big blo ck on whic h the adjacency matrix A concentrates. First w e will establish concen tration in ` ∞ → ` 2 norm, and then, using Grothendiec k-Pietsc h factorization, in the sp ectral norm. The lemmas of this and next section should b e b est understo o d for m = n , I = J = [ n ] and α = 1. In this case, w e are working with the entire adjacency matrix, and trying to mak e the first step in the iterative pro cedure. The further steps will require us to handle smaller blo c ks I × J ; the parameter α will then b ecome smaller in order to achiev e b etter concen tration for smaller blo c ks. Lemma 3.3 (Concen tration in ` ∞ → ` 2 norm) . L et 1 ≤ m ≤ n and α ≥ m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . L et I 0 b e the set of indic es of the r ows of A I × J that c ontain at most αd ones. Then k ( A − E A ) I 0 × J k ∞→ 2 ≤ C p αdmr log ( en/m ) . (3.2) 13 Pr o of. By definition, k ( A − E A ) I 0 × J k 2 ∞→ 2 = max x ∈{− 1 , 1 } m X i ∈ I 0  X j ∈ J ( A ij − E A ij ) x j  2 = max x ∈{− 1 , 1 } m X i ∈ I  X i ξ i  2 (3.3) where w e denoted X i := X j ∈ J ( A ij − E A ij ) x j , ξ i := 1 { P i ∈ J A ij ≤ αd } . Let us first fix a block I × J and a vector x ∈ {− 1 , 1 } m . Let us analyze the indep endent random v ariables X i ξ i . Since | X i | ≤ P j ∈ J | A ij − E A ij | ≤ P j ∈ J A ij , it follo ws by definition of ξ i that | X i ξ i | ≤ αd. (3.4) A more useful b ond will follow from Bernstein’s inequality . Indeed, X i is a sum of m indep enden t random v ariables with zero means and v ariances at most d/n . By Bernstein’s inequalit y , for any t > 0 we hav e P {| X i ξ i | > tm } ≤ P {| X i | > tm } ≤ 2 exp  − mt 2 / 2 d/n + t/ 3  , t ≥ 0 . (3.5) F or tm ≤ αd , this can b e further b ounded b y 2exp( − m 2 t 2 / 4 αd ), once we use the assumption α ≥ m/n . F or tm > α d , the left-hand side of ( 3.5 ) is automatically zero by ( 3.4 ). Therefore P {| X i ξ i | > tm } ≤ 2 exp  − m 2 t 2 4 αd  , t ≥ 0 . (3.6) W e are now ready to b ound the righ t-hand side of ( 3.3 ). By ( 3.6 ), the random v ariable X i ξ i is sub-gaussian 2 with sub-gaussian norm at most √ αd . It follo ws that ( X i ξ i ) 2 is sub- exp onen tial with sub-exp onen tial norm at most 2 αd . Using Bernstein’s inequalit y for sub- exp onen tial random v ariables (see Corrollary 5.17 in [ 43 ]), we hav e P ( X i ∈ I  X i ξ i  2 > εmαd ) ≤ 2 exp  − c min  ε 2 , ε  m  , ε ≥ 0 . (3.7) Cho osing ε := (10 /c ) r log( en/m ), we b ound this probability by ( en/m ) − 5 rm . Summarizing, we hav e prov ed that for fixed I , J ⊆ [ n ] and x ∈ {− 1 , 1 } m , with probability at least 1 − ( en/m ) − 5 rm , the follo wing holds: X i ∈ I  X i ξ i  2 ≤ (10 /c ) r log( en/m ) · mα d. (3.8) T aking a union b ound ov er all p ossibilities of m, I , J, x and using ( 3.3 ), ( 3.8 ), we see that the conclusion of the lemma holds with probability at least 1 − n X m =1 2 m  n m  2  en m  − 5 rm ≥ 1 − n − r . The pro of is complete.  Applying Lemma 3.3 follow ed by Grothendieck-Piesc h factorization (Theorem 3.2 ), we obtain the follo wing. 2 F or definitions and basic facts ab out sub-gaussian random v ariables, see e.g. [ 43 ]. 14 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Lemma 3.4 (Concentration in spectral norm) . L et 1 ≤ m ≤ n and α ≥ m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . L et I 0 b e the set of indic es of the r ows of A I × J that c ontain at most αd ones. Then one c an find a subset J 0 ⊆ J of at le ast 3 m/ 4 c olumns such that k ( A − E A ) I 0 × J 0 k ≤ C p αdr log ( en/m ) . (3.9) 3.4. Restricted degrees. The tw o simple lemmas of this section will help us to handle the part of the adjacency matrix outside the core blo ck constructed in Lemma 3.4 . First, we sho w that almost all rows ha v e at most O ( αd ) ones, and th us are included in the core blo ck. Lemma 3.5 (Degrees of subgraphs) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size m × m . Then al l but m/αd r ows of A I × J have at most 8 r αd ones. Pr o of. Fix a blo c k I × J , and denote by d i the num ber of ones in the i -th ro w of A I × J . Then E d i ≤ md/n by the assumption. Using C hernoff ’s inequalit y , we obtain P { d i > 8 r αd } ≤  8 r αd emd/n  − 8 rαd ≤  2 αn m  − 8 rαd =: p. Let S b e the num ber of rows i with d i > 8 r αd . Then S is a sum of m indep enden t Bernoulli random v ariables with exp ectations at most p . Again, Chernoff ’s inequality implies P { S > m/αd } ≤ ( epαd ) m/αd ≤ p m/ 2 αd =  2 αn m  − 4 rm . The second inequality here holds since eαd ≤ p − 1 / 2 . (T o see this, notice that the definition of p and assumption on α imply that p − 1 / 2 = (2 αn/m ) 4 rαd ≥ 2 4 rαd .) It remains to tak e a union b ound ov er all blo cks I × J . W e obtain that the conclusion of the lemma holds with probabilit y at least 1 − n X m =1  n m  2  2 αn m  − 4 rm ≥ 1 − n − r . In the last inequalit y we used the assumption that α ≥ p m/n . The pro of is complete.  Next, w e handle the block of ro ws that do ha v e to o man y ones. W e sho w that most c olumns of this blo c k ha v e O (1) ones. Lemma 3.6 (More on degrees of subgraphs) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − n − r . Consider a blo ck I × J of size k × m with some k ≤ m/α d . Then al l but m/ 4 c olumns of A I × J have at most 32 r ones. Pr o of. Fix I and J , and denote by d j the num ber of ones in the j -th column of A I × J . Then E d j ≤ k d/n ≤ m/αn by assumption. Using Chernoff ’s inequality , w e ha v e P { d j > 32 r } ≤  32 r em/αn  − 32 r ≤  10 αn m  − 32 r =: p. Let S b e the n um ber of columns j with d j > 32 r . Then S is a sum of m independent Bernoulli random v ariables with exp ectations at most p . Again, Chernoff ’s inequality implies P { S > m/ 4 } ≤ (4 ep ) m/ 4 ≤ p m/ 6 ≤  10 αn m  − 5 rm . The second inequality here holds since 4 e < p 1 / 2 , whic h in turn follows by assumption on α . 15 It remains to take a union b ound ov er all blo cks I × J . It is enough to consider the blo c ks with largest p ossible num ber of columns, th us with k = d m/αd e . W e obtain that the conclusion of the lemma holds with probability at least 1 − n X m =1  n m  n d m/αd e   10 αn m  − 5 rm ≤ 1 − n − r . In the last inequalit y we used the assumption that α ≥ p m/n . The pro of is complete.  3.5. Iterativ e decomp osition: pro of of Theorem 2.1 . Finally , w e combine the to ols we dev elop ed so far, and we construct an iterative decomp osition of the adjacency matrix the w a y w e outline in Section 3.1 . Let us set up one step of this pro cedure, where we consider an m × m blo c k and decomp ose almost all of it (everything except an m/ 2 × m/ 2 blo c k) into classes N , R and C satisfying the conclusion of Theorem 2.6 . Once we can do this, we repeat the pro cedure for the m/ 2 × m/ 2 blo ck, etc. Lemma 3.7 (Decomp osition of a blo ck) . L et 1 ≤ m ≤ n and α ≥ p m/n . Then for r ≥ 1 the fol lowing holds with pr ob ability at le ast 1 − 3 n − r . Consider a blo ck I × J of size m × m . Then ther e exists an exc eptional sub-blo ck I 1 × J 1 with dimensions at most m/ 2 × m/ 2 such that the r emaining p art of the blo ck, that is ( I × J ) \ ( I 1 × J 1 ) , c an b e de c omp ose d into thr e e classes N , R ⊂ ( I \ I 1 ) × J and C ⊂ I × ( J \ J 1 ) so that the fol lowing holds. • The gr aph c onc entr ates on N , namely k ( A − E A ) N k ≤ C r 3 / 2 p αd log ( en/m ) . • Each r ow of A R and e ach c olumn of A C has at most 32 r ones. Mor e over, R interse cts at most n/αd c olumns and C interse cts at most n/αd r ows of I × J . After a permutation of rows and columns, a decomp osition of the blo ck stated in Lemma 3.7 can b e visualized in Figure 4c . (a) Initial step. (b) Rep eat for transp ose. (c) Final decomp osition. Figure 4. Construction of a blo c k decomp osition in Lemma 3.7 . Pr o of. Since we are going to use Lemmas 3.4 , 3.5 and 3.6 , let us fix realization of the random graph that satisfies the conclusion of those three lemmas. By Lemma 3.5 , all but m/αd rows of A I × J ha v e at most 8 r αd ones; let us denote by I 0 the set of indices of those ro ws with at most 8 rα d ones. Then w e can use Lemma 3.4 for the block I 0 × J and with α replaced by 8 r α ; the choice of I 0 ensures that all rows ha v e small num bers of ones, as required in that lemma. T o control the ro ws outside I 0 , w e ma y use Lemma 3.6 for ( I \ I 0 ) × J ; as we already noted, this blo c k has at most m/αd ro ws as required in that 16 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN lemma. In tersecting the go o d sets of columns pro duced by those t w o lemmas, we obtain a set of at most m/ 2 exceptional columns J 1 ⊂ J such that the following holds. • On the blo ck N 1 := I 0 × ( J \ J 1 ), w e ha v e k ( A − E A ) N 1 k ≤ C r 3 / 2 p αd log ( en/m ) . • F or the blo ck C := ( I \ I 0 ) × ( J \ J 1 ), all columns of A C ha v e at most 32 r ones. Figure 4a illustrates the decomp osition of the blo ck I × J in to the set of exceptional columns indexed b y J 1 and go o d sets N 1 and C . T o finish the pro of, w e apply the ab ov e argumen t to the transp ose A T on the blo c k J × I . T o b e precise, we start with the set J 0 ⊂ J of all but m/αd small columns of A I × J (those with at most 8 r αd ones); then we obtain an exceptional set I 1 ⊂ I of at most m/ 2 ro ws; and finally we conclude that A concentrates on the blo ck N 2 := ( I \ I 1 ) × J 0 and has small rows on the blo c k R := ( I \ I 1 ) × ( J \ J 0 ). Figure 4b illustrates this decomp osition. It only remains to combine the decomp ositions for A and A T ; Figure 4c illustrates a result of the com bination. Once w e define N := N 1 ∪ N 2 , it b ecomes clear that N , R and C ha v e the required prop erties. 3  Pr o of of The or em 2.6 . Let us fix a realization of the random graph that satisfies the conclu- sion of Lemma 3.7 . Applying that lemma for m = n and with α = 1, we decompose the set of edges [ n ] × [ n ] in to three classes N 0 , C 0 and R 0 plus an n/ 2 × n/ 2 exceptional blo ck I 1 × J 1 . Apply Lemma 3.7 again, this time for the blo ck I 1 × J 1 , for m = n/ 2 and with α = p 1 / 2. W e decomp ose I 1 × J 1 in to N 1 , C 1 and R 1 plus an n/ 4 × n/ 4 exceptional blo ck I 2 × J 2 . Rep eat this pro cess for α = p m/n where m is the running size of the blo c k; w e halv e this size at each step, and so w e hav e α i ≤ 2 − i/ 2 . Figure 3c illustrates a decomp osition that we ma y obtain this wa y . In a finite n um ber of steps (actually , in O (log n ) steps) the exceptional blo c k b ecomes empt y , and the pro cess terminates. A t that p oint we ha v e decomp osed the set of edges [ n ] × [ n ] in to N , R and C , defined as the union of N i , C i and R i resp ectiv ely , whic h w e obtained at each step. It is clear that R and C satisfy the required prop erties. It remains to b ound the deviation of A on N . By construction, N i satisfies k ( A − E A ) N i k ≤ C r 3 / 2 p α i d log ( eα i ) . Th us, b y triangle inequality we hav e k ( A − E A ) N k ≤ X i ≥ 0 C r 3 / 2 p α i d log ( eα i ) ≤ C 0 r 3 / 2 √ d. In the second inequalit y we used that α i ≤ 2 − i/ 2 , which forces the series to conv erge. The pro of of Theorem 2.6 is complete.  3.6. Replacing the degrees b y the ` 2 norms in Theorem 2.1 . Let us no w prov e the “moreo v er” part of Theorem 2.1 , where d 0 is the the maximal ` 2 norm of the ro ws and columns of the regularized adjacency matrix A 0 . This is clearly a stronger statement than in the main part of the theorem. Indeed, since all entries of A 0 are b ounded in absolute v alue by 1, eac h degree, b eing the ` 1 norm of a ro w, is b ounded b elo w b y the ` 2 norm squared. This strengthening is in fact easy to chec k. T o do so, note that the definition of d 0 w as used only once in the pro of of Theorem 2.1 , namely in Step 2 where w e b ounded the norms of A 0 R and A 0 C . Thus, to obtain the strengthening, it is enough to replace the application of Lemma 2.7 there b y the following lemma. 3 It may happ en that an entry ends up in more than one class N , R and C . In such cases, we split the tie arbitrarily . 17 Lemma 3.8. Consider a matrix B with entries in [0 , 1] . Supp ose e ach r ow of B has at most a non-zer o entries, and e ach c olumn has ` 2 norm at most √ b . Then k B k ≤ √ ab . Pr o of. T o prov e the claim, let x b e a vector with k x k 2 = 1. Using Cauc h y-Sc h w arz inequalit y and the assumptions, w e hav e k B x k 2 2 = X j  X i B ij x i  2 ≤ X j  X i : B ij 6 =0 B 2 ij X i : B ij 6 =0 x 2 i  ≤ X j  b X i : B ij 6 =0 x 2 i  = b X i x 2 i X j : B ij 6 =0 1 ≤ b X i x 2 i a = ab. Since x is arbitrary , this completes the pro of.  4. Concentra tion of the regularized Laplacian In this section, w e state the following formal v ersion of Theorem 1.2 , and w e deduce it from concen tration of adjacency matrices (Theorem 2.1 ). Theorem 4.1 (Concen tration of regularized Laplacians) . Consider a r andom gr aph fr om the inhomo gene ous Er d¨ os-R´ enyi mo del, and let d b e as in ( 1.3 ) . Cho ose a numb er τ > 0 . Then, for any r ≥ 1 , with pr ob ability at le ast 1 − e − r one has kL ( A τ ) − L ( E A τ ) k ≤ C r 2 √ τ  1 + d τ  5 / 2 . Pr o of. Two sources con tribute to the deviation of Laplacian – the deviation of the adjacency matrix and the deviation of the degrees. Let us separate and bound them individually . Step 1. Decomposing the deviation. Let us denote ¯ A := E A for simplicit y; then E := L ( A τ ) − L ( ¯ A τ ) = D − 1 / 2 τ A τ D − 1 / 2 τ − ¯ D − 1 / 2 τ ¯ A τ ¯ D − 1 / 2 τ . Here D τ = diag( d i + τ ) and ¯ D τ = diag( ¯ d i + τ ) are the diagonal matrices with degrees of A τ and ¯ A τ on the diagonal, resp ectively . Using the fact that A τ − ¯ A τ = A − ¯ A , w e can represent the deviation as E = S + T where S = D − 1 / 2 τ ( A − ¯ A ) D − 1 / 2 τ , T = D − 1 / 2 τ ¯ A τ D − 1 / 2 τ − ¯ D − 1 / 2 τ ¯ A τ ¯ D − 1 / 2 τ . Let us b ound S and T separately . Step 2. Bounding S . Let us introduce a diagonal matrix ∆ that should b e easier to w ork with than D τ . Set ∆ ii = 1 if d i ≤ 8 r d and ∆ ii = d i /τ + 1 otherwise. Then entries of τ ∆ are upp er b ounded b y the corresponding entries of D τ , and so τ k S k ≤ k ∆ − 1 / 2 ( A − ¯ A )∆ − 1 / 2 k . Next, b y triangle inequality , τ k S k ≤ k ∆ − 1 / 2 A ∆ − 1 / 2 − ¯ A k + k ¯ A − ∆ − 1 / 2 ¯ A ∆ − 1 / 2 k =: R 1 + R 2 . (4.1) In order to bound R 1 , we use Theorem 2.1 to show that A 0 := ∆ − 1 / 2 A ∆ − 1 / 2 concen trates around ¯ A . This should b e p ossible b ecause A 0 is obtained from A b y reducing the degrees that are bigger than 8 rd . T o apply the “moreo v er” part of Theorem 2.1 , let us chec k the magnitude of the ` 2 norms of the ro ws A 0 i of A 0 : k A 0 i k 2 2 = n X j =1 A ij ∆ ii ∆ j j ≤ d i ∆ ii ≤ max(8 r d, τ ) . 18 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN Here in the first inequalit y we used that ∆ j j ≥ 1 and P j A ij = d i ; the second inequality follo ws by definition of ∆ ii . Applying Theorem 2.1 , we obtain with probability 1 − n − r that R 1 = k A 0 − ¯ A k ≤ C 1 r 2 ( √ d + √ τ ) . T o bound R 2 , w e note that by construction of ∆, the matrices ¯ A and ∆ − 1 / 2 ¯ A ∆ − 1 / 2 coincide on the blo ck I × I , where I is the set of v ertices satisfying d i ≤ 8 r d . This blo ck is v ery large – indeed, Lemma 3.5 implies that | I c | ≤ n/d with probabilit y 1 − n − r . Outside this blo ck, i.e. on the small blo c ks I c × [ n ] and [ n ] × I c , the entries of ¯ A − ∆ − 1 / 2 ¯ A ∆ − 1 / 2 are bounded b y the corresp onding entries of ¯ A , which are all b ounded by d/n . Thus, using Lemma 2.7 , w e ha v e R 2 ≤ k ¯ A I c × [ n ] k + k ¯ A [ n ] × I c k ≤ 2 √ d. Substituting the b ounds for R 1 and R 2 in to ( 4.1 ), w e conclude that k S k ≤ C 2 r 2 τ ( √ d + √ τ ) with probabilit y at least 1 − 2 n − r . Step 3. Bounding T . Bounding the sp ectral norm by the Hilb ert-Sc hmidt norm, we get k T k ≤ k T k HS = n X i,j =1 T 2 ij , where T ij = ( ¯ A ij + τ /n ) h 1 / p δ ij − 1 / q ¯ δ ij i and δ ij = ( d i + τ )( d j + τ ) and ¯ δ ij = ( ¯ d i + τ )( ¯ d j + τ ). T o b ound T ij , w e note that 0 ≤ ¯ A ij + τ /n ≤ d + τ n and   1 / p δ ij − 1 / q ¯ δ ij   =       δ ij − ¯ δ ij δ ij q ¯ δ ij + ¯ δ ij p δ ij       ≥ | δ ij − ¯ δ ij | 2 τ 3 . Recalling the definition of δ ij and ¯ δ ij and adding and subtracting ( d i + τ )( ¯ d j + τ ), w e ha v e δ ij − ¯ δ ij = ( d i + τ )( d j − ¯ d j ) + ( ¯ d j + τ )( d i − ¯ d i ) . So, using the inequalit y ( a + b ) 2 ≤ 2( a 2 + b 2 ) and b ounding ¯ d j + τ by d + τ , we obtain k T k 2 ≤ ( d + τ ) 2 n 2 τ 6 h n X i =1 ( d i + τ ) 2 n X j =1 ( d j − ¯ d j ) 2 + n ( d + τ ) 2 n X i =1 ( d i − ¯ d i ) 2 i . (4.2) W e claim that n X j =1 ( d j − ¯ d j ) 2 ≤ C 3 r 2 nd with probability 1 − e − 2 r . (4.3) Indeed, since the v ariance of each d i is b ounded by d , the exp ectation of the sum in ( 4.3 ) is b ounded b y nd . T o upgrade the v ariance b ound to an exp onential deviation b ound, one can use one of the several standard metho ds. F or example, Bernstein’s inequality implies that X i = d j − ¯ d j satisfies P n X i > C 4 t √ d o ≤ e − t for all t ≥ 1. This means that the random v ariable X 2 i b elongs to the Orlicz space L ψ 1 / 2 and has norm k X 2 i k ψ 1 / 2 ≤ C 3 d , see [ 26 ]. By triangle inequalit y , we conclude that k P n i =1 X 2 i k ψ 1 / 2 ≤ C 3 nd , whic h in turn implies ( 4.3 ). F urthermore, ( 4.3 ) implies n X i =1 ( d i + τ ) 2 ≤ 2 n X i =1 ( d i − ¯ d i ) 2 + 2 n X i =1 ( ¯ d i + τ ) 2 ≤ 2 C 3 r 2 nd + 2 n ( d + τ ) 2 ≤ C 5 r 2 n ( d + τ ) 2 . 19 Substituting this b ound and ( 4.3 ) in to ( 4.2 ) we conclude that k T k 2 ≤ ( d + τ ) 2 n 2 τ 6 · C 3 r 2 nd h C 5 r 2 n ( d + τ ) 2 + n ( d + τ ) 2 i ≤ C 6 r 4 τ  1 + d τ  5 . It remains to substitute the b ounds for S and T into the inequalit y k E k ≤ k S k + k T k , and simplify the expression. The resulting b ound holds with probability at least 1 − n − r − n − r − e − 2 r ≥ 1 − e − r , as claimed.  5. Fur ther questions 5.1. Optimal regularization? The main p oint of our pap er was that regularization helps sparse graphs to concen trate. W e ha v e discussed sev eral kinds of regularization in Section 1.4 and men tioned some more in Section 1.4 . W e found that an y meaningful regularization works, as long as it reduces the too high degrees and increases the to o lo w degrees. Is there an optimal w a y to regularize a graph? Designing the bes t “prepro cessing” of sparse graphs for spectral algorithms is esp ecially interesting from the applied p erspective, i.e. for real world netw orks. On the theoretical level, can regularization of sparse graphs pro duce the same optimal b ound 2 √ d (1 + o (1)) that we saw for dense graphs in ( 1.1 )? Thus, an ideal regularization should bring all parasitic outliers of the sp ectrum in to the bulk. If so, this w ould lead to a p oten tially simple sp ectral clustering algorithm for communit y detection in net w orks which matc hes the theoretical low er b ounds. Algorithms with optimal rates exist for this problem [ 33 , 29 ], but their analysis is very technical. 5.2. Ho w exactly concen tration dep ends on regularization? It would be interesting to determine ho w exactly the concen tration of Laplacian depends on the regularization pa- rameter τ . The dep endence in Theorem 4.1 is not optimal, and we hav e not made efforts to impro v e it. Although it is natural to choose τ ∼ d as in Theorem 1.2 , c hoosing τ  d could also be useful [ 23 ]. Cho osing τ  d may b e in teresting as well, for then L ( E A τ ) ≈ L ( E A ) and we obtain the concen tration of L ( A τ ) around the Laplacian of the expectation of the original (rather than regularized) matrix E A . 5.3. Av erage exp ected degree? Both concen tration results of this pap er, Theorems 1.1 and 1.2 , dep end on d = max ij np ij . W ould it be p ossible to reduce d to the maximal expected degree d av e = max i P j p ij ? 5.4. F rom random graphs to random matrices? Adjacency matrices of random graphs are particular examples of random matrices. Do es the phenomenon w e describ ed, namely that regularization leads to concentration, apply for general random matrices? Guided by Theorem 1.1 , we might exp ect the following for a broader general class of random matrices B with mean zero indep endent entries. First, the only reason the sp ectral norm of B is to o large (and that it is determined b y outliers of sp ectrum) could b e the existence of a large ro w or column. F urthermore, it migh t b e p ossible to reduce the norm of B (and thus bring the outliers in to the bulk of spectrum) by regularizing in some wa y the ro ws and columns that are to o large. F or related questions in random matrix theory , see the recen t work [ 4 , 21 ]. References [1] E. Abb e, A. S. Bandeira, and G. Hall. Exact recov ery in the sto chastic blo ck mo del. IEEE T r ansactions on Information The ory , 62(1):471–487, 2016. [2] N. Alon and N. Kahale. A sp ectral technique for coloring random 3-colorable graphs. SIAM J. Comput. , (26):1733–1748, 1997. 20 CAN M. LE, ELIZA VET A LEVINA AND ROMAN VERSHYNIN [3] A. A. Amini, A. Chen, P . J. Bick el, and E. Levina. Pseudo-likelihoo d metho ds for communit y detection in large sparse netw orks. The Annals of Statistics , 41(4):2097–2122, 2013. [4] A. Bandeira and R. V. Handel. Sharp nonasymptotic b ounds on the norm of random matrices with indep enden t entries. Annals of Pr ob ability, to appe ar , 2014. [5] R. Bhatia. Matrix Analysis . Springer-V erlag New Y ork, 1996. [6] P . J. Bick el and A. Chen. A nonparametric view of netw ork mo dels and Newman-Girv an and other mo dularities. Pr o c. Natl. A c ad. Sci. USA , 106:21068–21073, 2009. [7] B. Bollobas, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs. R andom Structur es and Algorithms , 31:3–122, 2007. [8] C. Bordenav e, M. Lelarge, and L. Massouli´ e. Non-backtrac king sp ectrum of random graphs: communit y detection and non-regular Ramanujan graphs. , 2015. [9] S. Bouc heron, G. Lugosi, and P . Massart. Conc entr ation ine qualities: a nonasymptotic the ory of indep en- denc e . Oxford Universit y Press, 2013. [10] T. Cai and X. Li. Robust and computationally feasible communit y detection in the presence of arbitrary outlier no des. Ann. Statist. , 43(3):1027–1059, 2015. [11] K. Chaudhuri, F. Chung, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine L e arning R ese ar ch Workshop and Confer enc e Pr o c e e dings , 23:35.1 – 35.23, 2012. [12] P . Chin, A. Rao, and V. V u. Sto chastic blo ck mo del and communit y detection in the sparse graphs : A sp ectral algorithm with optimal rate of recov ery . , 2015. [13] F. R. K. Chung. Sp e ctr al Gr aph The ory . CBMS Regional Conference Series in Mathematics, 1997. [14] A. Decelle, F. Krzak ala, C. Mo ore, and L. Zdeb orov´ a. Asymptotic analysis of the sto chastic blo ck mo del for mo dular net wo rks and its algorithmic applications. Physic al R eview E , 84:066106, 2011. [15] U. F eige and . Ofek. Sp ectral techniques applied to sparse random graphs. Wiley InterScience , 2005. [16] Z. Fredi and J. Komls. The eigenv alues of random symmetric matrices. Combinatoric a , 1:3:233–241, 1980. [17] J. F riedman, J. Kahn, and E. Szemeredi. On the second eigenv alue in random regular graphs. Pr o c Twenty First Annu ACMSymp The ory of Computing , pages 587–598, 1989. [18] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving optimal misclassification prop ortion in stochastic blo c k mo del. , 2015. [19] O. Gu ´ edon and R. V ershynin. Communit y detection in sparse netw orks via grothendieck’s inequality . Pr ob ability The ory and R elate d Fields, to app e ar , 2014. [20] B. Ha jek, Y. W u, and J. Xu. Achieving exact cluster reco v ery threshold via semidefinite programming. arXiv:1412.6156 , 2014. [21] R. V. Handel. On the sp ectral norm of inhomogeneous random matrices. , 2015. [22] P . W. Holland, K. B. Lask ey , and S. Leinhardt. Sto chastic blo ckmodels: first steps. So cial Networks , 5(2):109–137, 1983. [23] A. Joseph and B. Y u. Impact of regularization on sp ectral clustering. Ann. Statist. , 44(4):1765–1791, 2016. [24] M. Kriv elevic h and B. Sudak o v. The largest eigenv alue of sparse random graphs. Combin Pr ob ab Comput , 12:61–72, 2003. [25] M. Ledoux. The Conc entr ation of Me asur e Phenomenon , v olume 89 of Mathematical Surveys and Mono- gr aphs . Amer. Math. So ciety , 2001. [26] M. Ledoux and M. T alagrand. Pr ob ability in Banach sp ac es: Isop erimetry and pr o cesses . Springer-V erlag, Berlin, 1991. [27] J. Lei and A. Rinaldo. Consistency of sp ectral clustering in stochastic block models. Ann. Statist. , 43(1):215–237, 2015. [28] L. Lu and X. Peng. Spectra of edge-indep endent random graphs. The electr onic journal of c ombinatorics , 20(4), 2013. [29] L. Massouli´ e. Communit y detection thresholds and the weak Raman ujan prop erty . In Pr o c e e dings of the 46th Annual ACM Symp osium on The ory of Computing , STOC ’14, pages 694–703, 2014. [30] McSherry . Sp ectral partitioning of random graphs. Pr o c. 42nd FOCS , pages 529–537, 2001. [31] A. Montanari and S. Sen. Semidefinite programs on sparse random graphs and their application to comm unity detection. , 2015. [32] E. Mossel, J. Neeman, and A. Sly . Consistency thresholds for binary symmetric blo ck mo dels. arXiv:1407.1591 , 2014. 21 [33] E. Mossel, J. Neeman, and A. Sly . A pro of of the block mo del threshold conjecture. , 2014. [34] E. Mossel, J. Neeman, and A. Sly . Reconstruction and estimation in the planted partition mo del. Pr ob a- bility The ory and R elate d Fields , 2014. [35] R. Oliveira. Concentration of the adjacency matrix and of the laplacian in random graphs with indep en- den t edges. , 2010. [36] A. Pietsch. Op er ator Ide als . North-Holland Amsterdam, 1978. [37] G. Pisier. F actorization of line ar oper ators and ge ometry of Banach sp ac es . Num ber 60 in CBMS Regional Conference Series in Mathematics. AMS, Providence, 1986. [38] G. Pisier. Grothendiecks theorem, past and present. Bul letin (New Series) of the Americ an Mathematic al So ciety , 49(2):237–323, 2012. [39] T. Qin and K. Rohe. Regularized sp ectral clustering under the degree-corrected sto c hastic blo ckmodel. In Advanc es in Neur al Information Pro c essing Systems , pages 3120–3128, 2013. [40] E. M. Stein and R. Shak archi. F unctional Analysis: Intr o duction to F urther T opics in Analysis . Princeton Univ ersity Press, 2011. [41] N. T omczak-Jaegermann. Banach-Mazur distanc es and finite-dimensional op er ator ide als . John Wiley & Sons, Inc., New Y ork, 1989. [42] J. A. T ropp. Column subset selection, matrix factorization, and eigenv alue optimization. Pr o c e e dings of the Twentieth Annual ACM-SIAM Symp osium on Discr ete Algorithms , pages 978–986, 2009. [43] R. V ershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Ku- t yniok, editors, Compr esse d sensing: theory and applic ations . Cam bridge Univ ersity Press. Submitted. [44] V. V u. Sp ectral norm of random matrices. Combinatoric a , 27(6):721–736, 2007. Dep ar tment of St a tistics, University of California, Da vis, One Shields A ve, Da vis, CA 95616, U.S.A. E-mail addr ess : canle@ucdavis.edu Dep ar tment of St a tistics, University of Michigan, 1085 S. University A ve, Ann Arbor, MI 48109, U.S.A. E-mail addr ess : elevina@umich.edu Dep ar tment of Ma thema tics, University of Michigan, 530 Church St, Ann Arbor, MI 48109, U.S.A. E-mail addr ess : romanv@umich.edu

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment