Graph Sparsification via Refinement Sampling

Graph Sparsiﬁcation via Reﬁnement Sampling Ashish Goel ∗ Michael Kapralov † Sanjee v Khanna ‡ November 8, 2021 Abstract A graph G ′ ( V , E ′ ) is an ǫ -sparsiﬁcation o f G for some ǫ > 0 , if every (weigh t ed) cu t in G ′ is within (1 ± ǫ ) of the c orrespon ding cut in G . A c elebrated re sult of Benc z ´ ur an d Karger shows that f or every undirected graph G , an ǫ -sparsiﬁcation with O ( n lo g n/ǫ 2 ) edg es can be constructed in O ( m lo g 2 n ) time. The notion o f cu t-preserv ing graph sparsiﬁcation ha s playe d an im portant role in spe eding u p algo rithms for several fun damental network desig n an d r outing p roblems. Applications to modern massiv e d ata sets often constra in algor ithms to u se com putation mod els that restrict ran dom a ccess to the inp ut. Th e semi- streaming m odel, in which the algorith m is constraine d to use ˜ O ( n ) space, has b een shown to be a good abstraction for analyzing graph algorithms in applications to large data sets. Recently , a semi-stream ing algorithm for gr aph sparsiﬁcation was presented by An h and Guh a; the total r unning time o f the ir imp le- mentation is Ω( mn ) , too large for application s where both space and time are importan t. In this paper , we introdu ce a new technique for grap h sparsiﬁcation, namely reﬁnement sampling , that g iv es an ˜ O ( m ) time semi-streaming algorithm for graph sparsiﬁcation. Speciﬁcally , we show that reﬁnemen t sampling can b e used to d esign a one- pass streaming algorith m for sparsiﬁcation t hat takes O (log log n ) time per edge, uses O (log 2 n ) space per node, and outputs an ǫ -sparsiﬁer with O ( n log 3 n/ǫ 2 ) edges. At a sligh tly increased spac e and time comp lexity , we can reduce the spa rsiﬁer size to O ( n lo g n/ǫ 2 ) edges matc hing the Benc z ´ ur-Karger result, while impr oving upon the Bencz ´ ur-Karger ru ntime for m = ω ( n lo g 3 n ) . Finally , we show th at an ǫ -sparsiﬁer with O ( n lo g n/ ǫ 2 ) edges can be constructed in tw o passes over the data and O ( m ) t ime whenever m = Ω( n 1+ δ ) for some constant δ > 0 . As a b y-pro duct of our approach , we also obtain an O ( m log log n + n log n ) time streaming algorithm to compute a sparse k -con nectivity certiﬁcate of a graph. ∗ Departments of Management Science and Engineering and (by courtesy) Computer S cience, St anford Uni versity . Email: ashishg@stanf ord.edu . Research sup ported in part by NSF award IIS-0904325. † Institute for Computational and Mathematical Engineering, St anford University . E mail: kapralov@s tanford.edu . Re- search supported by a Stanford Graduate Fellowship. ‡ Department of Comp uter and Information Science, Univ ersity of Pennsylv ania, Philadelphia P A. Email: sanjeev@cis.u penn.edu . S upported in part by NSF A wards CCF-0635084 and IIS-0904314. 1 Introd uction The notion of graph sparsiﬁcatio n was introduced in [BK96], where th e authors gav e a near linear ti me pro- cedure th at ta kes as input an u ndirect ed graph G on n vertices an d const ructs a weigh ted sub graph H of G with O ( n log n/ǫ 2 ) edges such that the v alue of ev ery cut in H is within a 1 ± ǫ factor of the v alue of the corres ponding cut in G . This algor ithm has subseq uently been used to speed u p algorithms fo r ﬁnding approx- imately minimum or sparsest cuts in graphs ([BK96, KR V 06]), as well as in a host of other applicatio ns (e.g. [KL02]). A m ore general class of spectral sparsiﬁer s wa s recentl y in troduce d by Spielman and Sriv asta v a in [SS08]. T he a lgorith ms de veloped in [BK96] and [SS08] take nea r- linear time in the size of the graph and produ ce v ery high quality sparsiﬁers , b ut require random ac cess to the ed ges of the input graph G , which is of- ten prohib iti vel y expens iv e in a pplicati ons to modern massi ve data sets. The streaming mod el o f c omputatio n, which restri cts alg orithms to use a small number of passe s over the inp ut and space polylogari thmic in the size of t he input, has been studied extensi vely in v arious applicatio n do mains (e.g. [Mut06]), b ut has prov en too restricti ve for e ven the si mplest gr aph alg orithms (e ven testi ng s − t conne cti vity requires Ω( n ) space). The less restr icti ve semi-s treaming model, in which th e algorithm is restricted to use ˜ O ( n ) spa ce, is m ore suited for graph algori thms [FKM + 05]. The problem of constru cting graph sparsiﬁers in the semi-stre aming model was recent ly posed by Anh and G uha [A G 09], w ho ga ve a one-pas s algori thm for ﬁnding Bencz ´ ur - Kar ger type s parsiﬁers with a slightly lar ger number of edges than the origi nal Bencz ´ ur -Karger algorith m, i.e. O ( n log n log m n /ǫ 2 ) as oppose d to O ( n log n/ǫ 2 ) . Their algorithm requires only one pass over the data, and their analysi s is quite non-tri vial. Howe ver , i ts time complex ity is Ω( mn polylog ( n )) , ma king it impractical for applica tions where both time and space are important constra ints 1 Apart from the issue of random access vs disk, the semi-stre aming mode l is also important for scenari os where edges of the graph are re vealed one at a time by an e xterna l process . For exampl e, this application maps well to online so cial networks w here edges arri ve one by one, but efﬁci ent networ k computations m ay be required at any ti me, making it particular ly useful to hav e a dynamical ly maintained sparsiﬁer . Our results: W e introdu ce t he concept of re ﬁnement sampling . At a high le vel, the basic idea is to sample edges at geometrical ly decrea sing rate s, using the sampled edges at each rate to reﬁne the con nected com- ponen ts from the pr e vious rate. The sampling ra te at which the two endpo ints of an edg e ge t sep arated int o dif ferent conn ected components is use d as an approximate m easure of the “stre ngth” of th at edge. W e use reﬁnement sampling to obtain two algorit hms for computi ng Bencz ´ ur -Kar ger type sparsiﬁers of undirecte d graphs in the semi-streaming m odel efﬁcientl y . The ﬁrst algori thm requires O (log n ) passes, O (log n ) space per node, O (log n log log n ) work per edge and produces sparsiﬁers with O ( n log 2 n/ǫ 2 ) edges . The second algori thm requ ires one pass ove r the edges of the graph, O (log 2 n ) space per nod e, O (log log n ) wo rk per edge and pr oduces spa rsiﬁers with O ( n log 3 n/ǫ 2 ) edges. Se veral pr opertie s of the se results are wo rth noting: 1. In the incremental model, the amor tized runn ing time per edge arri val is O (log log n ) , which is quite practic al and much better than the pre viously best kno wn running time of Ω( n ) . 2. The sample size can be improv ed fo r both algorithms by running the original Benc ´ ur -Kar ger al go- rithm on the sampled graph without viola ting the restrict ions of the semi-streaming model, yielding O (log n log log n + ( n m ) log 4 n ) and O (log log n + ( n m ) log 5 n ) amortize d work per edge resp ecti vely . 3. Somewhat surprising ly , this tw o-stag e (b ut s till semi-streaming ) algori thm impr ov es upon the runtime of the ori ginal sparsiﬁca tion sche me when m = ω ( n log 2 n ) for the O (log n ) -pa ss versi on and m = ω ( n log 3 n ) for the one-pa ss versi on. 1 As is often the case for semi-streaming algorithms, Anh and Guha d o not ex plicitly compute the runn ing time of their algorithm; Ω( mn polylog ( n )) is the best running time we can come up with for their algorithm. 1 4. As a by-produc t of our ana lysis, we sho w that reﬁnement sampling ca n be regarde d as a one-pass algori thm for produc ing a sparse connecti vity certiﬁcate of a w eighted undir ected graph (see Corollary 4.7). T hus we obtaining a streaming analog of the N agamochi -Ibaraki result [NI92] for producing sparse certiﬁcate s, which is in turn used in the Benc ´ ur -K ar ger sampling . Finally , in Sectio n 5 w e gi ve an algorithm for construct ing O ( n log n/ǫ 2 ) -size sparsiﬁers in O ( m ) time using two pass es over th e input when m = Ω( n 1+ δ ) . Related W ork: In [A G09] the authors gi ve an algorithm for sparsiﬁcation in the semi-streaming model based on the observ ation that one can use the constructe d sparsiﬁcat ion of the currently receiv ed part of the graph to estimat e of the strong connecti vity of a ne wly recei ved ed ge. A brief outline of the algorith m is as follows. Denote the edges of G in their order in the stream by e 1 , . . . , e m . Set H 0 = ( V , ∅ ) . For e very t > 0 compute the strength s t of e t in H t − 1 , and with probability p e t = min { ρ/s t , 1 } set H t = ( V , E ( H t − 1 ) ∪ { e t } ) , gi ving e t weight 1 / p e t in H t and H t = H t − 1 otherwis e. For e very t the graph H t is an ǫ -sparsiﬁcation of the subgraph recei ved by time t . The author s show that this algorithm yields an ǫ -sparsi ﬁer with O ( n log n log m n /ǫ 2 ) edges. Howe ver , it is unclear ho w one can calculate the stre ngths s t ef ﬁciently . A nai ve implementatio n would take Ω( n ) time for each t , result ing in Ω( mn ) time o ve rall. One could concei vably use t he fact t hat H t − 1 is alwa ys a subgraph of H t , but to the best o f our kn o wledge there are no resul ts on ef ﬁciently calculat ing or approxi mating str ong connecti vities in t he incremental model. It is importan t to emph asize t hat our techniques f or obtainin g an efﬁci ent one-pass sparsiﬁcat ion alg orithm are very dif ferent from the ap proach of [A G09]. In particu lar , the s tructure of dependenci es in th e sampling proces s is quite d if ferent. In the algori thm of [A G 09] edges are not sampled in depend ently since th e pro babil- ity with which an edge is sampled depends on the the coin tosses for edges that came earlier in the stream. Our approa ch, on the other hand, deco uples the proces s of estimating edge str engths from the pro cess of pr oducin g the output sample, thus simplify ing analysis and m aking a direct in v ocation of the B encz ´ ur -Karge r sampling theore m possible . Organiz ation: Section 2 introd uces some notatio n as well as re views the Bencz ´ ur -Karger sampling algo- rithm. W e then intr oduce in Section 3 our reﬁ nement sampling scheme, and sho w ho w it can be used to obtain a sparsiﬁcation algorithm requiri ng O (log n ) p asses and O (log n log log n ) work pe r edge. The size of the samp led graph is O ( n log 2 n/ǫ 2 ) , i.e. at most O (log n ) times lar ger than that produ ced by Bencz ´ ur - Kar ger sampling. Finally , in Section 4 w e b uild on the ideas of Section 3 to obtain a one-p ass algorit hm with O (log log n ) work per e dge at the expens e of increasi ng the size of the sample to O ( n lo g 3 n/ǫ 2 ) . 2 Pr eliminaries W e will denote by G ( V , E ) the input undirecte d grap h with verte x set V and edge set E wit h | V | = n and | E | = m . For an y ǫ > 0 , w e say that a weigh ted gr aph G ′ ( V , E ′ ) is an ǫ -spar siﬁcation of G if e very (weighte d) cut in G ′ is within (1 ± ǫ ) of the co rrespon ding cut in G . Giv en any two c ollectio ns of se ts th at partiti on V , say S 1 and S 2 , we say that S 2 is a r eﬁnement of S 1 if for any X ∈ S 1 and Y ∈ S 2 , either X ∩ Y = ∅ or Y ⊂ X . In other words, S 1 ∪ S 2 form a laminar set syst em. 2.1 Bencz ´ ur -Karger S ampling Scheme W e say that a graph is k -connect ed if the valu e of eac h cut in G is at least k . The Bencz ´ ur -Karger sa mpling scheme uses a more strict notion of conn ecti vity , referred to as str ong connec tivity , deﬁned as follo ws: Deﬁnition 2.1 [BK 96] A k -strong component is a maximal k -connect ed verte x-induced sub grap h. The strong conne cti vity of an edge e , deno ted by s e , is the lar gest k such that a k -str ong component contains e . 2 Note that the set of k -strong co mponents f orm a partiti on of the vert ex set of G , and the set of k + 1 -st rong compone nts forms a reﬁnement this p artition . W e s ay e is k -str ong if its strong conn ecti vity is k or more, and k -weak otherwise. The foll o wing simple lemma will be usefu l in our analy sis. Lemma 2.2 [BK 96] The numbe r of k -w eak edg es in a gra ph on n vertices is bounded by k ( n − 1) . The sampling algori thm relies on the follo w ing result: Theor em 2.3 [BK96] Let G ′ be ob tained by samplin g edg es of G w ith pr obability p e = min { ρ ǫ 2 s e , 1 } , wher e ρ = 16( d + 2) ln n , an d giv ing each sampled edge w eight 1 /p e . Then G ′ is an ǫ -spars iﬁcation of G with pr obabil ity at least 1 − n − d . Mor eover , e xpected number of edg es in G ′ is O ( n log n ) . It follo ws easily from the proof of theo rem 2.3 in [BK96] that if we sample using an und er estimate of edge stren gths, the resultin g graph is still an ǫ -sparsiﬁcati on. Cor ollary 2.4 Let G ′ be obtained by sampling ea ch edge of G w ith pr obabili ty ˜ p e ≥ p e and and give e very sampled edge e weight 1 / ˜ p e . Then G ′ is an ǫ -spar siﬁcat ion of G with pr obabi lity at least 1 − n − d . In [B K96] the author s giv e an O ( m log 2 n ) time algor ithm for calc ulating estimate s of str ong conn ec- ti vities that are sufﬁcie nt for sampling. The algorith m, howe ver , requires ran dom ac cess to the ed ges of the graph, which is disallo wed in the semi-st reaming model . More precisely , the procedu re for e stimating edge streng ths gi ven in [BK96] relie s on the Nagamochi-Ibar aki alg orithm for obtaining sparse certiﬁcates for edge- conne cti vity in O ( m ) t ime ([NI92]). The algorithm o f [NI92] relies on random access to edges o f the graph and to the best of our kno w ledge no s treaming implementa tion is kno wn. In f act we show in Corollary 4.7 that reﬁnement sampling yields a streaming algorithm for producing sparse certiﬁcates for edge-conn ecti vity in one pass ov er the data. In what follows we w ill consider unweighted graph s to simplify notation. The results obtained can be easily exte nded to the polynomially weighted case as outlined in Remark 4.8 at the end of Section 4. 3 Reﬁnement Sampling W e start by introducin g the idea of reﬁnement sampling that gi ves a simple algorit hm fo r ef ﬁciently computin g a BK-sample, and serv es as a build ing block for our streamin g algorithms. T o motiv ate reﬁnement sampling, let us consider the simpler problem of identi fying all edges of streng th at least k in the input graph G ( V , E ) . A natural ide a to do so is as follo ws: (a) generate a graph G ′ by sampling edges of G with probabi lity ˜ O (1 /k ) , (b ) ﬁnd connec ted compo nents of G ′ , and (c) output a ll e dges ( u, v ) ∈ E as such tha t u and v are in the same connec ted component in G ′ . The sampling rat e of ˜ O (1 /k ) sugges ts that i f an edge ( u, v ) has strong connec ti vity below k , the vertices u and v wou ld end up in dif ferent components in G ′ , and con versely , if the strong connect iv ity of ( u, v ) is abov e k , they are lik ely to stay connecte d and hence outpu t in step ( c ) . While th is pr ocess i ndeed ﬁlte rs out most k -weak edges, it is e asy t o construct ex amples where the outp ut w ill conta in many edg es of strength 1 ev en thoug h k is polyno mially lar ge (a star graph, for instan ce). The idea of reﬁnement sampling is to get around this by succes si vely reﬁ ning the sample obtain ed in the ﬁnal step ( c ) abov e. In de signing our algorithm, we will repeatedl y in v oke the subrou tine R E FI N E ( S, p ) that ess entially imple- ments the simple idea descri bed abo ve. Function: R E FI N E ( S, p ) Input: Partition S of V , sampling probabil ity p . 3 Output: Partition S ′ of V , a reﬁnement of S . 1. T ake a u niform sample E ′ of edges of E with pro babilit y p . 2. For each U ∈ S, U ⊆ V let C ( U ) be the set of con nected components of U induc ed by E ′ . 3. R eturn S ′ := ∪ U ∈ S C ( U ) . It is easy to see that R E FI N E can be implemented using O ( n ) space, a total of n U N I O N operations with O ( n log n ) ov erall cost and m F I N D operations with O (1) cost per operatio n, for an overa ll running time of O ( n log n + m ) (see, e.g. [CLRS01]). Also, R E FI N E can be impl emented using a single pa ss over the set of edges. A sch eme of reﬁnement relations between S l,k is gi ven in Fig. 1. The r eﬁnement sampling algo rithm computes partitions S l,j for l = 1 , . . . , L and j = 0 , 1 , . . . , K . Her e L = log (2 n ) is the number of strengt h le vels (t he factor o f 2 is chosen for con ven ience to ensure that S L,K consis ts of isola ted verti ces whp), K is a parameter which we cal l the str engthen ing parameter . Also, we choos e a parameter φ > 0 , which we will refer to as the overs ampling parameter . For a partiti on S , let X ( S ) denote all the edges in E which hav e endpoin ts in two dif ferent sets in S . The partiti ons are computed as follo ws: Algorithm 1 (Reﬁnement Sampling) Initializ ation: S l, 0 = { V } for l = 1 , . . . , L . 1. S et k := 1 2. For each l , 1 ≤ l ≤ L , set S l,k := R E FI N E ( S l,k − 1 , 2 − l ) . 3. S et k := k + 1 . If k < K , go to step 1. 4. For each e ∈ E deﬁne L ( e ) = min { l : e ∈ X ( S l,K ) } . Sample edge e with prob ability z ( e ) = min { 1 , φ ǫ 2 2 L ( e ) } and assign it weight 1 /z ( e ) . Let R ( φ, K ) denote the set of edges sampled during this step; we call this the reﬁnement sample of G . The follo wing two lemmas rela te the probabili ties z ( e ) to the sa mpling probab ilities used in th e Bencz ´ ur - Kar ger sampling scheme. Lemma 3.1 F or any K > 0 , w ith pr obabi lity at least 1 − K n − d eve ry edge e satisﬁes z ( e ) ≤ 4 φρ/ ( ǫ 2 s e ) . Pro of: Co nsider an edge e w ith strong co nnecti vity s e , and let C d enote the s e -stron gly conn ected compo nent contai ning e . By Theorem 2.3, samplin g w ith probability min { 4 ρ/s e , 1 } preserv es all cuts up to 1 ± 1 2 in C with probabi lity at lea st 1 − n − d . Hence , all s e -stron gly connect ed components sta y conn ected after K pa sses of R E FI N E for all l > 0 such that 2 − l ≥ 4 ρ/s e , yielding the lemma. Lemma 3.2 If K > log 4 / 3 n , then 2 − L ( e )+1 ≥ 1 / (2 s e ) for ever y e ∈ E ( G ) w ith pr obabili ty at least 1 − K e − ( n − 1) / 100 . Pro of: Conside r a le vel l such that p = 2 − l < 1 / (2 s e ) . Let H be the graph obtai ned by contr acting all ( s e + 1) -strong components in G i nto supern odes. Since H contains o nly ( s e + 1) -weak edges, the number of edges is at most s e ( n − 1) by Lemma 2.2. As the expecte d number of ( s e + 1) -weak edges in the sample is at most ( n − 1 ) / 2 , by Chernof f bounds, the probab ility that the nu mber of ( s e + 1) -weak edges in the sample e xceed s 3( n − 1) / 4 is at most ( e 1 / 4 (5 / 4) − 5 / 4 ) − ( n − 1) / 2 < e − ( n − 1) / 100 . T hus at least one quarter of the supernodes get isolated in each ite ration. Hence, no ( s e + 1) -weak edge s urvi ves af ter K = log 4 / 3 n round s of reﬁnement samplin g with probab ility at least 1 − K e − ( n − 1) / 100 . Sin ce L ( e ) was d eﬁned as the least l such that e ∈ X ( S l,K ) , the endpoin ts of e were connecte d in S L ( e ) − 1 ,K , so 2 − L ( e )+1 ≥ 1 / (2 s e ) . 4 S 1 , 1 S 1 , 2 . . . S 1 ,K − 1 S 1 ,K ✲ ✲ ✲ ✲ S 2 , 1 S 2 , 2 . . . S 2 ,K − 1 S 2 ,K ✲ ✲ ✲ ✲ . . . . . . . . . . . . . . . S L − 1 , 1 S L − 1 , 2 . . . S L − 1 ,K − 1 S L − 1 ,K ✲ ✲ ✲ ✲ S L, 1 S L, 2 . . . S L,K − 1 S L,K ✲ ✲ ✲ ✲ Figure 1: Scheme of reﬁneme nt relations between part itions for A lgorith m 1. Theor em 3.3 Let G ′ be the gr aph obtain ed by running Algorithm 1 with φ := 4 ρ . T hen G ′ has O ( n log 2 n/ǫ 2 ) edg es in expe ctation, and is an ǫ -spar siﬁcat ion of G with pr obabil ity at least 1 − n − d +1 . Pro of: W e hav e from lemma 3.2 and the choic e of φ tha t the sampling proba bilities dominate those used in Bencz ´ ur -Kar ger sampling with probabilit y at least 1 − K e − ( n − 1) / 100 . Hence, by corolla ry 2.4 we ha ve that ev ery cut in G ′ is within 1 ± ǫ of its v alue in G with probabi lity at least 1 − K e − ( n − 1) / 100 − n − d . The exp ected size of the sa mple is O ( n log 2 n/ǫ 2 ) by lemma 3.1 together with the fact th at ρ = O (log n ) . The probab ility of failure of the esti mate in lemma 3.2 is at mos t K n − d , so all bound s hold with probabil ity at least 1 − K n − d + K e − ( n − 1) / 100 − n − d > 1 − n − d +1 for suf ﬁciently large n . T he high prob ability bound on the number of edges follo ws by an applicat ion of the Chernof f bound. The nex t lemma follo ws from the discussion abov e: Lemma 3.4 F or any ǫ > 0 , an ǫ -sparsi ﬁcation of G with O ( n log 2 n/ǫ 2 ) edg es can be constructe d in O (log n ) passes of R E FI N E u sing O (log n ) space per node and O (log 2 n ) time per edge . W e now note that one log n facto r in the runni ng time comes from the fac t that during each pass k Algo- rithm 1 ﬂips a coin at e very lev el l to d ecide whether o r n ot to i nclude e into S l,k when e ∈ S l,k − 1 . I f we could guaran tee tha t S l,k is a reﬁnement of S l ′ ,k for all l ′ < l and for all k , we would be abl e to u se binary search to ﬁnd the lar gest l such that e ∈ S l,k in O (log log n ) time. Algorithm 2 giv en belo w uses iterati ve sampling to e nsure a scheme of reﬁnement relatio ns gi ven in Fig. 2. For each edge e , 1 ≤ k ≤ K , and 1 ≤ ℓ ≤ L , we deﬁne for con ven ience independe nt Bernoul li random v ariabl es A l,k ,e such Pr [ A l,k ,e = 1] = 1 / 2 , ev en thoug h the algorit hm will not always nee d to ﬂip all these O (log 2 n ) coins . Also deﬁne U l,k ,e = Q j ≤ l A j,k ,e . The algorith m uses conne cti vity data structu res D l,k , 1 ≤ l ≤ L, 1 ≤ k ≤ K . Adding an edge e to D l,k mer ges the componen ts that the endpoin ts of e belong to in D l,k . 5 Algorithm 2 (An O (log n ) -Pass Sparsiﬁer) Input: Edges of G streame d in adversari al order: ( e 1 , . . . , e m ) . Output: A sparsiﬁcation G ′ of G . Initializ ation: Set E ′ := ∅ . 1. For all k = 1 , . . . , K 2. S et t = 1 . 3. For all l = 1 , . . . , L 4. A dd e t = ( u t , v t ) to D l,k if U l,k ,e = 1 and u t and v t are connec ted in D ( l,k − 1) . 5. S et t := t + 1 . Go to step 1 if t ≤ m . 6. For each e t deﬁne L ′ ( e t ) as the m inimum l such that u t and v t are not co nnected in D l,K . Set z ′ ( e t ) := min n 1 , 4 ρ ǫ 2 2 L ′ ( e t ) o . Outpu t e t with proba bility z ′ ( e t ) , gi ving it weight 1 /z ′ ( e t ) . Theor em 3.5 F o r any ǫ > 0 , t her e e xists an O (log n ) -pass str eaming algorithm that pr oduces an ǫ -spar siﬁcati on G ′ of a grap h G with at most O ( n log 2 n/ǫ 2 ) edges using O (( n/m ) log n + log n log log n ) time per edge. Pro of: The correc tness of Algorith m 2 follo ws in the same way as for Algorith m 1 abo ve, so it remains to determine its runtime. An O (( n/m ) log n + 1) term per edge comes from amortized O ( n log n + m ) comple xity of UNION-FIND operations. The log n factor in th e runtime comes fr om the log n passes, and we now sho w that step 3 can be implemented in O (log log n ) time. First note that since S l ′ ,k ′ is a reﬁnement of S l,k whene ver l ′ ≥ l and k ′ ≥ k , one can use binary search to determine the larg est l 0 such that u t and v t are connecte d in D l 0 − 1 ,k − 1 . One then keep s ﬂippi ng a fair coin and adding e to connec ti vity data structure s D l,k for successi ve l ≥ l 0 as l ong as the coin kee ps coming up he ads. Since 2 such steps are performed on a vera ge, it take s O ( K ) = O (log n ) amortize d time per edge by the C hernof f b ound. P utting these estimates togeth er , w e obt ain the claimed time comple xity . The scheme of reﬁnement relation s between S l,k is depic ted in Fig. 2. Cor ollary 3.6 F or any ǫ > 0 , ther e is an O (log n ) -pa ss alg orithm that pr oduces an ǫ -spa rsiﬁ cation G ′ of an input graph G w ith at m ost O ( n log n/ǫ 2 ) edges using O (log 2 n ) space per node, and performing O (log n log log n + ( n/m ) log 4 n ) amortized work per edge . Pro of: One can obta in a sparsiﬁcati on G ′ with O ( n log 2 n/ǫ 2 ) edge s by running Algorithm 2 on the inpu t graph G , and then run the Bencz ´ ur -Karger algorithm on G ′ without violating the restrictio ns of the semi- streamin g m odel. Note that ev en t hough G ′ is a weighted graph, this will ha ve over head O (log 2 n ) per edge of G ′ since the weights are polyno mial. Sinc e G ′ has O ( n log 2 n ) edges, the amortiz ed work per edge of G is O (log n log log n + ( n/m ) log 4 n ) . The Bencz ´ ur- Karg er alg orithm can be implemented using space propo rtional to the size of the graph, which yields O (log 2 n ) space per node. Remark 3.7 The algo rithm impr oves upon the runti me of the Bencz ´ ur -Kar ger sparsi ﬁcation scheme when m = ω ( n log 2 n ) . 6 4 A One-pass ˜ O ( n + m ) -Time Al gorithm for Graph Sp arsiﬁcation In this section we con vert Algorit hm 2 obtained in the pre vious sectio n to a one-pass algorithm. W e will design a one-pass algorith m th at produces an ǫ -sparsiﬁer with O ( n log 3 n/ǫ 2 ) edges using only O (log log n ) amortize d work per edge. A simple po st-proce ssing step at the end of the algorith m will allow us to reduc e the size to O ( n log n/ǫ 2 ) edges with a sl ightly i ncrease d space and t ime complexity . The main difﬁculty is that in going from O (log n ) passes to a one-pass algorith m, we need to introdu ce and analyze new depende ncies in the samplin g proces s. As before , the algorithm maintains connecti vity data structures D l,k , where 1 ≤ l ≤ L and 1 ≤ k ≤ K . In addition to inde xing D l,k by pairs ( l , k ) we shall also write D J for D l,k , where J = K ( l − 1) + k , so that 1 ≤ J ≤ LK . This induces a natural orderin g on D l,k , illustrated in Fig. 3, that correspon ds to the structure of reﬁnement relation s. W e will assume for simplicity of presentatio n that D 0 = D 1 , 0 is a connecti vity data structu re in which all vertices are connected. For each edge e , 1 ≤ ℓ ≤ L , and 1 ≤ k ≤ K , we deﬁne a n indepe ndent Bernoulli random va riable A ′ l,k ,e with Pr [ A ′ l,k ,e = 1] = 2 − l . The alg orithm is as follo ws: Algorithm 3 (A One-Pass Sparsiﬁer) Input: Edges of G streame d in adversari al order: ( e 1 , . . . , e m ) . Output: A sparsiﬁcation G ′ of G . Initializ ation: Set E ′ := ∅ . 1. S et t = 1 . 2. For all J = 1 , . . . , LK ( J = ( l , k ) ) 3. A dd e t = ( u t , v t ) to D J if A ′ l,k ,e = 1 and u t and v t are connec ted in D J − 1 . 4. D eﬁne L ′ ( e t ) as th e minimum l such that u t and v t are not conn ected in D l,K . Set z ′ ( e t ) := min n 1 , 4 ρ ǫ 2 2 L ′ ( e t ) o . Output e t with probabi lity z ′ ( e t ) , gi ving it weight 1 /z ′ ( e t ) . 5. S et t := t + 1 . Go to step 2 if t ≤ m . Informal ly , Algorith m 3 underest imates stren gth of some edges until the data structures D l,k become proper ly conn ected b ut proce eds simila rly t o A lgorith ms 1 and 2 after t hat. Our main goal in the rest of th e sectio n is to sho w that this underes timation of strength s does not lead to a large increase in the size of the sample. Note that not all LK = Θ(log 2 n ) coin tosse s A ′ l,k ,e per edge are necessary for an implementa tion of Algorith m 3 (in particula r , we will sho w th at Algorithm 3 can be implemente d with O (log log n ) = o ( LK ) work p er edge). Howe ver , the ran dom v ariables A ′ l,k ,e are usefu l for analysis purp oses. W e no w sho w that Algorith m 3 outputs a sparsi ﬁcation G ′ of G with O ( n log 3 n/ǫ 2 ) edges whp. Lemma 4.1 F or any ǫ > 0 , w .h.p. the gr aph G ′ is an ǫ -spar siﬁcat ion of G . Pro of: W e can couple beha viors of Algor ithms 1 an d 3 using the coin tosses A ′ l,k ,e to sh o w that L ( e ) ≥ L ′ ( e ) for e very edge e , i.e. z ′ ( e ) ≥ z ( e ) . Hence G ′ is a spars iﬁcation of G by Corollary 2.4. It remai ns to u pper bou nd the size of the sample. The follo wing lemma i s cruc ial to our analysi s; its proof is defer red to the Appendix A due to space limitation s. Lemma 4.2 L et G ( V , E ) be an undir ected g raph . Consider the e xecution of A lgorith m 3, and for 1 ≤ J ≤ LK wher e J = ( l , k ) , let X J denote th e set of edg es e = ( u, v ) suc h th at u a nd v ar e c onnecte d in D J − 1 when e arrives. Then | E \ X J | = O ( K 2 l n ) with high pr obability . 7 Lemma 4.3 T he number of edg es in G ′ is O ( n log 3 n/ǫ 2 ) with high pr obabilit y . Pro of: Recall tha t Algo rithm 3 samples an edge e t = ( u t , v t ) with pro babilit y z ′ ( e t ) = m in n 1 , 4 ρ ǫ 2 2 L ′ ( e t ) o , where L ′ ( e t ) is the minimum l such that u t and v t are not connecte d in D l,K . As before, f or J = ( l, k ) , w e denote by X J the set of edges e = ( u, v ) such that u and v a re conne cted in D J − 1 when e arri ves. Note that w .h.p. X ( L, 1) = ∅ w .h.p. by our c hoice of L = log (2 n ) . For each 1 ≤ l ≤ L , let Y l = X ( l, 1) \ X ( l +1 , 1) . W e ha ve by Lemma 4.2 that P 1 ≤ j ≤ l | Y j | = O ( K 2 l n ) w .h.p. A lso note that edges in Y l are sampled with probab ility at most 4 ρ ǫ 2 2 l − 1 . Hen ce, we get that the expecte d number of edges in the sample is at most L X l =1 | Y l | · 4 ρ ǫ 2 2 l − 1 = O L X l =1 K 2 l n · 4 ρ ǫ 2 2 l − 1 ! = O ( n lo g 3 n/ǫ 2 ) . The high probabi lity bound no w follo ws by standard concentra tion inequalit ies. Finally , we hav e the follo wing theorem. Theor em 4.4 F or a ny ǫ > 0 an d d > 0 , ther e ex ists a one-pass alg orithm that g iven th e ed ges of an und i- r ected graph G str eamed in adve rsar ial or der , pr oduces an ǫ -sparsi ﬁer G ′ with O ( n log 3 n/ǫ 2 ) edge s w ith pr obabil ity at least 1 − n − d . The algorithm takes O (log log n ) amortized time p er edge and uses O (log 2 n ) space per node . Pro of: Lemma 4.1 a nd Lemma 4.3 tog ether establish that G ′ is an ǫ -s parsiﬁer G ′ with O ( n log 3 n/ǫ 2 ) edges. It remains to pro ve the stated runtime boun ds. Note that when an edge e t = ( u t , v t ) is pro cessed in step 3 of Algorit hm 3, it is not necessary to add e t to an y data struct ure D J in which u t and v t are already conne cted. Also, since D J is a reﬁnement of D J ′ whene ver J ′ ≤ J , fo r e very ed ge e t there exists J ∗ such that u t and v t are connected in D J for any J ≤ J ∗ and not con nected for any J ≥ J ∗ . The va lue of J ∗ can be fo und in O (log log n ) time by binary search . No w we need to keep adding e t to D J , for each J ≥ J ∗ such that U l,k ,e t = 1 . Ho wev er , we ha ve that E h P J ≥ J ∗ U ′ l,k ,e t i = O (1) . Amortizing over all edges, we ge t O (1) per edge u sing sta ndard con centrati on inequa lities. Cor ollary 4.5 F or any ǫ > 0 and d > 0 , ther e e xists a on e-pass algorithm that given the edges of an undir ected gr aph G stre amed in a dver sarial or der , pr oduce s an ǫ -spar siﬁer G ′ with O ( n lo g n/ǫ 2 ) edges wit h pr obabil ity a t least 1 − n − d . T he algorith m takes amortized O (log log n + ( n/m ) log 5 n ) time per edg e and uses O (log 3 n ) space per node . Pro of: One can obtain a sparsiﬁcation of G ′ with O ( n log 3 n/ǫ 2 ) edges by running Algorith m 3 on the input graph G , and then run the Bencz ´ ur -Karger algorithm on G ′ without violating the restrictio ns of the semi- streamin g m odel. Note that ev en t hough G ′ is a weighted graph, this will ha ve over head O (log 2 n ) per edge of G ′ since the weights are polyno mial. Sinc e G ′ has O ( n log 3 n ) edges, the amortiz ed work per edge of G is O (log n log log n + ( n/m ) log 5 n ) . The Bencz ´ ur- Karg er alg orithm can be implemented using space propo rtional to the size of the graph, which yields O (log 3 n ) space per node. Remark 4.6 The algorithm av ove impr oves upon the runtime of the Bencz ´ ur -Kar ger spa rsiﬁc ation scheme when m = ω ( n log 3 n ) . Sparse k -connecti vity Certiﬁcates: O ur analysis of the performan ce of reﬁnement sampling is alon g b roadly similar lines to the analysis of the streng th estimation routine in [BK96]. T o make this ana logy m ore precise , 8 we note that reﬁnement sampling as used in Algorithm 3 in fact produc es a spar se con nectivi ty certi ﬁcate of G , similarly to the algori thm of Nagamochi-I baraki[ NI92 ], althoug h with slightly weaker guara ntees on size. A k -connectiv ity ce rtiﬁcate , or simply a k -certiﬁcat e , for an n -verte x graph G is a subgraph H of G such that contains all edges crossing cuts of size k or less in G . S uch a certiﬁcate alway s exists with O ( k n ) edges, an d moreo ver , there are graphs where Ω ( k n ) edges are necess ary . The algori thm of [NI92] depends on random access to edges of G to produce a k -certiﬁca te with O ( k n ) edges in O ( m ) time. W e now sho w that reﬁnement sampling gi ves a one-p ass algorithm to pr oduce a k -certiﬁcate with O ( k n log 2 n ) edges in time O ( m log log n + n log n ) . The result is summarized in the follo wing corollary: Cor ollary 4.7 Whp for eac h l ≥ 1 the set X ( D l,K ) is a 2 l -certiﬁ cate of G with O (log 2 n )2 l n edg es. Pro of: Whp X ( D l,K ) co ntains all 2 l -weak edges, in particular thos e tha t cro ss cuts of size at most 2 l . The bound on the size follo w s by Lemma 4.2. Remark 4.8 Algorithms 1-3 can be eas ily extend ed to graph s with polyn omially bounded inte ger weights on edges. If we denote by W the lar gest edge w eight, then it is suf ﬁcient to set th e numbe r of level s L to log(2 nW ) instea d of log(2 n ) and the number of pa sses to log 4 / 3 nW instead of log 4 / 3 n . A weig hted edg e is then viewed as severa l parall el edges , and sampling can be performed ef ﬁcient ly fo r suc h edg es by sampling dir ectly fr om the corr espondi ng b inomial distrib ution. 5 A Linear -time Algorithm for O ( n l og n/ǫ 2 ) -size Sparsiﬁers W e now present an algorithm for co mputing an ǫ -sparsiﬁcation with O ( n log n/ǫ 2 ) edg es in O ( m log 1 δ + n 1+ δ ) expecte d time for an y δ > 0 . Thus, the algori thm runs in linear -time whene ver m = Ω( n 1+Ω(1) ) . W e n ote tha t no (randomized) algori thm can outp ut an ǫ -s parsiﬁcati on in sub- linear time e ven if there is no restric tion on the size of the spa rsiﬁer . This is easi ly seen by conside ring the fa mily of graphs formed by disjoi nt union of two n -vert ex graphs G 1 and G 2 with m edges each, and a single edge e connecting the two graphs . The cut that separates G 1 from G 2 has a single edg e e , and henc e any ǫ -sp arsiﬁer must include e . On the other hand , it is easy to see that Ω( m ) probes are needed in expe ctation to disco ver the edge e . Our algorithm can in fact be viewed as a two-pass streaming algorithm, and w e present is as such belo w . As before , let G = ( V , E ) be an undirect ed unweig hted graph. W e w ill use Algorithm 3 as a build ing b lock of our const ruction . W e no w describe each of the passes. First pass: Sample e very ed ge of G uniformly at random with p robabi lity p = 4 / log n . Denote the re sulting graph by G ′ = ( V , E ′ ) . Giv e the strea m of sample d edges to Algorithm 3 as the input stream, and sa ve the state of the co nnecti vity dat a struct ures D l,K for all 1 ≤ l ≤ L a t the end of exec ution. For 1 ≤ l ≤ L , le t D ∗ l denote th ese con necti vity data structu res (we will also re fer to D ∗ l as p artition s in what follo w s). Note t hat t he ﬁrst pass t akes O ( m ) e xpected time since Algorit hm 3 has an overhe ad O (log log n ) time per edge and the exp ected size of | E ′ | is | E | / log n . Recall that the partitions D ∗ l are used in Algorithm 3 to estimate strength of edges e ∈ E ′ . W e no w sho w that these partitions c an a lso be u sed to e stimate strength of edges in E . The followin g lemma establi shes a relatio nship between the edg e strengths in G ′ and G . For e ver y edge e ∈ E , let s ′ e denote the strength of edge e in the graph G ′ e ( V , E ′ ∪ { e } ) . Lemma 5.1 Whp s ′ e ≤ s e ≤ 2 s ′ e log n + ρ log n for all e ∈ E , wher e ρ = 16( d + 2) ln n is the over sampling par ameter in Karg er sampling . 9 Pro of: The ﬁrst inequality is triv ially true since G ′ e is a subgraph of G . For the second one, let us ﬁrst consid er any edge e ∈ E with s e > ρ log n . Let C be the s e -stron g compon ent in G tha t co ntains the edge e . By Kar ger’ s the orem, whp the capac ity of any cu t deﬁned by a partition of v ertices in C decreases by a fact or of at most 2 log n after sa mpling ed ges of G with pr obabili ty p = 4 / log n = ρ/ ((1 / 2) 2 ρ log n ) , i.e. in going from G to G ′ . So any cut in C , restricted to edges in E ′ has size at least s e / (2 log n ) , implying that s ′ e ≥ s e / (2 log n ) . Finally , fo r any edge e with s e ≤ ρ log n , s ′ e is at least 1 , and the inequality thus follo ws. W e now discuss the second pass over the data. Recall that in order to estimate the strengt h s ′ e of an edge e ∈ E ′ , Algorithm 3 ﬁnds the minimum L ( e ) such that the endpoi nts of e are not conne cted in D ∗ l by doing a bin ary searc h ove r the range [1 ..L ] . For an edg e e ∈ G we est imate its strength in G ′ e by doing binary search as before, b ut stopping the bina ry search as soon as the size of the inte rv al is small er than δ L , thu s taking O (log 1 δ ) time per edge and obta ining an estimate that is away fro m the true v alue by a factor of at most n δ . Let s ′′ e denote this estimate, that is, s ′ e n − δ ≤ s ′′ e ≤ s ′ e n δ . No w sampling e very edg e with proba bility p e = min n ρn δ ǫ 2 s ′′ e , 1 o and gi ving each sampled edg e weight 1 /p e yields a n ǫ -sparsiﬁca tion G ′′ = ( V , E ′′ ) of G whp. Moreov er , we ha ve that w .h.p. | E ′′ | = ˜ O ( n 1+ δ ) . Final ly , we p rov ide the graph G ′′ as input to A lgorith m 3 follo wed by applying Bencz ´ ur- Karg er sampling as outlined in Corollary 4.5, obtaining a sparsiﬁer of size O ( n log n/ǫ 2 ) . W e now summariz e the second pass. Second pass: For each edge e of the input graph G : • Perform O (log 1 δ ) steps of binary search to calcul ate s ′′ e . • Sample edge e with probab ility p e = min { ρn δ ǫ 2 s ′′ e , 1 } . • If e is sample d, assign it a weight of 1 /p e , and pass it as an input to a fresh in vocati on of Al- gorith m 3, follo w ed by Bencz ´ ur -Karg er sampling as ou tlined in Corol lary 4.5, gi ving the ﬁnal sparsi ﬁcation. Note that the total time tak en in the second pass is O ( m log 1 δ ) + ˜ O ( n 1+ δ ) . W e hav e prov ed the follo wing Theor em 5.2 F or any ǫ > 0 a nd δ > 0 , ther e ex ists a two-pa ss algorithm that pr oduce s an ǫ - spar siﬁer in time O ( m log 1 δ ) + ˜ O ( n 1+ δ ) . Thus th e algorithm runs in linear -time w hen m = Ω( n 1+ δ ) and δ is constant. 10 S 1 , 1 S 1 , 2 . . . S 1 ,K − 1 S 1 ,K ✲ ✲ ✲ ✲ ❄ ❄ ❄ ❄ S 2 , 1 S 2 , 2 . . . S 2 ,K − 1 S 2 ,K ✲ ✲ ✲ ✲ ❄ ❄ ❄ ❄ . . . . . . . . . . . . . . . S L − 1 , 1 S L − 1 , 2 . . . S L − 1 ,K − 1 S L − 1 ,K ✲ ✲ ✲ ✲ ❄ ❄ ❄ ❄ S L, 1 S L, 2 . . . S L,K − 1 S L,K ✲ ✲ ✲ ✲ Figure 2: Scheme of reﬁneme nt relations for Algorithm 2. Refer ences [A G09] K. A hn and S. Guha. On graph problems in a semi-streaming m odel. Auto mata, languag es and pr ogr amm ing: Algorithms and comple xity , pages 207 – 216, 2009. [AS08] N. Alon and J. Spencer . The pr obabili stic method . W ile y , 2008. [BK96] Andr ´ as A. Bencz ´ ur and D a vid R. Karger . Approxima ting s-t minimum cuts in ˜ O ( n 2 ) time. Pr oceedings of the 28th annual ACM sy mposium on Theory of computing , pages 47–55, 1996. [CLRS01] T . Cormen, C . Leiserson , R. Rive st, and C . Stein. Intr oduction to Algorit hms . MIT Press an d McGraw-Hill, 20 01. [FKM + 05] J. Feigenb aum, S. Kan nan, A. McGregor , S. Suri, and J. Zhan g. O n graph prob lems in a semi- streamin g model. Theor . C omput. Sci. , 348:20 7–216, 2005. [KL02] D. Kar ger and M. Levine . Random sampling in residual graphs. STOC , 200 2. [KR V06] R. Khandekar , S. Rao, and V . V azirani. Graph p artition ing using singl e commodity ﬂo ws. STOC , pages 385 – 390, 2006. [Mut06] S. Muthukr ishnan. Data str eams: algorit hms and applicat ions . Now pu blishers , 2 006. [NI92] H. Nagamochi and T . Ibara ki. Computing edge-conn ecti vity in multigraphs and cap acitated graphs . SIAM J ourna l on Discr ete Mathematics , 5(1):54 –66, 1992. [SS08] D.A. Spielman and N. S ri v asta va. Graph spars iﬁcation by effec ti ve res istances . STOC , pages 563–5 68, 2008. I S 1 , 1 S 1 , 2 . . . S 1 ,K − 1 S 1 ,K ❄ ✲ ✲ ✲ ✲ S 2 , 1 S 2 , 2 . . . S 2 ,K − 1 S 2 ,K ❄ ❄ ✲ ✲ ✲ ✲ . . . . . . . . . . . . . . . S L − 1 , 1 S L − 1 , 2 . . . S L − 1 ,K − 1 S L − 1 ,K ❄ ✲ ✲ ✲ ✲ S L, 1 S L, 2 . . . S L,K − 1 S L,K ✲ ✲ ✲ ✲ Figure 3: Scheme of reﬁneme nt for Algorithm 3. A Proof of Lemma 4.2 W e d enote the edges of G in their o rder in the stream by E = ( e 1 , . . . , e m ) . In wha t fol lo ws we shall trea t edge se ts as ordered s ets, and for an y E 1 ⊆ E write E \ E 1 to denote the result of removin g edges of E 1 from E w hile pre serving the order of th e remaining edges . For a stream of ed ges E we shall write E t to deno te the set of the ﬁrst t edges in the stream. For a κ -conn ected componen t C of a graph G we will write | C | to denote the numbe r of vertices in C . Also, we w ill denote th e resu lt of sa mpling the edges o f C uniformly at random with probability p by C ′ . The follo wing simple lemma will be useful in our analys is: Lemma A.1 Let C be a κ -connected compone nt of G for some p ositive i nte ger κ . Denote the graph obta ined by sampling edg es of C with pr obability p ≥ λ/ κ by C ′ . Then the number of connected components in C ′ is at most γ | C | with pr obabilit y at least 1 − e − η | C | , wher e γ = (7 / 8 + e − λ/ 2 / 8) and η = 1 − e − λ/ 2 . Pro of: Choose A, B ⊂ V ( C ) so that A ∪ B = V ( C ) , A ∩ B = ∅ , | A | ≥ | V ( C ) | / 2 and fo r e very v ∈ A at leas t half of its edg es that go to v ertices in C go to B . Note that such a pa rtition alw ays exists : star ting from a ny a rbitrary pa rtition of vertices of C , we can repeated ly mo ve a v erte x from one side to the othe r if it increa ses the number of edges going across the partition, and upon termina tion, the lar ger side corres ponds to the set A . D enote by Y the nu mber of vertice s of A that belo ng to co mponents of size at least 2 . Note that Y can be e xpresse d as sum of | A | indep endent 0 / 1 Bernoull i random var iables. Let µ := E [ Y ] ; we ha ve that µ ≥ | A | (1 − (1 − λ/κ ) κ/ 2 ) ≥ | A | (1 − e − λ/ 2 ) . W e get by the C hernof f bound that Pr [ Y ≤ | A | (1 − e − λ/ 2 ) / 2] ≤ e − 2 µ ≤ e −| C | (1 − e − λ/ 2 ) = e − η | C | . Hence, at least a (1 − e − λ/ 2 ) / 4 fractio n of the ver tices of C are i n component s of size at least 2 . Hence, the number of connected components is at most a 1 − (1 − e − λ/ 2 ) / 8 = 7 / 8 + e − λ/ 2 / 8 = γ frac tion of the number of vertice s of C . II Pro of of Lemma 4.2: The proof is by ind uction on J . W e pro ve that w .h.p. for ev ery J = ( l, k ) one has | E \ X J | ≤ P 1 ≤ J ′ =( l ′ ,k ′ ) ≤ J − 1 c 1 2 l ′ n for a consta nt c 1 > 0 . Base: J = 1 Since e verythin g is connected in D 0 by deﬁniti on, the claim holds. Inductiv e step: J → J + 1 The outline of the proo f is as follows. For ev ery J = ( l, k ) we consi der the edges of the stream that the algorit hm t ries to add to D J , identif y a sequence of 2 l -stron gly connected compone nts C 0 , C 1 . . . in the partially recei ved graph, and use lemma A.1 to sho w that the numb er of connected compon ents decr eases fas t beca use only a small fractio n of vertice s in the sample d 2 l - strong ly conn ected c omponent s are isolated . W e t hus sho w th at, informally , it will t ake O (2 l n ) edges to make the connecti vity data str ucture D J in Algorithm 3 connecte d. The connected c omponent s C s are deﬁned by i nductio n on s . The vertices of C s are elements of a partition P s of the verte x set V of the graph G . W e shall use an auxiliary sequence of graphs which we denote by H t s . Let P 0 be the partit ion cons isting of isolated vertices of V . W e treat the base case s = 0 separately to simplify exp osition. W e use the deﬁni tion of γ and η from lemma A.1 with λ = 1 sinc e we are consid ering 2 l -conne cted components w hen J = ( k , l ) . Base case: s = 0 . Set H t 0 = ( P 0 , { e 1 , . . . , e t } ) , i.e. H t 0 is th e parti ally recei ved graph up to time t . Let t ∗ 0 be the the ﬁrst valu e of t s uch that s H t 0 ( e t ) ≥ 2 l . This means t hat e t ∗ 0 belong s to a 2 l -stron gly conne cted component in H t ∗ 0 0 . Note that this component does not contain any (2 l + 1) -strongly conne cted comp onents. Deno te this component by C 0 (note that the number of e dges in C 0 is at most 2 l | C 0 | by le mma 2.2). Denote th e rand om v ariabl es that cor respond to sampling e dges of C 0 by R 0 . Let X 0 be a n in dicator v ariabl e that e quals 1 if the number of connecte d components in C ′ 0 is at most γ | C 0 | and 0 otherwise. By lemma A.1 we ha ve that Pr [ X 0 = 1] ≥ 1 − e − η | C 0 | . For a partition P denote diag ( P ) = { ( u, u ) : u ∈ P } . Deﬁne P 1 by mer ging partitions of P 0 that belon g to connect ed components in C ′ 0 if X 0 = 1 and as equal to P 0 otherwis e. Let E 1 = E \ ( E ( C 0 ) ∪ diag ( P 1 )) , i.e. we remov e e dges of C 0 and also edges that conne ct v ertice s that belo ng to the same par tition in P 1 . Note that we can s afely remov e thes e edges since their endpo ints are connecte d in D J when they arri ve. Deﬁne H t 1 = ( P 1 , E 1 t ) , i.e. H t 1 is the p artially recei ved graph on the modiﬁed stream of edge s. Inductiv e step: s → s + 1 . As in the base cas e, let t ∗ s be th e the ﬁrst valu e of t such that s H t s ( e t ) ≥ 2 l . This means that e t ∗ s belong s to a 2 l -conne cted component in H t ∗ s s . Denot e this component by C s (note that the numbe r of edges in C s is at most 2 l | C s | by lemma 2.2). Denote the rando m v ariable s that correspond to sampling edges of C s by R s . Let X s be an indi cator v ariable that equals 1 if the number of connected components in C ′ s is at most γ | C s | and 0 otherwise. By lemma A.1 we h av e that Pr [ X s = 1] ≥ 1 − e − η | C s | . Deﬁne P s +1 by m er ging toge ther ve rtices that belon g to connect ed components in C ′ s . Let E s +1 = E s \ ( E ( C s ) ∪ diag ( P s )) . Denote H t s = ( P s , E s t ) . It is important to note th at at each step s we only ﬂip coins R s that cor respond to edge s in E ( C s ) , and delete o nly t hose ed ges from E s . While t here m ay be e dges going acro ss pa rtitions P s for w hich we do not perfo rm a coin ﬂip, there n umber is b ounded b y O (2 l n ) since t hese edges do not contain a 2 l -conne cted component. Note that for any s > 0 the number of conne cted compone nts in P s is at most n − s X j =1 (1 − γ ) | C j | X j . III W e no w sho w that it is v ery unlik ely that P s j =1 | C j | X j is more tha n a consta nt factor smaller th an P s j =1 | C j | , thus sho w ing that the number of co nnected components cannot be more tha n 1 when P s j =1 | C j | ≥ cn 1 − γ for an appropri ate constan t c > 0 . For any consta nt d > 0 deﬁne I + = { i ≥ 0 : | C i | > (( d + 2) /η ) log n } and I − = { i ≥ 0 : | C i | ≤ (( d + 2) /η ) log n } . A lso deﬁne Z + i = P 0 ≤ j ≤ i,j ∈ I + X j | C j | , Z − i = P 0 ≤ j ≤ i,j ∈ I − X j | C j | − | C j | (1 − e − η | C j | ) . First note that one has Pr [ X j = 1] ≥ 1 − n − d − 2 for any j ∈ I + by lemma A.1. Hence, it follo ws by taking the union bound that i ≤ n 2 one has Pr [ Z + i = P j ∈ I + ,j ≤ i | C j | ] ≥ 1 − n − d . W e now consid er Z − i . Note that Z − i ’ s deﬁne a martingale sequence with respect to R i − 1 , . . . , R 0 : E [ Z − i | R i − 1 , . . . , R 0 ] = Z − i − 1 . Also, | Z − i − Z − i − 1 | ≤ (( d + 2) /η ) log n fo r all i . H ence, by Azuma’ s inequa lity (see, e.g. [AS08]) one has Pr [ Z − i < t ] < exp  − t 2 2 i ((( d + 2) /η ) log n ) 2  . No w consider the smallest value τ such that P j ≤ τ | C j | = P j ≤ τ , j ∈ I + | C j | + P j ≤ i,j ∈ I − | C j | = S + + S − ≥ 4 n (1 − e − 2 η )(1 − γ ) . Note th at τ < n/ (2(1 − e − 2 η )(1 − γ )) since | C i | ≥ 2 . If S + ≥ 2 n (1 − e − 2 η )(1 − γ ) ≥ 2 n/ (1 − γ ) , then we ha ve that Z + τ = S + > 2 n/ (1 − γ ) with probability at least 1 − n − d . Thus, n − τ X j =1 (1 − γ ) | C j | X j ≤ n − (1 − γ ) Z + τ ≤ 0 . Otherwise S − ≥ 2 n (1 − e − 2 η )(1 − γ ) and by Azuma’ s inequal ity we ha ve Pr [ Z − τ < − n ] < exp  − n 2 2 τ ((( d + 2) /η ) log n ) 2  ≤ exp  − n ((( d + 2) /η ) log n ) 2  < n − d . Since | C i | ≥ 2 , we hav e | C i | (1 − e − η | C i | ) ≥ | C i | (1 − e − 2 η ) and thus we get n − τ X j =1 (1 − γ ) | C j | X j < n − (1 − γ )   X 1 ≤ j ≤ τ , j ∈ I − | C j | (1 − e − η | C j | ) + Z − τ   < n − (1 − γ )   (1 − e − 2 η ) X 1 ≤ j ≤ τ , j ∈ I − | C j | + Z − τ   < n − (1 − γ )  2 n 1 − γ + n  < 0 W e ha ve sho w n tha t there e xists a constan t c ′ > 0 such that with p robabi lity at least 1 − n − d after c ′ 2 l n edges are sampled by the al gorithm at le vel J all subseq uent edges will ha ve their endpoints conne cted in D J . Not e that we n ev er ﬂipped coins fo r those edges t hat did no t contain a 2 l -conne cted component. Setting c 1 = c ′ + 1 , we hav e that w .h.p. | E \ X J | ≤ c 1 2 l n + | E \ X J − 1 | . By the in ducti ve hypoth esis we ha ve that | E \ X J − 1 | ≤ P 1 ≤ J ′ =( l ′ ,k ′ ) ≤ J − 2 c 1 2 l ′ n , which togeth er with the pre vious estimate giv es us the desired result. It now follo ws tha t | E \ X J | ≤ P 1 ≤ J ′ =( l ′ ,k ′ ) ≤ J − 1 c 1 2 l ′ n = O ( K 2 l n ) w .h.p., ﬁnishing t he proof o f t he lemma. IV

Graph Sparsification via Refinement Sampling

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment