Scalable Parallel Factorizations of SDD Matrices and Efficient Sampling for Gaussian Graphical Models

Scalable P arallel F actorizations of SDD Matrices and Eﬃcien t Sampling for Gaussian Graphical Mo dels Deh ua Cheng ∗ USC Y u Che ng † USC Y an Liu ∗ USC Ric hard P eng MIT Shang-Hua T eng † USC Octob er 21, 2014 Abstract Motiv ated b y a sampling problem basic to compu tational statistical inference, we develop a nearly optimal algorithm for a fund amen tal p roblem in spectral graph theory and numerical analysis. Give n an n × n SDDM matrix M , and a constant − 1 ≤ p ≤ 1, our algorithm gives eﬃcient access to a sparse n × n linear op erator ˜ C such that M p ≈ ˜ C ˜ C ⊤ . The solution is b ased on factoring M into a pro duct of simple and sparse matrices using squaring and sp ectral sparsiﬁcation. F or M with m n on - zero entries, our algorithm takes work nearly-linear in m , and p olylogarithmic depth on a p arallel m achine with m pro cessors. This giv es the ﬁrst sampling algorithm that only requires ne arly line ar work and n i.i.d. r andom univariate Gaussian samples to generate i.i. d. r andom samples for n -dimensional Gaussian random ﬁelds with S DDM p recisio n matrices. F or sampling this n atural sub class of Gaussian rand om ﬁelds, it is optimal in the randomness and n early optimal in th e wo rk and parallel complexity . I n addition, our sampling algorithm can be directly extended to Gaussian random ﬁelds with SDD precision matrices. 1 In tro duction Sampling from a mult iv ar iate pr obabilit y distr ibutio n is one of the most fundamen tal pro blems in s tatis- tical mac hine learning and statistical inference. In the sequen tial computation framework, the Marko v Chain Monte Carlo (MCMC) ba sed Gibbs sa mpling a lgorithm and its theo r etical underpinning built on the celebrated Hammersley-Cliﬀor d Theorem have provided the alg orithmic and mathematical foundation for analyzing graphical mo dels [Jo r98, KF09]. Ho wev er, unless there are s ig niﬁcan t indep endences among the v ariables, the Gibbs sampling pro cess tends to b e highly sequential [J or98]. Mo reov er, man y distributions require the Gibbs sampling pro cess to take a signiﬁcant n umber of iterations to ac hieve a reliable es timate. Despite a ctiv e eﬀor ts b y v arious resea rc h gro ups [GLGG11, I IS + 03, J SW13, ASW14], the design o f a scal- able pa rallel Gibbs sa mpling algor ithm r emains a challenging problem. The “holy gr ail” question in parallel sampling ca n b e characterized as : Is it p ossible t o obtain a char acterization of the family of Markov r andom ﬁ elds fr om which one c an dr aw a sample in p olylo garithmi c p ar al lel time with ne arly-line ar total work? F orma lly , the p erformance of a para llel sampling algor ithm can b e characterized using the following three parameters : ∗ Supported in part b y NSF research grants I IS-1134 990, I IS-1254206 and U.S. Defense Adv anced Research Pro jects Agency (D AR P A) under Social Media in Strategic Comm unication (SMISC) program, Agreement N um b er W911NF-12-1-0034. † Supported in part by NSF gran ts CCF-1111270 and CCF-096448 and by the Simons Inv estigator Award from the Simons F o undation. 1 • Time complexit y (w ork) : the total amount of ope rations needed by the sampling a lgorithm to generate the ﬁrst a nd subsequent random samples. • P arallel complexit y (depth) : the length of the lo ng est critical path of the algorithm in genera ting the ﬁr st and subseq ue nt r andom s amples. • Randomness complexit y : total r andom num b ers (scalars) needed b y the sa mpling algo rithm to generate a random sample. One w ay to obtain an (approximately) r andom sa mple is through Gibbs sa mpling. The basic Gibbs sampling metho d follows a r andomized pro cess that iteratively resamples ea c h v ariable conditioned o n the v alue of o ther v ar iables. This method is analogo us to the Ga uss-Seidel method for solving linear systems, and this connection is most apparent on an impo rtan t class of multiv ariate proba bilit y distributions: mult ivari ate normal distributions or multivariate Gauss ian distributions [L W12]. In the framework o f graphical mo dels, such a distribution is co mmonly sp eciﬁed by its precision matrix Λ and po ten tial vector h , i.e., Pr( x | Λ , h ) ∝ exp( − 1 2 x ⊤ Λx + h ⊤ x ), whic h deﬁnes a Gaussian r andom ﬁeld . The analogy betw een the Gibbs sampling method and Gauss-Seidel iterative metho d for solving linear systems imply that it has s imilar w o rst-case b e ha vio rs. F or ex ample, if a Gaussian r andom ﬁeld has n v ariables with a chain structure, it may take Ω( n ) iter ations for the Gibbs sa mpling algor ithm to reasona bly conv erge [LKW13]. Given an n × n pr ecision matrix with m non-zero entries, the Gibbs sampling pro cess may take Ω( mn ) total work and requir e Ω( nχ ( G )) iterations, wher e χ ( G ) is the chromatic n umber o f the graph underlying the precision ma tr ix. It may also use Ω( n 2 ) random n umber s ( n p er iteration) to gener ate its ﬁr st sample. How ever, in practice it usually p e r forms signiﬁca n tly better than these worst case b ounds. Our study is par tly motiv ated by the recent work of J ohnson, Saunderso n, Willsky [JSW13] that provides a mathematical characterizatio n o f the Ho gwild Gaussian Gibbs sampling heuristics (see also Niu, Rech t, R´ e, and W rig ht [NRR W11]). Johnso n et al. prov ed tha t if the pr ecision matr ix Λ is symmetric gener alize d diagonally do minant , ak a. a n H-matrix 1 , then Hog wild Gibbs sa mpling conv erges with the co r rect mea n for any v ariable partition (among para llel pro cessors) and for any num b er o f lo cally s equen tial Gibbs sam- pling steps with o ld statistics fr om other pro cessors. This connectio n to H-matric es leads us to draw from developmen ts in nearly-linear work s olvers for linea r s ystems in symmetr ic diagonally dominant (SDD) ma- trices [ST04, KMP10, KOSZ13, PS1 4], whic h are alg orithmically in ter -reducible to H-matrices [DS08]. Most relev ant to the par allel sa mpling question is a result with worst-case lo garithmic upp er b ound on pa rallel complexity [PS14]. In this pap er we develop a scalable parallel sampling algorithm for Gaus sian rando m ﬁelds with SDD pr ecision matr ic es by extending the algo rithmic fra mew or k of [PS14] fro m so lving linea r systems to sampling graphical mo dels. 1.1 Our Con t ributions W e fo cus on the problem of parallel sampling f o r a Gaussia n r andom ﬁeld whose precision matrix Λ is SDDM , which a re p ositive deﬁnite matrices with non-p ositive oﬀ-diag onals. This is a natural sub class of Gaussian random ﬁelds which a rises in applicatio ns such a s image denoising [Bis06]. It ca n also b e used in the rejection/imp ortance sampling framework [ADFDJ03 ] a s a pr opo sed distribution, likely giving gains ov er the commonly used diagona l precis ion matrices in this framework. Although faster sampling is p ossible for s ome subs et of SDDM pr ecision matr ices [LKW13, PY10], Ω( mn ) remains a worst-case complexity barrier for genera ting unbiased samples in this sub class o f Gaussian r andom ﬁelds. While Newton’s meth o d leads to sampling algorithms with polylog arithmic par allel s teps, these algorithms a re exp ensive due to calls to dense matrix m ultiplica tio n, even if the orig inal graphical mo del is spar se. Our alg orithm can generate i.i.d. samples from the Gaussia n random ﬁeld with an approximate cov a riance with nearly-linea r total work. In contrast, samples ge nerated from Gibbs sampling are c orrelated, but ar e drawn fro m the exact cov ariance in the limit. 1 A symmetric real n × n matrix M = ( m i,j ) i s generalized diagonally dominant if it can be scaled into a diagonally dominan t matrix, that is, there exists a p ositive diagonal scaling matrix D = diag ([ d 1 , ..., d n ]) s uch that DMD is strictly diagonally dominan t. 2 Our a lgorithm is base d o n the following three-step numerical fra mew or k for sampling a Gaussian rando m ﬁeld with para meter s ( Λ , h ): Pr( x | Λ , h ) ∝ exp( − 1 2 x ⊤ Λx + h ⊤ x ). 1. Compute the mean µ o f the distribution from ( Λ , h ) by solving the linear system Λ µ = h . 2. Compute a factor of the cov ariance ma trix by ﬁnding a n inv er se squar e-ro ot factor of the precision matrix Λ , i.e., Λ − 1 = CC ⊤ . (1) 3. If C is a n n × n ′ matrix, then we can obtain a sample of N ( µ , Λ ) b y ﬁrs t drawing a random v ec to r z = ( z 1 , z 2 , . . . , z n ′ ) ⊤ , where z i , i = 1 , 2 , . . . , n ′ are i.i.d. standard Gaussia n random v aria bles, and return x = C z + µ . If C has an eﬃcien t sparse representation, then Step 3 has a scala ble par allel implementation with parallel complexity p olylogarithmic in n . Polylogarithmic depth and nearly-linea r w o rk solvers for SDDM linear s ystems [PS14] a lso imply a scalable par allel solution to Step 1 . So this framework allows us to concentrate o n step 2 : fac toring the c ov ariance matr ix Λ . W e g iv e tw o approa c hes that take near ly-linear w o rk and p olyloga rithmic parallel depth to constr uc t a linear o pera tor ˜ C s uc h that Λ − 1 ≈ ˜ C ˜ C ⊤ . The ﬁrst appr oach factorizes Λ based on an a nalog of the signe d e dge-vertex incidenc e matrix . It diﬀers from previous approaches such as the one b y Cho w a nd Saad [CS14] in that it generates a non-sq ua re ˜ C . This leads to a sa mpling algor ithm w ith randomnes s co mplexit y of m , the num b er of nonzero s in Λ : it generates a sa mple o f ( Λ , h ) by ﬁrst taking random univ ariate Gaussia n samples, o ne p e r non- z ero entry in Λ . This sampling appro ach can b e eﬀectiv e if m = O ( n ). The s e cond approach, our main result, pr o duces a squar e n × n inv ers e fa c to rization ˜ C . It pro duces, after eﬃcient prepro cessing , a sc a lable parallel i.i.d. r andom ve ctor generato r for ( Λ , h ) with optimal randomness complexity: eac h ra ndom sample of ( Λ , h ) is generated from n i.i.d . standard Gauss ian random v ariables. This factorization pro blem motiv ates us to develop a highly eﬃcient algo rithm for a fundamental problem in sp e ctr al gr aph the ory that could b e signiﬁcant in its own r ig h t: Given an n × n SDDM ma tr ix M , a co ns tan t − 1 ≤ p ≤ 1, and an approximation para meter ǫ , compute a sparse representation o f an n × n linear op erato r ˜ C s uc h that M p ≈ ǫ ˜ C ˜ C ⊤ , (2) where ≈ ǫ is spectral similarity b et ween linear op erators which w e will deﬁne at the s tart o f Section 2. Our a lgorithm factor s M p int o a series o f well-conditioned, easy-to-ev aluate, spa rse matrices, which we ca ll a spa rse factor chain. F or any cons tan t − 1 ≤ p ≤ 1, our algo rithm constructs a spa rse factor chain with O ( m · log c 1 n · log c 2 ( κ/ǫ ) · ǫ − 4 ) work for mo dest 2 constants c 1 and c 2 , where κ denotes the co ndition num ber 3 of M . Mo reov er, for the case when p = − 1 (i.e., the factoriza tion pr o blem needed for Gauss ian sa mpling), we can remov e the p olynomial dep endency of ǫ , so the complexity in this case is of form O ( m · log c 1 n · log c 2 ( κ/ǫ )). Of equal imp ortance, our algor ithm can b e implemented to run in p olylogar ithmic depth on a pa rallel mach ine with m pro cessor s. P rior to our work, no algorithm could provide such a fac torization in nea rly-linear time. Because the factorization of an SDD 4 matrix can b e re duce d to the factorizatio n of an SDDM matrix twice its dimensio ns (See App endix A), our sampling algor ithm can b e dir ectly extended to Gaussian random ﬁelds with SDD precision matr ic es. 2 Current l y , b oth constan ts are at most 5 – these constan ts and the exponent 4 for (1 /ǫ ) can b e further r educe d with the improv emen t of sp ectral spars iﬁcation algorithms [ ST11, SS11, Kou14, M PX13] and tigh ter numerical analysis of our approaches. 3 As shown in [ST14](Lemma 6.1), for an SDD M matrix M , κ ( M ) is essent ial ly upper b ounded by the r atio of the l argest oﬀ-diagonal en try to the sm allest non-zero oﬀ-diagonal entry of − M , thus, κ ( M ) is alwa ys b ounded by poly ( n ) /ǫ M , where ǫ M denotes the machine precision. Note also, the running time of any numerical algorithm f or factoring M − 1 is Ω( m + log ( κ ( M ) /ǫ ) ) where Ω(log ( κ ( M ) /ǫ )) is the bits of precision required in the worst case. 4 A m atri x M = ( m i,j ) is symmetric diagonal ly dominant (SDD) if f or all i , m i,i > P j 6 = i | m i,j | . 3 Our result is a step toward understanding the holy g rail questio n in parallel s ampling men tione d ea rlier. It provides a natur a l and non-triv ial suﬃcient co ndition for Gaussian ra ndom ﬁelds that leads to parallel sampling alg o rithms that r un in p olylogarithmic para llel time and nearly-linea r total work. While we a re mindful of the man y classes o f precision ma trices that are not reducible to SDD matrices, we believe our approach is a pplicable to a wider ra ng e of matrices. The key structural prop ert y that we use is that matrices re lated to the squar e of the oﬀ-diagona l entries of SDDM matrices have spa rse representations. W e belie v e inc o rpo rating such struc tur al pr oper ties is crucia l for ov er coming the Ω( mn )-ba rrier for s ampling from Gaussian r andom ﬁelds, and that our constr uction o f the spars e factor chain may lead to eﬃcie nt parallel sampling a lgorithms for o ther p opular gr aphical mo dels. By its nature of development, we hop e our algorithmic adv ance will a lso help to streng then the connection b et ween machine learning and spe ctral graph theor y , tw o o f the active ﬁelds in unders tanding larg e data and netw orks. 2 Bac kground and Notation In this section, we in tro duce the deﬁnitions and notations that w e will use in this pap er. W e use ρ ( M ) to denote the spectra l radius of the matrix M , which is the max imum a bs olute v alue of the eigenv alues of M . W e use κ ( M ) to denote its condition num b er corr e sponding to the s pectra l norm, which is the ratio b et ween the la rgest and smallest singula r v alues of M . In our analysis, we will make extens ive use of sp ectral approximation. W e say X ≈ ǫ Y when exp ( ǫ ) X < Y < exp ( − ǫ ) X , (3) where the Lo ewner ordering Y < X means Y − X is p ositive se mi- deﬁnite. W e use the fo llowing standard facts ab out the Lo ewner po s itiv e deﬁnite or de r and a pproximation. F act 2.1. F or p ositive semi-deﬁnite matric es X , Y , W and Z , a. if Y ≈ ǫ Z , then X + Y ≈ ǫ X + Z ; b. if X ≈ ǫ Y and W ≈ ǫ Z , then X + W ≈ ǫ Y + Z ; c. if X ≈ ǫ 1 Y and Y ≈ ǫ 2 Z , then X ≈ ǫ 1 + ǫ 2 Z ; d. if X and Y ar e p ositive deﬁnite matric es such t hat X ≈ ǫ Y , then X − 1 ≈ ǫ Y − 1 ; e. if X ≈ ǫ Y and V is a matrix , then V ⊤ XV ≈ ǫ V ⊤ YV . W e say that C is an inverse squar e-r o ot factor of M , if CC ⊤ = M − 1 . When the context is clear , we sometime re fer to C as an inv e r se factor o f M . Note that the inverse factor of a matrix is not unique. W e will work with SDDM matrices in our algorithms, a nd denote th e m using M . SDDM matrices are po s itiv e-deﬁnite SDD matrices . The equiv alence betw een so lv ing linear systems in such matrices and graph Laplacians , weakly-SDD matr ices, and M-matrices are well known [ST04, DS08, KOSZ13, PS14]. In Appendix A, we r igorously prove that this connectio n carries ov er to inv e r se factors. SDDM matrices can b e split int o diagonal and oﬀ-dia gonal entries. W e will show in Appendix B that by allowing diag onal entries in b oth ma trices of this splitting, we can ensur e that the diagonal ma trix is I without lo ss of g enerality . Lemma 2.2. L et M = D − A b e a SDDM matrix, wher e D = diag ( d 1 , d 2 , . . . , d n ) is its diagonal, and let κ = max { 2 , κ ( D − A ) } . We c an pick c = (1 − 1 / κ ) / (max i d i ) , so that al l eigenvalues of c ( D − A ) lie b etwe en 1 2 κ and 2 − 1 2 κ . Mor e over, we c an r ewrite c ( D − A ) as a SDDM matrix I − X , wher e al l entr ies of X = I − c ( D − A ) ar e nonne gative, and ρ ( X ) ≤ 1 − 1 2 κ . Since all of our approximation g ua rantees are multiplicative, we will igno r e the c o nstan t fa c tor res c a ling, and use the splitting M = I − X for the r est of o ur analysis . 4 F or a symmetr ic , po sitiv e deﬁnite matr ix M with sp ectral decompo sition M = n X i =1 λ i u i u ⊤ i , (4) its p th power for any p ∈ R is: M p = n X i =1 λ p i u i u ⊤ i . (5) 3 Ov erview W e star t by brieﬂy outlining our approach. Our ﬁr st construction of ˜ C follows from the fa ct that SDDM matrices can b e factor ized. The pro of of the following lemma can b e found in Appendix C. Lemma 3. 1 (Combinatorial F actoriza tion) . L et M b e an S DDM matrix with m non-zer o en t ries and c on- dition num b er κ , then (1) M c an b e factor e d into M = BB ⊤ , (6) wher e B is an n × m matrix with at most 2 non-zer os p er c olumn ; (2) for any appr oximation p ar ameter ǫ > 0 , if Z is line ar op er ator satisfying Z ≈ ǫ M − 1 , then t he matrix ˜ C = ZB (7) satisﬁes ˜ C ˜ C ⊤ ≈ 2 ǫ M − 1 . (8) Mor e over, a r epr esentation of ˜ C c an b e c ompute d in work O ( m · log c 1 n · log c 2 κ ) and depth O (lo g c 3 n · log κ ) for some mo dest c onstants c 1 , c 2 , and c 3 so t hat for any m -dimensional ve ctor x , the pr o duct ˜ C x c an b e c ompute d in work O (( m + n · lo g c n · lo g 3 κ ) · log (1 /ǫ )) and depth O (log n · log κ · log(1 / ǫ )) for a mo dest c onstant c . In particular , the par allel solver a lgorithm from [PS14] computes Z ≈ ǫ M − 1 in total work near ly -linear in m , and depth p olylogar ithmic in n , κ and ǫ − 1 . W e also adapt par ts o f [P S14] to obta in our main a lgorithm with optimal r andomness complexity , which co nstructs ˜ C that is n × n . The high level idea is to break down a diﬃcult linear op erator, M , in to a series of pro ducts of o p erato rs that are individually easy to ev aluate. One ca ndida te expression is the ident ity ( I − X ) − 1 = ( I + X )  I − X 2  − 1 . (9) It reduces computing ( I − X ) − 1 to a matr ix-vector multiplication and co mputing ( I − X 2 ) − 1 . As k X 2 k 2 < k X k 2 when k X k 2 < 1, I − X 2 is closer to I than I − X . F or mally , it can b e shown that after log κ steps, where κ = κ ( I − X ), the pr oblem b ecomes w ell-co nditioned. T his low iteration count coupled with the low cost o f matrix-vector mult iplica tions then gives a low-depth algorithm. A ma jor issue with ex tending this intuition to matrices is that sq ua res of matrices can be dense. The den- sity o f matrices can b e reduced using sparsiﬁca tion, which lea ds to approximation errors . The factoriza tion in Equatio n 9 is problema tic for ha ndling s uc h err ors. F act 2.1.e s tates that such spectr al a pproximations can only b e pr opagated w he n co mposed with other o pera tors symmetric ally . This issue was addresse d in [PS14] using a n alternate decomp osition that introduces a n additional term of I + X on the other side. How ever, 5 this introduce s a n additional pro blem for computing a g oo d factor ization: the ov erall expressio n is no lo nger a pro duct. Our result is based on directly symmetrizing the ident ity in Equation 9. W e use the fact that p olynomials of I and X commute to mo ve a half p o wer of the ﬁrs t term o n to the other side: ( I − X ) − 1 = ( I + X ) 1 / 2  I − X 2  − 1 ( I + X ) 1 / 2 . (10) This has the additional adv antage that the terms can b e naturally inco r por ated into the fac torization ˜ C ˜ C ⊤ . Of course, this lea ds to the issue of ev aluating half powers, or mo re gener ally p th powers. Here our key idea , which we will discuss in depth in Sec tio n 4 and Appendix D, is to use Maclaurin expansions. These are low-degree p olynomia ls which give high quality approximations when the matrices are well-conditioned. W e show that for any well-conditioned matrix M with M ≈ O (1) I , and for a n y co nstan t p o wer | p | ≤ 1, there exists a (po lylogarithmic) deg ree t p olynomia l T p,t ( · ) such that T p,t ( M ) ≈ ǫ M p . (11) The condition that ρ ( X ) < 1 means the expressio n I + X is almost well co nditioned: the only problematic case is when X ha s an eig en v alue clos e to − 1 . W e resolve this issue by halving the co eﬃcient in front o f X , leading to the following fo r m ula for (symmetric) factoriza tion: ( I − X ) − 1 =  I + 1 2 X  1 / 2  I − 1 2 X − 1 2 X 2  − 1  I + 1 2 X  1 / 2 , (12) which upo n in tr oduction of T p,t ( · ) b ecomes: ( I − X ) − 1 ≈ T 1 2 ,t  I + 1 2 X   I − 1 2 X − 1 2 X 2  − 1 T 1 2 ,t  I + 1 2 X  . (13) Note that the tw o terms T 1 2 ,t  I + 1 2 X  are symmetric. As our g oal is to a pproximate ( I − X ) − 1 with ˜ C ˜ C ⊤ , we can pla ce one term in to ˜ C a nd the other in to ˜ C ⊤ . This allows us to r educe the problem to o ne inv olving  I − 1 2 X − 1 2 X 2  − 1 . Also , as X 2 may b e a dense matrix , we need to s pa rsify it during intermediate steps. This matr ix is the average of I − X and I − X 2 . A key o bserv ation in [PS1 4] is that the s e c ond matrix, I − X 2 , co rresp onds to t wo-step random walks on I − X , a nd is therefore SDDM and can be eﬃciently sparsiﬁed, whic h implies that we can obtain a sparsiﬁer for the average as w ell. The above numerical reduction from the factor ization of ( I − X ) − 1 to the factorization of  I − 1 2 X − 1 2 X 2  − 1 leads to a chain of matrices akin to the sparse inv erse c ha in deﬁned in [PS14]. W e ca ll it a sp arse inverse factor chain , and prove its existence in Lemma 5 .4. Deﬁnition 3.2 (Spars e In verse F a ctor Chain) . F or a se quenc e of appr oximation p ar ameters ǫ = ( ǫ 0 , . . . , ǫ d ) , an ǫ -sp arse factor chain ( X 1 , . . . , X d ) for M = I − X 0 satisﬁes the fol lowing c onditions: 1. F or i = 0 , 1 , 2 , . . . , d − 1 , I − X i +1 ≈ ǫ i I − 1 2 X i − 1 2 X 2 i ; 2. I ≈ ǫ d I − X d . W e summarize o ur ﬁrst technical lemma b elo w. In this lemma, and the theorems that follow, w e will assume without further spec ifying that our SDDM matrix M is n × n a nd ha s m nonzero ent r ies. W e a ls o also use κ to denote κ ( M ), and assume that M is ex pr essed in the form M = I − X 0 for a non-negative matrix X 0 . Lemma 3.3 (E ﬃcie nt Cha in Construction) . Given an SDDM matrix M = I − X 0 and appr o x imation p ar ameter ǫ , ther e exists an ǫ -sp arse factor chain ( X 1 , X 2 , . . . , X d ) for M with d = O (log ( κ/ǫ )) such that (1) P d i =0 ǫ i ≤ ǫ , and (2) the total numb er of nonzer o entries in ( X 1 , X 2 , . . . , X d ) is upp er b ounde d by O ( n log c n · log 3 ( κ/ǫ ) /ǫ 2 ) , for a c onstant c . Mor e over, we c an c onstruct such an ǫ -sp arse factor chain in work O ( m · lo g c 1 n · log c 2 ( κ/ǫ ) /ǫ 4 ) and p ar al lel depth O (log c 3 n · log( κ/ǫ )) , for some c onstants c 1 , c 2 and c 3 . 6 While we ca n use any nearly linear-time spectr al spa rsiﬁcation algor ithm [ST11, SS11] in our algor ithm for Lemma 3.3, the precise exp onents of the log n and log κ terms can b e improv ed using rec en t works on combinatorial sp ectral spar siﬁcation [Kou14, MP X13]. Currently , the c i s in Lemma 3.3 and the subsequent Theorems 3.4 and 3.5 ar e at most 5 . B ecause of the likely p o ssibilit y of further improv ements in the near future, we only state them here as consta n ts. The length o f the chain is logarithmic in b oth the condition num b er κ a nd P ǫ i . W e will analyze the error pro pa gation along this c ha in in Sectio n 5. Now we state our main r e sult for co mputing inv erse squar e ro ot and factoring p th power in the follo wing tw o theorems. Theorem 3.4 (Scalable Parallel Inv er se F actor ization) . Given an S DDM matrix M = I − X 0 and a pr e cision p ar ameter ǫ , we c an c onstru ct a r epr esentation of an n × n matrix ˜ C , su ch t hat ˜ C ˜ C ⊤ ≈ ǫ ( I − X 0 ) − 1 in work O ( m · log c 1 n · log c 2 κ ) and p ar al lel depth O (log c 3 n · log κ ) for some c onstants c 1 , c 2 , and c 3 . Mor e over, for any x ∈ R n , we c an c ompute ˜ C x in work O (( m + n · lo g c n · log 3 κ ) · log lo g κ · log(1 /ǫ )) and depth O (log n · log κ · log log κ · lo g(1 /ǫ )) for some other c onstant c . Pr o of. F or computing the inv er se square ro ot factor ( p = − 1), we ﬁrst use Lemma 3.3 with P ǫ i = 1 to construct a cr ude sparse inv erse facto r chain with leng th d = O (log κ ). Then for each i betw een 0 and d − 1, we use Macla urin expansions fro m Lemma 4.2 to obtain O (log log κ )-degree polyno mials that approximate ( I + 1 2 X i ) 1 / 2 to pre cision ǫ i = Θ(1 / d ). Finally we apply a r eﬁnemen t pro cedure fro m Theorem 5.7 to r educe error to the ﬁnal targ et precisio n ǫ , using O (lo g(1 /ǫ )) mult iplica tion w ith the crude approximator. The nu mber of op erations needed to compute ˜ C x follows the co mplexit y of Horner’s metho d for p olynomial ev aluation. The factor iz ation formula from E quation 13 ca n b e extended to any p ∈ [ − 1 , 1]. This a llo ws us to construct an ǫ -appr o xima te factor of M p for gener al p ∈ [ − 1 , 1], with work ne a rly-linear in m/ǫ 4 . It remains op en if this dep endency on ǫ can be reduced to log (1 /ǫ ) for the gener al c a se. Theorem 3.5 (F acto ring M p ) . F or any p ∈ [ − 1 , 1] , and an SDDM matrix M = I − X 0 , we c an c onstruct a r epr esentation of an n × n matrix ˜ C such t hat ˜ C ˜ C ⊤ ≈ ǫ M p in work O ( m · log c 1 n · log c 2 ( κ/ǫ ) /ǫ 4 ) and depth O (log c 3 n · log( κ/ǫ )) for some c onstants c 1 , c 2 , and c 3 . Mor e over, for any x ∈ R n , we c an c ompute ˜ C x in work O (( m + n · log c n · log 3 ( κ/ǫ ) /ǫ 2 ) · log(log( κ ) / ǫ )) and depth O (log n · log( κ/ǫ ) · log (log( κ ) /ǫ )) for some other c onstant c . Pr o of. Because the reﬁnement pr ocedur e in Theorem 5.7 do es not apply to the g eneral p th power c ase, we need to ha ve a longer chain and higher order Maclaurin expansio n here. W e ﬁr st use Lemma 3.3 with P ǫ i = ǫ/ 2 to cons tr uct a matrix chain with length d = O (log( κ/ǫ )). Then for i ≤ d − 1, w e use Lemma 4 .2 to obtain O (log(log κ/ ǫ ))-degree p olynomials to approximate ( I + 1 2 X i ) − p/ 2 to precision ǫ i = Θ( ǫ/d ). The nu mber o f op erations needed to compute ˜ C x follows from the length of the chain and the order of Maclaur in expansion. 4 Maclaurin Series Expansion In this section, w e show how to approximate a matrix of the form  I + 1 2 X  p . Beca use ρ ( X ) < 1, I + 1 2 X is well-conditioned. Thus, we can a ppr o x ima te its p th power to any a pproximation pa rameter ǫ > 0 using a n O (log(1 / ǫ ))-order Macla urin expansio n, i.e., a low deg ree p olynomial of X . Mor eo ver, s ince the approximation is a po ly nomial of X , the eigenbasis is pres erved. W e start with the following lemma on the residue of Macla urin expansio n, which we prov e in App endix D. Lemma 4.1 (Maclaur in Series) . Fix p ∈ [ − 1 , 1 ] and δ ∈ (0 , 1) , for any err or toler anc e ǫ > 0 , ther e exists a t -de gr e e p olynomial T p,t ( · ) with t ≤ log(1 / ( ǫ (1 − δ ) 2 ) 1 − δ , such that for al l λ ∈ [1 − δ, 1 + δ ] , exp( − ǫ ) λ p ≤ T p,t ( λ ) ≤ exp( ǫ ) λ p . (14) 7 Now w e present one o f our key lemmas, whic h provides the backbone for b ounding overall er ror of our sparse factor chain. W e claim that substituting  I + 1 2 X  p with its Maclaur in expansion in to Eq ua tion 13 preserves the multiplicativ e a pproximation. W e prove the lemma for p ∈ [ − 1 , 1 ] which later w e use for computing the p th power of SDDM matrices. F or computing the inverse squar e-ro ot factor, we o nly need the sp ecial case when p = − 1 . Lemma 4.2 (F actorization with Matrix Polynomials) . L et X b e a symmetric matrix with ρ ( X ) < 1 and ﬁx p ∈ [ − 1 , 1] . F or any appr oximation p ar ameter ǫ > 0 , ther e exists a t -de gr e e p o lynomial T p,t ( · ) with t = O (log(1 / ǫ )) , such that ( I − X ) p ≈ ǫ T − p 2 ,t  I + 1 2 X   I − 1 2 X − 1 2 X 2  p T − p 2 ,t  I + 1 2 X  . (15) Pr o of. T ake the sp ectral decomp osition of X X = X i λ i u i u ⊤ i , ( 1 6) where λ i are the eigenv a lues of X . Then we ca n represent the left hand side of E quation 15 a s ( I − X ) p =  I + 1 2 X  − p/ 2  I − 1 2 X − 1 2 X 2  p  I + 1 2 X  − p/ 2 = X i  1 − 1 2 λ i − 1 2 λ 2 i   1 + 1 2 λ i  − p u i u ⊤ i , (17) and the right hand s ide of Equation 15 as T − p 2 ,t  I + 1 2 X   I − 1 2 X − 1 2 X 2  p T − p 2 ,t  I + 1 2 X  = X i  1 − 1 2 λ i − 1 2 λ 2 i   T − p 2 ,t  1 + 1 2 λ i  2 u i u ⊤ i . (18) Because ρ ( X ) < 1, for all i w e hav e  1 − 1 2 λ i − 1 2 λ 2 i  > 0. By inv oking Lemma 4.1 with δ = 1 / 2, err o r tolerance ǫ/ 2 and p ow er − p/ 2, we can obtain the poly nomial T − 1 2 p,t ( · ) with the fo llo wing prop erty , exp( − ǫ )  1 + 1 2 λ i  − p ≤  T − 1 2 p,t  1 + 1 2 λ i  2 ≤ exp( ǫ )  1 + 1 2 λ i  − p . (19) T o conclude the pro of, we use this inequality inside Equation 18 and co mpare Equa tion 1 7 and 1 8. 5 Sparse F actor Ch ains: Construction a n d Analysis Our basic algorithm constr ucts an approximate factor of ( I − X ) p , for any p ∈ [ − 1 , 1]. Using the facto rization with Macla ur in ma trix p olynomials g iv en by Lemma 4 .2, we e xtend the framework o f Peng-Spielman [P S1 4] for para llel approximation of matrix inv er se to SDDM fac torization: Our alg o rithm uses the following identit y ( I − X ) p =  I + 1 2 X  − p/ 2  I − 1 2 X − 1 2 X 2  p  I + 1 2 X  − p/ 2 . ( 2 0) and the fact that we can approximate  I + 1 2 X  − p/ 2 with its Maclaurin s eries e xpansion (Lemma 4.2). It th us numerically reduces the factor iz a tion pr oblem of ( I − X ) p to that of  I − 1 2 X − 1 2 X 2  p . 8 The key to eﬃcie n t applications of this iterative reduction is the sp ectral spa rsiﬁcation of I − 1 2 X − 1 2 X 2 , so that our matrix p o lynomials only use sparse matrices. In particula r, we start with I − X 0 , a nd at step i we spa rsify I − 1 2 X i − 1 2 X 2 i to obtain its spectr a l appr o xima tion I − X i +1 , and then pro ceed to the next iteration until ρ ( X d ) is small enough, at whic h p oin t we approximate I − X d with I . Lemma 5.1 (Spa r siﬁcation of I − 1 2 X − 1 2 X 2 ) . Given an SDDM matrix M = I − X and a pr e cision p ar ameter ǫ , we c an c onstruct an n × n non-ne gative matrix ¯ X in work O ( m · log c 1 n/ǫ 4 ) and p ar al lel depth O (lo g c 3 n ) , such that (1) I − ¯ X ≈ ǫ I − 1 2 X − 1 2 X 2 , and (2) the nu mb er of non-zer o entries in ¯ X is upp er b ounde d by O ( n · log c n/ǫ 2 ) for some mo dest c onstants c 1 , c 3 and c . Pr o of. W e split I − 1 2 X − 1 2 X 2 int o 1 2 ( I − X ) + 1 2  I − X 2  and ﬁrst a pply the P eng -Spielman sparsiﬁcatio n metho d (Section 6 of [P S14]) to spar sify I − X 2 . This metho d involv es n indep endent sparsiﬁcatio n ta sks, one p er row of X , by sampling tw o-step rando m walks o n X . In other words, it sparsiﬁes I − X 2 based on X , w itho ut actually calculating X 2 . It constructs a spar se ma tr ix X ′ such that I − X ′ ≈ ǫ/ 2 I − X 2 with work O ( m · lo g 2 n/ǫ 2 ) a nd parallel depth is O (log n ) [PS14]. Mor eo ver, the num b er of no nzero entries in X ′ upper b ounded b y O ( m · lo g n/ǫ 2 ). Next, we sparsify 1 2 ( I − X ) + 1 2 ( I − X ′ ) by using a s pectra l spa rsiﬁcation a lgorithm [Kou14, MPX13, SS11, ST1 1] and co mpute a matrix ¯ X that satisﬁes I − ¯ X ≈ ǫ/ 2 1 2 ( I − X ) + 1 2 ( I − X ′ ) ≈ ǫ/ 2 I − 1 2 X − 1 2 X 2 (21) in work O ( m · log c 1 n/ǫ 4 ) and parallel depth O (log c 3 n ), with num be r of non-zer o entries in ¯ X upp er b ounded by O ( n · log c n/ǫ 2 ). A technical s ubtlety , unlike in [PS14] where one can s imply use approximate diag onals, is that we need to maintain I , and the non-negativeness of ¯ X . T his can be achieved using the splitting techn iq ue prese nted in Section 3.3 of Peng’s thesis [Pen13 ]. The following basic lemma then guarantees sp ectral a pproximation is pr eserved under the p th power, which enables the contin uation of the reduction. Lemma 5.2. L et A , B b e p ositive deﬁnite matric es with A ≈ ǫ B , then for al l p ∈ [ − 1 , 1] , we have that A p ≈ | p | ǫ B p . (22) Pr o of. By Theorem 4 .2 .1 of [Bha07], if A < B then A p < B p for all p ∈ [0 , 1]. Thus, if A ≈ ǫ B , then for all p ∈ [0 , 1], exp ( pǫ ) A p < B p < exp ( − pǫ ) A p , (23) which implies A p ≈ | p | ǫ B p . By F act 2 .1.d, the claim holds for a ll p ∈ [ − 1 , 0] as well. In the res t o f the sectio n, we will b ound the leng th o f our spar se facto r chain (Section 5 .1), and s ho w that the chain indeed provides a g oo d a pproximation to ( I − X 0 ) p (Section 5.2). In Sectio n 5.3, w e prese nt a re ﬁnement technique to further re duce the dep endency of ǫ to log(1 /ǫ ) for the case when p = ± 1. 5.1 Con v er gence of the Matr ix Chain T o b ound the length of the sparse factor chain, we need to study the conv er gence be ha vio r of ρ ( X i ). W e show that a pr oper ly constructed matrix chain of length O (lo g ( κ/ǫ )) is suﬃcien t to achiev e ǫ precision. Our analysis is a nalogous to the one in [PS14]. How ever, we have to dea l with the ﬁrst order term in I − 1 2 X − 1 2 X 2 , which results a longer chain when ǫ is small. W e start with the follo wing lemma, which a nalyzes o ne iteration of the chain construction. W e also take int o account the a pproximation err or intro duced by sparsiﬁca tion. 9 Lemma 5.3. L et X and ¯ X b e nonn e gative symm et ric matric es such that I < X , I < ¯ X , and I − ¯ X ≈ ǫ I − 1 2 X − 1 2 X 2 . (24) If ρ ( X ) ≤ 1 − λ , then the eigenvalues of ¯ X al l lie b et we en 1 − 1 2 (3 − λ ) exp( ǫ ) and 1 − 1 2  3 λ − λ 2  exp( − ǫ ) . Pr o of. If the eigenv a lue s of I − X ar e b et ween λ and 2 − λ , the eigenv alues o f I − 1 2 X − 1 2 X 2 will b elong to [(3 λ − λ 2 ) / 2 , (3 − λ ) / 2]. The b ound stated in this lemma then follows fro m the fact that the sp ectral sparsiﬁcatio n step preserves the eigenv alues of I − 1 2 X − 1 2 X 2 to within a factor of exp( ± ǫ ). W e next prove that our spa r se matrix chain achieves ov er all pr ecision ǫ . Lemma 5.4 (Short F actor ization Chain) . L et M = I − X 0 b e an S D DM matrix with c ondition numb er κ . F or any appr oximation p ar ameter ǫ , ther e exists an ( ǫ 0 , . . . , ǫ d ) -sp arse factor chain of M with d = lo g 9 / 8 ( κ/ǫ ) and P d j =0 ǫ j ≤ ǫ , satisfying I − X i +1 ≈ ǫ i I − 1 2 X i − 1 2 X 2 i ∀ 0 ≤ i ≤ d − 1 , I − X d ≈ ǫ d I . (25) Pr o of. Without loss of gener alit y , we a ssume ǫ ≤ 1 10 . F or 0 ≤ i ≤ d − 1, we s e t ǫ i = ǫ 8 d . W e will s ho w that our chain satis ﬁe s the c ondition given by E quation (25) with ǫ d ≤ (7 ǫ ) / 8 , and thus P d i =0 ǫ i ≤ ǫ. Let λ i = 1 − ρ ( X i ). W e split the erro r analysis to tw o parts, one for λ i ≤ 1 2 , a nd one for λ i ≥ 1 2 . In the ﬁrst part, b ecause ǫ i ≤ 1 / 10, 1 − 1 2  3 λ i − λ 2 i  exp( − ǫ i ) < 1 − 9 8 λ i and 1 − 1 2 (3 − λ i ) exp( ǫ i ) > − 1 + 9 8 λ i (26) which, by Lemma 5 .3, implies that λ i +1 ≥ 9 8 λ i . Because initially , λ 0 ≥ 1 /κ , after d 1 = log 9 / 8 κ iteratio ns, it m us t b e the case that λ d 1 ≥ 1 2 . Now we enter the s econd par t. Note that ǫ i = ǫ 8 d ≤ ǫ 6 . Thus, whe n 1 2 ≤ λ i ≤ 1 − 5 6 ǫ , w e hav e ǫ i ≤ 1 − λ i 5 and therefor e 1 − 1 2  3 λ i − λ 2 i  exp( − ǫ i ) < 8 9 (1 − λ i ) and 1 − 1 2 (3 − λ i ) exp( ǫ i ) > 8 9 ( − 1 + λ i ) (27) which, by L e mma 5.3, implies that 1 − λ i +1 ≤ 8 9 (1 − λ i ). Because 1 − λ d 1 ≤ 1 2 , after another d 2 = log 9 / 8 (1 /ǫ ) iterations, we get ρ ( X d 1 + d 2 ) ≤ 5 6 ǫ . Beca use exp( − 7 8 ǫ ) < 1 − 5 6 ǫ when ǫ ≤ 1 10 , we conclude that ( I − X d 1 + d 2 ) ≈ ǫ d I with ǫ d ≤ 7 8 ǫ . 5.2 Precision of the F actor Chain W e now b ound the total error of the matrix factor that o ur algo r ithm co nstructs. W e sta rt with an error- analysis o f a single reduction step in the algorithm. Lemma 5.5. Fix p ∈ [ − 1 , 1] , given an ǫ -sp arse factor chain, for al l 0 ≤ i ≤ d − 1 , t her e exists de gr e e t i p olynomial s , with t i = O (log (1 /ǫ i )) , such that ( I − X i ) p ≈ 2 ǫ i T − 1 2 p,t i  I + 1 2 X i  ( I − X i +1 ) p T − 1 2 p,t i  I + 1 2 X i  . (28) 10 Pr o of. ( I − X i ) p ≈ ǫ i T − 1 2 p,t i  I + 1 2 X i   I − 1 2 X i − 1 2 X 2 i  p T − 1 2 p,t i  I + 1 2 X i  ≈ ǫ i T − 1 2 p,t i  I + 1 2 X i  ( I − X i +1 ) p T − 1 2 p,t i  I + 1 2 X i  . (29) The ﬁrst approximation follows from Le mma 4.2. The second approximation follows from L e mma 5.2, F act 2.1.e, and I − X i +1 ≈ ǫ i I − 1 2 X i − 1 2 X 2 i . The next lemma bounds the tota l err o r of our construction. Lemma 5.6. Deﬁne Z p,i as Z p,i = ( I if i = d  T − 1 2 p,t i  I + 1 2 X i   Z p,i +1 if 0 ≤ i ≤ d − 1 . (30) Then the fol lowing statement is t rue for al l 0 ≤ i ≤ d Z p,i Z ⊤ p,i ≈ 2 P d j = i ǫ j ( I − X i ) p . (31) Pr o of. W e prov e this lemma by r ev er se induction. Combine I ≈ ǫ d I − X d and Lemma 5.2, we know that Z p,d Z ⊤ p,d = I p ≈ ǫ d ( I − X d ) p , so the claim is true for i = d . Now assume the claim is true for i = k + 1, then Z p,k Z ⊤ p,k = T − 1 2 p,t k  I + 1 2 X k  Z p,k +1 Z ⊤ p,k +1 T − 1 2 p,t k  I + 1 2 X k  ≈ 2 P d j = k +1 ǫ j T − 1 2 p,t k  I + 1 2 X k  ( I − X k +1 ) p T − 1 2 p,t k  I + 1 2 X k  ≈ 2 ǫ k ( I − X k ) p . (32) The la st approximation follows fro m Lemma 5.5. If we deﬁne ˜ C to b e Z − 1 2 , 0 , this lemma implies ˜ C ˜ C ⊤ ≈ P d j =0 ǫ j ( I − X 0 ) − 1 . 5.3 Iterativ e R eﬁnem ent for Inv erse Square-Ro ot F actor In this section, we provide a highly accura te factor for the c a se when p = − 1. W e use a r eﬁnemen t approa ch inspired by the precondition tec hnique for sq uare factors describ ed in Section 2.2 of [CS14]. Theorem 5.7. F or any symmet ric p ositive deﬁnite m atr ix M , any matrix C such that M − 1 ≈ O (1) ZZ ⊤ , and any err or toler anc e ǫ > 0 , ther e exists a line ar op er ator ˜ C which is an O (log (1 /ǫ )) -de gr e e p olynomial of M and Z , such that ˜ C ˜ C ⊤ ≈ ǫ M − 1 . Pr o of. Starting from M − 1 ≈ O (1) ZZ ⊤ , we know that Z ⊤ MZ ≈ O (1) I . This implies that κ ( Z ⊤ MZ ) = O (1), which we ca n scale Z ⊤ MZ s o that its eigenv alues lie in [1 − δ, 1 + δ ] for so me constant 0 < δ < 1 . By applying Lemma 4.1 on its eigenv alues, there is an O (log (1 /ǫ )) degree polyno mial T − 1 2 ,t ( · ) that approximates the inverse squa re ro ot of Z ⊤ MZ , i.e .  T − 1 2 ,O (l og(1 /ǫ ))  Z ⊤ MZ   2 ≈ ǫ  Z ⊤ MZ  − 1 . (33) Here, we ca n rescale Z ⊤ MZ back inside T − 1 2 ,t ( · ), which do es not a ﬀect the multiplicative err or. Then by F act 2.1.e, we have Z  T − 1 2 ,O (l og(1 /ǫ ))  Z ⊤ MZ   2 Z ⊤ ≈ ǫ Z  Z ⊤ MZ  − 1 Z ⊤ = M − 1 . (34) So if we deﬁne the linea r o pera tor ˜ C = Z  T − 1 2 ,O (l og(1 /ǫ ))  Z ⊤ MZ   , ˜ C satisﬁes the claimed prop erties. 11 Note that this reﬁnement pro cedure is spe ciﬁc to p = ± 1. It remains o pen if it can b e ex tended to genera l p ∈ [ − 1 , 1]. References [ADFDJ03] Christo phe Andrieu, Nando De F reitas , Arnaud Doucet, and Michael I Jorda n. An intro duction to MCMC for ma chine lear ning. Machine le arning , 50(1- 2):5–43, 2 003. [ASW14] Sungjin Ahn, Babak Shahbaba, and Max W elling. Distributed s to chastic g radient MCMC. In ICML , 20 1 4. [Bha07] Ra jendra Bhatia. Positive deﬁnite matric es . P rinceton Univ er sit y Press, 20 07. [Bis06] Christopher M. Bishop. Pattern r e c o gnition and machine le arning . springer New Y or k, 2006 . [CS14] Edmond Chow and Y o us ef Sa ad. Preco nditioned krylov subspace methods for sampling multi- v ariate gaussian distributions . SIAM Journal on Scientiﬁc Computing , 36(2):A58 8–A608, 2 014. [DS08] Samuel I. Daitch and Daniel A. Spielman. F aster approximate loss y generalize d ﬂow via interior po in t algorithms. In Pr o c e e dings of the 40th annual ACM symp osium on The ory of c omputing , STOC ’08 , pages 45 1–460, New Y ork, NY, USA, 20 08. A CM. [GLGG11] Joseph Gonza lez, Y uc heng Low, Arthur Gretton, and Carlo s Guestrin. Parallel gibbs sampling : F rom colored ﬁelds to thin junction trees. In Pr o c. of AIST A TS’11 , 2 011. [Gre96] Keith D Gr em ban. Combinatorial pr e c onditioners for sp arse, symmetr ic, diagonal ly dominant line ar systems . PhD thesis , Carneg ie Mello n Universit y , 19 96. [IIS + 03] Alexander Ihler, Er T. Ihler, Erik Sudderth, William F reeman, and Alan Willsky . Eﬃcient m ultisca le sampling fr om pro ducts of gaus s ian mixtures . In In NIPS 17 , pag e 2003. MIT Pre ss, 2003. [Jor98 ] M. I. Jordan. L e arning in Gr aphic al Mo dels . The MIT press, 1998. [JSW13] Matthew Johns o n, James Saunderson, and Alan Willsk y . Analyzing ho gwild para lle l g aussian gibbs sampling. In C.J .C. Bur ges, L. Bottou, M. W elling, Z. Ghahra mani, and K.Q. W einber ger, editors, A dvanc es in Neur al Information Pr o c essing Systems 26 , pages 27 15–272 3. Cur ran As- so ciates, Inc., 2013. [KF09] Daphne Koller and Nir F riedman. Pr ob ab ilistic Gr aphi c al Mo dels: Principles and T e ch n iques - A daptive Computation and Machine L e arning . The MIT Press, 20 09. [KMP10] Ioannis Koutis, Gar y L. Miller, a nd Richard Peng. Appro ac hing optimality for solving sdd linear systems. In Pr o c e e dings of the 2010 IEEE 51st Annual S ymp osium on F oun datio n s of Computer Scienc e , FOCS ’10, pages 235 –244, 2010 . [KOSZ13] Jonathan A. Kelner , Lor e nzo Orecchia, Aaro n Sidford, and Zeyuan Allen Zhu. A s imple, co m bi- natorial alg orithm for solving sdd systems in nea r ly-linear time. In Pr o c e e dings of t he F orty-ﬁfth Annual ACM Symp osium on The ory of Computing , STOC ’13, 2013. [Kou14] Ioannis Ko utis. A simple paralle l a lgorithm for sp ectral spar siﬁcation. CoRR , abs/140 2 .3851, 2014. [LKW13] Ying Liu, Oliver Kosut, and Alan S. Willsky . Sa mpling from g aussian gra phica l models using subgraph per turbations. In Pr o c e e dings of the 2013 IEEE International Symp osium on Infor- mation The ory, Ist anbul, T urkey, July 7-12, 2013 , pag es 2 498–250 2, 2013. 12 [L W12] Po-Ling Loh and Mar tin J. W ain wright. Structure estimation for discrete g raphical mo dels: Generalized cov ariance matrices and their inv erses . In NIPS , pages 20 96–2104 , 2 012. [MPX13] Gary L. Miller, Richard Peng, a nd Shen Chen Xu. Improv ed pa rallel alg orithms for spanners and hopsets . CoRR , abs/13 09.3545, 20 13. [NRR W11 ] Geng Niu, Benjamin Rec ht, Christopher Re, and Stephen J. W rig ht. Hog wild: A lo ck-free approach to paralleliz ing sto chastic gra dien t descent. In NIPS , pages 693– 701, 2 011. [Pen13] Richard Peng. Algorithm D esign Using Sp e ctr al Gr aph The ory . PhD thesis , Car negie Mello n Univ e r sit y , Pittsburg h, August 2013. CMU CS T ech Repo rt CMU-CS-1 3-121. [PS14] Richard Peng and Daniel A. Spielman. An eﬃcien t parallel s olv e r for sdd linear systems. In Pr o c e e dings of the 46th Annual ACM Symp osium on The ory of Computing , STOC ’14 , pages 333–3 42, New Y ork, NY, USA, 2014. ACM. [PY10] Geor g e Papandreou and Alan L. Y uille. Gaussian sampling by lo cal p erturbatio ns. In In N IPS , pages 18 58–1866 . MIT Press, 2 010. [SS11] Daniel A. Spielman and Nikil Sriv astav a . Gr aph spa r siﬁcation by eﬀectiv e res istances. SIAM J. Comput. , 40(6):1913 –1926, 2011. [ST04] Daniel A. Spielman a nd Shang-Hua T eng. Nearly-linear time alg o rithms for gr aph pa rtitioning, graph spar siﬁcation, a nd solving linea r sy s tems. In Pr o c e e dings of the Thirty-sixth Annual ACM Symp osium on The ory of Computing , STOC ’04, pages 81– 90, 2004. [ST11] Daniel A. Spielman and Shang- Hua T eng. Sp ectral spar siﬁcation of g r aphs. SIAM J. Comput. , 40(4):981 –1025, July 2011. [ST14] Daniel A. Spielman a nd Shang-Hua T eng . Nearly linear time alg orithms for preconditioning and solving symmetric, diagonally do minan t linear systems. SIAM J. Matrix Analysi s and Applic ations , 35 (3):835–88 5, 2014. A Grem ban’s Reduction for In v erse Square Ro ot F actor W e ﬁr st note that Gremban’s r e ductio n [Gr e96] can be ex tended from solving linear systems to c omputing the in verse square-ro ot factor . The proble m of ﬁnding a in verse fa c to r o f a SDD matrix, can b e reduced to ﬁnding a inv erse fac to r of a SDDM matrix that is t wice as lar ge. Lemma A.1. L et Λ = D + A n + A p b e an n × n SDD matrix, wher e D is the diagonal of Λ , A p c ontains al l t he p ositive oﬀ-diagonal entries of Λ , and A n c ontains al l t he ne gative oﬀ-diagonal ent ries. Consider the fol lowing SDDM matrix S , and an inverse factor ˜ C of S , such that S =  D + A n − A p − A p D + A n  and ˜ C ˜ C ⊤ = S . (35) L et I n b e the n × n identity matrix. The matrix C = 1 √ 2  I n − I n  ˜ C (36) is an inverse squar e-r o ot factor of Λ . Pr o of. W e can pr ove that Λ − 1 = 1 2  I n − I n  S − 1  I n − I n  (37) th us we have CC ⊤ = Λ − 1 . 13 Note that if ˜ C is 2 n × 2 n , C will be an n × 2 n matrix. Given the no n-square inv erse factor C , we can draw a sa mple from multiv ar iate Gaussian distribution with co v ar iance Λ − 1 as follows. First draw 2 n i.i.d. standard Gaussia n ra ndom v ar ia bles x = ( x 1 , x 2 , . . . , x 2 n ) ⊤ , then y = C x is a multiv ar iate Ga ussian random vector with cov ariance Λ − 1 . B Linear Normalization of S DDM M atrices F or a SDDM matrix D − A , one w ay to normalize it is to mult iply D − 1 / 2 on both sides, D − A = D 1 / 2  I − D − 1 / 2 AD − 1 / 2  D 1 / 2 . (38) Because ρ  D − 1 / 2 AD − 1 / 2  < 1, this enables us to approximate  I − D − 1 / 2 AD − 1 / 2  p . It is no t clear how to undo the nor malization and obtain ( D − A ) p from it, due to the fact that D and A do not commut e. Instead, we co nsider another no rmalization as in Lemma 2.2 a nd provide its pro of. Lemma 2.2. L et M = D − A b e a SDDM matrix, wher e D = diag ( d 1 , d 2 , . . . , d n ) is its diagonal, and let κ = max { 2 , κ ( D − A ) } . We c an pick c = (1 − 1 / κ ) / (max i d i ) , so that al l eigenvalues of c ( D − A ) lie b etwe en 1 2 κ and 2 − 1 2 κ . Mor e over, we c an r ewrite c ( D − A ) as a SDDM matrix I − X , wher e al l entr ies of X = I − c ( D − A ) ar e nonne gative, and ρ ( X ) ≤ 1 − 1 2 κ . Pr o of. Let j = arg max i d i , by deﬁnition c = (1 − 1 /κ ) /d j . Let [ λ min , λ max ] denote the range of the eigenv a lues of D − A . By Gers hgorin circle theorem and the fact that D − A is diago nally do minan t, we have that λ max ≤ 2 d j . Also d j = e ⊤ j ( D − A ) e j , s o λ max ≥ d j , a nd therefore λ min ≥ λ max /κ ≥ d j /κ . The eigenv alues of c ( D − A ) lie in the interv al [ cλ min , cλ max ], a nd can be b ounded as 1 2 κ ≤  1 − 1 κ  1 d j  d j κ  ≤ cλ min ≤ cλ max ≤  1 − 1 κ  1 d j (2 d j ) ≤ 2 − 1 2 κ . (39) T o s ee that X = ( I − c D ) + c A is a nonnegative matr ix , no te that A is nonneg ativ e and I is entrywise la rger than c D . C In v erse F actor by SDDM F actorization W e restate a nd prov e Lemma 3.1. Lemma 3. 1 (Combinatorial F actoriza tion) . L et M b e an S DDM matrix with m non-zer o en t ries and c on- dition num b er κ , then (1) M c an b e factor e d into M = BB ⊤ , (6) wher e B is an n × m matrix with at most 2 non-zer os p er c olumn ; (2) for any appr oximation p ar ameter ǫ > 0 , if Z is line ar op er ator satisfying Z ≈ ǫ M − 1 , then t he matrix ˜ C = ZB (7) satisﬁes ˜ C ˜ C ⊤ ≈ 2 ǫ M − 1 . (8) Mor e over, a r epr esentation of ˜ C c an b e c ompute d in work O ( m · log c 1 n · log c 2 κ ) and depth O (lo g c 3 n · log κ ) for some mo dest c onstants c 1 , c 2 , and c 3 so t hat for any m -dimensional ve ctor x , the pr o duct ˜ C x c an b e c ompute d in work O (( m + n · lo g c n · lo g 3 κ ) · log (1 /ǫ )) and depth O (log n · log κ · log(1 / ǫ )) for a mo dest c onstant c . 14 Pr o of. Let e i be the i th standard basis, then M can be repre sen ted b y M = 1 2 X 1 ≤ i 0 , ther e exists a t -de gr e e p olynomial T p,t ( · ) with t ≤ log(1 / ( ǫ (1 − δ ) 2 ) 1 − δ , such that for al l λ ∈ [1 − δ, 1 + δ ] , exp( − ǫ ) λ p ≤ T p,t ( λ ) ≤ exp( ǫ ) λ p . (14) Pr o of. Let g p,t ( x ) be the t th order Maclaur in series expa ns ion of f ( x ) = (1 − x ) p . Becaus e | 1 − λ | ≤ δ , by Prop osition D.1, | g p,t (1 − λ ) − λ p | ≤ δ t +1 1 − δ . (45) 15 W e deﬁne T p,t ( x ) = g p,t (1 − x ). Also λ p ≥ 1 − δ , b ecause p ∈ [ − 1 , 1] and λ ∈ [1 − δ, 1 + δ ]. So we have | T p,t ( λ ) − λ p | ≤ δ t +1 (1 − δ ) 2 λ p . (46) Therefore, when t = (1 − δ ) − 1 log(1 / ( ǫ (1 − δ ) 2 ))), w e get δ t +1 (1 − δ ) 2 < ǫ and | T p,t ( λ ) − λ p | ≤ ǫλ p , (47) which implies exp( − ǫ ) λ p ≤ T p,t ( λ ) ≤ exp( ǫ ) λ p . (48) 16

Scalable Parallel Factorizations of SDD Matrices and Efficient Sampling for Gaussian Graphical Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment