Exact covariance thresholding into connected components for large-scale Graphical Lasso

We consider the sparse inverse covariance regularization problem or graphical lasso with regularization parameter $\rho$. Suppose the co- variance graph formed by thresholding the entries of the sample covariance matrix at $\rho$ is decomposed into c…

Authors: Rahul Mazumder, Trevor Hastie

Exact covariance thresholding into connected components for large-scale   Graphical Lasso
Exact Co v ariance Thresholding into Connected Comp onen ts for large-scale Gr aphic al L asso Rah ul Mazumder ∗ T rev or Hastie † Department of Statisti cs Stanfor d Universit y Stanfor d, CA 94305. Second Draft. Submit ted for publica tion on 8-26 -2011 First Draft 8-17-201 1. Abstract W e consider the sparse i nv erse co v ariance regularizatio n problem o r gr aphi- c al lasso with regula rization parameter λ . Supp ose the sample c ovarianc e gr aph formed b y thresholding the en tries of the sample co v ariance matrix at λ is decom- p osed into connected comp onent s. W e show that th e vertex-p artition induced by the conn ected c omp onen ts of th e thr esholded samp le cov ariance graph is exactly equal to that in duced by the connected comp onen ts of the estimated concen tra- tion graph, obtained by solving the graphical lasso problem. This c haracterizes a ve ry in teresting prop er ty of a p ath of graphical lasso solutions. F urth ermore, this simp le rule, when us ed as a wr app er around existing algorithms for the graphical lasso, leads to enormous p erf ormance gains. F or a range of v alues of λ , our prop osal sp lits a large graphical lasso p roblem into sm aller tractable problems, making it p ossible to solv e an otherwise infeasible large-scale pr oblem. W e illustrate the graceful scal abilit y of o ur prop osal via synthetic and real-life microarra y examp les. 1 1 In tro ductio n Consider a data matrix X n × p comprising of n sample realizations from a p dimensional Gaussian dis tribution with z ero mean a nd p o sitive definite co v ariance matrix Σ (un- kno wn), ie x i i . i . d ∼ M V N ( 0 , Σ ). The task is to estimate the unkno wn Σ based on the n samples. ℓ 1 regularized Sparse In ve rse Cov a riance Selection a lso kno wn as gr aph i c al ∗ email: rahulm@stanford.edu † email: hastie@stanfor d.e du 1 The first draft to this pa per is av ailable at http: //arx iv.or g/PS_cache/arxiv/pdf/1108/1108.3829v1.pdf . 1 lasso [F riedman et al., 200 7, Banerjee et a l., 200 8, Y uan and Lin, 2007] estimates the co v ar ia nce matrix Σ , under the assumption that the inv erse co v ariance matrix i.e. Σ − 1 is sparse. This is ac hiev ed b y minimiz ing the regularized nega t ive log-likelihoo d function: minimize Θ  0 − log det( Θ ) + tr( SΘ ) + λ X i,j | Θ ij | , (1) where S is the sample cov ariance matrix. Problem (1) is a conv ex optimization problem in the v aria ble Θ [Bo yd and V anden b erghe , 2004]. Let b Θ ( λ ) denote the solution to (1). W e note that (1) can also b e use d in a more non-parametric fashion for an y po sitive semidefinite input matr ix S , not necess arily a sample cov ariance matrix o f a MVN sample as describ ed ab o v e. A related criterion to (1) is one where the diagonals are not p enalized — w e will study (1) in this pap er. Dev eloping efficien t large-scale algorithms for (1) is an active area of researc h across the fields of Conv ex Optimization, Mac hine Learning and Statistics. Many algorithms ha v e b een prop o sed fo r this ta sk [F riedman et al., 2007, Banerjee et al., 2008, Lu , 2009, 2010, Sc hein b erg et al., 2010, Sra and K im, 2011, Y uan, 2009, Li and T oh , 2010, for example]. Ho w ev er, it app ears that certain sp ecial prop erties of the solution to (1) hav e b een largely ig nored. This pap er is ab out one suc h (surprising) prop ert y — namely establishing an equiv a lence b et w een the vertex-p artition induced b y t he connected comp onen ts of the non-zero patterns of b Θ ( λ ) and the thresholded sample co v ar ia nce matrix S . This pap er is not ab out a sp ecific algorithm for the problem (1) — it fo cuses on the aforemen tioned observ ation that leads to a nov el threshold- ing/screening pro cedure based on S . This pro vides in teresting insigh t into the path of solutions { b Θ ( λ ) } λ ≥ 0 obtained b y solving ( 1), ov er a path o f λ v alues. The b eha vior of the connected-comp onen ts o btained from the non-zero patterns of { b Θ ( λ ) } λ ≥ 0 can b e completely understo o d by simple screening rules on S . This can b e done without even attempting to solv e (1) — arguably a v ery c hallenging conv ex optimization problem. F urthermore, this thresholding rule can b e used as a wr app er to enormously b o ost the p erformance of existing algorithms, as seen in o ur exp erimen ts. This strategy b ecomes extremely effectiv e in solving la rge pro blems ov er a range of v alues of λ — sufficien tly restricted to ensure sparsity and the separation in to connected componen ts. A t this p oin t we in t r o duce some notation and terminology , whic h w e will use throughout the pap er. 2 1.1 Notations and preliminaries F or a matrix Z , its ( i, j ) th en try is denoted by Z ij . W e also introduce some graph theory notations a nd definitions [Bollo bas, 19 98, see for example]. A finite undirected graph G on p ve rtices is giv en b y the ordered tuple G = ( V , E ), where V is the set of no des and E the collection of (undirected) edges. The edge-set is equiv alently represen ted via a (symmetric) 0-1 matrix 2 (also kno wn as the adja c ency matrix) with p rows /columns. W e use t he conv ention that a no de is not connected to itself, so the diagonals of the adja cency mat rix are all zeros. Let | V | and | E | denote the num b er of no des and edges res p ectiv ely . W e sa y t wo no des u, v ∈ V a re c onne cte d if there is a p ath b et w een t hem. A maximal connec ted sub gr aph 3 is a c onne cte d c omp one n t of the g r aph G . Conne cte d- ness is an equiv alence relation tha t decomposes a g raph G in to its connected compo - nen ts { ( V ℓ , E ℓ ) } 1 ≤ ℓ ≤ K — with G = ∪ K ℓ =1 ( V ℓ , E ℓ ), where K denotes the num b er of con- nected comp onen ts. This decomp osition partitions the v ertices V of G in to {V ℓ } 1 ≤ ℓ ≤ K . Note that the lab eling of the comp onen ts is unique upto p erm utations on { 1 , . . . , K } . Throughout this pap er w e will of t en refer to t his partition a s the vertex-p artition in- duced by t he comp onen ts o f the graph G . If the size of a comp onen t is one i.e. |V ℓ | = 1, w e sa y that the no de is isolate d . Supp ose a graph b G defined on the set of v ertices V admits the following decomp o sition in to connected comp onents: b G = ∪ b K ℓ =1 ( b V ℓ , b E ℓ ). W e sa y the ve rtex-partit ions induced b y the connec ted comp o nents of G and b G are e qual if b K = K and there is a p erm utation π on { 1 , . . . , K } suc h that b V π ( ℓ ) = V ℓ for all ℓ ∈ { 1 , . . . , K } . The pap er is or g anized as follows . Section 2 describ es the cov ariance gra ph thresh- olding idea, along with theoretical j ustifications and related work, follow ed b y com- plexit y analysis of the algorithmic fra mework in Section 3. Numerical exp erimen ts app ear in Section 4, concluding remarks in Section 5 and the pro ofs are gathered in the App endix A. 2 0 denotes absence of an edg e and 1 denotes its pr esence. 3 G ′ = ( V ′ , E ′ ) is a sub gr aph o f G if V ′ ⊂ V and E ′ ⊂ E . 3 2 Metho dolo g y: Exact Thre sholdi n g o f the Co v ari- ance Graph The sparsit y pattern of the solution b Θ ( λ ) to (1) giv es rise to the symmetric edge matrix/sk eleton ∈ { 0 , 1 } p × p defined b y: E ( λ ) ij = ( 1 if b Θ ( λ ) ij 6 = 0, i 6 = j ; 0 otherwise . (2) The ab ov e defines a symmetric graph G ( λ ) = ( V , E ( λ ) ), namely the estimate d c on- c entr ation gr aph [C ox a nd W erm uth, 1996, Lauritzen, 1996] defined on the no des V = { 1 , . . . , p } with edges E ( λ ) . Supp ose the graph G ( λ ) admits a decomp o sition in to κ ( λ ) connected comp onen ts: G ( λ ) = ∪ κ ( λ ) ℓ =1 G ( λ ) ℓ (3) where G ( λ ) ℓ = ( b V ( λ ) ℓ , E ( λ ) ℓ ) are t he comp o nen ts of the graph G ( λ ) . Note that κ ( λ ) ∈ { 1 , . . . , p } , with κ ( λ ) = p (large λ ) implying that all no des are isola t ed and for small enough v alues of λ , there is only one comp onent i.e. κ ( λ ) = 1. W e now describe the simple screening/thresholding rule. Giv en λ w e p erform a thresholding on the entries of the sample co v ariance matrix S and obtain a g r a ph edge sk eleton E ( λ ) ∈ { 0 , 1 } p × p defined b y: E ( λ ) ij = ( 1 if | S ij | > λ , i 6 = j ; 0 otherwise . (4) The symme tric matrix E ( λ ) defines a symmetric graph on the no des V = { 1 , . . . , p } giv en by G ( λ ) = ( V , E ( λ ) ). W e refer to this as the thr esh olde d samp l e c ovarianc e gr aph . Similar to the decomp osition in (3), the graph G ( λ ) also admits a decomp osition in to connected comp onents: G ( λ ) = ∪ k ( λ ) ℓ =1 G ( λ ) ℓ , (5) where G ( λ ) ℓ = ( V ( λ ) ℓ , E ( λ ) ℓ ) are the comp onen ts of the graph G ( λ ) . Note that the comp onen ts of G ( λ ) require knowle dge of b Θ ( λ ) — the solution to (1). Construction o f G ( λ ) and its comp o nen ts require o p erating on S — an o p eration that can b e p erformed completely indep enden t of the o ptimization problem (1), whic h is arguably more exp ensiv e (See Se ction 3). The surprising message w e describe in this pap er is that the vertex-p artition of the connected componen ts of (5) is exactly equal 4 to that of (3). This observ ation has the followin g c onsequences: 1. W e obtain a v ery interes ting prop ert y of the path of s olutions { b Θ ( λ ) } λ ≥ 0 — t he b eha vior of t he connected comp o nen ts of the estimated concen tration graph can b e completely understo o d b y simple screening rule s on S ! 2. The cost o f computing the connected comp onen ts of the thresholded sample co- v ariance graph (5) is orders of magnitude smaller t han the cost of fitting gra phical mo dels (1). F urthermore, the computations p ertaining to the co v ariance graph can b e done off-line and is amenable to parallel computation (See Section 3). 3. The o ptimization pro blem (1) completely separates into k ( λ ) separate optimiza- tion sub-pro blems of the f orm ( 1 ). The sub-problems ha v e size equal to the n um b er o f no des in eac h comp onen t p i := |V i | , i = 1 , . . . , k ( λ ). Hence for certain v alues of λ , solving problem (1), b ecomes feasible althoug h it may b e imp ossible to o p erate on the p × p dime nsional (global) v ariable Θ on a single mac hine. 4. Suppose that for λ 0 , there are k ( λ 0 ) comp onen ts and the g r aphical mo del compu- tations are distributed 4 . Since the v ertex-partitions induced via (3) and (5) are nested with increasing λ (see Theorem 2), it suffices to op erate indep enden tly on these separate mac hines to obtain the path of solutions { b Θ ( λ ) } λ for all λ ≥ λ 0 . 5. Conside r a distributed computing arc hitecture, where ev ery machine allows op er- ating on a graphical lasso problem (1) of maximal size p max . Then with relativ ely small effort we can find the smalle st v alue of λ = λ p max , suc h that there are no connected c omp onen ts of size larger than p max . Problem (1) th us ‘splits up’ in- dep enden tly into mana geable problems a cro ss the differen t mac hines. When this structure is not exploited the g lobal pro blem (1 ) remains in tra ctable. The following theorem es tablishes the main tec hnical con tributio n of this pap er— the equiv alence of the v ertex-partitions induced b y the connected comp onen ts of the thresholded sample cov a r iance graph and the estimated concentration graph. Theorem 1. F or any λ > 0 , the c om p onents of the estimate d c onc en tr ation gr ap h G ( λ ) , as define d in (2) and (3) ind uc e exactly the s a me ve rtex- p artition as that of the 4 Distributing thes e o pe r ations de p end upo n the num ber of pro cesso rs av ailable, their capa cities, communication lag, the n umber of components a nd the ma x imal size of the blo cks across all machines. These of-cours e dep end up o n the computing environment. In the context o f the present problem, it is often des irable to club smaller c omp onents into a single machine. 5 thr eshold e d sample c ovarianc e gr aph G ( λ ) , define d in (4) and (5). T h at is κ ( λ ) = k ( λ ) and ther e ex ists a p ermutation π on { 1 , . . . , k ( λ ) } such that: b V ( λ ) i = V ( λ ) π ( i ) , ∀ i = 1 , . . . , k ( λ ) . (6) Pr o of. The pro of of the theorem app ears in App endix A.1. Since the decomp osition of a symmetric graph into its connecte d comp onen ts de- p ends up on the ordering/ lab eling of t he components, the perm utatio n π app ear s in Theorem 1. Remark 1. Note that the e dge-structur es within e ach blo ck ne e d not b e pr e serve d . Under a matching r e or dering of the lab els of the c om p on e n ts of G ( λ ) and G ( λ ) : for every fixe d ℓ such that b V ( λ ) ℓ = V ( λ ) ℓ the e dge-sets E ( λ ) ℓ and E ( λ ) ℓ ar e not ne c essarily e qual. Theorem 1 leads to a sp ecial prop erty of the path-of- solutions to (1 ), i.e. t he v ertex- partition induced b y the connected components of G ( λ ) are nested with increasing λ . This is the con ten t of the following theorem. Theorem 2. Consider two values of the r e gulariza tion p ar am eter such that λ > λ ′ > 0 , with c o rr esp onding c onc entr ation gr aphs G ( λ ) and G ( λ ′ ) as in (2) and c onne cte d c om p onents (3). Then the vertex-p a rtition induc e d by the c omp o n ents of G ( λ ) ar e nested within the p artition ind uc e d by the c o m p on e nts of G ( λ ′ ) . F ormal ly, κ ( λ ) ≥ κ ( λ ′ ) and the vertex-p artition { b V ( λ ) ℓ } 1 ≤ ℓ ≤ κ ( λ ) forms a finer r esolution of { b V ( λ ′ ) ℓ } 1 ≤ ℓ ≤ κ ( λ ′ ) . Pr o of. The pro of of this theorem app ears in the App endix A.2. Remark 2. It is worth noting that The or em 2 addr esses the nesting of the e dges across c onn e cte d c omp onents and not within a c omp onent. In ge n er al, the e dge-se t E ( λ ) of the estimate d c onc entr ation gr aph ne e d not b e n este d as a function of λ : for λ > λ ′ , in gener al, E ( λ ) 6⊂ E ( λ ′ ) . See F riedman et al. [2007, Figure 3], fo r n umerical examples demonstrating the non-monotonicity of the edge-set across λ , as describ ed in Remark 2. 2.1 Related W ork Witten and F riedman [2011] fairly recen tly pro p osed a sc heme t o detect isola te d no des for problem (1) via a simple screening of the entries of S . Using the notation in Witten and F riedman [2011, Algo rithm 1 ], the authors prop ose op erating criterion (1) 6 on the set of non- isolated no des (obta ined from the sample co v ariance matrix) i.e. { 1 , . . . , p } \ C , where the isolated no des a r e g iv en by C : C = { i : | S ij | ≤ λ, ∀ j 6 = i } . (7) The authors show ed that C is exactly equiv alen t to the set of isolated no des of the estimated precis ion matrix obtained b y solving (1) on the entire p × p dimensional problem. Earlier, Banerjee et al. [2008][Theorem 4] also made the same observ ation. This is of - course r elate d to a v ery sp ecial case of the prop osal in this pap er. Suppo se in Theorem 1, the estimated concen tration gra ph admits a decomp osition where some o f the connected comp onents ha ve size one — Witten and F riedman [2011] only screens the isolated nodes and treats the the r emaining no des as a separate ‘connected unit’. Although the original v ersion of their pap er [Witten and F r iedman , 2011] deal with single nodes, w e ha v e learned [Witten , 8- 12-2011] that with N. Simon they hav e also disco v ered a form of blo ck screening. This no de-screening stra t egy (7 ) was used b y them as a wrapp er around the (graph- ical lasso) glasso algorithm of F riedman et al. [200 7] — leading to substan tial im- pro v emen ts ov er the existing glasso solver o f F riedman et al. [2007] (CRAN glasso pac k ag e ve rsion 1.4). Ho w ev er, w e show b elow t ha t t he no de-screening idea is actually an imme diate c onse quen c e of the blo c k co ordinate-wise up dates used by the glasso algorithm — an observ ation that w as not ex ploited b y the solver. Recall that the glas so algorithm [F riedman et al., 2 0 07] op erates in a blo c k- co ordinate-wise i.e. r o w/column fashion on the v ariable W = Θ − 1 . The metho d partitions the problem v ariables as follows : Θ = Θ 11 θ 12 θ 21 θ 22 ! , S = S 11 s 12 s 21 s 22 ! , W = W 11 w 12 w 21 w 22 ! (8) where the last ro w/ column represen ts the optimizatio n v ariable, the others b eing fixed. The partial optimization problem w.r.t. the last row/column (lea ving apart the diagonal en try) is g iv en by : b θ 12 := arg min θ 12  1 2 θ ′ 12 W 11 θ 12 + θ ′ 12 θ 22 s 12 + λθ 22 k θ 12 k 1  . (9) Clearly the solution b θ 12 of the ab ov e (9 ) is zero iff k s 12 k ∞ ≤ λ (10) 7 — a condition dep ending only on tha t row/column of S . As w e p oin ted out b efore, the ab o v e condition (10) is exactly the condition for node-screening (7) described in Witten and F riedman [2 011]. The not a ble impro v emen t in timings observ ed in Witten and F riedman [2011] with no de screening goes on t o suggest that the glass o solv er of F r iedman et al. [2007] (as implemen ted in CRAN glasso pac k age V ersion 1.4) do es no t mak e the che ck (10 ) , b efore go ing on to solv e problem (9 ) . The existing implemen tation go es on to optimize (9) — a ℓ 1 regularized quadratic program via cyclical co ordinate-descen t. Note that (9), in its ow n right, is fa irly challenging to solv e for large pro blems. 3 Computation al Co mpl e xit y The o v erall complexit y of our prop osal dep ends up o n (a ) the graph partitio n stage a nd (b) solving (sub)problems of the fo rm (1). In addition to these, there is a n unav oidable complexit y asso ciated with handling and/o r for ming S . The cost of computing the connected comp onents of the thresholded co v aria nce graph is fa irly negligible when compared to solving a similar size d graphical la sso problem (1) — see also our sim ulation studies in Section 4. In case w e observ e samples x i ∈ ℜ p , i = 1 , . . . , n the cost for creating the sample co v ariance matrix S is O ( n · p 2 ). Thresholding the sample co v ariance matrix costs O ( p 2 ). Obtaining the connected comp onen ts of the thresholded co v ariance graph costs O ( | E ( λ ) | + p ) [T a rjan, 1972]. Since w e ar e intereste d in a region where the thresholded co v ariance graph is sparse enough to b e brok en into smaller connected comp onents — | E ( λ ) | ≪ p 2 . Note that all computations p ertaining to the construction of the connected comp onents and the task of computing S can b e computed off - line. F urthermore the computations are parallelizable. Ga zit [1991, for example] describ es parallel a lgorithms for computing connected comp onen ts o f a graph — they ha ve a time complexit y O (log p ) and require O (( | E ( λ ) | + p ) / log( p )) pro cessors with space O ( p + | E ( λ ) | ). There are a wide v ariety o f algorithms for t he task o f solving (1). While an exhaus- tiv e review of the computational complexities of the differen t algor it hms is b ey ond the scop e of this pap er, w e provide a brief summary for a few algorithms b elow. Banerjee et al. [200 8 ] prop osed a smo oth accelerated g radien t based metho d [Nestero v, 2005] with complexit y O ( p 4 . 5 ǫ ) to obtain a n ǫ accurate solution — the p er iteration cost b eing O ( p 3 ). They a lso prop osed a blo c k co ordinate me tho d wh ich has a complexit y of O ( p 4 ). The complexity of the glasso algorithm [F riedman et al., 2 007] whic h uses a row- 8 b y-row blo c k co or dina t e metho d is roughly O ( p 3 ) f or reasonably sparse-problems with p no des. F or denser problem s the cost can b e as large as O ( p 4 ). The algorithm smacs prop o sed in Lu [201 0] has a p er it erat ion complexit y of O ( p 3 ) and an o v erall complex ity of O ( p 4 √ ǫ ) to obtain an ǫ > 0 accurate solution. It app ears that most existing algorithms for (1), ha v e a complexit y of at least O ( p 3 ) to O ( p 4 ) o r p ossibly lar g er, dep ending up on the algo rithm used and the desired accuracy o f the solution — making computations for ( 1 ) almost impractical for v alues of p m uch lar g er than 2 000. It is quite clear that the role pla y ed b y co v a riance thresholding is indeed crucial in this con text. As sume that w e choose t o use a solv er of complexit y O ( p J ), with J ∈ { 3 , 4 } , along with our screening pro cedure. Supp o se for a giv en λ the thresholded sample co v a r iance gra ph has k ( λ ) comp onents — the tota l cost of solving these smaller problems is then P k ( λ ) i =1 O ( |V ( λ ) i | J ) ≪ O ( p J ), with J ∈ { 3 , 4 } . This difference in practice can b e enormous — se e Section 4 for n umerical examples. This is what mak es large scale graphical lasso problems solv able ! 4 Numerical examples In this section w e show via n umerical exp erimen ts tha t the screening prop ert y helps in obtaining many fo ld sp eed-ups when compared to an alg orithm that do es not exploit it. Section 4.1 considers syn thetic examples and Section 4 .2 discusses real-life microarray data-examples. 4.1 Syn thetic examples Exp eriments are p erformed with t wo publicly av ailable algorithm implemen tations for the problem (1): glasso : The a lg orithm of F riedman et a l. [2007]. W e used the MA TLAB wrap- p er a v ailable at http://w ww- stat.stanford . edu/ ~ tibs/glasso /index.html to the F ortran co de. The specific criterion for con v ergence (lac k of progr ess of the diagona l entries ) was set to 10 − 5 and the maximal num b er of iteratio ns was set to 1000. smacs : denotes the algo rithm of Lu [2010]. W e used the MA TLAB implemen tatio n smooth_covs el a v a ilable at http://people.m ath.sfu.ca/ ~ zhaosong/Co des/SMOOTH_COVSEL/ . The criterion for con v ergence (based on duality gap) was set to 1 0 − 5 and the maximal n um b er of it era t io ns was set to 100 0. 9 W e will lik e to note that the c onv ergence criteria of the t wo algorithms glasso and smacs are not the same. F or obtaining the connected components of a symmetric adjacency matrix we used the MA TLAB function graphconncomp . All o f our compu- tations are done in MA TLAB 7.11.0 on a 3.3 GhZ Inte l Xeon processor. The sim ulation examples are created as follo ws. W e generated a blo c k diag o nal matrix give n b y ˜ S = blkdiag ( ˜ S 1 , . . . , ˜ S K ), where eac h blo c k ˜ S ℓ = 1 p ℓ × p ℓ — a matrix of all ones and P ℓ p ℓ = p . In the examples w e to ok all p ℓ ’s to b e equal to p 1 (sa y). Noise of the form σ · U U ′ ( U is a p × p matr ix with i.i.d. standard Gaussian en tries) is a dded to ˜ S suc h that 1.2 5 times t he la r gest (in absolute v alue) off blo c k-diagonal (as in the blo c k structure of ˜ S ) entry of σ · U U ′ equals the smallest a bsolute no n-zero en try in ˜ S i.e. one. The sample cov ariance matrix is S = ˜ S + σ · U U ′ . W e consider a num b er of examples for v arying K and p 1 v alues, as sho wn in T a ble 1. Sizes w ere c hosen suc h that it is at-least ‘conce iv able’ to solv e (1) on the full dimensional problem, without screening. In all the examples sho wn in T able 1, we set λ I := ( λ max + λ min ) / 2, where for all v alues of λ in the interv a l [ λ min , λ max ] the thresh- holded v ersion of the sample co v a riance matrix has exactly K connected comp onen ts. W e also to ok a larger v alue of λ i.e. λ I I := λ max , whic h ga v e sparser estimates o f the precision matrix but the num b er of connected comp onents w ere the same. The computations across differen t connected blo c ks could be distributed in to a s man y ma chines . This w ould lead to almost a K fold impro v emen t in timings, how eve r in T able 1 we rep ort the timings by op erating serially across the blo c ks. The serial ‘lo op’ across the different blo c ks are implemen ted in MA TLAB. T able 1 sho ws the rather remark able impro v emen ts obta ined by using our prop o sed co v ar ia nce thresholding strategy as compared to op erating on the whole matr ix. Tim- ing comparisons b et we en glass o and smacs are not fair, since glas so is written in F or t r an and smacs in MA TLAB. Ho wev er, w e note tha t our exp erimen ts are mean t to demonstrate how the thresholding helps in improv ing the ov erall computational time o v er the baseline me tho d of not e xploiting screening. It is interes ting to observ e that there is almost a role-rev ersal in the p erformances of glasso and smacs with c hanging λ v alues, fo r the cases with screening. λ I corresp onds to a denser solution of the pre cision matrix — here glasso conv erg es more slo wly than smacs . F or la rger v alues of the tuning parameter i.e. λ = λ I I , the solutions are sparser — glasso con v erges muc h fa ster than smacs . F or problems without scre ening we observ e that glasso conv erges m uc h faster t ha n smacs , for both v alues of the tuning parameter. This is probably b ecause of the in tensiv e matrix computations asso ciated with the smacs algorithm. Clearly our prop osed strat egy mak es solving larger problems (1), 10 K p 1 / p λ A lgorithm Algorithm Timings (sec) Ratio Time ( sec) with without Sp eedup graph screen scree n factor partition 2 200 / 4 00 λ I glasso 11.1 25.97 2.33 0.04 smacs 1 2.31 137.45 11.16 λ I I glasso 1.687 4.783 2.83 0.066 smacs 1 0.01 42.08 4.20 2 500 /10 00 λ I glasso 305.24 735.39 2.40 0.247 smacs 175 2138* 12.21 λ I I glasso 29.8 121.8 4.08 0.35 smacs 2 72.6 1247.1 4.57 5 300 /15 00 λ I glasso 210.86 1439 6.82 0.18 smacs 6 3.22 6062* 95.88 λ I I glasso 10.47 293.63 28.04 0.123 smacs 219.72 6061.6 27.58 5 500 /25 00 λ I glasso 1386.9 - - 0.71 smacs 493 - - λ I I glasso 17.79 963.92 54.18 0.018 smacs 354.81 - - 8 300 /24 00 λ I glasso 692.25 - - 0.713 smacs 185.75 - - λ I I glasso 9.07 842.7 92.91 0.023 smacs 153.55 - - T able 1: T able sho wing (a) the times in seconds with screening, (b) without screening i.e. on the whole matrix and (c) the ra tio (b)/(a) – ‘Sp eedup factor’ for algorithms glasso and smacs . Algor it hms with screening are op erated serially—the t imes reflect the total time summed across all blo cks . The column ‘g raph partition’ lists the time f o r computing the connected comp o nen ts of the thresholded sample cov ariance graph. Since λ I I > λ I , the fo r mer give s sparser mo dels. ‘*’ denotes the a lgorithm did not conv erge within 1000 iterations. ‘-’ refers to cases where the resp ectiv e algorithms failed to con v erge within 2 hours. not only feasible but with quite at t r activ e computational time. The time taken by the graph-partitioning step in splitting the thresholded co v a r ia nce graph in to its connected comp onen ts is negligible as compared to the timings for the optimization problem. 11 4.2 Micro-arra y Data Examples The g r aphical lasso is often used in lear ning connectivit y netw orks in gene-microar r ay data [F riedman et al., 20 07, see for example ]. Since in most r eal examples the nu m- b er o f genes p is around tens of thousands, o bta ining an inv erse cov ariance matrix b y solving (1) is computatio nally impractical. The co v ariance thresholding metho d we prop ose easily applies to these problems — and as w e see gr a cefully delive rs solutions o v er a large range o f the par a meter λ . W e study three differen t micro-array examples and o bserv e that as one v aries λ from large to small v a lues, the thresholded co v a r iance graph splits into a n um b er of non-trivial connected comp onen ts of v arying sizes. W e con tin ue till a small/mo derate v alue of λ when t he maximal s ize o f a connected com- p onen t gets lar g er than a predefined mac hine-capacity or the ‘computational budget’ for a single graphical lasso problem. No t e that in relev an t micro-array applications, since p ≫ n ( n , the n um b er of samples is at most a few h undred) hea vy regular iza- tion is required t o control the v ariance of the cov a riance estimates — so it do es seem reasonable to restrict to solutions of (1) for large v alues o f λ . F ollowing are the data- sets we used for our exp erimen ts: (A) This data-set app ears in Alon et al. [1 999] and has b een analyzed by Rothman et al. [2008, for example]. In this exp erimen t, tissue sample s w ere analyzed using an Affymetrix oligo n ucleotide array . The data we re pro cessed, filtered and reduced to a su bset of p = 2 0 00 gene expression v alues. The num b er of colon adenocar- cinoma tissue samples is n = 62. (B) This is an early example of a n expression array , o btained f rom the P atric k Brown lab at Stanford Univ ersit y . There a re n = 385 patien t samples of t issue from v arious regions of the b o dy (some from tumors, some not), with g ene-express ion measuremen ts for p = 4 7 18 genes. (C) The third example is the b y now famous NKI data set that pro duced t he 70-gene prognostic signature for breast cancer v an de Vijv er et al. [2002]. Here there are n = 29 5 samples a nd p = 24481 genes. Among the ab ov e, b oth (B) and (C) hav e few missing v alues — whic h we imputed by the resp ectiv e g lo bal means of the observ ed expression v alues. F or eac h of the three data-sets, w e to ok S to b e the corresp onding sample c orrelat io n matrix. Figure 1 sho ws how the comp onen t sizes of the thresholded co v a riance g raph change across λ . W e desc rib e the strategy w e used t o arrive at the figure. Note that the connected comp onen ts c hange only at the a bsolute v alues o f the entries o f S . F rom the 12 sorted absolute v a lues of the off- diagonal en tries of S , we obta ined the smalles t v alue of λ , sa y λ ′ min , f o r whic h the size of the maximal connected comp onen t w as 1 500. F or a grid of v alues of λ till λ ′ min , we computed the connected comp o nen ts o f the thresholded sample-co v aria nce matrix a nd obtained the size-distribution of the v arious connected comp onen ts. Figure 1 shows ho w these comp onen ts c hange ov er a range of v alues of λ for the three e xamples (A), (B) and (C ). The n um b er of connected compo nen ts of a particular size is denoted by a color-sc heme, describ ed by the colo r -bar in the figures. With increasing λ : the larger connected comp onents gra dually disapp ear as they decompose into smaller comp onents; t he sizes of the connected comp onen ts decrease and the frequency of the smaller comp onents increase. Since these are all correlation matrices, for λ ≥ 1 all the no des in the graph b ecome isolated. The range of λ v alues for whic h the maximal size of the comp onen ts is s maller than 1500 differ across the thre e examples. F or (C) there is a gr eater v ariet y in the sizes of the comp onen ts as compared to (A) and (B). Note that b y Theorem 1, the pattern of the comp onen ts app earing in Figure 1 are exactly the same as the comp o nen ts app earing in the solution of (1) for that λ . 0 0.5 1 1.5 2 2.5 3 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 1 1−5 10−20 20−100 200−800 >800 P S f r a g r e p la c e m e n t s (A) p=2000 λ log 10 (Size of Comp onents) 0 0.5 1 1.5 2 2.5 3 0.6 0.65 0.7 0.75 0.8 0.85 0.9 1 1−10 30−80 200−800 800−1500 >1500 P S f r a g r e p la c e m e n t s (B) p=4718 log 10 (Size of Comp onents) 0 0.5 1 1.5 2 2.5 3 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 1−10 10−50 100−800 1500−8000 >8000 P S f r a g r e p la c e m e n t s (C) p=24281 log 10 (Size of Comp onents) Figure 1: Figure sho wing the size dis tribution (in the log-scale) of c onnected comp o- nen ts arising from the thresholded sample co v ariance graph for examples (A)- (C). F or ev ery v alue of λ (ve rtical axis), the horizon tal slice denotes the sizes of the differen t comp onen ts app earing in the thresholded c ov ariance graph. The colors r epresen t the n um b er of comp onents in the gra ph having that sp ecific size. F or ev ery figure, the range of λ v alues is c hosen suc h that the maximal size of the connected comp onents do not exceed 1500. Con tin uing from T able 1, w e pro ceed to s how that the screening rules lead t o 13 Av erage size Algorithm Algorithm Timings (sec) Ratio Time ( sec) of maximal with without Sp eedup graph comp onen t screen scree n factor partition 5 glasso 0.02 3866 1 . 9 × 10 5 0.009 smacs 0.87 1 . 16 × 10 5 1 . 33 × 10 5 727 glasso 413 13214 32 0.14 smacs 4 285 2 . 7 × 10 5 63 T able 2: Timings for Eg (A): table sho wing times with/without scree ning, ‘Sp eedup factor’ and time for ‘gra ph partition’ as in T a ble 1, for tw o differen t ranges o f λ -v alues. Here p = 2000 and the times for each of the tw o columns are summed ov er 10 different λ v alues. The left-most column is the size of the maximal connecte d comp onen t, a v eraged across the λ v alues. W e see that the times increase with decreasing sparsit y and the sp eed-up fa ctor is impressiv ely large when there are a large n um b er of small- sized connected comp onen ts. The cost of computing the comp o nen ts of the thresholded co v ar ia nce graph is relativ ely negligible. encouraging sp eed-ups for real-data examples as w ell. W e consider ex ample (A) and apply on it glass o and smacs with and without screening o n a grid of λ v alues. F or smaller v alues of λ in the range, g lasso and sma cs tak e a v ery long time to con v erge. Comparativ e timings app ear in T able 2. In these experimen ts w e to ok the criterion of conv ergence for b oth glasso and sma cs a s 1 0 − 4 and they w ere run till a maximum of 500 iterations. The algorit hms w ere run indep enden tly across the grid of λ v alues chosen. The times displa y ed for the a lgorithms with screening indicate the total time required b y solving t he smaller sub-problems serially — the resu lts sho wn are to emphasize the sp eed-ups o btained in eac h of the algorit hms via screening. F or examples (B) and (C) the full problem s izes a r e bey ond the scop e of glasso and smacs — the sc reening rule is apparen tly the o nly w a y t o obtain solutions for a reasonable range o f λ -v alues as sho wn in Fig ure 1. W e rep o rt in T able 3 the av erag ed time t a k en b y eac h of glasso and smacs o v er a grid of 1 00 λ - v alues, for examples (B) and (C). The 100 λ v alues corresp ond to the top 2 % sorted a bsolute v alues of the off-diagonal en tries in S b elo w λ 500 — where λ 500 is the smallest v alue of λ suc h that the maximal comp onent in the thresholded cov ariance graph has size 50 0 . 5 Conclus ions In this pap er we presen t a nov el prop erty c haracterizing the family of solutions to the graphical lasso problem (1 ), as a function of the regularization parameter λ . The prop- 14 Example / p Av erage size of Algorithm Timings (sec) Time (sec) maximal co mp onent glasso smacs graph-partition (B) / 4718 330 4.93 31.09 0.0082 (C) / 2 4481 461 15.4 141 0.0186 T able 3: Av eraged timings (in secs) for differen t algorithms f o r examples (B) and (C) o v er a grid o f 10 0 λ v alues, as describ ed in the tex t. Algorithms are a pplied with the screening rule. ert y is fairly surprising — the v ertex partition induced b y t he connecte d components of the non-zero pa tterns of the estimated concen tratio n ma t r ix and the thresholded sample co v ariance matrix S are exactly e qual . This prop erty seems to ha v e b een un- observ ed in the literature. Our observ ation not only provides inte resting insights in to the properties of the graphical lasso solution-path but also op ens the do o r to s olving large-scale gr aphical lasso problems, which a r e otherwise intractable. This simple rule when used as a wrapp er around existing algorithms leads to enormous p erfor mance b o osts — on o ccasions b y a factor of thousands! A Pro ofs A.1 Pro of of Theorem 1 Pr o of. Supp ose b Θ (w e suppress the sup erscript λ for notational con v enience) solv es problem (1), then standard KKT conditions o f optimality [Bo yd a nd V andenberghe , 2004] giv e: | S ij − c W ij | ≤ λ ∀ b Θ ij = 0; and (11) c W ij = S ij + λ ∀ b Θ ij > 0; c W ij = S ij − λ ∀ b Θ ij < 0; (12) where c W = ( b Θ ) − 1 . The diagonal en tries satisfy c W ii = S ii + λ , f or i = 1 , . . . , p . Using (4) a nd (5), there exists an ordering of the v ertices { 1 , . . . , p } of the graph suc h that E ( λ ) is blo c k-diagona l. F or notatio nal conv enience, w e will assum e that the matrix is already in that order. Under this ordering of the v ertices, the edge-matrix 15 of t he t hr esholded co v ariance graph is o f the f o rm: E ( λ ) =       E ( λ ) 1 0 · · · 0 0 E ( λ ) 2 0 · · · . . . . . . . . . . . . 0 · · · 0 E ( λ ) k ( λ )       (13) where the differen t comp onen ts represen t blo cks of indices giv en by : V ( λ ) ℓ , ℓ = 1 , . . . , k ( λ ). W e will construct a mat r ix c W havin g the same structure as (13) whic h is a solution to (1). Note that if c W is block diago nal then so is its inv erse. Let c W and its inv erse b Θ b e giv en by: c W =       c W 1 0 · · · 0 0 c W 2 0 · · · . . . . . . . . . . . . 0 · · · 0 c W k ( λ )       , b Θ =       b Θ 1 0 · · · 0 0 b Θ 2 0 · · · . . . . . . . . . . . . 0 · · · 0 b Θ k ( λ )       (14) Define the blo c k diagonal matr ices c W ℓ or equiv alently b Θ ℓ via the follow ing sub- problems b Θ ℓ = arg min Θ ℓ {− log det( Θ ℓ ) + tr ( S ℓ Θ ℓ ) + λ X ij | ( Θ ℓ ) ij |} (15) for ℓ = 1 , . . . , k ( λ ), where S ℓ is a sub-block of S , with row /column indices from V ( λ ) ℓ × V ( λ ) ℓ . The same notation is used for Θ ℓ . Denote the inv erses of the blo c k- precision matrices b y { b Θ ℓ } − 1 = c W ℓ . W e will sho w that the a b ov e b Θ satisfies the KKT conditions — (11) and (12). Note that b y construction of the thresholded sample cov ariance graph, if i ∈ V ( λ ) ℓ and j ∈ V ( λ ) ℓ ′ with ℓ 6 = ℓ ′ , then | S ij | ≤ λ . Hence, for i ∈ V ( λ ) ℓ and j ∈ V ( λ ) ℓ ′ with ℓ 6 = ℓ ′ ; the c hoice b Θ ij = c W ij = 0 satisfies the KKT conditions (11) | S ij − c W ij | ≤ λ for all the off-diag onal en tries in the blo ck -mat rix (13). By construction (15) it is easy to se e that for ev ery ℓ , the matrix b Θ ℓ satisfies t he KKT conditions (11) and (12) corresp onding to the ℓ th blo c k of the p × p dimensional problem. Hence b Θ solv es pro blem (1 ). The abov e argumen t sho ws that the connected comp onents obtained from the 16 estimated prec ision graph G ( λ ) leads to a partitio n of the v ertices { b V ( λ ) ℓ } 1 ≤ ℓ ≤ κ ( λ ) suc h that for ev ery ℓ ∈ { 1 , . . . , k ( λ ) } , there is a ℓ ′ ∈ { 1 , . . . , κ ( λ ) } suc h that b V ( λ ) ℓ ′ ⊂ V ( λ ) ℓ . In particular k ( λ ) ≤ κ ( λ ). Con v ersely , if b Θ admits the decomp osition as in the statemen t of the theorem, then it follow s fro m (11 ) that: for i ∈ b V ( λ ) ℓ and j ∈ b V ( λ ) ℓ ′ with ℓ 6 = ℓ ′ ; | S ij − c W ij | ≤ λ . Since c W ij = 0, w e ha v e | S ij | ≤ λ . This pro ve s that the connected componen ts of G ( λ ) leads to a partition of the vertice s, whic h is finer than the v ertex-partition induced b y t he comp onen ts of G ( λ ) . In pa r ticular this implies that k ( λ ) ≥ κ ( λ ). Com bining the a b ov e tw o we conclude k ( λ ) = κ ( λ ) a nd also the equality (6 ) . The p erm utation π in the theorem app ears since the lab eling of the connected comp onen ts is no t unique. A.2 Pro of of Theorem 2 Pr o of. This pro of is a direct consequence of Theorem 1, whic h establishe s that the v ertex-partitions induced by the the connected comp onen ts of the estimated precision graph and the thresholded sample cov ariance graph are equal. Observ e tha t, b y construction, the connected comp onen ts of the thresholded sam- ple co v a riance gra ph i.e. G ( λ ) are nested within the connected comp onen ts of G ( λ ′ ) . In particular, t he v ertex-partition induced b y the comp onents o f the thresholded sample co v ar ia nce graph at λ , is con tained inside the v ertex-partition induced b y the com- p onen ts of the thresholded sample co v a r ia nce graph at λ ′ . No w, us ing Theorem 1 we conclude that the ve rtex-partit ion induced by the comp onents of the estimated preci- sion graph at λ , give n b y { b V ( λ ) ℓ } 1 ≤ ℓ ≤ κ ( λ ) is contained inside the v ertex-partition induced b y the comp onen ts of the estimated precision graph at λ ′ , give n b y { b V ( λ ′ ) ℓ } 1 ≤ ℓ ≤ κ ( λ ′ ) . The pro of is thus complete. References U. Alon, N. Bark a i, D. A. Notterman, K. Gish, S. Ybarra , D. Mac k, and A. J. Levine. Broad patterns of gene express ion rev ealed b y clustering a na lysis of tumor and normal colon tissues prob ed b y olig on ucleotide arrays. Pr o c e e din gs of the National A c ademy of Scienc es of the Unite d States of Americ a , 96(12): 6745–675 0, Ju ne 1999. ISSN 0027- 8424. doi: 10.1073/pnas.96 .1 2.6745. URL http://dx.d oi.org/10.1073/pnas.96.12.6745 . 17 O. Banerjee, L. El Ghaoui, and A. d’As premont. Mo del se lection through sparse maxim um lik eliho o d estimation for multiv ariate gaussian or binary dat a. Journal of Machin e L e arnin g R ese ar ch , 9:485– 516, 2008. Bela Bollobas. Mo dern gr aph the ory . Springer, New Y ork, 1998. Stephen Bo yd and Liev en V anden b erghe. Convex Optimization . Cambridge Unive rsity Press, 2 004. D.R C ox and N. W ermuth. Multivariate De p endenci e s . Chapman and Hall, London, 1996. Jerome F riedman, T rev or Hastie, and Rob ert Tibshirani. Sparse in v erse co v ariance estimation with the graphical lasso. Bio statistics , 9:43 2–441, 2 007. Hillel Gazit. An optimal randomized parallel a lg orithm fo r finding connected c omp o- nen ts in a graph. SI AM J. on C o mputing , 20 (6):1046– 1 067, 1991. Steffen Lauritzen. Gr aphic a l Mo dels . Oxford Univ ersit y Press , 1996. Lu Li and Kim-Ch uan T oh. An inexact in terior p oin t metho d for l1- regularized sparse co v ar ia nce selection. Mathematic al Pr o gr amming Computation , 31:2000–2 016, May 2010. ISSN 1867- 2949. URL http://dx.doi.o rg/10.1007/s12532- 010- 00 20- 6 . Zhaosong Lu. Smo oth optimization approach for sparse cov ariance selection. SIAM J. on Optimization , 19:1807– 1 827, F ebruary 2009. ISSN 1 0 52-6234 . doi: 1 0.1137/ 07069591 5. URL http://portal.acm.or g/citation.cfm?id=1654243 .1654257 . Zhaosong Lu. Adaptiv e first-o rder metho ds for g eneral sparse in v erse co- v ariance selection. SIAM J. Matrix Anal. Appl. , 31:2000 –2016, Ma y 2010. ISSN 0895-4 798. doi: http://dx.doi.org/10.1137/08 0742531. URL http://dx.d oi.org/10.1137/080742531 . Y. Nestero v. Smo oth minimization of non-smo o th functions. Math. Pr o gr am., Serie A , 103:127–1 52, 2005. A.J. Rothman, P .J. Bick el, E. Levina, and J. Zh u. Sparse p erm utatio n in v arian t co v ar ia nce estimation. Ele ctr onic Journal of Statistics , 2 :4 94–515, 2008. Kat y a Sc hein b erg, Shiqian Ma, and Do nald Goldfarb. Sparse in ve rse cov ariance selection via alternating linearizat io n metho ds. NIPS , pages 1–9, 2010. URL http://arxi v.org/abs/1011.0097 . 18 Suvrit Sra and Dongmin Kim. Sparse in v erse cov ariance esti- mation via an adaptiv e gradien t-based metho d, 2011. URL http://arxi v.org/PS- cache/arxiv/pdf/1106 /1106.5175v1.pdf . R. E. T ar jan. Depth-first searc h and linear graph algor ithms. SIAM Journal on Computing , 1(2):1461 6 0, 1 972. M. J. v an de Vijv er, Y. D. He, L. J. v an’t V eer, H. Dai, A. A. Hart, D. W. V oskuil, G. J. Schre ib er, J. L. P eterse, C. Rob erts, M. J. Marton, M. P arrish, D. A tsma, A. Wittev een, A. Glas, L. Delahay e, T. v an der V elde, H. Bartelink , S. Ro denh uis, E. T. Rutgers, S. H. F riend, and R. Bernards. A gene-expression signature as a predictor of surviv al in breas t cancer. N. En g l . J. Me d. , 3 47:1999–2 009, Dec 2002. D.M. Witten. P ersonal comm unication, 8-12-2011. DM Witten and JH F riedman. A fa st screening rule for the graphical lasso. Journal of Computational and Gr aphic al Statistics: In Pr ess. , 2011. Rep ort dated 3- 12-2011 . M Y uan a nd Y Lin. Mo del se lection and estimation in the gaussian gr aphical mo del. Biometrika , 94 (1):19–35, 2007. Xiaoming Y uan. Alternating direction me tho ds for s parse co- v ariance sele ction. Metho ds , (August):1–12, 200 9 . URL http://www. optimization- online.org/DB- FILE/2009 /09/2390.pdf . 19

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment