High-dimensional Sparse Inverse Covariance Estimation using Greedy Methods

In this paper we consider the task of estimating the non-zero pattern of the sparse inverse covariance matrix of a zero-mean Gaussian random vector from a set of iid samples. Note that this is also equivalent to recovering the underlying graph struct…

Authors: Christopher C. Johnson, Ali Jalali, Pradeep Ravikumar

High-dimensional Sparse Inverse Covariance Estimation using Greedy   Methods
High-dimensional Sparse In v erse Co v ariance Estimation using Greedy Metho ds Christopher C. Johnson Ali Jalali Pradeep Ra vikumar CS, UT Austin cjohnson@cs.utexas.edu ECE, UT Austin alij@mail.utexas.edu CS, UT Austin pradeepr@cs.utexas.edu Abstract In this pap er w e consider the task of esti- mating the non-zero pattern of the sparse in- v erse co v ariance matrix of a zero-mean Gaus- sian random v ector from a set of iid samples. Note that this is also equiv alen t to recov er- ing the underlying graph structure of a sparse Gaussian Marko v Random Field (GMRF). W e presen t tw o nov el greedy approaches to solving this problem. The first estimates the non-zero co v ariates of the ov erall in v erse co- v ariance matrix using a series of global for- w ard and backw ard greedy steps. The sec- ond estimates the neigh b orho od of each node in the graph separately , again using greedy forw ard and backw ard steps, and combines the intermediate neighborho o ds to form an o verall estimate. The principal contribu- tion of this pap er is a rigorous analysis of the sparsistency , or consistency in recov er- ing the sparsity pattern of the in verse co- v ariance matrix. Surprisingl y , w e show that b oth the local and global greedy metho ds learn the full structure of the mo del with high probability given just O ( d log( p )) sam- ples, which is a signific ant improv ement ov er state of the art ` 1 -regularized Gaussian MLE (Graphical Lasso) that requires O ( d 2 log( p )) samples. Moreov er, the restricted eigenv alue and smo othness conditions imp osed by our greedy metho ds are muc h weak er than the strong irrepresen table conditions required by the ` 1 -regularization based methods. W e cor- rob orate our results with extensive simula- tions and examples, comparing our lo cal and global greedy methods to the ` 1 -regularized App earing in Pro ceedings of the 15 th In ternational Con- ference on Artificial Intelligence and Statistics (AIST A TS) 2012, La Palma, Canary Islands. V olume XX of JMLR: W&CP XX. Copyrigh t 2012 b y the authors. Gaussian MLE as w ell as the Neighbor- ho od Greedy metho d to that of no dewise ` 1 - regularized linear regression (Neighborho od Lasso). 1 In tro duction High-dimensional Covarianc e Estimation. Increas- ingly , modern statistical problems across v aried fields of science and engineering inv olv e a large num b er of v ariables. Estimation of such high-dimensional mo d- els has b een the fo cus of considerable recent research, and it is no w well understo o d that consisten t estima- tion is possible when some low-dimensional structure is imp osed on the mo del space. In this pap er, we con- sider the sp ecific high-dimensional problem of recov- ering the cov ariance matrix of a zero-mean Gaussian random vector, under the low-dimensional structural constrain t of sp arsity of the in v erse cov ariance, or con- cen tration matrix. When the random vector is mul- tiv ariate Gaussian, the set of non-zero en tries in the concen tration matrix corresp ond to the set of edges in an asso ciated Gaussian Marko v random field (GMRF). In this setting, imp osing sparsity on the entries of the concen tration matrix can b e interpreted as requiring that the graph underlying the GMRF hav e relativ ely few edges. State of the art: ` 1 r e gularize d Gaussian MLE. F or this task of sparse GMRF estimation, a line of recent pap ers [3, 5, 15] ha ve prop osed an estimator that min- imizes the Gaussian negative log-likelihoo d regular- ized by the ` 1 norm of the entries (or the off-diagonal en tries) of the concen tration matrix. The resulting optimization problem is a log-determinan t program, whic h can be solved in p olynomial time with inte- rior p oin t metho ds [1], or by co-ordinate descent al- gorithms [3, 5]. Rothman et al. [11], Ra vikumar et al. [10] hav e also sho wn strong statistical guarantees for this estimator: b oth in ` 2 op erator norm error bounds, and reco very of the underlying graph structure. High-dimensional Sparse In verse Co v ariance Estimation using Greedy Methods R e c ent r esur genc e of gr e e dy metho ds. A related line of recen t work on learning sparse models has fo cused on “stagewise” greedy algorithms. These perform simple forw ard steps (adding parameters greedily), and possi- bly also bac kw ard steps (remo ving parameters greed- ily), and yet pro vide strong statistical guaran tees for the estimate after a finite num be r of greedy steps. In- deed, such greedy algorithms ha ve appeared in v arious guises in multiple communities: in machine learning as b oosting [4], in function approximation [13], and in signal pro cessing as basis pursuit [2]. In the context of statistical mo del estimation, Zhang [17] analyzed the forw ard greedy algorithm for the case of sparse lin- ear regression; and show ed that the forw ard greedy algorithm is sparsisten t (consistent for mo del selec- tion recov ery) under the same “irrepresentable” con- dition as that required for “sparsistency” of the Lasso. Zhang [16] analyzes a more general greedy algorithm for sparse linear regression that p erforms forward and bac kward steps, and sho w ed that it is sparsistent un- der a w eaker restricted eigenv alue condition. Jalali et al. [7] extend the sparsistency analysis of [16] to general non-linear mo dels, and again show that strong sparsistency guaran tees hold for these algorithms. Our Appr o aches. Motiv ated by these recent results, w e apply the forward-bac kw ard greedy algorithm studied in [16, 7] to the task of learning the graph structure of a Gaussian Mark ov random field giv en iid samples. W e prop ose tw o algorithms: one that applies the greedy algorithm to the ov erall Gaussian log-lik eliho o d loss, and the other that is based on greedy neighborho od estimation. F or this second method, we follo w [8, 9], and estimate the neighborho o d of each no de by ap- plying the greedy algorithm to the lo cal no de condi- tional log-likelihoo d loss (which reduces to the least squares loss), and then sho w that eac h neigh b orho od is recov ered with very high probabilit y , so that b y an elemen tary union b ound, the entire graph structure is recov ered with high probability . A principal con- tribution of this pap er is a rigorous analysis of these algorithms, where we rep ort sufficien t conditions for reco very of the underlying graph structure. W e also corrob orate our analysis with extensive simulations. Our analysis shows that for a Gaussian random vec- tor X = ( X 1 , X 2 , . . . , X p ) with p v ariables, both the global and lo cal greedy algorithms only require n = O ( d log( p )) samples for sparsistent graph recov- ery . Note that this is a significant impro v ement o ver the ` 1 regularized Gaussian MLE [15] which has been sho wn to require O ( d 2 log( p )) samples [10]. Moreov er, w e sho w that the lo cal and global greedy algorithms re- quire a v ery weak restricted eigen v alue and restricted smo othness condition on the true inv erse cov ariance matrix (with the lo cal greedy imp osing a marginally w eaker condition that the global greedy algorithm). This is in con trast to the ` 1 regularized Gaussian MLE which imp oses a very stringen t edge-based ir- represen table condition [10]. In Section 5, we explic- itly compare these different conditions imp osed b y the v arious methods for some simple GMRFs, and quan- titativ ely show that the conditions imp osed b y the lo- cal and global greedy metho ds require muc h weak er conditions on the co v ariance entries. Th us, both the- oretically and via simulations, we show that the set of metho ds prop osed in the paper are the state of the art in recov ering the graph structure of a GMRF from iid samples: b oth in the num b er of samples required, and the w eakness of the sufficient conditions imposed up on the mo del. 2 Problem Setup 2.1 Gaussian graphical mo dels Let X = ( X 1 , X 2 , . . . , X p ) b e a zero-mean Gaus- sian random vector. Its density is parameterized by its inv erse cov ariance or c onc entr ation matrix Θ ∗ = (Σ ∗ ) − 1  0, and can b e written as f ( x 1 , . . . , x p ; Θ ∗ ) = exp  − 1 2 x T Θ ∗ x  p (2 π ) p det(Θ ∗ ) − 1 . (1) W e can asso ciate an undirected graph structure G = ( V , E ) with this distribution, with the v ertex set V = { 1 , 2 , . . . , p } corresp onding to the v ariables ( X 1 , . . . , X p ), and with edge set such that ( i, j ) / ∈ E if Θ ∗ ij = 0. W e are in terested in the problem of reco vering this un- derlying graph structure, which corresp onds to deter- mining whic h off-diagonal en tries of Θ ∗ are non-zero— that is, the set E (Θ ∗ ) := { i, j ∈ V | i 6 = j, Θ ∗ ij 6 = 0 } . (2) Giv en n samples, we define the sample c ovarianc e ma- trix b Σ n := 1 n n X k =1 X ( k ) ( X ( k ) ) T . (3) In the sequel, we o ccasionally drop the sup erscript n , and simply write b Σ for the sample co v ariance. With a sligh t abuse of notation, we define the sp arsity index s := | E (Θ ∗ ) | as the total num b er of non-zero elemen ts in off-diagonal p ositions of Θ ∗ ; equiv alen tly , this corresp onds to twice the n umber of edges in the case of a Gaussian graphical mo del. W e also define the maximum de gr e e or r ow c ar dinality d := max i =1 ,...,p      j ∈ V | Θ ∗ ij 6 = 0      , (4) Christopher C. Johnson, Ali Jalali, Pradeep Ravikumar corresp onding to the maximum num b er of non-zeros in an y ro w of Θ ∗ ; this corresponds to the maximum de- gree in the graph of the underlying Gaussian graphical mo del. Note that we hav e included the diagonal entry Θ ∗ ii in the degree count, corresp onding to a self-lo op at eac h vertex. 2.2 State of the art: ` 1 regularization Define the off-diagonal ` 1 r e gularizer k Θ k 1 , off := X i 6 = j | Θ ij | , (5) where the sum ranges o ver all i, j = 1 , . . . , p with i 6 = j . Given some regularization constant λ n > 0, w e consider estimating Θ ∗ b y solving the following ` 1 - r e gularize d lo g-determinant pr o gr am : b Θ := arg min Θ ∈S p + +  h h Θ , b Σ n i i − log det(Θ) + λ n k Θ k 1 , off  , (6) whic h returns a symmetric p ositiv e definite matrix b Θ. Note that this corresp onds to the ` 1 regularized Gaus- sian MLE when the underlying distribution is Gaus- sian. 2.3 F orw ard Backw ard Greedy [16, 7] consider a simple forw ard-bac kward greedy al- gorithm for model estimation that b egins with an empt y set of active v ariables and gradually adds (and remo ves) v ariables to the active set. This algorithm has tw o basic steps: the forward step and the back- w ard step. In the forward step, the algorithm finds the b est next candidate and adds it to the active set as long as it improv es the loss function at least by  S , otherwise the stopping criterion is met and the algo- rithm terminates. Then, in the bac kward step, the algorithm chec ks the influenc e of all v ariables in the presence of the newly added v ariable. If one or more of the previously added v ariables do not contribute at least ν  S to the loss function, then the algorithm remo ves them from the active set. This pro cedure en- sures that at eac h round, the loss function is improv ed b y at least (1 − ν )  S and hence it terminates within a finite n umber of steps. In the sequel, w e will apply this greedy metho d- ology to Gaussian graphical models, to obtain t w o metho ds: (a) Greedy Gaussian MLE, whic h applies the greedy algorithm to the Gaussian negative log- lik eliho o d loss, and (b) Greedy Neighborho o d Estima- tion, whic h applies the greedy algorithm to the lo cal no de-conditional negative log-likelihoo d loss. Algorithm 1 Global greedy forward-bac kw ard algo- rithm for Gaussian co v ariance estimation Input : b Σ n , Stopping Threshold  S , Backw ard Step F actor ν ∈ (0 , 1) Output : Inv erse Cov ariance Estimation b Θ Initialize b Θ (0) ← I , b S (0) ← ∅ , and k ← 1 while true do { F orwar d Step } (( i ∗ , j ∗ ) , α ∗ ) ← − arg min ( i,j ) ∈ ( b S ( k − 1) ) c ; α L  b Θ ( k − 1) + α ( e ij + e j i )  b S ( k ) ← − b S ( k − 1) ∪ { ( i ∗ , j ∗ ) } δ ( k ) f ← − L ( b Θ ( k − 1) ) − L  b Θ ( k − 1) + α ∗ ( e i ∗ j ∗ + e j ∗ i ∗ )  if δ ( k ) f ≤  S then break end if b Θ ( k ) ← − arg min Θ L  Θ b S ( k )  k ← − k + 1 while true do { Backwar d Step } ( i ∗ , j ∗ ) ← − arg min j ∈ b S ( k − 1) L  b Θ ( k − 1) − b Θ ( k − 1) ij ( e ij + e j i )  if L  b Θ ( k − 1) − b Θ ( k − 1) i ∗ j ∗ ( e i ∗ j ∗ + e j ∗ i ∗ )  − L  b Θ ( k − 1)  > ν δ ( k ) f then break end if b S ( k − 1) ← − b S ( k ) − { ( i ∗ , j ∗ ) } b Θ ( k − 1) ← − arg min θ L  Θ b S ( k − 1)  k ← − k − 1 end while end while 3 Greedy Gaussian MLE In Algorithm 1, we describ e the greedy algorithm of [16, 7] as applied to the Gaussian log-lik eliho o d loss, L (Θ) := h h Θ , b Σ n i i − log det(Θ) . Assumption: Let ρ ≥ 1 b e a constant and ∆ ∈ R p × p b e a sym- metric matrix that is sp arse with at most η d non- zero entries p er row (and column) for some η ≥ 2 + 4 ρ 2 ( p ( ρ 2 − ρ ) /d + √ 2) 2 . W e require that p op- ulation cov ariance matrix Σ ∗ = E  X X T  satisfy the restricted eigen v alue prop ert y , i.e., for some p ositiv e constan ts C min , w e hav e C min k ∆ k F ≤ h h Σ ∗ , ∆ i i ≤ ρC min k ∆ k F , where, k · k F denotes the F rob enius norm. High-dimensional Sparse In verse Co v ariance Estimation using Greedy Methods Lemma 1. Supp ose Σ ∗ satisfies the assumption in 7. Then, with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) for arbitr ary smal l c onstant α > 0 , we have that for any symmetric matrix ∆ with ηd non-zer o entries p er r ow (and c olumn), (1 − α ) C min k ∆ k F ≤ h h b Σ n , ∆ i i ≤ (1 + α ) ρC min k ∆ k F , pr ovide d that n ≥ K d log ( p ) for some p ositive c on- stant K , c 1 and c 2 . Pr o of. The pro of follo ws from Lemma 9 (Appendix K) in [14]. Using T aylor series expansion, we can write L (Θ + ∆) = L (Θ) + h h ∆ , b Σ n i i − h h Θ − 1 , ∆ i i + ∞ X i =2 ( − 1) i i h h (Θ − 1 ∆) i − 1 Θ − 1 , ∆ i i | {z } R ∆ . In order to establish the restricted strong con- v exity/smoothness required b y [7], w e need to lo wer/upper b ound R ∆ . Notice that in the pro of of [7], the required ∆ is the difference b et w een the target v ariable Θ ∗ and the k th step estimation b Θ ( k ) . Since the algorithm is guaranteed to con verge, ∆ = Θ ∗ − b Θ ( k ) is alw ays b ounded. Th us, without loss of generalit y , we assume that k ∆ k F ≤ 1. Notice that w e can scale k ∆ k F and similar t yp e o f result holds. The next lemma pro- vides the required upp er/lo w er b ound. Lemma 2. Supp ose Σ ∗ satisfies the assumption in 7. Then with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) , we have that for any symmetric matrix ∆ with η d non-zer o en- tries p er r ow (and c olumn), and with k ∆ k F ≤ 1 , 1 4 C 2 min k ∆ k 2 F ≤ R ∆ ≤ 1 2 ρ 2 C 2 min k ∆ k 2 F . Pr o of. Denote γ = h h Θ − 1 , ∆ i i . W e hav e R ∆ = ∞ X i =2 ( − 1) i i γ i = γ − log(1 + γ ) . Under our assumption, C min k ∆ k F ≤ γ ≤ ρC min k ∆ k F and the function γ − log(1 + γ ) is an increasing function in γ . Moreov er, for the range of γ , we ha ve γ − log(1 + γ ) ≥ 1 4 γ 2 b ecause they b oth v anish at zero and the deriv ativ e of LHS is larger than the deriv ative of LHS. Hence, w e hav e 1 4 C 2 min k ∆ k 2 F ≤ C min k ∆ k F − log(1 + C min k ∆ k F ) ≤ γ − log(1 + γ ) = R ∆ ≤ ρC min k ∆ k F − log(1 + ρC min k ∆ k F ) ≤ 1 2 ρ 2 C 2 min k ∆ k 2 F . The last inequality follo ws from γ − log(1 + γ ) ≤ 1 2 γ 2 (since they are equal at zero and the deriv ative of RHS is alwa ys larger ab o ve zero). Hence, the result follo ws. Let ∇ ( n ) := k b Σ n − (Θ ∗ ) − 1 k ∞ . By first order condition on the optimalit y of Θ ∗ , it is clear that lim n →∞ ∇ ( n ) = 0. The follo wing lemma pro vides an upp er bound on ∇ ( n ) . Lemma 3. Given the sample c omplexity n ≥ K log( p ) for some c onstant K , we have ∇ ( n ) ≤ c r log( p ) n , with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) for some p os- itive c onstants c , c 1 and c 2 . Pr o of. The pro of follows from Lemma 1 in [10]. This entails that the restricted strong conv exity and smo othness (i.e., the required assumptions of the gen- eral result in [7]) are satisfied. Now, w e can sp ecialize the results in [7] to obtain the follo wing theorem: Theorem 1 (Global Greedy Sparsistency) . Under the assumption ab ove, supp ose we run Algorithm 1 with stopping thr eshold  S ≥ (2 cη /ρ 2 ) d log( p ) /n , wher e, d is the maximum no de de gr e e in the gr aphic al mo del, and the true p ar ameters Θ ∗ satisfy min t ∈S ∗ | Θ ∗ | ≥ p 8  S /ρ 2 , and further that numb er of samples sc ales as n > K d log ( p ) for some c onstant K . Then, with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) , we have (a) No F alse Exclusions: E ∗ − b E = ∅ . (b) No F alse Inclusions: b E − E ∗ = ∅ . 4 Greedy Neighborho o d Estimation Denote by N ∗ ( r ) the set of neigh b ors of a vertex r ∈ V , so that N ∗ ( r ) = { t : ( r, t ) ∈ E ∗ } . Then the graphi- cal mo del selection problem is equiv alent to that of estimating the neigh b orhoo ds ˆ N n ( r ) ⊂ V , so that P [ ˆ N n ( r ) = N ∗ ( r ); ∀ r ∈ V ] → 1 as n → ∞ . F or an y pair of random v ariables X r and X t , the pa- rameter Θ rt fully characterizes whether there is an edge b etw een them, and can b e estimated via its conditional likelihoo d. In particular, defining Θ r := { Θ rt } t 6 = r , our goal is to use the conditional likelihoo d of X r conditioned on X V \ r to estimate the supp ort of Θ r and hence its neigh b orhoo d N ( r ). This conditional Christopher C. Johnson, Ali Jalali, Pradeep Ravikumar distribution of X r conditioned on X V \ r generated b y (1) is giv en by (considering Θ − 1 = Σ) X r | X V \ r ∼ N  − Θ − 1 r \ r Θ \ r \ r X V \ r , Θ − 1 rr − Θ − 1 r \ r Θ \ r \ r Θ − 1 \ r r  . Ho wev er, note that we do not need to estimate the v ariance of this conditional distribution in order to obtain the supp ort of Θ r = Θ r \ r . In particular, the solution to the follo wing least squares loss Γ ∗ r = arg min Γ r E [( X r − X t 6 = r Γ rt X t ) 2 ] , w ould satisfy supp(Γ ∗ r ) = supp(Θ ∗ r ). Giv en the n samples X (1) , . . . , X ( n ) , w e th us use the sample-based linear loss L (Γ r ) = 1 2 n n X i =1   X ( i ) r − X t 6 = r Γ rt X ( i ) t   2 . (7) Adapting the greedy algorithm from the previous sec- tion to this linear loss at eac h node th us yields Algo- rithm 2. Assumption: Let ρ ≥ 1 b e a constant and ∆ ∈ R p − 1 b e an arbitrary η d -sparse vector, where, η ≥ 2 + 4 ρ 2 ( p ( ρ 2 − ρ ) /d + √ 2) 2 . W e require the marginal population Fisher in- formation matrix Σ ∗ \ r = E h X \ r X T \ r i satisfy the re- stricted eigenv alue prop ert y , i.e., for some p ositiv e con- stan ts C min , w e hav e C min k ∆ k F ≤ k Σ ∗ \ r ∆ k F ≤ ρC min k ∆ k F . Lemma 4. Under assumption ab ove, and for some arbitr ary smal l c onstant α > 0 , the mar ginal sample Fisher information matrix b Σ n \ r = 1 n P n i =1 X ( i ) \ r X ( i ) T \ r , with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) , satisfies the c ondition that for any symmetric matrix ∆ with η d non-zer o entries p er r ow (and c olumn), (1 − α ) C min k ∆ k F ≤ k b Σ n \ r ∆ k F ≤ (1 + α ) ρC min k ∆ k F , pr ovide d that n ≥ K d log ( p ) for some p ositive c on- stant K , c 1 and c 2 . Pr o of. The pro of follo ws from Lemma 9 (Appendix K) in [14]. Let ∇ ( n ) r := max t    1 n P n i =1 X ( i ) t  X ( i ) r − P t 6 = r Γ ∗ rt X ( i ) t     . By first order condition on the optimality of Γ ∗ rt , it is clear that lim n →∞ ∇ ( n ) r = 0. The follo wing lemma pro vides an upp er b ound on ∇ ( n ) r . Algorithm 2 Greedy forw ard-backw ard algorithm for marginal Gaussian co v ariance estimation Input : Data V ectors X (1) , . . . , X ( n ) , Stopping Threshold  S , Bac kward Step F actor ν ∈ (0 , 1) Output : Marginal V ector b Γ r Initialize b Γ (0) r ← 0 , b S (0) ← ∅ , and k ← 1 while true do { F orwar d Step } ( t ∗ , α ∗ ) ← − arg min t ∈ ( b S ( k − 1) ) c ; α L  b Γ ( k − 1) r + αe t  b S ( k ) ← − b S ( k − 1) ∪ { t ∗ } δ ( k ) f ← − L ( b Γ ( k − 1) r ) − L ( b Γ ( k − 1) r + α ∗ e t ∗ ) if δ ( k ) f ≤  S then break end if b Γ ( k ) r ← − arg min Γ r L  (Γ r ) b S ( k )  k ← − k + 1 while true do { Backwar d Step } t ∗ ← − arg min t ∈ b S ( k − 1) L ( b Γ ( k − 1) r − b Γ ( k − 1) rt e t ) if L  b Γ ( k − 1) r − b Γ ( k − 1) rt ∗ e t ∗  − L  b Γ ( k − 1) r  > ν δ ( k ) f then break end if b S ( k − 1) ← − b S ( k ) − { t ∗ } b Γ ( k − 1) r ← − arg min Γ r L  (Γ r ) b S ( k − 1)  k ← − k − 1 end while end while Lemma 5. Given the sample c omplexity n ≥ K log( p ) for some c onstant K , we have ∇ ( n ) r ≤ c r log( p ) n , with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) for some p os- itive c onstants c , c 1 and c 2 . Pr o of. The pro of follows from Lemma 5 in [14]. This entails that the restricted strong conv exity and smo othness (i.e., the required assumptions of the gen- eral result in [7]) are satisfied with constants C min and ρC min , resp ectiv ely; because, the third and higher or- der deriv atives are zero. Now, w e can then sp ecialize the results in [7] to obtain the follo wing theorem: Theorem 2 (Neigh b orho od Greedy Sparsis- tency) . Under the assumption ab ove, supp ose we run A lgorithm 2 with stopping thr eshold  S ≥ (8 cρη /C min ) d log( p ) /n , wher e, d is the maximum High-dimensional Sparse In verse Co v ariance Estimation using Greedy Methods no de de gr e e in the gr aphic al mo del, and the true p a- r ameters Γ ∗ r satisfy min t ∈N ∗ ( r ) | Γ ∗ rt | ≥ p 32 ρ S /C min , and further that numb er of samples sc ales as n > K d log ( p ) for some c onstant K . Then, with pr ob ability at le ast 1 − c 1 exp( − c 2 n ) , we have (a) No F alse Exclusions: E ∗ r − b E r = ∅ . (b) No F alse Inclusions: b E r − E ∗ r = ∅ . 5 Comparisons to Related Metho ds In this section, we compare our global and lo cal greedy metho ds to the ` 1 -regularized Gaussian MLE, ana- lyzed in [10], and to ` 1 -regularization (Lasso) based neigh b orho od selection, analyzed in [8, 14]. 5.1 Sample Complexit y Our greedy algorithm requires O ( d log ( p )) samples to reco ver the exact structure of the graph for both the global and lo cal neigh b orho od based metho ds. In con trast, the ` 1 -regularized Gaussian MLE [10] re- quires O ( d 2 log( p )) samples to guaran tee structure reco very with high probability . The linear neigh- b orhoo d selection with ` 1 -regularization [8] requires O ( d log( p )) samples to guaran tee sparsistency , similar to our greedy algorithms. 5.2 Minim um Non-Zero V alues The ` 1 -regularized Gaussian MLE imp oses the mo del condition that the minimum non-zero entry of Σ ∗ − 1 satisfy Σ ∗ − 1 min = O (1 /d ). Our greedy algorithms al- lo w for a broader range of minim um non-zero v alues Σ ∗ − 1 min = O (1 / √ d ). The linear neighborho od selection with ` 1 -regularization again matches our greedy algo- rithms and only requires that Σ ∗ − 1 min = O (1 / √ d ). 5.3 P arameter Restrictions W e no w compare the irrepresen table and restricted eigen v alue and smo othness conditions imp osed on the mo del parameters by the different metho ds. 5.3.1 Star Graphs Consider a star graph G ( V , E ) with p no des in Fig 1(a), where the center node is lab eled 1 and the other no des are lab eled from 2 to p . F ollo wing [10], consider the follo wing co v ariance matrix Σ ∗ parameterized by the correlation parameter τ ∈ [ − 1 , 1]: the diagonal entries are set to Σ ∗ ii = 1, for all i ∈ V ; the en tries correspond- ing to edges are set to Σ ∗ ij = τ for ( i, j ) ∈ E ; while the (a) Star (b) Chain (c) Grid (d) Diamond Figure 1: Generic Graph Schematics non-edge en tries are set as Σ ∗ ij = τ 2 for ( i, j ) / ∈ E . It is easy to c heck that Σ ∗ induces the desired star graph. With this setup, the irrepresen table condition imp osed b y the ` 1 -regularized Gaussian MLE [10] en tails that | τ | ( | τ | + 2) < 1 or equiv alently τ ∈ ( − 0 . 4142 , 0 . 4142) to guaran tee sparsistency . Ho wev er, our greedy algo- rithms allow for τ ∈ ( − 1 , 1) (since C min = 1 − τ 2 ). Under the same setup, the linear neigh b orhoo d selec- tion with ` 1 -regularization [8] requires τ ∈ ( − 1 , 1) to guaran tee the success. 5.3.2 Chain Graphs Consider a chain (line) graph G ( V , E ) on p no des as sho wn in Fig 1(b). Again, consider a p opulation co- v ariance matrix Σ ∗ parameterized by the correlation parameter τ ∈ [ − 1 , 1]: set Σ ∗ ij = τ | i − j | . Thus, this matrix assumes a correlation factor of τ k b et w een tw o no des that are k hops a wa y from each other. It is easy to chec k that Σ ∗ induces the desired c hain graph. With this setup, the ` 1 -regularized Gaussian MLE [10] requires | τ | p − 2 (( p − 2) | τ | + p − 1) < 1. It is hard to ev aluate bounds on τ in general, but for the case p = 4 w e hav e τ ∈ ( − 0 . 6 , 0 . 6); for the case p = 10 w e hav e τ ∈ ( − 0 . 75 , 0 . 75) and for the case p = 100 we hav e τ ∈ ( − 0 . 95 , 0 . 95). Our greedy algorithms on the other hand allo w for τ ∈ ( − 1 , 1) (since C min = (1 − τ 2 ) f p ( τ ) for some function f p ( τ ) that depends on p and satisfies Christopher C. Johnson, Ali Jalali, Pradeep Ravikumar f p ( τ ) > C p for all τ and some constan t C p dep ending only on p ). Under the same setup, the linear neigh b or- ho od selection with ` 1 -regualrization [8] only imp oses τ ∈ ( − 1 , 1), similar to our greedy metho ds. 5.3.3 Diamond Graph Consider the diamond graph G ( V , E ) on 4 no des with the nodes lab eled as in Fig 1(d). Given a correlation parameter τ 6 = 0, let Σ ∗ b e the population co v ariance matrix with Σ ∗ ii = 1 and Σ ∗ ij = τ except Σ ∗ 23 = 0 and Σ ∗ 14 = 2 τ 2 . It is easy to c hec k that Σ ∗ − 1 induces the desired graph. With this setup, the ` 1 -regularized Gaussian MLE [10] requires 4 | τ | ( | τ | + 1) < 1 or equiv- alen tly τ ∈ ( − 0 . 2017 , 0 . 2017). Our greedy algorithm allo ws for τ ∈ ( − 0 . 7071 , 0 . 7071) (since C min = 1 − 2 τ 2 ). Under the same setup, the linear neighborho o d se- lection with ` 1 -regualrization [8] requires 2 | τ | < 1 or equiv alently that τ ∈ ( − 0 . 5 , 0 . 5) to guaran tee the success. Unlik e the previous tw o examples, this is a strictly stronger condition than that imp osed by our greedy metho ds. 6 Exp erimen tal Analysis In this section we will outline our exp erimen tal results in testing the effectiveness of b oth Algorithms 1 and 2 in a sim ulated environmen t. 6.1 Optimization Metho d Our greedy algorithm consists of a single v ariable opti- mization step where w e try to pic k the b est coordinate. This step can b e run in parallel for all s ingle v ari- ables to achiev e maximum sp eedup. F or greedy neigh- b orhoo d selection, the single v ariable optimization is a relatively simple op eration, how ev er for the global mo del selection algorithm (log-det optimization), we w ould like to provide a fast single v ariable optimiza- tion metho d to av oid a contin ual log-det calculation. F ollowing the result in [12], we hav e det  b Θ ( k − 1) + α ( e ij + e j i )  = det  b Θ ( k − 1)   (1 + α ( b Θ ( k − 1) ) − 1 i,j ) 2 − α 2 ( b Θ ( k − 1) ) − 1 ii ( b Θ ( k − 1) ) − 1 j j  This en tails that α ∗ = arg min α h h b Θ ( k − 1) + α ( e ij + e j i ) , b Σ n i i − log det( b Θ ( k − 1) + α ( e ij + e j i )) = b Σ n ij − ( b Θ ( k − 1) ) − 1 ij ( b Θ ( k − 1) ) − 1 ii ( b Θ ( k − 1) ) − 1 j j − ( b Θ ( k − 1) ) − 1 ij ( b Θ ( k − 1) ) − 1 ij This closed-form solution simplifies the single v ariable optimization step in our algorithm and av oids contin- ual calculation of log det( b Θ). 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 Control Parameter Probability of Success p = 36 p = 64 p = 100 Global Greedy GLasso (a) Chain (Line Graph) 0 0.5 1 1.5 2 2.5 3 3.5 0 0.2 0.4 0.6 0.8 1 Control Parameter Probability of Success p = 36 p = 64 p = 100 Global Greedy GLasso (b) 4-Nearest Neighbor (Grid Graph) Fig 2: Plots of success probability P [ b S = S ∗ ] versus the con trol parameter β ( n, p, d ) = n/ [70 d log ( p )] for (a) c hain ( d = 2) and (b) 4-nearest neigh bor grid ( d = 4) using b oth Algorithm 1 and ` 1 -regularized Gaussian MLE (Graphical Lasso). As our theorem suggests and these figures show, the Global Greedy algorithm re- quires less samples to reco ver the exact structure of the graphical mo del. 6.2 Exp erimen ts T o presen t a formal exp erimen tal analysis for both Algorithm 1 and Algorithm 2 we simulated zero-mean Gaussian inv erse cov ariance estimation, or GMRF structure learning, for v arious graph types and scal- ings of ( n, p, d ). F or the Global Greedy metho d w e exp erimen ted using chain ( d = 2) and grid ( d = 4) graph t yp es with sizes of p ∈ { 36 , 64 , 100 } . F or the Neighborho od Greedy metho d we exp erimen ted using chain ( d = 2) and star ( d = 0 . 1 p ) graph types with sizes of p ∈ { 36 , 64 , 100 } . Figure 1 outlines the sc hematic structure for eac h graph type. F or eac h algorithm, we measured p erformance by completely learning the true supp ort set S ∗ p ertaining to the non-zero inv erse co v ariates (graph edges). If S ∗ w as completely learned then we called this a suc c ess and otherwise we called it a failur e . Using a batch size of 50 trials for each scaling of ( n, p, d ) we measured the probability of success as the av erage success rate. F or b oth algorithms we used a stopping threshold High-dimensional Sparse In verse Co v ariance Estimation using Greedy Methods 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 Control Parameter Probability of Success p = 36 p = 64 p = 100 Neighborhood Greedy Logistic Regression (a) Chain (Line Graph) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 Control Parameter Probability of Success p = 36 p = 64 p = 100 Logistic Regression Neighborhood Greedy (b) Star Graph Fig 3: Plots of success probability P [ b N ± ( r ) = N ∗ ( r ) , ∀ r ∈ V ] v ersus the con trol parameter β ( n, p, d ) = n/ [70 d log ( p )] for (a) chain ( d = 2) and β ( n, p, d ) = n/ [200 log ( dp )] for (b) star graph ( d = 0 . 1 p ) using both Algorithm 2 and no dewise ` 1 -regularized linear regression (Neigh b orho od Lasso). As our theorem suggests and these figures show, the Neigh b orho od Greedy algorithm requires less samples to reco ver the exact structure of the graphical mo del.  S = cd log p n where d is the maximum degree of the graph, p is the num b er of no des in the graph, n is the n umber of samples used, and c is a constant tuning parameter, as well as a bac kwards step threshold of v = 0 . 5. W e compared Algorithm 1 to that of ` 1 -regularized Gaussian MLE (Graphical Lasso) as discussed in [5] and [10] using the glasso implementa- tion from F riedman et al. [5]. W e compared Algorithm 2 to that of neighborho o d based ` 1 -regularized linear regression (Neighborho o d Lasso) using the glmnet generalized Lasso implemen tation, also from F riedman et al. [6]. Both glasso and glmnet use a regularization parameter λ = c q log p n whic h w as optimally set using k -fold cross v alidation. Figure 2 plots the probabilit y of successfully learning S ∗ vs the control parameter β ( n, p, d ) = n 70 d log p for v arying num b er of samples n for b oth Algorithm 1 and Graphical Lasso. Figure 3 plots the probabilit y of successfully learning S ∗ vs the con trol parameter β ( n, p, d ) = n 70 d log p for the chain graph type and β ( n, p, d ) = n 200 log( dp ) for the star graph type for b oth Algorithm 2 and neighborho od based ` 1 -linear regression. Both figures illustrate our theoretical re- sults that the Greedy Algorithms require less samples ( O ( d log p )) than the state of the art Lasso metho ds ( O ( d 2 log p )) for complete structure learning. References [1] S. Boyd and L. V andenberghe. Convex optimiza- tion . Cam bridge Univ ersity Press, Cambridge, UK, 2004. [2] S. Chen, D. L. Donoho, and M. A. Saunders. A tomic decomp osition by basis pursuit. SIAM J. Sci. Computing , 20(1):33–61, 1998. [3] A. d’Aspr´ emon t, O. Banerjee, and L. El Ghaoui. First-order metho ds for sparse cov ariance selec- tion. SIAM J. Matrix Anal. Appl. , 30(1):56–66, 2008. [4] J. F riedman, T. Hastie, and R. Tibshirani. Ad- ditiv e logistic regression: A statistical view of b oosting. Annals of Statistics , 28:337–374, 2000. [5] J. F riedman, T. Hastie, and R. Tibshirani. Sparse in verse cov ariance estimation with the graphical Lasso. Biostat. , 9(3):432–441, 2007. [6] J. F riedman, T. Hastie, and R. Tibshirani. Regu- larized paths for generalized linear mo dels via co- ordinate descent. Journal of Statistic al Softwar e , 33(1), 2010. [7] A. Jalali, C. Johnson, and P . Ravikumar. On learning discrete graphical mo dels using greedy metho ds. In NIPS , 2011. [8] N. Meinshausen and P . B ¨ uhlmann. High- dimensional graphs and v ariable selection with the Lasso. Ann. Statist. , 34(3):1436–1462, 2006. [9] P . Ravikumar, M. J. W ain wrigh t, and J. Laffert y . High-dimensional ising mo del selection using ` 1 - regularized logistic regression. A nnals of Statis- tics , 38(3):1287–1319, 2010. [10] P . Ravikumar, M. J. W ain wrigh t, G. Raskutti, and B. Y u. High-dimensional co v ariance estima- tion by minimizing ` 1 -p enalized log-determinant div ergence. Ele ctr on. J. Statist. , 5:935–98, 2011. [11] A. J. Rothman, P . J. Bick el, E. Levina, and J. Zh u. Sparse p erm utation in v arian t cov ariance estimation. 2:494–515, 2008. [12] K. Sc heinberg and I. Rish. Learning sparse gaus- sian marko v netw orks using a greedy co ordinate ascen t approach. In: Balc azar, J., Bonchi, F., Gionis, A., Seb ag, M. (e ds.) Machine L e arning Christopher C. Johnson, Ali Jalali, Pradeep Ravikumar and Know le dge Disc overy in Datab ases, L e ctur e Notes in Computer Scienc e 6323 , pages 196–212. [13] V. N. T emly ak ov. Greedy approximation. A cta Numeric a , 17:235–409, 2008. [14] M. J. W ain wright. Sharp thresholds for high- dimensional and noisy sparsity reco very using ` 1 -constrained quadratic programming (Lasso). IEEE T r ans. Information The ory , 55:2183–2202, Ma y 2009. [15] M. Y uan and Y. Lin. Model selection and estima- tion in the Gaussian graphical model. Biometrika , 94(1):19–35, 2007. [16] T. Zhang. Adaptive forw ard-backw ard greedy al- gorithm for sparse learning with linear mo dels. In Neur al Information Pr o c essing Systems (NIPS) 21 , 2008. [17] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine L e arning R ese ar ch , 10:555–568, 2009.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment