Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

Adaptiv e First-Order Metho ds for General Sparse In v erse Co v ariance Selection Zhaoson g Lu ∗ Decem b er 2, 2 008 Abstract In this pap er, w e consider estimating sparse inv erse co v ariance of a Gaussian graphi- cal mo del wh ose conditional indep end en ce is assumed to b e p artial ly kno wn. Similarly as in [5], we form ulate it as an l 1 -norm p en alized maxim um lik eliho o d estimation pr ob lem. F ur ther, w e p rop ose an algorithm framew ork, and dev elop t wo ﬁrst-order metho ds, that is, t h e adaptive sp ectral pro jected gradient (ASPG) metho d and th e adaptiv e Nestero v’s smo oth (ANS) metho d , for solving this estimation p roblem. Finally , w e compare the p erformance of these t wo metho ds on a set of randomly generated instances. O ur com- putational results demonstrate that b oth metho ds are able to s olv e problems of size at least a th ou s and and num b er of constraints of nearly a h alf million w ithin a reasonable amoun t of time, and the AS PG metho d generally outp erforms the ANS metho d. Key words: Sparse in verse co v ariance s election, adaptive s p ectral pr o jected gradient metho d, adaptiv e Nestero v’s smo oth metho d AMS 2000 sub ject classiﬁcation: 90C22, 90C25, 90C47, 65K05, 62J10 1 In tro duction It is w ell-known that sparse undirected graphical mo dels are capable o f describing and explain- ing the relationships among a set of v ariables. G iv en a set of random v ariables with Gaussian distribution, the estimation of suc h m o dels inv olv es ﬁnding the pattern of zeros in the in- v erse cov ariance matrix since these zeros corresp ond to conditional indep endencies among the v ariables. In recen t ye a r s, a v ariet y of a pproac hes ha v e b een prop osed f or estimating sparse in v erse cov ariance matrix. (All notations used b elow are deﬁned in Subsection 1.1.) G iv en a ∗ Department of Mathematics, Simon F ras er Universit y , Bur naby , BC, V5A 1S6 , Canada. (email: zhaoso ng@sf u.ca ). This a uthor was suppo rted in par t by SFU Pres ident ’s Research Gra nt and NSERC Discov ery Grant. 1 sample cov ariance matr ix Σ ∈ S n + , d’Aspremon t et al. [5] formu la t ed sparse inv erse cov ariance selection as the fo llowing l 1 -norm p enalized maximum lik eliho o d estimation problem: max X { log det X − h Σ , X i − ρ e T | X | e : X  0 } , (1) where ρ > 0 is a parameter con trolling the trade-oﬀ b etw een lik eliho o d and sparsit y of the so- lution. They also studied Nesterov’s smo oth approximation sc heme [10] and blo c k-co ordinate descen t (BCD) metho d for solving (1). Independen tly , Y uan and Lin [13] prop osed a similar estimation problem to (1) as follow s: max X { log det X − h Σ , X i − ρ X i 6 = j | X ij | : X  0 } . (2) They sho we d that problem (2) can b e suitably solv ed by the in terior p o in t a lgorithm dev elop ed in V anden b erghe et al. [12]. As demonstrated in [5, 13], the estimation problems (1) a nd (2) are capable of disco v ering eﬀectiv ely the sparse structure, or equiv alen tly , the conditional indep endence in the underlying gra phical mo del. Recen tly , Lu [8] prop osed a v arian t of Nestero v’s smo oth metho d [10] for problems (1) and (2 ) that substan tially outp erforms the existing metho ds in the literature. In addition, Dahl et al. [4] studied the maximum lik eliho o d estimation of a G aussian graphical mo del whose conditio nal indep endence is kno wn, which can b e fo r mulated a s max X { log det X − h Σ , X i : X  0 , X ij = 0 , ∀ ( i, j ) ∈ ¯ E } , (3) where ¯ E is a collection of all pairs of conditional independen t no des. They sho we d that when the underlying graph is nearly-c hordal, Newton’s metho d and preconditioned conjuga t e gradien t metho d can b e eﬃcien tly applied to solve (3 ). In practice, the sparsity structure of a Gaussian gra phical mo del is often partia lly know n from some kno wledge of its random v aria bles. In this pap er w e consider estimating sparse in v erse co v ar ia nce of a Gaussian graphical mo del whose conditional indep endence is assumed to b e p artial ly kno wn in adv ance (but it can b e completely unkno wn). Giv en a sample co- v ariance matrix Σ ∈ S n + , w e can naturally formulate it as the follo wing constrained l 1 -norm p enalized maxim um lik eliho o d estimation pro blem: max X log det X − h Σ , X i − P ( i,j ) / ∈ Ω ρ ij | X ij | , s.t. X  0 , X ij = 0 , ∀ ( i, j ) ∈ Ω , (4) where Ω consis ts of a set of pairs of conditionally indep enden t no des, a nd { ρ ij } ( i,j ) / ∈ Ω is a set of nonnegativ e parameters con trolling the trade-oﬀ b et we en likelihoo d and sparsit y of the solution. It is worth mentioning that unlik e in [4], w e do not assume any sp eciﬁc structure on the sparsit y of underlying graph for problem (4). W e can clearly observ e that (i) ( i, i ) / ∈ Ω for 1 ≤ i ≤ n , and ( i, j ) ∈ Ω if and only if ( j, i ) ∈ Ω; (ii) ρ ij = ρ j i for an y ( i, j ) / ∈ Ω; and ( iii) problems (1)- (3) can b e view ed as sp ecial cases of problem (4) by c ho osing appro pr ia te Ω and { ρ ij } ( i,j ) / ∈ Ω . F or example, if setting Ω = ∅ and ρ ij = ρ for all ( i, j ), problem (4) b ecomes ( 1). 2 It is easy to o bserve that problem (4) can b e r efo rm ulated a s a constrained smo oth con- v ex pro blem that has a n explicit O ( n 2 )-logarithmically homogeneous self-concordan t barrier function. Th us, it can b e suitably solv ed by in terior p oint (IP) metho ds (see Nesterov and Nemiro vski [11] and V anden b erghe et al. [12]). The w orst-case iteration complexit y of IP metho ds for ﬁnding an ǫ -optimal solution to (4) is O ( n log( ǫ 0 /ǫ )), where ǫ 0 is an initial gap. Eac h iterate of IP metho ds requires O ( n 6 ) a r ithmetic cost for assem bling and solving a typ- ically dense Newton system with O ( n 2 ) v a riables. Th us, the total w orst-case arithmetic cost of IP metho ds for ﬁnding an ǫ -optimal solution to (4) is O ( n 7 log( ǫ 0 /ǫ )), whic h is pro hibitiv e when n is relatively la r ge. Recen tly , F riedman et al. [6] prop o sed a gradient t yp e metho d for solving problem ( 4). They ﬁrst con verted (4) in to the following p enalization problem max X  0 log det X − h Σ , X i − X i,j ρ ij | X ij | . (5) b y setting ρ ij to an extraordinary large nu mber (say , 1 0 9 ) for all ( i, j ) ∈ Ω. Then they applied a slight v ariant of the BCD metho d [5] to t he dual problem of (5) in whic h eac h iteratio n is solve d by a co ordinate descen t a pproac h to a lasso ( l 1 -regularized) least-squares problem. Giv en t ha t their me tho d is a gradien t t yp e metho d and the dual pro blem of (5) is highly ill-conditioned for t he ab o ve c hoice of ρ , it is not surprising that their metho d conv erges extremely slo wly . Moreo v er, since the asso ciated lasso least-squares problems can only b e solv ed inexactly , their metho d often fails to conv erge ev en for a small problem. In this pap er, we prop ose a daptiv e ﬁrst-order metho ds fo r problem (4). Instead of solving (5) o nce with a set of huge p enalty para meters { ρ ij } ( i,j ) ∈ Ω , our metho ds c onsist of solving a sequence of problems (5) with a set of mo derate p enalt y para meters { ρ ij } ( i,j ) ∈ Ω that a re adaptiv ely adj usted un til a desired approxim at e solution is found. F o r a give n ρ , problem (5) is solv ed by the adaptive sp ectral pro jected gra dien t (ASPG) metho d and the adaptive Nestero v’s smo o th (ANS) metho d that are prop osed in this pa p er. The rest of pap er is orga nized as follo ws. In Subsection 1.1, w e introduce the notations used in this pap er. In Section 2, w e prop ose a n algorit hm framew ork and dev elop tw o ﬁrst-order metho ds, that is, the ASPG and ANS metho ds, for solving problem (4). The p erfo rmance of these t wo metho ds are compared o n a set o f ra ndomly g enerated instances in Section 3. Finally , w e presen t some concluding remarks in Section 4. 1.1 Notation In this pap er, all v ector spaces are assumed to b e ﬁnite dimensional. The sym b o ls ℜ n , ℜ n + and ℜ n ++ denote the n - dimensional Euclidean space, the nonnegativ e or t ha n t of ℜ n and the p ositiv e o rthan t of ℜ n , resp ectiv ely . The set of all m × n matrices with real entries is denoted b y ℜ m × n . The space of symmetric n × n matrices will b e denoted by S n . If X ∈ S n is p ositive semideﬁnite, w e write X  0. Also, w e write X  Y to mean Y − X  0. The cone of p ositiv e semideﬁnite (resp., deﬁnite) matrices is denoted b y S n + (resp., S n ++ ). Giv en matrices X and Y in ℜ m × n , the standard inner pro duct is deﬁned by h X , Y i := T r( X Y T ), where T r ( · ) 3 denotes the trace of a matrix. k · k denotes the Euclidean norm and its asso ciated op erator norm unless it is explicitly stated otherwise. The F rob enius norm of a real matrix X is deﬁned as k X k F := p T r( X X T ). W e denote b y e the v ector of all o nes, and by I the identit y matrix. Their dimensions should b e clear from the con text. F or a real ma t r ix X , w e denote b y | X | the absolute v alue of X , that is, | X | ij = | X ij | for all i, j . The determinant and the minimal (resp., maximal) eigen v alue of a real symmetric ma t rix X a re denoted b y det X and λ min ( X ) (resp., λ max ( X )), resp ectiv ely , and λ i ( X ) denotes its i th largest eigen v alue. Giv en an n × n (partial) matr ix ρ , Diag ( ρ ) denotes the diago na l matrix whose i th diag onal elemen t is ρ ii for i = 1 , . . . , n . Giv en matrices X and Y in ℜ m × n , X ∗ Y denotes the p oint wise pro duct of X and Y , namely , X ∗ Y ∈ ℜ m × n whose ij th en try is X ij Y ij for all i, j . W e denote by Z + the set of all nonnegative in tegers. 2 Adaptiv e ﬁrs t-order met h o ds In this section, w e discuss some suitable ﬁrst-o r der metho ds for general sparse in vers e co v ari- ance selection problem (4). In particular, w e ﬁrst provide a n algorithm f ramew ork f or it in Subsection 2.1. Then w e sp ecialize this framew ork by considering tw o ﬁrst-order metho ds, namely , the adaptiv e sp ectral pro jected gradien t metho d and the adaptive Nestero v’s smo oth metho d in Subsection 2.2. 2.1 Algorithm framew ork In this subsection, we provid e an algorithm framew o r k for general sparse in ve rse cov ariance selection problem ( 4). Throughout this pap er, w e assume that ρ ij ≥ 0 is give n and ﬁxed for all ( i, j ) / ∈ Ω, and that the followin g condition holds. Assumption 1 Σ + Diag( ρ ) ≻ 0 . Note t hat Σ is a sample cov ariance mat rix, and hence Σ  0. In addition, D iag( ρ )  0. Th us, Σ + Diag ( ρ )  0. It may not b e, ho w ev er, p ositiv e deﬁnite in general. But w e can alw ay s p erturb ρ ii b y adding a small p ositiv e num b er (say , 10 − 8 ) whenev er needed t o ensure the ab ov e assumption holds. W e ﬁrst establish the existence of an optimal solution fo r pro blem (4) as follows . Prop osition 2.1 Pr oblem (4) has a uniq ue optimal solution X ∗ ∈ S n ++ . Pr o of . Sin ce ( i, i ) / ∈ Ω for i = 1 , . . . , n , w e see that X = I is a feasible solution of problem (4). F or conv enience, let f ( X ) denote the ob jectiv e function of (4). W e no w sho w that the sup-lev el set S f ( I ) = { X  0 : f ( X ) ≥ f ( I ) , X ij = 0 , ∀ ( i, j ) ∈ Ω } is compact. Indeed, using 4 the deﬁnition o f f ( · ), w e observ e that for any X ∈ S f ( I ), f ( I ) ≤ f ( X ) ≤ log det X − h Σ + Diag( ρ ) , X i ≤ n X i =1 [log λ i ( X ) − λ min (Σ + D iag( ρ )) λ i ( X )] , ≤ ( n − 1) [ − 1 − lo g λ min (Σ + Diag ( ρ ))] + log λ max ( X ) − λ min (Σ + D iag( ρ )) λ max ( X ) , where the last inequality f ollo ws from the fact that fo r a ny a > 0, max t { log t − at : t ≥ 0 } = − 1 − lo g a. (6) Hence, w e obtain that for an y X ∈ S f ( I ), log λ max ( X ) − λ min (Σ + D ia g( ρ )) λ max ( X ) ≥ f ( I ) − ( n − 1) [ − 1 − lo g λ min (Σ + Diag ( ρ ))] , (7 ) whic h implies that there exists some β ( ρ ) > 0 suc h that λ max ( X ) ≤ β ( ρ ) for all X ∈ S f ( I ). Th us, S f ( I ) ⊆ { X ∈ S n : 0  X  β ( ρ ) I } . F urther, using this r esult along with the deﬁnition of f ( · ) , we easily o bserv e that for an y X ∈ S f ( I ), log λ min ( X ) = f ( X ) − n − 1 P i =1 log λ i ( X ) + h Σ , X i + P ( i,j ) / ∈ Ω ρ ij | X ij | , ≥ f ( I ) − ( n − 1) log β ( ρ ) + min 0  X  β ( ρ ) {h Σ , X i + P ( i,j ) / ∈ Ω ρ ij | X ij |} . It follow s that there exists some α ( ρ ) > 0 suc h that λ min ( X ) ≥ α ( ρ ) fo r all X ∈ S f ( I ). Hence, S f ( I ) ⊆ { X ∈ S n : α ( ρ ) I  X  β ( ρ ) I } is b o unded, whic h together with t he fact that f ( · ) is con tin uous in the latter set, implies that S f ( I ) is closed. Therefore, problem (4) has at least an optimal solution. F urther, observing that f ( · ) is strict concav e, w e conclude that problem (4) has a unique optimal solution. Similarly , w e can show that the follow ing result holds. Prop osition 2.2 Given any ρ ij ≥ 0 for ( i, j ) ∈ Ω , pr o blem (5 ) has a unique optim a l solution X ∗ ∈ S n ++ . Before presen ting an algorithm fra mew ork for problem (4), w e in tro duce a terminology for (4) as follows . Deﬁnition 1 L et ǫ o ≥ 0 and ǫ c ≥ 0 b e given. L et f ( · ) and f ∗ denote the obje ctive function and the op tima l value of (4), r e s p e ctively. X ∈ S n + is an ( ǫ o , ǫ c ) -optimal solution of pr oblem (4) if f ( X ) ≥ f ∗ − ǫ o and max ( i,j ) ∈ Ω | X ij | ≤ ǫ c . Analogously , we can deﬁne an ǫ o -optimal solution for problem (5). Giv en that our ultimate aim is to estimate a sparse inv erse co v ariance matrix X ∗  0 that satisﬁes at least X ∗ ij = 0, ∀ ( i, j ) ∈ Ω and approximately maximizes the log- lik eliho o d, we now brieﬂy discuss how to 5 obtain suc h an approximate solution X ∗ from an ( ǫ o , ǫ c )-optimal solution ¯ X ∗ of (4). Let us deﬁne ˜ X ∗ ∈ S n b y letting ˜ X ∗ ij = ¯ X ∗ ij , ∀ ( i, j ) / ∈ Ω and ˜ X ∗ ij = 0, ∀ ( i, j ) ∈ Ω. W e the n set X ∗ := ˜ X ∗ + t ∗ I , where t ∗ = arg max { lo g det( ˜ X ∗ + tI ) − h Σ , ˜ X ∗ + tI i : t ≥ − λ min ( ˜ X ∗ ) } . It is not hard to see that t ∗ can b e easily found. W e also o bserv e that suc h X ∗ b elongs to S n ++ , satisﬁes X ∗ ij = 0, ∀ ( i, j ) ∈ Ω and retains the same sparsity as ˜ X ∗ . In a ddition, by setting the log-lik eliho o d v alue at ˜ X ∗ to −∞ if λ min ( ˜ X ∗ ) ≤ 0, w e can easily see that the lo g-lik eliho o d v alue at X ∗ is at least as go o d as that at ˜ X ∗ . Thus , X ∗ is a desirable estimation of sparse in v erse co v ariance, provided ¯ X ∗ is a go o d approximate solution to problem (4) . In the remainder of this pap er, w e concen trate on ﬁnding an ( ǫ o , ǫ c )-optimal solution of problem (4) for any pair o f p ositive ( ǫ o , ǫ c ). W e next pr esen t a n algorithm framew ork for (4 ) based on an adaptive l 1 p enalt y approac h. Algorithm framew ork for general sparse in ve r se co v ariance selection (GSICS): Let ǫ o > 0, ǫ c > 0 and r ρ > 1 b e giv en. Let ρ 0 ij > 0 , ∀ ( i, j ) ∈ Ω b e giv en suc h that ρ 0 ij = ρ 0 j i , ∀ ( i, j ) ∈ Ω. Set ρ ij = ρ 0 ij for all ( i, j ) ∈ Ω. 1) Find an ǫ o -optimal solution X ǫ o of problem (5). 2) If max ( i,j ) ∈ Ω | X ǫ o ij | ≤ ǫ c , terminate. Otherwise, set ρ ij ← ρ ij r ρ for all ( i, j ) ∈ Ω, a nd go to step 1). end R ema rk . T o make the ab o v e framework complete, w e need to c ho ose suitable metho ds for solving problem (5) in step 1). W e will prop o se ﬁrst- o rder metho ds f or it in Subsection 2.2. In step 2) of the framew ork GSICS, there are some other stra t egies for up da t ing the p enalt y para meters { ρ ij } ( i,j ) ∈ Ω . F or example, for an y ( i, j ) ∈ Ω, one can up date ρ ij only if ρ ij > ǫ c . But w e o bserv ed in our exp erimen tation that this strategy p erforms worse than the one describ ed ab ov e. In addition, instead of using a common ratio r ρ for all ( i, j ) ∈ Ω, one can asso ciate with eac h ρ ij an individual r atio r ij . Also, the ra t io r ρ is no need to b e ﬁxed for all iterations, a nd it can v ar y f r o m iteration to iteration dep ending on the amount of violatio n incurred in max ( i,j ) ∈ Ω | X ǫ o ij | ≤ ǫ c . Before discussing the con ve rg ence of the framew ork GSICS, w e ﬁrst study the con vergenc e of the l 1 p enalt y metho d for a g eneral nonlinear programming (NLP) problem. Giv en a set ∅ 6 = X ⊆ ℜ n and functions f : X → ℜ , g : X → ℜ k and h : X → ℜ l , consider the NLP problem: f ∗ = sup x ∈X f ( x ) s.t. g ( x ) = 0 , h ( x ) ≤ 0 . (8) 6 W e asso ciat e with the NLP problem (8) the follow ing l 1 p enalt y f unction: P ( x ; λ, µ ) := f ( x ) − λ T | g ( x ) | − µ T h + ( x ) , (9) where λ ∈ ℜ k + , µ ∈ ℜ l + and ( h + ( x )) i = max { 0 , h i ( x ) } for i = 1 , . . . , l . W e no w establish a conv ergence result for the l 1 p enalt y metho d for the NLP problem (8) under some assumption on f ( x ). Prop osition 2.3 L et ǫ o > 0 an d ǫ c > 0 b e given. Assume that ther e exists some ¯ f ∈ ℜ such that f ( x ) ≤ ¯ f for al l x ∈ X . L et x ǫ o λ,µ ∈ X b e an ǫ o -optimal solution of the pr oblem sup { P ( x ; λ, µ ) : x ∈ X } (10) for λ ∈ ℜ k + and µ ∈ ℜ l + , and let v λ,µ := min { min i λ i , min i µ i } . Then f ( x ǫ o λ,µ ) ≥ f ∗ − ǫ o , and, mor e over,    g ( x ǫ o λ,µ ); h + ( x ǫ o λ,µ )    ∞ ≤ ǫ c holds whenever v λ,µ ≥ ( ¯ f − f ∗ + ǫ o ) /ǫ c , wher e f ∗ is the optimal value of the NLP pr o blem (8). Pr o of . In view o f the a ssumption tha t f ( x ) is b ounded ab o ve in X , we clearly see that f ∗ is ﬁnite. Let f ∗ λ,µ denote the optimal v alue of problem (10). W e easily observ e that f ∗ λ,µ ≥ f ∗ . Using this relation, (9 ) and the fact that x ǫ o λ,µ is an ǫ o -optimal solution of (10), w e hav e f ( x ǫ o λ,µ ) ≥ P ( x ǫ o λ,µ ; λ, µ ) ≥ f ∗ λ,µ − ǫ o ≥ f ∗ − ǫ o , (11) and hence the ﬁrst statemen t holds. W e now prov e the second statement. Using (9), ( 11) a nd the deﬁnition o f v λ,µ , w e hav e f ( x ǫ o λ,µ ) − v λ,µ    g ( x ǫ o λ,µ ); h + ( x ǫ o λ,µ )    ∞ ≥ f ( x ǫ o λ,µ ) − v λ,µ    g ( x ǫ o λ,µ ); h + ( x ǫ o λ,µ )    1 ≥ P ( x ǫ o λ,µ ; λ, µ ) ≥ f ∗ − ǫ o . (12) F urther, from the assumption, w e know f ( x ǫ o λ,µ ) ≤ ¯ f due t o x ǫ o λ,µ ∈ X . This together with (12) immediately implies that the second statemen t holds. W e are no w ready to establish a con vergenc e result for the framew ork G SICS. Theorem 2.4 L et ǫ o > 0 and ǫ c > 0 b e given. Supp ose t ha t in step 1) of the f r amewo rk GSICS, an ǫ o -optimal solut ion X ǫ o of pr oblem (5) is obtaine d by some m etho d. T hen, the fr am e work GSICS gener ates an ( ǫ o , ǫ c ) -optimal solution to pr oblem (4) in a ﬁnite numb er of outer iter ations, or e quivalently, a ﬁni te numb er of up dates on the p enalty p ar ameters { ρ ij } ( i,j ) ∈ Ω . Pr o of . In v oking that Σ + Dia g ( ρ ) ≻ 0 (see Assumption 1), w e see that for an y X ∈ S n + , log det X − h Σ , X i − P ( i,j ) / ∈ Ω ρ ij | X ij | ≤ log det X − h Σ + D ia g( ρ ) , X i ≤ sup { log det Y − h Σ + Diag( ρ ) , Y i : Y  0 } < ∞ , where the la st inequality follows from t he fact that the ab ov e maximization problem ac hiev es its optimal v alue at Y = ( Σ + Diag ( ρ )) − 1 ≻ 0. This observ ation to gether with Prop osition 2.3 immediately yields t he conclusion. 7 2.2 Adaptiv e ﬁrst-order metho ds for problem (5) In this subsection, w e will discuss some suitable ﬁrst-order metho ds fo r solving problem (5) that app ears in step 1) of the algor it hm fr a mew ork G SICS. As seen fr o m Prop osition 2.2, problem (5 ) has a unique optimal solution. W e next prov ide some b ounds o n it . Prop osition 2.5 L et f ρ ( · ) an d X ∗ ρ denote the obje ctive function and the unique optimal so- lution of pr oblem (5), r esp e c tively. L et ϑ b e deﬁne d as ϑ := max  f ρ ((Σ + D iag( ρ )) − 1 ) , θ  − ( n − 1)[ − 1 − log λ min (Σ + Diag ( ρ ))] , (13) wher e θ := n ( − 1 − log T r(Σ + ρ ) + log n ) . Then α ρ I  X ∗ ρ  β ρ I , wh e r e α ρ := 1 / ( k Σ k + k ρ k ) and β ρ is the l a r gest p ositive r o ot of the fol lowing e quation log t − λ min (Σ + Diag ( ρ )) t − ϑ = 0 . Pr o of . Let U := { U ∈ S n : | U ij | ≤ 1 , ∀ ij } , (14) and φ ( X , U ) := lo g det X − h Σ + ρ ∗ U, X i , ∀ ( X , U ) ∈ S n ++ × U . (15) Since X ∗ ρ ∈ S n ++ is the optimal solution of problem (5), it can b e easily shown that there exists some U ∗ ∈ U suc h that ( X ∗ ρ , U ∗ ) is a saddle p oin t of φ ( · , · ) in S n ++ × U , and hence X ∗ ρ = arg min X ∈S n ++ φ ( X , U ∗ ) . This relation along with (15) immediately yields X ∗ ρ (Σ + ρ ∗ U ∗ ) = I . Hence, we ha v e X ∗ ρ = (Σ + ρ ∗ U ∗ ) − 1  1 k Σ k + k ρ ∗ U ∗ k I , whic h together with (14) and the fact that U ∗ ∈ U , implies that X ∗  1 k Σ k + k ρ k I . T hus, X ∗ ρ  α ρ I as desired. W e next b ound X ∗ ρ from ab ov e. Let f ∗ ρ denote the optimal v alue of pro blem (5). In view of the deﬁnition of f ρ ( · ) and (6), we hav e f ∗ ρ ≥ max t> 0 f ρ ( tI ) = max t> 0 n log t − t T r(Σ + ρ ) = n ( − 1 − log T r (Σ + ρ ) + log n ) =: θ . Th us, f ∗ ρ ≥ max { f ρ ((Σ + D iag( ρ )) − 1 ) , θ } . Using this result and f ollo wing a similar pro cedure as for deriving (7), w e can show tha t log λ max ( X ∗ ρ ) − λ min (Σ + Diag ( ρ )) λ max ( X ∗ ρ ) ≥ ϑ, 8 where ϑ is give n in (13), and hence the statemen t X ∗ ρ  β ρ I immediately follows. In view of Prop osition 2.5 , we see that problem (5) is equiv alen t to the follo wing problem max α ρ  X  β ρ log det X − h Σ , X i − X i,j ρ ij | X ij | , (16) where α ρ and β ρ are deﬁned in Prop osition 2.5. W e further observ e that problem (16) can b e rewritten as max X ∈X ρ { f ρ ( X ) := min U ∈U φ ( X , U ) } , (17) where U and φ ( · , · ) are giv en in (14) and (15), resp ectiv ely , and X ρ is deﬁned as follows: X ρ := { X ∈ S n : α ρ I  X  β ρ I } . (18) Observing tha t φ ( X , U ) : X ρ × U → ℜ is a smo oth function whic h is strictly concav e in X ∈ X ρ for eve ry ﬁxed U ∈ U , and con v ex in U ∈ U for ev ery ﬁxed X ∈ X ρ , w e can conclude that (i) problem (17 ) and its dual, t ha t is, min U ∈U { g ρ ( U ) := max X ∈X ρ φ ( X , U ) } (19) are b o t h solv able and ha v e the same optimal v alue; and (ii) the function g ρ ( · ) is con v ex diﬀeren tiable and its gradien t is giv en b y ∇ g ρ ( U ) = ∇ U φ ( X ( U ) , U ) , ∀ U ∈ U , where X ( U ) := arg max X ∈X ρ φ ( X , U ) . (20) The following result shows that the approximate solution of problem (17) (o r equiv a lently , (5)) can b e o btained b y solving smo oth con vex problem (19). Prop osition 2.6 L et X ∗ ρ b e the unique o ptimal solution of pr oblem (17 ), and let f ∗ ρ b e the optimal value of pr oblems (17) and (19). Supp ose that the se quenc e { U k } ∞ k =0 ⊆ U is s uch that g ρ ( U k ) → f ∗ ρ as k → ∞ . Th e n, X ( U k ) → X ∗ ρ and g ρ ( U k ) − f ρ ( X ( U k )) → 0 as k → ∞ , wher e X ( · ) is deﬁne d in (20). Pr o of . The pro of is similar to that of Theorem 2.4 of Lu [8]. F ro m Prop osition 2.6, w e see tha t problem (5) can b e solv ed sim ultaneously while solving problem (19). Indeed, supp o se that { U k } ∞ k =0 ⊆ U is a sequenc e of approximate solutions generated b y some metho d for solving (1 9). It follows from Prop o sition 2.6 that given an y ǫ o > 0, there exists some iterate U k suc h that g ρ ( U k ) − f ρ ( X ( U k )) ≤ ǫ o . T hen, it is clear that X ( U k ) is an ǫ o -optimal solution of (17) and hence (5). W e next discuss tw o ﬁrst o r der metho ds, namely , the adaptiv e sp ectral pro jected gradien t metho d and the adaptive Nestero v’s smo oth metho d for problems (19) and (17) (or equiv alen tly , (5)). 9 2.2.1 Adaptiv e sp ectral gr adient pro ject ion metho d In this subsection, w e prop ose an adaptive sp ectral pro jected gr a dien t (ASPG) metho d for solving problems (19) a nd (17) (or equiv alently , (5)). The sp ectral gradient pro jection (SPG) metho ds w ere dev elop ed b y Bir g in et al. [3] for minimizing a smo oth function ov er a closed conv ex set, whic h w ell integrate the nonmonoto ne line searc h tec hnique prop o sed by Gripp o et al. [7] and Barzilai-Borwein’s gradient metho d [1 ] in to classical pro jected gradien t metho ds (see [2 ]) . W e next discuss the one of t hem (namely , the SPG2 metho d [3]) for solving the problem min { g ρ,β ( U ) : U ∈ U } , (21) and its dual max { f ρ ( X ) : α ρ I  X  β I } (22) for some β ≥ α ρ , where g ρ,β ( U ) := max α ρ I  X  β I φ ( X , U ) , (23) U , φ ( · , · ), f ρ ( · ) and α ρ are deﬁned in (14 ), (15), (17) and Prop osition 2.5, resp ectiv ely . W e denote b y X β ( U ) the unique optimal solution o f problem (23). In view of (15), it is not ha r d to o bserv e that g ρ,β ( U ) is diﬀeren tiable, and, moreov er, X β ( U ) and ∇ g ρ,β ( U ) ha v e closed-form expressions f or any U ∈ U (see (30) of [8]). In addition, since U is a simple set, the pro jection of a p oin t to U can b e c heaply carried out. Th us, the SPG metho d [3] is suitable f o r solving problem (21). F o r ease of su bsequen t presen tation, w e no w des crib e the SPG metho d [3] for (21) in details. The follo wing notation will b e used throughout this subsection. Giv en a sequenc e { U k } ∞ k =0 ⊆ U and an in teger M ≥ 1, w e deﬁne g M k := max { g ρ,β ( U k − j ) : 0 ≤ j ≤ min { k , M − 1 }} . Also, let P U : ℜ n × n → U b e deﬁned as P U ( U ) := arg min {k ˆ U − U k F : ˆ U ∈ U } , ∀ U ∈ ℜ n × n . 10 The SPG metho d for problems (21) and (22) : Let ǫ o > 0, γ ∈ (0 , 1), 0 < σ 1 < σ 2 < 1 and 0 < α min < α max < ∞ b e giv en. Let M ≥ 1 b e an in teger. Choose U 0 ∈ U , α 0 ∈ [ α min , α max ] and set k = 0. 1) If g ρ,β ( U k ) − f ρ ( X β ( U k )) ≤ ǫ o , terminate. 2) Compute d k = P U ( U k − α k ∇ g ρ,β ( U k )) − U k . Se t λ ← 1. 2a) Set U + = U k + λd k . 2b) If g ρ,β ( U + ) ≤ g M k + γ λ h d k , ∇ g ρ,β ( U k ) i , set U k +1 = U + , s k = U k +1 − U k , y k = ∇ g ρ,β ( U k +1 ) − ∇ g ρ,β ( U k ). Otherwise, choose λ + ∈ [ σ 1 λ, σ 2 λ ], set λ ← λ + and go to step 2a). 2c) Compute b k = h s k , y k i . If b k ≤ 0, s et α k +1 = α max . Otherwise, compute a k = h s k , s k i and set α k +1 = min { α max , max { α min , a k /b k }} . 3) Set k ← k + 1, and go to step 1). end W e next establish a con v ergence result for the SPG metho d for solving pro blems (21) and (22). Theorem 2.7 L et ǫ o > 0 b e given. The SPG metho d gener ates a p air of ǫ o -optimal solutions ( U k , X β ( U k )) to p r oblems (21) and (22) in a ﬁnite numb er of iter ation s . Pr o of . Suppo se b y contradiction that the SPG metho d do es not terminate. Then it generates a sequence { U k } ∞ k =0 ⊆ U satisfying g ρ,β ( U k ) − f ρ ( X β ( U k )) > ǫ o . Note that g ρ,β ( · ) is conv ex, whic h together with Theorem 2.4 of [3] implie s that an y accum ulation p oin t of { U k } ∞ k =0 is an optimal solution o f problem (21). By the con tinuit y of g ρ,β ( · ), it further implies that an y accum ulation point of { g ρ,β ( U k ) } ∞ k =0 is the optimal v alue f ∗ ρ of (21). Using this observ ation and the fa ct that { g ρ,β ( U k ) } ∞ k =0 is b ounded, w e conclude that g ρ,β ( U k ) → f ∗ ρ as k → ∞ . F urther, in view o f Prop osition 2.6 b y replacing β ρ with β , a nd g ρ ( · ) with g ρ,β ( · ), w e ha ve g ρ,β ( U k ) − f ρ ( X β ( U k )) → 0 a s k → ∞ , and arriv e at a contradiction. Therefore, the conclusion of this theorem holds. Based on the ab ov e discussion, we see that the SPG metho d can b e directly applied to ﬁnd a pair of ǫ o -optimal solutions to pro blems (19) and (17) (or equiv alently , (5)) b y setting β = β ρ , where β ρ is giv en in Prop osition 2.5. It may conv erge, ho w ev er, v ery slo wly when β ρ is large. Indeed, similarly a s in [8], one can show that ∇ g ρ,β ( U ) is Lipsc hitz contin uo us on U with constan t L = β 2 (max i,j ρ ij ) 2 with resp ect to the F rob enius norm. Let α k , b k and d k b e deﬁned as ab o v e. Since g ρ,β ( · ) is con v ex, w e hav e b k ≥ 0. Actually , w e observ ed that it is almost alwa ys p ositiv e. In addition, α min and α max are usually set to b e 1 0 − 30 and 10 30 , 11 resp ectiv ely . Th us for the SPG metho d, w e typically ha v e α k +1 = k U k +1 − U k k 2 F h U k +1 − U k , ∇ g ρ,β ( U k +1 ) − ∇ g ρ,β ( U k ) i ≥ = 1 L = 1 β 2 (max i,j ρ ij ) 2 . Recall that β ρ is an upp er b ound of λ max ( X ∗ ρ ), and typically it is o ve rly large, where X ∗ ρ is the optimal solution of (5) . When β = β ρ , w e see from a b o ve that α k can b e v ery small and so is U k +1 − U k due to k U k +1 − U k k F ≤ k d k k F = k P U ( U k − α k ∇ g ρ,β ( U k )) − U k k F ≤ α k k∇ g ρ,β ( U k ) k F . Therefore, the SPG metho d may con v erge very slo wly when applied to problem (19 ) directly . T o alleviate the aforemen tio ned computational diﬃculty , we next prop ose an adaptive SPG (ASPG) metho d for problems (19) and (17 ) (or equiv alen tly , (5)) b y solving a sequence of problems (21) with β = β 0 , β 1 , . . . , β m for some { β k } m k =0 approac hing λ max ( X ∗ ρ ) monotonically from b elo w. The adaptiv e SPG (A SPG) metho d for problems (17) and (19): Let ǫ o > 0, β 0 ≪ β ρ and r β > 1 b e giv en. Cho ose U 0 ∈ U and set k = 0. 1) Set β ← β k . Apply the SPG metho d to ﬁnd a pair of ǫ o -optimal solutions ( ˆ U k , X β ( ˆ U k )) to problems (21) and (22) start ing from U 0 . 2) If β = β ρ or λ max ( X β ( ˆ U k )) < β , terminate. 3) Set U 0 ← ˆ U k , β k +1 = min { β r β , β ρ } , k ← k + 1, and g o t o step 1). end W e no w establish a conv ergence result for the ASPG metho d for solving problems (19) and (17) (or equiv a len tly , (5)). Theorem 2.8 L et ǫ o > 0 b e given. The ASPG metho d gener ates a p air o f ǫ o -optimal solutions to pr oblems (19) and (17) (or e quivalen tly, (5)) in a ﬁnite numb er of total (inner) iter ations. Pr o of . First, we clearly see that β is up dated for o nly a ﬁnite n umber of times. Using this observ ation and Theorem 2.7, w e conclude tha t the ASPG metho d terminates in a ﬁnite n um b er of total (inner) iterat io ns. No w, supp ose t ha t it terminates at β = β k for some k . W e claim that ( ˆ U k , X β ( ˆ U k )) is a pair of ǫ o -optimal solutions to problems (19) a nd (17) (or equiv alen tly , (5)). Indeed, w e clearly hav e β = β ρ or λ max ( X β ( ˆ U k )) < β , which together with the deﬁnition of g ρ ( · ) and g ρ,β ( · ) (see (19) and ( 21)), implies that g ρ ( ˆ U k ) = g ρ,β ( ˆ U k ). Th us, w e o btain that g ρ ( ˆ U k ) − f ρ ( X β ( ˆ U k )) = g ρ,β ( ˆ U k ) − f ρ ( X β ( ˆ U k )) ≤ ǫ o , whic h along w ith the fact X β ( ˆ U k ) ∈ X ρ , implies that ( ˆ U k , X β ( ˆ U k )) is a pair of ǫ o -optimal solutions to problems (19) and (17). 12 As discussed ab ov e, the ASPG metho d is able to ﬁnd a pair of ǫ o -optimal solutions to problems (5 ) and (19). W e no w show ho w this metho d can b e extended to ﬁnd an ( ǫ o , ǫ c )- optimal solution to problem (4). Recall from the framew ork GSICS (see Subsection 2.1) that in order to o btain an ( ǫ o , ǫ c )-optimal s olutio n to pro blem (4 ) , w e need to ﬁnd an ǫ o - optimal solution of problem ( 5) for a sequence of p enalty parameters { ρ k } m k =1 , whic h satisfy for k = 1 , . . . , m , ρ k ij = ρ ij , ∀ ( i, j ) 6∈ Ω and ρ k ij = ρ 0 ij r k − 1 ρ , ∀ ( i, j ) ∈ Ω for some r ρ > 1 and ρ 0 ij > 0, ∀ ( i, j ) ∈ Ω. Supp o se that a pair of ǫ o -optimal solutio ns ( X β k ( ˆ U k )) , ˆ U k ) of problems (5) and (19) with ρ = ρ k are alr eady found b y the ASPG metho d for some β k ∈ [ α ρ k , β ρ k ]. Then, w e c ho ose the initial U 0 and β 0 for the ASPG metho d when applied to solv e problems (5) and (19) with ρ = ρ k +1 as follow s: ( U 0 ) ij =  ( ˆ U k ) ij /r ρ , if ( i, j ) ∈ Ω; ( ˆ U k ) ij , otherwise . , β 0 = max n α ρ k +1 , λ max ( X β k ( ˆ U k )) o . (24) W e next prov ide some in terpretatio n on suc h a choice of U 0 and β 0 . Since ˆ U k ∈ U and r ρ > 1, w e easily see tha t U 0 ∈ U . In addition, using the deﬁnition of β ρ (see Prop osition 2.5) and the fact that Diag( ρ k +1 ) = Diag ( ρ k ), w e o bserv e that β ρ k +1 = β ρ k , and hence β 0 ∈ [ α ρ k +1 , β ρ k +1 ]. Let f ∗ ρ denote the optimal v alue of problem (5) for any g iv en ρ . Clearly , w e can observ e from the ASPG metho d that either λ max ( X β k ( ˆ U k )) < β k < β ρ k or λ max ( X β k ( ˆ U k ) ≤ β k = β ρ k holds, whic h to gether with (19) and (23) implies that g ρ k ,β k ( ˆ U k ) = g ρ k ( ˆ U k ) ∈ [ f ∗ ρ k , f ∗ ρ k + ǫ o ] . (25) T ypically , λ max ( X β k ( ˆ U k ) ≫ α ρ k +1 , and hence β 0 = λ max ( X β k ( ˆ U k )) ≤ β k generally holds. Also usually , α ρ k +1 ≈ α ρ k ≈ 0. Using these relations along with (2 5 ), (19) and (23) , w e further observ e that f ∗ ρ k +1 ≤ g ρ k +1 ( U 0 ) ≈ g ρ k ( ˆ U k ) ≤ f ∗ ρ k + ǫ o , f ∗ ρ k +1 ≤ g ρ k +1 ,β 0 ( U 0 ) ≈ g ρ k ,β 0 ( U 0 ) ≤ g ρ k ,β k ( ˆ U k ) ≤ f ∗ ρ k + ǫ o . It follow s that when f ∗ ρ k +1 is close to f ∗ ρ k , U 0 is nearly an ǫ o -optimal solution for problems (19) and (23) with ρ = ρ k +1 and β = β 0 . Therefore, w e exp ect that f or the ab ov e c hoice of U 0 and β 0 , the ASPG metho d can solve problems (5) and (19) with ρ = ρ k +1 rapidly when ρ k +1 is close to ρ k . 2.2.2 Adaptiv e Nesterov’s smo oth metho d In this subsection, w e prop o se an a daptiv e Nestero v’s smo oth (ANS) metho d f or solving problems (19) and (17 ) (o r equiv alently , (5)). Recen tly , Lu [8] studied Nesterov ’s sm o oth metho d [9, 1 0 ] for solving a sp ecial class of problems (19) and (17) ( or equ iv alen tly , (5) ) , where ρ is a positive m ultiple of ee T . He sho w ed that an ǫ o -optimal solution to problems (19) a nd (17) can b e found in at most 13 √ 2 β ρ (max i,j ρ ij ) max U ∈U k U − U 0 k F / √ ǫ o iterations b y Nestero v’s smo o th metho d fo r some initial p oin t U 0 ∈ U (see pp. 12 of [8] fo r details). Giv en that β ρ is an estimate and t ypically an ov erestimate of λ max ( X ∗ ρ ), w here X ∗ ρ is t he unique optimal solutio n of problem (5), the aforemen tioned iteratio n complexit y can b e exceedingly large and Nestero v’s smo oth metho d generally con v erges extremely slo wly . Lu [8] f urther pro p osed an adaptiv e Nestero v’s smo oth (ANS) metho d f o r solving problems (19) and (17) (see pp. 15 of [8]). In his metho d, λ max ( X ∗ ρ ) is estimated b y λ max ( X ( U k )) and adaptiv ely adjusted based on the change o f λ max ( X ( U k )) as the alg orithm progresses, where U k is an approximate solution of problem (19). As a result, his metho d can pro vide an asymptotically tight estimate of λ max ( X ∗ ρ ) and it has an asymptotically optimal iteration complexit y . W e now extend the ANS metho d [8] to problems (19) and (17 ) (or equiv a len tly , (5)) with a general ρ . Recall from Subsection 2 .2 .1 that ∇ g ρ ( U ) is Lipschitz c ontin uous on U with constan t L = β 2 ρ (max i,j ρ ij ) 2 with resp ect to the F rob enius norm. Then it is straightforw ard to ex tend the ANS method [8] to problems (19) and (5) for a g eneral ρ b y replacing the corresp onding Lipsc hitz constan ts by the o nes computed according t o the ab ov e formula. F or ease of reference, w e prov ide the details of the ANS metho d f o r problems ( 19) and (17) (or equiv alen tly , (5)) b elo w. Throughout the remainder of this section, w e assume that α ρ , β ρ , g ρ,β ( · ) and X β ( · ) are giv en in Prop osition 2.5 and Subsection 2.2.1, respectiv ely . W e now in tro duce a deﬁnition that will b e used subsequen tly . Deﬁnition 2 Given any U ∈ U a nd β ∈ [ α ρ , β ρ ] , X β ( U ) is c al le d “active” if λ max ( X β ( U )) = β and β < β ρ ; otherwise it is c a l le d “inactive”. W e are no w ready to presen t the ANS metho d [8] for problems (19) and (17). 14 The ANS metho d for problems ( 17) and ( 19) Let ǫ > 0, ς 1 , ς 2 > 1, and let ς 3 ∈ (0 , 1 ) b e g iv en. Let ρ max = max i,j ρ ij . Cho ose U 0 ∈ U and β ∈ [ α ρ , β ρ ]. Set L = β 2 ρ 2 max , σ = 1 , and k = 0. 1) Compute X β ( U k ). 1a) If X β ( U k ) is active , ﬁnd the smallest s ∈ Z + suc h that X ¯ β ( U k ) is inactive , where ¯ β = min { ς s 1 β , β ρ } . Se t k = 0, U 0 = U k , β = ¯ β , L = β 2 ρ 2 max and go to step 2). 1b) If X β ( U k ) is inactiv e and λ max ( X β ( U k )) ≤ ς 3 β , set k = 0, U 0 = U k , β = max { min { ς 2 λ max ( X β ( U k )) , β ρ } , α ρ } , and L = β 2 ρ 2 max . 2) If g ρ,β ( U k ) − f ρ ( X β ( U k )) ≤ ǫ , terminate. Otherwise , compute ∇ g ρ,β ( U k ). 3) Find U sd k = argmin  h∇ g ρ,β ( U k ) , U − U k i + L 2 k U − U k k 2 F : U ∈ U  . 4) Find U ag k = argmin  L 2 σ k U − U 0 k 2 F + k P i =0 i +1 2 [ g ρ,β ( U i ) + h∇ g ρ,β ( U i ) , U − U i i ] : U ∈ U  . 5) Set U k +1 = 2 k +3 U ag k + k +1 k +3 U sd k . 6) Set k ← k + 1, and go to step 1). end Similarly as the ASPG metho d, w e can easily extend the ANS metho d to ﬁnd a n ( ǫ o , ǫ c )- optimal solution to problem (4 ) b y apply ing the same strat egy for up dating t he initial U 0 and β 0 detailed at the end of Subsection 2.2.1. F or con v enience of presen tation, the resulting metho d is referred to as the adaptiv e Nestero v’s smo oth (ANS) metho d. 3 Computation al results In this sec tion, w e test the s parse reco v ery abilit y of the mo del (4 ) and compare the p er- formance of the ada pt ive sp ectral pro jected g radien t (ASPG) metho d and the adaptive Nes- tero v’s smo o t h (ANS) metho d that are prop osed in Section 2 for solving problem (4) on a set of randomly generated instances. All instances used in this section w ere randomly generated in a similar manner a s describ ed in d’Aspremon t et al. [5] a nd Lu [8]. Indeed, w e ﬁrst generate a sparse matrix A ∈ S n ++ , and then w e generate a matrix B ∈ S n b y B = A − 1 + τ V , where V ∈ S n con tains pseudo-random v alues drawn f rom a uniform dis tr ibutio n on the in terv a l [ − 1 , 1], and τ is a small p ositiv e num b er. Finally , we o btain t he fo llowing randomly generated sample cov ariance matrix: Σ = B − min { λ min ( B ) − ϑ, 0 } I , 15 T able 1: Comparison of ASPG and ANS for  = 0 . 1 Problem Iter Nf Time n size(Ω) ans aspg ans aspg ans aspg 100 8792 1298 1736 1298 2626 17.9 33.9 200 35646 593 489 593 654 52.1 56.1 300 80604 1411 683 1411 974 431.8 291.7 400 143636 1400 702 1400 978 1053.8 730.4 500 224788 1012 615 1012 863 1469.4 1244.8 600 324072 1410 661 1410 908 3501.2 2220.5 700 441380 1189 738 1189 10 50 4656.0 4070.5 800 576896 1175 811 1175 11 69 6601.2 6500.5 900 730500 1660 808 1660 11 54 12975.7 8964.5 1000 902124 2600 1285 2600 19 03 27523.2 20059.9 where ϑ is a small p ositive num b er. In particular, w e set τ = 0 . 15, ϑ = 1 . 0 e − 4 for generating all instances. In the ﬁrst exp eriment we compare the p erformance of the ASPG and ANS metho ds fo r problem (4). F or this purp ose, w e ﬁrst randomly generate the ab ov e matrix A ∈ S n ++ with a density prescrib ed by  , and set Ω = { ( i, j ) : A ij = 0 , | i − j | ≥ 2 } a nd ρ ij = 0 . 5 for all ( i, j ) / ∈ Ω. Σ is then g enerated by the ab ov e a ppro ac h. T he co des for b oth metho ds are written in MA TLAB. In particular, w e set γ = 1 0 − 4 , M = 8 , σ 1 = 0 . 1 , σ 2 = 0 . 9, α min = 10 − 15 , α max = 10 15 for the ASPG metho d, and set ς 1 = ς 2 = 1 . 05 and ς 3 = 0 . 95 for the ANS metho d. In addition, for b oth metho ds w e set β 0 = 1, r β = 10, r ρ = 2, and ρ 0 ij = 0 . 5 for all ( i, j ) ∈ Ω. Also, the ASPG and ANS metho ds start from the initial p oint U 0 = 0 and terminate once an ( ǫ o , ǫ c )-optimal solution of problem (4) is found, where ǫ o = 0 . 1 and ǫ c = 10 − 4 . All computations are p erfo rmed on an Intel Xeon 2.66 GHz machin e with R ed Hat Lin ux ve rsion 8. The p erformance of the ASPG and ANS metho ds for the ra ndomly g enerated instances with densit y  = 0 . 1, 0 . 5 and 0 . 9 is presen ted in T ables 1-3, resp ective ly . The row size n of eac h sample co v ariance matrix Σ is given in column one. The size of the set Ω is giv en in column t wo. The num b ers of (inner) iterations of ASPG a nd ANS are given in columns three to four, the n umber of function ev a lua tions are giv en in columns ﬁv e t o six, and the CPU times (in seconds) are given in the last tw o columns, resp ectiv ely . F rom T ables 1-3, we see that b oth metho ds are able to solve all instances within a reasonable a moun t of time. In addition, the ASPG metho d, namely , the ada ptiv e sp ectral gradient metho d, generally outp erfo r ms the ANS metho d, that is, the adaptive Nestero v’s smo oth metho d. Our second exp erimen t is similar to t he o ne carr ied out in d’Aspremon t et al. [5]. W e in tend to test the sparse reco very ability of the mo del (4). T o this aim, we sp ecialize n = 30 and the matrix A ∈ S n ++ to b e the one with diagonal entries a r o und o ne and a few randomly c hosen, nonzero oﬀ-diag o nal entries equal to +1 or − 1 a nd t he sample cov a riance matrix Σ is then generated by the aforemen tioned approach. Also, w e set Ω = { ( i, j ) : A ij = 0 , | i − j | ≥ 5 } 16 T able 2: Comparison of ASPG and ANS for  = 0 . 5 Problem Iter Nf Time n size(Ω) ans aspg ans aspg ans aspg 100 4776 256 112 256 146 3.9 2.3 200 19438 453 178 453 229 40.3 20 .1 300 44136 412 229 412 296 128.4 91.4 400 78738 433 250 433 339 335.1 260.2 500 12330 0 499 313 499 417 727.2 605 .8 600 17761 4 535 354 535 494 1361.0 1247.1 700 24194 4 569 327 569 467 2204.8 1793.9 800 31718 4 536 349 536 498 3011.7 2763.5 900 40095 2 581 420 581 600 4619.5 4752.2 1000 494610 697 561 697 775 7425.6 8240.1 T able 3: Comparison of ASPG and ANS for  = 0 . 9 Problem Iter Nf Time n size(Ω) ans aspg ans aspg ans aspg 100 960 207 85 207 164 3.3 2.5 200 3738 275 139 275 180 24.5 16.0 300 8750 567 178 567 220 173.4 69.7 400 15764 408 180 408 235 318.6 182.7 500 25072 416 272 416 367 616.8 535.2 600 35846 441 275 441 371 1107.0 920.7 700 48718 1219 421 1219 597 4646.2 2300.0 800 63814 461 348 461 460 2693.9 2650.0 900 80798 469 363 469 507 4124.1 4171.8 1000 98870 495 363 495 514 5656.1 5718.9 and ρ ij = 0 . 1 for all ( i, j ) / ∈ Ω. The mo del (4) with such an instance is ﬁnally solved b y the ASPG metho d whose parameters, initial p oint and termination criterion are exactly same as ab ov e. In Figure 1, we plot the sparsit y patterns o f t he original inv erse co v ariance matr ix A , the approxim a t e solution to problem (4) and the noisy in v erse cov a riance matrix B − 1 for suc h a randomly generated instance. W e observ e that the mo del (4) is capable of reco v ering the sparsit y pat tern of the original inv erse cov a riance matrix. 4 Conclud ing remarks In this pap er, w e considered estimating sparse in vers e cov ariance o f a Gaussian graphical mo del whose conditional independence is assumed to b e partially kno wn. Naturally , we formulated it as a constrained l 1 -norm p enalized maxim um lik eliho o d estimation problem. F urther, we 17 Original inverse A Noisy inverse B −1 Approximate solution of (4) Figure 1: Sparsit y reco v ery . prop osed an algorithm framew o r k, and dev elop ed t wo ﬁrst-or der metho ds, tha t is, adaptiv e sp ectral pro jected g radien t (ASPG) metho d and adaptiv e Nestero v’s smo ot h (ANS) metho d, for solving it. Our computational results demonstrate tha t b oth metho ds are a ble to solve problems of size at least a thousand and num b er of constrain ts of nearly a half million within a reasonable amount of time, and the ASPG metho d generally outp erfo rms the ANS metho d. The source co des for the ASPG a nd ANS metho ds (written in MA TLAB) are av ailable online a t www.math.sfu.ca/ ∼ zhaosong. They can also b e applied to problem (4) with Ω = ∅ , namely , the case where t he underlying sparsit y structure is completely unkno wn. It shall b e men tioned that these co des can b e extended straigh tfor w ardly to mor e general problems of the form max X log det X − h Σ , X i − P ( ij ) / ∈ Ω ρ ij | X ij | s.t. αI  X  β I , X ij = 0 , ∀ ( i, j ) ∈ Ω , where 0 ≤ α < β ≤ ∞ are some ﬁxed b ounds o n the eigen v alues of the solution. References [1] J. B arzilai and J. M . Bor wein , T wo p oint step size gr adient metho d s , IMA J. Numer. Anal., 8 (1988 ) , pp. 141 – 148. [2] D. P. Ber tsekas , Nonline ar Pr o gr amming , 2nd edition, A thena Scien tiﬁc, Belmont, Massac h usetts, 1999. [3] E. G. Birgin, J. M . M ar t ´ ınez, and M. Ra ydan , Nonmonotone sp e ctr al pr oje cte d gr ad i e n t metho ds on c onve x se ts , SIAM J. Optim., 10 (20 00), pp. 1196 – 1211. [4] J. Dahl, L. V and e nberghe, and V. Ro ycho wdhur y , Covaria n c e sele ction for no n- chor dal gr aphs via chor dal em b e dding , Optim. Metho ds Soft w., 2 3 (2008), pp. 501–5 2 0. [5] A. d’As p remont, O. Bane rje e, and L. El Ghaoui , Fi rs t-or der metho ds fo r sp arse c ov a rianc e sele ction , SIAM J. Matrix Anal. Appl., 30 (2008), pp. 56 –66. 18 [6] J. Friedman, T. Has tie, and R. Tibshirani , Sp arse i n verse c ovarianc e estimation with the g r aphic al lasso , Biostatistics, 9 (2 0 08), pp. 432– 4 41. [7] L. Grippo, F. Lamp ariello, and S. Lucidi , A nonm onotone line se ar ch te chnique for Newton ’s m etho d , SIAM J. Numer. Anal., 23 (1986) , pp. 707–716 . [8] Z. Lu , Smo oth optimization a ppr o ach for s p arse c ovarianc e sele ction , SIAM J. Optim., to app ear. [9] Y. E. Nestero v , A metho d for unc onstr ain e d c onvex minimi z ation pr oblem with the r ate of c onver genc e O (1 / k 2 ), Doklady AN SSSR, 269 (1983 ), pp. 543–5 47, translated as So viet Math. D o cl. [10] Y. E. Nestero v , Smo oth minimization of nonsmo oth functions , Math. Programming, 103 ( 2 005), pp. 127 –152. [11] Y. E. Nestero v and A. S. Nemiro vski , Interior p o i n t Polyno m ial algorithms in Convex Pr o gr am m ing: The ory and Applic ations , SIAM, Philadelphia, 1994. [12] L. V andenberghe, S. Boyd, and S . Wu , Determinant maximization with line ar matrix i ne quali ty c onstr aints , SIAM J. Matrix Anal. Appl., 1 9 (1998 ), pp. 4 99–533. [13] M. Yuan and Y. Lin , Mo del sele ction and estimation in the Gaussian gr aphic al mo del , Biometrik a, 94 (2007), pp. 19–35. 19

Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment