An Augmented Lagrangian Approach for Sparse Principal Component Analysis

An Augmen ted Lagrangian Approac h f or Sparse Principal Comp onen t Analysis Zhaoson g Lu ∗ Y ong Zhang † July 10 , 2009 Abstract Principal comp onen t analysis (PCA) is a widely used tec hnique for data analysis and dimension reduction with numerous applicati ons in science and engineering. Ho wev er, the stand ard PCA suﬀers f rom the fact that the prin cipal comp onents (PCs) are usu ally linear com b in ations of all the original v ariables, and it is thus often diﬃcult to interpret the PCs. T o alleviate this drawbac k, v a rious sparse PCA approac hes were p rop osed in literature [15, 6, 17, 28, 8, 25, 18, 7, 16]. Despite success in ac hieving sparsit y , some imp ortant prop erties enjo y ed b y th e standard PCA are lost in these metho d s such as uncorrelation of PCs and orthogonalit y of loading vect ors. Also, the total explained v ariance th at th ey attempt to maximize can b e to o optimistic. In this p ap er w e p rop ose a n ew formulation for sparse PCA, aiming at ﬁnd ing sparse and nearly uncorrelated PCs with orthogonal loading vec tors wh ile explaining as muc h of the total v ariance as p ossible. W e also d ev elop a nov el augment ed L agrangian metho d for solving a class of nonsmo oth constrained optimization problems, whic h is well suited for our form ulation of sparse PCA. W e s h o w that it con v erges to a fe asible p oin t, and moreo v er under some regularit y assumptions, it con v erges to a stat ionary p oint . Additionally , w e prop ose t w o nonmonotone gradient metho ds for solving the augmente d Lagrangian subp roblems, and establish their global and lo cal conv ergence. Final ly , w e compare our sparse PCA approac h wit h sev eral existing metho ds on synthetic, rand om, and real data, resp ectiv ely . The computational results demonstrate that the sp ars e PCs pro duced by our approac h substanti ally outp erform those by other methods in terms of total exp lained v ariance, correlation of PC s, and orth ogonalit y of loading v ectors. Key words : sparse PCA, augmen ted Lagrangian method , nonmonotone gradien t meth- o ds, nons m o oth minimization AMS 2000 sub ject classiﬁcation: 62H20, 62H2 5, 62H30, 90C30, 65K05 ∗ Department of Mathematics, Simon F ras er University , Burnaby , BC, V5A 1S6, Canada . (email: zhaoso ng@sf u.ca ). This a utho r w as suppor ted in pa rt by NSERC Discovery Gra nt . † Department of Mathematics, Simon F ras er University , Burnaby , BC, V5A 1S6, Canada . (email: yza30@ sfu.c a ). 1 1 In tro ductio n Principal comp onen t analysis (PCA) is a p opular to ol for d ata pro cessing and dimens ion reduction. It has b een widely used in nume rous applications in science a nd engineering suc h as biology , c hemis try , image pro cessing, mach ine learning and so on. F or example, PC A has recen tly b een applied to human f ace recognition, ha ndwritten zip co de classiﬁcation and gene expression data analysis (see [1 0, 12, 1, 11]) . In esse nce, PCA aims at ﬁnding a few linear comb inatio ns o f the original v ariables, called princip al c omp onents ( PCs), whic h p oint in orthogonal directions capturing as m uc h of the v ariance of the v ariables as p ossible. It is we ll kno wn that PCs can b e fo und via t he eigen v alue decomp osition of the c ov ariance matrix Σ. Ho w ev er, Σ is t ypically unknown in practice. Instead, the PCs can b e a ppro ximately computed via the singular v alue decomp osition (SVD) of the data matrix or the eigen v alue decomp osition of the sample co v ariance matrix. In detail, let ξ = ( ξ (1) , . . . , ξ ( p ) ) b e a p -dimensional ra ndom v ector, and X b e an n × p data matrix, whic h records the n observ ations of ξ . Without lo ss of generality , assume X is cen tered, that is, the column means of X are all 0. Then t he commonly used sample cov ariance matrix is ˆ Σ = X T X/ ( n − 1). Suppose the eigen v alue decompo sition of ˆ Σ is ˆ Σ = V D V T . Then η = ξ V giv es the PCs, and the columns of V are the corresp onding loa ding ve ctors. It is worth noting that V can also b e obtained b y p erforming the SVD o f X (see , for example, [28]). Clearly , the columns of V are ort ho normal v ectors, a nd moreov er V T ˆ Σ V is diag o nal. W e th us immediately see that if ˆ Σ = Σ, the corresp onding PCs are uncorrelated; otherwise, they can be correlated with eac h other (see Section 2 for details). W e no w describ e sev eral imp ortant pro p erties of the PCs obtained b y the standard PCA when Σ is well estimated b y ˆ Σ (see also [28]): 1. The PCs sequen tially capture the maxim um v ariance of the v ariables approximately , th us encouraging minimal information loss as m uc h as p ossible; 2. The PCs a re nearly uncorrelated, so the explained v ariance b y diﬀeren t PCs has small o v erlap; 3. The PCs p oint in orthogonal directions, that is, their loa ding v ectors are orthogonal to eac h other. In practice, t ypically the ﬁrst few PCs a r e enough to represen t the data, thus a great dimen- sionalit y reduction is a chiev ed. In spite of the p opularity a nd success of PCA due to these nice features, PCA has a n obvious dra wbac k, that is, PCs ar e usually linear combinations of all p v a r ia bles and the loa dings are t ypically nonzero. This mak es it often diﬃcult to in terpret the PCs, especially when p is large. Indeed, in man y applications, the orig inal v ar ia bles ha v e concrete ph ysical meaning. F or example in biolo gy , eac h v ariable migh t represen t the expres- sion lev el of a gene. In these cases , the in terpretation of PCs w ould b e fa cilitated if they were 2 comp osed o nly fro m a small n um b er of the original v ariables, namely , eac h PC in volv ed a small n um b er of nonzero loa ding s. It is th us imp erativ e to dev elop sparse PCA tec hniques for ﬁnding the PCs with sparse loadings while enjo ying the ab ov e three nice pro p erties a s m uch as p ossible. Sparse PCA has b een an activ e researc h topic for mor e than a dec ade. The ﬁrst class of a pproac hes are based on ad- ho c metho ds by p ost-pro cessing the PCs obtained from the standard PCA men tioned a b o ve . F or example, Jolliﬀe [15] applied v ar io us rotation techniq ues to the standard PCs for obtaining sparse loading v ectors. C adima a nd Jolliﬀe [6] prop osed a simple t hr esholding a pproac h b y artiﬁcially setting to zero the standard PCs’ loadings with absolute v alues smaller than a threshold. In recen t y ears, optimization approa ches hav e been prop osed fo r ﬁnding sparse PCs. They usually form ulate sparse PCA into an optimization problem, aiming at ac hieving the sparsit y of loadings while maximizing the explained v ar ia nce as muc h a s p ossible. F or instance, Jolliﬀe et al. [17] prop osed a n interes ting a lgorithm, called SCoTLASS, for ﬁnding sparse orthogonal loading v ectors b y sequen tially maximizing the appro ximate v ariance explained by each PC unde r the l 1 -norm p enalt y o n loading v ectors. Zou et al. [28] fo rm ulated sparse PCA as a regression-t yp e optimizatio n problem and imp osed a comb inatio n of l 1 - and l 2 -norm p enalties o n the regression co eﬃcien ts. d’Aspremon t et al. [8] prop osed a metho d, called DSPCA, for ﬁnding sparse PCs b y solving a sequence of semideﬁ nite program relaxations of sparse PCA. Shen and Huang [25] recen tly dev elop ed an approac h fo r computing sparse PCs b y solving a sequence of rank-o ne ma t r ix approxim atio n problems under sev eral sparsit y-inducing p enalties. V ery recen tly , Journ´ ee et al. [16] formulated sparse PCA as nonconca v e maximization problems with l 0 - or l 1 -norm sparsit y-inducing p enalties. They sho w ed that these problems can b e reduced in to maximization of a con v ex function on a compact set, and they also prop osed a simple but computationally eﬃcien t gradient metho d for ﬁnding a stationary point o f the latter problems. Additionally , greedy metho ds w ere inv es tigated fo r sparse PCA by Mogha ddam et al. [18] and d’Aspremon t et al. [7]. The PCs obtained b y the ab o ve metho ds [15, 6, 17, 2 8, 8 , 25, 18, 7, 16 ] are usually sparse. Ho w ev er, the a foremen tioned nice prop erties o f the standard PC s are lost to some exten t in these sparse PCs. Indeed, the lik ely correlation among the sparse PCs are not considered in these metho ds. The refore, their sparse PCs can b e quite correlated with each other. Also, the tota l explained v ariance t hat these metho ds attempt to maximize can b e to o optimistic as there ma y be some o v erlap among the individual v a r iances of sparse PCs. Finally , the loading v ectors of the sparse PCs g iv en b y these metho ds lac k or t hogonalit y except SCoTLASS [17]. In this pap er w e prop ose a new formulation for sparse PCA b y taking into accoun t the three nice prop erties of the standard PC A, that is, maximal total explained v ariance, un- correlation of PCs , and orthogonality of loading v ectors. W e also ex plore t he connection of this formulation with the standard PCA and sho w that it can b e view ed as a certain p er- turbation of the standard PCA. W e furt her prop ose a nov el augmented Lagra ngian method for solving a class of nonsmoo th constrained optimization problems, whic h is w ell suited for our formulation of sparse PCA. This metho d diﬀers from the classical augmen ted L a grangian metho d in that: i) the v alues of the augmen ted Lagra ng ian functions at their approximate minimizers giv en b y the metho d are bo unded from ab ov e; and ii) the magnitude of p enalty 3 parameters outgrows that of Lagra ngian multipliers (see Section 3.2 for details). W e show that this method con v erges to a fe asible p oint, and moreov er it conv erges to a ﬁrst-or der sta- tionary p oint under some regularity assumptions. W e also prop ose t w o nonmonotone gradient metho ds for minimizing a class of nonsmo o th functions ov er a closed con v ex set, which can b e suitably applied to the s ubproblems arising in our augmen ted Lagra ng ia n metho d. W e further establish global con v ergence and, under a lo cal Lipsc hitzian error b ounds assumption [26], lo cal linear rate of con v ergence for thes e gra dient metho ds. Finally , w e compare the sparse PCA approach pr o p osed in this pap er with sev eral existing metho ds [2 8, 8, 25, 16] on syn thetic, random, and real data, resp ectiv ely . The computatio nal res ults demonstrate that the sparse PCs obtained by our appro a c h substan tially outp erform those b y the other metho ds in terms of total explained v ariance, corr elat io n of PCs, and orthog onalit y of loading v ectors. The rest of paper is org anized as follo ws. In Section 2, w e prop ose a new form ulation for sparse PCA and explore the connection of this form ulatio n with the standard PCA. In Section 3, we then dev elop a no v el augmen ted Lagrangian metho d for a class of nonsmo oth constrained pr o blems, and prop o se tw o nonmonotone gradient metho ds fo r minimizing a class of nonsmo oth f unctions ov e r a closed conv ex set. In Section 4, w e discuss the applicabilit y and implemen tation details of our augmented Lagrangia n metho d for sparse PCA. The sparse PCA approac h prop osed in t his pap er is then compared with sev eral existing metho ds on syn thetic, random, a nd real data in Section 5. Finally , w e prese nt some concluding remarks in Section 6. 1.1 Notation In this pap er, all v ector spaces are assumed to b e ﬁnite dimensional. Th e sym b o ls ℜ n and ℜ n + (resp., ℜ n − ) denote the n -dimensional Euclidean space and the nonnegativ e (resp., nonp ositiv e) orthan t of ℜ n , res p ective ly , and ℜ ++ denotes the set of p ositiv e real n um b ers. The space of all m × n matrices with real en tries is denoted by ℜ m × n . The space of symmetric n × n matrices is denoted by S n . Additionally , D n denotes the space of n × n diagonal matrices. F or a real matrix X , we denote b y | X | the absolute v alue of X , t ha t is, | X | ij = | X ij | for all ij , and b y sign ( X ) the sign of X whos e ij th en try equals the sign of X ij for all ij . Also, the nonnegativ e part of X is denoted by [ X ] + whose ij th en try is giv en b y max { 0 , X ij } for a ll ij . The rank of X is denoted b y rank( X ). F urther, the iden tit y matrix and the all-ones matrix are denoted b y I and E , respectiv ely , whose dimens ion should b e clear from the conte xt. If X ∈ S n is po sitive se mideﬁnite, w e write X  0. F or an y X , Y ∈ S n , w e write X  Y to mean Y − X  0. Give n matrices X and Y in ℜ m × n , the standar d inner pro duct is deﬁned b y X • Y := T r ( X Y T ), where T r( · ) denotes the trace of a matrix, and the compo nen t-wise pro duct is denoted by X ⊙ Y , whose ij th en try is X ij Y ij for all ij . k · k denotes the Euclidean norm and its a sso ciat ed op erator norm unless it is explicitly stated otherwise. The minimal (resp., maximal) eigenv alue of an n × n symmetric matrix X are denoted b y λ min ( X ) (resp., λ max ( X )), resp ectiv ely , and λ i ( X ) denotes its i th largest eigen v alue for i = 1 , . . . , n . Giv en a v ector v ∈ ℜ n , Diag( v ) or Diag( v 1 , . . . , v n ) denotes a diag onal matrix whose i th diagonal elemen t is v i for i = 1 , . . . , n . G iv en an n × n matrix X , ] Diag ( X ) denotes a diagona l matrix 4 whose i th diago nal elemen t is X ii for i = 1 , . . . , n . Let U b e a real v ector space . Giv en a closed con v ex set C ⊆ U , let dist ( · , C ) : U → ℜ + denote the distance function to C measured in terms of k · k , that is, dist( u, C ) := inf ˜ u ∈ C k u − ˜ u k ∀ u ∈ U . (1) 2 F orm ulation for s parse PCA In this section w e prop o se a new form ulation for sparse PCA by taking into accoun t sparsit y and orthogonalit y of loading v ectors, and uncorrelation of PCs. W e also address the connection of our formulation with the standard PCA. Let ξ = ( ξ (1) , . . . , ξ ( p ) ) b e a p - dimensional ra ndo m v ector with co v ariance matrix Σ. Sup- p ose X is an n × p data matrix, whic h records the n observ ations of ξ . Without loss of general- it y , assume the column means of X are 0. Then the commonly used sample co v ariance matrix of ξ is ˆ Σ = X T X/ ( n − 1). F or any r loading v ectors represen ted as V = [ V 1 , . . . , V r ] ∈ ℜ p × r where 1 ≤ r ≤ p , the corresp onding comp onen ts are give n by η = ( η (1) , . . . , η ( r ) ) = ξ V , whic h are linear combin atio ns o f ξ (1) , . . . , ξ ( p ) . Clearly , the co v ariance matrix of η is V T Σ V , and th us the comp o nen ts η ( i ) and η ( j ) are uncorrelated if and only if the ij th en try of V T Σ V is zero. Also, the total ex plained v ariance by the comp onen ts η ( i ) ’s equals, if they are uncorrelated, the sum of the individual v ariances of η ( i ) ’s, that is, r X i =1 V T i Σ V i = T r( V T Σ V ) . Recall that our a im is to ﬁnd a set of sparse a nd orthogonal loading v ectors V so t ha t the corresp onding comp onen ts η (1) , . . . , η ( r ) are uncorrelated and explain as m uc h v ariance of the original v ariables ξ (1) , . . . , ξ ( p ) as p ossible. It app ears that our goal can b e ac hiev ed by solving the follow ing pro blem: max V ∈ℜ n × r T r( V T Σ V ) − ρ • | V | s.t. V T Σ V is diagonal , V T V = I , (2) where ρ ∈ ℜ p × r + is a t unning parameter f or con trolling the sparsit y of V . Ho w ev er, the cov ari- ance matrix Σ is typic ally unkno wn and can only b e a pproximated b y the sample co v ariance matrix ˆ Σ. It lo oks pla usible to mo dify (2) by simply replacing Σ with ˆ Σ at a glance. Nev er- theless, suc h a mo diﬁcation would eliminate all optimal solutio ns V ∗ of (2) from consideration since ( V ∗ ) T ˆ Σ V ∗ is generally non-diagonal. F or this reason, giv en a sample co v ariance ˆ Σ, we consider the fo llo wing f orm ulation for sparse PCA, whic h can b e view ed a s a mo diﬁcation of problem (2), max V ∈ℜ n × r T r( V T ˆ Σ V ) − ρ • | V | s.t. | V T i ˆ Σ V j | ≤ ∆ ij ∀ i 6 = j, V T V = I , (3) 5 where ∆ ij ≥ 0 ( i 6 = j ) are the pa r ameters for controlling the correlation of the components corresp onding to V . Clearly , ∆ ij = ∆ j i for all i 6 = j . W e next explore the connection of for mulation (3) with the standard PCA. Before pro- ceeding, w e state a technic al lemma as follo ws that will b e used subsequen tly . Its pro o f can b e found in [20]. Lemma 2.1 G iven any ˆ Σ ∈ S n and inte ger 1 ≤ r ≤ n , deﬁne i r = max { 1 ≤ i ≤ n : λ i ( ˆ Σ) > λ r ( ˆ Σ) } , ¯ i r = max { 1 ≤ i ≤ n : λ i ( ˆ Σ) = λ r ( ˆ Σ) } , (4) and l e t f ∗ b e the optima l value of max { T r( ˆ Σ Y ) : 0  Y  I , T r( Y ) = r } . (5) Then, f ∗ = P r i =1 λ i ( ˆ Σ) , and Y ∗ is an optima l solution of (5 ) if a nd only if Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 , wher e P ∗ ∈ S ¯ i r − i r satisﬁes 0  P ∗  I an d T r( P ∗ ) = r − i r , and U ∗ 1 ∈ ℜ n × i r and U ∗ 2 ∈ ℜ n × ( ¯ i r − i r ) ar e the matric es who s e c olumns c onsist of the orthonorma l eigenve ctors of ˆ Σ c orr esp onding to the eigenvalues ( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)) and ( λ i r +1 ( ˆ Σ) , . . . , λ ¯ i r ( ˆ Σ)) , r esp e ctively. W e next address the relation b etw een the eigen v ectors of ˆ Σ a nd the solutions of problem (3) when ρ = 0 and ∆ ij = 0 for all i 6 = j . Prop osition 2.2 Supp ose for pr oblem (3) that ρ = 0 and ∆ ij = 0 for al l i 6 = j . L et f ∗ b e the optimal value of (3). Then, f ∗ = P r i =1 λ i ( ˆ Σ) , and V ∗ ∈ ℜ n × r is an optimal solution o f (3) if and only if the c olumns of V ∗ c onsist of the orthonormal eigenve ctors of ˆ Σ c orr esp onding to r lar gest eigenvalues of ˆ Σ . Pr o of . W e ﬁrst sho w t ha t f ∗ = P r i =1 λ i ( ˆ Σ). Indeed, let U b e an n × r ma t r ix whose columns consis t of the ortho normal eigen ve ctors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ. W e then see that U is a feasible solution of (3) a nd T r( U T ˆ Σ U ) = P r i =1 λ i ( ˆ Σ). It follo ws that f ∗ ≥ P r i =1 λ i ( ˆ Σ). On the other hand, w e observ e that f ∗ is b ounded ab o v e b y the optimal v alue of max { T r( V T ˆ Σ V ) : V T V = I , V ∈ ℜ n × r } . W e kno w from [9] that its optimal v alue equals P r i =1 λ i ( ˆ Σ). Therefore, f ∗ = P r i =1 λ i ( ˆ Σ) holds and U is an optimal solutio n of (3 ) . It also implies that the “if ” part of this prop osition holds. W e next sho w that the “only if ” part also holds. Let V ∗ ∈ ℜ n × r b e an optimal solution of (3), and deﬁne Y ∗ = V ∗ V ∗ T . Then, w e ha v e V ∗ T V ∗ = I , whic h yields 0  Y ∗  I and T r( Y ∗ ) = r . Hence, Y ∗ is a feasible solution of (5). Using the fact that f ∗ = P r i =1 λ i ( ˆ Σ), w e then hav e T r ( ˆ Σ Y ∗ ) = T r( V ∗ T ˆ Σ V ∗ ) = r X i =1 λ i ( ˆ Σ) , whic h together with Lemm a 2.1 implies that Y ∗ is an optimal solution of (5). Let i r and ¯ i r b e deﬁned in (4). Then, it follows from Lemma 2 .1 that Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 , where 6 P ∗ ∈ S ¯ i r − i r satisﬁes 0  P ∗  I and T r( P ∗ ) = r − i r , and U ∗ 1 ∈ ℜ n × i r and U ∗ 2 ∈ ℜ n × ( ¯ i r − i r ) are the matrices whose columns consist of t he orthonor ma l eigen v ectors of ˆ Σ corresponding to the eigen v alues ( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)) and ( λ i r +1 ( ˆ Σ) , . . . , λ ¯ i r ( ˆ Σ)), resp ectiv ely . T hus , w e hav e ˆ Σ U ∗ 1 = U ∗ 1 Λ , ˆ Σ U ∗ 2 = λ r ( ˆ Σ) U ∗ 2 , (6) where Λ = Diag( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)). In addition, it is easy to show that rank( Y ∗ ) = i r + rank( P ∗ ). Since Y ∗ = V ∗ V ∗ T and V ∗ T V ∗ = I , we can observ e that rank ( Y ∗ ) = r . Hence, rank( P ∗ ) = r − i r , whic h implies that P ∗ has o nly r − i r nonzero eigen v alues. Using t his fact and the r elations 0  P ∗  I and T r( P ∗ ) = r − i r , w e can further conclude that r − i r eigen v alues of P ∗ are 1 and the rest ar e 0. Therefore, there exists W ∈ ℜ ( ¯ i r − i r ) × ( r − i r ) suc h that W T W = I , P ∗ = W W T . (7) It together with Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 implies that Y ∗ = U ∗ U ∗ T , where U ∗ = [ U ∗ 1 U ∗ 2 W ]. In view of (7) and the iden tities U ∗ T 1 U ∗ 1 = I , U ∗ T 2 U ∗ 2 = I and U ∗ T 1 U ∗ 2 = 0, w e see that U ∗ T U ∗ = I . Using t his result, and the relations V ∗ T V ∗ = I and Y ∗ = U ∗ U ∗ T = V ∗ V ∗ T , it is not hard to see that the columns of U ∗ and V ∗ form an o r t ho normal basis for the rang e space of Y ∗ , resp ectiv ely . Th us, V ∗ = U ∗ Q for some Q ∈ ℜ r × r satisfying Q T Q = I . Now, let D = V ∗ T ˆ Σ V ∗ . By the deﬁnition of V ∗ , w e kno w that D is an r × r diagonal matrix. Moreov er, in view of (6), (7), the deﬁnition o f U ∗ , and the relatio ns V ∗ = U ∗ Q , U ∗ T 1 U ∗ 1 = I , U ∗ T 2 U ∗ 2 = I and U ∗ T 1 U ∗ 2 = 0, w e hav e D = V ∗ T ˆ Σ V ∗ = Q T U ∗ T ˆ Σ U ∗ Q = Q T  U ∗ T 1 W T U ∗ T 2  ˆ Σ [ U ∗ 1 U ∗ 2 W ] Q = Q T  Λ 0 0 λ r ( ˆ Σ) I  Q, (8) whic h together with Q T Q = I implies that D is similar to the diagonal matrix app earing on the right-hand side of (8). H ence, the diagonal elemen ts of D consist of r largest eigen v alues of ˆ Σ. In addition, let Q 1 ∈ ℜ i r × r and Q 2 ∈ ℜ ( r − i r ) × r b e the submatrices cor r esp onding to the ﬁrst i r and the last r − i r ro ws of Q , respectiv ely . Then, in view of the deﬁnition of U ∗ and V ∗ = U ∗ Q , w e hav e [ U ∗ 1 U ∗ 2 W ] = U ∗ = V ∗ Q T = [ V ∗ Q T 1 V ∗ Q T 2 ] . Th us, w e obtain that U ∗ 1 = V ∗ Q T 1 and U ∗ 2 W = V ∗ Q T 2 . Using these iden tities, (6), (8), and the relation V ∗ = U ∗ Q , w e ha ve ˆ Σ V ∗ = ˆ Σ U ∗ Q = ˆ Σ [ U ∗ 1 U ∗ 2 W ] Q = [ U ∗ 1 Λ λ r ( ˆ Σ) U ∗ 2 W ] Q = [ V ∗ Q T 1 Λ λ r ( ˆ Σ) V ∗ Q T 2 ] Q = V ∗ Q T  Λ 0 0 λ r ( ˆ Σ) I  Q = V ∗ D . It follow s that the columns of V ∗ consist of the orthonormal eigen ve ctors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ, and th us the “only if ” part of this prop osition holds. 7 F rom the ab o v e prop osition, w e see t ha t when ρ = 0 and ∆ ij = 0 f or all i 6 = j , eac h solutio n of (3) consists of the orthonormal eigenv ectors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ, whic h can b e computed from the eigen v alue decomposition of ˆ Σ. There for e, t he loading v ectors obtained fro m (3) are the same as those giv en by the standard PCA when applied to ˆ Σ. On the other ha nd, when ρ and ∆ ij for all i 6 = j are small, the loading v ectors found b y (3) can b e view ed as an appro ximation to the ones pro vided b y the standard PCA. W e will prop ose suitable metho ds for solving (3) in Sections 3 and 4. 3 Augmen ted Lagrangian metho d for nonsmo oth con- strained nonl inear prog ramming In this section w e prop o se a no v el augmented Lagrangian metho d for a class of nonsmo oth constrained nonlinear progra mming pro blems, whic h is w ell suited for form ulation (3) of sparse PCA. In particular, we study ﬁrst-order o ptimalit y conditions in Subsection 3.1. In Subsection 3.2, w e dev elop an augmen ted Lagra ngian metho d and establish its global conv ergenc e. In Subsection 3.3, w e prop ose tw o nonmonotone gradien t metho ds for minimizing a class of non- smo oth functions o v er a closed con v ex set, whic h can b e suitably applied to the subproblems arising in our augmente d Lagrangia n metho d. W e also establish global and lo cal conv ergenc e for these gradien t metho ds. 3.1 First-order optimalit y conditio ns In this subsection w e in tro duce a class of nonsmo oth constrained nonlinear programming problems and study ﬁrst-order optimality conditions for them. Consider the nonlinear progra mming problem min f ( x ) + P ( x ) s.t. g i ( x ) ≤ 0 , i = 1 , . . . , m, h i ( x ) = 0 , i = 1 , . . . , p, x ∈ X . (9) W e assume that the functions f : ℜ n → ℜ , g i : ℜ n → ℜ , i = 1 , . . . , m , and h i : ℜ n → ℜ , i = 1 , . . . , p , are con tin uously diﬀerentiable, and t hat the function P : ℜ n → ℜ is con v ex but not necessarily smo oth, and that the set X ⊆ ℜ n is closed a nd conv ex . F or con v enience o f the subseque nt presen tatio n, w e denote b y Ω the feasible region of problem (9). Before establishing ﬁrst-order optimality conditions for problem (9), w e describe a general constrain t qualiﬁcation condition fo r (9), that is, R o binson’s condition that w as pro p osed in [21]. Let x ∈ ℜ n b e a feasible p oint of pro blem (9). W e denote the s et of active inequalit y constrain ts at x as A ( x ) = { 1 ≤ i ≤ m : g i ( x ) = 0 } . 8 In addition, x is said to satisfy R obinson ’s c ondition if  g ′ ( x ) d − v h ′ ( x ) d  : d ∈ T X ( x ) , v ∈ ℜ m , v i ≤ 0 , i ∈ A ( x )  = ℜ m × ℜ p , (10) where g ′ ( x ) and h ′ ( x ) denote the Jacobian of the functions g = ( g 1 , . . . , g m ) and h = ( h 1 , . . . , h p ) at x , resp ectiv ely . Other equiv alen t expressions of Robinson’s condition can b e found, for ex- ample, in [21, 22, 24]. The follo wing pro p osition demonstrates that Robinson’s condition is indeed a constrain t qualiﬁcation condition for problem (9). F o r the sak e of completeness, w e include a brief pro o f for it. Prop osition 3.1 Given a fe a s i ble p oint x ∈ ℜ n of pr oblem (9), let T Ω ( x ) b e the tangent c one to Ω a t x , and ( T Ω ( x )) ◦ b e its p olar c one. If R obinson ’s c ondition (10) holds at x , then T Ω ( x ) =  d ∈ T X ( x ) : d T ∇ g i ( x ) ≤ 0 , i ∈ A ( x ) , d T ∇ h i ( x ) = 0 , i = 1 , . . . , p  , ( T Ω ( x )) ◦ =    X i ∈A ( x ) λ i ∇ g i ( x ) + p X i =1 µ i ∇ h i ( x ) + N X ( x ) : λ ∈ ℜ m + , µ ∈ ℜ p    , (11) wher e T X ( x ) and N X ( x ) ar e the tangent and normal c ones to X at x , r esp e ctively. Pr o of . By Theorem A.10 of [24], w e see that Robinson’s condition (10) implies that the assumption of Theorem 3 . 15 of [24] is satisﬁed with x 0 = x, X 0 = X , Y 0 = ℜ m − × ℜ p , g ( · ) = ( g 1 ( · ); . . . ; g m ( · ); h 1 ( · ); . . . ; h p ( · )) . The ﬁrst statemen t then follows from Theorem 3 . 15 of [24] with the ab o v e x 0 , X 0 , Y 0 and g ( · ) . F urther, let A ( x ) denote the matrix whose r ows are the gradien ts of all active constraints at x in the same order a s they app ear in (9). Then, Robinson’s condition (10) implies t hat the assumptions of Theorem 2 . 36 of [24] a re satisﬁed with A = A ( x ) , K 1 = T X ( x ) , K 2 = ℜ |A ( x ) | − × ℜ p . Let K = { d ∈ K 1 : Ad ∈ K 2 } . Then, it follo ws from Theorem 2 . 36 of [24] that ( T Ω ( x )) ◦ = K ◦ = K ◦ 1 + { A T ξ : ξ ∈ K ◦ 2 } , whic h together with the identit y ( T X ( x )) ◦ = N X ( x ) and the deﬁnitions of A , K 1 and K 2 , implies that the second statemen t holds. W e are no w ready to establish ﬁrst- order optimalit y conditions for problem (9). 9 Theorem 3.2 L et x ∗ ∈ ℜ n b e a lo c al min i mizer of pr oblem (9). Assume that R obinson ’s c ondition (10) is satisﬁe d at x ∗ . Then ther e exist L agr a n ge multipliers λ ∈ ℜ m + and µ ∈ ℜ p such that 0 ∈ ∇ f ( x ∗ ) + ∂ P ( x ∗ ) + m X i =1 λ i ∇ g i ( x ∗ ) + p X i =1 µ i ∇ h i ( x ∗ ) + N X ( x ∗ ) , (12) and λ i g i ( x ∗ ) = 0 , i = 1 , . . . , m. (13) Mor e over, the set of L agr ange multipliers ( λ, µ ) ∈ ℜ m + × ℜ p satisfying the ab ove c onditions, denote d by Λ( x ∗ ) , is c onvex and c omp act. Pr o of . W e ﬁrst sho w that d T ∇ f ( x ∗ ) + P ′ ( x ∗ ; d ) ≥ 0 ∀ d ∈ T Ω ( x ∗ ) . (14) Let d ∈ T Ω ( x ∗ ) b e arbitrarily chose n. Then, there exist sequence s { x k } ∞ k =1 ⊆ Ω and { t k } ∞ k =1 ⊆ ℜ ++ suc h that t k ↓ 0 and d = lim k →∞ x k − x ∗ t k . Th us, w e hav e x k = x ∗ + t k d + o ( t k ). Using this relatio n along with the fact that the function f is diﬀerentiable and P is conv ex in ℜ n , w e can hav e f ( x ∗ + t k d ) − f ( x k ) = o ( t k ) , P ( x ∗ + t k d ) − P ( x k ) = o ( t k ) , (15) where the ﬁrst equality follow s from the Mean V alue Theorem while the second o ne comes from Theorem 10 . 4 of [23]. Clearly , x k → x ∗ . This together with the assumption that x ∗ is a lo cal minimizer of (9), implies t hat f ( x k ) + P ( x k ) ≥ f ( x ∗ ) + P ( x ∗ ) (16) when k is suﬃcien tly large. In view of (15) and (16), w e obta in that d T ∇ f ( x ∗ ) + P ′ ( x ∗ ; d ) = lim k →∞ f ( x ∗ + t k d ) − f ( x ∗ ) t k + lim k →∞ P ( x ∗ + t k d ) − P ( x ∗ ) t k , = lim k →∞  f ( x k ) + P ( x k ) − f ( x ∗ ) − P ( x ∗ ) t k + o ( t k ) t k  , = lim k →∞ f ( x k ) + P ( x k ) − f ( x ∗ ) − P ( x ∗ ) t k ≥ 0 , and hence (14) holds. F or simplicity of notations, let T ◦ Ω = ( T Ω ( x ∗ )) ◦ and S = −∇ f ( x ∗ ) − ∂ P ( x ∗ ). W e next sho w that S ∩ T ◦ Ω 6 = ∅ . Suppo se fo r con tradiction that S ∩ T ◦ Ω = ∅ . This together with the fact that S and T ◦ Ω are nonempty closed con v ex sets and S is b ounded, implies that there exists some 10 d ∈ ℜ n suc h tha t d T y ≤ 0 for a n y y ∈ T ◦ Ω , and d T y ≥ 1 for any y ∈ S . Clearly , w e se e that d ∈ ( T ◦ Ω ) ◦ = T Ω ( x ∗ ), and 1 ≤ inf y ∈ S d T y = inf z ∈ ∂ P ( x ∗ ) d T ( −∇ f ( x ∗ ) − z ) = − d T ∇ f ( x ∗ ) − sup z ∈ ∂ P ( x ∗ ) d T z = − d T ∇ f ( x ∗ ) − P ′ ( x ∗ ; d ) , whic h con tradicts (14). Hence, w e ha v e S ∩ T ◦ Ω 6 = ∅ . Using this relation, (11), the deﬁnitions of S and A ( x ∗ ), and letting λ i = 0 for i / ∈ A ( x ∗ ), w e easily see that (1 2) a nd (13) hold. In view of the fact that ∂ P ( x ∗ ) and N X ( x ∗ ) are closed and con ve x, and moreov er ∂ P ( x ∗ ) is b ounded, w e k now that ∂ P ( x ∗ ) + N X ( x ∗ ) is closed and conv ex. Using this res ult, it is straig h tforw ard to see that Λ( x ∗ ) is closed and con v ex. W e next sho w that Λ ( x ∗ ) is b ounded. Suppo se for contradiction that Λ( x ∗ ) is un b ounded. Then, there exists a sequence { ( λ k , µ k ) } ∞ k =1 ⊆ Λ( x ∗ ) suc h that k ( λ k , µ k ) k → ∞ , and 0 = ∇ f ( x ∗ ) + z k + m X i =1 λ k i ∇ g i ( x ∗ ) + p X i =1 µ k i ∇ h i ( x ∗ ) + v k (17) for some { z k } ∞ k =1 ⊆ ∂ P ( x ∗ ) and { v k } ∞ k =1 ⊆ N X ( x ∗ ). Let ( ¯ λ k , ¯ µ k ) = ( λ k , µ k ) / k ( λ k , µ k ) k . By passing to a subsequence if necessary , w e can assume that ( ¯ λ k , ¯ µ k ) → ( ¯ λ, ¯ µ ). W e clearly see that k ( ¯ λ, ¯ µ ) k = 1, ¯ λ ∈ ℜ m + , and ¯ λ i = 0 for i / ∈ A ( x ∗ ). Note that ∂ P ( x ∗ ) is b ounded and N X ( x ∗ ) is a closed cone. In view of this fact, and up on dividing b oth sides of (17) b y k ( λ k , µ k ) k and taking limits on a subsequence if necessary , we obtain that 0 = m X i =1 ¯ λ i ∇ g i ( x ∗ ) + p X i =1 ¯ µ i ∇ h i ( x ∗ ) + ¯ v (18) for some ¯ v ∈ N X ( x ∗ ). Since Robinson’s condition (10) is satisﬁed at x ∗ , there exist d ∈ T X ( x ∗ ) and v ∈ ℜ m suc h that v i ≤ 0 for i ∈ A ( x ∗ ), and d T ∇ g i ( x ∗ ) − v i = − ¯ λ i ∀ i ∈ A ( x ∗ ) , d T ∇ h i ( x ∗ ) = − ¯ µ i , i = 1 , . . . , p. Using these relatio ns, (1 8) a nd the fact that d ∈ T X ( x ∗ ), ¯ v ∈ N X ( x ∗ ), ¯ λ ∈ ℜ m + , and ¯ λ i = 0 for i / ∈ A ( x ∗ ), w e hav e m P i =1 ¯ λ 2 i + p P i =1 ¯ µ 2 i ≤ − m P i =1 ¯ λ i d T ∇ g i ( x ∗ ) − p P i =1 ¯ µ i d T ∇ h i ( x ∗ ) , = − d T  m P i =1 ¯ λ i ∇ g i ( x ∗ ) + p P i =1 ¯ µ i ∇ h i ( x ∗ )  = d T ¯ v ≤ 0 . It yields ( ¯ λ, ¯ µ ) = (0 , 0), whic h contradicts the iden tity k ( ¯ λ, ¯ µ ) k = 1. Th us, Λ( x ∗ ) is b ounded. 11 3.2 Augmen ted Lagrangian me tho d for (9) F or a conv ex progr a m, it is known that under some mild a ssumptions, an y accum ulation p oin t of t he sequence generated by the classical augmente d Lagrang ian metho d is an optimal solution (e.g., see []). Nev ertheless, when problem (9 ) is a noncon v ex pr o gram, esp ecially when the function h i is not aﬃne or g i is noncon v ex, the class ical augmen ted Lagr angian metho d ma y not ev en con v erge to a feasible p oint. T o alleviate this dra wback , w e prop ose a no v el augmen ted Lagrangia n metho d f or problem (9) and establish its global con v ergence in this subsection. Throughout this subsection, we mak e the follo wing assumption for problem (9). Assumption 1 Pr oblem (9) is fe asible, and mor e over at le ast a fe asible solution, d e note d by x feas , is k n own. It is w ell-kno wn that for problem (9) the asso ciated augmen ted Lag r angian function L  ( x, λ, µ ) : ℜ n × ℜ m × ℜ p → ℜ is given by L  ( x, λ, µ ) := w ( x ) + P ( x ) , (19) where w ( x ) := f ( x ) + 1 2  ( k [ λ + g ( x )] + k 2 − k λ k 2 ) + µ T h ( x ) +  2 k h ( x ) k 2 , (20) and  > 0 is a p enalt y parameter (e.g., see [3, 24]). Roughly sp eaking, an augmented La- grangian metho d, when applied to problem (9), solv es a sequence of subproblems in the form of min x ∈ X L  ( x, λ, µ ) while up dating the Lagra ngian m ultipliers ( λ , µ ) a nd the p enalt y parameter  . Let x feas b e a kno wn feasible p oin t of (9) (see Ass umption 1). W e no w describ e the algorithm framew ork of a no v el aug mented Lagrangian metho d as follows. Algorithm framew ork of augmen ted Lagrangian metho d: Let { ǫ k } b e a p o sitiv e decreasing sequ ence. Let λ 0 ∈ ℜ m + , µ 0 ∈ ℜ p ,  0 > 0, τ > 0, σ > 1 b e giv en. Cho ose an arbitrary initial p oin t x 0 init ∈ X and constan t Υ ≥ max { f ( x feas ) , L  0 ( x 0 init , λ 0 , µ 0 ) } . Set k = 0. 1) Find an approxim ate solution x k ∈ X for the subproblem min x ∈ X L  k ( x, λ k , µ k ) (21) suc h that dist  −∇ w ( x k ) , ∂ P ( x k ) + N X ( x k )  ≤ ǫ k , L  k ( x k , λ k , µ k ) ≤ Υ . (22) 12 2) Up date Lag range m ultipliers according to λ k +1 := [ λ k +  k g ( x k )] + , µ k +1 := µ k +  k h ( x k ) . (23) 3) Set  k +1 := max  σ  k , k λ k +1 k 1+ τ , k µ k +1 k 1+ τ  . 4) Set k ← k + 1 and go to step 1). end The ab ov e aug mented Lagra ng ian metho d diﬀers from the classical augmen ted Lagra ngian metho d in that: i) the v alues of the augmen ted Lagra ng ian functions at their approximate minimizers giv en by t he method are bo unded f r o m ab o v e (see Step 1)) ; and ii) the magnitude of p enalt y parameters outgro ws that o f Lagrangian m ultipliers (see Step 2)). In addition, to make the ab ov e augmente d Lagrangian metho d complete, w e need address ho w to ﬁnd a n a ppro ximate solution x k ∈ X for subproblem (2 1) satisfying (22) a s r equired in Step 1). W e will leav e this discussion to the end of this subsection. F or the time b eing, w e establish the main con ve rgence result regar ding this metho d for solving problem (9). Theorem 3.3 A ssume that ǫ k → 0 . L et { x k } b e the se quenc e gener ate d by the a b ove aug- mente d L agr angian metho d satisfying (22). Supp ose that a subse quenc e { x k } k ∈ K c onver ges to x ∗ . Then, the f o l lowing statements hold: (a) x ∗ is a f e asible p oint of pr oblem (9); (b) F urther, if R obinson ’s c ondition (10) is satisﬁe d at x ∗ , then the subse quenc e { ( λ k +1 , µ k +1 ) } k ∈ K is b ounde d, and e ach a c cumulation p oint ( λ ∗ , µ ∗ ) of { ( λ k +1 , µ k +1 ) } k ∈ K is the ve ctor of L agr ang e multipliers satisfying the ﬁrst-or der optimality c onditions (12)-(13) at x ∗ . Pr o of . In view of (19), (20) and the second relation in (22), w e hav e f ( x k ) + P ( x k ) + 1 2  k ( k [ λ k +  k g ( x k )] + k 2 − k λ k k 2 ) + ( µ k ) T h ( x k ) +  k 2 k h ( x k ) k 2 ≤ Υ ∀ k . It follow s that k [ λ k / k + g ( x k )] + k 2 + k h ( x k ) k 2 ≤ 2[Υ − f ( x k ) − g ( x k ) − ( µ k ) T h ( x k )] / k + ( k λ k k / k ) 2 . Noticing that  0 > 0 τ > 0, and  k +1 = max  σ  k , k λ k +1 k 1+ τ , k µ k +1 k 1+ τ  for k ≥ 0, w e can observ e that  k → ∞ and k ( λ k , µ k ) k / k → 0. W e also kno w that { x k } k ∈ K → x ∗ , { g ( x k ) } k ∈ K → g ( x ∗ ) and { h ( x k ) } k ∈ K → h ( x ∗ ). Using these results, and upo n taking limits as k ∈ K → ∞ on b oth sides of the ab ov e inequalit y , w e obtain that k [ g ( x ∗ )] + k 2 + k h ( x ∗ ) k 2 ≤ 0 , whic h implies that g ( x ∗ ) ≤ 0 and h ( x ∗ ) = 0. W e also know that x ∗ ∈ X . It th us fo llo ws that statemen t (a) holds. 13 W e next sho w that statemen t (b) also holds. Using (21), (1 9), (20), (23), and the ﬁrst relation in (22), we hav e k∇ f ( x k ) + ( λ k +1 ) T ∇ g ( x k ) + ( µ k +1 ) T ∇ h ( x k ) + z k + v k k ≤ ǫ k (24) for some z k ∈ ∂ P ( x k ) and v k ∈ N X ( x k ). Supp ose for contradiction that the subs equence { ( λ k +1 , µ k +1 ) } k ∈ K is un b o unded. By passing to a subsequence if nec essary , w e can assume that { ( λ k +1 , µ k +1 ) } k ∈ K → ∞ . Let ( ¯ λ k +1 , ¯ µ k +1 ) = ( λ k +1 , µ k +1 ) / k ( λ k +1 , µ k +1 ) k and ¯ v k = v k / k ( λ k +1 , µ k +1 ) k . Recall that { x k } k ∈ K → x ∗ . It together with Theorem 6.2.7 of [13 ] im- plies that ∪ k ∈ K ∂ P ( x k ) is b ounded, a nd so is { z k } k ∈ K . In addition, { g ( x k ) } k ∈ K → g ( x ∗ ) and { h ( x k ) } k ∈ K → h ( x ∗ ). Then, w e can observ e from (24) that { ¯ v k } k ∈ K is bo unded. Without loss of generalit y , assume tha t { ( ¯ λ k +1 , ¯ µ k +1 ) } k ∈ K → ( ¯ λ, ¯ µ ) a nd { ¯ v k } k ∈ K → ¯ v (otherwise, one can consider their con v ergen t subsequence s). Cle arly , k ( ¯ λ, ¯ µ ) k = 1. D ividing b oth sides of (24) b y k ( λ k +1 , µ k +1 ) k and taking limits as k ∈ k → ∞ , w e obtain that ¯ λ T ∇ g ( x ∗ ) + ¯ µ T ∇ h ( x ∗ ) + ¯ v = 0 . (25) F urther, using t he iden tit y λ k +1 = [ λ k +  k g ( x k )] + and the fact that  k → ∞ and k λ k k / k → 0, w e observ e that λ k +1 ∈ ℜ m + and λ k +1 i = 0 for i / ∈ A ( x ∗ ) when k ∈ K is suﬃcien tly large, whic h imply tha t ¯ λ ∈ ℜ m + and ¯ λ i = 0 for i / ∈ A ( x ∗ ). Moreov er, w e hav e ¯ v ∈ N X ( x ∗ ) since N X ( x ∗ ) is a closed cone. Using these r esults, (25), R obinson’s condition (10) at x ∗ , and a similar ar g umen t as that in the pro o f of Theorem 3.2, w e can obtain that ( ¯ λ, ¯ µ ) = (0 , 0), whic h con tradicts the iden tit y k ( ¯ λ, ¯ µ ) k = 1. Therefore, the subseque nce { ( λ k +1 , µ k +1 ) } k ∈ K is b ounded. Using this result together with (24) and the f a ct { z k } k ∈ K is b ounded, w e immediately see that { v k } k ∈ K is b ounded. Using semicon tin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [23] and Lemma 2 . 42 of [24 ]) , a nd the fact { x k } k ∈ K → x ∗ , w e conclude that ev ery accum ulation p oin t of { z k } k ∈ K and { v k } k ∈ K b elongs to ∂ P ( x ∗ ) and N X ( x ∗ ), resp ectiv ely . Using these results and (24), w e further see that for ev ery accum ulation p oin t ( λ ∗ , µ ∗ ) of { ( λ k +1 , µ k +1 ) } k ∈ K , there exists some z ∗ ∈ ∂ P ( x ∗ ) and v ∗ ∈ N X ( x ∗ ) suc h that ∇ f ( x ∗ ) + ( λ ∗ ) T ∇ g ( x ∗ ) + ( µ ∗ ) T ∇ h ( x ∗ ) + z ∗ + v ∗ = 0 . Moreo v er, using the iden tit y λ k +1 = [ λ k +  k g ( x k )] + and the fact t ha t  k → ∞ a nd k λ k k / k → 0, w e easily see that λ ∗ ∈ ℜ m + and λ ∗ i = 0 for i / ∈ A ( x ∗ ). Th us, ( λ ∗ , µ ∗ ) satisﬁes the ﬁrst-order optimalit y conditions (12)-(13) at x ∗ . Before ending this subsection, we no w brieﬂy discuss ho w t o ﬁnd an approx imate solution x k ∈ X for subproblem ( 2 1) satisfying (2 2) as r equired in Step 1) of the ab ov e augmen ted Lagrangian metho d. In particular, w e are in terested in applying the nonmonotone gradient metho ds pro p osed in Subsec tion 3.3 to (21). As sho wn in Subsection 3.3 (see Theorems 3.9 and 3.13), these metho ds are able to ﬁnd an appro ximate solution x k ∈ X s atisfying the ﬁrst relatio n o f (22). Moreov er, if an initial point for these metho ds is prop erly chose n, the obtained approximate solution x k also satisﬁes the second relation o f (22). F o r example, giv en k ≥ 0, let x k init ∈ X denote the initia l p oin t for solving the k th subproblem (21), and w e deﬁne 14 x k init for k ≥ 1 as follow s x k init =  x feas , if L  k ( x k − 1 , λ k , µ k ) > Υ; x k − 1 , otherwis e , where x k − 1 is the approx imate solution to the ( k − 1 )th subproblem (21) satisfying ( 2 2) ( with k replaced b y k − 1). Recall fr o m Assumption 1 that x feas is a feasible solution of (9) . T hus , g ( x feas ) ≤ 0, and h ( x feas ) = 0, whic h together with (19), (20) a nd the deﬁnition of Υ implies that L  k ( x feas , λ k , µ k ) ≤ f ( x feas ) ≤ Υ . It follo ws from this inequalit y and the ab ov e c hoice of x k init that L  k ( x k init , λ k , µ k ) ≤ Υ. Ad- ditionally , the nonmonotone gradien t metho ds propo sed in Subsection 3.3 p ossess a natural prop ert y that t he ob jectiv e function v alues at all subsequen t iterates are b ounded below b y the one at the initial p o int. The refore, we ha v e L  k ( x k , λ k , µ k ) ≤ L  k ( x k init , λ k , µ k ) ≤ Υ , and so the second relation of (22) is satisﬁed a t x k . 3.3 Nonmonotone gradien t metho ds for nonsmo oth minimization In this subsection w e prop o se tw o no nmono tone gradien t metho ds for minimizing a class of nonsmo oth func tions o v er a closed conv ex set, w hich can be su itably applied to the sub- problems arising in our augmen ted L a grangian metho d detailed in Subsection 3.2. W e also establish g lobal conv ergence and lo cal linear rate o f conv e rgence for these metho ds. It shall b e mentioned that these t w o metho ds are closely related to the ones prop osed in [27] and [2 6 ], but they are not the same (see the remarks b elow for details). In additio n, these metho ds can b e view ed as an extension of the w ell-kno wn pro jected gradien t metho ds studied in [4] for smo oth problems, but the metho ds prop osed in [27] and [2 6] cannot. Throughout this subsection, we consider the follow ing pro blem min x ∈ X { F ( x ) := f ( x ) + P ( x ) } , (26) where f : ℜ n → ℜ is con tin uously diﬀerentiable, P : ℜ n → ℜ is con v ex but not nec essarily smo oth, and X ⊆ ℜ n is closed and con v ex. W e say that x ∈ ℜ n is a stationary p oint of problem (26) if x ∈ X a nd 0 ∈ ∇ f ( x ) + ∂ P ( x ) + N X ( x ) . (27) Giv en a p oin t x ∈ ℜ n and H ≻ 0, w e denote b y d H ( x ) the s olutio n of the follo wing problem: d H ( x ) := arg min d  ∇ f ( x ) T d + 1 2 d T H d + P ( x + d ) : x + d ∈ X  . (28) The following lemma pro vides an alternativ e characterization of stationarity that will b e used in our subsequen t analysis. 15 Lemma 3.4 F or any H ≻ 0 , x ∈ X is a stationary p oint of pr oblem (26) if and only if d H ( x ) = 0 . Pr o of . W e ﬁrst observ e that (28) is a conv ex problem, and moreo ve r its ob jective function is strictly con v ex. The conclusion of this lemma immediately follows from this o bserv ation and the ﬁrst-order optimality condition of (28). The next lemma shows that k d H ( x ) k c hanges not to o fast with H . It will b e used to pro v e Theorems 3.10 and 3.14. Lemma 3.5 F or any x ∈ ℜ n , H ≻ 0 , and ˜ H ≻ 0 , l e t d = d H ( x ) and ˜ d = d ˜ H ( x ) . Then k ˜ d k ≤ 1 + λ max ( Q ) + p 1 − 2 λ min ( Q ) + λ max ( Q ) 2 2 λ min ( ˜ H ) λ max ( H ) k d k , (29) wher e Q = H − 1 / 2 ˜ H H − 1 / 2 . Pr o of . The conclusion immediately follows from Lemma 3.2 of [26] with J = { 1 , . . . , n } , c = 1, and P ( x ) := P ( x ) + I X ( x ), where I X is the indicator function of X . The followin g lemma will b e used to prov e Theorems 3.10 and 3.14. Lemma 3.6 G iven x ∈ ℜ n and H ≻ 0 , let g = ∇ f ( x ) an d ∆ d = g T d + P ( x + d ) − P ( x ) fo r al l d ∈ ℜ n . L et σ ∈ (0 , 1 ) b e g i ven. The f o l lowing statements h old: (a) If d = d H ( x ) , then − ∆ d ≥ d T H d ≥ λ min ( H ) k d k 2 . (b) F or any ¯ x ∈ ℜ n , α ∈ (0 , 1] , d = d H ( x ) , and x ′ = x + αd , then ( g + H d ) T ( x ′ − ¯ x ) + P ( x ′ ) − P ( ¯ x ) ≤ ( α − 1)( d T H d + ∆ d ) . (c) If f satisﬁes k∇ f ( y ) − ∇ f ( z ) k ≤ L k y − z k ∀ y , z ∈ ℜ n (30) for some L > 0 , then the desc ent c ondition F ( x + α d ) ≤ F ( x ) + σ α ∆ d is satisﬁ e d for d = d H ( x ) , pr ovide d 0 ≤ α ≤ min { 1 , 2(1 − σ ) λ min ( H ) /L } . (d) If f satisﬁes (30), then the desc ent c ondition F ( x + d ) ≤ F ( x ) + σ ∆ d is satisﬁ e d for d = d H ( α ) ( x ) , wher e H ( α ) = αH , p r ovide d α ≥ L/ [2(1 − σ ) λ min ( H )] . 16 Pr o of . The statemen ts (a )-(c) follo w from Theorem 4.1 (a ) and Lemma 3.4 of [26] with J = { 1 , . . . , n } , γ = 0, and λ = λ min ( H ). W e now prov e statemen t (d). Let d = d H ( α ) ( x ), where H ( α ) = αH for α > 0. It follo ws from statemen t (a) that k d k 2 ≤ − ∆ d / ( αλ min ( H )). Using this relation, (30), and the deﬁnitions of F and ∆ d , w e hav e F ( x + d ) − F ( x ) = f ( x + d ) − f ( x ) + P ( x + d ) − P ( x ) = ∇ f ( x ) T d + P ( x + d ) − P ( x ) + Z 1 0 d T ( ∇ f ( x + td ) − ∇ f ( x )) dt ≤ ∆ d + Z 1 0 k∇ f ( x + td ) − ∇ f ( x ) kk d k dt ≤ ∆ d + L k d k 2 / 2 ≤ [1 − L/ (2 α λ min ( H ))] ∆ d , whic h together with α ≥ L/ [2(1 − σ ) λ min ( H )], immediately implies that statemen t ( d) holds. W e now presen t the ﬁrst nonmonotone g radien t metho d f o r (26) as follo ws. Nonmonotone gradien t method I: Cho ose parameters η > 1, 0 < σ < 1 , 0 < α < ¯ α , 0 < λ ≤ ¯ λ , and in teger M ≥ 0. Set k = 0 and c ho ose x 0 ∈ X . 1) Cho ose α 0 k ∈ [ α , ¯ α ] and λI  H k  ¯ λI . 2) F or j = 0 , 1 , . . . 2a) Let α k = α 0 k η j . Solv e (28) with x = x k and H = α k H k to obtain d k = d H ( x ). 2b) If d k satisﬁes F ( x k + d k ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ∆ k , (31) go to step 3), where ∆ k := ∇ f ( x k ) T d k + P ( x k + d k ) − P ( x k ) . (32) 3) Set x k +1 = x k + d k and k ← k + 1. end R emark . The abov e metho d is closely related to the one prop osed in [2 7]. They diﬀer from eac h other only in that the distinct ∆ k ’s are used in (31). Indeed, the latter metho d uses ∆ k = − α k k d k k 2 / 2. Nev ertheles s, fo r global con v ergence, the metho d [27] needs a strong assumption that the function f is Lipschitz con tinu ously diﬀeren tiable, whic h is not required for our metho d (see Theorem 3 .9). In addition, our metho d can b e view ed as an extension of the ﬁrst pro jected gradien t metho d (namely , SPG1) studied in [4] for smo oth problems, but their metho d cannot. Finally , lo cal con v ergence is established fo r our metho d (see Theorem 3.10), but not studied for their metho d. 17 W e next prov e glo ba l conv ergence of the nonmonot one gradien t metho d I. Before pro ceed- ing, w e establish t w o tec hnical le mmas b elo w. The ﬁrst lemma shows that if x k ∈ X is a nonstationary p oin t, there exists a n α k > 0 in step 2 a) so that (31) is satisﬁed, and hence the ab ov e metho d is w ell deﬁned. Lemma 3.7 Supp ose that H k ≻ 0 an d x k ∈ X is a no n stationary p o int of pr oblem (26). Then, ther e exists ˜ α > 0 such that d k = d H k ( α k ) ( x k ) , wher e H k ( α k ) = α k H k , satisﬁes (31) whenever α k ≥ ˜ α . Pr o of . F o r simplicit y of notatio n, let d ( α ) = d H k ( α ) ( x k ), where H k ( α ) = αH k for an y α > 0. Then, it follo ws f rom (28) that for all α > 0, α k d ( α ) k ≤ − 2[ ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k )] λ min ( H k ) k d ( α ) k ≤ − 2 F ′ ( x k , d ( α ) / k d ( α ) k ) λ min ( H k ) . (33) Th us, we easily see that the set ˜ S := { α k d ( α ) k : α > 0 } is b ounded. It implies tha t k d ( α ) k → 0 as α → ∞ . W e claim that lim inf α →∞ α k d ( α ) k > 0 . (34) Supp ose not. Then there exists a sequence { ¯ α l } ↑ ∞ suc h tha t ¯ α l k d ( ¯ α l ) k → 0 as l → ∞ . In v oking that d ( ¯ α l ) is the optimal solution of (28) with x = x k , H = ¯ α l H k and α = ¯ α l , w e ha v e 0 ∈ ∇ f ( x k ) + ¯ α l H k d ( ¯ α l ) + ∂ P ( x k + d ( ¯ α l )) + N X ( x k + d ( ¯ α l )) . Up on taking limits on b oth sides a s l → ∞ , and using se micontin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [23] and Lemma 2 . 4 2 of [24]), and the relations k d ( ¯ α l ) k → 0 and ¯ α l k d ( ¯ α l ) k → 0, w e see that (27) holds at x k , whic h con tradicts the nonstationarit y of x k . Hence, (34) holds. W e observ e that αd ( α ) T H k d ( α ) ≥ λ min ( H k ) α k d ( α ) k 2 , whic h to gether with (3 4) and H k ≻ 0, implies that k d ( α ) k = O  αd ( α ) T H k d ( α )  as α → ∞ . In addition, using the b oundedness of ˜ S and H k ≻ 0, w e ha v e αd ( α ) T H k d ( α ) = O ( k d ( α ) k ) as α → ∞ . Th us, w e o bt a in that αd ( α ) T H k d ( α ) = Θ( k d ( α ) k ) as α → ∞ . (35) This relation together with Lemma 3.6(a) implies that P ( x k ) − ∇ f ( x k ) T d ( α ) − P ( x k + d ( α )) ≥ αd ( α ) T H k d ( α ) = Θ( k d ( α ) k ) . (36) Using this result, and the relatio n k d ( α ) k → 0 as α → ∞ , w e further hav e F ( x k + d ( α )) − max [ k − M ] + ≤ i ≤ k F ( x i ) ≤ F ( x k + d ( α )) − F ( x k ) = f ( x k + d ( α )) − f ( x k ) + P ( x k + d ( α )) − P ( x k ) = ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k ) + o ( k d ( α ) k ) ≤ σ [ ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k )] , (37) 18 pro vided α is suﬃcien tly la rge. It implies that the conclusion holds. The follo wing lemma sho ws that the searc h directions { d k } approac h zero, and the sequence of ob jectiv e function v alues { F ( x k ) } also con v erges. Lemma 3.8 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the n o n monotone gr adient metho d I satisﬁes lim k →∞ d k = 0 . Mor e over, the se quenc e { F ( x k ) } c onver ges. Pr o of . W e ﬁrst observ e that { x k } ⊆ L . Let l ( k ) b e an in teger suc h that [ k − M ] + ≤ l ( k ) ≤ k and F ( x l ( k ) ) = max { F ( x i ) : [ k − M ] + ≤ i ≤ k } for all k ≥ 0. W e clearly observ e that F ( x k +1 ) ≤ F ( x l ( k ) ) for all k ≥ 0, whic h together with the deﬁnition of l ( k ) implies that the sequence { F ( x l ( k ) ) } is monot onically nonincreasing. F urther, since F is b ounded b elo w in X , w e ha v e lim k →∞ F ( x l ( k ) ) = F ∗ (38) for some F ∗ ∈ ℜ . W e next pro v e b y induction tha t the fo llo wing limits ho ld for all j ≥ 1: lim k →∞ d l ( k ) − j = 0 , lim k →∞ F ( x l ( k ) − j ) = F ∗ . (39) Using (31) and (32) with k replaced by l ( k ) − 1, w e obtain that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) + σ ∆ l ( k ) − 1 . (40) Replacing k and α b y l ( k ) − 1 a nd α l ( k ) − 1 in ( 3 6), respective ly , and usin g H l ( k ) − 1  λ I and the deﬁnition of ∆ l ( k ) − 1 (see (32)), w e hav e ∆ l ( k ) − 1 ≤ − λ α l ( k ) − 1 k d l ( k ) − 1 k 2 . The ab ov e t w o inequalities yield that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) − σ λ α l ( k ) − 1 k d l ( k ) − 1 k 2 , (41) whic h t o gether with (38) implies that lim k →∞ α l ( k ) − 1 k d l ( k ) − 1 k 2 = 0. F urther, noticing t ha t α k ≥ α for all k , w e obtain that lim k →∞ d l ( k ) − 1 = 0 . Using this result and ( 3 8), w e ha v e lim k →∞ F ( x l ( k ) − 1 ) = lim k →∞ F ( x l ( k ) − d l ( k ) − 1 ) = lim k →∞ F ( x l ( k ) ) = F ∗ , (42) where the second equalit y follows from uniform contin uit y of F in L . Therefore, (39) ho lds for j = 1. W e now need to sho w t ha t if (39) holds for j , then it also holds for j + 1. Using a similar argumen t as that leading to (41), w e ha v e F ( x l ( k ) − j ) ≤ F ( x l ( l ( k ) − j − 1) ) − σ λα l ( k ) − j − 1 k d l ( k ) − j − 1 k 2 , 19 whic h together with (3 8 ), the induction assump tion lim k →∞ F ( x l ( k ) − j ) = F ∗ , and the fact that α l ( k ) − j − 1 ≥ α for all k , yie lds lim k →∞ d l ( k ) − j − 1 = 0. Using this result, t he induction assumption lim k →∞ F ( x l ( k ) − j ) = F ∗ , and a similar argumen t as that leading to (4 2), we can sho w that lim k →∞ F ( x l ( k ) − j − 1 ) = F ∗ . Hence, (39) holds for j + 1. Finally , w e will pro v e that lim k →∞ d k = 0 and lim k →∞ F ( x k ) = F ∗ . By the deﬁnition of l ( k ), we see tha t for k ≥ M + 1, k − M − 1 = l ( k ) − j for some 1 ≤ j ≤ M + 1, whic h together with the ﬁrst limit in (39), implies t ha t lim k →∞ d k = lim k →∞ d k − M − 1 = 0. Additiona lly , w e observ e that x l ( k ) = x k − M − 1 + ¯ l k X j =1 d l ( k ) − j ∀ k ≥ M + 1 , where ¯ l k = l ( k ) − ( k − M − 1) ≤ M + 1 . Using the ab o ve identit y , (39), and uniform con tin uity of F in L , w e see that lim k →∞ F ( x k ) = lim k →∞ F ( x k − M − 1 ) = F ∗ . Th us, the conclusion of this lemma holds. W e a r e now r eady to show that the nonmonotone g radien t metho d I is globally con v ergen t. Theorem 3.9 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the level se t L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, an y ac cumulation p oint of the se quenc e { x k } gener ate d by the nonmonotone gr adient metho d I is a stationary p oint of ( 2 6). Pr o of . Supp ose for contradiction that x ∗ is an accum ulation point of { x k } that is a non- stationary p oint of (2 6). Let K b e the subsequence suc h that { x k } k ∈ K → x ∗ . W e ﬁrst claim that { α k } k ∈ K is b ounded. Suppose not. Then there exists a subsequence o f { α k } k ∈ K that go es to ∞ . Without loss o f generalit y , w e assume that { α k } k ∈ K → ∞ . F or simplicit y of notations, let ¯ α k = α k /η , d k ( α ) = d H k ( α ) ( x k ) for k ∈ K and α > 0, where H k ( α ) = αH k . Since { α k } k ∈ K → ∞ and α 0 k ≤ ¯ α , there exists some index ¯ k ≥ 0 suc h that α k > α 0 k for all k ∈ K with k ≥ ¯ k . By the particular c hoice of α k sp eciﬁed in steps (2 a) a nd (2b), w e hav e F ( x k + d k ( ¯ α k )) > max [ k − M ] + ≤ i ≤ k F ( x i ) + σ [ ∇ f ( x k ) T d k ( ¯ α k ) + P ( x k + d k ( ¯ α k )) − P ( x k )] , (43) Using a similar argument as t hat leading to (33), w e hav e ¯ α k k d k ( ¯ α k ) k ≤ − 2 F ′ ( x k , d k ( ¯ α k ) / k d k ( ¯ α k ) k ) λ min ( H k ) ∀ k ∈ K, whic h along with t he relations H k  λ I and { x k } k ∈ K → x ∗ , implies tha t { ¯ α k k d k ( ¯ α k ) k} k ∈ K is b ounded. Since { ¯ α k } k ∈ K → ∞ , w e further ha v e {k d k ( ¯ α k ) k} k ∈ K → 0 . W e now claim that lim inf k ∈ K,k → ∞ ¯ α k k d k ( ¯ α k ) k > 0 . (44) Supp ose not . By pa ssing to a subsequence if necessary , we can assume that { ¯ α k k d k ( ¯ α k ) k} k ∈ K → 0. In v oking tha t d k ( ¯ α k ) is the optimal solution of (28) with x = x k and H = ¯ α k H k , w e hav e 0 ∈ ∇ f ( x k ) + ¯ α k H k d k ( ¯ α k ) + ∂ P ( x k + d k ( ¯ α k )) + N X ( x k + d k ( ¯ α k )) ∀ k ∈ K . 20 Up on ta king limits on b oth sides a s k ∈ K → ∞ , and using sem icontin uit y of ∂ P ( · ) and N X ( · ) (s ee Theorem 24 . 4 of [23] and Lemma 2 . 42 of [24]), the re latio ns λI  H k  ¯ λI , {k d k ( ¯ α k ) k} k ∈ K → 0, { ¯ α k k d k ( ¯ α k ) k} k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (2 7) holds at x ∗ , whic h con tradicts nonstationarity of x ∗ .Th us, (4 4) holds. No w, using (44), the relation H k  λ I , b oundedness of { ¯ α k k d k ( ¯ α k ) k} k ∈ K , and a similar argumen t as that leading to (35), w e observ e that ¯ α k d k ( ¯ α k ) T H k d k ( ¯ α k ) = Θ( k d k ( ¯ α k ) k ) as k ∈ K → ∞ . Using this result and a similar argumen t as that leading to (37), w e ha v e F ( x k + d k ( ¯ α k )) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ [ ∇ f ( x k ) T d k ( ¯ α k ) + P ( x k + d k ( ¯ α k )) − P ( x k )] , pro vided that k ∈ K is suﬃcien tly large. The ab o ve inequalit y e viden tly con tradicts (43). Th us, { α k } k ∈ K is b ounded. Finally , inv oking that d k = d k ( α k ) is the optimal solution of (28) with x = x k , H = α k H k , w e hav e 0 ∈ ∇ f ( x k ) + α k H k d k + ∂ P ( x k + d k ) + N X ( x k + d k ) ∀ k ∈ K . (45) By Lemma 3.8, we hav e { d k } k ∈ K → 0. Up on taking limits on b oth sides of (45) as k ∈ K → ∞ , and using semicon tin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 o f [23] and Lemma 2 . 4 2 of [24]), and the relations λ I  H k  ¯ λI , { d k } k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (27) holds at x ∗ , wh ich con tradicts the nonstationarit y o f x ∗ that is ass umed at the b eginning of this pro of. Therefore, the conclusion of this theorem holds. W e next analyze the asymptotic con v ergence rate of the nonmonoto ne g radien t metho d I under the following a ssumption, whic h is the same as that made in [26]. In what fo llows, w e denote b y ¯ X the set of stationary p oin ts of problem (26). Assumption 2 (a) ¯ X 6 = ∅ and, for any ζ ≥ min x ∈ X F ( x ) , ther e exists τ > 0 and ǫ > 0 such that dist( x, ¯ X ) ≤ τ k d I ( x ) k whenever F ( x ) ≤ ζ , k d I ( x ) k ≤ ǫ. (b) Ther e exists δ > 0 such that k x − y k ≥ δ whenever x ∈ ¯ X , y ∈ ¯ X , F ( x ) 6 = F ( y ) . W e are ready to establish lo cal linear ra te of conv ergence for the no nmonotone gradien t metho d I describ ed ab o v e. T he pro of of the follow ing theorem is inspired by the work of Tseng and Y un [26], who analyze d a similar lo cal conv ergence for a coo r dina t e gradien t de scen t metho d for a class of nonsmo oth minimization problems. Theorem 3.10 Supp ose that f satisﬁes ( 3 0), and F is b ounde d b elow in X an d uniform l y c ontinuous in the level s e t L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the nonmonotone gr adient metho d I sa tisﬁ e s F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) , pr ovide d k is suﬃciently lar ge, wher e F ∗ = lim k →∞ F ( x k ) (se e L emma 3.8), and c is some c onstant in (0 , 1) . 21 Pr o of . Inv oking α 0 k ≤ ¯ α and t he sp eciﬁc choice of α k , w e see fr om Lemma 3.6(d) that ˆ α := sup k α k < ∞ . Let H k ( α ) = αH k . T hen, it fo llows from λI  H k  ¯ λI and α k ≥ α that ( α · λ ) I  H k ( α k )  ˆ α ¯ λI . Us ing this relation, Lemma 3.5, H k  λI , and d k = d H k ( α k ) ( x k ), we obtain that k d I ( x k ) k = Θ( k d k k ) , (46) whic h together with Lemma 3.8 implies { d I ( x k ) } → 0. Th us, for an y ǫ > 0, there exis ts some index ¯ k suc h tha t d I ( x l ( k ) − 1 ) ≤ ǫ for all k ≥ ¯ k . In addition, w e clearly o bserv e that F ( x l ( k ) − 1 ) ≤ F ( x 0 ). Then, b y Assumption 2(a) and (46), there exists some index k ′ suc h that k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k ) − 1 k ∀ k ≥ k ′ (47) for some c 1 > 0 and ¯ x l ( k ) − 1 ∈ ¯ X . Note that k x l ( k +1) − 1 − x l ( k ) − 1 k ≤ l ( k +1) − 2 X i = l ( k ) − 1 k d i k ≤ [ k − 1] + X i =[ k − M − 1] + k d i k , whic h together with { d k } → 0, implies that k x l ( k +1) − 1 − x l ( k ) − 1 k → 0. Usin g this result, (47), and Lemma 3.8, w e o btain k ¯ x l ( k +1) − 1 − ¯ x l ( k ) − 1 k ≤ k x l ( k +1) − 1 − ¯ x l ( k +1) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k + k x l ( k +1) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k +1) − 1 k + c 1 k d l ( k ) − 1 k + k x l ( k +1) − 1 − ¯ x l ( k ) − 1 k → 0 . It follows from this relation and Assumption 2(b) that there exists an index ˆ k ≥ k ′ and v ∈ ℜ suc h that F ( ¯ x l ( k ) − 1 ) = v ∀ k ≥ ˆ k . (48) Then, b y Lemma 5.1 of [26], w e see that F ∗ = lim k →∞ F ( x k ) = lim inf k →∞ F ( x l ( k ) − 1 ) ≥ v . (49) F urther, using the deﬁnition of F , (30), (48), Lemma 3.6(b), and H k ( α k )  ˆ α ¯ λI , w e hav e for k ≥ ˆ k , F ( x l ( k ) ) − v = f ( x l ( k ) ) + P ( x l ( k ) ) − f ( ¯ x l ( k ) − 1 ) − P ( ¯ x l ( k ) − 1 ) = ∇ f ( ˜ x k ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) = ( ∇ f ( ˜ x k ) − ∇ f ( x l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) − ( H l ( k ) − 1 ( α l ( k ) − 1 ) d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) +  ( ∇ f ( x l ( k ) − 1 ) + H l ( k ) − 1 ( α l ( k ) − 1 ) d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 )  ≤ L k ˜ x k − x l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k + ˆ α ¯ λ k d l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k , (50) where ˜ x k is some p oint lying on the segmen t joining x l ( k ) with ¯ x l ( k ) − 1 . It fo llo ws from (47) that, for k ≥ ˆ k , k ˜ x k − x l ( k ) − 1 k ≤ k x l ( k ) − x l ( k ) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k = (1 + c 1 ) k d l ( k ) − 1 k . 22 Similarly , k x l ( k ) − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k for k ≥ ˆ k . Using t hese inequalities, Lemma 3.6(a), H k ( α k )  ( α · λ ) I , and (50), w e see that for k ≥ ˆ k , F ( x l ( k ) ) − v ≤ − c 2 ∆ l ( k ) − 1 for some constan t c 2 > 0 . This inequalit y together with (40) giv es F ( x l ( k ) ) − v ≤ c 3  F ( x l ( l ( k ) − 1) ) − F ( x l ( k ) )  ∀ k ≥ ˆ k , (51) where c 3 = c 2 /σ . Using lim k →∞ F ( x l ( k ) ) = F ∗ , and up on taking limits on b oth sides of (5 1), w e see that F ∗ ≤ v , whic h tog ether with (49) implies that v = F ∗ . Using this result and up on rearranging terms of (51), we ha v e F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) ∀ k ≥ ˆ k , where c = c 3 / (1 + c 3 ). W e next presen t the second nonmonot o ne gradien t metho d for (26) as follow s. Nonmonotone gradien t method I I: Cho ose pa r a meters 0 < η < 1, 0 < σ < 1 , 0 < α < ¯ α , 0 < λ ≤ ¯ λ , and integer M ≥ 0. Set k = 0 and c ho ose x 0 ∈ X . 1) Cho ose λ I  H k  ¯ λI . 2) Solv e (2 8) with x = x k and H = H k to o bt a in d k = d H ( x ), and compute ∆ k according to (32). 3) Cho ose α 0 k ∈ [ α , ¯ α ]. Find the smallest in teger j ≥ 0 such that α k = α 0 k η j satisﬁes F ( x k + α k d k ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ α k ∆ k , (52) where ∆ k is deﬁned in (32). 4) Set x k +1 = x k + α k d k and k ← k + 1. end R emark . The ab ov e metho d is closely related to the one prop osed in [26]. When the en tire co ordinate blo c k, that is, J = { 1 , . . . , n } is c hosen for t he latter metho d, it b ecomes a sp ecial case of our metho d with M = 0, which is a g radien t descen t metho d. Giv en that our metho d is generally a nonmonotone metho d esp ecially when M ≥ 1, most pro of s of global and lo cal con ve rgence for the method [26] do not hold for our method directly . In addition, our metho d can b e view ed as an extension of the second pro jected gradien t metho d (namely , SPG2) studied in [4] for smo o t h problems, but the metho d [26] generally cannot. 23 W e nex t pro v e global con v ergence of the nonmonotone gradien t metho d I I. Befor e pro- ceeding, w e establish tw o tec hnical lemmas b elo w. The ﬁrst lemma sho ws that if x k ∈ X is a nonstationary p oin t, there exists an α k > 0 in step 3) so that (52) is satisﬁed, and hence the ab ov e metho d is w ell deﬁned. Lemma 3.11 Supp ose that H k ≻ 0 and x k ∈ X is a nonstationary p oint o f pr oblem (26). Then, ther e exists ˜ α > 0 such that d k = d H k ( x k ) satisﬁes (52) whenever 0 < α k ≤ ˜ α . Pr o of . In view of Lemma 2 . 1 of [26] with J = { 1 , . . . , n } , c = 1, x = x k , and H = H k , we ha v e F ( x k + α d k ) ≤ F ( x k ) + α ∆ k + o ( α ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + α ∆ k + o ( α ) ∀ α ∈ (0 , 1] , where ∆ k is deﬁned in (32). Using the a ssumption of this lemma, w e see fro m Lemma 3.4 that d k 6 = 0, whic h together with H k ≻ 0 and Lemma 3.6(a ) implies ∆ k < 0. Th e conclusion of this lemma immediately follows fr o m this relation and the ab ov e inequalit y . The follo wing lemma shows tha t the scaled searc h direc tions { α k d k } approac h zero, and the sequence of ob jectiv e function v alues { F ( x k ) } also con v erges. Lemma 3.12 Supp ose that F i s b ounde d b elow in X and uniformly c ontinuous in the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the n o n monotone gr adient metho d II satisﬁes lim k →∞ α k d k = 0 . Mor e over, the s e quenc e { F ( x k ) } c onver ges. Pr o of . L et l ( k ) b e deﬁned in the pro of of Lemma 3.8. W e ﬁrst observ e that { x k } ⊆ L . Using (32), the deﬁnition of d k , and H k  λ I , w e hav e ∆ k = ∇ f ( x k ) T d k + P ( x k + d k ) − P ( x k ) ≤ − 1 2 ( d k ) T H k d k ≤ − 1 2 λ k d k k 2 , (53) whic h together with the r elat io n α k ≤ α 0 k ≤ ¯ α , implies that α 2 k k d k k 2 ≤ − 2 ¯ αα k ∆ k /λ . (54) By a similar argumen t as that leading to (38), we see that { x k } satisﬁes (38) for some F ∗ . W e next show b y induction that the f ollo wing limits hold fo r all j ≥ 1: lim k →∞ α l ( k ) − j d l ( k ) − j = 0 , lim k →∞ F ( x l ( k ) − j ) = F ∗ . (55) Indeed, using (52) with k replace d b y l ( k ) − 1, w e obtain that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) + σ α l ( k ) − 1 ∆ l ( k ) − 1 . It together with (38) immediately yields lim k →∞ α l ( k ) − 1 ∆ l ( k ) − 1 = 0. Using this result and (5 4), w e see that the ﬁrst iden tit y of (55) holds for j = 1. F urther, in view of this iden tit y , (38 ), 24 and uniform con tinuit y of F in L , w e can easily see that the second iden tity of (55) also holds j = 1. W e no w need to sho w that if (55) holds f or j , then it also holds for j + 1. First, it follo ws from (5 2) that F ( x l ( k ) − j ) ≤ F ( x l ( l ( k ) − j − 1) ) + σ α l ( k ) − j − 1 ∆ l ( k ) − j − 1 , whic h together with (38) and the induction assumption that lim k →∞ F ( x l ( k ) − j ) = F ∗ , yie lds lim k →∞ α l ( k ) − j − 1 ∆ l ( k ) − j − 1 = 0. Usin g this result and (54), w e ha v e lim k →∞ α l ( k ) − j − 1 d l ( k ) − j − 1 = 0. In view of this iden tit y , uniform contin uit y o f F in L and the induction assumption lim k →∞ F ( x l ( k ) − j ) = F ∗ , w e can easily sho w tha t lim k →∞ F ( x l ( k ) − j − 1 ) = F ∗ . Hence, (55) holds for j + 1. T he conclusion of this lemma then follows f r om (55) and a similar argumen t as that in the pro of of Lemma 3.1 2. W e are now ready to sho w that the nonmonoto ne gradien t metho d I I is globally con v ergen t. Theorem 3.13 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . The n, any ac cumulation p oint of the se quenc e { x k } gener ate d by the nonm o notone gr adient metho d II is a stationary p oint of (26) . Pr o of . Supp ose for contradiction that x ∗ is an accum ulation point of { x k } that is a non- stationary p oint of (2 6). Let K b e the subsequence suc h that { x k } k ∈ K → x ∗ . W e ﬁrst claim that lim inf k ∈ K,k → ∞ k d k k > 0. Suppose no t . By passing to a subseque nce if necessary , w e can assume that {k d k k} k ∈ K → 0. In v oking that d k is the optimal solution of (28) with x = x k and H = H k , w e hav e 0 ∈ ∇ f ( x k ) + H k d k + ∂ P ( x k + d k ) + N X ( x k + d k ) ∀ k ∈ K . Up on taking limits on b o t h sides as k ∈ K → ∞ , and using semicon tinuit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [2 3] and Lemma 2 . 42 of [24 ]) the relations λ I  H k  ¯ λI , {k d k k} k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (27) holds at x ∗ , whic h con tradicts the nonstationa rit y of x ∗ . Th us, lim inf k ∈ K,k → ∞ k d k k > 0 holds. F urther, using a similar a rgumen t a s that leading to (33), w e hav e k d k k ≤ − 2 F ′ ( x k , d k / k d k k ) λ min ( H k ) ∀ k ∈ K, whic h together with { x k } k ∈ K → x ∗ , H k  λ I and lim inf k ∈ K,k → ∞ k d k k > 0, implies that k d k k = Θ(1) for k ∈ K . F urther, using (53), we see that lim sup k ∈ K,k → ∞ ∆ k < 0. No w, it follo ws from Lemma 3.12 and the relation lim inf k ∈ K,k → ∞ k d k k > 0 that { α k } k ∈ K → 0. Since α 0 k ≥ α > 0, there exists some index ¯ k ≥ 0 suc h that α k < α 0 k and α k < η for all k ∈ K with k ≥ ¯ k . L et ¯ α k = α k /η . Then, { ¯ α k } k ∈ K → 0 and 0 < ¯ α k ≤ 1 for all k ∈ K . By t he stepsize rule used in step (3), w e hav e, for all k ∈ K with k ≥ ¯ k , F ( x k + ¯ α k d k ) > max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ¯ α k ∆ k , (56) 25 On the other ha nd, in view of the de ﬁnition of F , (32), the relations k d k k = Θ(1 ) and lim sup k ∈ K,k → ∞ ∆ k < 0, a nd the monotonicity o f ( P ( x k + αd k ) − P ( x k )) /α , w e obtain that, for suﬃcien tly larg e k ∈ K , F ( x k + ¯ α k d k ) = f ( x k + ¯ α k d k ) + P ( x k + ¯ α k d k ) = f ( x k + ¯ α k d k ) − f ( x k ) + P ( x k + ¯ α k d k ) − P ( x k ) + F ( x k ) = ¯ α k ∇ f ( x k ) T d k + o ( ¯ α k k d k k ) + P ( x k + ¯ α k d k ) − P ( x k ) + F ( x k ) ≤ ¯ α k ∇ f ( x k ) T d k + o ( ¯ α k ) + ¯ α k [ P ( x k + d k ) − P ( x k )] + max [ k − M ] + ≤ i ≤ k F ( x i ) = max [ k − M ] + ≤ i ≤ k F ( x i ) + ¯ α k ∆ k + o ( ¯ α k ) < max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ¯ α k ∆ k , whic h clearly contradicts (56). Th erefore, the conclusion of this theorem holds. W e next establish lo cal linear rate of conv ergence f o r the no nmono t one gr adien t method I I describ ed ab ov e. The pro of of the follo wing theorem is inspired by the work of Tseng and Y un [2 6 ]. Theorem 3.14 Supp ose that ¯ α ≤ 1 , f satisﬁes (30), and F is b ounde d b elow in X and uniformly c ontinuous in the le v el set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the nonm o notone gr adient metho d II satisﬁes F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) pr ovide d k is suﬃciently lar ge, wher e F ∗ = lim k →∞ F ( x k ) (se e L em ma 3.12), and c is som e c onstant in (0 , 1) . Pr o of . Since α k is c hosen by the Armijo rule with α 0 k ≥ α > 0, w e see from Lemma 3.6(c) that inf k α k > 0. It together with Lemma 3.12 implies that { d k } → 0. F urther, using Lemma 3.5 and the fact that d k = d H k ( x k ) and λ I  H k  ¯ λI , w e obtain t hat k d I ( x k ) k = Θ( k d k k ), and hence { d I ( x k ) } → 0 . Then, by a similar ar g umen t as tha t in the pro of of Theorem 3.10, there exist c 1 > 0 , v ∈ ℜ , and ¯ x l ( k ) − 1 ∈ ¯ X suc h that k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k ) − 1 k , F ( ¯ x l ( k ) − 1 ) = v ∀ k ≥ ˆ k , where ˆ k is some index. Then, b y Lemma 5.1 of [26], we see t ha t (49) holds for { x k } , and the ab ov e F ∗ and v . F urther, using the deﬁnition of F , (30) , Lemma 3.6(b), and λ I  H k  ¯ λI , 26 w e hav e, for k ≥ ˆ k , F ( x l ( k ) ) − v = f ( x l ( k ) ) + P ( x l ( k ) ) − f ( ¯ x l ( k ) − 1 ) − P ( ¯ x l ( k ) − 1 ) = ∇ f ( ˜ x k ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) = ( ∇ f ( ˜ x k ) − ∇ f ( x l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) − ( H l ( k ) − 1 d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) +  ( ∇ f ( x l ( k ) − 1 ) + H l ( k ) − 1 d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 )  ≤ L k ˜ x k − x l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k + ¯ λ k d l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k +( α l ( k ) − 1 − 1)  ( d l ( k ) − 1 ) T H l ( k ) − 1 d l ( k ) − 1 + ∆ l ( k ) − 1  , (57) where ˜ x k is some p oint lying on the segmen t joining x l ( k ) with ¯ x l ( k ) − 1 . It fo llo ws from (47) and α k ≤ 1 t ha t, for k ≥ ˆ k , k ˜ x k − x l ( k ) − 1 k ≤ k x l ( k ) − x l ( k ) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k . Similarly , k x l ( k ) − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k for k ≥ ˆ k . Using t hese inequalities, Lemma 3.6(a), H k  λ I , α k ≤ 1, a nd (57), w e see that, for k ≥ ˆ k , F ( x l ( k ) ) − v ≤ − c 2 ∆ l ( k ) − 1 for some constan t c 2 > 0 . The remaining pro of follo ws similarly as that of Theorem 3.10. 4 Augmen ted Lagrangian metho d for sparse PCA In this sec tion w e discus s the applic ability and impleme ntation details of the augmen ted Lagrangian metho d prop osed in Section 3 for solving sparse PCA (3) . 4.1 Applicabilit y of augmen ted Lagrangi an metho d for (3) W e ﬁrst o bserv e that problem (3) can b e reform ulated as min V ∈ℜ n × r − T r( V T ˆ Σ V ) + ρ • | V | s.t. V T i ˆ Σ V j ≤ ∆ ij ∀ i 6 = j, − V T i ˆ Σ V j ≤ ∆ ij ∀ i 6 = j, V T V = I . (58) Clearly , problem (58 ) has the same form as (9). F rom Subsection 3.2 , we know that the suﬃcien t conditions fo r conv e rgence of our augmen ted Lagrangian metho d include: i) a feasible p oin t is explic itly giv en; and ii) Robinson’s condition (10) holds at an accum ulation p oint. It is easy to observ e that any V ∈ ℜ n × r consisting of r orthonormal eigenv ec tor s of ˆ Σ is a feasible p oin t of (5 8), and th us the ﬁrst condition is t r ivially satisﬁed. Giv en that the accum ulation 27 p oin ts ar e not kno wn b eforehand, it is hard to c hec k the second condition directly . Ins tead, w e ma y c hec k Robinson’s condition at all feasible p oints of (58 ) . Ho w ev er, due to complication of the constrain ts, w e are only a ble to verify Robinson’s condition at a set of feasible po ints b elo w. Before pro ceeding, w e establish a tec hnical lemma as follow s that will b e used subseque ntly . Lemma 4.1 L et V ∈ ℜ n × r b e a fe asible solution of (58). Given any W 1 , W 2 ∈ S r , the system of δ V T ˆ Σ V + V T ˆ Σ δ V + δ D = W 1 , ( 5 9) δ V T V + V T δ V = W 2 (60) has a t le ast one solution ( δ V , δ D ) ∈ ℜ n × r × D r if one of the fol lowing c onditions hold s: a) V T ˆ Σ V is diagonal and V T i ˆ Σ V i 6 = V T j ˆ Σ V j for al l i 6 = j ; b) V T ˆ Σ( I − V V T ) ˆ Σ V is no nsingular. Pr o of . Note that the columns of V consist of r orthonormal eigenv e ctors. Therefore, there exist ¯ V ∈ ℜ n × ( n − r ) suc h that [ V ¯ V ] ∈ ℜ n × n is an orthogonal matrix. It follow s that for an y δ V ∈ ℜ n × r , there exists δ P ∈ ℜ r × r and δ ¯ P ∈ ℜ ( n − r ) × r suc h that δ V = V δ P + ¯ V δ ¯ P . P erforming suc h a c hange of v ariable for δ V , and using the fact that the matrix [ V ¯ V ] is orthogonal, w e can show that the syste m of (59) and (60) is equiv alent to δ P T G + Gδ P + δ ¯ P T ¯ G + ¯ G T δ ¯ P + δ D = W 1 , (61) δ P T + δ P = W 2 , (62) where G = V T ˆ Σ V and ¯ G = ¯ V T ˆ Σ V . The r emaining pro of of t his lemma reduces to sho w that the system of (61) and (62) has at least a solution ( δ P , δ ¯ P , δ D ) ∈ ℜ r × r × ℜ ( n − r ) × r × D r if one of conditions (a) or (b) holds. First, w e assume that condition (a) holds. The n, G is a diagonal matrix and G ii 6 = G j j for all i 6 = j . It fo llo ws that there exists a unique δ P ∗ ∈ ℜ n × r satisfying δ P ii = ( W 2 ) ii / 2 for all i and δ P ij G j j + G ii δ P ij = ( W 1 ) ij ∀ i 6 = j, δ P ij + δ P j i = ( W 2 ) ij ∀ i 6 = j. No w, let δ ¯ P ∗ = 0 and δ D ∗ = ] Diag( W 1 − GW 2 ). It is easy t o ve rify that ( δ P ∗ , δ ¯ P ∗ , δ D ∗ ) is a solution of the system of (61) and (62). W e nex t assume that conditio n (b) ho lds. G iven any δ ¯ P ∈ ℜ ( n − r ) × r , there exist δ Y ∈ ℜ ( n − r ) × r and δ Z ∈ ℜ r × r suc h that ¯ G T δ Y = 0 and δ ¯ P = δ Y + ¯ Gδ Z . P eforming suc h a c hange of v ariable for δ ¯ P , we see t ha t (61) can b e rewritten as δ P T G + Gδ P + δ Z T ¯ G T ¯ G + ¯ G T ¯ Gδ Z + δ D = W 1 . (63) Th us, it suﬃces t o sho w that the system of (62) and (63) has at least a solution ( δ P , δ Z , δ D ) ∈ ℜ r × r × ℜ r × r × D r . Using the deﬁnition of ¯ G and the fact that the matrix [ V ¯ V ] is o rthogonal, w e see that ¯ G T ¯ G = V T ˆ Σ ¯ V ¯ V T ˆ Σ V = V T ˆ Σ( I − V V T ) ˆ Σ V , 28 whic h together with condition (b) implies that ¯ G T ¯ G is nonsingular. No w, let δ P ∗ = W 2 / 2 , δ Z ∗ = ( ¯ G T ¯ G ) − 1 (2 W 1 − W 2 G − GW 2 ) / 4 , δ D ∗ = 0 . It is easy to v erify that ( δ P ∗ , δ Z ∗ , δ D ∗ ) is a solution of the system of (63) a nd (62). Therefore, the conclusion holds. W e are no w ready to sho w that Robinson’s condition (10) holds at a set of feasible p oin ts of (58). Prop osition 4.2 L et V ∈ ℜ n × r b e a fe asible solution of (58). The R obinson ’s c ondition (10) holds at V if o ne of the fol lowi n g c onditions ho ld : a) ∆ ij = 0 and V T i ˆ Σ V i 6 = V T j ˆ Σ V j for al l i 6 = j ; b) Ther e is at le ast one active and one ina c tive ine quality c onstr aint of (58) at V , and V T ˆ Σ( I − V V T ) ˆ Σ V is no nsingular; c) Al l ine quality c onstr aints of (58) ar e inactive a t V . Pr o of . W e ﬁrst supp ose that condition (a) holds. Then, it immediately implies tha t V T ˆ Σ V is diagonal, and hence the conditio n (a) of Lemma 4.1 holds. In addition, w e observ e that all constrain ts o f (58) b ecome equalit y ones. Using these fa cts and Lemma 4.1, w e see that Robinson’s condition (10) holds a t V . Nex t, w e assume that condition (b) holds. It implies that condition (b) of Lemma 4.1 holds. T he conclusion then follows directly from Lemma 4.1. Finally , supp ose condition (c) holds. Then, Robinson’s condition (10) holds a t V if and only if (60) has a t least a solution δ V ∈ ℜ n × r for an y W 2 ∈ S r . Noting t hat V T V = I , w e easily see that δ V = V W 2 / 2 is a solution of (60), and th us Robinson’s condition (10) holds at V . F rom Propo sition 4.2, w e see that Robinson’s condition (10) indeed holds at a set of feasible p oin ts of (58). Though w e are not able to show that it holds a t all feasible p oints of (58), w e observ e in our implemen tation tha t the accum ulation po in ts of our augmen ted Lagrangia n metho d g enerally satisfy one of the conditions describ ed in Pro p osition 4.2, and so Robinson’s condition usually holds a t the accum ulation p oin ts. Moreo v er, we hav e nev er seen that our augmen ted Lagrangia n metho d failed to conv erge for an instance in our implemen tation so far. 4.2 Implemen tation details of augmen ted Lagrangian metho d for (58) In this section, w e show how our augmente d Lagrangian metho d prop osed in Subsection 3.2 can b e applied to solv e problem (58) (or , equiv alen tly , (3)). In particular, w e will discuss the implemen tation details of outer and inner iterations o f this metho d. 29 W e ﬁrst discuss ho w to eﬃcien tly ev a lua te the function and gradient in v olv ed in our aug- men ted Lagra ng ian metho d for problem (58). Supp ose that  > 0 is a p enalty parameter, and { λ + ij } i 6 = j and { λ − ij } i 6 = j are the Lagra ng ian m ultipliers for the inequalit y constrain ts of (58), resp ectiv ely , and µ ∈ S r is the Lagrangian m ultipliers for the equalit y constrain ts of (5 8). F or con v enience of presen tation, let ∆ ∈ S r b e the mat rix whose ij th en try equals the parameter ∆ ij of (58) for all i 6 = j and dia g onal en tries are 0. Similarly , let λ + (resp., λ − ) b e an r × r symmetric matr ix whose ij th en try is λ + ij (resp., λ − ij ) fo r all i 6 = j a nd diagona l en tries are 0. W e no w deﬁne λ ∈ ℜ 2 r × r b y stac king λ + o v er λ − . Us ing these notat io ns, we o bserv e that the asso ciated Lagrangian function fo r problem (58) can b e rewritten as L  ( V , λ, µ ) = w ( V ) + ρ • | V | , (64 ) where w ( V ) = − T r ( V T ˆ Σ V )+ 1 2          λ + λ −  +   S − ∆ − S − ∆  +      2 F −      λ + λ −      2 F   + µ • R +  2 k R k 2 F , and S = V T ˆ Σ V − ] Diag( V T ˆ Σ V ) , R = V T V − I . (65) It is not hard to v erify that the gradien t of w ( V ) can b e computed according to ∇ w ( V ) = 2  − ˆ Σ V  I − [ λ + + S −  ∆] + + [ λ − − S −  ∆] +  + V ( µ + R )  . Clearly , the main eﬀort for the ab o ve function and gradien t ev aluations lies in computing V T ˆ Σ V and ˆ Σ V . When ˆ Σ ∈ S p is explicitly given, t he computational complexit y for ev aluating these t w o quan tities is O ( p 2 r ). In practice, we are, ho w ev er, t ypically give n the data matrix X ∈ ℜ n × p . As suming the column means of X are 0, the sample co v aria nce matrix ˆ Σ can b e obtained from ˆ Σ = X T X/ ( n − 1). Nev ertheless, when p ≫ n , we o bserv e that it is not eﬃcien t to compute and store ˆ Σ. Also, it is muc h c heap er to compute V T ˆ Σ V a nd ˆ Σ V by using ˆ Σ implicitly rather explicitly . Indeed, w e can ﬁrst ev aluate X V , and then compute V T ˆ Σ V and ˆ Σ V a ccording to V T ˆ Σ V = ( X V ) T ( X V ) / ( n − 1) , ˆ Σ V = X T ( X V ) / ( n − 1) . Then, the resulting ov erall computational complexit y is O ( npr ), whic h is clearly m uc h sup erior to the one b y using ˆ Σ explicitly , tha t is, O ( p 2 r ). W e now address initialization and termination criterion for our augmen ted Lagrangian metho d. In particular, w e c ho ose initia l point V 0 init and feasible p oint V feas to b e the loading v ectors of the r standard PCs, that is, the orthonormal eigen v ectors corresp onding to r largest eigen v alues of ˆ Σ. In addition, w e set initial p enalty parameter and Lagrangian mu ltipliers to b e 1, and set the parameters τ = 0 . 2 and σ = 10. W e terminate our me tho d once the 30 constrain t violation and the relativ e diﬀerence b et w een the augmen ted Lagrang ia n f unction and the regular ob jectiv e function are suﬃcien tly small, that is, max i 6 = j [ | V T i ˆ Σ V j | − ∆ ij ] + ≤ ǫ I , max i,j | R ij | ≤ ǫ E , | L  ( V , λ, µ ) − f ( V ) | max ( | f ( V ) | , 1) ≤ ǫ O , (66) where f ( V ) = − T r( V T ˆ Σ V ) + ρ • | V | , R is deﬁ ned in (65), and ǫ I , ǫ E , ǫ O are some pre- scrib ed accuracy parameters corresp onding to inequality constrain ts, equality constrain ts and ob jectiv e function, resp ectiv ely . W e next discuss ho w to apply the nonmonotone g r a dien t metho ds prop osed in Section 3.3 for the augmen ted La grangian subproblems, whic h are in the form of min V L  ( V , λ, µ ) , (67) where the function L  ( · , λ, µ ) is deﬁned in (64). Give n that the impleme ntation details of those nonmonotone gradien t metho ds are similar, w e only fo cus on the second one, that is, the nonmono t one gr adien t metho d II. First, the initial p oin t for this metho d can b e c hosen according to the sc heme describ ed at the end of Subsection 3.2. In addition, giv en an iterate V k , the searc h direction d k is computed b y solving subproblem (28) with H = α − 1 k I , whic h b ecomes, in the conte xt of (21) and (64), d k := arg min d  ∇ w ( V k ) • d + 1 2 α k k d k 2 F + ρ • | V k + d |  . (68) Here, α k > 0 is c hosen according to t he sc heme prop osed b y Barzilai a nd Borwe in [2], whic h is also us ed by Birgin et al. [4] for studying a class of pro jected gradien t metho ds. Let 0 < α min < α max b e given. Initially , c ho ose an arbitr a ry α 0 ∈ [ α min , α max ]. Then, α k is up dated as follows: α k +1 =  α max , if b k ≤ 0; max { α min , min { α max , a k /b k }} , otherwise , where a k = k V k − V k − 1 k 2 F and b k = ( V k − V k − 1 ) • ( ∇ w ( V k ) − ∇ w ( V k − 1 )). It is not hard to v erify that the optimal solution of problem (68) has a closed-for m expression, whic h is g iv en b y d k = sign( C ) ⊙  | C | − α k ρ  + − V k , where C = V k − α k ∇ w ( V k ). In addition, w e see from Lemma 3.4 that the follo wing termination criterion is suitable for this metho d when applied to (67): max ij | d I ( V k ) | ij max( | L  ( V k , λ, µ ) | , 1) ≤ ǫ, where d I ( V k ) is the solution of (68) with α k = 1, and ǫ is a prescrib ed accuracy parameter. In our nume rical implemen tation, w e set α 0 = 1 / max ij | d I ( V 0 ) | ij , α max = 1 , α min = 1 0 − 15 and ǫ = 10 − 4 . 31 T able 1: Sparse PCA metho ds us e d for our compa r ison GP ow er l 1 Single-unit sparse PCA via l 1 -p enalt y GP ow er l 0 Single-unit sparse PCA via l 0 -p enalt y GP ow er l 1 ,m Blo c k sparse PCA via l 1 -p enalt y GP ow er l 0 ,m Blo c k sparse PCA via l 0 -p enalt y DSPCA DSPCA algorithm SPCA SPCA algorithm rSVD sPCA-rSVD algorithm with soft thresholding ALSPCA Augmen ted Lagrangian algorithm Finally , it shall b e men tioned that fo r the sake of pra ctical p erformance, the n umerical implemen tation of our augmen ted Lag rangian metho d is slightly diﬀerent from the one de- scrib ed in Subsection 3.2. In particular, w e follo w a similar sc hem e as disc ussed on pp. 4 05 of [3] to a djust penalty parameter and Lagrangian multiplie rs. Indeed, they are up dated sep- arately rather t ha n sim ultaneously . Roughly sp eaking, given γ ∈ (0 , 1 ), w e adjust p enalt y parameter only when the constraint violation is not decreased b y a factor γ ov er the previous minimization. Similarly , w e up date Lagrang ia n multiplie rs only when t he constraint viola- tion is decreased b y a factor γ o v er the previous minimization. W e c ho ose γ = 0 . 2 5 in our implemen tation as recommended in [3]. 5 Numerical re sults In this s ection, w e conduct n umerical ex p eriments for the augmen ted Lagrangian metho d detailed in Subsections 3.2 a nd 4.2 for formulation (58) (or, equiv alen tly , ( 3)) of sparse PCA on sy nthetic , random, and real data. In particular, w e compare t he results of our approach with sev eral exis ting sparse PCA metho ds in terms of total explained v ariance, correlation of PCs, and orthogonality of loading v ectors, wh ich include the generalized p o w er metho ds (Journ ´ ee et al. [16]), the DSPCA algorithm (d’Aspremon t et a l. [8]), the SPCA algo rithm (Zou et al. [2 8 ]), and t he sPCA-rSVD a lgorithm (Shen and Huang [25]). W e now list all t he metho ds used in t his section in T able 1. Sp eciﬁcally , the metho ds with the preﬁx ‘GP o we r’ a re the generalized p o w er metho ds studied in [16], and the metho d ALSPCA is the augmen ted Lagrangian metho d prop osed in this pap er. As discu ssed in Section 2, the PCs obtained from the standar d PCA based on sample co v ariance matrix ˆ Σ ∈ ℜ n × p are nearly uncorrelated when the sample size is suﬃcien tly la r ge, and the total explained v ariance b y the ﬁrs t r PCs appro ximately equals the s um of the individual v ariances of PCs, that is, T r( V T ˆ Σ V ), where V ∈ ℜ p × r consists o f the loading v ectors of t hese PCs. How ev er, the PCs found b y sparse PCA metho ds ma y b e correlated with eac h other, and th us the quan tit y T r( V T ˆ Σ V ) can o v erestimate m uc h the total explained v ariance by the PCs due to the o v erlap among their individual v ariances . In a ttempt to deal with suc h an ov erlap, t w o adjusted tota l explained v ariances w ere propo sed in [28, 2 5 ]. It is not hard to o bserv e that they can b e view ed as the total explained v ariance o f a set o f transformed v ariables from the estimated sparse PCs. Giv en that these transformed v ariables can be dramatically diﬀeren t from those sparse PCs, their total explained v ariances may 32 diﬀer m uc h from each other as we ll. T o alleviate t his drawbac k while taking in to accoun t the p ossible correlations among PCs, w e in tro duce the following adjuste d total explaine d varianc e for sparse PCs: AdjV ar V = T r( V T ˆ Σ V ) − s X i 6 = j ( V T i ˆ Σ V j ) 2 . Clearly , when the PCs are uncorrelated, it b ecomes the usual total explained v ariance, that is, T r( V T ˆ Σ V ). W e also deﬁne the cumulative p er c entage of adjuste d varianc e (CP A V) for the ﬁrst r sparse PCs as the quotien t of the adjusted total ex plained v ariance of these PCs and the total explained v ariance b y all standard PCs, that is, AdjV ar V / T r( ˆ Σ). Finally , w e shall stress that the sole purp ose of this section is to compare the p erformance of those methods listed in T able 1 for ﬁnding the sparse PCs that ne arly enjoy the t hree imp ortant prop erties p ossessed b y the standard PCA (see Section 1 ). Therefore, w e will not compare the speed of these metho ds. Nev ertheless , it shall be mentioned that o ur metho d, that is, ALSPCA, is a ﬁrst-order metho d and capable of solving larg e-scale problems within a reasonable amount of time a s o bserv ed in our exp erimen ts. 5.1 Syn thetic data In this subsection we use the syn thetic data in tro duced b y Zou et al. [28] to test the eﬀec- tiv eness of our a ppro ac h ALSPCA for ﬁnding sparse PCs. The syn thetic example [28] considers three hidden factors: V 1 ∼ N (0 , 290) , V 2 ∼ N (0 , 300) , V 3 = − 0 . 3 V 1 + 0 . 925 V 2 + ǫ, ǫ ∼ N (0 , 1) , where V 1 , V 2 and ǫ are indep enden t. Then the 10 observ able v ariables are generated as follows : X i = V 1 + ǫ 1 i , ǫ 1 i ∼ N (0 , 1) , i = 1 , 2 , 3 , 4 , X i = V 2 + ǫ 2 i , ǫ 2 i ∼ N (0 , 1) , i = 5 , 6 , 7 , 8 , X i = V 3 + ǫ 3 i , ǫ 3 i ∼ N (0 , 1) , i = 9 , 10 , where ǫ j i are indep enden t for j = 1 , 2 , 3 and i = 1 , . . . , 1 0 . W e will use the actual cov ariance matrix of ( X 1 , . . . , X 10 ) to ﬁnd the standard and sparse PCs, resp ectiv ely . W e ﬁrst see that V 1 and V 2 are indep enden t, but V 3 is a linear com bination of V 1 and V 2 . Moreo v er, the v ariances o f the three underlying factors V 1 , V 2 and V 3 are 29 0, 300, and 283 . 8, resp ectiv ely . The refore V 2 is slightly more imp ortant than V 1 , and they b oth are m uc h more imp ortant than V 3 . In addition, the ﬁrst t w o standard PCs together explain 99 . 7 2 % of the total v aria nce (see T able 2). These facts suggest tha t the ﬁrst tw o sparse PCs b e suﬃcien t to explain most o f the v ariance. Ideally , the ﬁrst sparse PC reco v ers the fa ctor V 2 only using ( X 5 , X 6 , X 7 , X 8 ), and the second sparse PC reco ve rs the factor V 1 only using ( X 1 , X 2 , X 3 , X 4 ). Moreo v er, give n that ( X 5 , X 6 , X 7 , X 8 ) and ( X 1 , X 2 , X 3 , X 4 ) are indep enden t, t hese sparse PCs w ould b e uncorrelated and orthogo nal each other. 33 T able 2: Loadings o f the ﬁrst tw o PCs b y standard PCA and ALSPCA V ari able PCA ALSPCA PC1 PC2 PC1 PC2 X 1 0.1158 0.4785 0 0.5000 X 2 0.1158 0.4785 0 0.5000 X 3 0.1158 0.4785 0 0.5000 X 4 0.1158 0.4785 0 0.5000 X 5 -0.3955 0.1449 - 0.5000 0 X 6 -0.3955 0.1449 - 0.5000 0 X 7 -0.3955 0.1449 - 0.5000 0 X 8 -0.3955 0.1449 - 0.5000 0 X 9 -0.4005 -0.0095 0 0 X 10 -0.4005 -0.0095 0 0 CP A V (%) 99.72 80.46 Syn thetic data In o ur test, w e set r = 2, ∆ ij = 0 f or all i 6 = j , and ρ = 4 for fo rm ulation (58) o f sparse PCA. In addition, we choose (66) a s the termination criterion for ALSPCA with ǫ I = ǫ O = 0 . 1 and ǫ E = 10 − 3 . The results o f standard PCA and ALSPCA for this example a r e presen ted in T able 2. The loadings of standard and sparse PC s are given in columns tw o and three, resp ectiv ely , and their CP A Vs are giv en in the last ro w. W e cle arly see that our sparse PCs are consisten t with the ones predicte d ab ov e. In terestingly , they are iden tical with the ones obtained b y SPCA and D SPCA rep o rted in [28, 8]. F or general data, ho w ev er, these metho ds ma y p erform quite diﬀerently (see Subsection 5.3). 5.2 Random data In this subsection, we compare the p erformance o f the GP ow er metho ds [16] and our ALSPCA metho d for ﬁnding sparse PCs o n a set of randomly generated data. F irst, w e randomly generate 100 cen tered data matrices X with the size of 20 × 20. Throug ho ut this subsection, w e set ∆ ij = 0 . 5 for all i 6 = j for f orm ulation (58 ) of sparse PCA, and choose (66) as the termination criterion for ALSPCA with ǫ I = 0 . 1 , ǫ E = 0 . 1 and ǫ O = 0 . 1. In the ﬁrst test, we a im to ﬁnd the ﬁrs t thr e e sparse PC s with the av erage n um b er of zero lo a dings around 30 (50% sparsit y). T o me et this purp o se, the tunning parameter ρ fo r problem (58) a nd the parameters f o r the GP ow er metho ds are prop erly c hosen. The results of the GP o w er metho ds a nd ALSPCA for the ab o ve randomly generated instances are presen ted in T able 3 . The name of each metho d is giv en in column o ne. The sparsit y measured by the n um b er of zero loadings a v eraged ov er all instances is giv en in column t w o. The column three giv es the av erage amount of non-orthogonality of the loading v ectors, whic h is measured b y the maximum a bsolute diﬀerence b etw e en 90 ◦ and the angles formed by all pairs of loading v ectors. Clearly , the smaller v alue in this column implies the b etter orthogonality . In addition, the maxim um correlation and CP A V of sparse PCs av e rag ed ov er all instances are presen ted in columns four a nd ﬁv e, resp ectiv ely . F rom T able 3, w e see that the av erage n um b er of zero loadings of the ﬁrst three sparse PCs for all metho ds are almost same, which a r e aro und 30 . W e also observ e that the sparse PCs giv en b y our metho d ALSPCA are almost uncorrelated a nd their loading v ectors a re nearly orthogona l, whic h are m uc h sup erior to the GP o we r metho ds. 34 T able 3: Compa r ison of GPow er a nd ALSPCA Method Sparsity Non-orthogonalit y Correlation CP A V (%) GP ow er l 1 30.73 3.480 0.104 39.06 GP ow er l 0 30.37 3.540 0.111 38.74 GP ow er l 1 ,m 30.61 5.220 0.161 38.12 GP ow er l 0 ,m 30.51 5.216 0.153 37.99 ALSPCA 30.91 1.007 0.038 40.79 Random data: T est I T able 4: Compa r ison of GPow er a nd ALSPCA Method Sparsity Non-orthogonalit y Correlation CP A V (%) GP ow er l 1 60.58 5.766 0.168 63.46 GP ow er l 0 60.53 6.031 0.169 63.27 GP ow er l 1 ,m 60.02 8.665 0.265 62.34 GP ow er l 0 ,m 60.14 9.146 0.272 62.00 ALSPCA 60.74 1.177 0.066 64.62 Random data: T est II Moreo v er, our sparse PCs ha v e b etter CP A V tha n the others. Our aim of the second test is to ﬁnd the ﬁrst six sparse PCs with the av erage num b er of zero loadings around 60 (5 0% sparsity ). T o reac h this goa l, the tunning parameter ρ for pro blem (58) and the par a meters for the GPo w er metho ds are appropr ia tely c hosen. The results of the GP o w er metho ds a nd ALSPCA for the ab o ve randomly generated instances are presen ted in T able 4. Eac h column of T able 4 has the same meaning as T able 3 . First, w e clearly see that our metho d substan tially o utp erforms the GP o we r metho ds in terms of uncorrelation of PCs, orthogonalit y of loading ve ctors and CP A V. F urther, in comparison with T a ble 3 , w e observ e that as the n um b er of PCs increases, the CP A V gro ws accordingly for all metho ds, and moreo v er, the sparse PCs giv en b y the GPo w er methods b ecome m uc h more correlated and non-orthogonal eac h other. But the p erformance of our sparse PCs almost remains same. This phenomenon is actually not surprising, g iv en that uncorrelation and orthog onalit y are not w ell take n in to a ccount in the GP o w er metho ds. 5.3 Pitprops data In this subsection w e test the p erformance of our a pproac h ALSPCA fo r ﬁnding sparse PCs on the real data, that is, Pitprops data. W e also compare the results with sev eral ex isting metho ds [28, 8, 25, 16]. The Pitprops data in tro duced b y Jeﬀers [14] has 180 observ ations and 13 measured v ari- ables. It is a classic example that illustrates the diﬃcult y of in terpreting PCs. Recen tly , sev eral sparse PCA metho ds [17, 28, 25, 8] ha v e been applied to this data set for ﬁnding si x sparse PCs by using the actual cov ariance matrix. F or ease of comparison, w e presen t the standard PCs, and some of those sparse PCs in T ables 5-8, resp ectiv ely . It shall b e men tioned that tw o groups of sparse PCs w ere found by DSPCA with the pa rameter k 1 = 5 o r 6, and they ha v e similar sparsit y and total explained v a riance (see [8] for details). Th us, w e only presen t the lat ter one (i.e., t he one with k 1 = 6) in T able 8. Als o, w e applied the GP o w er metho ds [16] to this data fo r ﬁnding the PCs with the largest sparsity of the ones give n in [28 , 25, 8], 35 T able 5: Loadings of the ﬁr st six PCs by standard PCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4038 0.2178 0.2073 0.0912 0.0826 0.1198 length 0.4055 0.1861 0.2350 0.1027 0.1128 0.1629 moist 0.1244 0.5406 -0.1415 -0.0784 -0.3498 -0.2759 testsg 0.1732 0.4556 -0.3524 -0.0548 - 0.3558 -0.0540 o vensg 0.0572 -0.1701 -0.4812 -0.0491 -0.1761 0.6256 ringtop 0.2844 -0.0142 -0.4753 0.0635 0.3158 0.0523 ringbut 0.3998 - 0.1897 -0.2531 0.0650 0.2151 0.0026 bowmax 0 .2936 -0.1892 0.2431 -0.2856 -0.1853 -0.0551 bowdist 0.3566 0.0171 0.2076 -0.0967 0.1061 0.0342 whorls 0.3789 -0.2485 0.1188 0.2050 -0.1564 -0.1731 clear -0.0111 0.2053 0.0704 -0.8036 0.3430 0.1753 knots - 0.1151 0.3432 -0.0920 0.3008 0.6003 -0.1698 diaknot -0.1125 0.3085 0.3261 0.3034 -0.0799 0.6263 Pitprops data T able 6: Loadings o f the ﬁrst six P Cs by SPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.477 0 0 0 0 0 length -0.476 0 0 0 0 0 moist 0 0.785 0 0 0 0 testsg 0 0.620 0 0 0 0 o vensg 0.177 0 0.640 0 0 0 ringtop 0 0 0.589 0 0 0 ringbut - 0.250 0 0.492 0 0 0 bowmax -0.344 -0.021 0 0 0 0 bowdist -0.416 0 0 0 0 0 whorls -0.400 0 0 0 0 0 clear 0 0 0 -1 0 0 knots 0 0.013 0 0 -1 0 diaknot 0 0 -0.015 0 0 1 Pitprops data and fo und the b est result w as giv en b y GPo w er l 1 . Th us, we only rep ort the six sparse PCs ob- tained b y GPo w er l 1 in T a ble 9. In addition, we presen t sparsity , CP A V, non-orthogo nalit y and correlation of the PCs obtained b y the standard PCA and sparse PCA metho ds [28, 25, 8, 16] in columns tw o to ﬁv e of T able 13, r esp ectiv ely . In more details, the second and ﬁfth columns resp ectiv ely give sparsity (measured b y the n umber of zero loadings) and CP A V. The third column repo r ts non-or thogonality , which is measured by the maximum absolute diﬀerence b e- t w een 90 ◦ and the angles formed by all pairs of loading vec tors. The fourth column presen ts the maxim um correlation o f PC s. T hough the PCs obtained by these sparse PCA metho ds ha v e nice sparsit y , we o bserv e from T ables 13 that they are m uc h correlated, and moreov er almost all of them are far fr om ort ho gonal eac h o ther excep t the ones giv en by SPCA [28]. T o impro v e the quality of sparse PCs, w e next apply our approac h ALSPCA, and compare t he results with these metho ds. F or all tests b elo w, w e c ho ose (66) a s the terminatio n criterion for ALSPCA with ǫ O = 0 . 1 and ǫ I = ǫ E = 1 0 − 3 . In t he ﬁrst exp erimen t, w e aim to ﬁnd six uncorrelated and or thogonal s parse PCs b y ALSPCA while explaining most of v ariance. In particular, w e set r = 6, ∆ ij = 0 . 07 for all i 6 = j and ρ = 0 . 8 for form ulation (58) of sparse PCA. The resulting sparse PCs are presen ted in T able 1 0, and their sparsit y , CP A V, non-ort hogonalit y and correlation are repo rted in row 36 T able 7: Loadings o f the ﬁrst six PCs by rSVD V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.449 0 0 -0.114 0 0 length -0.460 0 0 -0.102 0 0 moist 0 -0.707 0 0 0 0 testsg 0 -0.707 0 0 0 0 o vensg 0 0 0.550 0 0 -0.744 ringtop -0.199 0 0.546 -0.176 0 0 ringbut - 0.399 0 0.366 0 0 0 bowmax -0.279 0 0 0.422 0 0 bowdist -0.380 0 0 0.283 0 0 whorls -0.407 0 0 0 0.231 0 clear 0 0 0 -0.785 -0.973 0 knots 0 0 0 - 0.265 0 0.161 diaknot 0 0 -0.515 0 0 -0.648 Pitprops data T able 8: Loadings o f the ﬁrst six PCs by DSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.4907 0 0 0 0 0 length -0.5067 0 0 0 0 0 moist 0 0.7071 0 0 0 0 testsg 0 0.7071 0 0 0 0 o vensg 0 0 0 0 -1.0000 0 ringtop -0.0670 0 -0.8731 0 0 0 ringbut - 0.3566 0 -0.4841 0 0 0 bowmax -0.2335 0 0 0 0 0 bowdist -0.3861 0 0 0 0 0 whorls -0.4089 0 0 0 0 0 clear 0 0 0 0 0 1. 0000 knots 0 0 0 1.0000 0 0 diaknot 0 0 0.0569 0 0 0 Pitprops data sev en o f T able 13. W e easily observ e that o ur metho d ALSPCA ov erally outp erforms the other sparse PCA methods substantially in all asp ects except sparsit y . Naturally , w e can improv e the sparsit y by increasing the v alues of ρ , ye t the total explained v ariance ma y b e sacriﬁced as demonstrated in our next exp erimen t. W e no w attempt to ﬁnd six PCs with similar correlation and orthogonality but hig her sparsit y than those giv en in the ab ov e exp erimen t. F or this purp ose, we set ∆ ij = 0 . 07 for all i 6 = j a nd choo se ρ = 2 . 1 for problem ( 58) in this exp erimen t. Th e resulting sparse PCs are presen ted in T able 11. T he CP A V, non-orthog onalit y and correlation o f these PCs are giv en in ro w eigh t of T able 13. In comparison with the ones giv en in the a b o v e exp erimen t, the PCs obtained in this exp erimen t a r e muc h more sparse while retaining almost same correlation and orthogonality . Ho w ev er, their CP A V go es down dramatically . C ombining the results of these t w o exp erimen ts, w e deduce that for the Pitprops data, it seem s not p ossible to extract six highly sparse (e.g., around 60 zero loadings), nearly orthogonal and uncorrelated PCs while explaining most of v aria nce as they may not exist. Th e following exp erimen t further sustains suc h a deduction. Finally w e are in terested in exploring how the correlation controlling parameters ∆ ij ( i 6 = j ) aﬀect the p erformance of the sparse PCs. In pa r ticular, we set ∆ ij = 0 . 5 for all i 6 = j and 37 T able 9: Loadings of the ﬁrst six PCs by GPo wer l 1 V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.4182 0 0 0 0 0 length -0.4205 0 0 0 0 0 moist 0 -0.7472 0 0 0 0 testsg -0.1713 -0.6646 0 0 0 0 o vensg 0 0 0 0 - 0.7877 0 ringtop -0.2843 0 0 0 -0.6160 0 ringbut - 0.4039 0 0 0 0 0 bowmax -0.3002 0 0 0 0 0 bowdist -0.3677 0 0 0 0 0 whorls -0.3868 0 0 0 0 0 clear 0 0 0 0 0 1. 0000 knots 0 0 0 1.0000 0 0 diaknot 0 0 1.0000 0 0 0 Pitprops data T able 10: Loadings of the ﬁrst s ix PCs b y ALSP C A V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4394 0 0 0 0 0 length 0.4617 0 0 0 0 0 moist 0.0419 0.4611 -0.1644 0.0688 -0.3127 0 testsg 0.1058 0.7902 0 0 0 0 o vensg 0.0058 0 0 0 0 0 ringtop 0.1302 0 0.2094 0 0 0.9999 ringbut 0.3477 0 0.0515 0 0.3240 0 bowmax 0 .2256 -0.3566 0 0 0 0 bowdist 0.4063 0 0 0 0 0 whorls 0.4606 0 0 0 0 -0.0125 clear 0 0.0369 0 - 0.9973 0 0 knots - 0.1115 0.1614 -0.0762 0.0239 0.8929 0 diaknot -0.0487 0.0918 0.9595 0.0137 0 0 Pitprops data : T est I c ho ose ρ = 0 . 7 for problem (58). The obtained sparse PCs are presen ted in T able 12 . The CP A V, no n- orthogonality and correlation o f these PCs are given in t he last row of T able 13. W e see that these PCs are highly sparse, ortho gonal, and explain go o d amount of v ariance. Ho w ev er, they a r e quite correlated with each ot her, which is actually not surprising, giv en that ∆ ij ( i 6 = j ) are not small. Despite suc h a draw back , w e observ e that t hese sparse PCs still o v erally outp erform those obta ined b y SPCA, rSVD, D SPCA and GP ow er l 1 . F rom the ab o ve exp erimen ts, w e ma y conclude that for the Pitprops data, there do not exist s ix highly sparse, nearly orthogonal and uncorrelated PCs while explaining m ost of v ariance. Therefore, the most acceptable sparse PCs seem to b e the ones giv en in T able 10. 6 Conclud ing remarks In this pap er w e prop osed a new form ulation of sparse PCA for ﬁnding sparse and nearly uncorrelated principal componen ts (PCs) with orthogonal loading v ec tors while explaining as m uc h of t he t o tal v ariance as p ossible. W e also dev elop ed a no v el globa lly c onv ergen t augmen ted Lagrangian me tho d for solving a class of nonsmo ot h constrained o ptimization 38 T able 11 : Loadings of the ﬁrst six PCs by ALSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 1.0000 0 0 0 0 0 length 0 -0.2916 -0.1421 0 0 -0.0599 moist 0 0.9565 -0.0433 0 0 - 0.0183 testsg 0 0 0 0.0786 - 0. 1330 0 o vensg 0 0 -0.9683 0 0 0 ringtop 0 0 0 0 0 0 ringbut 0 0 0.1949 0 0.2369 0 bowmax 0 0 0 0 0 0 bowdist 0 0 0 0 0 0 whorls 0 0 0 0 0 0 clear 0 0 0 -0.9969 0 0 knots 0 0 -0.0480 0.0109 0.9624 0 diaknot 0 0 -0.0093 0 0 0.9980 Pitprops data: T est I I T able 12 : Loadings of the ﬁrst six PCs by ALSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4051 0 0 0 0 0 length 0.4248 0 0 0 0 0 moist 0 0.7262 0 0 0 0 testsg 0.0018 0.6875 0 0 0 0 o vensg 0 0 - 1.0000 0 0 0 ringtop 0.1856 0 0 0 0 0 ringbut 0.4123 0 0 0 0 0 bowmax 0.3278 0 0 0 0 0 bowdist 0.3830 0 0 0 0 0 whorls 0.4437 -0.0028 0 0 0 0 clear 0 0 0 -1.0000 0 0 knots 0 0 0 0 1.0000 0 diaknot 0 0 0 0 0 1.0000 Pitprops data: T est I I I problems, whic h is w ell suited for our form ulatio n o f sparse PCA. Additionally , we prop osed t w o nonmonotone gra dien t metho ds for solving the augmented Lag rangian subproblems , and established their global a nd lo cal con v ergence. Finally , w e compared our sparse PCA approach with sev eral existing metho ds on syn thetic, random, and real data, resp ectiv ely . The com- putational results demonstrate that the sparse PCs pro duced by our a ppro ac h subs tantially outp erform those by ot her metho ds in terms o f total explained v ariance, correlatio n of PCs, and orthogonality of loading v ectors. As observ ed in our exp erimen ts, form ulation (3) is v ery e ﬀectiv e in ﬁnding the desired sparse PC s. Ho w ev er, there remains a natural theoretical ques tion a b out it. Giv en a set of random v ariables, supp ose there exis t sparse and uncorrelated PCs with orthogonal loa ding v ectors while explaining most of v ariance of the v a riables. In other w ords, their actual co v ari- ance matrix Σ has few dominant eigen v alues and the associated orthonormal eigenv ectors ar e sparse. Since in pra ctice Σ is t ypically unknow n and only approximated b y a sample co v ari- ance matrix ˆ Σ, one natural question is whether or not there exist some suitable parameters ρ and ∆ ij ( i 6 = j ) so that (3) is able to reco v er t ho se sparse PCs almost surely as t he sample size b ecomes suﬃcien tly large. In Section 4 w e sho w ed that Ro binson’s condition ( 1 0) holds a t a set o f feasible p oints of 39 T able 13 : Comparis o n o f SPCA, rSVD, DSPCA, GPo wer l 1 and ALSP CA Method Sparsity Non-orthog onality Corr elation CP A V (%) PCA 0 0 0 87.00 SPCA 60 0.86 0.395 66.21 rSVD 53 14.76 0.459 67.04 DSPCA 63 13.63 0.573 60.97 GP ow er l 1 63 10.09 0.353 64.15 ALSPCA-1 46 0.03 0.082 69.55 ALSPCA-2 60 0.03 0.084 39.42 ALSPCA-3 63 0.00 0.222 65.97 Pitprops data (58). Als o, w e observ ed from our experiments that the accum ulation p oints of our augmen ted Lagrangian metho d lie in this set when applied to (58), and th us it con v erges. Ho w ev er, it remains op en whether or no t Ro binson’s condition holds at all feasible p oints of (58). In addition, Burer and Monteiro [5] recen tly applied the classic al augmen ted Lagrangian metho d to a nonconv e x nonlinear pro g ram (NLP) refo r m ulation of sem ideﬁnite pro grams (SDP), and they obtained some nice computational r esults esp ecially f o r the SDP relaxatio ns of sev eral hard com binatorial optimization problems. How e ve r, the classical augmen ted La- grangian metho d generally cannot guara ntee con v erging to a feasible p oint when applied to a noncon v ex NLP . Due this and [19], at least theoretically , their approac h [5 ] ma y not con v erge to a feasible p oin t o f the primal SDP . Giv en that the augmen ted Lagrangian metho d prop osed in this pap er conv erges globa lly under some mild a ssumptions, it w ould b e interes ting to apply it to the NLP reformulation of SD P and compare the p erfo rmance with the approa ch studied in [5]. Finally , the co des of our approac h for solving the sparse PCA formulation (5 8) (or, equiv a- len tly , (3)) are written in Matla b, whic h are a v ailable online at www.m ath.sfu.ca/ ∼ zhaosong. As a future researc h, we will f urt her improv e their p erformance b y conducting more extensiv e computational exp erimen ts and exploring more practical a pplications. Ac kno wledg emen t W e gratefully ac kno wledge commen ts from Jim Burk e, T erry Ro c k afellar, Defeng Sun and P aul Tseng in the W est Coast O ptimizatio n Meeting a t Univ ersity of W ashington, Seattle, USA in Spring 2009 . References [1] O. Alter, P . Bro wn, and D. Bot stein. Singular v alue decomp osition for Genome-Wide ex- pression data Pro cessing and Mo deling. Pr o c e e ding s of the National A c ademy of Scien c es , 97:10101– 10106, 2000. [2] J. Barzilai and J. M. Borw ein. Tw o p o in t step size gradien t metho ds. IMA Journal of Numeric al Analysis , 8:14 1 –148, 1988. 40 [3] D. P . Bertsek as Nonline ar Pr o gr amming . A thena Scien tiﬁc, 1999. [4] E. G. Birgin, J. M. Mart ´ ınez, and M. Ray dan. No nmono tone sp ectral pro jected gradien t metho ds on con v ex sets. SIAM Jo urnal on Optimization , 4:1196 –1211, 2 000. [5] S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidef- inite programs via lo w-rank factor izat io n. Mathem a tic al Pr o gr amming, Series B , 95:329– 357, 2003. [6] J. Cadima and I. Jolliﬀe. Loadings and correlations in the interpretation of principal comp onen ts. Journal of Applie d Statistics , 22 :203–214, 1995. [7] A. d’Aspre mont, F. R. Bac h, and L. El Ghaoui. Optimal solutions fo r sparse principal comp onen t analysis. Journal of Machine L e arn i n g R ese ar c h , 9:1269–1294 , 20 0 8. [8] A. d’Aspremon t, L. El Ghaoui, M. I. Jordan, and G . R . G . L anc kriet. A direct form ulation for sparse PCA using semideﬁnite programming. SIAM R eview , 49:434–448, 2007. [9] K. F an. On a theorem of W eyl concerning the eigen v alues of linear transformatio ns. Pr o c e e di n gs of the National A c ade m y o f the Scie n c es of U.S.A. , 35: 652–655 (1949). [10] P . Hanco c k, A. Burton, and V. Bruce. F ace pro cessing: Hum an perception a nd principal comp onen ts a na lysis. Memo ry a n d Co gnition , 24:2 6–40, 1996. [11] T. Hastie, R. Tibs hirani, M. Eisen, P . Brow n, D . R oss, U. Sc herf, J. W einstein, A. Al- izadeh, L. Staudt, and D. Botstein. ´ gene Shaving´ as a metho d for identifying distinct sets of genes with similar Expression Patterns . Genome B iolo gy , 1:1–21, 200 0. [12] T. Hastie, R. Tibs hirani, and J. F riedman. Th e Elements of Statistic al L e arning; Data mining, I nfer enc e and Pr e diction . New Y ork: Springer V erlag. [13] J. B. Hiriart–Urruty and C. Lemar ´ ec hal. Convex Analysis an d Minimization algorithms I . Comprehensiv e Study in Mathematics, volume 305, Springer-V erlag, New Y ork, 199 3. [14] J. Jeﬀers. Tw o Case Studies in the Application of Principal Comp onent. Applie d Statis- tics , 16:225–236, 1967. [15] I. Jolliﬀe. Ro tation of principal comp onen ts: ch oice of normalization constraints . Journal of Applie d Statistics , 22:29–3 5, 19 95. [16] M. Jo urn´ ee, Y u. Nesterov , P . Ric ht´ arik, and R. Sepulc hre. Generalized p ow e r metho d for sparse principal comp onen t analysis. CORE Discussion P ap er 2008/70 . Submitted to Journal of Mach ine L e arning R ese ar c h . [17] I. T. Jolliﬀe, N. T . T rendaﬁlov , and M. L . Uddin. A modiﬁed principal comp onen t tec hnique based on the Lasso. Jo urnal of Co m putational a n d Gr aphic al Statistics , 12:5 3 1– 547. 41 [18] B. Mog ha ddam, Y. W eiss, and S. Avidan. Sp ectral b ounds for sparse PCA: Exact and greedy algorithms. In Y. W eiss, B. Sc h¨ olk opf, and J. Platt, editors, A dvanc es in Neur al Information Pr o c essing Systems 18 , pages 915– 9 22. MIT Press, Cambridge, MA, 2006. [19] R. D. C. Mon teiro. private c ommunic ation , 20 09. [20] M. L. Overton and R. S. W omersley . Optimality conditio ns and dualit y theory for mini- mizing sums of the largest eigen v a lues of symme tric matrices. Mathematic al Pr o gr am m ing , 62:321–35 7, 1993. [21] S. M. R obinson. Stability theory for systems of inequalities, P art 2: Diﬀerentiable non- linear system s. SIAM Journal on Numeric al Analysis , 13:497–513 , 19 7 6. [22] S. M. Robinson. Lo cal structure of f easible sets in nonlinear programming, P art I: R eg- ularit y . In V. P ereira a nd A . Reinoza, e ditors, Numeric al Metho ds L e ctu r e Notes in Mathematics , v ol. 1005, Springer-V erlag, Berlin, 1983. [23] R. T. Ro ck afellar. Convex A nalysis . Princeton Univ ersit y Press, 1970. [24] A. Ruszczy´ nski. Nonline ar Optimiz a tion . Princeton Univ ersit y Press, 2006. [25] H. Shen a nd J. Z. Huang. Sparse principal comp onen t analysis via r egula r ized low rank matrix approx imation. Journal of Multivariate Analysis , 99(6 ):1015–10 3 4, 2008. [26] P . Tseng and S. Y un. A co ordina t e gradien t descen t metho d for nonsmo o t h separable minimization. Mathematic al Pr o gr amming , 117:38 7–423, 2 0 09. [27] S J. W right, R. Now ak, and M . A. T. F igueiredo. Sparse reconstruction b y separable appro ximation. T o app ear in IEEE T r ansactions on Signal Pr o c essing . [28] H. Z ou, T. Hastie, and R. Tibshirani. Spa r se principal comp o nent analysis. Jo urnal of Computational and Gr aphic al Statistics , 1 5 (2):265–28 6, 2006. 42

An Augmented Lagrangian Approach for Sparse Principal Component Analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment