An Augmented Lagrangian Approach for Sparse Principal Component Analysis
Principal component analysis (PCA) is a widely used technique for data analysis and dimension reduction with numerous applications in science and engineering. However, the standard PCA suffers from the fact that the principal components (PCs) are usu…
Authors: Zhaosong Lu, Yong Zhang
An Augmen ted Lagrangian Approac h f or Sparse Principal Comp onen t Analysis Zhaoson g Lu ∗ Y ong Zhang † July 10 , 2009 Abstract Principal comp onen t analysis (PCA) is a widely used tec hnique for data analysis and dimension reduction with numerous applicati ons in science and engineering. Ho wev er, the stand ard PCA suffers f rom the fact that the prin cipal comp onents (PCs) are usu ally linear com b in ations of all the original v ariables, and it is thus often difficult to interpret the PCs. T o alleviate this drawbac k, v a rious sparse PCA approac hes were p rop osed in literature [15, 6, 17, 28, 8, 25, 18, 7, 16]. Despite success in ac hieving sparsit y , some imp ortant prop erties enjo y ed b y th e standard PCA are lost in these metho d s such as uncorrelation of PCs and orthogonalit y of loading vect ors. Also, the total explained v ariance th at th ey attempt to maximize can b e to o optimistic. In this p ap er w e p rop ose a n ew formulation for sparse PCA, aiming at find ing sparse and nearly uncorrelated PCs with orthogonal loading vec tors wh ile explaining as muc h of the total v ariance as p ossible. W e also d ev elop a nov el augment ed L agrangian metho d for solving a class of nonsmo oth constrained optimization problems, whic h is well suited for our form ulation of sparse PCA. W e s h o w that it con v erges to a fe asible p oin t, and moreo v er under some regularit y assumptions, it con v erges to a stat ionary p oint . Additionally , w e prop ose t w o nonmonotone gradient metho ds for solving the augmente d Lagrangian subp roblems, and establish their global and lo cal conv ergence. Final ly , w e compare our sparse PCA approac h wit h sev eral existing metho ds on synthetic, rand om, and real data, resp ectiv ely . The computational results demonstrate that the sp ars e PCs pro duced by our approac h substanti ally outp erform those by other methods in terms of total exp lained v ariance, correlation of PC s, and orth ogonalit y of loading v ectors. Key words : sparse PCA, augmen ted Lagrangian method , nonmonotone gradien t meth- o ds, nons m o oth minimization AMS 2000 sub ject classification: 62H20, 62H2 5, 62H30, 90C30, 65K05 ∗ Department of Mathematics, Simon F ras er University , Burnaby , BC, V5A 1S6, Canada . (email: zhaoso ng@sf u.ca ). This a utho r w as suppor ted in pa rt by NSERC Discovery Gra nt . † Department of Mathematics, Simon F ras er University , Burnaby , BC, V5A 1S6, Canada . (email: yza30@ sfu.c a ). 1 1 In tro ductio n Principal comp onen t analysis (PCA) is a p opular to ol for d ata pro cessing and dimens ion reduction. It has b een widely used in nume rous applications in science a nd engineering suc h as biology , c hemis try , image pro cessing, mach ine learning and so on. F or example, PC A has recen tly b een applied to human f ace recognition, ha ndwritten zip co de classification and gene expression data analysis (see [1 0, 12, 1, 11]) . In esse nce, PCA aims at finding a few linear comb inatio ns o f the original v ariables, called princip al c omp onents ( PCs), whic h p oint in orthogonal directions capturing as m uc h of the v ariance of the v ariables as p ossible. It is we ll kno wn that PCs can b e fo und via t he eigen v alue decomp osition of the c ov ariance matrix Σ. Ho w ev er, Σ is t ypically unknown in practice. Instead, the PCs can b e a ppro ximately computed via the singular v alue decomp osition (SVD) of the data matrix or the eigen v alue decomp osition of the sample co v ariance matrix. In detail, let ξ = ( ξ (1) , . . . , ξ ( p ) ) b e a p -dimensional ra ndom v ector, and X b e an n × p data matrix, whic h records the n observ ations of ξ . Without lo ss of generality , assume X is cen tered, that is, the column means of X are all 0. Then t he commonly used sample cov ariance matrix is ˆ Σ = X T X/ ( n − 1). Suppose the eigen v alue decompo sition of ˆ Σ is ˆ Σ = V D V T . Then η = ξ V giv es the PCs, and the columns of V are the corresp onding loa ding ve ctors. It is worth noting that V can also b e obtained b y p erforming the SVD o f X (see , for example, [28]). Clearly , the columns of V are ort ho normal v ectors, a nd moreov er V T ˆ Σ V is diag o nal. W e th us immediately see that if ˆ Σ = Σ, the corresp onding PCs are uncorrelated; otherwise, they can be correlated with eac h other (see Section 2 for details). W e no w describ e sev eral imp ortant pro p erties of the PCs obtained b y the standard PCA when Σ is well estimated b y ˆ Σ (see also [28]): 1. The PCs sequen tially capture the maxim um v ariance of the v ariables approximately , th us encouraging minimal information loss as m uc h as p ossible; 2. The PCs a re nearly uncorrelated, so the explained v ariance b y differen t PCs has small o v erlap; 3. The PCs p oint in orthogonal directions, that is, their loa ding v ectors are orthogonal to eac h other. In practice, t ypically the first few PCs a r e enough to represen t the data, thus a great dimen- sionalit y reduction is a chiev ed. In spite of the p opularity a nd success of PCA due to these nice features, PCA has a n obvious dra wbac k, that is, PCs ar e usually linear combinations of all p v a r ia bles and the loa dings are t ypically nonzero. This mak es it often difficult to in terpret the PCs, especially when p is large. Indeed, in man y applications, the orig inal v ar ia bles ha v e concrete ph ysical meaning. F or example in biolo gy , eac h v ariable migh t represen t the expres- sion lev el of a gene. In these cases , the in terpretation of PCs w ould b e fa cilitated if they were 2 comp osed o nly fro m a small n um b er of the original v ariables, namely , eac h PC in volv ed a small n um b er of nonzero loa ding s. It is th us imp erativ e to dev elop sparse PCA tec hniques for finding the PCs with sparse loadings while enjo ying the ab ov e three nice pro p erties a s m uch as p ossible. Sparse PCA has b een an activ e researc h topic for mor e than a dec ade. The first class of a pproac hes are based on ad- ho c metho ds by p ost-pro cessing the PCs obtained from the standard PCA men tioned a b o ve . F or example, Jolliffe [15] applied v ar io us rotation techniq ues to the standard PCs for obtaining sparse loading v ectors. C adima a nd Jolliffe [6] prop osed a simple t hr esholding a pproac h b y artificially setting to zero the standard PCs’ loadings with absolute v alues smaller than a threshold. In recen t y ears, optimization approa ches hav e been prop osed fo r finding sparse PCs. They usually form ulate sparse PCA into an optimization problem, aiming at ac hieving the sparsit y of loadings while maximizing the explained v ar ia nce as muc h a s p ossible. F or instance, Jolliffe et al. [17] prop osed a n interes ting a lgorithm, called SCoTLASS, for finding sparse orthogonal loading v ectors b y sequen tially maximizing the appro ximate v ariance explained by each PC unde r the l 1 -norm p enalt y o n loading v ectors. Zou et al. [28] fo rm ulated sparse PCA as a regression-t yp e optimizatio n problem and imp osed a comb inatio n of l 1 - and l 2 -norm p enalties o n the regression co efficien ts. d’Aspremon t et al. [8] prop osed a metho d, called DSPCA, for finding sparse PCs b y solving a sequence of semidefi nite program relaxations of sparse PCA. Shen and Huang [25] recen tly dev elop ed an approac h fo r computing sparse PCs b y solving a sequence of rank-o ne ma t r ix approxim atio n problems under sev eral sparsit y-inducing p enalties. V ery recen tly , Journ´ ee et al. [16] formulated sparse PCA as nonconca v e maximization problems with l 0 - or l 1 -norm sparsit y-inducing p enalties. They sho w ed that these problems can b e reduced in to maximization of a con v ex function on a compact set, and they also prop osed a simple but computationally efficien t gradient metho d for finding a stationary point o f the latter problems. Additionally , greedy metho ds w ere inv es tigated fo r sparse PCA by Mogha ddam et al. [18] and d’Aspremon t et al. [7]. The PCs obtained b y the ab o ve metho ds [15, 6, 17, 2 8, 8 , 25, 18, 7, 16 ] are usually sparse. Ho w ev er, the a foremen tioned nice prop erties o f the standard PC s are lost to some exten t in these sparse PCs. Indeed, the lik ely correlation among the sparse PCs are not considered in these metho ds. The refore, their sparse PCs can b e quite correlated with each other. Also, the tota l explained v ariance t hat these metho ds attempt to maximize can b e to o optimistic as there ma y be some o v erlap among the individual v a r iances of sparse PCs. Finally , the loading v ectors of the sparse PCs g iv en b y these metho ds lac k or t hogonalit y except SCoTLASS [17]. In this pap er w e prop ose a new formulation for sparse PCA b y taking into accoun t the three nice prop erties of the standard PC A, that is, maximal total explained v ariance, un- correlation of PCs , and orthogonality of loading v ectors. W e also ex plore t he connection of this formulation with the standard PCA and sho w that it can b e view ed as a certain p er- turbation of the standard PCA. W e furt her prop ose a nov el augmented Lagra ngian method for solving a class of nonsmoo th constrained optimization problems, whic h is w ell suited for our formulation of sparse PCA. This metho d differs from the classical augmen ted L a grangian metho d in that: i) the v alues of the augmen ted Lagra ng ian functions at their approximate minimizers giv en b y the metho d are bo unded from ab ov e; and ii) the magnitude of p enalty 3 parameters outgrows that of Lagra ngian multipliers (see Section 3.2 for details). W e show that this method con v erges to a fe asible p oint, and moreov er it conv erges to a first-or der sta- tionary p oint under some regularity assumptions. W e also prop ose t w o nonmonotone gradient metho ds for minimizing a class of nonsmo o th functions ov er a closed con v ex set, which can b e suitably applied to the s ubproblems arising in our augmen ted Lagra ng ia n metho d. W e further establish global con v ergence and, under a lo cal Lipsc hitzian error b ounds assumption [26], lo cal linear rate of con v ergence for thes e gra dient metho ds. Finally , w e compare the sparse PCA approach pr o p osed in this pap er with sev eral existing metho ds [2 8, 8, 25, 16] on syn thetic, random, and real data, resp ectiv ely . The computatio nal res ults demonstrate that the sparse PCs obtained by our appro a c h substan tially outp erform those b y the other metho ds in terms of total explained v ariance, corr elat io n of PCs, and orthog onalit y of loading v ectors. The rest of paper is org anized as follo ws. In Section 2, w e prop ose a new form ulation for sparse PCA and explore the connection of this form ulatio n with the standard PCA. In Section 3, we then dev elop a no v el augmen ted Lagrangian metho d for a class of nonsmo oth constrained pr o blems, and prop o se tw o nonmonotone gradient metho ds fo r minimizing a class of nonsmo oth f unctions ov e r a closed conv ex set. In Section 4, w e discuss the applicabilit y and implemen tation details of our augmented Lagrangia n metho d for sparse PCA. The sparse PCA approac h prop osed in t his pap er is then compared with sev eral existing metho ds on syn thetic, random, a nd real data in Section 5. Finally , w e prese nt some concluding remarks in Section 6. 1.1 Notation In this pap er, all v ector spaces are assumed to b e finite dimensional. Th e sym b o ls ℜ n and ℜ n + (resp., ℜ n − ) denote the n -dimensional Euclidean space and the nonnegativ e (resp., nonp ositiv e) orthan t of ℜ n , res p ective ly , and ℜ ++ denotes the set of p ositiv e real n um b ers. The space of all m × n matrices with real en tries is denoted by ℜ m × n . The space of symmetric n × n matrices is denoted by S n . Additionally , D n denotes the space of n × n diagonal matrices. F or a real matrix X , we denote b y | X | the absolute v alue of X , t ha t is, | X | ij = | X ij | for all ij , and b y sign ( X ) the sign of X whos e ij th en try equals the sign of X ij for all ij . Also, the nonnegativ e part of X is denoted by [ X ] + whose ij th en try is giv en b y max { 0 , X ij } for a ll ij . The rank of X is denoted b y rank( X ). F urther, the iden tit y matrix and the all-ones matrix are denoted b y I and E , respectiv ely , whose dimens ion should b e clear from the conte xt. If X ∈ S n is po sitive se midefinite, w e write X 0. F or an y X , Y ∈ S n , w e write X Y to mean Y − X 0. Give n matrices X and Y in ℜ m × n , the standar d inner pro duct is defined b y X • Y := T r ( X Y T ), where T r( · ) denotes the trace of a matrix, and the compo nen t-wise pro duct is denoted by X ⊙ Y , whose ij th en try is X ij Y ij for all ij . k · k denotes the Euclidean norm and its a sso ciat ed op erator norm unless it is explicitly stated otherwise. The minimal (resp., maximal) eigenv alue of an n × n symmetric matrix X are denoted b y λ min ( X ) (resp., λ max ( X )), resp ectiv ely , and λ i ( X ) denotes its i th largest eigen v alue for i = 1 , . . . , n . Giv en a v ector v ∈ ℜ n , Diag( v ) or Diag( v 1 , . . . , v n ) denotes a diag onal matrix whose i th diagonal elemen t is v i for i = 1 , . . . , n . G iv en an n × n matrix X , ] Diag ( X ) denotes a diagona l matrix 4 whose i th diago nal elemen t is X ii for i = 1 , . . . , n . Let U b e a real v ector space . Giv en a closed con v ex set C ⊆ U , let dist ( · , C ) : U → ℜ + denote the distance function to C measured in terms of k · k , that is, dist( u, C ) := inf ˜ u ∈ C k u − ˜ u k ∀ u ∈ U . (1) 2 F orm ulation for s parse PCA In this section w e prop o se a new form ulation for sparse PCA by taking into accoun t sparsit y and orthogonalit y of loading v ectors, and uncorrelation of PCs. W e also address the connection of our formulation with the standard PCA. Let ξ = ( ξ (1) , . . . , ξ ( p ) ) b e a p - dimensional ra ndo m v ector with co v ariance matrix Σ. Sup- p ose X is an n × p data matrix, whic h records the n observ ations of ξ . Without loss of general- it y , assume the column means of X are 0. Then the commonly used sample co v ariance matrix of ξ is ˆ Σ = X T X/ ( n − 1). F or any r loading v ectors represen ted as V = [ V 1 , . . . , V r ] ∈ ℜ p × r where 1 ≤ r ≤ p , the corresp onding comp onen ts are give n by η = ( η (1) , . . . , η ( r ) ) = ξ V , whic h are linear combin atio ns o f ξ (1) , . . . , ξ ( p ) . Clearly , the co v ariance matrix of η is V T Σ V , and th us the comp o nen ts η ( i ) and η ( j ) are uncorrelated if and only if the ij th en try of V T Σ V is zero. Also, the total ex plained v ariance by the comp onen ts η ( i ) ’s equals, if they are uncorrelated, the sum of the individual v ariances of η ( i ) ’s, that is, r X i =1 V T i Σ V i = T r( V T Σ V ) . Recall that our a im is to find a set of sparse a nd orthogonal loading v ectors V so t ha t the corresp onding comp onen ts η (1) , . . . , η ( r ) are uncorrelated and explain as m uc h v ariance of the original v ariables ξ (1) , . . . , ξ ( p ) as p ossible. It app ears that our goal can b e ac hiev ed by solving the follow ing pro blem: max V ∈ℜ n × r T r( V T Σ V ) − ρ • | V | s.t. V T Σ V is diagonal , V T V = I , (2) where ρ ∈ ℜ p × r + is a t unning parameter f or con trolling the sparsit y of V . Ho w ev er, the cov ari- ance matrix Σ is typic ally unkno wn and can only b e a pproximated b y the sample co v ariance matrix ˆ Σ. It lo oks pla usible to mo dify (2) by simply replacing Σ with ˆ Σ at a glance. Nev er- theless, suc h a mo dification would eliminate all optimal solutio ns V ∗ of (2) from consideration since ( V ∗ ) T ˆ Σ V ∗ is generally non-diagonal. F or this reason, giv en a sample co v ariance ˆ Σ, we consider the fo llo wing f orm ulation for sparse PCA, whic h can b e view ed a s a mo dification of problem (2), max V ∈ℜ n × r T r( V T ˆ Σ V ) − ρ • | V | s.t. | V T i ˆ Σ V j | ≤ ∆ ij ∀ i 6 = j, V T V = I , (3) 5 where ∆ ij ≥ 0 ( i 6 = j ) are the pa r ameters for controlling the correlation of the components corresp onding to V . Clearly , ∆ ij = ∆ j i for all i 6 = j . W e next explore the connection of for mulation (3) with the standard PCA. Before pro- ceeding, w e state a technic al lemma as follo ws that will b e used subsequen tly . Its pro o f can b e found in [20]. Lemma 2.1 G iven any ˆ Σ ∈ S n and inte ger 1 ≤ r ≤ n , define i r = max { 1 ≤ i ≤ n : λ i ( ˆ Σ) > λ r ( ˆ Σ) } , ¯ i r = max { 1 ≤ i ≤ n : λ i ( ˆ Σ) = λ r ( ˆ Σ) } , (4) and l e t f ∗ b e the optima l value of max { T r( ˆ Σ Y ) : 0 Y I , T r( Y ) = r } . (5) Then, f ∗ = P r i =1 λ i ( ˆ Σ) , and Y ∗ is an optima l solution of (5 ) if a nd only if Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 , wher e P ∗ ∈ S ¯ i r − i r satisfies 0 P ∗ I an d T r( P ∗ ) = r − i r , and U ∗ 1 ∈ ℜ n × i r and U ∗ 2 ∈ ℜ n × ( ¯ i r − i r ) ar e the matric es who s e c olumns c onsist of the orthonorma l eigenve ctors of ˆ Σ c orr esp onding to the eigenvalues ( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)) and ( λ i r +1 ( ˆ Σ) , . . . , λ ¯ i r ( ˆ Σ)) , r esp e ctively. W e next address the relation b etw een the eigen v ectors of ˆ Σ a nd the solutions of problem (3) when ρ = 0 and ∆ ij = 0 for all i 6 = j . Prop osition 2.2 Supp ose for pr oblem (3) that ρ = 0 and ∆ ij = 0 for al l i 6 = j . L et f ∗ b e the optimal value of (3). Then, f ∗ = P r i =1 λ i ( ˆ Σ) , and V ∗ ∈ ℜ n × r is an optimal solution o f (3) if and only if the c olumns of V ∗ c onsist of the orthonormal eigenve ctors of ˆ Σ c orr esp onding to r lar gest eigenvalues of ˆ Σ . Pr o of . W e first sho w t ha t f ∗ = P r i =1 λ i ( ˆ Σ). Indeed, let U b e an n × r ma t r ix whose columns consis t of the ortho normal eigen ve ctors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ. W e then see that U is a feasible solution of (3) a nd T r( U T ˆ Σ U ) = P r i =1 λ i ( ˆ Σ). It follo ws that f ∗ ≥ P r i =1 λ i ( ˆ Σ). On the other hand, w e observ e that f ∗ is b ounded ab o v e b y the optimal v alue of max { T r( V T ˆ Σ V ) : V T V = I , V ∈ ℜ n × r } . W e kno w from [9] that its optimal v alue equals P r i =1 λ i ( ˆ Σ). Therefore, f ∗ = P r i =1 λ i ( ˆ Σ) holds and U is an optimal solutio n of (3 ) . It also implies that the “if ” part of this prop osition holds. W e next sho w that the “only if ” part also holds. Let V ∗ ∈ ℜ n × r b e an optimal solution of (3), and define Y ∗ = V ∗ V ∗ T . Then, w e ha v e V ∗ T V ∗ = I , whic h yields 0 Y ∗ I and T r( Y ∗ ) = r . Hence, Y ∗ is a feasible solution of (5). Using the fact that f ∗ = P r i =1 λ i ( ˆ Σ), w e then hav e T r ( ˆ Σ Y ∗ ) = T r( V ∗ T ˆ Σ V ∗ ) = r X i =1 λ i ( ˆ Σ) , whic h together with Lemm a 2.1 implies that Y ∗ is an optimal solution of (5). Let i r and ¯ i r b e defined in (4). Then, it follows from Lemma 2 .1 that Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 , where 6 P ∗ ∈ S ¯ i r − i r satisfies 0 P ∗ I and T r( P ∗ ) = r − i r , and U ∗ 1 ∈ ℜ n × i r and U ∗ 2 ∈ ℜ n × ( ¯ i r − i r ) are the matrices whose columns consist of t he orthonor ma l eigen v ectors of ˆ Σ corresponding to the eigen v alues ( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)) and ( λ i r +1 ( ˆ Σ) , . . . , λ ¯ i r ( ˆ Σ)), resp ectiv ely . T hus , w e hav e ˆ Σ U ∗ 1 = U ∗ 1 Λ , ˆ Σ U ∗ 2 = λ r ( ˆ Σ) U ∗ 2 , (6) where Λ = Diag( λ 1 ( ˆ Σ) , . . . , λ i r ( ˆ Σ)). In addition, it is easy to show that rank( Y ∗ ) = i r + rank( P ∗ ). Since Y ∗ = V ∗ V ∗ T and V ∗ T V ∗ = I , we can observ e that rank ( Y ∗ ) = r . Hence, rank( P ∗ ) = r − i r , whic h implies that P ∗ has o nly r − i r nonzero eigen v alues. Using t his fact and the r elations 0 P ∗ I and T r( P ∗ ) = r − i r , w e can further conclude that r − i r eigen v alues of P ∗ are 1 and the rest ar e 0. Therefore, there exists W ∈ ℜ ( ¯ i r − i r ) × ( r − i r ) suc h that W T W = I , P ∗ = W W T . (7) It together with Y ∗ = U ∗ 1 U ∗ T 1 + U ∗ 2 P ∗ U ∗ T 2 implies that Y ∗ = U ∗ U ∗ T , where U ∗ = [ U ∗ 1 U ∗ 2 W ]. In view of (7) and the iden tities U ∗ T 1 U ∗ 1 = I , U ∗ T 2 U ∗ 2 = I and U ∗ T 1 U ∗ 2 = 0, w e see that U ∗ T U ∗ = I . Using t his result, and the relations V ∗ T V ∗ = I and Y ∗ = U ∗ U ∗ T = V ∗ V ∗ T , it is not hard to see that the columns of U ∗ and V ∗ form an o r t ho normal basis for the rang e space of Y ∗ , resp ectiv ely . Th us, V ∗ = U ∗ Q for some Q ∈ ℜ r × r satisfying Q T Q = I . Now, let D = V ∗ T ˆ Σ V ∗ . By the definition of V ∗ , w e kno w that D is an r × r diagonal matrix. Moreov er, in view of (6), (7), the definition o f U ∗ , and the relatio ns V ∗ = U ∗ Q , U ∗ T 1 U ∗ 1 = I , U ∗ T 2 U ∗ 2 = I and U ∗ T 1 U ∗ 2 = 0, w e hav e D = V ∗ T ˆ Σ V ∗ = Q T U ∗ T ˆ Σ U ∗ Q = Q T U ∗ T 1 W T U ∗ T 2 ˆ Σ [ U ∗ 1 U ∗ 2 W ] Q = Q T Λ 0 0 λ r ( ˆ Σ) I Q, (8) whic h together with Q T Q = I implies that D is similar to the diagonal matrix app earing on the right-hand side of (8). H ence, the diagonal elemen ts of D consist of r largest eigen v alues of ˆ Σ. In addition, let Q 1 ∈ ℜ i r × r and Q 2 ∈ ℜ ( r − i r ) × r b e the submatrices cor r esp onding to the first i r and the last r − i r ro ws of Q , respectiv ely . Then, in view of the definition of U ∗ and V ∗ = U ∗ Q , w e hav e [ U ∗ 1 U ∗ 2 W ] = U ∗ = V ∗ Q T = [ V ∗ Q T 1 V ∗ Q T 2 ] . Th us, w e obtain that U ∗ 1 = V ∗ Q T 1 and U ∗ 2 W = V ∗ Q T 2 . Using these iden tities, (6), (8), and the relation V ∗ = U ∗ Q , w e ha ve ˆ Σ V ∗ = ˆ Σ U ∗ Q = ˆ Σ [ U ∗ 1 U ∗ 2 W ] Q = [ U ∗ 1 Λ λ r ( ˆ Σ) U ∗ 2 W ] Q = [ V ∗ Q T 1 Λ λ r ( ˆ Σ) V ∗ Q T 2 ] Q = V ∗ Q T Λ 0 0 λ r ( ˆ Σ) I Q = V ∗ D . It follow s that the columns of V ∗ consist of the orthonormal eigen ve ctors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ, and th us the “only if ” part of this prop osition holds. 7 F rom the ab o v e prop osition, w e see t ha t when ρ = 0 and ∆ ij = 0 f or all i 6 = j , eac h solutio n of (3) consists of the orthonormal eigenv ectors of ˆ Σ corresp onding to r largest eigen v alues of ˆ Σ, whic h can b e computed from the eigen v alue decomposition of ˆ Σ. There for e, t he loading v ectors obtained fro m (3) are the same as those giv en by the standard PCA when applied to ˆ Σ. On the other ha nd, when ρ and ∆ ij for all i 6 = j are small, the loading v ectors found b y (3) can b e view ed as an appro ximation to the ones pro vided b y the standard PCA. W e will prop ose suitable metho ds for solving (3) in Sections 3 and 4. 3 Augmen ted Lagrangian metho d for nonsmo oth con- strained nonl inear prog ramming In this section w e prop o se a no v el augmented Lagrangian metho d for a class of nonsmo oth constrained nonlinear progra mming pro blems, whic h is w ell suited for form ulation (3) of sparse PCA. In particular, we study first-order o ptimalit y conditions in Subsection 3.1. In Subsection 3.2, w e dev elop an augmen ted Lagra ngian metho d and establish its global conv ergenc e. In Subsection 3.3, w e prop ose tw o nonmonotone gradien t metho ds for minimizing a class of non- smo oth functions o v er a closed con v ex set, whic h can b e suitably applied to the subproblems arising in our augmente d Lagrangia n metho d. W e also establish global and lo cal conv ergenc e for these gradien t metho ds. 3.1 First-order optimalit y conditio ns In this subsection w e in tro duce a class of nonsmo oth constrained nonlinear programming problems and study first-order optimality conditions for them. Consider the nonlinear progra mming problem min f ( x ) + P ( x ) s.t. g i ( x ) ≤ 0 , i = 1 , . . . , m, h i ( x ) = 0 , i = 1 , . . . , p, x ∈ X . (9) W e assume that the functions f : ℜ n → ℜ , g i : ℜ n → ℜ , i = 1 , . . . , m , and h i : ℜ n → ℜ , i = 1 , . . . , p , are con tin uously differentiable, and t hat the function P : ℜ n → ℜ is con v ex but not necessarily smo oth, and that the set X ⊆ ℜ n is closed a nd conv ex . F or con v enience o f the subseque nt presen tatio n, w e denote b y Ω the feasible region of problem (9). Before establishing first-order optimality conditions for problem (9), w e describe a general constrain t qualification condition fo r (9), that is, R o binson’s condition that w as pro p osed in [21]. Let x ∈ ℜ n b e a feasible p oint of pro blem (9). W e denote the s et of active inequalit y constrain ts at x as A ( x ) = { 1 ≤ i ≤ m : g i ( x ) = 0 } . 8 In addition, x is said to satisfy R obinson ’s c ondition if g ′ ( x ) d − v h ′ ( x ) d : d ∈ T X ( x ) , v ∈ ℜ m , v i ≤ 0 , i ∈ A ( x ) = ℜ m × ℜ p , (10) where g ′ ( x ) and h ′ ( x ) denote the Jacobian of the functions g = ( g 1 , . . . , g m ) and h = ( h 1 , . . . , h p ) at x , resp ectiv ely . Other equiv alen t expressions of Robinson’s condition can b e found, for ex- ample, in [21, 22, 24]. The follo wing pro p osition demonstrates that Robinson’s condition is indeed a constrain t qualification condition for problem (9). F o r the sak e of completeness, w e include a brief pro o f for it. Prop osition 3.1 Given a fe a s i ble p oint x ∈ ℜ n of pr oblem (9), let T Ω ( x ) b e the tangent c one to Ω a t x , and ( T Ω ( x )) ◦ b e its p olar c one. If R obinson ’s c ondition (10) holds at x , then T Ω ( x ) = d ∈ T X ( x ) : d T ∇ g i ( x ) ≤ 0 , i ∈ A ( x ) , d T ∇ h i ( x ) = 0 , i = 1 , . . . , p , ( T Ω ( x )) ◦ = X i ∈A ( x ) λ i ∇ g i ( x ) + p X i =1 µ i ∇ h i ( x ) + N X ( x ) : λ ∈ ℜ m + , µ ∈ ℜ p , (11) wher e T X ( x ) and N X ( x ) ar e the tangent and normal c ones to X at x , r esp e ctively. Pr o of . By Theorem A.10 of [24], w e see that Robinson’s condition (10) implies that the assumption of Theorem 3 . 15 of [24] is satisfied with x 0 = x, X 0 = X , Y 0 = ℜ m − × ℜ p , g ( · ) = ( g 1 ( · ); . . . ; g m ( · ); h 1 ( · ); . . . ; h p ( · )) . The first statemen t then follows from Theorem 3 . 15 of [24] with the ab o v e x 0 , X 0 , Y 0 and g ( · ) . F urther, let A ( x ) denote the matrix whose r ows are the gradien ts of all active constraints at x in the same order a s they app ear in (9). Then, Robinson’s condition (10) implies t hat the assumptions of Theorem 2 . 36 of [24] a re satisfied with A = A ( x ) , K 1 = T X ( x ) , K 2 = ℜ |A ( x ) | − × ℜ p . Let K = { d ∈ K 1 : Ad ∈ K 2 } . Then, it follo ws from Theorem 2 . 36 of [24] that ( T Ω ( x )) ◦ = K ◦ = K ◦ 1 + { A T ξ : ξ ∈ K ◦ 2 } , whic h together with the identit y ( T X ( x )) ◦ = N X ( x ) and the definitions of A , K 1 and K 2 , implies that the second statemen t holds. W e are no w ready to establish first- order optimalit y conditions for problem (9). 9 Theorem 3.2 L et x ∗ ∈ ℜ n b e a lo c al min i mizer of pr oblem (9). Assume that R obinson ’s c ondition (10) is satisfie d at x ∗ . Then ther e exist L agr a n ge multipliers λ ∈ ℜ m + and µ ∈ ℜ p such that 0 ∈ ∇ f ( x ∗ ) + ∂ P ( x ∗ ) + m X i =1 λ i ∇ g i ( x ∗ ) + p X i =1 µ i ∇ h i ( x ∗ ) + N X ( x ∗ ) , (12) and λ i g i ( x ∗ ) = 0 , i = 1 , . . . , m. (13) Mor e over, the set of L agr ange multipliers ( λ, µ ) ∈ ℜ m + × ℜ p satisfying the ab ove c onditions, denote d by Λ( x ∗ ) , is c onvex and c omp act. Pr o of . W e first sho w that d T ∇ f ( x ∗ ) + P ′ ( x ∗ ; d ) ≥ 0 ∀ d ∈ T Ω ( x ∗ ) . (14) Let d ∈ T Ω ( x ∗ ) b e arbitrarily chose n. Then, there exist sequence s { x k } ∞ k =1 ⊆ Ω and { t k } ∞ k =1 ⊆ ℜ ++ suc h that t k ↓ 0 and d = lim k →∞ x k − x ∗ t k . Th us, w e hav e x k = x ∗ + t k d + o ( t k ). Using this relatio n along with the fact that the function f is differentiable and P is conv ex in ℜ n , w e can hav e f ( x ∗ + t k d ) − f ( x k ) = o ( t k ) , P ( x ∗ + t k d ) − P ( x k ) = o ( t k ) , (15) where the first equality follow s from the Mean V alue Theorem while the second o ne comes from Theorem 10 . 4 of [23]. Clearly , x k → x ∗ . This together with the assumption that x ∗ is a lo cal minimizer of (9), implies t hat f ( x k ) + P ( x k ) ≥ f ( x ∗ ) + P ( x ∗ ) (16) when k is sufficien tly large. In view of (15) and (16), w e obta in that d T ∇ f ( x ∗ ) + P ′ ( x ∗ ; d ) = lim k →∞ f ( x ∗ + t k d ) − f ( x ∗ ) t k + lim k →∞ P ( x ∗ + t k d ) − P ( x ∗ ) t k , = lim k →∞ f ( x k ) + P ( x k ) − f ( x ∗ ) − P ( x ∗ ) t k + o ( t k ) t k , = lim k →∞ f ( x k ) + P ( x k ) − f ( x ∗ ) − P ( x ∗ ) t k ≥ 0 , and hence (14) holds. F or simplicity of notations, let T ◦ Ω = ( T Ω ( x ∗ )) ◦ and S = −∇ f ( x ∗ ) − ∂ P ( x ∗ ). W e next sho w that S ∩ T ◦ Ω 6 = ∅ . Suppo se fo r con tradiction that S ∩ T ◦ Ω = ∅ . This together with the fact that S and T ◦ Ω are nonempty closed con v ex sets and S is b ounded, implies that there exists some 10 d ∈ ℜ n suc h tha t d T y ≤ 0 for a n y y ∈ T ◦ Ω , and d T y ≥ 1 for any y ∈ S . Clearly , w e se e that d ∈ ( T ◦ Ω ) ◦ = T Ω ( x ∗ ), and 1 ≤ inf y ∈ S d T y = inf z ∈ ∂ P ( x ∗ ) d T ( −∇ f ( x ∗ ) − z ) = − d T ∇ f ( x ∗ ) − sup z ∈ ∂ P ( x ∗ ) d T z = − d T ∇ f ( x ∗ ) − P ′ ( x ∗ ; d ) , whic h con tradicts (14). Hence, w e ha v e S ∩ T ◦ Ω 6 = ∅ . Using this relation, (11), the definitions of S and A ( x ∗ ), and letting λ i = 0 for i / ∈ A ( x ∗ ), w e easily see that (1 2) a nd (13) hold. In view of the fact that ∂ P ( x ∗ ) and N X ( x ∗ ) are closed and con ve x, and moreov er ∂ P ( x ∗ ) is b ounded, w e k now that ∂ P ( x ∗ ) + N X ( x ∗ ) is closed and conv ex. Using this res ult, it is straig h tforw ard to see that Λ( x ∗ ) is closed and con v ex. W e next sho w that Λ ( x ∗ ) is b ounded. Suppo se for contradiction that Λ( x ∗ ) is un b ounded. Then, there exists a sequence { ( λ k , µ k ) } ∞ k =1 ⊆ Λ( x ∗ ) suc h that k ( λ k , µ k ) k → ∞ , and 0 = ∇ f ( x ∗ ) + z k + m X i =1 λ k i ∇ g i ( x ∗ ) + p X i =1 µ k i ∇ h i ( x ∗ ) + v k (17) for some { z k } ∞ k =1 ⊆ ∂ P ( x ∗ ) and { v k } ∞ k =1 ⊆ N X ( x ∗ ). Let ( ¯ λ k , ¯ µ k ) = ( λ k , µ k ) / k ( λ k , µ k ) k . By passing to a subsequence if necessary , w e can assume that ( ¯ λ k , ¯ µ k ) → ( ¯ λ, ¯ µ ). W e clearly see that k ( ¯ λ, ¯ µ ) k = 1, ¯ λ ∈ ℜ m + , and ¯ λ i = 0 for i / ∈ A ( x ∗ ). Note that ∂ P ( x ∗ ) is b ounded and N X ( x ∗ ) is a closed cone. In view of this fact, and up on dividing b oth sides of (17) b y k ( λ k , µ k ) k and taking limits on a subsequence if necessary , we obtain that 0 = m X i =1 ¯ λ i ∇ g i ( x ∗ ) + p X i =1 ¯ µ i ∇ h i ( x ∗ ) + ¯ v (18) for some ¯ v ∈ N X ( x ∗ ). Since Robinson’s condition (10) is satisfied at x ∗ , there exist d ∈ T X ( x ∗ ) and v ∈ ℜ m suc h that v i ≤ 0 for i ∈ A ( x ∗ ), and d T ∇ g i ( x ∗ ) − v i = − ¯ λ i ∀ i ∈ A ( x ∗ ) , d T ∇ h i ( x ∗ ) = − ¯ µ i , i = 1 , . . . , p. Using these relatio ns, (1 8) a nd the fact that d ∈ T X ( x ∗ ), ¯ v ∈ N X ( x ∗ ), ¯ λ ∈ ℜ m + , and ¯ λ i = 0 for i / ∈ A ( x ∗ ), w e hav e m P i =1 ¯ λ 2 i + p P i =1 ¯ µ 2 i ≤ − m P i =1 ¯ λ i d T ∇ g i ( x ∗ ) − p P i =1 ¯ µ i d T ∇ h i ( x ∗ ) , = − d T m P i =1 ¯ λ i ∇ g i ( x ∗ ) + p P i =1 ¯ µ i ∇ h i ( x ∗ ) = d T ¯ v ≤ 0 . It yields ( ¯ λ, ¯ µ ) = (0 , 0), whic h contradicts the iden tity k ( ¯ λ, ¯ µ ) k = 1. Th us, Λ( x ∗ ) is b ounded. 11 3.2 Augmen ted Lagrangian me tho d for (9) F or a conv ex progr a m, it is known that under some mild a ssumptions, an y accum ulation p oin t of t he sequence generated by the classical augmente d Lagrang ian metho d is an optimal solution (e.g., see []). Nev ertheless, when problem (9 ) is a noncon v ex pr o gram, esp ecially when the function h i is not affine or g i is noncon v ex, the class ical augmen ted Lagr angian metho d ma y not ev en con v erge to a feasible p oint. T o alleviate this dra wback , w e prop ose a no v el augmen ted Lagrangia n metho d f or problem (9) and establish its global con v ergence in this subsection. Throughout this subsection, we mak e the follo wing assumption for problem (9). Assumption 1 Pr oblem (9) is fe asible, and mor e over at le ast a fe asible solution, d e note d by x feas , is k n own. It is w ell-kno wn that for problem (9) the asso ciated augmen ted Lag r angian function L ( x, λ, µ ) : ℜ n × ℜ m × ℜ p → ℜ is given by L ( x, λ, µ ) := w ( x ) + P ( x ) , (19) where w ( x ) := f ( x ) + 1 2 ( k [ λ + g ( x )] + k 2 − k λ k 2 ) + µ T h ( x ) + 2 k h ( x ) k 2 , (20) and > 0 is a p enalt y parameter (e.g., see [3, 24]). Roughly sp eaking, an augmented La- grangian metho d, when applied to problem (9), solv es a sequence of subproblems in the form of min x ∈ X L ( x, λ, µ ) while up dating the Lagra ngian m ultipliers ( λ , µ ) a nd the p enalt y parameter . Let x feas b e a kno wn feasible p oin t of (9) (see Ass umption 1). W e no w describ e the algorithm framew ork of a no v el aug mented Lagrangian metho d as follows. Algorithm framew ork of augmen ted Lagrangian metho d: Let { ǫ k } b e a p o sitiv e decreasing sequ ence. Let λ 0 ∈ ℜ m + , µ 0 ∈ ℜ p , 0 > 0, τ > 0, σ > 1 b e giv en. Cho ose an arbitrary initial p oin t x 0 init ∈ X and constan t Υ ≥ max { f ( x feas ) , L 0 ( x 0 init , λ 0 , µ 0 ) } . Set k = 0. 1) Find an approxim ate solution x k ∈ X for the subproblem min x ∈ X L k ( x, λ k , µ k ) (21) suc h that dist −∇ w ( x k ) , ∂ P ( x k ) + N X ( x k ) ≤ ǫ k , L k ( x k , λ k , µ k ) ≤ Υ . (22) 12 2) Up date Lag range m ultipliers according to λ k +1 := [ λ k + k g ( x k )] + , µ k +1 := µ k + k h ( x k ) . (23) 3) Set k +1 := max σ k , k λ k +1 k 1+ τ , k µ k +1 k 1+ τ . 4) Set k ← k + 1 and go to step 1). end The ab ov e aug mented Lagra ng ian metho d differs from the classical augmen ted Lagra ngian metho d in that: i) the v alues of the augmen ted Lagra ng ian functions at their approximate minimizers giv en by t he method are bo unded f r o m ab o v e (see Step 1)) ; and ii) the magnitude of p enalt y parameters outgro ws that o f Lagrangian m ultipliers (see Step 2)). In addition, to make the ab ov e augmente d Lagrangian metho d complete, w e need address ho w to find a n a ppro ximate solution x k ∈ X for subproblem (2 1) satisfying (22) a s r equired in Step 1). W e will leav e this discussion to the end of this subsection. F or the time b eing, w e establish the main con ve rgence result regar ding this metho d for solving problem (9). Theorem 3.3 A ssume that ǫ k → 0 . L et { x k } b e the se quenc e gener ate d by the a b ove aug- mente d L agr angian metho d satisfying (22). Supp ose that a subse quenc e { x k } k ∈ K c onver ges to x ∗ . Then, the f o l lowing statements hold: (a) x ∗ is a f e asible p oint of pr oblem (9); (b) F urther, if R obinson ’s c ondition (10) is satisfie d at x ∗ , then the subse quenc e { ( λ k +1 , µ k +1 ) } k ∈ K is b ounde d, and e ach a c cumulation p oint ( λ ∗ , µ ∗ ) of { ( λ k +1 , µ k +1 ) } k ∈ K is the ve ctor of L agr ang e multipliers satisfying the first-or der optimality c onditions (12)-(13) at x ∗ . Pr o of . In view of (19), (20) and the second relation in (22), w e hav e f ( x k ) + P ( x k ) + 1 2 k ( k [ λ k + k g ( x k )] + k 2 − k λ k k 2 ) + ( µ k ) T h ( x k ) + k 2 k h ( x k ) k 2 ≤ Υ ∀ k . It follow s that k [ λ k / k + g ( x k )] + k 2 + k h ( x k ) k 2 ≤ 2[Υ − f ( x k ) − g ( x k ) − ( µ k ) T h ( x k )] / k + ( k λ k k / k ) 2 . Noticing that 0 > 0 τ > 0, and k +1 = max σ k , k λ k +1 k 1+ τ , k µ k +1 k 1+ τ for k ≥ 0, w e can observ e that k → ∞ and k ( λ k , µ k ) k / k → 0. W e also kno w that { x k } k ∈ K → x ∗ , { g ( x k ) } k ∈ K → g ( x ∗ ) and { h ( x k ) } k ∈ K → h ( x ∗ ). Using these results, and upo n taking limits as k ∈ K → ∞ on b oth sides of the ab ov e inequalit y , w e obtain that k [ g ( x ∗ )] + k 2 + k h ( x ∗ ) k 2 ≤ 0 , whic h implies that g ( x ∗ ) ≤ 0 and h ( x ∗ ) = 0. W e also know that x ∗ ∈ X . It th us fo llo ws that statemen t (a) holds. 13 W e next sho w that statemen t (b) also holds. Using (21), (1 9), (20), (23), and the first relation in (22), we hav e k∇ f ( x k ) + ( λ k +1 ) T ∇ g ( x k ) + ( µ k +1 ) T ∇ h ( x k ) + z k + v k k ≤ ǫ k (24) for some z k ∈ ∂ P ( x k ) and v k ∈ N X ( x k ). Supp ose for contradiction that the subs equence { ( λ k +1 , µ k +1 ) } k ∈ K is un b o unded. By passing to a subsequence if nec essary , w e can assume that { ( λ k +1 , µ k +1 ) } k ∈ K → ∞ . Let ( ¯ λ k +1 , ¯ µ k +1 ) = ( λ k +1 , µ k +1 ) / k ( λ k +1 , µ k +1 ) k and ¯ v k = v k / k ( λ k +1 , µ k +1 ) k . Recall that { x k } k ∈ K → x ∗ . It together with Theorem 6.2.7 of [13 ] im- plies that ∪ k ∈ K ∂ P ( x k ) is b ounded, a nd so is { z k } k ∈ K . In addition, { g ( x k ) } k ∈ K → g ( x ∗ ) and { h ( x k ) } k ∈ K → h ( x ∗ ). Then, w e can observ e from (24) that { ¯ v k } k ∈ K is bo unded. Without loss of generalit y , assume tha t { ( ¯ λ k +1 , ¯ µ k +1 ) } k ∈ K → ( ¯ λ, ¯ µ ) a nd { ¯ v k } k ∈ K → ¯ v (otherwise, one can consider their con v ergen t subsequence s). Cle arly , k ( ¯ λ, ¯ µ ) k = 1. D ividing b oth sides of (24) b y k ( λ k +1 , µ k +1 ) k and taking limits as k ∈ k → ∞ , w e obtain that ¯ λ T ∇ g ( x ∗ ) + ¯ µ T ∇ h ( x ∗ ) + ¯ v = 0 . (25) F urther, using t he iden tit y λ k +1 = [ λ k + k g ( x k )] + and the fact that k → ∞ and k λ k k / k → 0, w e observ e that λ k +1 ∈ ℜ m + and λ k +1 i = 0 for i / ∈ A ( x ∗ ) when k ∈ K is sufficien tly large, whic h imply tha t ¯ λ ∈ ℜ m + and ¯ λ i = 0 for i / ∈ A ( x ∗ ). Moreov er, w e hav e ¯ v ∈ N X ( x ∗ ) since N X ( x ∗ ) is a closed cone. Using these r esults, (25), R obinson’s condition (10) at x ∗ , and a similar ar g umen t as that in the pro o f of Theorem 3.2, w e can obtain that ( ¯ λ, ¯ µ ) = (0 , 0), whic h con tradicts the iden tit y k ( ¯ λ, ¯ µ ) k = 1. Therefore, the subseque nce { ( λ k +1 , µ k +1 ) } k ∈ K is b ounded. Using this result together with (24) and the f a ct { z k } k ∈ K is b ounded, w e immediately see that { v k } k ∈ K is b ounded. Using semicon tin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [23] and Lemma 2 . 42 of [24 ]) , a nd the fact { x k } k ∈ K → x ∗ , w e conclude that ev ery accum ulation p oin t of { z k } k ∈ K and { v k } k ∈ K b elongs to ∂ P ( x ∗ ) and N X ( x ∗ ), resp ectiv ely . Using these results and (24), w e further see that for ev ery accum ulation p oin t ( λ ∗ , µ ∗ ) of { ( λ k +1 , µ k +1 ) } k ∈ K , there exists some z ∗ ∈ ∂ P ( x ∗ ) and v ∗ ∈ N X ( x ∗ ) suc h that ∇ f ( x ∗ ) + ( λ ∗ ) T ∇ g ( x ∗ ) + ( µ ∗ ) T ∇ h ( x ∗ ) + z ∗ + v ∗ = 0 . Moreo v er, using the iden tit y λ k +1 = [ λ k + k g ( x k )] + and the fact t ha t k → ∞ a nd k λ k k / k → 0, w e easily see that λ ∗ ∈ ℜ m + and λ ∗ i = 0 for i / ∈ A ( x ∗ ). Th us, ( λ ∗ , µ ∗ ) satisfies the first-order optimalit y conditions (12)-(13) at x ∗ . Before ending this subsection, we no w briefly discuss ho w t o find an approx imate solution x k ∈ X for subproblem ( 2 1) satisfying (2 2) as r equired in Step 1) of the ab ov e augmen ted Lagrangian metho d. In particular, w e are in terested in applying the nonmonotone gradient metho ds pro p osed in Subsec tion 3.3 to (21). As sho wn in Subsection 3.3 (see Theorems 3.9 and 3.13), these metho ds are able to find an appro ximate solution x k ∈ X s atisfying the first relatio n o f (22). Moreov er, if an initial point for these metho ds is prop erly chose n, the obtained approximate solution x k also satisfies the second relation o f (22). F o r example, giv en k ≥ 0, let x k init ∈ X denote the initia l p oin t for solving the k th subproblem (21), and w e define 14 x k init for k ≥ 1 as follow s x k init = x feas , if L k ( x k − 1 , λ k , µ k ) > Υ; x k − 1 , otherwis e , where x k − 1 is the approx imate solution to the ( k − 1 )th subproblem (21) satisfying ( 2 2) ( with k replaced b y k − 1). Recall fr o m Assumption 1 that x feas is a feasible solution of (9) . T hus , g ( x feas ) ≤ 0, and h ( x feas ) = 0, whic h together with (19), (20) a nd the definition of Υ implies that L k ( x feas , λ k , µ k ) ≤ f ( x feas ) ≤ Υ . It follo ws from this inequalit y and the ab ov e c hoice of x k init that L k ( x k init , λ k , µ k ) ≤ Υ. Ad- ditionally , the nonmonotone gradien t metho ds propo sed in Subsection 3.3 p ossess a natural prop ert y that t he ob jectiv e function v alues at all subsequen t iterates are b ounded below b y the one at the initial p o int. The refore, we ha v e L k ( x k , λ k , µ k ) ≤ L k ( x k init , λ k , µ k ) ≤ Υ , and so the second relation of (22) is satisfied a t x k . 3.3 Nonmonotone gradien t metho ds for nonsmo oth minimization In this subsection w e prop o se tw o no nmono tone gradien t metho ds for minimizing a class of nonsmo oth func tions o v er a closed conv ex set, w hich can be su itably applied to the sub- problems arising in our augmen ted L a grangian metho d detailed in Subsection 3.2. W e also establish g lobal conv ergence and lo cal linear rate o f conv e rgence for these metho ds. It shall b e mentioned that these t w o metho ds are closely related to the ones prop osed in [27] and [2 6 ], but they are not the same (see the remarks b elow for details). In additio n, these metho ds can b e view ed as an extension of the w ell-kno wn pro jected gradien t metho ds studied in [4] for smo oth problems, but the metho ds prop osed in [27] and [2 6] cannot. Throughout this subsection, we consider the follow ing pro blem min x ∈ X { F ( x ) := f ( x ) + P ( x ) } , (26) where f : ℜ n → ℜ is con tin uously differentiable, P : ℜ n → ℜ is con v ex but not nec essarily smo oth, and X ⊆ ℜ n is closed and con v ex. W e say that x ∈ ℜ n is a stationary p oint of problem (26) if x ∈ X a nd 0 ∈ ∇ f ( x ) + ∂ P ( x ) + N X ( x ) . (27) Giv en a p oin t x ∈ ℜ n and H ≻ 0, w e denote b y d H ( x ) the s olutio n of the follo wing problem: d H ( x ) := arg min d ∇ f ( x ) T d + 1 2 d T H d + P ( x + d ) : x + d ∈ X . (28) The following lemma pro vides an alternativ e characterization of stationarity that will b e used in our subsequen t analysis. 15 Lemma 3.4 F or any H ≻ 0 , x ∈ X is a stationary p oint of pr oblem (26) if and only if d H ( x ) = 0 . Pr o of . W e first observ e that (28) is a conv ex problem, and moreo ve r its ob jective function is strictly con v ex. The conclusion of this lemma immediately follows from this o bserv ation and the first-order optimality condition of (28). The next lemma shows that k d H ( x ) k c hanges not to o fast with H . It will b e used to pro v e Theorems 3.10 and 3.14. Lemma 3.5 F or any x ∈ ℜ n , H ≻ 0 , and ˜ H ≻ 0 , l e t d = d H ( x ) and ˜ d = d ˜ H ( x ) . Then k ˜ d k ≤ 1 + λ max ( Q ) + p 1 − 2 λ min ( Q ) + λ max ( Q ) 2 2 λ min ( ˜ H ) λ max ( H ) k d k , (29) wher e Q = H − 1 / 2 ˜ H H − 1 / 2 . Pr o of . The conclusion immediately follows from Lemma 3.2 of [26] with J = { 1 , . . . , n } , c = 1, and P ( x ) := P ( x ) + I X ( x ), where I X is the indicator function of X . The followin g lemma will b e used to prov e Theorems 3.10 and 3.14. Lemma 3.6 G iven x ∈ ℜ n and H ≻ 0 , let g = ∇ f ( x ) an d ∆ d = g T d + P ( x + d ) − P ( x ) fo r al l d ∈ ℜ n . L et σ ∈ (0 , 1 ) b e g i ven. The f o l lowing statements h old: (a) If d = d H ( x ) , then − ∆ d ≥ d T H d ≥ λ min ( H ) k d k 2 . (b) F or any ¯ x ∈ ℜ n , α ∈ (0 , 1] , d = d H ( x ) , and x ′ = x + αd , then ( g + H d ) T ( x ′ − ¯ x ) + P ( x ′ ) − P ( ¯ x ) ≤ ( α − 1)( d T H d + ∆ d ) . (c) If f satisfies k∇ f ( y ) − ∇ f ( z ) k ≤ L k y − z k ∀ y , z ∈ ℜ n (30) for some L > 0 , then the desc ent c ondition F ( x + α d ) ≤ F ( x ) + σ α ∆ d is satisfi e d for d = d H ( x ) , pr ovide d 0 ≤ α ≤ min { 1 , 2(1 − σ ) λ min ( H ) /L } . (d) If f satisfies (30), then the desc ent c ondition F ( x + d ) ≤ F ( x ) + σ ∆ d is satisfi e d for d = d H ( α ) ( x ) , wher e H ( α ) = αH , p r ovide d α ≥ L/ [2(1 − σ ) λ min ( H )] . 16 Pr o of . The statemen ts (a )-(c) follo w from Theorem 4.1 (a ) and Lemma 3.4 of [26] with J = { 1 , . . . , n } , γ = 0, and λ = λ min ( H ). W e now prov e statemen t (d). Let d = d H ( α ) ( x ), where H ( α ) = αH for α > 0. It follo ws from statemen t (a) that k d k 2 ≤ − ∆ d / ( αλ min ( H )). Using this relation, (30), and the definitions of F and ∆ d , w e hav e F ( x + d ) − F ( x ) = f ( x + d ) − f ( x ) + P ( x + d ) − P ( x ) = ∇ f ( x ) T d + P ( x + d ) − P ( x ) + Z 1 0 d T ( ∇ f ( x + td ) − ∇ f ( x )) dt ≤ ∆ d + Z 1 0 k∇ f ( x + td ) − ∇ f ( x ) kk d k dt ≤ ∆ d + L k d k 2 / 2 ≤ [1 − L/ (2 α λ min ( H ))] ∆ d , whic h together with α ≥ L/ [2(1 − σ ) λ min ( H )], immediately implies that statemen t ( d) holds. W e now presen t the first nonmonotone g radien t metho d f o r (26) as follo ws. Nonmonotone gradien t method I: Cho ose parameters η > 1, 0 < σ < 1 , 0 < α < ¯ α , 0 < λ ≤ ¯ λ , and in teger M ≥ 0. Set k = 0 and c ho ose x 0 ∈ X . 1) Cho ose α 0 k ∈ [ α , ¯ α ] and λI H k ¯ λI . 2) F or j = 0 , 1 , . . . 2a) Let α k = α 0 k η j . Solv e (28) with x = x k and H = α k H k to obtain d k = d H ( x ). 2b) If d k satisfies F ( x k + d k ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ∆ k , (31) go to step 3), where ∆ k := ∇ f ( x k ) T d k + P ( x k + d k ) − P ( x k ) . (32) 3) Set x k +1 = x k + d k and k ← k + 1. end R emark . The abov e metho d is closely related to the one prop osed in [2 7]. They differ from eac h other only in that the distinct ∆ k ’s are used in (31). Indeed, the latter metho d uses ∆ k = − α k k d k k 2 / 2. Nev ertheles s, fo r global con v ergence, the metho d [27] needs a strong assumption that the function f is Lipschitz con tinu ously differen tiable, whic h is not required for our metho d (see Theorem 3 .9). In addition, our metho d can b e view ed as an extension of the first pro jected gradien t metho d (namely , SPG1) studied in [4] for smo oth problems, but their metho d cannot. Finally , lo cal con v ergence is established fo r our metho d (see Theorem 3.10), but not studied for their metho d. 17 W e next prov e glo ba l conv ergence of the nonmonot one gradien t metho d I. Before pro ceed- ing, w e establish t w o tec hnical le mmas b elo w. The first lemma shows that if x k ∈ X is a nonstationary p oin t, there exists a n α k > 0 in step 2 a) so that (31) is satisfied, and hence the ab ov e metho d is w ell defined. Lemma 3.7 Supp ose that H k ≻ 0 an d x k ∈ X is a no n stationary p o int of pr oblem (26). Then, ther e exists ˜ α > 0 such that d k = d H k ( α k ) ( x k ) , wher e H k ( α k ) = α k H k , satisfies (31) whenever α k ≥ ˜ α . Pr o of . F o r simplicit y of notatio n, let d ( α ) = d H k ( α ) ( x k ), where H k ( α ) = αH k for an y α > 0. Then, it follo ws f rom (28) that for all α > 0, α k d ( α ) k ≤ − 2[ ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k )] λ min ( H k ) k d ( α ) k ≤ − 2 F ′ ( x k , d ( α ) / k d ( α ) k ) λ min ( H k ) . (33) Th us, we easily see that the set ˜ S := { α k d ( α ) k : α > 0 } is b ounded. It implies tha t k d ( α ) k → 0 as α → ∞ . W e claim that lim inf α →∞ α k d ( α ) k > 0 . (34) Supp ose not. Then there exists a sequence { ¯ α l } ↑ ∞ suc h tha t ¯ α l k d ( ¯ α l ) k → 0 as l → ∞ . In v oking that d ( ¯ α l ) is the optimal solution of (28) with x = x k , H = ¯ α l H k and α = ¯ α l , w e ha v e 0 ∈ ∇ f ( x k ) + ¯ α l H k d ( ¯ α l ) + ∂ P ( x k + d ( ¯ α l )) + N X ( x k + d ( ¯ α l )) . Up on taking limits on b oth sides a s l → ∞ , and using se micontin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [23] and Lemma 2 . 4 2 of [24]), and the relations k d ( ¯ α l ) k → 0 and ¯ α l k d ( ¯ α l ) k → 0, w e see that (27) holds at x k , whic h con tradicts the nonstationarit y of x k . Hence, (34) holds. W e observ e that αd ( α ) T H k d ( α ) ≥ λ min ( H k ) α k d ( α ) k 2 , whic h to gether with (3 4) and H k ≻ 0, implies that k d ( α ) k = O αd ( α ) T H k d ( α ) as α → ∞ . In addition, using the b oundedness of ˜ S and H k ≻ 0, w e ha v e αd ( α ) T H k d ( α ) = O ( k d ( α ) k ) as α → ∞ . Th us, w e o bt a in that αd ( α ) T H k d ( α ) = Θ( k d ( α ) k ) as α → ∞ . (35) This relation together with Lemma 3.6(a) implies that P ( x k ) − ∇ f ( x k ) T d ( α ) − P ( x k + d ( α )) ≥ αd ( α ) T H k d ( α ) = Θ( k d ( α ) k ) . (36) Using this result, and the relatio n k d ( α ) k → 0 as α → ∞ , w e further hav e F ( x k + d ( α )) − max [ k − M ] + ≤ i ≤ k F ( x i ) ≤ F ( x k + d ( α )) − F ( x k ) = f ( x k + d ( α )) − f ( x k ) + P ( x k + d ( α )) − P ( x k ) = ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k ) + o ( k d ( α ) k ) ≤ σ [ ∇ f ( x k ) T d ( α ) + P ( x k + d ( α )) − P ( x k )] , (37) 18 pro vided α is sufficien tly la rge. It implies that the conclusion holds. The follo wing lemma sho ws that the searc h directions { d k } approac h zero, and the sequence of ob jectiv e function v alues { F ( x k ) } also con v erges. Lemma 3.8 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the n o n monotone gr adient metho d I satisfies lim k →∞ d k = 0 . Mor e over, the se quenc e { F ( x k ) } c onver ges. Pr o of . W e first observ e that { x k } ⊆ L . Let l ( k ) b e an in teger suc h that [ k − M ] + ≤ l ( k ) ≤ k and F ( x l ( k ) ) = max { F ( x i ) : [ k − M ] + ≤ i ≤ k } for all k ≥ 0. W e clearly observ e that F ( x k +1 ) ≤ F ( x l ( k ) ) for all k ≥ 0, whic h together with the definition of l ( k ) implies that the sequence { F ( x l ( k ) ) } is monot onically nonincreasing. F urther, since F is b ounded b elo w in X , w e ha v e lim k →∞ F ( x l ( k ) ) = F ∗ (38) for some F ∗ ∈ ℜ . W e next pro v e b y induction tha t the fo llo wing limits ho ld for all j ≥ 1: lim k →∞ d l ( k ) − j = 0 , lim k →∞ F ( x l ( k ) − j ) = F ∗ . (39) Using (31) and (32) with k replaced by l ( k ) − 1, w e obtain that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) + σ ∆ l ( k ) − 1 . (40) Replacing k and α b y l ( k ) − 1 a nd α l ( k ) − 1 in ( 3 6), respective ly , and usin g H l ( k ) − 1 λ I and the definition of ∆ l ( k ) − 1 (see (32)), w e hav e ∆ l ( k ) − 1 ≤ − λ α l ( k ) − 1 k d l ( k ) − 1 k 2 . The ab ov e t w o inequalities yield that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) − σ λ α l ( k ) − 1 k d l ( k ) − 1 k 2 , (41) whic h t o gether with (38) implies that lim k →∞ α l ( k ) − 1 k d l ( k ) − 1 k 2 = 0. F urther, noticing t ha t α k ≥ α for all k , w e obtain that lim k →∞ d l ( k ) − 1 = 0 . Using this result and ( 3 8), w e ha v e lim k →∞ F ( x l ( k ) − 1 ) = lim k →∞ F ( x l ( k ) − d l ( k ) − 1 ) = lim k →∞ F ( x l ( k ) ) = F ∗ , (42) where the second equalit y follows from uniform contin uit y of F in L . Therefore, (39) ho lds for j = 1. W e now need to sho w t ha t if (39) holds for j , then it also holds for j + 1. Using a similar argumen t as that leading to (41), w e ha v e F ( x l ( k ) − j ) ≤ F ( x l ( l ( k ) − j − 1) ) − σ λα l ( k ) − j − 1 k d l ( k ) − j − 1 k 2 , 19 whic h together with (3 8 ), the induction assump tion lim k →∞ F ( x l ( k ) − j ) = F ∗ , and the fact that α l ( k ) − j − 1 ≥ α for all k , yie lds lim k →∞ d l ( k ) − j − 1 = 0. Using this result, t he induction assumption lim k →∞ F ( x l ( k ) − j ) = F ∗ , and a similar argumen t as that leading to (4 2), we can sho w that lim k →∞ F ( x l ( k ) − j − 1 ) = F ∗ . Hence, (39) holds for j + 1. Finally , w e will pro v e that lim k →∞ d k = 0 and lim k →∞ F ( x k ) = F ∗ . By the definition of l ( k ), we see tha t for k ≥ M + 1, k − M − 1 = l ( k ) − j for some 1 ≤ j ≤ M + 1, whic h together with the first limit in (39), implies t ha t lim k →∞ d k = lim k →∞ d k − M − 1 = 0. Additiona lly , w e observ e that x l ( k ) = x k − M − 1 + ¯ l k X j =1 d l ( k ) − j ∀ k ≥ M + 1 , where ¯ l k = l ( k ) − ( k − M − 1) ≤ M + 1 . Using the ab o ve identit y , (39), and uniform con tin uity of F in L , w e see that lim k →∞ F ( x k ) = lim k →∞ F ( x k − M − 1 ) = F ∗ . Th us, the conclusion of this lemma holds. W e a r e now r eady to show that the nonmonotone g radien t metho d I is globally con v ergen t. Theorem 3.9 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the level se t L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, an y ac cumulation p oint of the se quenc e { x k } gener ate d by the nonmonotone gr adient metho d I is a stationary p oint of ( 2 6). Pr o of . Supp ose for contradiction that x ∗ is an accum ulation point of { x k } that is a non- stationary p oint of (2 6). Let K b e the subsequence suc h that { x k } k ∈ K → x ∗ . W e first claim that { α k } k ∈ K is b ounded. Suppose not. Then there exists a subsequence o f { α k } k ∈ K that go es to ∞ . Without loss o f generalit y , w e assume that { α k } k ∈ K → ∞ . F or simplicit y of notations, let ¯ α k = α k /η , d k ( α ) = d H k ( α ) ( x k ) for k ∈ K and α > 0, where H k ( α ) = αH k . Since { α k } k ∈ K → ∞ and α 0 k ≤ ¯ α , there exists some index ¯ k ≥ 0 suc h that α k > α 0 k for all k ∈ K with k ≥ ¯ k . By the particular c hoice of α k sp ecified in steps (2 a) a nd (2b), w e hav e F ( x k + d k ( ¯ α k )) > max [ k − M ] + ≤ i ≤ k F ( x i ) + σ [ ∇ f ( x k ) T d k ( ¯ α k ) + P ( x k + d k ( ¯ α k )) − P ( x k )] , (43) Using a similar argument as t hat leading to (33), w e hav e ¯ α k k d k ( ¯ α k ) k ≤ − 2 F ′ ( x k , d k ( ¯ α k ) / k d k ( ¯ α k ) k ) λ min ( H k ) ∀ k ∈ K, whic h along with t he relations H k λ I and { x k } k ∈ K → x ∗ , implies tha t { ¯ α k k d k ( ¯ α k ) k} k ∈ K is b ounded. Since { ¯ α k } k ∈ K → ∞ , w e further ha v e {k d k ( ¯ α k ) k} k ∈ K → 0 . W e now claim that lim inf k ∈ K,k → ∞ ¯ α k k d k ( ¯ α k ) k > 0 . (44) Supp ose not . By pa ssing to a subsequence if necessary , we can assume that { ¯ α k k d k ( ¯ α k ) k} k ∈ K → 0. In v oking tha t d k ( ¯ α k ) is the optimal solution of (28) with x = x k and H = ¯ α k H k , w e hav e 0 ∈ ∇ f ( x k ) + ¯ α k H k d k ( ¯ α k ) + ∂ P ( x k + d k ( ¯ α k )) + N X ( x k + d k ( ¯ α k )) ∀ k ∈ K . 20 Up on ta king limits on b oth sides a s k ∈ K → ∞ , and using sem icontin uit y of ∂ P ( · ) and N X ( · ) (s ee Theorem 24 . 4 of [23] and Lemma 2 . 42 of [24]), the re latio ns λI H k ¯ λI , {k d k ( ¯ α k ) k} k ∈ K → 0, { ¯ α k k d k ( ¯ α k ) k} k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (2 7) holds at x ∗ , whic h con tradicts nonstationarity of x ∗ .Th us, (4 4) holds. No w, using (44), the relation H k λ I , b oundedness of { ¯ α k k d k ( ¯ α k ) k} k ∈ K , and a similar argumen t as that leading to (35), w e observ e that ¯ α k d k ( ¯ α k ) T H k d k ( ¯ α k ) = Θ( k d k ( ¯ α k ) k ) as k ∈ K → ∞ . Using this result and a similar argumen t as that leading to (37), w e ha v e F ( x k + d k ( ¯ α k )) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ [ ∇ f ( x k ) T d k ( ¯ α k ) + P ( x k + d k ( ¯ α k )) − P ( x k )] , pro vided that k ∈ K is sufficien tly large. The ab o ve inequalit y e viden tly con tradicts (43). Th us, { α k } k ∈ K is b ounded. Finally , inv oking that d k = d k ( α k ) is the optimal solution of (28) with x = x k , H = α k H k , w e hav e 0 ∈ ∇ f ( x k ) + α k H k d k + ∂ P ( x k + d k ) + N X ( x k + d k ) ∀ k ∈ K . (45) By Lemma 3.8, we hav e { d k } k ∈ K → 0. Up on taking limits on b oth sides of (45) as k ∈ K → ∞ , and using semicon tin uit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 o f [23] and Lemma 2 . 4 2 of [24]), and the relations λ I H k ¯ λI , { d k } k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (27) holds at x ∗ , wh ich con tradicts the nonstationarit y o f x ∗ that is ass umed at the b eginning of this pro of. Therefore, the conclusion of this theorem holds. W e next analyze the asymptotic con v ergence rate of the nonmonoto ne g radien t metho d I under the following a ssumption, whic h is the same as that made in [26]. In what fo llows, w e denote b y ¯ X the set of stationary p oin ts of problem (26). Assumption 2 (a) ¯ X 6 = ∅ and, for any ζ ≥ min x ∈ X F ( x ) , ther e exists τ > 0 and ǫ > 0 such that dist( x, ¯ X ) ≤ τ k d I ( x ) k whenever F ( x ) ≤ ζ , k d I ( x ) k ≤ ǫ. (b) Ther e exists δ > 0 such that k x − y k ≥ δ whenever x ∈ ¯ X , y ∈ ¯ X , F ( x ) 6 = F ( y ) . W e are ready to establish lo cal linear ra te of conv ergence for the no nmonotone gradien t metho d I describ ed ab o v e. T he pro of of the follow ing theorem is inspired by the work of Tseng and Y un [26], who analyze d a similar lo cal conv ergence for a coo r dina t e gradien t de scen t metho d for a class of nonsmo oth minimization problems. Theorem 3.10 Supp ose that f satisfies ( 3 0), and F is b ounde d b elow in X an d uniform l y c ontinuous in the level s e t L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the nonmonotone gr adient metho d I sa tisfi e s F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) , pr ovide d k is sufficiently lar ge, wher e F ∗ = lim k →∞ F ( x k ) (se e L emma 3.8), and c is some c onstant in (0 , 1) . 21 Pr o of . Inv oking α 0 k ≤ ¯ α and t he sp ecific choice of α k , w e see fr om Lemma 3.6(d) that ˆ α := sup k α k < ∞ . Let H k ( α ) = αH k . T hen, it fo llows from λI H k ¯ λI and α k ≥ α that ( α · λ ) I H k ( α k ) ˆ α ¯ λI . Us ing this relation, Lemma 3.5, H k λI , and d k = d H k ( α k ) ( x k ), we obtain that k d I ( x k ) k = Θ( k d k k ) , (46) whic h together with Lemma 3.8 implies { d I ( x k ) } → 0. Th us, for an y ǫ > 0, there exis ts some index ¯ k suc h tha t d I ( x l ( k ) − 1 ) ≤ ǫ for all k ≥ ¯ k . In addition, w e clearly o bserv e that F ( x l ( k ) − 1 ) ≤ F ( x 0 ). Then, b y Assumption 2(a) and (46), there exists some index k ′ suc h that k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k ) − 1 k ∀ k ≥ k ′ (47) for some c 1 > 0 and ¯ x l ( k ) − 1 ∈ ¯ X . Note that k x l ( k +1) − 1 − x l ( k ) − 1 k ≤ l ( k +1) − 2 X i = l ( k ) − 1 k d i k ≤ [ k − 1] + X i =[ k − M − 1] + k d i k , whic h together with { d k } → 0, implies that k x l ( k +1) − 1 − x l ( k ) − 1 k → 0. Usin g this result, (47), and Lemma 3.8, w e o btain k ¯ x l ( k +1) − 1 − ¯ x l ( k ) − 1 k ≤ k x l ( k +1) − 1 − ¯ x l ( k +1) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k + k x l ( k +1) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k +1) − 1 k + c 1 k d l ( k ) − 1 k + k x l ( k +1) − 1 − ¯ x l ( k ) − 1 k → 0 . It follows from this relation and Assumption 2(b) that there exists an index ˆ k ≥ k ′ and v ∈ ℜ suc h that F ( ¯ x l ( k ) − 1 ) = v ∀ k ≥ ˆ k . (48) Then, b y Lemma 5.1 of [26], w e see that F ∗ = lim k →∞ F ( x k ) = lim inf k →∞ F ( x l ( k ) − 1 ) ≥ v . (49) F urther, using the definition of F , (30), (48), Lemma 3.6(b), and H k ( α k ) ˆ α ¯ λI , w e hav e for k ≥ ˆ k , F ( x l ( k ) ) − v = f ( x l ( k ) ) + P ( x l ( k ) ) − f ( ¯ x l ( k ) − 1 ) − P ( ¯ x l ( k ) − 1 ) = ∇ f ( ˜ x k ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) = ( ∇ f ( ˜ x k ) − ∇ f ( x l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) − ( H l ( k ) − 1 ( α l ( k ) − 1 ) d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + ( ∇ f ( x l ( k ) − 1 ) + H l ( k ) − 1 ( α l ( k ) − 1 ) d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) ≤ L k ˜ x k − x l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k + ˆ α ¯ λ k d l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k , (50) where ˜ x k is some p oint lying on the segmen t joining x l ( k ) with ¯ x l ( k ) − 1 . It fo llo ws from (47) that, for k ≥ ˆ k , k ˜ x k − x l ( k ) − 1 k ≤ k x l ( k ) − x l ( k ) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k = (1 + c 1 ) k d l ( k ) − 1 k . 22 Similarly , k x l ( k ) − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k for k ≥ ˆ k . Using t hese inequalities, Lemma 3.6(a), H k ( α k ) ( α · λ ) I , and (50), w e see that for k ≥ ˆ k , F ( x l ( k ) ) − v ≤ − c 2 ∆ l ( k ) − 1 for some constan t c 2 > 0 . This inequalit y together with (40) giv es F ( x l ( k ) ) − v ≤ c 3 F ( x l ( l ( k ) − 1) ) − F ( x l ( k ) ) ∀ k ≥ ˆ k , (51) where c 3 = c 2 /σ . Using lim k →∞ F ( x l ( k ) ) = F ∗ , and up on taking limits on b oth sides of (5 1), w e see that F ∗ ≤ v , whic h tog ether with (49) implies that v = F ∗ . Using this result and up on rearranging terms of (51), we ha v e F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) ∀ k ≥ ˆ k , where c = c 3 / (1 + c 3 ). W e next presen t the second nonmonot o ne gradien t metho d for (26) as follow s. Nonmonotone gradien t method I I: Cho ose pa r a meters 0 < η < 1, 0 < σ < 1 , 0 < α < ¯ α , 0 < λ ≤ ¯ λ , and integer M ≥ 0. Set k = 0 and c ho ose x 0 ∈ X . 1) Cho ose λ I H k ¯ λI . 2) Solv e (2 8) with x = x k and H = H k to o bt a in d k = d H ( x ), and compute ∆ k according to (32). 3) Cho ose α 0 k ∈ [ α , ¯ α ]. Find the smallest in teger j ≥ 0 such that α k = α 0 k η j satisfies F ( x k + α k d k ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + σ α k ∆ k , (52) where ∆ k is defined in (32). 4) Set x k +1 = x k + α k d k and k ← k + 1. end R emark . The ab ov e metho d is closely related to the one prop osed in [26]. When the en tire co ordinate blo c k, that is, J = { 1 , . . . , n } is c hosen for t he latter metho d, it b ecomes a sp ecial case of our metho d with M = 0, which is a g radien t descen t metho d. Giv en that our metho d is generally a nonmonotone metho d esp ecially when M ≥ 1, most pro of s of global and lo cal con ve rgence for the method [26] do not hold for our method directly . In addition, our metho d can b e view ed as an extension of the second pro jected gradien t metho d (namely , SPG2) studied in [4] for smo o t h problems, but the metho d [26] generally cannot. 23 W e nex t pro v e global con v ergence of the nonmonotone gradien t metho d I I. Befor e pro- ceeding, w e establish tw o tec hnical lemmas b elo w. The first lemma sho ws that if x k ∈ X is a nonstationary p oin t, there exists an α k > 0 in step 3) so that (52) is satisfied, and hence the ab ov e metho d is w ell defined. Lemma 3.11 Supp ose that H k ≻ 0 and x k ∈ X is a nonstationary p oint o f pr oblem (26). Then, ther e exists ˜ α > 0 such that d k = d H k ( x k ) satisfies (52) whenever 0 < α k ≤ ˜ α . Pr o of . In view of Lemma 2 . 1 of [26] with J = { 1 , . . . , n } , c = 1, x = x k , and H = H k , we ha v e F ( x k + α d k ) ≤ F ( x k ) + α ∆ k + o ( α ) ≤ max [ k − M ] + ≤ i ≤ k F ( x i ) + α ∆ k + o ( α ) ∀ α ∈ (0 , 1] , where ∆ k is defined in (32). Using the a ssumption of this lemma, w e see fro m Lemma 3.4 that d k 6 = 0, whic h together with H k ≻ 0 and Lemma 3.6(a ) implies ∆ k < 0. Th e conclusion of this lemma immediately follows fr o m this relation and the ab ov e inequalit y . The follo wing lemma shows tha t the scaled searc h direc tions { α k d k } approac h zero, and the sequence of ob jectiv e function v alues { F ( x k ) } also con v erges. Lemma 3.12 Supp ose that F i s b ounde d b elow in X and uniformly c ontinuous in the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the n o n monotone gr adient metho d II satisfies lim k →∞ α k d k = 0 . Mor e over, the s e quenc e { F ( x k ) } c onver ges. Pr o of . L et l ( k ) b e defined in the pro of of Lemma 3.8. W e first observ e that { x k } ⊆ L . Using (32), the definition of d k , and H k λ I , w e hav e ∆ k = ∇ f ( x k ) T d k + P ( x k + d k ) − P ( x k ) ≤ − 1 2 ( d k ) T H k d k ≤ − 1 2 λ k d k k 2 , (53) whic h together with the r elat io n α k ≤ α 0 k ≤ ¯ α , implies that α 2 k k d k k 2 ≤ − 2 ¯ αα k ∆ k /λ . (54) By a similar argumen t as that leading to (38), we see that { x k } satisfies (38) for some F ∗ . W e next show b y induction that the f ollo wing limits hold fo r all j ≥ 1: lim k →∞ α l ( k ) − j d l ( k ) − j = 0 , lim k →∞ F ( x l ( k ) − j ) = F ∗ . (55) Indeed, using (52) with k replace d b y l ( k ) − 1, w e obtain that F ( x l ( k ) ) ≤ F ( x l ( l ( k ) − 1) ) + σ α l ( k ) − 1 ∆ l ( k ) − 1 . It together with (38) immediately yields lim k →∞ α l ( k ) − 1 ∆ l ( k ) − 1 = 0. Using this result and (5 4), w e see that the first iden tit y of (55) holds for j = 1. F urther, in view of this iden tit y , (38 ), 24 and uniform con tinuit y of F in L , w e can easily see that the second iden tity of (55) also holds j = 1. W e no w need to sho w that if (55) holds f or j , then it also holds for j + 1. First, it follo ws from (5 2) that F ( x l ( k ) − j ) ≤ F ( x l ( l ( k ) − j − 1) ) + σ α l ( k ) − j − 1 ∆ l ( k ) − j − 1 , whic h together with (38) and the induction assumption that lim k →∞ F ( x l ( k ) − j ) = F ∗ , yie lds lim k →∞ α l ( k ) − j − 1 ∆ l ( k ) − j − 1 = 0. Usin g this result and (54), w e ha v e lim k →∞ α l ( k ) − j − 1 d l ( k ) − j − 1 = 0. In view of this iden tit y , uniform contin uit y o f F in L and the induction assumption lim k →∞ F ( x l ( k ) − j ) = F ∗ , w e can easily sho w tha t lim k →∞ F ( x l ( k ) − j − 1 ) = F ∗ . Hence, (55) holds for j + 1. T he conclusion of this lemma then follows f r om (55) and a similar argumen t as that in the pro of of Lemma 3.1 2. W e are now ready to sho w that the nonmonoto ne gradien t metho d I I is globally con v ergen t. Theorem 3.13 Supp ose that F is b ounde d b elow in X and uniformly c ontinuous in the level set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . The n, any ac cumulation p oint of the se quenc e { x k } gener ate d by the nonm o notone gr adient metho d II is a stationary p oint of (26) . Pr o of . Supp ose for contradiction that x ∗ is an accum ulation point of { x k } that is a non- stationary p oint of (2 6). Let K b e the subsequence suc h that { x k } k ∈ K → x ∗ . W e first claim that lim inf k ∈ K,k → ∞ k d k k > 0. Suppose no t . By passing to a subseque nce if necessary , w e can assume that {k d k k} k ∈ K → 0. In v oking that d k is the optimal solution of (28) with x = x k and H = H k , w e hav e 0 ∈ ∇ f ( x k ) + H k d k + ∂ P ( x k + d k ) + N X ( x k + d k ) ∀ k ∈ K . Up on taking limits on b o t h sides as k ∈ K → ∞ , and using semicon tinuit y of ∂ P ( · ) and N X ( · ) (see Theorem 24 . 4 of [2 3] and Lemma 2 . 42 of [24 ]) the relations λ I H k ¯ λI , {k d k k} k ∈ K → 0 and { x k } k ∈ K → x ∗ , w e see that (27) holds at x ∗ , whic h con tradicts the nonstationa rit y of x ∗ . Th us, lim inf k ∈ K,k → ∞ k d k k > 0 holds. F urther, using a similar a rgumen t a s that leading to (33), w e hav e k d k k ≤ − 2 F ′ ( x k , d k / k d k k ) λ min ( H k ) ∀ k ∈ K, whic h together with { x k } k ∈ K → x ∗ , H k λ I and lim inf k ∈ K,k → ∞ k d k k > 0, implies that k d k k = Θ(1) for k ∈ K . F urther, using (53), we see that lim sup k ∈ K,k → ∞ ∆ k < 0. No w, it follo ws from Lemma 3.12 and the relation lim inf k ∈ K,k → ∞ k d k k > 0 that { α k } k ∈ K → 0. Since α 0 k ≥ α > 0, there exists some index ¯ k ≥ 0 suc h that α k < α 0 k and α k < η for all k ∈ K with k ≥ ¯ k . L et ¯ α k = α k /η . Then, { ¯ α k } k ∈ K → 0 and 0 < ¯ α k ≤ 1 for all k ∈ K . By t he stepsize rule used in step (3), w e hav e, for all k ∈ K with k ≥ ¯ k , F ( x k + ¯ α k d k ) > max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ¯ α k ∆ k , (56) 25 On the other ha nd, in view of the de finition of F , (32), the relations k d k k = Θ(1 ) and lim sup k ∈ K,k → ∞ ∆ k < 0, a nd the monotonicity o f ( P ( x k + αd k ) − P ( x k )) /α , w e obtain that, for sufficien tly larg e k ∈ K , F ( x k + ¯ α k d k ) = f ( x k + ¯ α k d k ) + P ( x k + ¯ α k d k ) = f ( x k + ¯ α k d k ) − f ( x k ) + P ( x k + ¯ α k d k ) − P ( x k ) + F ( x k ) = ¯ α k ∇ f ( x k ) T d k + o ( ¯ α k k d k k ) + P ( x k + ¯ α k d k ) − P ( x k ) + F ( x k ) ≤ ¯ α k ∇ f ( x k ) T d k + o ( ¯ α k ) + ¯ α k [ P ( x k + d k ) − P ( x k )] + max [ k − M ] + ≤ i ≤ k F ( x i ) = max [ k − M ] + ≤ i ≤ k F ( x i ) + ¯ α k ∆ k + o ( ¯ α k ) < max [ k − M ] + ≤ i ≤ k F ( x i ) + σ ¯ α k ∆ k , whic h clearly contradicts (56). Th erefore, the conclusion of this theorem holds. W e next establish lo cal linear rate of conv ergence f o r the no nmono t one gr adien t method I I describ ed ab ov e. The pro of of the follo wing theorem is inspired by the work of Tseng and Y un [2 6 ]. Theorem 3.14 Supp ose that ¯ α ≤ 1 , f satisfies (30), and F is b ounde d b elow in X and uniformly c ontinuous in the le v el set L = { x ∈ X : F ( x ) ≤ F ( x 0 ) } . Then, the se quenc e { x k } gener ate d by the nonm o notone gr adient metho d II satisfies F ( x l ( k ) ) − F ∗ ≤ c ( F ( x l ( l ( k )) − 1) − F ∗ ) pr ovide d k is sufficiently lar ge, wher e F ∗ = lim k →∞ F ( x k ) (se e L em ma 3.12), and c is som e c onstant in (0 , 1) . Pr o of . Since α k is c hosen by the Armijo rule with α 0 k ≥ α > 0, w e see from Lemma 3.6(c) that inf k α k > 0. It together with Lemma 3.12 implies that { d k } → 0. F urther, using Lemma 3.5 and the fact that d k = d H k ( x k ) and λ I H k ¯ λI , w e obtain t hat k d I ( x k ) k = Θ( k d k k ), and hence { d I ( x k ) } → 0 . Then, by a similar ar g umen t as tha t in the pro of of Theorem 3.10, there exist c 1 > 0 , v ∈ ℜ , and ¯ x l ( k ) − 1 ∈ ¯ X suc h that k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ c 1 k d l ( k ) − 1 k , F ( ¯ x l ( k ) − 1 ) = v ∀ k ≥ ˆ k , where ˆ k is some index. Then, b y Lemma 5.1 of [26], we see t ha t (49) holds for { x k } , and the ab ov e F ∗ and v . F urther, using the definition of F , (30) , Lemma 3.6(b), and λ I H k ¯ λI , 26 w e hav e, for k ≥ ˆ k , F ( x l ( k ) ) − v = f ( x l ( k ) ) + P ( x l ( k ) ) − f ( ¯ x l ( k ) − 1 ) − P ( ¯ x l ( k ) − 1 ) = ∇ f ( ˜ x k ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) = ( ∇ f ( ˜ x k ) − ∇ f ( x l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) − ( H l ( k ) − 1 d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + ( ∇ f ( x l ( k ) − 1 ) + H l ( k ) − 1 d l ( k ) − 1 ) T ( x l ( k ) − ¯ x l ( k ) − 1 ) + P ( x l ( k ) ) − P ( ¯ x l ( k ) − 1 ) ≤ L k ˜ x k − x l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k + ¯ λ k d l ( k ) − 1 kk x l ( k ) − ¯ x l ( k ) − 1 k +( α l ( k ) − 1 − 1) ( d l ( k ) − 1 ) T H l ( k ) − 1 d l ( k ) − 1 + ∆ l ( k ) − 1 , (57) where ˜ x k is some p oint lying on the segmen t joining x l ( k ) with ¯ x l ( k ) − 1 . It fo llo ws from (47) and α k ≤ 1 t ha t, for k ≥ ˆ k , k ˜ x k − x l ( k ) − 1 k ≤ k x l ( k ) − x l ( k ) − 1 k + k x l ( k ) − 1 − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k . Similarly , k x l ( k ) − ¯ x l ( k ) − 1 k ≤ (1 + c 1 ) k d l ( k ) − 1 k for k ≥ ˆ k . Using t hese inequalities, Lemma 3.6(a), H k λ I , α k ≤ 1, a nd (57), w e see that, for k ≥ ˆ k , F ( x l ( k ) ) − v ≤ − c 2 ∆ l ( k ) − 1 for some constan t c 2 > 0 . The remaining pro of follo ws similarly as that of Theorem 3.10. 4 Augmen ted Lagrangian metho d for sparse PCA In this sec tion w e discus s the applic ability and impleme ntation details of the augmen ted Lagrangian metho d prop osed in Section 3 for solving sparse PCA (3) . 4.1 Applicabilit y of augmen ted Lagrangi an metho d for (3) W e first o bserv e that problem (3) can b e reform ulated as min V ∈ℜ n × r − T r( V T ˆ Σ V ) + ρ • | V | s.t. V T i ˆ Σ V j ≤ ∆ ij ∀ i 6 = j, − V T i ˆ Σ V j ≤ ∆ ij ∀ i 6 = j, V T V = I . (58) Clearly , problem (58 ) has the same form as (9). F rom Subsection 3.2 , we know that the sufficien t conditions fo r conv e rgence of our augmen ted Lagrangian metho d include: i) a feasible p oin t is explic itly giv en; and ii) Robinson’s condition (10) holds at an accum ulation p oint. It is easy to observ e that any V ∈ ℜ n × r consisting of r orthonormal eigenv ec tor s of ˆ Σ is a feasible p oin t of (5 8), and th us the first condition is t r ivially satisfied. Giv en that the accum ulation 27 p oin ts ar e not kno wn b eforehand, it is hard to c hec k the second condition directly . Ins tead, w e ma y c hec k Robinson’s condition at all feasible p oints of (58 ) . Ho w ev er, due to complication of the constrain ts, w e are only a ble to verify Robinson’s condition at a set of feasible po ints b elo w. Before pro ceeding, w e establish a tec hnical lemma as follow s that will b e used subseque ntly . Lemma 4.1 L et V ∈ ℜ n × r b e a fe asible solution of (58). Given any W 1 , W 2 ∈ S r , the system of δ V T ˆ Σ V + V T ˆ Σ δ V + δ D = W 1 , ( 5 9) δ V T V + V T δ V = W 2 (60) has a t le ast one solution ( δ V , δ D ) ∈ ℜ n × r × D r if one of the fol lowing c onditions hold s: a) V T ˆ Σ V is diagonal and V T i ˆ Σ V i 6 = V T j ˆ Σ V j for al l i 6 = j ; b) V T ˆ Σ( I − V V T ) ˆ Σ V is no nsingular. Pr o of . Note that the columns of V consist of r orthonormal eigenv e ctors. Therefore, there exist ¯ V ∈ ℜ n × ( n − r ) suc h that [ V ¯ V ] ∈ ℜ n × n is an orthogonal matrix. It follow s that for an y δ V ∈ ℜ n × r , there exists δ P ∈ ℜ r × r and δ ¯ P ∈ ℜ ( n − r ) × r suc h that δ V = V δ P + ¯ V δ ¯ P . P erforming suc h a c hange of v ariable for δ V , and using the fact that the matrix [ V ¯ V ] is orthogonal, w e can show that the syste m of (59) and (60) is equiv alent to δ P T G + Gδ P + δ ¯ P T ¯ G + ¯ G T δ ¯ P + δ D = W 1 , (61) δ P T + δ P = W 2 , (62) where G = V T ˆ Σ V and ¯ G = ¯ V T ˆ Σ V . The r emaining pro of of t his lemma reduces to sho w that the system of (61) and (62) has at least a solution ( δ P , δ ¯ P , δ D ) ∈ ℜ r × r × ℜ ( n − r ) × r × D r if one of conditions (a) or (b) holds. First, w e assume that condition (a) holds. The n, G is a diagonal matrix and G ii 6 = G j j for all i 6 = j . It fo llo ws that there exists a unique δ P ∗ ∈ ℜ n × r satisfying δ P ii = ( W 2 ) ii / 2 for all i and δ P ij G j j + G ii δ P ij = ( W 1 ) ij ∀ i 6 = j, δ P ij + δ P j i = ( W 2 ) ij ∀ i 6 = j. No w, let δ ¯ P ∗ = 0 and δ D ∗ = ] Diag( W 1 − GW 2 ). It is easy t o ve rify that ( δ P ∗ , δ ¯ P ∗ , δ D ∗ ) is a solution of the system of (61) and (62). W e nex t assume that conditio n (b) ho lds. G iven any δ ¯ P ∈ ℜ ( n − r ) × r , there exist δ Y ∈ ℜ ( n − r ) × r and δ Z ∈ ℜ r × r suc h that ¯ G T δ Y = 0 and δ ¯ P = δ Y + ¯ Gδ Z . P eforming suc h a c hange of v ariable for δ ¯ P , we see t ha t (61) can b e rewritten as δ P T G + Gδ P + δ Z T ¯ G T ¯ G + ¯ G T ¯ Gδ Z + δ D = W 1 . (63) Th us, it suffices t o sho w that the system of (62) and (63) has at least a solution ( δ P , δ Z , δ D ) ∈ ℜ r × r × ℜ r × r × D r . Using the definition of ¯ G and the fact that the matrix [ V ¯ V ] is o rthogonal, w e see that ¯ G T ¯ G = V T ˆ Σ ¯ V ¯ V T ˆ Σ V = V T ˆ Σ( I − V V T ) ˆ Σ V , 28 whic h together with condition (b) implies that ¯ G T ¯ G is nonsingular. No w, let δ P ∗ = W 2 / 2 , δ Z ∗ = ( ¯ G T ¯ G ) − 1 (2 W 1 − W 2 G − GW 2 ) / 4 , δ D ∗ = 0 . It is easy to v erify that ( δ P ∗ , δ Z ∗ , δ D ∗ ) is a solution of the system of (63) a nd (62). Therefore, the conclusion holds. W e are no w ready to sho w that Robinson’s condition (10) holds at a set of feasible p oin ts of (58). Prop osition 4.2 L et V ∈ ℜ n × r b e a fe asible solution of (58). The R obinson ’s c ondition (10) holds at V if o ne of the fol lowi n g c onditions ho ld : a) ∆ ij = 0 and V T i ˆ Σ V i 6 = V T j ˆ Σ V j for al l i 6 = j ; b) Ther e is at le ast one active and one ina c tive ine quality c onstr aint of (58) at V , and V T ˆ Σ( I − V V T ) ˆ Σ V is no nsingular; c) Al l ine quality c onstr aints of (58) ar e inactive a t V . Pr o of . W e first supp ose that condition (a) holds. Then, it immediately implies tha t V T ˆ Σ V is diagonal, and hence the conditio n (a) of Lemma 4.1 holds. In addition, w e observ e that all constrain ts o f (58) b ecome equalit y ones. Using these fa cts and Lemma 4.1, w e see that Robinson’s condition (10) holds a t V . Nex t, w e assume that condition (b) holds. It implies that condition (b) of Lemma 4.1 holds. T he conclusion then follows directly from Lemma 4.1. Finally , supp ose condition (c) holds. Then, Robinson’s condition (10) holds a t V if and only if (60) has a t least a solution δ V ∈ ℜ n × r for an y W 2 ∈ S r . Noting t hat V T V = I , w e easily see that δ V = V W 2 / 2 is a solution of (60), and th us Robinson’s condition (10) holds at V . F rom Propo sition 4.2, w e see that Robinson’s condition (10) indeed holds at a set of feasible p oin ts of (58). Though w e are not able to show that it holds a t all feasible p oints of (58), w e observ e in our implemen tation tha t the accum ulation po in ts of our augmen ted Lagrangia n metho d g enerally satisfy one of the conditions describ ed in Pro p osition 4.2, and so Robinson’s condition usually holds a t the accum ulation p oin ts. Moreo v er, we hav e nev er seen that our augmen ted Lagrangia n metho d failed to conv erge for an instance in our implemen tation so far. 4.2 Implemen tation details of augmen ted Lagrangian metho d for (58) In this section, w e show how our augmente d Lagrangian metho d prop osed in Subsection 3.2 can b e applied to solv e problem (58) (or , equiv alen tly , (3)). In particular, w e will discuss the implemen tation details of outer and inner iterations o f this metho d. 29 W e first discuss ho w to efficien tly ev a lua te the function and gradient in v olv ed in our aug- men ted Lagra ng ian metho d for problem (58). Supp ose that > 0 is a p enalty parameter, and { λ + ij } i 6 = j and { λ − ij } i 6 = j are the Lagra ng ian m ultipliers for the inequalit y constrain ts of (58), resp ectiv ely , and µ ∈ S r is the Lagrangian m ultipliers for the equalit y constrain ts of (5 8). F or con v enience of presen tation, let ∆ ∈ S r b e the mat rix whose ij th en try equals the parameter ∆ ij of (58) for all i 6 = j and dia g onal en tries are 0. Similarly , let λ + (resp., λ − ) b e an r × r symmetric matr ix whose ij th en try is λ + ij (resp., λ − ij ) fo r all i 6 = j a nd diagona l en tries are 0. W e no w define λ ∈ ℜ 2 r × r b y stac king λ + o v er λ − . Us ing these notat io ns, we o bserv e that the asso ciated Lagrangian function fo r problem (58) can b e rewritten as L ( V , λ, µ ) = w ( V ) + ρ • | V | , (64 ) where w ( V ) = − T r ( V T ˆ Σ V )+ 1 2 λ + λ − + S − ∆ − S − ∆ + 2 F − λ + λ − 2 F + µ • R + 2 k R k 2 F , and S = V T ˆ Σ V − ] Diag( V T ˆ Σ V ) , R = V T V − I . (65) It is not hard to v erify that the gradien t of w ( V ) can b e computed according to ∇ w ( V ) = 2 − ˆ Σ V I − [ λ + + S − ∆] + + [ λ − − S − ∆] + + V ( µ + R ) . Clearly , the main effort for the ab o ve function and gradien t ev aluations lies in computing V T ˆ Σ V and ˆ Σ V . When ˆ Σ ∈ S p is explicitly given, t he computational complexit y for ev aluating these t w o quan tities is O ( p 2 r ). In practice, we are, ho w ev er, t ypically give n the data matrix X ∈ ℜ n × p . As suming the column means of X are 0, the sample co v aria nce matrix ˆ Σ can b e obtained from ˆ Σ = X T X/ ( n − 1). Nev ertheless, when p ≫ n , we o bserv e that it is not efficien t to compute and store ˆ Σ. Also, it is muc h c heap er to compute V T ˆ Σ V a nd ˆ Σ V by using ˆ Σ implicitly rather explicitly . Indeed, w e can first ev aluate X V , and then compute V T ˆ Σ V and ˆ Σ V a ccording to V T ˆ Σ V = ( X V ) T ( X V ) / ( n − 1) , ˆ Σ V = X T ( X V ) / ( n − 1) . Then, the resulting ov erall computational complexit y is O ( npr ), whic h is clearly m uc h sup erior to the one b y using ˆ Σ explicitly , tha t is, O ( p 2 r ). W e now address initialization and termination criterion for our augmen ted Lagrangian metho d. In particular, w e c ho ose initia l point V 0 init and feasible p oint V feas to b e the loading v ectors of the r standard PCs, that is, the orthonormal eigen v ectors corresp onding to r largest eigen v alues of ˆ Σ. In addition, w e set initial p enalty parameter and Lagrangian mu ltipliers to b e 1, and set the parameters τ = 0 . 2 and σ = 10. W e terminate our me tho d once the 30 constrain t violation and the relativ e difference b et w een the augmen ted Lagrang ia n f unction and the regular ob jectiv e function are sufficien tly small, that is, max i 6 = j [ | V T i ˆ Σ V j | − ∆ ij ] + ≤ ǫ I , max i,j | R ij | ≤ ǫ E , | L ( V , λ, µ ) − f ( V ) | max ( | f ( V ) | , 1) ≤ ǫ O , (66) where f ( V ) = − T r( V T ˆ Σ V ) + ρ • | V | , R is defi ned in (65), and ǫ I , ǫ E , ǫ O are some pre- scrib ed accuracy parameters corresp onding to inequality constrain ts, equality constrain ts and ob jectiv e function, resp ectiv ely . W e next discuss ho w to apply the nonmonotone g r a dien t metho ds prop osed in Section 3.3 for the augmen ted La grangian subproblems, whic h are in the form of min V L ( V , λ, µ ) , (67) where the function L ( · , λ, µ ) is defined in (64). Give n that the impleme ntation details of those nonmonotone gradien t metho ds are similar, w e only fo cus on the second one, that is, the nonmono t one gr adien t metho d II. First, the initial p oin t for this metho d can b e c hosen according to the sc heme describ ed at the end of Subsection 3.2. In addition, giv en an iterate V k , the searc h direction d k is computed b y solving subproblem (28) with H = α − 1 k I , whic h b ecomes, in the conte xt of (21) and (64), d k := arg min d ∇ w ( V k ) • d + 1 2 α k k d k 2 F + ρ • | V k + d | . (68) Here, α k > 0 is c hosen according to t he sc heme prop osed b y Barzilai a nd Borwe in [2], whic h is also us ed by Birgin et al. [4] for studying a class of pro jected gradien t metho ds. Let 0 < α min < α max b e given. Initially , c ho ose an arbitr a ry α 0 ∈ [ α min , α max ]. Then, α k is up dated as follows: α k +1 = α max , if b k ≤ 0; max { α min , min { α max , a k /b k }} , otherwise , where a k = k V k − V k − 1 k 2 F and b k = ( V k − V k − 1 ) • ( ∇ w ( V k ) − ∇ w ( V k − 1 )). It is not hard to v erify that the optimal solution of problem (68) has a closed-for m expression, whic h is g iv en b y d k = sign( C ) ⊙ | C | − α k ρ + − V k , where C = V k − α k ∇ w ( V k ). In addition, w e see from Lemma 3.4 that the follo wing termination criterion is suitable for this metho d when applied to (67): max ij | d I ( V k ) | ij max( | L ( V k , λ, µ ) | , 1) ≤ ǫ, where d I ( V k ) is the solution of (68) with α k = 1, and ǫ is a prescrib ed accuracy parameter. In our nume rical implemen tation, w e set α 0 = 1 / max ij | d I ( V 0 ) | ij , α max = 1 , α min = 1 0 − 15 and ǫ = 10 − 4 . 31 T able 1: Sparse PCA metho ds us e d for our compa r ison GP ow er l 1 Single-unit sparse PCA via l 1 -p enalt y GP ow er l 0 Single-unit sparse PCA via l 0 -p enalt y GP ow er l 1 ,m Blo c k sparse PCA via l 1 -p enalt y GP ow er l 0 ,m Blo c k sparse PCA via l 0 -p enalt y DSPCA DSPCA algorithm SPCA SPCA algorithm rSVD sPCA-rSVD algorithm with soft thresholding ALSPCA Augmen ted Lagrangian algorithm Finally , it shall b e men tioned that fo r the sake of pra ctical p erformance, the n umerical implemen tation of our augmen ted Lag rangian metho d is slightly different from the one de- scrib ed in Subsection 3.2. In particular, w e follo w a similar sc hem e as disc ussed on pp. 4 05 of [3] to a djust penalty parameter and Lagrangian multiplie rs. Indeed, they are up dated sep- arately rather t ha n sim ultaneously . Roughly sp eaking, given γ ∈ (0 , 1 ), w e adjust p enalt y parameter only when the constraint violation is not decreased b y a factor γ ov er the previous minimization. Similarly , w e up date Lagrang ia n multiplie rs only when t he constraint viola- tion is decreased b y a factor γ o v er the previous minimization. W e c ho ose γ = 0 . 2 5 in our implemen tation as recommended in [3]. 5 Numerical re sults In this s ection, w e conduct n umerical ex p eriments for the augmen ted Lagrangian metho d detailed in Subsections 3.2 a nd 4.2 for formulation (58) (or, equiv alen tly , ( 3)) of sparse PCA on sy nthetic , random, and real data. In particular, w e compare t he results of our approach with sev eral exis ting sparse PCA metho ds in terms of total explained v ariance, correlation of PCs, and orthogonality of loading v ectors, wh ich include the generalized p o w er metho ds (Journ ´ ee et al. [16]), the DSPCA algorithm (d’Aspremon t et a l. [8]), the SPCA algo rithm (Zou et al. [2 8 ]), and t he sPCA-rSVD a lgorithm (Shen and Huang [25]). W e now list all t he metho ds used in t his section in T able 1. Sp ecifically , the metho ds with the prefix ‘GP o we r’ a re the generalized p o w er metho ds studied in [16], and the metho d ALSPCA is the augmen ted Lagrangian metho d prop osed in this pap er. As discu ssed in Section 2, the PCs obtained from the standar d PCA based on sample co v ariance matrix ˆ Σ ∈ ℜ n × p are nearly uncorrelated when the sample size is sufficien tly la r ge, and the total explained v ariance b y the firs t r PCs appro ximately equals the s um of the individual v ariances of PCs, that is, T r( V T ˆ Σ V ), where V ∈ ℜ p × r consists o f the loading v ectors of t hese PCs. How ev er, the PCs found b y sparse PCA metho ds ma y b e correlated with eac h other, and th us the quan tit y T r( V T ˆ Σ V ) can o v erestimate m uc h the total explained v ariance by the PCs due to the o v erlap among their individual v ariances . In a ttempt to deal with suc h an ov erlap, t w o adjusted tota l explained v ariances w ere propo sed in [28, 2 5 ]. It is not hard to o bserv e that they can b e view ed as the total explained v ariance o f a set o f transformed v ariables from the estimated sparse PCs. Giv en that these transformed v ariables can be dramatically differen t from those sparse PCs, their total explained v ariances may 32 differ m uc h from each other as we ll. T o alleviate t his drawbac k while taking in to accoun t the p ossible correlations among PCs, w e in tro duce the following adjuste d total explaine d varianc e for sparse PCs: AdjV ar V = T r( V T ˆ Σ V ) − s X i 6 = j ( V T i ˆ Σ V j ) 2 . Clearly , when the PCs are uncorrelated, it b ecomes the usual total explained v ariance, that is, T r( V T ˆ Σ V ). W e also define the cumulative p er c entage of adjuste d varianc e (CP A V) for the first r sparse PCs as the quotien t of the adjusted total ex plained v ariance of these PCs and the total explained v ariance b y all standard PCs, that is, AdjV ar V / T r( ˆ Σ). Finally , w e shall stress that the sole purp ose of this section is to compare the p erformance of those methods listed in T able 1 for finding the sparse PCs that ne arly enjoy the t hree imp ortant prop erties p ossessed b y the standard PCA (see Section 1 ). Therefore, w e will not compare the speed of these metho ds. Nev ertheless , it shall be mentioned that o ur metho d, that is, ALSPCA, is a first-order metho d and capable of solving larg e-scale problems within a reasonable amount of time a s o bserv ed in our exp erimen ts. 5.1 Syn thetic data In this subsection we use the syn thetic data in tro duced b y Zou et al. [28] to test the effec- tiv eness of our a ppro ac h ALSPCA for finding sparse PCs. The syn thetic example [28] considers three hidden factors: V 1 ∼ N (0 , 290) , V 2 ∼ N (0 , 300) , V 3 = − 0 . 3 V 1 + 0 . 925 V 2 + ǫ, ǫ ∼ N (0 , 1) , where V 1 , V 2 and ǫ are indep enden t. Then the 10 observ able v ariables are generated as follows : X i = V 1 + ǫ 1 i , ǫ 1 i ∼ N (0 , 1) , i = 1 , 2 , 3 , 4 , X i = V 2 + ǫ 2 i , ǫ 2 i ∼ N (0 , 1) , i = 5 , 6 , 7 , 8 , X i = V 3 + ǫ 3 i , ǫ 3 i ∼ N (0 , 1) , i = 9 , 10 , where ǫ j i are indep enden t for j = 1 , 2 , 3 and i = 1 , . . . , 1 0 . W e will use the actual cov ariance matrix of ( X 1 , . . . , X 10 ) to find the standard and sparse PCs, resp ectiv ely . W e first see that V 1 and V 2 are indep enden t, but V 3 is a linear com bination of V 1 and V 2 . Moreo v er, the v ariances o f the three underlying factors V 1 , V 2 and V 3 are 29 0, 300, and 283 . 8, resp ectiv ely . The refore V 2 is slightly more imp ortant than V 1 , and they b oth are m uc h more imp ortant than V 3 . In addition, the first t w o standard PCs together explain 99 . 7 2 % of the total v aria nce (see T able 2). These facts suggest tha t the first tw o sparse PCs b e sufficien t to explain most o f the v ariance. Ideally , the first sparse PC reco v ers the fa ctor V 2 only using ( X 5 , X 6 , X 7 , X 8 ), and the second sparse PC reco ve rs the factor V 1 only using ( X 1 , X 2 , X 3 , X 4 ). Moreo v er, give n that ( X 5 , X 6 , X 7 , X 8 ) and ( X 1 , X 2 , X 3 , X 4 ) are indep enden t, t hese sparse PCs w ould b e uncorrelated and orthogo nal each other. 33 T able 2: Loadings o f the first tw o PCs b y standard PCA and ALSPCA V ari able PCA ALSPCA PC1 PC2 PC1 PC2 X 1 0.1158 0.4785 0 0.5000 X 2 0.1158 0.4785 0 0.5000 X 3 0.1158 0.4785 0 0.5000 X 4 0.1158 0.4785 0 0.5000 X 5 -0.3955 0.1449 - 0.5000 0 X 6 -0.3955 0.1449 - 0.5000 0 X 7 -0.3955 0.1449 - 0.5000 0 X 8 -0.3955 0.1449 - 0.5000 0 X 9 -0.4005 -0.0095 0 0 X 10 -0.4005 -0.0095 0 0 CP A V (%) 99.72 80.46 Syn thetic data In o ur test, w e set r = 2, ∆ ij = 0 f or all i 6 = j , and ρ = 4 for fo rm ulation (58) o f sparse PCA. In addition, we choose (66) a s the termination criterion for ALSPCA with ǫ I = ǫ O = 0 . 1 and ǫ E = 10 − 3 . The results o f standard PCA and ALSPCA for this example a r e presen ted in T able 2. The loadings of standard and sparse PC s are given in columns tw o and three, resp ectiv ely , and their CP A Vs are giv en in the last ro w. W e cle arly see that our sparse PCs are consisten t with the ones predicte d ab ov e. In terestingly , they are iden tical with the ones obtained b y SPCA and D SPCA rep o rted in [28, 8]. F or general data, ho w ev er, these metho ds ma y p erform quite differently (see Subsection 5.3). 5.2 Random data In this subsection, we compare the p erformance o f the GP ow er metho ds [16] and our ALSPCA metho d for finding sparse PCs o n a set of randomly generated data. F irst, w e randomly generate 100 cen tered data matrices X with the size of 20 × 20. Throug ho ut this subsection, w e set ∆ ij = 0 . 5 for all i 6 = j for f orm ulation (58 ) of sparse PCA, and choose (66) as the termination criterion for ALSPCA with ǫ I = 0 . 1 , ǫ E = 0 . 1 and ǫ O = 0 . 1. In the first test, we a im to find the firs t thr e e sparse PC s with the av erage n um b er of zero lo a dings around 30 (50% sparsit y). T o me et this purp o se, the tunning parameter ρ fo r problem (58) a nd the parameters f o r the GP ow er metho ds are prop erly c hosen. The results of the GP o w er metho ds a nd ALSPCA for the ab o ve randomly generated instances are presen ted in T able 3 . The name of each metho d is giv en in column o ne. The sparsit y measured by the n um b er of zero loadings a v eraged ov er all instances is giv en in column t w o. The column three giv es the av erage amount of non-orthogonality of the loading v ectors, whic h is measured b y the maximum a bsolute difference b etw e en 90 ◦ and the angles formed by all pairs of loading v ectors. Clearly , the smaller v alue in this column implies the b etter orthogonality . In addition, the maxim um correlation and CP A V of sparse PCs av e rag ed ov er all instances are presen ted in columns four a nd fiv e, resp ectiv ely . F rom T able 3, w e see that the av erage n um b er of zero loadings of the first three sparse PCs for all metho ds are almost same, which a r e aro und 30 . W e also observ e that the sparse PCs giv en b y our metho d ALSPCA are almost uncorrelated a nd their loading v ectors a re nearly orthogona l, whic h are m uc h sup erior to the GP o we r metho ds. 34 T able 3: Compa r ison of GPow er a nd ALSPCA Method Sparsity Non-orthogonalit y Correlation CP A V (%) GP ow er l 1 30.73 3.480 0.104 39.06 GP ow er l 0 30.37 3.540 0.111 38.74 GP ow er l 1 ,m 30.61 5.220 0.161 38.12 GP ow er l 0 ,m 30.51 5.216 0.153 37.99 ALSPCA 30.91 1.007 0.038 40.79 Random data: T est I T able 4: Compa r ison of GPow er a nd ALSPCA Method Sparsity Non-orthogonalit y Correlation CP A V (%) GP ow er l 1 60.58 5.766 0.168 63.46 GP ow er l 0 60.53 6.031 0.169 63.27 GP ow er l 1 ,m 60.02 8.665 0.265 62.34 GP ow er l 0 ,m 60.14 9.146 0.272 62.00 ALSPCA 60.74 1.177 0.066 64.62 Random data: T est II Moreo v er, our sparse PCs ha v e b etter CP A V tha n the others. Our aim of the second test is to find the first six sparse PCs with the av erage num b er of zero loadings around 60 (5 0% sparsity ). T o reac h this goa l, the tunning parameter ρ for pro blem (58) and the par a meters for the GPo w er metho ds are appropr ia tely c hosen. The results of the GP o w er metho ds a nd ALSPCA for the ab o ve randomly generated instances are presen ted in T able 4. Eac h column of T able 4 has the same meaning as T able 3 . First, w e clearly see that our metho d substan tially o utp erforms the GP o we r metho ds in terms of uncorrelation of PCs, orthogonalit y of loading ve ctors and CP A V. F urther, in comparison with T a ble 3 , w e observ e that as the n um b er of PCs increases, the CP A V gro ws accordingly for all metho ds, and moreo v er, the sparse PCs giv en b y the GPo w er methods b ecome m uc h more correlated and non-orthogonal eac h other. But the p erformance of our sparse PCs almost remains same. This phenomenon is actually not surprising, g iv en that uncorrelation and orthog onalit y are not w ell take n in to a ccount in the GP o w er metho ds. 5.3 Pitprops data In this subsection w e test the p erformance of our a pproac h ALSPCA fo r finding sparse PCs on the real data, that is, Pitprops data. W e also compare the results with sev eral ex isting metho ds [28, 8, 25, 16]. The Pitprops data in tro duced b y Jeffers [14] has 180 observ ations and 13 measured v ari- ables. It is a classic example that illustrates the difficult y of in terpreting PCs. Recen tly , sev eral sparse PCA metho ds [17, 28, 25, 8] ha v e been applied to this data set for finding si x sparse PCs by using the actual cov ariance matrix. F or ease of comparison, w e presen t the standard PCs, and some of those sparse PCs in T ables 5-8, resp ectiv ely . It shall b e men tioned that tw o groups of sparse PCs w ere found by DSPCA with the pa rameter k 1 = 5 o r 6, and they ha v e similar sparsit y and total explained v a riance (see [8] for details). Th us, w e only presen t the lat ter one (i.e., t he one with k 1 = 6) in T able 8. Als o, w e applied the GP o w er metho ds [16] to this data fo r finding the PCs with the largest sparsity of the ones give n in [28 , 25, 8], 35 T able 5: Loadings of the fir st six PCs by standard PCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4038 0.2178 0.2073 0.0912 0.0826 0.1198 length 0.4055 0.1861 0.2350 0.1027 0.1128 0.1629 moist 0.1244 0.5406 -0.1415 -0.0784 -0.3498 -0.2759 testsg 0.1732 0.4556 -0.3524 -0.0548 - 0.3558 -0.0540 o vensg 0.0572 -0.1701 -0.4812 -0.0491 -0.1761 0.6256 ringtop 0.2844 -0.0142 -0.4753 0.0635 0.3158 0.0523 ringbut 0.3998 - 0.1897 -0.2531 0.0650 0.2151 0.0026 bowmax 0 .2936 -0.1892 0.2431 -0.2856 -0.1853 -0.0551 bowdist 0.3566 0.0171 0.2076 -0.0967 0.1061 0.0342 whorls 0.3789 -0.2485 0.1188 0.2050 -0.1564 -0.1731 clear -0.0111 0.2053 0.0704 -0.8036 0.3430 0.1753 knots - 0.1151 0.3432 -0.0920 0.3008 0.6003 -0.1698 diaknot -0.1125 0.3085 0.3261 0.3034 -0.0799 0.6263 Pitprops data T able 6: Loadings o f the first six P Cs by SPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.477 0 0 0 0 0 length -0.476 0 0 0 0 0 moist 0 0.785 0 0 0 0 testsg 0 0.620 0 0 0 0 o vensg 0.177 0 0.640 0 0 0 ringtop 0 0 0.589 0 0 0 ringbut - 0.250 0 0.492 0 0 0 bowmax -0.344 -0.021 0 0 0 0 bowdist -0.416 0 0 0 0 0 whorls -0.400 0 0 0 0 0 clear 0 0 0 -1 0 0 knots 0 0.013 0 0 -1 0 diaknot 0 0 -0.015 0 0 1 Pitprops data and fo und the b est result w as giv en b y GPo w er l 1 . Th us, we only rep ort the six sparse PCs ob- tained b y GPo w er l 1 in T a ble 9. In addition, we presen t sparsity , CP A V, non-orthogo nalit y and correlation of the PCs obtained b y the standard PCA and sparse PCA metho ds [28, 25, 8, 16] in columns tw o to fiv e of T able 13, r esp ectiv ely . In more details, the second and fifth columns resp ectiv ely give sparsity (measured b y the n umber of zero loadings) and CP A V. The third column repo r ts non-or thogonality , which is measured by the maximum absolute difference b e- t w een 90 ◦ and the angles formed by all pairs of loading vec tors. The fourth column presen ts the maxim um correlation o f PC s. T hough the PCs obtained by these sparse PCA metho ds ha v e nice sparsit y , we o bserv e from T ables 13 that they are m uc h correlated, and moreov er almost all of them are far fr om ort ho gonal eac h o ther excep t the ones giv en by SPCA [28]. T o impro v e the quality of sparse PCs, w e next apply our approac h ALSPCA, and compare t he results with these metho ds. F or all tests b elo w, w e c ho ose (66) a s the terminatio n criterion for ALSPCA with ǫ O = 0 . 1 and ǫ I = ǫ E = 1 0 − 3 . In t he first exp erimen t, w e aim to find six uncorrelated and or thogonal s parse PCs b y ALSPCA while explaining most of v ariance. In particular, w e set r = 6, ∆ ij = 0 . 07 for all i 6 = j and ρ = 0 . 8 for form ulation (58) of sparse PCA. The resulting sparse PCs are presen ted in T able 1 0, and their sparsit y , CP A V, non-ort hogonalit y and correlation are repo rted in row 36 T able 7: Loadings o f the first six PCs by rSVD V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.449 0 0 -0.114 0 0 length -0.460 0 0 -0.102 0 0 moist 0 -0.707 0 0 0 0 testsg 0 -0.707 0 0 0 0 o vensg 0 0 0.550 0 0 -0.744 ringtop -0.199 0 0.546 -0.176 0 0 ringbut - 0.399 0 0.366 0 0 0 bowmax -0.279 0 0 0.422 0 0 bowdist -0.380 0 0 0.283 0 0 whorls -0.407 0 0 0 0.231 0 clear 0 0 0 -0.785 -0.973 0 knots 0 0 0 - 0.265 0 0.161 diaknot 0 0 -0.515 0 0 -0.648 Pitprops data T able 8: Loadings o f the first six PCs by DSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.4907 0 0 0 0 0 length -0.5067 0 0 0 0 0 moist 0 0.7071 0 0 0 0 testsg 0 0.7071 0 0 0 0 o vensg 0 0 0 0 -1.0000 0 ringtop -0.0670 0 -0.8731 0 0 0 ringbut - 0.3566 0 -0.4841 0 0 0 bowmax -0.2335 0 0 0 0 0 bowdist -0.3861 0 0 0 0 0 whorls -0.4089 0 0 0 0 0 clear 0 0 0 0 0 1. 0000 knots 0 0 0 1.0000 0 0 diaknot 0 0 0.0569 0 0 0 Pitprops data sev en o f T able 13. W e easily observ e that o ur metho d ALSPCA ov erally outp erforms the other sparse PCA methods substantially in all asp ects except sparsit y . Naturally , w e can improv e the sparsit y by increasing the v alues of ρ , ye t the total explained v ariance ma y b e sacrificed as demonstrated in our next exp erimen t. W e no w attempt to find six PCs with similar correlation and orthogonality but hig her sparsit y than those giv en in the ab ov e exp erimen t. F or this purp ose, we set ∆ ij = 0 . 07 for all i 6 = j a nd choo se ρ = 2 . 1 for problem ( 58) in this exp erimen t. Th e resulting sparse PCs are presen ted in T able 11. T he CP A V, non-orthog onalit y and correlation o f these PCs are giv en in ro w eigh t of T able 13. In comparison with the ones giv en in the a b o v e exp erimen t, the PCs obtained in this exp erimen t a r e muc h more sparse while retaining almost same correlation and orthogonality . Ho w ev er, their CP A V go es down dramatically . C ombining the results of these t w o exp erimen ts, w e deduce that for the Pitprops data, it seem s not p ossible to extract six highly sparse (e.g., around 60 zero loadings), nearly orthogonal and uncorrelated PCs while explaining most of v aria nce as they may not exist. Th e following exp erimen t further sustains suc h a deduction. Finally w e are in terested in exploring how the correlation controlling parameters ∆ ij ( i 6 = j ) affect the p erformance of the sparse PCs. In pa r ticular, we set ∆ ij = 0 . 5 for all i 6 = j and 37 T able 9: Loadings of the first six PCs by GPo wer l 1 V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam -0.4182 0 0 0 0 0 length -0.4205 0 0 0 0 0 moist 0 -0.7472 0 0 0 0 testsg -0.1713 -0.6646 0 0 0 0 o vensg 0 0 0 0 - 0.7877 0 ringtop -0.2843 0 0 0 -0.6160 0 ringbut - 0.4039 0 0 0 0 0 bowmax -0.3002 0 0 0 0 0 bowdist -0.3677 0 0 0 0 0 whorls -0.3868 0 0 0 0 0 clear 0 0 0 0 0 1. 0000 knots 0 0 0 1.0000 0 0 diaknot 0 0 1.0000 0 0 0 Pitprops data T able 10: Loadings of the first s ix PCs b y ALSP C A V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4394 0 0 0 0 0 length 0.4617 0 0 0 0 0 moist 0.0419 0.4611 -0.1644 0.0688 -0.3127 0 testsg 0.1058 0.7902 0 0 0 0 o vensg 0.0058 0 0 0 0 0 ringtop 0.1302 0 0.2094 0 0 0.9999 ringbut 0.3477 0 0.0515 0 0.3240 0 bowmax 0 .2256 -0.3566 0 0 0 0 bowdist 0.4063 0 0 0 0 0 whorls 0.4606 0 0 0 0 -0.0125 clear 0 0.0369 0 - 0.9973 0 0 knots - 0.1115 0.1614 -0.0762 0.0239 0.8929 0 diaknot -0.0487 0.0918 0.9595 0.0137 0 0 Pitprops data : T est I c ho ose ρ = 0 . 7 for problem (58). The obtained sparse PCs are presen ted in T able 12 . The CP A V, no n- orthogonality and correlation o f these PCs are given in t he last row of T able 13. W e see that these PCs are highly sparse, ortho gonal, and explain go o d amount of v ariance. Ho w ev er, they a r e quite correlated with each ot her, which is actually not surprising, giv en that ∆ ij ( i 6 = j ) are not small. Despite suc h a draw back , w e observ e that t hese sparse PCs still o v erally outp erform those obta ined b y SPCA, rSVD, D SPCA and GP ow er l 1 . F rom the ab o ve exp erimen ts, w e ma y conclude that for the Pitprops data, there do not exist s ix highly sparse, nearly orthogonal and uncorrelated PCs while explaining m ost of v ariance. Therefore, the most acceptable sparse PCs seem to b e the ones giv en in T able 10. 6 Conclud ing remarks In this pap er w e prop osed a new form ulation of sparse PCA for finding sparse and nearly uncorrelated principal componen ts (PCs) with orthogonal loading v ec tors while explaining as m uc h of t he t o tal v ariance as p ossible. W e also dev elop ed a no v el globa lly c onv ergen t augmen ted Lagrangian me tho d for solving a class of nonsmo ot h constrained o ptimization 38 T able 11 : Loadings of the first six PCs by ALSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 1.0000 0 0 0 0 0 length 0 -0.2916 -0.1421 0 0 -0.0599 moist 0 0.9565 -0.0433 0 0 - 0.0183 testsg 0 0 0 0.0786 - 0. 1330 0 o vensg 0 0 -0.9683 0 0 0 ringtop 0 0 0 0 0 0 ringbut 0 0 0.1949 0 0.2369 0 bowmax 0 0 0 0 0 0 bowdist 0 0 0 0 0 0 whorls 0 0 0 0 0 0 clear 0 0 0 -0.9969 0 0 knots 0 0 -0.0480 0.0109 0.9624 0 diaknot 0 0 -0.0093 0 0 0.9980 Pitprops data: T est I I T able 12 : Loadings of the first six PCs by ALSPCA V ari able PC1 PC2 PC3 PC4 PC5 PC6 topdiam 0.4051 0 0 0 0 0 length 0.4248 0 0 0 0 0 moist 0 0.7262 0 0 0 0 testsg 0.0018 0.6875 0 0 0 0 o vensg 0 0 - 1.0000 0 0 0 ringtop 0.1856 0 0 0 0 0 ringbut 0.4123 0 0 0 0 0 bowmax 0.3278 0 0 0 0 0 bowdist 0.3830 0 0 0 0 0 whorls 0.4437 -0.0028 0 0 0 0 clear 0 0 0 -1.0000 0 0 knots 0 0 0 0 1.0000 0 diaknot 0 0 0 0 0 1.0000 Pitprops data: T est I I I problems, whic h is w ell suited for our form ulatio n o f sparse PCA. Additionally , we prop osed t w o nonmonotone gra dien t metho ds for solving the augmented Lag rangian subproblems , and established their global a nd lo cal con v ergence. Finally , w e compared our sparse PCA approach with sev eral existing metho ds on syn thetic, random, and real data, resp ectiv ely . The com- putational results demonstrate that the sparse PCs pro duced by our a ppro ac h subs tantially outp erform those by ot her metho ds in terms o f total explained v ariance, correlatio n of PCs, and orthogonality of loading v ectors. As observ ed in our exp erimen ts, form ulation (3) is v ery e ffectiv e in finding the desired sparse PC s. Ho w ev er, there remains a natural theoretical ques tion a b out it. Giv en a set of random v ariables, supp ose there exis t sparse and uncorrelated PCs with orthogonal loa ding v ectors while explaining most of v ariance of the v a riables. In other w ords, their actual co v ari- ance matrix Σ has few dominant eigen v alues and the associated orthonormal eigenv ectors ar e sparse. Since in pra ctice Σ is t ypically unknow n and only approximated b y a sample co v ari- ance matrix ˆ Σ, one natural question is whether or not there exist some suitable parameters ρ and ∆ ij ( i 6 = j ) so that (3) is able to reco v er t ho se sparse PCs almost surely as t he sample size b ecomes sufficien tly large. In Section 4 w e sho w ed that Ro binson’s condition ( 1 0) holds a t a set o f feasible p oints of 39 T able 13 : Comparis o n o f SPCA, rSVD, DSPCA, GPo wer l 1 and ALSP CA Method Sparsity Non-orthog onality Corr elation CP A V (%) PCA 0 0 0 87.00 SPCA 60 0.86 0.395 66.21 rSVD 53 14.76 0.459 67.04 DSPCA 63 13.63 0.573 60.97 GP ow er l 1 63 10.09 0.353 64.15 ALSPCA-1 46 0.03 0.082 69.55 ALSPCA-2 60 0.03 0.084 39.42 ALSPCA-3 63 0.00 0.222 65.97 Pitprops data (58). Als o, w e observ ed from our experiments that the accum ulation p oints of our augmen ted Lagrangian metho d lie in this set when applied to (58), and th us it con v erges. Ho w ev er, it remains op en whether or no t Ro binson’s condition holds at all feasible p oints of (58). In addition, Burer and Monteiro [5] recen tly applied the classic al augmen ted Lagrangian metho d to a nonconv e x nonlinear pro g ram (NLP) refo r m ulation of sem idefinite pro grams (SDP), and they obtained some nice computational r esults esp ecially f o r the SDP relaxatio ns of sev eral hard com binatorial optimization problems. How e ve r, the classical augmen ted La- grangian metho d generally cannot guara ntee con v erging to a feasible p oint when applied to a noncon v ex NLP . Due this and [19], at least theoretically , their approac h [5 ] ma y not con v erge to a feasible p oin t o f the primal SDP . Giv en that the augmen ted Lagrangian metho d prop osed in this pap er conv erges globa lly under some mild a ssumptions, it w ould b e interes ting to apply it to the NLP reformulation of SD P and compare the p erfo rmance with the approa ch studied in [5]. Finally , the co des of our approac h for solving the sparse PCA formulation (5 8) (or, equiv a- len tly , (3)) are written in Matla b, whic h are a v ailable online at www.m ath.sfu.ca/ ∼ zhaosong. As a future researc h, we will f urt her improv e their p erformance b y conducting more extensiv e computational exp erimen ts and exploring more practical a pplications. Ac kno wledg emen t W e gratefully ac kno wledge commen ts from Jim Burk e, T erry Ro c k afellar, Defeng Sun and P aul Tseng in the W est Coast O ptimizatio n Meeting a t Univ ersity of W ashington, Seattle, USA in Spring 2009 . References [1] O. Alter, P . Bro wn, and D. Bot stein. Singular v alue decomp osition for Genome-Wide ex- pression data Pro cessing and Mo deling. Pr o c e e ding s of the National A c ademy of Scien c es , 97:10101– 10106, 2000. [2] J. Barzilai and J. M. Borw ein. Tw o p o in t step size gradien t metho ds. IMA Journal of Numeric al Analysis , 8:14 1 –148, 1988. 40 [3] D. P . Bertsek as Nonline ar Pr o gr amming . A thena Scien tific, 1999. [4] E. G. Birgin, J. M. Mart ´ ınez, and M. Ray dan. No nmono tone sp ectral pro jected gradien t metho ds on con v ex sets. SIAM Jo urnal on Optimization , 4:1196 –1211, 2 000. [5] S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidef- inite programs via lo w-rank factor izat io n. Mathem a tic al Pr o gr amming, Series B , 95:329– 357, 2003. [6] J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal comp onen ts. Journal of Applie d Statistics , 22 :203–214, 1995. [7] A. d’Aspre mont, F. R. Bac h, and L. El Ghaoui. Optimal solutions fo r sparse principal comp onen t analysis. Journal of Machine L e arn i n g R ese ar c h , 9:1269–1294 , 20 0 8. [8] A. d’Aspremon t, L. El Ghaoui, M. I. Jordan, and G . R . G . L anc kriet. A direct form ulation for sparse PCA using semidefinite programming. SIAM R eview , 49:434–448, 2007. [9] K. F an. On a theorem of W eyl concerning the eigen v alues of linear transformatio ns. Pr o c e e di n gs of the National A c ade m y o f the Scie n c es of U.S.A. , 35: 652–655 (1949). [10] P . Hanco c k, A. Burton, and V. Bruce. F ace pro cessing: Hum an perception a nd principal comp onen ts a na lysis. Memo ry a n d Co gnition , 24:2 6–40, 1996. [11] T. Hastie, R. Tibs hirani, M. Eisen, P . Brow n, D . R oss, U. Sc herf, J. W einstein, A. Al- izadeh, L. Staudt, and D. Botstein. ´ gene Shaving´ as a metho d for identifying distinct sets of genes with similar Expression Patterns . Genome B iolo gy , 1:1–21, 200 0. [12] T. Hastie, R. Tibs hirani, and J. F riedman. Th e Elements of Statistic al L e arning; Data mining, I nfer enc e and Pr e diction . New Y ork: Springer V erlag. [13] J. B. Hiriart–Urruty and C. Lemar ´ ec hal. Convex Analysis an d Minimization algorithms I . Comprehensiv e Study in Mathematics, volume 305, Springer-V erlag, New Y ork, 199 3. [14] J. Jeffers. Tw o Case Studies in the Application of Principal Comp onent. Applie d Statis- tics , 16:225–236, 1967. [15] I. Jolliffe. Ro tation of principal comp onen ts: ch oice of normalization constraints . Journal of Applie d Statistics , 22:29–3 5, 19 95. [16] M. Jo urn´ ee, Y u. Nesterov , P . Ric ht´ arik, and R. Sepulc hre. Generalized p ow e r metho d for sparse principal comp onen t analysis. CORE Discussion P ap er 2008/70 . Submitted to Journal of Mach ine L e arning R ese ar c h . [17] I. T. Jolliffe, N. T . T rendafilov , and M. L . Uddin. A modified principal comp onen t tec hnique based on the Lasso. Jo urnal of Co m putational a n d Gr aphic al Statistics , 12:5 3 1– 547. 41 [18] B. Mog ha ddam, Y. W eiss, and S. Avidan. Sp ectral b ounds for sparse PCA: Exact and greedy algorithms. In Y. W eiss, B. Sc h¨ olk opf, and J. Platt, editors, A dvanc es in Neur al Information Pr o c essing Systems 18 , pages 915– 9 22. MIT Press, Cambridge, MA, 2006. [19] R. D. C. Mon teiro. private c ommunic ation , 20 09. [20] M. L. Overton and R. S. W omersley . Optimality conditio ns and dualit y theory for mini- mizing sums of the largest eigen v a lues of symme tric matrices. Mathematic al Pr o gr am m ing , 62:321–35 7, 1993. [21] S. M. R obinson. Stability theory for systems of inequalities, P art 2: Differentiable non- linear system s. SIAM Journal on Numeric al Analysis , 13:497–513 , 19 7 6. [22] S. M. Robinson. Lo cal structure of f easible sets in nonlinear programming, P art I: R eg- ularit y . In V. P ereira a nd A . Reinoza, e ditors, Numeric al Metho ds L e ctu r e Notes in Mathematics , v ol. 1005, Springer-V erlag, Berlin, 1983. [23] R. T. Ro ck afellar. Convex A nalysis . Princeton Univ ersit y Press, 1970. [24] A. Ruszczy´ nski. Nonline ar Optimiz a tion . Princeton Univ ersit y Press, 2006. [25] H. Shen a nd J. Z. Huang. Sparse principal comp onen t analysis via r egula r ized low rank matrix approx imation. Journal of Multivariate Analysis , 99(6 ):1015–10 3 4, 2008. [26] P . Tseng and S. Y un. A co ordina t e gradien t descen t metho d for nonsmo o t h separable minimization. Mathematic al Pr o gr amming , 117:38 7–423, 2 0 09. [27] S J. W right, R. Now ak, and M . A. T. F igueiredo. Sparse reconstruction b y separable appro ximation. T o app ear in IEEE T r ansactions on Signal Pr o c essing . [28] H. Z ou, T. Hastie, and R. Tibshirani. Spa r se principal comp o nent analysis. Jo urnal of Computational and Gr aphic al Statistics , 1 5 (2):265–28 6, 2006. 42
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment