Convex Optimization without Projection Steps
For the general problem of minimizing a convex function over a compact convex domain, we will investigate a simple iterative approximation algorithm based on the method by Frank & Wolfe 1956, that does not need projection steps in order to stay insid…
Authors: Martin Jaggi
Convex Optimization without Projection Steps with Applications to Spa rse and Lo w Rank App ro ximation Martin Jaggi ETH Z ¨ uric h, Switzerland jaggi@inf.ethz.c h Abstract. W e study the general problem of minimizing a con vex function o v er a compact con vex domain. W e will inv estigate a simple iterative appro ximation algorithm based on the metho d b y F rank & W olfe [ FW56 ], that do es not need pro jec tion steps in order to sta y inside the optimization domain. Instead of a pro jection step, the linearized problem defined by a curren t subgradient is solved, which giv es a step direction that will naturally stay in the do- main. Our framew ork generalizes the sparse greedy algorithm of [ FW56 ] and its primal-dual analysis b y [ Cla10 ] (and the lo w-rank SDP approac h by [ Haz08 ]) to arbitrary con vex domains. Analogously , we giv e a con v ergence proof guaranteeing ε -small dualit y gap after O ( 1 ε ) iterations. The metho d allo ws us to understand the sparsity of appro ximate solutions for an y ` 1 -regularized con vex optimization problem (and for optimization ov er the simplex), expressed as a function of the appro ximation qualit y . W e obtain matc hing upper and lo wer b ounds of Θ( 1 ε ) for the sparsit y for ` 1 -problems. The same b ounds apply to lo w-rank semidefinite optimization with b ounded trace, sho wing that rank O ( 1 ε ) is b est p ossible here as well. As another application, we obtain sparse matrices of O ( 1 ε ) non-zero en tries as ε -appro ximate solutions when optimizing an y con vex function o v er a class of diagonally dominant symmetric matrices. W e sho w that our prop osed first-order method also applies to nuclear norm and max-norm matrix optimization problems. F or nuclear norm regularized optimization, such as matrix com- pletion and low-rank reco very , w e demonstrate the practical efficiency and scalabilit y of our algorithm for large matrix problems, as e.g. the Netflix dataset. F or general conv ex optimiza- tion ov er b ounded matrix max-norm, our algorithm is the first with a conv ergence guarantee, to the b est of our knowledge. (This article consists of the first t wo c hapters of the author’s PhD thesis [ Jag11 ].) 1 Contents 1 Intro duction 3 2 The P o o r Man’s App roach to Convex Optimization and Duality 5 2.1 Subgradien ts of a Conv ex F unction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 A Dualit y for Conv ex Optimization o ver Compact Domain . . . . . . . . . . . . . . . . . . 6 3 A Projection-F ree First-Order Method for Convex Optimization 7 3.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Obtaining a Guaran teed Small Duality Gap . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Cho osing the Optimal Step-Size b y Line-Search . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 The Curv ature Measure of a Con v ex F unction . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Optimizing o ver Con v ex Hulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6 Randomized V ariants, and Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . 16 3.7 Relation to Classical Con vex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Spa rse App roximation over the Simplex 18 4.1 Upp er Bound: Sparse Greedy on the Simplex . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Ω( 1 ε ) Lo wer Bound on the Sparsit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Spa rse App roximation with Bounded ` 1 -No rm 22 5.1 Relation to Matc hing Pursuit and Basis Pursuit in Compressed Sensing . . . . . . . . . . 25 6 Optimization with Bounded ` ∞ -No rm 26 7 Semidefinite Optimization with Bounded T race 27 7.1 Lo w-Rank Semidefinite Optimization with Bounded T race: The O ( 1 ε ) Algorithm b y Hazan 28 7.2 Solving Arbitrary SDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.3 Tw o Improv ed V arian ts of Algorithm 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 7.4 Ω( 1 ε ) Lo wer Bound on the Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 8 Semidefinite Optimization with ` ∞ -Bounded Diagonal 35 9 Spa rse Semidefinite Optimization 37 10 Submo dula r Optimization 39 11 Optimization with the Nuclear and Max-Norm 39 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 11.2 The Nuclear Norm for Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 11.2.1 W eigh ted Nuclear Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 11.3 The Max-Norm for Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 11.4 Optimizing with Bounded Nuclear Norm and Max-Norm . . . . . . . . . . . . . . . . . . . 46 11.4.1 Optimization with a Nuclear Norm Regularization . . . . . . . . . . . . . . . . . . 47 11.4.2 Optimization with a Max-Norm Regularization . . . . . . . . . . . . . . . . . . . . 48 11.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.5.1 Robust Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 49 11.5.2 Matrix Completion and Lo w Norm Matrix F actorizations . . . . . . . . . . . . . . 50 11.5.3 The Structure of the Resulting Eigen v alue Problems . . . . . . . . . . . . . . . . . 52 11.5.4 Relation to Simon F unk’s SVD Metho d . . . . . . . . . . . . . . . . . . . . . . . . 52 11.6 Exp erimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 11.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Bibliography 55 2 1 Intro duction Motivation. F or the p erformance of large scale approximation algorithms for conv ex optimiza- tion, the trade-off b etw een the n um b er of iterations on one hand, and the computational cost p er iteration on the other hand, is of crucial imp ortance. The lo w er complexit y p er iteration is among the main reasons why first-order metho ds (i.e., metho ds using only information from the first deriv ativ e of the ob jective function), as for example sto chastic gradien t descent, are curren tly used muc h more widely and successfully in many machine learning applications — de- spite the fact that they often need a larger num ber of iterations than for example second-order metho ds. Classical gradien t descen t optimization tec hniques usually require a pro jection step in eac h iteration, in order to get bac k to the feasible region. F or a v ariet y of applications, this is a non-trivial and costly step. One prominent example is semidefinite optimization, where the pro jection of an arbitrary symmetric matrix back to the PSD matrices requires the computation of a complete eigenv alue-decomp osition. Here w e study a simple first-order approach that do es not need any pro jection steps, and is applicable to any conv ex optimization problem ov er a compact conv ex domain. The algorithm is a generalization of an existing metho d originally prop osed by F rank & W olfe [ FW56 ], which w as recently extended and analyzed in the seminal pap er of Clarkson [ Cla10 ] for optimization o ver the unit simplex. Instead of a pro jection, the primitiv e op eration of the optimizer here is to minimize a line ar ap- pro ximation to the function ov er the same (compact) optimization domain. Any (appro ximate) minimizer of this simpler linearized problem is then chosen as the next step-direction. Because all such candidates are alwa ys feasible for the original problem, the algorithm will automatically sta y in our con v ex feasible region. The analysis will sho w that the n um b er of steps needed is roughly iden tical to classical gradien t descent sc hemes, meaning that O 1 ε steps suffice in order to obtain an approximation quality of ε > 0. The main question ab out the efficiency p er iteration of our algorithm, compared to a classical gradien t descen t step, can not b e answ ered generally in fa vor of one or the other. Whether a pro jection or a linearized problem is computationally cheaper will crucially dep end on the shap e and the representation of the feasible region. In terestingly , if we consider conv ex optimization o ver the Euclidean k . k 2 -ball, the tw o approaches fully coincide, i.e., we exactly reco v er classical gradien t descen t. Ho w ev er there are several classes of optimization problems where the lin- earization approac h w e present here is definitely v ery attractive, and leads to faster and simpler algorithms. This includes for example ` 1 -regularized problems, whic h w e discuss in Sections 4 and 5 , as w ell as semidefinite optimization under b ounded trace, as studied by [ Haz08 ], see Section 7 . Spa rsit y and Lo w-Rank. F or these mentioned sp ecific classes of con vex optimization problems, w e will additionally demonstrate that our studied algorithm leads to (optimally) sparse or low- rank solutions. This prop erty is a crucial side-effect that can usually not b e ac hiev ed b y classical optimization tec hniques, and corresp onds to the c or eset concept kno wn from computational geometry , see also [ GJ09 ]. More precisely , we show matc hing upp er and lo wer b ounds of Θ 1 ε for the sparsit y of solutions to general ` 1 -regularized problems, and also for optimizing ov er the simplex, if the required approximation quality is ε . F or matrix optimization, an analogous statemen t will hold for the rank in case of nuclear norm regularized problems. Applications. Applications of the first mentioned class of ` 1 -regularized problems do include man y machine learning algorithms ranging from supp ort vector machines (SVMs) to b o osting 3 and multiple k ernel learning, as well as ` 2 -supp ort vector regression (SVR), mean-v ariance anal- ysis in p ortfolio selection [ Mar52 ], the smallest enclosing ball problem [ BC07 ], ` 1 -regularized least squares (also kno wn as basis pursuit de-noising in compressed sensing), the L asso [ Tib96 ], and ` 1 -regularized logistic regression [ KKB07 ] as well as w alking of artificial dogs ov er rough terrain [ KBP + 10 ]. The second mentioned class of matrix problems, that is, optimizing ov er semidefinite matrices with bounded trace, has applications in low-rank reco v ery [ FHB01 , CR09 , CT10 ], dimension- alit y reduction, matrix factorization and completion problems, as well as general semidefinite programs (SDPs). F urther applications to n uclear norm and max-norm optimization, suc h as sparse/robust PCA will be discussed in Section 11 . Histo ry and Related Wo rk. The class of first-order optimization metho ds in the spirit of F rank and W olfe [ FW56 ] has a ric h history in the literature. Although the focus of the original pap er w as on quadratic programming, its last section [ FW56 , Section 6] already introduces the general algorithm for minimizing conv ex functions using the ab ov e mentioned linearization idea, when the optimization domain is giv en b y linear inequalit y constrain ts. In this case, each in termediate step consists of solving a linear program. The given conv ergence guarantee b ounds the primal error, and assumes that all in ternal problems are solved exactly . Later [ DH78 ] has generalized the same metho d to arbitrary conv ex domains, and improv ed the analysis to also work when the internal subproblems are only solv ed appro ximately , see also [ Dun80 ]. P atriksson in [ P at93 , Pat98 ] then revisited the general optimization paradigm, in vestigated sev eral in teresting classes of con v ex domains, and coined the term “cost appro xi- mation” for this type of algorithms. More recently , [ Zha03 ] considered optimization ov er conv ex h ulls, and studies the crucial concept of sparsit y of the resulting approximate solutions. How ev er, this proposed algorithm do es not use linear subproblems. The most recent w ork of Clarkson [ Cla10 ] provides a go o d ov erview of the existing lines of re- searc h, and in vestigates the sparsit y solutions when the optimization domain is the unit simplex, and establishing the connection to coreset methods from computational geometry . F urthermore, [ Cla10 ] was the first to in tro duce the stronger notion of con vergence in primal-dual error for this class of problems, and relating this notion of duality gap to W olfe dualit y . Our Contributions. The character of this article mostly lies in reviewing, re-interpreting and generalizing the existing approac h given by [ Cla10 ], [ Haz08 ] and the earlier pap ers b y [ Zha03 , DH78 , FW56 ], who do deserv e credit for the analysis techniques. Our con tribution here is to transfer these metho ds to the more general case of con v ex optimization o ver arbitrary b ounded con vex subsets of a v ector space, while providing stronger primal-dual con v ergence guarantees. T o do so, we prop ose a very simple alternativ e concept of optimization duality , whic h will allow us to generalize the stronger primal-dual con v ergence analysis whic h [ Cla10 ] has pro vided for the the simplex case, to optimization ov er arbitrary conv ex domains. So far, no such guarantees on the duality gap were known in the literature for the F rank-W olfe-t yp e algorithms [ FW56 ], except when optimizing ov er the simplex. F urthermore, w e generalize Clarkson’s analysis [ Cla10 ] to w ork when only approximate linear in ternal optimizers are used, and to arbitrary starting p oin ts. Also, w e study the sparsity of solutions in more detail, obtaining upper and low er b ounds for the sparsity of approximate solutions for a wider class of domains. Our prop osed notion of dualit y gives simple certificates for the current appro ximation quality , whic h can b e used for an y optimization algorithm for con vex optimization problems o v er b ounded domain, ev en in the case of non-differen tiable ob jectiv e functions. 4 W e demonstrate the broad applicability of our general technique to sev eral imp ortant classes of optimization problems, such as ` 1 - and ` ∞ -regularized problems, as well as semidefinite opti- mization with uniformly b ounded diagonal, and sparse semidefinite optimization. Later in Section 11 we will give a simple transformation in order to apply the first-order optimization tec hniques w e review here also to n uclear norm and max-norm matrix optimization problems. Ackno wledgments. Credit for the imp ortant geometric interpretation of the dualit y gap ov er the sp ectahedron as the distance to the linearization go es to So eren Laue. F urthermore, the author w ould like to thank Marek Sulovsk´ y, Bernd G¨ artner, Ark adi Nemirovski, Elad Hazan, Joac him Giesen, Sebastian Stic h, Michel Baes, Michael B ¨ urgisser and Christian Lorenz M ¨ uller for helpful discussions and comments. 2 The P o o r Man’s Ap p roach to Convex Optimization and Dualit y The Idea of a Dualit y given by Supp o rting Hyp erplanes. Supp ose w e are given the task of minimizing a conv ex function f ov er a b ounded con vex set D ⊂ R n , and let us assume for the momen t that f is con tinuously differentiable. Then for an y point x ∈ D , it seems natural to consider the tangen tial “supp orting” h yp erplane to the graph of the function f at the p oin t ( x, f ( x )). Since the function f is con vex, any suc h linear appro ximation m ust lie b elo w the graph of the function. Using this linear appro ximation for eac h p oint x ∈ D , w e define a dual function v alue ω ( x ) as the minim um of the linear approximation to f at point x , where the minimum is taken ov er the domain D . W e note that the p oin t attaining this linear minimum also seems to b e go o d direction of improv emen t for our original minimization problem given b y f , as seen from the curren t p oin t x . This idea will lead to the optimization algorithm that w e will discuss b elow. As the entire graph of f lies ab ov e an y such linear appro ximation, it is easy to see that ω ( x ) ≤ f ( y ) holds for each pair x, y ∈ D . This fact is called we ak duality in the optimization literature. This rather simple definition already completes the dualit y concept that we will need in this pap er. W e will provide a slightly more formal and concise definition in the next subsection, whic h is useful also for the case of non-differentiable con vex functions. The reason w e call this concept a p o or man ’s duality is that we think it is considerably more direct and in tuitive for the setting here, when compared to classical Lagrange duality or W olfe duality , see e.g. [ BV04 ]. 2.1 Subgradients of a Convex Function In the follo wing, we will work o v er a general v ector space X equipp ed with an inner product h ., . i . As the most prominent example in our inv estigations, the reader migh t alw a ys think of the case X = R n with h x, y i = x T y b eing the standard Euclidean scalar pro duct. W e consider general conv ex optimization problems giv en by a con v ex function f : X → R ov er a compact 1 con vex domain D ⊆ X , or formally minimize x ∈ D f ( x ) . (1) In order to dev elop b oth our algorithm and the notion of duality for suc h conv ex optimization problems in the following, we need to formally define the supp orting h yp erplanes at a given 1 Here we call a set D ⊆ X c omp act if it is closed and b ounded. See [ KZ05 ] for more details. 5 p oin t x ∈ D . These planes coincide exactly with the w ell-studied concept of sub gr adients of a con vex function. F or each p oin t x ∈ D , the sub differ ential at x is defined as the set of normal v ectors of the affine linear functions through ( x, f ( x )) that lie b elow the function f . F ormally ∂ f ( x ) := { d x ∈ X | f ( y ) ≥ f ( x ) + h y − x, d x i ∀ y ∈ D } . (2) An y element d x ∈ ∂ f ( x ) is called a sub gr adient to f at x . Note that for eac h x , ∂ f ( x ) is a closed conv ex set. F urthermore, if f is differen tiable, then the sub differential consists of exactly one elemen t for each x ∈ D , namely ∂ f ( x ) = {∇ f ( x ) } , as explained e.g. in [ Nem05 , KZ05 ]. If we assume that f is con vex and lo wer semicon tin uous 2 on D , then it is known that ∂ f ( x ) is non-empty , meaning that there exists at least one subgradient d x for every p oint x ∈ D . F or a more detailed in v estigation of subgradien ts, we refer the reader to one of the works of e.g. [ Roc97 , BV04 , Nem05 , KZ05 , BL06 ]. 2.2 A Duality fo r Convex Optimization over Compact Domain F or a giv en point x ∈ D , and any choice of a subgradient d x ∈ ∂ f ( x ), we define a dual function v alue ω ( x, d x ) := min y ∈ D f ( x ) + h y − x, d x i . (3) In other w ords ω ( x, d x ) ∈ R is the minim um of the linear appro ximation to f defined b y the subgradien t d x at the supp orting p oin t x , where the minimum is tak en ov er the domain D . This minim um is alw a ys attained, since D is compact, and the linear function is con tinuous in y . By the definition of the subgradien t — as lying below the graph of the function f — w e readily attain the prop erty of weak-dualit y , which is at the core of the optimization approach we will study below. Lemma 1 (W eak dualit y) . F or al l p airs x, y ∈ D , it holds that ω ( x, d x ) ≤ f ( y ) Pr o of. Immediately from the definition of the dual ω ( ., . ): ω ( x, d x ) = min z ∈ D f ( x ) + h z − x, d x i ≤ f ( x ) + h y − x, d x i ≤ f ( y ) . Here the last inequality is by the definition ( 2 ) of a subgradient. Geometrically , this fact can b e understo od as that any function v alue f ( y ), which is “part of ” the graph of f , alwa ys lies higher than the minimum ov er any linear appro ximation (giv en by d x ) to f . In the case that f is differentiable, there is only one p ossible c hoice for a subgradient, namely d x = ∇ f ( x ), and so we will then denote the (unique) dual v alue for each x b y ω ( x ) := ω ( x, ∇ f ( x )) = min y ∈ D f ( x ) + h y − x, ∇ f ( x ) i . (4) 2 The assumption that our ob jectiv e function f is lo wer semicontin uous on D , is equiv alen t to the fact that its epigraph — i.e. the set { ( x, t ) ∈ D × R | t ≥ f ( x ) } of all p oin ts lying on or ab ov e the graph of the function — is closed, see also [ KZ05 , Theorem 7.1.2]. 6 The Duality Gap as a Measure of Appro ximation Qualit y . The ab ov e dualit y concept allo ws us to compute a v ery simple measure of appro ximation qualit y , for an y candidate solution x ∈ D to problem ( 1 ). This measure will b e easy to compute even if the true optimum v alue f ( x ∗ ) is unkno wn, which will v ery often b e the case in practical applications. The duality gap g ( x, d x ) at an y point x ∈ D and an y subgradien t subgradien t d x ∈ ∂ f ( x ) is defined as g ( x, d x ) := f ( x ) − ω ( x, d x ) = max y ∈ D h x − y , d x i , (5) or in other words as the difference of the curren t function v alue f ( x ) to the minim um v alue of the corresp onding linearization at x , taken ov er the domain D . The quantit y g ( x, d x ) = f ( x ) − ω ( x, d x ) will b e called the duality gap at x , for the c hosen d x . By the w eak dualit y Lemma 1 , w e obtain that for an y optimal solution x ∗ to problem ( 1 ), it holds that g ( x, d x ) ≥ f ( x ) − f ( x ∗ ) ≥ 0 ∀ x ∈ D, ∀ d x ∈ ∂ f ( x ) . (6) Here the quantit y f ( x ) − f ( x ∗ ) is what we call the primal err or at p oint x , which is usually imp ossible to compute due to x ∗ b eing unknown. The ab ov e inequality ( 6 ) now gives us that the duality gap — whic h is easy to compute, given d x — is alw a ys an upp er bound on the primal error. This property mak es the dualit y gap an extremely useful measure for example as a stopping criterion in practical optimizers or heuristics. W e call a p oint x ∈ X an ε -appr oximation if g ( x, d x ) ≤ ε for some c hoice of subgradien t d x ∈ ∂ f ( x ). F or the sp ecial case that f is differentiable, w e will again use the simpler notation g ( x ) for the (unique) duality gap for each x , i.e. g ( x ) := g ( x, ∇ f ( x )) = max y ∈ D h x − y , ∇ f ( x ) i . Relation to Dualit y of No rms. In the sp ecial case when the optimization domain D is given b y the unit ball of some norm on the space X , w e observ e the follo wing: Observation 2. F or optimization over any domain D = { x ∈ X | k x k ≤ 1 } b eing the unit b al l of some norm k . k , the duality gap for the optimization pr oblem min x ∈ D f ( x ) is given by g ( x, d x ) = k d x k ∗ + h x, d x i , wher e k . k ∗ is the dual norm of k . k . Pr o of. Directly by the definitions of the dual norm k x k ∗ = sup k y k≤ 1 h y , x i , and the duality gap g ( x, d x ) = max y ∈ D h y , − d x i + h x, d x i as in ( 5 ). 3 A Projection-F ree First-Order Metho d fo r Convex Optimization 3.1 The Algorithm In the following w e will generalize the sparse greedy algorithm of [ FW56 ] and its analysis b y [ Cla10 ] to con vex optimization o v er arbitrary compact con v ex sets D ⊆ X of a vector space. More formally , we assume that the space X is a Hilb ert space, and consider problems of the form ( 1 ), i.e., minimize x ∈ D f ( x ) . Here we supp ose that the ob jective function f is differen tiable ov er the domain D , and that for an y x ∈ D , we are given the gradien t ∇ f ( x ) via an oracle. 7 The existing algorithms so far did only apply to conv ex optimization ov er the simplex (or con vex hulls in some cases) [ Cla10 ], or ov er the sp ectahedron of PSD matrices [ Haz08 ], or then did not provide guarantees on the dualit y gap. Inspired by the w ork of Hazan [ Haz08 ], we can also relax the requirement of exactly solving the linearized problem in each step, to just computing appr oximations thereof, while keeping the same con vergence results. This allows for more efficien t steps in many applications. Also, our algorithm v ariant works for arbitrary starting p oints, without needing to compute the best initial “starting v ertex” of D as in [ Cla10 ]. The Primal-Dual Idea. W e are motiv ated by the geometric interpretation of the “p oor man’s” dualit y gap, as explained in the previous Section 2 . This duality gap is the maximum difference of the curren t v alue f ( x ), to the linear approximation to f at the currently fixed p oint x , where the linear maximum is taken ov er the domain D . This observ ation leads to the algorithmic idea of directly using the current linear approximation (ov er the conv ex domain D ) to obtain a go o d direction for the next step, automatically sta ying in the feasible region. The general optimization metho d with its tw o precision v ariants is giv en in Algorithm 1 . F or the approximate v arian t, the constan t C f > 0 is an upp er b ound on the curv ature of the ob jectiv e function f , which w e will explain b elo w in more details. Algo rithm 1 Greedy on a Conv ex Set Input: Conv ex function f , con v ex set D , target accuracy ε Output: ε -approximate solution for problem ( 1 ) Pic k an arbitrary starting p oint x (0) ∈ D for k = 0 . . . ∞ do Let α := 2 k +2 Compute s := ExactLinear ∇ f ( x ( k ) ) , D { Solv e the linearized primitive task exactly } —or— Compute s := Appro xLinear ∇ f ( x ( k ) ) , D , α C f { Appro ximate the linearized problem } Up date x ( k +1) := x ( k ) + α ( s − x ( k ) ) end for The Linearized Primitive. The internal “step direction” pro cedure ExactLinear ( c, D ) used in Algorithm 1 is a method that minimizes the linear function h x, c i ov er the compact con vex domain D . F ormally it must return a p oint s ∈ D such that h s, c i = min y ∈ D h y , c i . In terms of the smo oth conv ex optimization literature, the vectors y that hav e negative scalar pro duct with the gradient , i.e. h y , ∇ f ( x ) i < 0, are called desc ent dir e ctions , see e.g. [ BV04 , Chapter 9]. The main difference to classical conv ex optimization is that we alwa ys choose descent steps sta ying in the domain D , where traditional gradient descend techniques usually use arbitrary directions and need to pro ject back on to D after each step. W e will comment more on this analogy in Section 3.7 . In the alternative in terpretation of our duality concept from Section 2 , the linearized sub-task means that w e searc h for a p oint s that “realizes” the current duality gap g ( x ), that is the distance to the linear approximation, as giv en in ( 5 ). In the special case that the set D is given b y an in tersection of linear constrain ts, this sub-task is exactly equiv alent to line ar pr o gr amming , as already observ ed by [ FW56 , Section 6]. Ho wev er, for many other represen tations of imp ortant sp ecific feasible domains D , this in ternal primitiv e op eration is significan tly easier to solv e, as w e will see in the later sections. 8 x s D f ( x ) Figure 1: Visualization of a step of Algorithm 1 , moving from the current point x = x ( k ) to wards a linear minimizer s ∈ D . Here the tw o-dimensional domain D is part of the ground plane, and we plot the function v alues vertically . Visualization by Rob ert Carnecky . The alternativ e approximate v ariant of Algorithm 1 uses the procedure Appro xLi near ( c, D , ε 0 ) as the internal “step-direction generator”. Analogously to the exact primitiv e, this procedure appr oximates the minimum of the linear function h x, c i ov er the conv ex domain D , with ad- ditiv e error ε 0 > 0. F ormally Appro xLinear ( c, D , ε 0 ) must return a p oin t s ∈ D suc h that h s, c i ≤ min y ∈ D h y , c i + ε 0 . F or several applications, this can be done significantly more efficien tly than the exact v ariant, see e.g. the applications for semidefinite programming in Section 7 . The Curvature. Everything w e need for the analysis of Algorithm 1 is that the linear appro x- imation to our (conv ex) function f at an y p oint x do es not deviate from f by to o m uch, when tak en o ver the whole optimization domain D . The curvatur e c onstant C f of a conv ex and differentiable function f : R n → R , with resp ect to the compact domain D is defined as. C f := sup x,s ∈ D, α ∈ [0 , 1] , y = x + α ( s − x ) 1 α 2 ( f ( y ) − f ( x ) − h y − x, ∇ f ( x ) i ) . (7) A motiv ation to consider this quantit y follows if w e imagine our optimization pro cedure at the curren t p oint x = x ( k ) , and choosing the next iterate as y = x ( k +1) := x + α ( s − x ). Bounded C f then means that the deviation of f at y from the “b est” linear prediction given by ∇ f ( x ) is b ounded, where the acceptable deviation is weigh ted by the inv erse of the squared step-size α . F or linear functions f for example, it holds that C f = 0. The defining term f ( y ) − f ( x ) − h y − x, d x i is also widely known as the Br e gman diver genc e defined b y f . The quantit y C f turns out to b e small for man y relev an t applications, some of whic h w e will discuss later, or see also [ Cla10 ]. The assumption of b ounded curv ature C f is indeed very similar to a Lipsc hitz assumption on the gradient of f , see also the discussion in Sections 3.4 and 3.7 . In the optimization literature, this property is sometimes also called C f -str ong smo othness . It will not alwa ys b e p ossible to compute the constant C f exactly . Ho wev er, for all algorithms in the follo wing, and also their analysis, it is sufficient to just use some upp er b ound on C f . W e will commen t in some more details on the curv ature measure in Section 3.4 . 9 Convergence. The following theorem shows that after O 1 ε man y iterations, Algorithm 1 obtains an ε -approximate solution. The analysis essen tially follo ws the approach of [ Cla10 ], inspired b y the earlier w ork of [ FW56 , DH78 , P at93 ] and [ Zha03 ]. Later, in Section 4.2 , w e will sho w that this con vergence rate is indeed b est p ossible for this type of algorithm, when considering optimization ov er the unit-simplex. More precisely , w e will show that the dep endence of the sparsity on the appro ximation quality , as giv en by the algorithm here, is b est p ossible up to a constant factor. Analogously , for the case of semidefinite optimization with b ounded trace, w e will prov e in Section 7.4 that the obtained (lo w) rank of the approximations given b y this algorithm is optimal, for the giv en appro ximation qualit y . Theo rem 3 (Primal Convergence) . F or e ach k ≥ 1 , the iter ate x ( k ) of the exact variant of Algo- rithm 1 satisfies f ( x ( k ) ) − f ( x ∗ ) ≤ 4 C f k + 2 , wher e x ∗ ∈ D is an optimal solution to pr oblem ( 1 ). F or the appr oximate variant of Algorithm 1 , it holds that f ( x ( k ) ) − f ( x ∗ ) ≤ 8 C f k + 2 . (In other wor ds b oth algorithm variants deliver a solution of primal err or at most ε after O ( 1 ε ) many iter ations.) The pro of of the ab ov e theorem on the conv ergence-rate of the primal error crucially dep ends on the following Lemma 4 on the impro vemen t in eac h iteration. W e recall from Section 2 that the dualit y gap for the general conv ex problem ( 1 ) o ver the domain D is given b y g ( x ) = max s ∈ D h x − s, ∇ f ( x ) i . Lemma 4. F or any step x ( k +1) := x ( k ) + α ( s − x ( k ) ) with arbitr ary step-size α ∈ [0 , 1] , it holds that f ( x ( k +1) ) ≤ f ( x ( k ) ) − αg ( x ( k ) ) + α 2 C f if s is given by s := ExactLinear ( ∇ f ( x ) , D ) . If the appr oximate primitive s := Appro xLinear ( ∇ f ( x ) , D , αC f ) is use d inste ad, then it holds that f ( x ( k +1) ) ≤ f ( x ( k ) ) − αg ( x ( k ) ) + 2 α 2 C f . Pr o of. W e write x := x ( k ) , y := x ( k +1) = x + α ( s − x ), and d x := ∇ f ( x ) to simplify the notation, and first prov e the second part of the lemma. W e use the definition of the curv ature constant C f of our conv ex function f , to obtain f ( y ) = f ( x + α ( s − x )) ≤ f ( x ) + α h s − x, d x i + α 2 C f . No w we use that the choice of s := Appro x Linear ( d x , D , ε 0 ) is a go o d “descen t direction” on the linear approximation to f at x . F ormally , we are giv en a p oint s that satisfies h s, d x i ≤ min y ∈ D h y , d x i + ε 0 , or in other words we hav e h s − x, d x i ≤ min y ∈ D h y , d x i − h x, d x i + ε 0 = − g ( x, d x ) + ε 0 , from the definition ( 5 ) of the dualit y gap g ( x ) = g ( x, d x ). Altogether, we obtain f ( y ) ≤ f ( x ) + α ( − g ( x ) + ε 0 ) + α 2 C f = f ( x ) − αg ( x ) + 2 α 2 C f , 10 the last equalit y follo wing by our choice of ε 0 = αC f . This pro ves the lemma for the appro ximate case. The first claim for the exact linear primitive ExactLinear () follows by the same pro of for ε 0 = 0. Ha ving Lemma 4 at hand, the proof of our ab ov e primal conv ergence Theorem 3 now follows along the same idea as in [ Cla10 , Theorem 2.3] or [ Haz08 , Theorem 1]. Note that a weak er v arian t of Lemma 4 was already prov en by [ FW56 ]. Pr o of of The or em 3 . F rom Lemma 4 we kno w that for every step of the exact v arian t of Algo- rithm 1 , it holds that f ( x ( k +1) ) ≤ f ( x ( k ) ) − αg ( x ( k ) ) + α 2 C f . W riting h ( x ) := f ( x ) − f ( x ∗ ) for the (unknown) primal error at any p oint x , this implies that h ( x ( k +1) ) ≤ h ( x ( k ) ) − αg ( x ( k ) ) + α 2 C f ≤ h ( x ( k ) ) − αh ( x ( k ) ) + α 2 C f = (1 − α ) h ( x ( k ) ) + α 2 C f , (8) where w e ha v e used w eak dualit y h ( x ) ≤ g ( x ) as giv en b y in ( 6 ). W e will no w use induction o ver k in order to prov e our claimed b ound, i.e., h ( x ( k +1) ) ≤ 4 C f k + 1 + 2 k = 0 . . . ∞ . The base-case k = 0 follows from ( 8 ) applied for the first step of the algorithm, using α = α (0) = 2 0+2 = 1. No w considering k ≥ 1, the b ound ( 8 ) giv es us h ( x ( k +1) ) ≤ (1 − α ( k ) ) h ( x ( k ) ) + α ( k ) 2 C f = (1 − 2 k +2 ) h ( x ( k ) ) + ( 2 k +2 ) 2 C f ≤ (1 − 2 k +2 ) 4 C f k +2 + ( 2 k +2 ) 2 C f , where in the last inequalit y we hav e used the induction hypothesis for x ( k ) . Simply rearranging the terms gives h ( x ( k +1) ) ≤ 4 C f k +2 − 8 C f ( k +2) 2 + 4 C f ( k +2) 2 = 4 C f 1 k +2 − 1 ( k +2) 2 = 4 C f k +2 k +2 − 1 k +2 ≤ 4 C f k +2 k +2 k +3 = 4 C f k +3 , whic h is our claimed b ound for k ≥ 1. The analogous claim for Algorithm 1 using the approximate linear primitiv e Appro xLinear () follo ws from the exactly same argument, b y replacing ev ery o ccurrence of C f in the pro of here b y 2 C f instead (compare to Lemma 4 also). 3.2 Obtaining a Guaranteed Small Duality Gap F rom the ab o ve Theorem 3 on the conv ergence of Algorithm 1 , we ha ve obtained small primal error. How ever, the optimum v alue f ( x ∗ ) is unknown in most practical applications, where we w ould prefer to hav e an easily computable measure of the curren t approximation qualit y , for example as a stopping criterion of our optimizer in the case that C f is unkno wn. The dualit y gap g ( x ) that w e defined in Section 2 satisfies these requiremen ts, and alwa ys upp er b ounds the primal error f ( x ) − f ( x ∗ ). 11 By a nice argument of Clarkson [ Cla10 , Theorem 2.3], the conv ergence on the simplex opti- mization domain can b e extended to obtain the stronger property of guaranteed small dualit y gap g ( x ( k ) ) ≤ ε , after at most O ( 1 ε ) many iterations. This stronger conv ergence result was not y et known in earlier papers of [ FW56 , DH78 , Jon92 , Pat93 ] and [ Zha03 ]. Here w e will generalize the primal-dual con v ergence to arbitrary compact conv ex domains. The pro of of our theorem b elo w again relies on Lemma 4 . Theo rem 5 (Primal-Dual Convergence) . L et K := l 4 C f ε m . We run the exact variant of A lgo- rithm 1 for K iter ations (r e c al l that the step-sizes ar e given by α ( k ) := 2 k +2 , 0 ≤ k ≤ K ), and then c ontinue for another K + 1 iter ations, now with the fixe d step-size α ( k ) := 2 K +2 for K ≤ k ≤ 2 K + 1 . Then the algorithm has an iter ate x ( ˆ k ) , K ≤ ˆ k ≤ 2 K + 1 , with duality gap b ounde d by g ( x ( ˆ k ) ) ≤ ε . The same statement holds for the appr oximate variant of Algorithm 1 , when setting K := l 8 C f ε m inste ad. Pr o of. The pro of follo ws the idea of [ Cla10 , Section 7]. By our previous Theorem 3 we already know that the primal error satisfies h ( x ( K ) ) = f ( x ( K ) ) − f ( x ∗ ) ≤ 4 C f K +2 after K iterations. In the subsequent K + 1 iterations, we will no w supp ose that g ( x ( k ) ) alw a ys sta ys larger than 4 C f K +2 . W e will try to derive a contradiction to this assumption. Putting the assumption g ( x ( k ) ) > 4 C f K +2 in to the step impro vemen t b ound giv en by Lemma 4 , w e get that f ( x ( k +1) ) − f ( x ( k ) ) ≤ − α ( k ) g ( x ( k ) ) + α ( k ) 2 C f < − α ( k ) 4 C f K +2 + α ( k ) 2 C f holds for any step size α ( k ) ∈ (0 , 1]. Now using the fixed step-size α ( k ) = 2 K +2 in the iterations k ≥ K of Algorithm 1 , this reads as f ( x ( k +1) ) − f ( x ( k ) ) < − 2 K +2 4 C f K +2 + 4 ( K +2) 2 C f = − 4 C f ( K +2) 2 Summing up ov er the additional steps, we obtain f ( x (2 K +2) ) − f ( x ( K ) ) = 2 K +1 X k = K f ( x ( k +1) ) − f ( x ( k ) ) < − ( K + 2) 4 C f ( K +2) 2 = − 4 C f K +2 , whic h together with our kno wn primal approximation error f ( x ( K ) ) − f ( x ∗ ) ≤ 4 C f K +2 w ould result in f ( x (2 K +2) ) − f ( x ∗ ) < 0, a contradiction. Therefore there must exist ˆ k , K ≤ ˆ k ≤ 2 K + 1, with g ( x ( ˆ k ) ) ≤ 4 C f K +2 ≤ ε . The analysis for the appro ximate v ariant of Algorithm 1 follows using the analogous second b ound from Lemma 4 . F ollo wing [ Cla10 , Theorem 2.3], one can also prov e a similar primal-dual conv ergence theorem for the line-searc h algorithm v ariant that uses the optimal step-size in each iteration, as w e will discuss in the next Section 3.3 . This is somewhat exp ected as the line-searc h algorithm in each step is at least as go o d as the fixed step-size v ariant we consider here. 12 3.3 Cho osing the Optimal Step-Size b y Line-Search Alternativ ely , instead of the fixed step-size α = 2 k +2 in Algorithm 1 , one can also find the optimal α ∈ [0 , 1] b y line-search. This will not improv e the theoretical conv ergence guarantee, but might still b e considered in practical applications if the b est α is easy to compute. Exp erimentally , we observ ed that line-search can improv e the numerical stabilit y in particular if approximate step directions are used, whic h w e will discuss e.g. for semidefinite matrix completion problems in Section 11.5 . F ormally , given the c hosen direction s , we then search for the α of b est improv ement in the ob jectiv e function f , that is α := arg min α ∈ [0 , 1] f x ( k ) + α ( s − x ( k ) ) . (9) The resulting mo dified v ersion of Algorithm 1 is depicted again in Algorithm 2 , and was precisely analyzed in [ Cla10 ] for the case of optimizing ov er the simplex. Algo rithm 2 Greedy on a Conv ex Set, using Line-Searc h Input: Conv ex function f , con v ex set D , target accuracy ε Output: ε -approximate solution for problem ( 16 ) Pic k an arbitrary starting p oint x (0) ∈ D for k = 0 . . . ∞ do Compute s := ExactLinear ∇ f ( x ( k ) ) , D —or— Compute s := Appro xLinear ∇ f ( x ( k ) ) , D , 2 C f k +2 Find the optimal step-size α := arg min α ∈ [0 , 1] f x ( k ) + α ( s − x ( k ) ) Up date x ( k +1) := x ( k ) + α ( s − x ( k ) ) end for In man y cases, w e can solve this line-searc h analytically in a straigh tforw ard manner, b y differ- en tiating the ab ov e expression with resp ect to α : Consider f α := f x ( k +1) ( α ) = f x ( k ) + α s − x ( k ) and compute 0 ! = ∂ ∂ α f α = D s − x ( k ) , ∇ f x ( k +1) ( α ) E . (10) If this equation can be solved for α , then the optimal suc h α can directly b e used as the step-size in Algorithm 1 , and the conv ergence guarantee of Theorem 3 still holds. This is b ecause the impro vemen t in each step will b e at least as large as if w e w ere using the older (p otentially sub-optimal) fixed c hoice of α = 2 k +2 . Here w e hav e assumed that α ( k ) ∈ [0 , 1] alwa ys holds, whic h can be done when using some caution, see also [ Cla10 ]. Note that the line-search can also b e used for the appro ximate v arian t of Algorithm 1 , where w e keep the accuracy for the internal primitiv e metho d Appro xLinear ∇ f ( x ( k ) ) , D , ε 0 fixed to ε 0 = α fixed C f = 2 C f k +2 . Theorem 3 then holds as as in the original case. 3.4 The Curvature Measure of a Convex Function F or the case of differen tiable f ov er the space X = R n , w e recall the definition of the curv ature constan t C f with respect to the domain D ⊂ R n , as stated in ( 7 ), C f := sup x,s ∈ D, α ∈ [0 , 1] , y = x + α ( s − x ) 1 α 2 f ( y ) − f ( x ) − ( y − x ) T ∇ f ( x ) . 13 An ov erview of v alues of C f for several classes of functions f o ver domains that are related to the unit simplex can b e found in [ Cla10 ]. Asymptotic Curvature. As Algorithm 1 con v erges to wards some optimal solution x ∗ , it also mak es sense to consider the asymptotic curv ature C ∗ f , defined as C ∗ f := sup s ∈ D, α ∈ [0 , 1] , y = x ∗ + α ( s − x ∗ ) 1 α 2 f ( y ) − f ( x ∗ ) − ( y − x ∗ ) T ∇ f ( x ∗ ) . (11) Clearly C ∗ f ≤ C f . As describ ed in [ Cla10 , Section 4.4], w e exp ect that as the algorithm con verges to wards x ∗ , also the impro vemen t bound as given by Lemma 4 should be determined by C ∗ f + o (1) instead of C f , resulting in a b etter conv ergence sp eed constant than given Theorem 3 , at least for large k . The class of str ongly c onvex functions is an example for whic h the con vergence of the relev ant constan t tow ards C ∗ f is easy to see, since for these functions, conv ergence in the function v alue also imlies con vergence in the domain, tow ards a unique p oint x ∗ , see e.g. [ BV04 , Section 9.1.2]. Relating the Curvature to the Hessian Matrix. Before we can compare the assumption of b ounded curv ature C f to a Lipschitz assumption on the gradient of f , we will need to relate C f to the Hessian matrix (matrix of all second deriv atives) of f . Here the idea describ ed in [ Cla10 , Section 4.1] is to make use of the degree-2 T a ylor-expansion of our function f at the fixed p oint x , as a function of α , which is f ( x + α ( s − x )) = f ( x ) + α ( s − x ) T ∇ f ( x ) + α 2 2 ( s − x ) T ∇ 2 f ( z )( s − x ) , where z is a p oint on the line-segment [ x, y ] ⊆ D ⊂ R d b et ween the tw o p oin ts x ∈ R n and y = x + α ( s − x ) ∈ R n . T o upper b ound the curv ature measure, we can no w directly plug in this expression for f ( y ) in to the ab ov e definition of C f , obtaining C f ≤ sup x,y ∈ D, z ∈ [ x,y ] ⊆ D 1 2 ( y − x ) T ∇ 2 f ( z )( y − x ) . (12) F rom this b ound, it follo ws that C f is upp er b ounded b y the largest eigenv alue of the Hessian matrix of f , scaled with the domain’s Euclidean diameter, or formally Lemma 6. F or any twic e differ entiable c onvex function f over a c omp act c onvex domain D , it holds that C f ≤ 1 2 diam( D ) 2 · sup z ∈ D λ max ∇ 2 f ( z ) . Note that as f is conv ex, the Hessian matrix ∇ 2 f ( z ) is positive semidefinite for all z , see e.g. [ KZ05 , Theorem 7.3.6]. Pr o of. Applying the Cauc hy-Sc hw arz inequality to ( 12 ) for any x, y ∈ D (as in the definition of C f ), w e get ( y − x ) T ∇ 2 f ( z )( y − x ) ≤ k y − x k 2 ∇ 2 f ( z )( y − x ) 2 ≤ k y − x k 2 2 ∇ 2 f ( z ) spec ≤ diam( D ) 2 · sup z ∈ D λ max ∇ 2 f ( z ) . The middle inequalit y follo ws from the v ariational characterization of the matrix sp ectral norm, i.e. k A k spec := sup x 6 =0 k Ax k 2 k x k 2 . Finally , in the last inequalit y we hav e used that by conv exity of f , the Hessian matrix ∇ 2 f ( z ) is PSD, so that its sp ectral norm is its largest eigenv alue. 14 Note that in the case of D b eing the unit simplex (see also the following Section 4 ), we hav e that diam(∆ n ) = √ 2, meaning the scaling factor disapp ears, i.e. C f ≤ sup z ∈ ∆ n λ max ∇ 2 f ( z ) . Bounded Curvature vs. Lipschitz-Continuous Gradient. Our core assumption on the given optimization problems is that that the curv ature C f of the function is bounded ov er the domain. Equiv alen tly , this means that the function do es not deviate from its linear appro ximation by too m uch. Here w e will explain that this assumption is in fact v ery close to the natural assumption that the gradien t ∇ f is Lipschitz-con tin uous, whic h is often assumed in classical con vex opti- mization literature, where it is sometimes called C f -str ong smo othness , see e.g. [ Nem05 , KSST09 ] (or [ d’A08 ] if the gradient information is only approximate). Lemma 7. L et f b e a c onvex and twic e differ entiable function, and assume that the gr adient ∇ f is Lipschitz-c ontinuous over the domain D with Lipschitz-c onstant L > 0 . Then C f ≤ 1 2 diam( D ) 2 L . Pr o of. Ha ving k∇ f ( y ) − ∇ f ( x ) k 2 ≤ L k y − x k 2 ∀ x, y ∈ D by the Cauch y-Sch warz inequality implies that ( y − x ) T ( ∇ f ( y ) − ∇ f ( x )) ≤ L k y − x k 2 2 , so that f ( y ) ≤ f ( x ) + ( y − x ) T ∇ f ( x ) + L 2 k y − x k 2 2 . (13) If f is twice differentiable, it can directly b e seen that the ab ov e condition implies that L · I ∇ 2 f ( z ) holds for the Hessian of f , that is λ max ∇ 2 f ( z ) ≤ L . T ogether with our result from Lemma 6 , the claim follows. The ab o ve b ound ( 13 ) which is implied by Lipschitz-con tinuous gradient means that the function is not “curved” b y more than L in some sense, which is an in teresting prop erty . In fact this is exactly the opp osite inequalit y compared to the prop erty of str ong c onvexity , which is an assumption on the function f that we do not imp ose here. Strong conv exity on a compact domain means that the function is alw ays curv ed at le ast by some constan t (as our L ). W e just note that for strongly conv ex functions, “accelerated” algorithms with an even faster conv ergence of 1 k 2 (meaning O ( 1 √ ε ) steps) do exist [ Nes04 , Nes07a ]. 3.5 Optimizing over Convex Hulls In the case that the optimization domain D is given as the conv ex hull of a (finite or infinite) subset V ⊂ X , i.e. D = con v( V ) , then it is particularly easy to solv e the linear optimization subproblems as needed in our Algo- rithm 1 . Recall that conv( V ) is defined as the set of all finite conv ex combinations P i α i v i for a finite subset { v 1 , . . . , v k } ⊆ V , while V can also b e an infinite set. Lemma 8 (Linear Optimization over Convex Hulls) . L et D = conv( V ) for any subset V ⊂ X , and D c omp act. Then any line ar function y 7→ h y , c i wil l attain its minimum and maximum over D at some “vertex” v ∈ V . Pr o of. W.l.g. we will only sho w the case for the maximum. Let s ∈ D b e a point attaining the linear optim um h s, c i = max y ∈ D h y , c i . Then by definition of D , we hav e that s = P k i =1 α i v i , meaning that s is the w eighted av erage of some finite set of “vertices” v 1 , . . . , v k ∈ V , with α i ≥ 0 , P i α i = 1. By linearity of the inner product, 15 h s, c i = * k X i =1 α i v i , c + = k X i =1 α i h v i , c i , and therefore w e m ust hav e that h v i , c i ≥ h s, c i for at least one of the indices i , meaning that v i ∈ V is also attaining the linear maximum. In the following w e will discuss sev eral application where this simple fact will b e useful to solv e the linearized subproblems ExactLinear () more efficien tly , as the set V is often m uch easier to describ e than the full compact domain D . The setting of conv ex optimization ov er a con vex h ull in a v ector space was already studied b y [ Zha03 ]. There, each iteration of the optimizer consists of greedily selecting the p oint (or “v ertex”) of V which promises b est improv ement. [ Zha03 ] then ga ve a similar primal conv ergence guaran tee as in our Theorem 3 (but no primal-dual con vergence result on general con vex h ulls was kno wn so far, to the b est of our kno wledge). The ab ov e Lemma 8 in a sense explains the relation to our linearized in ternal problem. The main difference is that the algorithm of [ Zha03 ] alw ays ev aluates the original non-linear function f at all v ertices V , while our slightly more general framew ork only relies on the linear subproblem, and allo ws for arbitrary means to appro ximately solv e the subproblem. 3.6 Randomized Va riants, and Sto chastic Optimization F or a v ariety of classes of our conv ex optimization problems, randomization can help to solv e the linearized subproblem more efficiently . This idea is strongly related to online and sto chastic optimization, see e.g. [ Nes11 ], and also the p opular stochastic gradient descent (SGD) tech- niques [ Bot10 ]. W e can also apply suc h c heap er randomized steps in our describ ed framew ork, instead of deterministically solving the internal linear problem in each iteration. Assumed that the user of our metho d is able to decomp ose the linearized problem in some arbitrary w ay using random- ization, and if the randomization is such that the linearized problem will b e solv ed “accurately enough” with some probabilit y p > 0 in eac h iteration, then our conv ergence analysis still holds also in this probabilistic setting as follo ws: F ormally , w e assume that we are giv en access to a randomized pro cedure RandomLinear ( c, D , ε 0 ), whic h returns a point s ∈ D such that h s, c i ≤ min y ∈ D h y , c i + ε 0 with probability at least p > 0. In other w ords, with a p ositiv e probabilit y , RandomLinear () will b eha v e lik e Appr oxLinear (). In each iteration of the line-search v ariant of our algorithm (see Algorithm 2 ), we will no w use that randomized in ternal pro cedure instead. The exp ected impro vemen t giv en b y a step to wards s = RandomLinear () is at least p times the amount given in Lemma 4 . (Here w e hav e used that in the even ts of “failure” of RandomLinear (), the ob jectiv e function v alue will at least not become worse, due to the use of line-search). In other words if w e p erform 1 p times more iterations than required for the deterministic Algorithm 1 , then w e ha ve that the conv ergence b y Theorem 3 also holds for the randomized v arian t describ ed here. Sto chastic Gradient Descent (SGD). A classical example is when the linearized problem is giv en by simply finding the maximum ov er s a y n co ordinates, as we will e.g. see in the following Sections 4 and 5 for optimizing o ver the simplex, or o ver bounded ` 1 -norm. In this case, b y sampling uniformly at random, with probability 1 n w e will pick the correct co ordinate, for which the step impro v ement is as in the deterministic Algorithm 1 . Therefore w e ha v e obtained the same con vergence guarantee as for the deterministic algorithm, but the necessary num b er of steps is multiplied b y n . 16 F or unconstrained con vex optimization, the conv ergence of SGD and other related metho ds w as analyzed e.g. in [ Nes11 ] and also the earlier pap er [ Nes07b , Section 6]. Also here, a comparable slo w-do wn w as observ ed when using the cheaper randomized steps. 3.7 Relation to Classical Convex Optimization Relation to Gradient Descent and Steep est Descent. The internal linear optimizer in our Algorithm 1 can also b e in terpreted in terms of descent directions. Recall that all v ectors y that hav e negativ e scalar pro duct with the curren t gradien t, i.e. h y , ∇ f ( x ) i < 0, are called desc ent dir e ctions , see e.g. [ BV04 , Chapter 9.4]. Also observe that h y , ∇ f ( x ) i is the directional deriv ativ e of f in direction of y if y is of unit length. Our method therefore c ho oses the best descen t direction ov er the entire domain D , where the qualit y is measured as the b est p ossible absolute improv emen t as suggested b y the linearization at the current p oint x . In any iteration, this will crucially dep end on the glob al shap e of the domain D , even if the actual step-size α ( k ) migh t b e v ery small. This crucially con trasts classical gradien t descent tec hniques, which only use lo c al information to determine the step-directions, facing the risk of walking out of the domain D and therefore requiring pro jection steps after eac h iteration. Relation to Inaccurate and Missing Gradient Info rmation. The ability of our Algorithm 1 to deal with only approximate in ternal linear optimizers as in Appr oxLinear () is also related to existing metho ds that assume that gradient information is only a v ailable with noise, or in a sto c hastic or sampling setting. F or the case of optimizing smo oth con vex functions, [ d’A08 ] has used a similar measure of error, namely that the linearization giv en b y the “noisy” v ersion ˜ d x of the gradient ∇ f ( x ) does not differ by more than say ε 0 when measured ov er the en tire domain D , or formally h y − z , ˜ d x i − h y − z , ∇ f ( x ) i ≤ ε 0 , ∀ x, y , z ∈ D . This assumption is similar, but stronger than the additiv e appro ximation qualit y that w e require in our ab ov e setting (we only need that the linearized optimal values are within ε 0 ). Also, the algorithm as well as the analysis in [ d’A08 ] are more complicated than the metho d prop osed here, due to the need of pro jections and proximal op erators. W e ha ve discussed the case where gradien t information is av ailable only in a sto chastic oracle (e.g. suc h that the gradient is obtained in exp ectation) in the ab ov e Subsection 3.6 . F or an o verview of related randomized metho ds in unconstrained con vex optimization, we refer the reader to the recen t work by [ Nes11 ], which also applies when the gradient itself is not av ailable and has to b e estimated b y oracle calls to the function alone. If gradien t information can b e constructed in an y w ay such that the linearized problem Appro xLinear () can b e solved to the desired additive error, then our abov e analysis of Algo- rithm 1 will still hold. Relation to Mirror Descent, Pro ximal Metho ds and Conjugate Functions. Our prop osed metho d is related to mirr or desc ent as w ell as pr oximal metho ds in conv ex optimization, but our approac h is usually simpler. The mirror descent technique originates from e.g. [ BT03 , BTN05 ]. F or a brief o v erview of proximal metho ds with applications to some of the classes of sparse con vex optimization problems as studied here, we refer to [ BJMO11 , Section 3]. T o inv estigate the connection, we write f lin | x ( y ) := f ( x ) + h y − x, d x i for the linearization giv en by the (sub)gradient d x = ∇ f ( x ) at a fixed p oint x ∈ D . A v ariant of mirror descent, 17 see e.g. [ BTN05 , Haz11 ] is to find the next iterate y as the p oint maximizing the Bregman div ergence f ( y ) − f ( x ) − h y − x, d x i = f ( y ) − f lin | x ( y ) (14) relativ e to the currently fixed old iterate x . This is the same task as maximizing the gap betw een the function f ( y ) and its linear approximation at x , or equiv alently we ev aluate the c onjugate function f ∗ ( z ) := sup y ∈ D h y , z i − f ( y ) at z = d x . The definition of the conjugate dual is also known as F enchel duality , see e.g. [ BL06 ]. In [ Nem05 ], the conjugate function is also called the Legendre transformation. Ho wev er in our approach, the inner task ExactLinear ( d x , D ) as well as Appr oxLinear ( d x , D , ε 0 ) is a simpler linear problem. Namely , w e directly minimize the linearization at the current p oin t x , i.e. we maximize − f ( x ) − h y − x, d x i = − f lin | x ( y ) (15) and then mov e tow ards an appro ximate maximizer y . In terms of F enchel dualit y , this simpler linear problem is the ev aluation of the conjugate dual of the char acteristic function of our domain D , i.e. 1 ∗ D ( z ) := sup y ∈X h y , z i − 1 D ( y ) , where this function is ev aluated at the current subgradient z = d x . The char acteristic function 1 D : X → R of a set D ⊆ X is defined as 1 D ( y ) = 0 for y ∈ D and 1 D ( y ) = ∞ otherwise. Compared to our algorithm, mirror descen t schemes require a “pro jection” step in eac h it- eration, sometimes also called the pr oximal or mirr or op er ator . This refers to minimizing the linearization plus a strongly con vex pro x-function that punishes the distance to the starting p oin t. If the squared Euclidean norm is used, the mirror op erator corresp onds to the standard pro jection back onto the domain D . Our metho d uses no such prox-function, and neither is the zero-function a strongly conv ex one, as w ould b e required for mirror descen t to work. It is exp ected that the computational cost p er iteration of our metho d will in most application cases b e lo w er compared to mirror descent schemes. F or conv ex optimization ov er the simplex, which w e will study in more details in the following Section 4 , [ BT03 ] ha ve prop osed a mirror descen t algorithm, obtaining a con vergence of f ( x ( k ) ) − f ( x ∗ ) ≤ √ 2 ln n L √ k . This ho wev er is w orse than the conv ergence of our metho ds as given b y Theorem 3 . Our conv ergence is indep enden t of the dimension n , and go es with 1 k instead of 1 √ k . Also the earlier pap er b y [ BTMN01 ] only obtained a conv ergence of O 1 √ k for the case of Lipsc hitz-contin uous con vex functions o ver the simplex. The NERML optimizer b y [ BTN05 ] is a v ariant of mirror descent that memorizes several past linearizations of the ob jective function, to impro ve the a v ailable kno wledge ab out the current function landscap e. It is an op en research question if this idea could also help in our setting here, or for sto chastic gradient descent schemes [ Bot10 ]. 4 Spa rse Appro ximation over the Simplex As a first application of the ab ov e general sc heme, w e will no w consider optimization problems defined o ver the unit simplex, or in other w ords the non-negativ e vectors of ` 1 -norm equal to one. This will serve as a warm-up case b efore considering ` 1 -norm regularized problems in the next Section 5 . Our approac h here will allo w us to understand the best achiev able sparsity of approximate solutions, as a function of the appro ximation quality , as already sho wn by [ Cla10 ]. In particular, we will sho w that our main Algorithm 1 on page 8 and its analysis do lead to Clarkson’s approach [ Cla10 ] for optimizing ov er the simplex. In this case, it was already kno wn 18 that sparsit y O 1 ε can alw ays b e achiev ed by applying Algorithm 1 to the simplex domain, see [ Cla10 ]. W e will also show that this is indeed optimal, b y pro viding an asymptotically matc hing lo wer b ound in Section 4.2 . Also, our analysis holds ev en if the linear subproblems are only solved appro ximately , and allo ws arbitrary starting points, in contrast to [ Cla10 ]. Ha ving this efficient algorithm giving sparsit y O 1 ε is in particularly attractive in view of the computational complexity of ve ctor c ar dinality minimization , which is kno wn to b e NP-hard, by a reduction to Exact-Cov er, see [ Nat95 ]. V ector cardinality minimization here refers to finding the sparsest v ector that is an ε -approximation to some given conv ex minimization problem. F ormally , finding the sparsest x that satisfies k Ax − b k 2 ≤ ε for given A, b and ε . Set-Up. W e supp ose that a basis has b een c hosen in the space X , so that w e can assume X = R n with the standard inner pro duct h x, y i = x T y . Here we consider one sp ecial class of the general optimization problems ( 1 ), namely we optimize ov er non-negative vectors that sum up to one, that is minimize x ∈ R n f ( x ) s.t. k x k 1 = 1 , x ≥ 0 . (16) In the follo wing we write ∆ n := { x ∈ R n | x ≥ 0 , k x k 1 = 1 } for the unit simplex in R n . As the ob jective function f is now defined o ver R n , all subgradien ts or gradien ts of f will also b e represen ted b y v ectors in R n in the following. Note that the alternative case of optimizing under an inequality constraint k x k 1 ≤ 1 instead of k x k 1 = 1 can easily b e reduced to the ab o v e form ( 16 ) by introducing a new “slac k” v ariable. F ormally , one uses vectors ˆ x = ( x 1 , . . . , x n , x n +1 ) ∈ R n +1 instead and optimizes the function ˆ f ( ˆ x ) := f ( x 1 , . . . , x n ) o v er the simplex domain k ˆ x k 1 = 1, ˆ x ≥ 0. Co resets. The coreset concept was originally in tro duced in computational geometry by [ BHPI02 ] and [ APV02 ]. F or p oin t-set problems, the coreset idea refers to identifying a v ery small subset (coreset) of the p oints, such that the solution just on the coreset is guaranteed to b e a go o d appro ximation to original problem. Here for general conv ex optimization, the role of the coreset p oin ts is taken b y the non-zero co ordinates of our sought v ector x instead. The coreset size then corresp onds to the sparsity of x . F ormally if there exists an ε -approximate solution x ∈ D ⊆ R n to the conv ex optimiza- tion problem ( 1 ), using only k man y non-zero co ordinates, then we say that the corresp onding co ordinates of x form an ε -c or eset of size k for problem ( 1 ). In other words, the following upp er and low er b ounds of O 1 ε on the sparsity of approxi- mations for problem ( 16 ) are indeed matching upp er and low er bounds on the coreset size for con vex optimization ov er the simplex, analogous to what w e ha ve found in the geometric problem setting in [ GJ09 ]. 4.1 Upp er Bound: Sparse Greedy on the Simplex Here w e will sho w how the general algorithm and its analysis from the previous Section 3 do in particular lead to Clarkson’s approac h [ Cla10 ] for minimizing an y con vex function o ver the unit simplex. The algorithm follo ws directly from Algorithm 1 , and will hav e a running time of O 1 ε man y gradient ev aluations. W e will crucially make use of the fact that every linear function attains its minimum at a vertex of the simplex ∆ n . F ormally , for an y vector c ∈ R n , it holds that min s ∈ ∆ n s T c = min i c i . This prop erty is easy to verify in the sp ecial case here, but is also a direct consequence of the small Lemma 8 whic h w e hav e pro ven for general conv ex h ulls, 19 if we accept that the unit simplex is the conv ex hull of the unit basis vectors. W e hav e obtained that the internal linearized primitive can b e solv ed exactly b y c ho osing ExactLinear ( c, ∆ n ) := e i with i = arg min i c i . Algo rithm 3 Sparse Greedy on the Simplex Input: Conv ex function f , target accuracy ε Output: ε -approximate solution for problem ( 16 ) Set x (0) := e 1 for k = 0 . . . ∞ do Compute i := arg min i ∇ f ( x ( k ) ) i Let α := 2 k +2 Up date x ( k +1) := x ( k ) + α ( e i − x ( k ) ) end for Observ e that in each iteration, this algorithm only in tro duces at most one new non-zero co ordinate, so that the sparsit y of x ( k ) is alw ays upp er b ounded b y the n um b er of steps k , plus one, given that we start at a vertex. Since Algorithm 3 only mov es in co ordinate directions, it can b e seen as a v ariant of c o or dinate desc ent . The conv ergence result directly follows from the general analysis we ga ve in the previous Section 3 . Theo rem 9 ([ Cla10 , Theorem 2.3], Convergence of Spa rse Greedy on the Simplex) . F or e ach k ≥ 1 , the iter ate x ( k ) of Algorithm 3 satisfies f ( x ( k ) ) − f ( x ∗ ) ≤ 4 C f k + 2 . wher e x ∗ ∈ ∆ n is an optimal solution to pr oblem ( 16 ). F urthermor e, for any ε > 0 , after at most 2 l 4 C f ε m + 1 = O 1 ε many steps 3 , it has an iter ate x ( k ) of sp arsity O 1 ε , satisfying g ( x ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . Dualit y Gap. W e recall from Section 2 that the dualit y gap ( 5 ) at any p oint x ∈ ∆ n is easily computable from any subgradien t, and in our case b ecomes g ( x, d x ) = x T d x − min i ( d x ) i , and g ( x ) = x T ∇ f ( x ) − min i ( ∇ f ( x )) i . (17) Here w e hav e again used the observ ation that linear functions attain their minimum at a vertex of the domain, i.e, min s ∈ ∆ n s T c = min i c i . Applications. Man y practically relev ant optimization problems do fit in to our setting ( 16 ) here, allowing the application of Algorithm 3 . This includes linear classifiers suc h as supp ort v ector machines (SVMs), see also [ GJ09 ], as w ell as kernel learning (finding the b est con vex com bination among a set of base kernels) [ BLJ04 ]. Some other applications that directly fit in to our framew ork are ` 2 -supp ort vector regression (SVR), AdaBo ost [ Zha03 ], mean-v ariance analysis in p ortfolio selection [ Mar52 ], the smallest enclosing ball problem [ BC07 ], and estimating mixtures of probability densities [ Cla10 ]. F or more applications w e refer to [ Cla10 ]. 3 Note that in order for our Theorem 5 on the b ounded duality gap to apply , the step-size in the second half of the iterations needs to b e fixed to α ( k ) := 2 K +2 , see Section 3.2 . This remark also applies to the later applications of our general Algorithm 1 that we discuss in the following. W e already mentioned abov e that if line-searc h is used instead, then no such tec hnicality is necessary , see also [ Cla10 ]. 20 Line-Sea rch fo r the Best Step-Size. In most applications it will b e a straigh t-forward task to find the optimal step-size α ∈ [0 , 1] in eac h step instead, as described in Section 3.3 . F or the sp ecial case of p olytop e distance and SVM problems, the resulting metho d then exactly corresp onds to Gilb ert’s geometric algorithm [ Gil66 ], as shown in [ GJ09 ]. Here the w ording of “line-searc h” mak es geometric sense in that we need to find the p oint s on a giv en line, such that s is closest to the origin. Aw a y Steps. By p erforming more work with the curren tly non-zero co ordinates, one can get the sparsit y even smaller. More precisely the num b er of non-zeros can b e impro ved close to 2 C f ε instead of 2 l 4 C f ε m as giv en b y the ab ov e Theorem 9 . The idea of away-steps in tro duced b y [ TY07 ] is to keep the total n umber of non-zero co ordinates (i.e. the coreset size) fixed ov er all iterations, b y remo ving the smallest non-zero coordinate from x after eac h adding step. F or more bac kground w e refer to [ Cla10 , Algorithm 9.1]. 4.2 Ω( 1 ε ) Lo w er Bound on the Spa rsit y W e will now sho w that sparsit y O 1 ε , as obtained by the greedy algorithm we analyzed in the previous section is indeed b est p ossible, by pro viding a low er b ound of Ω 1 ε . In the language of coresets, this means we will provide a matching lo wer b ound on the size of coresets for conv ex optimization o v er the simplex. T ogether with the upp er b ound, this therefore completely c har- acterizes the trade-off b etw een sparsit y and appro ximation qualit y for the family of optimization problems of the form ( 16 ). The same matching sparsity upp er and low er b ounds will also hold for optimizing ov er the ` 1 -ball instead, see Section 5 . F or the following lo wer b ound construction we consider the differentiable function f ( x ) := k x k 2 2 = x T x . This function has gradient ∇ f ( x ) = 2 x . Its curv ature constant is C f = 2, which follo ws directly from the definition ( 7 ), and the fact that here f ( y ) − f ( x ) − ( y − x ) T ∇ f ( x ) = y T y − x T x − ( y − x ) T 2 x = k x − y k 2 2 , so that C f = sup x,s ∈ ∆ n k x − s k 2 2 = diam(∆ n ) 2 = 2. The following lemmata show that the sparse greedy algorithm of [ Cla10 ] from Section 4.1 is indeed optimal for the appro ximation qualit y (primal as well as dual error respectively), giving b est possible sparsity , up to a small multiplicativ e constant. Lemma 10. F or f ( x ) := k x k 2 2 , and 1 ≤ k ≤ n , it holds that min x ∈ ∆ n card( x ) ≤ k f ( x ) = 1 k . Pr o of. W e prov e the inequalit y min x.. f ( x ) ≥ 1 k b y induction on k . Case k = 1 F or an y unit length v ector x ∈ ∆ n ha ving just a single non-zero en try , f ( x ) = k x k 2 = k x k 1 = 1. Case k > 1 F or every x ∈ ∆ n of sparsit y card( x ) ≤ k , we can pic k a co ordinate i with x i 6 = 0, and write x = (1 − α ) v + α e i as the sum of t wo orthogonal vectors v and a unit basis vector e i , where v ∈ ∆ n of sparsit y ≤ k − 1, v i = 0, and α = x i . So for ev ery x ∈ ∆ n of sparsit y ≤ k , we therefore get that f ( x ) = k x k 2 2 = x T x = ((1 − α ) v + α e i ) T ((1 − α ) v + α e i ) = (1 − α ) 2 v T v + α 2 ≥ (1 − α ) 2 1 k − 1 + α 2 ≥ min 0 ≤ β ≤ 1 (1 − β ) 2 1 k − 1 + β 2 = 1 k . 21 In the first inequality we hav e applied the induction h yp othesis for v ∈ ∆ n of sparsit y ≤ k − 1. Equalit y: The function v alue f ( x ) = 1 k is obtained by setting k of the co ordinates of x to 1 k eac h. In other w ords for an y v ector x of sparsity card( x ) = k , the primal error f ( x ) − f ( x ∗ ) is alw ays lo wer b ounded b y 1 k − 1 n . F or the dualit y gap g ( x ), the lo wer b ound is even slightly higher: Lemma 11. F or f ( x ) := k x k 2 2 , and any k ∈ N , k < n , it holds that g ( x ) ≥ 2 k ∀ x ∈ ∆ n s.t. card( x ) ≤ k . Pr o of. g ( x ) = x T ∇ f ( x ) − min i ( ∇ f ( x )) i = 2( x T x − min i x i ). W e no w use min i x i = 0 b ecause card( x ) < n , and that by Lemma 10 w e ha ve x T x = f ( x ) ≥ 1 k . Note: W e could also consider the function f ( x ) := γ k x k 2 2 instead, for some γ > 0. This f has curvatur e constant C f = 2 γ , and for this scaling, our ab ov e low er b ound on the duality gap will also scale linearly , giving C f k . 5 Spa rse Appro ximation with Bounded ` 1 -No rm In this second application case, will apply the general greedy approach from Section 3 in order to understand the b est achiev able sparsit y for con vex optimization under b ounded ` 1 -norm, as a function of the approximation quality . Here the situation is indeed extremely similar to the ab o ve Section 4 of optimizing ov er the simplex, and the resulting algorithm will again ha v e a running time of O 1 ε man y gradien t ev aluations. It is known that the vector k . k 1 -norm is the b est con vex appro ximation to the sparsity (car- dinalit y) of a v ector, that is card( . ). More precisely , the function k . k 1 is the con vex en velope of the sparsity , meaning that it is the “largest” conv ex function that is upp er bounded by the sparsit y on the conv ex domain of v ectors { x | k x k ∞ ≤ 1 } . This can b e seen by observing that card( x ) ≥ k x k 1 k x k ∞ , see e.g. [ RFP10 ]. W e will discuss the analogous generalization to matrice s in the second part of this article, see Section 11 , namely using the matrix n uclear norm as the “b est” con v ex appro ximation of the matrix rank. Set-Up. Here w e consider one sp ecial class of the general optimization problem ( 1 ), namely problems o v er v ectors in R n with bounded k . k 1 -norm, that is minimize x ∈ R n f ( x ) s.t. k x k 1 ≤ 1 . (18) W e write ♦ n := { x ∈ R n | k x k 1 ≤ 1 } for the ` 1 -ball in R n . Note that one can simply rescale the function argument to allow for more general constraints k x k 1 ≤ t for t > 0. Again we ha v e X = R n with the standard inner pro duct h x, y i = x T y , so that also the subgradien ts or gradients of f are represen ted as v ectors in R n . The Linearized Problem. As already in the simplex case, the subproblem of optimizing a linear function ov er the ` 1 -ball is particularly easy to solve, allowing us to pro vide a fast implementation of the internal primitiv e pro cedure ExactLinear ( c, ♦ n ). Namely , it is again easy to see that every linear function attains its minim um/maximum at a v ertex of the ball ♦ n , as w e hav e already seen for general conv ex hulls in our earlier Lemma 8 , 22 and here ♦ n = conv( {± e i | i ∈ [ n ] } ). Here this crucial observ ation can also alternatively b e in terpreted as the known fact that the dual norm to the ` 1 -norm is in fact the ` ∞ -norm, see also our earlier Observ ation 2 . Observation 12. F or any ve ctor c ∈ R n , it holds that e i · sign( c i ) ∈ arg max y ∈♦ n y T c wher e i ∈ [ n ] is an index of a maximal c o or dinate of c me asur e d in absolute value, or formal ly i ∈ arg max j | c j | . Using this observ ation for c = −∇ f ( x ) in our general Algorithm 1 , we therefore directly obtain the follo wing simple metho d for ` 1 -regularized con v ex optimization, as depicted in the Algorithm 4 . Algo rithm 4 Sparse Greedy on the ` 1 -Ball Input: Conv ex function f , target accuracy ε Output: ε -approximate solution for problem ( 18 ) Set x (0) := 0 for k = 0 . . . ∞ do Compute i := arg max i ∇ f ( x ( k ) ) i , and let s := e i · sign −∇ f ( x ( k ) ) i Let α := 2 k +2 Up date x ( k +1) := x ( k ) + α ( s − x ( k ) ) end for Observ e that in each iteration, this algorithm only in tro duces at most one new non-zero co ordinate, so that the sparsity of x ( k ) is alw a ys upp er b ounded b y the n um b er of steps k . This means that the metho d is again of co ordinate-descent-t yp e, as in the simplex case of the previous Section 4.1 . Its con vergence analysis again directly follo ws from the general analysis from Section 3 . Theo rem 13 (Convergence of Sparse Greedy on the ` 1 -Ball) . F or e ach k ≥ 1 , the iter ate x ( k ) of A lgorithm 4 satisfies f ( x ( k ) ) − f ( x ∗ ) ≤ 4 C f k + 2 . wher e x ∗ ∈ ♦ n is an optimal solution to pr oblem ( 18 ). F urthermor e, for any ε > 0 , after at most 2 l 4 C f ε m + 1 = O 1 ε many steps, it has an iter ate x ( k ) of sp arsity O 1 ε , satisfying g ( x ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . The Duality Gap, and Duality of the Norms. W e recall the definition of the duality gap ( 5 ) giv en by the linearization at any p oint x ∈ ♦ n , see Section 2 . Thanks to our Observ ation 2 , the computation of the duality gap in the case of the ` 1 -ball here b ecomes extremely simple, and is giv en b y the norm that is dual to the ` 1 -norm, namely the ` ∞ -norm of the used subgradien t, i.e., g ( x, d x ) = k d x k ∞ + x T d x , and g ( x ) = k∇ f ( x ) k ∞ + x T ∇ f ( x ) . Alternativ ely , the same expression can also b e deriv ed directly (without explicitly using dualit y of norms) by applying the Observ ation 12 . 23 A Lo w er Bound on the Spa rsit y . The low er b ound of Ω 1 ε on the sparsity as prov ed in Section 4.2 for the simplex case in fact directly translates to the ` 1 -ball as w ell. Instead of c ho osing the ob jective function f as the distance to the origin (which is part of the ` 1 -ball), w e consider the optimization problem min k x k 1 ≤ 1 f ( x ) := k x − r k 2 2 with resp ect to the fixed p oint r := ( 2 n , . . . , 2 n ) ∈ R n . This problem is of the form ( 18 ), and corresponds to optimizing the Euclidean distance to the p oint r giv en by mirroring the origin at the p ositive facet of the ` 1 - ball. Here by the “p ositive facet”, we mean the hyperplane defined by the in tersection of the b oundary of the ` 1 -ball with the p ositive orthant, which is exactly the unit simplex. Therefore, the proof for the simplex case from Section 4.2 holds analogously for our setting here. W e hav e thus obtained that sparsity O 1 ε as obtained by the greedy Algorithm 4 is indeed b est possible for ` 1 -regularized optimization problems of the form ( 18 ). Using Ba rycentric Co ordinates Instead. Clarkson [ Cla10 , Theorem 4.2] already observed that Algorithm 3 ov er the simplex ∆ n can b e used to optimize a con vex function f ( y ) o v er arbitrary con vex hulls, by just using barycen tric co ordinates y = Ax, x ∈ ∆ n , for A ∈ R n × m b eing the matrix con taining all m vertices of the conv ex domain as its columns. Here how ev er we sa w that for the ` 1 -ball, the steps of the algorithm are even slightly simpler, as well as that the duality gap can b e computed instantly from the ` ∞ -norm of the gradient. Applications. Our Algorithm 4 applies to arbitrary con vex vector optimization problems with an k . k 1 -norm regularization term, giving a guaran teed sparsit y of O 1 ε for all these applications. A classical example for problems of this class is giv en by the important k . k 1 -r e gularize d le ast squar es regression approach, i.e. min x ∈ R n k Ax − b k 2 2 + µ k x k 1 for a fixed matrix A ∈ R m × n , a vector b ∈ R m and a fixed regularization parameter µ > 0. The same problem is also kno wn as b asis pursuit de-noising in the compressed sensing literature, whic h we will discuss more precisely in Section 5.1 . The ab o ve formulation is in fact the Lagrangian formulation of the corresp onding constrained problem for k x k 1 ≤ t for some fixed parameter t corresp onding to µ . This equiv alent form ulation is also known as the L asso problem [ Tib96 ] which is min x ∈ R n k Ax − b k 2 2 s.t. k x k 1 ≤ t . The abov e form ulation is exactly a problem of our ab o v e form ( 18 ), namely min ˆ x ∈♦ n k tA ˆ x − b k 2 2 , if w e rescale the argument x =: t ˆ x so that k ˆ x k 1 ≤ 1. Another imp ortan t application for our result is lo gistic r e gr ession with k . k 1 -norm regulariza- tion, see e.g. [ KKB07 ], which is also a conv ex optimization problem [ Ren05 ]. The reduction to an ` 1 -problem of our form ( 18 ) w orks exactly the same wa y as describ ed here. Related Wo rk. As w e mentioned ab o ve, the optimization problem ( 18 ) — if f is the squared error of a linear function — is very well studied as the L asso approac h, see e.g. [ Tib96 ] and the references therein. F or general ob jective functions f of b ounded curv ature, the ab o ve in teresting trade-off b etw een sparsity and the approximation quality w as already in vestigated by [ SSSZ10 ], and also b y our earlier pap er [ GJ09 ] for the analogous case of optimizing ov er the simplex. 24 [ SSSZ10 , Theorem 2.4] shows a sparse conv ergence analogous to our ab o ve Theorem 13 , for the “forward greedy selection” algorithm on problem ( 18 ), but only for the case that f is differen tiable. 5.1 Relation to Matching Pursuit and Basis Pursuit in Compressed Sensing Both our sparse greedy Algorithm 3 for optimizing o v er the simplex and also Algorithm 4 for general ` 1 -problems are v ery similar to the technique of matching pursuit , which is one of the most popular techniques in sparse reco very in the v ector case [ T ro04 ]. Supp ose we w ant to recov er a sparse signal vector x ∈ R n from a noisy measurement v ec- tor Ax = y ∈ R m . F or a giv en dictionary matrix A ∈ R m × n , matc hing pursuit iterativ ely c ho oses the dictionary elemen t A i ∈ R m that has the highest inner pro duct with the cur- ren t residual, and therefore reduces the represen tation error f ( x ) = k Ax − y k 2 2 b y the largest amoun t. This c hoice of co ordinate i = arg max j A T j ( Ax − y ) exactly corresp onds 4 to the c hoice of i := arg min j ∇ f ( x ( k ) ) j in Algorithm 3 . Another v ariant of matc hing pursuit, called orthogonal matching pursuit (OMP) [ T ro04 , TG07 ], includes an extra orthogonalization step, and is closer related to the coreset algorithms that optimize ov er the all existing set of non-zero co ordinates b efore adding a new one, see e.g. [ Cla10 , Algorithm 8.2], or the analogous “fully corrective” v arian t of [ SSSZ10 ]. If y = Ax , with x sparse and the columns of A sufficiently incoheren t, then OMP recov ers the sparsest represen tation for the given y [ T ro04 ]. The pap er [ Zha11 ] recen tly prop osed another algorithm that generalizes OMP , comes with a guaran tee on correct sparse reco v ery , and also corresp onds to “completely optimize within each coreset”. The metho d uses the same choice of the new co ordinate i := arg max j ∇ f ( x ( k ) ) j as in our Algorithm 4 . How ever the analysis of [ Zha11 ] requires the not only b ounded curv ature as in our case, but also needs strong conv exity of the ob jective function (whic h then also app ears as a multiplicativ e factor in the num b er of iterations needed). Our Algorithm 4 as well as the earlier metho d b y [ Zha03 ] are simpler to implemen t, and hav e a lo wer complexity p er iteration, as we do not need to optimize ov er sev eral curren tly non-zero co ordinates, but only change one co ordinate b y a fixed amount in each iteration. Our Algorithm 4 for general ` 1 -regularized problems also applies to solving the so called b asis pursuit problem [ CDS98 , FNW07 ] and [ BV04 , Section 6.5.4], which is min x ∈ R n k x k 1 s.t. Ax = y . Note that this is in fact just the constrained v arian t of the corresponding “robust” ` 1 -regularized least squares regression problem min x ∈ R n k Ax − y k 2 2 + µ k x k 1 , whic h is the equiv alent trade-off v ariant of our problem of the form ( 18 ). [ FNW07 ] prop ose a traditional gradient descent technique for solving the ab ov e least squares problem, but do not giv e a con v ergence analysis. Solution path algorithms with appro ximation guaran tees for related problems (obtaining solu- tions for all v alues of the tradeoff parameter µ ) hav e b een studied in [ GJL10 , GJL12a , GJL12b ], and the author’s PhD thesis [ Jag11 ], building on the same duality gap concept we introduced in Section 2 . 4 The ob jective function f ( x ) := k Ax − y k 2 2 can be written as f ( x ) = ( Ax − y ) T ( Ax − y ) = x T A T Ax − 2 y T Ax − y T y , so its gradient is ∇ f ( x ) = 2 A T Ax − 2 A T y = 2 A T ( Ax − y ) ∈ R n . 25 6 Optimization with Bounded ` ∞ -No rm Applying our ab ov e general optimization framework for the special case of the domain b eing the k . k ∞ -norm unit ball, w e again obtain a v ery simple greedy algorithm. The running time will again corresp ond to O 1 ε man y gradient ev aluations. F ormally , we consider problems of the form minimize x ∈ R n f ( x ) s.t. k x k ∞ ≤ 1 . (19) W e denote the feasible set, i.e. the k . k ∞ -norm unit ball, b y n := { x ∈ R n | k x k ∞ ≤ 1 } . F or this set, it will again b e v ery simple to implemen t the internal primitive op eration of optimizing a linear function ov er the same domain. The following crucial observ ation allows us to implement ExactLinear ( c, n ) in a very simple wa y . This can also alternativ ely b e in terpreted as the kno wn fact that the dual-norm to the ` ∞ -norm is the ` 1 -norm, whic h also explains wh y the greedy algorithm w e will obtain here is very similar to the ` 1 -v ersion from the previous Section 5 . Observation 14. F or any ve ctor c ∈ R n , it holds that s c ∈ arg max y ∈ n y T c wher e s c ∈ R n is the sign-ve ctor of c , define d by the sign of e ach individual c o or dinate, i.e. ( s c ) i = sign( c i ) ∈ {− 1 , 1 } . Using this observ ation for c = − d x in our general Algorithm 1 , w e directly obtain the follo wing simple method for optimization o ver a b o x-domain n , as depicted in Algorithm 5 . Algo rithm 5 Sparse Greedy on the Cub e Input: Conv ex function f , target accuracy ε Output: ε -approximate solution for problem ( 19 ) Set x (0) := 0 for k = 0 . . . ∞ do Compute the sign-vector s of ∇ f ( x ( k ) ), suc h that s i = sign −∇ f ( x ( k ) ) i , i = 1 ..n Let α := 2 k +2 Up date x ( k +1) := x ( k ) + α ( s − x ( k ) ) end for The con v ergence analysis again directly follows from the general analysis from Section 3 . Theo rem 15. F or e ach k ≥ 1 , the iter ate x ( k ) of Algorithm 5 satisfies f ( x ( k ) ) − f ( x ∗ ) ≤ 4 C f k + 2 . wher e x ∗ ∈ n is an optimal solution to pr oblem ( 19 ). F urthermor e, for any ε > 0 , after at most 2 l 4 C f ε m + 1 = O 1 ε many steps, it has an iter ate x ( k ) with g ( x ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . 26 The Duality Gap, and Duality of the Norms. W e recall the definition of the duality gap ( 5 ) giv en by the linearization at any p oint x ∈ n , see Section 2 . Thanks to our Observ ation 2 , the computation of the duality gap in the case of the ` ∞ -ball here b ecomes extremely simple, and is giv en by the norm that is dual to the ` ∞ -norm, namely the ` 1 -norm of the used subgradien t, i.e., g ( x, d x ) = k d x k 1 + x T d x , and g ( x ) = k∇ f ( x ) k 1 + x T ∇ f ( x ) . Alternativ ely , the same expression can also b e deriv ed directly (without explicitly using dualit y of norms) by applying the Observ ation 14 . Spa rsit y and Compact Rep resentations. The analogue of “sparsit y” as in Sections 4 and 5 in the context of our Algorithm 5 means that we can describ e the obtained approximate solution x as a con vex com bination of few (i.e. O ( 1 ε ) many) cub e v ertices. This do es not imply that x has few non-zero co ordinates, but that w e ha ve a compact representation given by only O ( 1 ε ) many binary n -vectors indicating the corresp onding cub e vertices, of which x is a con vex com bination. Applications. An y conv ex problem under co ordinate-wise upp er and low er constraints can be transformed to the form ( 19 ) by re-scaling the optimization argument. A sp ecific interesting application was given b y [ MR11 ], who hav e demonstrated that integer linear programs can b e relaxed to con v ex problems of the ab o ve form, such that the solutions coincide with high probabilit y under some mild additional assumptions. Using Ba rycentric Co ordinates Instead. Clarkson [ Cla10 , Theorem 4.2] already observed that Algorithm 3 ov er the simplex ∆ n can b e used to optimize a con vex function f ( y ) o v er arbitrary con vex hulls, by just using barycen tric co ordinates y = Ax, x ∈ ∆ n , for A ∈ R n × m b eing the matrix containing all m vertices of the con vex domain as its columns. Here how ever w e saw that for the unit b ox, the steps of the algorithm are muc h simpler, as w ell as that the duality gap can b e computed instan tly , without ha ving to explicitly deal with the exp onen tially man y v ertices (here m = 2 n ) of the cub e. 7 Semidefinite Optimization with Bounded T race W e will no w apply the greedy approac h from the previous Section 3 to semidefinite optimiza- tion problems, for the case of b ounded trace. The main paradigm in this section will be to understand the b est ac hiev able lo w-rank prop ert y of approximate solutions as a function of the appro ximation qualit y . In particular, w e will show that our general Algorithm 1 and its analysis do lead to Hazan’s metho d for con vex semidefinite optimization with b ounded trace, as giv en b y [ Haz08 ]. Hazan’s algorithm can also b e used as a simple solv er for general SDPs. [ Haz08 ] has already shown that guaran teed ε -approximations of rank O 1 ε can alwa ys be found. Here we will also sho w that this is indeed optimal, b y pro viding an asymptotically matching low er b ound in Section 7.4 . F urthermore, we fix some problems in the original analysis of [ Haz08 ], and require only a weak er appro ximation quality for the in ternal linearized primitive problems. W e also prop ose t wo impro vemen t v ariants for the method in Section 7.3 . Later in Section 11 , we will discuss the application of these algorithms for n uclear norm and max-norm optimization problems, whic h ha ve man y imp ortan t applications in practice, such as dimensionalit y reduction, lo w-rank reco very as well as matrix completion and factorizations. 27 W e no w consider conv ex optimization problems of the form ( 1 ) o ver the space X = S n × n of symmetric matrices, equipp ed with the standard F rob enius inner pro duct h X, Y i = X • Y . It is left to the choice of the reader to iden tify the symmetric matrices either with R n 2 and consider functions with f ( X ) = f ( X T ), or only “using” the v ariables in the upp er righ t (or low er left) triangle, corresp onding to R n ( n +1) / 2 . In any case, the subgradients or gradien ts of our ob jective function f need to b e av ailable in the same representation (same choice of basis for the v ector space X ). F ormally , w e consider the following sp ecial case of the general optimization problems ( 1 ), i.e., minimize X ∈ S n × n f ( X ) s.t. T r( X ) = 1 , X 0 (20) W e will write S := { X ∈ S n × n | X 0 , T r( X ) = 1 } for the feasible set, that is the PSD matrices of unit trace. The set S is sometimes called the Sp e ctahe dr on , and can b e seen as a natural generalization of the unit simplex to symmetric matrices. By the Cholesky factorization, it can b e seen that the Sp ectahedron is the con v ex h ull of all rank-1 matrices of unit trace (i.e. the matrices of the form v v T for a unit vector v ∈ R n , k v k 2 = 1). 7.1 Low-Rank Semidefinite Optimization with Bounded T race: The O ( 1 ε ) Algo rithm b y Hazan Applying our general greedy Algorithm 1 that we studied in Section 3 to the ab ov e semidef- inite optimization problem, w e directly obtain the following Algorithm 6 , which is Hazan’s metho d [ Haz08 , GM11 ]. Note that this is now a first application of Algorithm 1 where the in ternal linearized problem Appro xLinear () is not trivial to solve, con trasting the applications for v ector optimization problems we studied ab o ve. The algorithm here obtains low-rank solutions (sum of rank-1 matrices) to any conv ex optimization problem of the form ( 20 ). More precisely , it guarantees ε - small dualit y gap after at most O 1 ε iterations, where eac h iteration only inv olves the calculation of a single approximate eigenv ector of a matrix M ∈ S n × n . W e will see that in practice for example Lanczos’ or the p ow er metho d can b e used as the internal optimizer Appr oxLinear (). Algo rithm 6 Hazan’s Algorithm / Sparse Greedy for Bounded T race Input: Conv ex function f with curv ature constan t C f , target accuracy ε Output: ε -approximate solution for problem ( 20 ) Set X (0) := v v T for an arbitrary unit length v ector v ∈ R n . for k = 0 . . . ∞ do Let α := 2 k +2 Compute v := v ( k ) = Appro xEV ∇ f ( X ( k ) ) , α C f Up date X ( k +1) := X ( k ) + α ( v v T − X ( k ) ) end for Here Appr oxEV ( A, ε 0 ) is a subroutine that deliv ers an appro ximate smallest eigen vector (the eigen vector corresp onding to the smallest eigen v alue) to a matrix A with the desired accuracy ε 0 > 0. More precisely , it m ust return a unit length v ector v such that v T Av ≤ λ min ( A ) + ε 0 . Note that as our conv ex function f takes a symmetric matrix X as an argument, its gradien ts ∇ f ( X ) are given as symmetric matrices as w ell. If we wan t to understand this prop osed Algorithm 6 as an instance of the general con vex optimization Algorithm 1 , w e just need to explain why the largest eigen vector should indeed 28 b e a solution to the in ternal linearized problem Appr oxLinear (), as required in Algorithm 1 . F ormally , w e ha ve to sho w that v := Appr oxEV ( A, ε 0 ) do es appro ximate the linearized problem, that is v v T • A ≤ min Y ∈S Y • A + ε 0 for the choice of v := Appr oxEV ( A, ε 0 ), and any matrix A ∈ S n × n . This fact is formalized in Lemma 16 b elo w, and will b e the crucial prop ert y enabling the fast implemen tation of Algorithm 6 . Alternativ ely , if e xact eigen v ector computations are a v ailable, w e can also implemen t the exact v arian t of Algorithm 1 using ExactLinear (), thereby halving the total n um b er of iterations. Observ e that an approximate eigenv ector here is significantly easier to compute than a pro- jection onto the feasible set S . If we were to find the k . k F ro -closest PSD matrix to a giv en symmetric matrix A , w e w ould ha ve to compute a complete eigen vector decomp osition of A , and only k eeping those corresp onding to positive eigenv alues, whic h is computationally exp ensive. By contrast, a single approximate smallest eigen vector computation as in Appro xEV ( A, ε 0 ) can b e done in near linear time in the num b er of non-zero entries of A . W e will discuss the implemen tation of Appr oxEV ( A, ε 0 ) in more detail further b elo w. Spa rsit y b ecomes Low Rank. As the rank-1 matrices are indeed the “vertices” of the domain S as shown in Lemma 16 b elow, our Algorithm 6 can be therefore seen as a matrix generalization of the sparse greedy appro ximation algorithm of [ Cla10 ] for v ectors in the unit simplex, see Section 4 , which has seen many successful applications. Here sp arsity just gets replaced by low r ank . By the analysis of the general algorithm in Theorem 3 , we already know that we obtain ε -approximate solutions for an y conv ex optimization problem ( 20 ) ov er the sp ectahedron S . Because each iterate X ( k ) is represented as a sum (conv ex com bination) of k many rank-1 matrices vv T , it follo ws that X ( k ) is of rank at most k . Therefore, the resulting ε -appro ximations are of low rank, i.e. rank O 1 ε . F or large-scale applications where 1 ε n , the representation of X ( k ) as a sum of rank-1 matrices is muc h more efficient than storing an entire matrix X ( k ) ∈ S n × n . Later in Section 11.5 (or see also [ JS10 ]) we will demonstrate that Algorithm 6 can readily b e applied to practical problems for n ≥ 10 6 on an ordinary computer, well exceeding the p ossibilities of in terior p oin t metho ds. [ Haz08 ] already observ ed that the same Algorithm 6 with a w ell-crafted function f can also b e used to approximately solv e arbitrary SDPs with b ounded trace, whic h we will briefly explain in Section 7.2 . Linea rization, the Duality Gap, and Duality of the Norms. Here we will pro ve that the general dualit y gap ( 5 ) can be calculated v ery efficien tly for the domain being the sp ectahedron S . F rom the follo wing Lemma 16 , w e obtain that g ( X ) = X • ∇ f ( X ) + λ max ( −∇ f ( X )) = X • ∇ f ( X ) − λ min ( ∇ f ( X )) . (21) As predicted b y our Observ ation 2 on formulating the duality gap, w e hav e again obtained the dual norm to the norm that determines the domain D . It can be seen that ov er the space of symmetric matrices, the dual norm of the matrix trace-norm (also known as the n uclear norm) is given by the spectral norm, i.e. the largest eigenv alue. T o see this, we refer the reader to the later Section 11.2 on the prop erties of the nuclear norm and its dual c haracterization. The following Lemma 16 shows that any linear function attains its minim um and maximum at a “vertex” of the Sp ectahedron S , as we hav e already prov ed for the case of general conv ex h ulls in Lemma 8 . 29 Lemma 16. The sp e ctahe dr on is the c onvex hul l of the r ank-1 matric es, S = conv( v v T v ∈ R n , k v k 2 = 1 ) . F urthermor e, for any symmetric matrix A ∈ S n × n , it holds that max X ∈S A • X = λ max ( A ) . Pr o of. Clearly , it holds that v v T ∈ S for an y unit length vector v ∈ R n , as T r( v v T ) = k v k 2 2 . T o prov e the other inclusion, w e consider an arbitrary matrix X ∈ S , and let X = U T U b e its Cholesky factorization. W e let α i b e the squared norms of the rows of U , and let u i b e the row v ectors of U , scaled to unit length. F rom the observ ation 1 = T r( X ) = T r( U T U ) = T r( U U T ) = P i α i it follo ws that any X ∈ S can b e written as a conv ex combination of at most n man y rank-1 matrices X = P n i =1 α i u i u T i with unit vectors u i ∈ R n , pro ving the first part of the claim. F urthermore, this implies that we can write max X ∈S A • X = max u i ,α i A • n X i =1 α i u i u T i = max u i ,α i n X i =1 α i ( A • u i u T i ) , where the maximization max u i ,α i is taken ov er unit v ectors u i ∈ R n , k u i k = 1, for 1 ≤ i ≤ n , and real co efficien ts α i ≥ 0, with P n i =1 α i = 1. Therefore max X ∈S A • X = max u i ,α i n X i =1 α i ( A • u i u T i ) = max v ∈ R n , k v k =1 A • v v T = max v ∈ R n , k v k =1 v T Av = λ max ( A ) , where the last equality is the v ariational c haracterization of the largest eigenv alue. Curvature. W e know that the constant in the actual running time for a giv en con vex function f : S d × d → R is given by the curvatur e c onstant C f as giv en in ( 7 ), which for the domain S b ecomes C f := sup X,V ∈S , α ∈ [0 , 1] , Y = X + α ( V − X ) 1 α 2 f ( Y ) − f ( X ) + ( Y − X ) • ∇ f ( X ) . (22) Convergence. W e can no w see the con vergence analysis for Algorithm 6 following directly as a corollary of our simple analysis of the general framework in Section 3 . The follo wing theorem pro ves that O 1 ε man y iterations are sufficient to obtain primal error ≤ ε . This result was already known in [ Haz08 , Theorem 1], or [ GM11 , Chapter 5] where some corrections to the original paper were made. Theo rem 17. F or e ach k ≥ 1 , the iter ate X ( k ) of Algorithm 6 satisfies f ( X ( k ) ) − f ( X ∗ ) ≤ 8 C f k + 2 . wher e X ∗ ∈ S is an optimal solution to pr oblem ( 20 ). F urthermor e, for any ε > 0 , after at most 2 l 8 C f ε m + 1 = O 1 ε many steps, it has an iter ate X ( k ) of r ank O 1 ε , satisfying g ( X ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . 30 App ro ximating the Largest Eigenvector. Appro ximating the smallest eigen vector of a sym- metric matrix ∇ f ( X ) (whic h is the largest eigenv ector of −∇ f ( X )) is a well-studied problem in the literature. W e will see in the following that the internal procedure Appro xEV( M , ε 0 ), can b e p erformed in ne ar-line ar time, when measured in the n umber of non-zero entries of the gradien t matrix ∇ f ( X ). This will follow from the analysis of [ KW92 ] for the p ow er metho d or Lanczos’ algorithm, b oth with a random start vector. A similar statement has b een used in [ AHK05 , Lemma 2]. Theo rem 18. L et M ∈ S n × n b e a p ositive semidefinite matrix. Then with high pr ob ability, b oth i) O log( n ) γ iter ations of the p ower metho d, or ii) O log( n ) √ γ iter ations of L anczos’ algorithm wil l pr o duc e a unit ve ctor x such that x T M x λ 1 ( M ) ≥ 1 − γ . Pr o of. The statement for the p ow er metho d follo ws from [ KW92 , Theorem 3.1(a)], and for Lanczos’ algorithm by [ KW92 , Theorem 3.2(a)]. The only remaining obstacle to use this result for our internal pro cedure ApproxEV( M , ε 0 ) is that our gradient matrix M = −∇ f ( X ) is usually not PSD. Ho wev er, this can easily b e fixed b y adding a large enough constant t to the diagonal, i.e. ˆ M := M + t I , or in other words shifting the sp ectrum of M so that the eigen v alues satisfy λ i ( ˆ M ) = λ i ( M ) + t ≥ 0 ∀ i . The c hoice of t = − λ min ( M ) is goo d enough for this to hold. No w b y setting γ := ε 0 L ≤ ε 0 λ max ( ˆ M ) for some upp er b ound L ≥ λ max ( ˆ M ) = λ max ( M ) − λ min ( M ), this implies that our internal pro cedure Appro xEV( M , ε 0 ) can b e implemen ted by p erforming O log( n ) √ L √ ε 0 man y Lanczos steps (that is matrix-vector m ultiplications). Note that a simple c hoice for L is giv en b y the sp ectral norm of M , since 2 k M k spec = 2 max i λ i ( M ) ≥ λ max ( M ) − λ min ( M ). W e state the implication for our algorithm in the follo wing corollary . Theo rem 19. F or M ∈ S n × n , and ε 0 > 0 , the pr o c e dur e Appr oxEV ( M , ε 0 ) r e quir es a total of O N f log( n ) √ L √ ε 0 many arithmetic op er ations, with high pr ob ability, by using L anczos’ algorithm. Here N f is the num b er of non-zero entries in M , whic h in the setting of Algorithm 6 is the gradien t matrix −∇ f ( X ). W e hav e also assumed that the sp ectral norm of M is b ounded by L . Since we already kno w the num b er of necessary “outer” iterations of Algorithm 6 , by Theo- rem 17 , w e conclude with the follo wing analysis of the total running time. Here w e again use that the required internal accuracy is giv en b y ε 0 = αC f ≤ εC f . Co rolla ry 20. When using L anczos’ algorithm for the appr oximate eigenve ctor pr o c e dur e Appr oxEV ( ., . ) , then Algorithm 6 pr ovides an ε -appr oximate solution in O 1 ε iter ations, r e quiring a total of ˜ O N f ε 1 . 5 arithmetic op er ations (with high pr ob ability). Here the notation ˜ O ( . ) suppresses the logarithmic factor in n . This corollary improv es the original analysis of [ Haz08 ] by a factor of 1 √ ε , since [ Haz08 , Algorithm 1] as w ell as the pro of of [ Haz08 , Theorem 1] used an internal accuracy b ound of ε 0 = O 1 k 2 instead of the sufficient c hoice of ε 0 = O 1 k as in our general analysis here. 31 Rep resentation of the Estimate X in the Algo rithm. The ab ov e result on the total running time assumes the follo wing: After having obtained an appro ximate eigen vector v , the rank-1 up date X ( k +1) := (1 − α ) X ( k ) + αv v T can b e pe rformed efficien tly , or more precisely in time N f . In the worst case, when a fully dense matrix X is needed, this up date cost is N f = n 2 . Ho wev er, there are many interesting applications where the function f dep ends only on a small fraction of the en tries of X , so that N f n 2 . Here, a prominen t example is matrix completion for recommender systems. In this case, only those N f man y en tries of X will b e stored and affected b y the rank-1 up date, see also our Section 11.5 . An alternativ e represen tation of X consists of the lo w-rank factorization, given by the v - v ectors of eac h of the O 1 ε man y up date steps, using a smaller memory of size O n ε . How ever, computing the gradient ∇ f ( X ) from this representation of X migh t require more time then. 7.2 Solving Arbitrary SDPs In [ Haz08 ] it was established that Algorithm 6 can also b e used to approximately solv e arbitrary semidefinite programs (SDPs) in feasibility form, i.e., find X s.t. A i • X ≤ b i i = 1 ..m X 0 . (23) Also ev ery classical SDP with a linear ob jective function maximize X C • X s.t. A i • X ≤ b i i = 1 ..m 0 X 0 . (24) can b e turned into a feasibilit y SDP ( 23 ) b y “guessing” the optimal v alue C • X by binary searc h [ AHK05 , Haz08 ]. Here we will therefore assume that we are given a feasibility SDP of the form ( 23 ) by its constrain ts A i • X ≤ b i , which we w ant to solv e for X . W e can represen t the constrain ts of ( 23 ) in a smo oth optimization ob jective instead, using the soft-max function f ( X ) := 1 σ log m X i =1 e σ ( A i • X − b i ) ! . (25) Supp ose that the original SDP was feasible, then after O 1 ε man y iterations of Algorithm 6 , for a suitable choice of σ , we hav e obtained X such that f ( X ) ≤ ε , which implies that all constrain ts are violated by at most ε . This means that A i • X ≤ b i + ε , or in other words we sa y that X is ε -feasible [ Haz08 , GM11 ]. It turns out the b est choice for the parameter σ is log m ε , and the curv ature constant C f ( σ ) for this function is b ounded by σ · max i λ max ( A i ) 2 . The total num b er of necessary approximate eigen vector computations is therefore in O log m ε 2 . In fact, Algorithm 6 when applied to the function ( 25 ) is very similar to the m ultiplicative weigh ts metho d [ AHK05 ]. Note that the soft-max function ( 25 ) is conv ex in X , see also [ Ren05 ]. F or a sligh tly more detailed exhibition of this approach of using Algorithm 6 to approximately solving SDPs, w e refer the reader to the bo ok of [ GM11 ]. Note that this technique of introducing the soft-max function is closely related to smo othing tec hniques in the optimization literature [ Nem04 , BB09 ], where the soft-max function is intro- duced to get a s mo oth approximation to the largest eigen v alue function. The transformation to a smo oth saddle-p oin t problem suggested by [ BB09 ] is more complicated than the simple notion of ε -feasibility suggested here, and will lead to a comparable computational complexity in total. 32 7.3 Two Imp roved Va riants of Algo rithm 6 Cho osing the Optimal α by Line-Sea rch. As we mentioned already for the general algorithm for conv ex optimization in Section 3 , the optimal α in Algorithm 6 , i.e. the α ∈ [0 , 1] of b est impro vemen t in the ob jective function f can b e found by line-search. In particular for matrix completion problems, which we will discuss in more details in Sec- tion 11.5 , the widely used squared error is easy to analyze in this resp ect: If the optimization function is given b y f ( X ) = 1 2 P ij ∈ P ( X ij − Y ij ) 2 , where P is the set of observed p ositions of the matrix Y , then the optimalit y condition ( 10 ) from Section 3.3 is equiv alen t to α = P ij ∈ P ( X ij − y ij )( X ij − v i v j ) P ij ∈ P ( X ij − v i v j ) 2 . (26) Here X = X ( k ) , and v is the approximate eigen vector v ( k ) used in step k of Algorithm 6 . The ab o ve expression is computable v ery efficiently compared to the eigen vector appro ximation task. Immediate F eedback in the P o wer Metho d. As a second improv ement, w e prop ose a heuristic to sp eed up the eigenv ector computation, i.e. the internal pro cedure Appro xEV ( ∇ f ( X ) , ε 0 ). Instead of m ultiplying the curren t candidate v ector v k with the gradien t matrix ∇ f ( X ) in eac h p o wer iteration, we m ultiply with 1 2 ∇ f ( X ) + ∇ f ( X ) , or in other words the av erage b etw een the current gradient and the gradient at the new candidate lo cation X = 1 − 1 k X ( k ) + 1 k v ( k ) v ( k ) T . Therefore, we immediately tak e in to accoun t the effect of the new feature vector v ( k ) . This heuristic (whic h unfortunately does not fall in to our curren t theoretical guarantee) is inspired b y sto c hastic gradient descen t as in Simon F unk’s metho d, which w e will describ e in Section 11.5.4 . In practical exp erimen ts, this prop osed slight mo dification will result in a significant sp eed-up of Algorithm 6 , as we will observ e e.g. for matrix completion problems in Section 11.5 . 7.4 Ω( 1 ε ) Lo w er Bound on the Rank Analogous to the vector case discussed in Section 4.2 , w e can also sho w that the rank of O 1 ε , as obtained b y the greedy Algorithm 6 is indeed optimal, by pro viding a lo wer b ound of Ω 1 ε . In other w ords w e can now exactly characterize the trade-off b etw een rank and approximation qualit y , for con v ex optimization o ver the sp ectahedron. F or the lo wer b ound construction, w e consider the con vex function f ( X ) := k X k 2 F ro = X • X o ver the symmetric matrices S n × n . This function has gradient ∇ f ( X ) = 2 X . W e will later see that its curv ature constan t is C f = 2. The following lemmata show that the abov e sparse SDP Algorithm 6 is optimal for the ap- pro ximation qualit y (primal as well as dual error resp ectiv ely), giving low est p ossible rank, up to a small multiplicativ e constant. Lemma 21. F or f ( X ) := k X k 2 F ro , and 1 ≤ k ≤ n , it holds that min X ∈S Rk( X ) ≤ k f ( X ) = 1 k . W e will see that this claim can b e reduced to the analogous Lemma 10 for the vector case, b y the standard tec hnique of diagonalizing a symmetric matrix. (This idea w as suggested by Elad Hazan). Alternatively , an explicit (but slightly longer) pro of without requiring the spectral theorem can b e obtained by using the Cholesky-decomp osition together with induction on k . 33 Pr o of. W e observe that the ob jectiv e function k . k 2 F ro , the trace, as well as the prop erty of b eing p ositiv e semidefinite, are all in v ariant under orthogonal transformations (or in other words under the c hoice of basis). By the standard sp ectral theorem, for any symmetric matrix X of Rk( X ) ≤ k , there exists an orthogonal transformation mapping X to a diagonal matrix X 0 with at most k non-zero entries on the diagonal (b eing eigenv alues of X by the w a y). F or diagonal matrices, the k . k F ro matrix norm coincides with the k . k 2 v ector norm of the diagonal of the matrix. Finally b y applying the v ector case Lemma 10 for the diagonal of X 0 , w e obtain that f ( X ) = f ( X 0 ) ≥ 1 k . T o see that the minimum can indeed b e attained, one again c ho oses the “uniform” example X := 1 k I k ∈ S , b eing the matrix consisting of k non-zero en tries (of 1 k eac h) on the diagonal. This giv es f ( X ) = 1 k . Recall from Section 7.1 that for conv ex problems of the form ( 20 ) o ver the Spectahedron, the duality gap is the non-negative v alue g ( X ) := f ( X ) − ω ( X ) = X • ∇ f ( X ) − λ min ( ∇ f ( X )). Also, b y weak duality as giv en in Lemma 1 , this v alue is alw ays an upp er b ound for the primal error, that is f ( X ) − f ( X ∗ ) ≤ g ( X ) ∀ X . Lemma 22. F or f ( X ) := k X k 2 F ro , and any k ∈ N , k < n , it holds that g ( X ) ≥ 1 k ∀ X ∈ S s.t. Rk( X ) ≤ k . Pr o of. g ( X ) = λ max ( −∇ f ( X ))+ X • ∇ f ( X ) = − λ min ( X )+ X • 2 X . W e now use that λ min ( X ) = 0 for all symmetric PSD matrices X that are not of full rank n , and that b y Lemma 21 , w e hav e X • X = T r( X T X ) = f ( X ) ≥ 1 k . The Curvature. W e will compute the curv ature C f of our function f ( X ) := X • X , sho wing that C f = 2 in this case. Using the definition ( 7 ), and the fact that here f ( Y ) − f ( X ) − ( Y − X ) • ∇ f ( X ) = Y • Y − X • X − ( Y − X ) • 2 X = k X − Y k 2 F ro , one obtains that C f = sup X,Y ∈S k X − Y k 2 F ro = diam F ro ( S ) 2 = 2. Finally the follo wing Lemma 23 shows that the diameter is indeed 2. Lemma 23 (Diameter of the Sp ectahedron) . diam F ro ( S ) 2 = 2 . Pr o of. Using the fact that the sp ectahedron S is the con vex hull of the rank-1 matrices of unit trace, see Lemma 16 , w e kno w that the diameter m ust b e attained at tw o vertices of S , i.e. u, v ∈ R n with k u k 2 = k v k 2 = 1, and v v T − uu T 2 F ro = v v T • v v T + uu T • uu T − 2 v v T • uu T = v T v v T v + u T uu T u − 2 u T v v T u = k v k 4 + k u k 4 − 2( u T v ) 2 . Clearly , this quantit y is maximized if u and v are orthogonal. Note: W e could also study f ( X ) := γ k X k 2 F ro instead, for some γ > 0. This function has curvatur e constant C f = 2 γ , and for this scaling our ab ov e lo wer bounds will also just scale linearly , giving C f k instead of 1 k . 34 8 Semidefinite Optimization with ` ∞ -Bounded Diagonal Here we sp ecialize our general Algorithm 1 to semidefinite optimization problems where all diagonal en tries are individually constrained. This will result in a new optimization metho d that can also b e applied to max-norm optimization problems, whic h we will discuss in more detail in Section 11 . As in the previous Section 7 , here we also consider matrix optimization problems o ver the space X = S n × n of symmetric matrices, equipp ed with the standard F rob enius inner product h X , Y i = X • Y . F ormally , we consider the follo wing sp ecial case of the general optimization problems ( 1 ), i.e. minimize X ∈ S n × n f ( X ) s.t. X ii ≤ 1 ∀ i, X 0 . (27) W e will write := { X ∈ S n × n | X 0 , X ii ≤ 1 ∀ i } for the feasible set in this case, that is the PSD matrices whose diagonal entries are all upp er b ounded by one. This class of optimization problems has b ecome widely known for the linear ob jective case when f ( X ) = A • X , if A b eing the Laplacian matrix of a graph. In this case, one obtains the standard SDP relaxation of the Max-Cut problem [ GW95 ], which we will briefly discuss b elo w. Also, this optimization domain is strongly related to the matrix max-norm , which w e study in more detail in Section 11.3 . Our general optimization Algorithm 1 directly applies to this sp ecialized class of optimization problems as well, in which case it b ecomes the method depicted in the follo wing Algorithm 7 . Algo rithm 7 Sparse Greedy for Max-Norm Bounded Semidefinite Optimization Input: Conv ex function f with curv ature C f , target accuracy ε Output: ε -approximate solution for problem ( 27 ) Set X (0) := v v T for an arbitrary unit length v ector v ∈ R n . for k = 0 . . . ∞ do Let α := 2 k +2 Compute S := Appr oxLinear ∇ f ( X ( k ) ) , , αC f Up date X ( k +1) := X ( k ) + α ( S − X ( k ) ) end for The Linea rized Problem. Here, the internal subproblem Appro xLinear () of approximately minimizing a linear function o ver the domain of PSD matrices is a non-trivial task. Ev ery call of Appr oxLinear ( A, , ε 0 ) in fact means that w e hav e to solve a semidefinite program min Y ∈ Y • A for a giv en matrix A , or in other w ords minimize Y Y • A s.t. Y ii ≤ 1 ∀ i, Y 0 (28) up to an additive approximation error of ε 0 = αC f . Relation to Max-Cut. In [ AHK05 , Kal07 ], the same linear problem is denoted by (MaxQP) . In the sp ecial case that A is chosen as the Laplacian matrix of a graph, then the ab ov e SDP is widely kno wn as the standard SDP relaxation of the Max-Cut problem [ GW95 ] (not to be confused with the combinatorial Max-Cut problem itself, whic h is kno wn to b e NP-hard). In fact the original relaxation uses equalit y constraints Y ii = 1 on the diagonal instead, but for any 35 matrix A of p ositive diagonal en tries (such as e.g. a graph Laplacian), this condition follows automatically in the maximization v ariant of ( 28 ), see [ KL96 ], or also [ GM11 , Kal07 ] for more bac kground. Dualit y and Dualit y of Norms. In Section 11.3 we will see that the ab ov e quantit y ( 28 ) that determines b oth the step in our greedy Algorithm 7 , but also the dualit y gap, is in fact the norm of A that is dual to the matrix max-norm. F or optimization problems of the form ( 27 ), it can again be shown that the po or-man’s duality giv en by the linearization (see also Section 2 ) indeed coincides with classical Wolfe-duality from the optimization literature. F ortunately , it w as sho wn b y [ AHK05 ] that also this linearized conv ex optimization prob- lem ( 28 ) — and therefore also our internal pro cedure Appro xLinear ( . ) — can b e solved rela- tiv ely efficien tly , if the matrix A (i.e. ∇ f ( X ) in our case) is sparse. 5 Theo rem 24. The algorithm of [ AHK05 ] delivers an additive ε 0 -appr oximation to the line arize d pr oblem ( 28 ) in time ˜ O n 1 . 5 L 2 . 5 ε 0 2 . 5 N A wher e the c onstant L > 0 is an upp er b ound on the maximum value of Y • A over Y ∈ , and N A is the numb er of non-zer os in A . Pr o of. The results of [ AHK05 , Theorem 3] and [ Kal07 , Theorem 33] giv e a running time of order ˜ O n 1 . 5 ε 2 . 5 · min n N , n 1 . 5 εα ∗ o to obtain a multiplicativ e (1 − ε )-approximation, where α ∗ is the v alue of an optimal solution. F ormally w e obtain S ∈ with S • A ≥ (1 − ε ) α ∗ . In other w ords by using an accuracy of ε := ε 0 α ∗ , w e obtain an additive ε 0 -appro ximation to ( 28 ). Here the notation ˜ O ( . ) again suppresses p oly-logarithmic factors in n , and N is the num b er of non-zero en tries of the matrix A . Note that analogous to the approximate eigen vector com- putation for Hazan’s Algorithm 6 , w e need the assumption that the linear function given by Y • ∇ f ( X ) is b ounded ov er the domain Y ∈ . Ho wev er this is a reasonable assumption, as our function has b ounded curv ature C f (corresp onding to ∇ f ( X ) b eing Lipsc hitz-contin uous o ver the domain ), and the diameter of is b ounded. The reason we need an absolute approximation qualit y lies in the analysis of Algorithm 1 , ev en if it would feel muc h more natural to work with relativ e appro ximation qualit y in many cases. Convergence. The con vergence result for the general Algorithm 1 directly gives us the analysis for the sp ecialized algorithm here. Note that the curv ature ov er the domain here is giv en b y C f := sup X,V ∈ , α ∈ [0 , 1] , Y = X + α ( V − X ) 1 α 2 f ( Y ) − f ( X ) + ( Y − X ) • ∇ f ( X ) . (29) Theo rem 25. F or e ach k ≥ 1 , the iter ate X ( k ) of Algorithm 7 satisfies f ( X ( k ) ) − f ( X ∗ ) ≤ 8 C f k + 2 . 5 Also, Kale in [ Kal07 , Theorem 14] has shown that this problem can b e solved v ery efficien tly if the matrix A = −∇ f ( X ) is sparse. Namely if A is the Laplacian matrix of a weigh ted graph, then a multiplicativ e ε -appro ximation to ( 28 ) can b e computed in time ˜ O ( ∆ 2 d 2 N A ) time, where N A is the n umber of non-zero entries of the matrix A . Here ∆ is the maximum entry on the diagonal of A , and d is the av erage v alue on the diagonal. 36 wher e X ∗ ∈ S is an optimal solution to pr oblem ( 27 ). F urthermor e, after at most 2 l 8 C f ε m + 1 = O ( 1 ε ) many steps, it has an iter ate X ( k ) with g ( X ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . Applications. The new algorithm can b e used to solve arbitrary max-norm constrained con vex optimization problems, such as max-norm regularized matrix completion problems, which w e will study in Section 11 . 9 Spa rse Semidefinite Optimization Another interesting optimization domain among the semidefinite matrices is giv en by the ma- trices with only one non-zero off-diagonal en try . Here w e sp ecialize our general Algorithm 1 to con vex optimization ov er the conv ex hull giv en by suc h matrices. Our algorithm will therefore obtain ε -appro ximate solutions giv en b y only O 1 ε suc h sparse matrices, or in other words solutions of sparsity O 1 ε . Why b other? The same sparse matrices are also used in the graph sparsification approach b y [ BSS09 ] 6 . F urthermore, sparse solutions to con vex matrix optimization problems ha ve gained in terest in dimensionalit y reduction, as in sparse PCA, see [ ZdG10 ] for an o verview. Setup. F ormally , here w e again use the standard F rob enius inner pro duct h X, Y i = X • Y o ver the symmetric matrices S n × n , and consider the sparse PSD matrices given by P ( ij ) := ( e i + e j )( e i + e j ) T = · · · · · · 1 · 1 · · · · · · · 1 · 1 · · · · · · ! , for an y fixed pair of indices i, j ∈ [ n ], i 6 = j . In other words P ( ij ) uv = 1 for u ∈ { i, j } , v ∈ { i, j } , and zero ev erywhere else. W e also consider the analogous “negativ e” coun terparts of such matrices, namely N ( kl ) := ( e i − e j )( e i − e j ) T = · · · · · · 1 · − 1 · · · · · · · − 1 · 1 · · · · · · ! , i.e. N ( ij ) uv = − 1 for the t wo off-diagonal en tries ( u, v ) ∈ { ( i, j ) , ( j, i ) } , and N ( ij ) uv = 1 for the t wo diagonal en tries ( u, v ) ∈ { ( i, i ) , ( j, j ) } , and zero ev erywhere else. Analogously to the tw o previous applications of our metho d to semidefinite optimization, we no w optimize a conv ex function, i.e. minimize X ∈S 4 sparse f ( X ) (30) o ver the domain D = S 4 sparse := con v [ ij P ( ij ) ∪ [ ij N ( ij ) . Optimizing over Sparse Matrices, and Solving the Linea rization. Applying our general Al- gorithm 1 to this class of problems ( 30 ) b ecomes very simple, as the linear primitive problem ExactLinear D X , S 4 sparse for an y fixed matrix D X ∈ S n × n is easily solv ed ov er S 4 sparse . F rom our simple Lemma 8 on linear functions ov er conv ex h ulls, w e know that this linear minimum is attained by the single sparse matrix P ( ij ) or N ( ij ) that maximizes the inner pro duct with 6 The theoretical result of [ BSS09 ] guarantees that all eigenv alues of the resulting sparse matrix (corresp onding to the Laplacian of a sparse graph) do not differ to o muc h from their counterparts in the original graph. 37 − D X . The optimal pair of indices ( k , l ) can b e found b y a linear pass through the gradient D X = ∇ f ( X ). This means that the linearized problem is m uch easier to solve than in the ab ov e t wo Sections 7 and 8 . Altogether, Algorithm 1 will build approximate solutions X ( k ) , eac h of whic h is a con v ex combination of k of the atomic matrices P ( ij ) or N ( ij ) , as formalized in the follo wing theorem: Theo rem 26. L et n ≥ 2 and let X (0) := P (12) b e the starting p oint. Then for e ach k ≥ 1 , the iter ate X ( k ) of Algorithm 1 has at most 4( k + 1) non-zer o entries, and satisfies f ( X ( k ) ) − f ( X ∗ ) ≤ 8 C f k + 2 . wher e X ∗ ∈ S 4 sp arse is an optimal solution to pr oblem ( 30 ). F urthermor e, for any ε > 0 , after at most 2 l 8 C f ε m + 1 = O 1 ε many steps, it has an iter ate X ( k ) of only O 1 ε many non-zer o entries, satisfying g ( X ( k ) ) ≤ ε . Pr o of. This is a corollary of Theorem 3 and Theorem 5 . The sparsity claim follo ws from our observ ation that the step directions given by ExactLinear ∇ f ( X ) , S 4 sparse are alw ays given b y one of the sparse matrices P ( ij ) or N ( ij ) . Optimizing over Non-Negative Matrix F actorizations. W e also consider the sligh t v ariant of ( 30 ), namely optimizing only ov er one of the tw o types of matrices as the domain D , i.e. only com binations of p ositiv e P ( ij ) or only of negativ e N ( ij ) . This means that the domain is given b y D = S 4+ sparse := con v S ij P ( ij ) or D = S 4 − sparse := con v S ij N ( ij ) . The ab ov e analysis for Algorithm 1 holds in exactly the same wa y . No w for S 4+ sparse , each step direction s = s ( k ) used b y the algorithm is giv en by s = P ( ij ) = ( e i + e j )( e i + e j ) T for some i, j , and so we ha ve that eac h of the approximations X ( k ) is a sum of k man y p ositive rank-1 factors of this form. In other w ords in eac h step k , X ( k ) = LR T is a pro duct of t w o (en try-wise) non-negativ e matrices of at most k columns eac h, i.e. L ∈ R n × k and R n × k . Consequently , our algorithm provides solutions that are non-ne gative matrix factorizations , which is a successful technique in matrix completion problems from recommender systems, see e.g. [ W u07 ]. Relation to Bounded T race and Diagonally Dominant Matrices. Observe the matrices in S 4 sparse form a subset of the b ounded trace PSD matrices S that we studied in the previous Section 7 , since every matrix P ( ij ) or N ( ij ) is PSD and has trace equal t wo. F urthermore, w e observ e that all matrices X ∈ S 4 sparse are diagonal ly dominant , meaning that | X ii | ≥ X j 6 = i | X ij | ∀ i ∈ [ n ] In the case that w e restrict to using only one of the t wo types of matrices S 4+ sparse or S 4 − sparse as the domain, then we hav e that e quality | X ii | = P j 6 = i | X ij | alw ays holds, since this equalit y is preserv ed under taking conv ex combinations, and holds for the atomic matrices P ( ij ) and N ( ij ) . The Curvature. The abov e reasoning also implies that the curv ature C f for problems of the form ( 30 ) is upp er b ounded by the curv ature in the spectahedron-case as giv en in ( 22 ), since S 4+ sparse ⊆ S 4 sparse ⊆ 2 · S . 38 Applications and Future Research. Computationally , the approach here lo oks very attractiv e, as the cost of a “sparse” step here is m uch cheaper than an appro ximate eigen vector computation whic h is needed in the b ounded trace case as explained in Section 7 . Also, it will b e interesting to see how a regularization by constraining to a scaled domain S 4 sparse or S 4+ sparse will p erform in practical machine learning applications as for example dimen- sionalit y reduction, compared to nuclear norm regularization that w e will discuss in the follo wing Chapter 11 . It also remains to in vestigate further on whether we can appro ximate general b ounded trace semidefinite problems of the form ( 20 ) b y using only sparse matrices. 10 Submo dula r Optimization F or a finite ground set S , a real v alued function defined on all subsets of S , is called submo dular , if g ( X ∩ Y ) + g ( X ∪ Y ) ≤ g ( X ) + g ( Y ) ∀ X , Y ⊆ S F or any given submo dular function g with g ( ∅ ) = 0, the pap er [ Lov83 , Section 3] introduces a corresp onding conv ex set in R | S | , called the submo dular p olyhe dr on (or also Lo v asz p olyhedron), P g := ( x ∈ R | S | X i ∈ T x i ≤ g ( T ) ∀ T ⊆ S ) . W e would now like to study conv ex optimization problems ov er suc h domains, which b ecome compact con v ex sets if we in tersect with the pos itiv e orthan t, i.e. D := P g ∩ R | S | ≥ 0 . Nicely for our optimization framework, [ Lo v83 , Section 3] already sho wed that there is a simple greedy algorithm whic h optimizes any linear function ov er the domain P g , i.e. it solves max x ∈P g c T x , or in other words it exactly solv es our in ternal problem ExactLinear ( c, P g ). Lo v asz [ Lov83 ] already demonstrated ho w to use this kind of linear optimization ov er P g to solv e submo dular minimization problems. It remains to in vestigate if there are in teresting ap- plications for the wider class of more general conv ex (non-linear) functions f o ver suc h domains, as addressed by our Algorithm 1 . 11 Optimization with the Nuclear and Max-No rm Matrix optimization problems with a nuclear norm or max-norm regularization, such as e.g. lo w norm matrix factorizations, hav e seen man y applications recen tly , ranging from low-rank reco very , dimensionality reduction, to recommender systems. W e prop ose tw o new first-order appro ximation metho ds building up on t w o of the simple semidefinite optimizers w e hav e studied ab o ve, that is the appro ximate SDP solv er of [ Haz08 ] from Section 7 on one hand, and our b ounded diagonal optimizer from Section 8 on the other hand. The algorithms come with strong con v ergence guaran tees. In con trast to existing metho ds, our nuclear norm optimizer do es not need any Cholesky or singular v alue decomp ositions in ternally , and provides guaran teed approximations that are sim ultaneously of lo w rank. The metho d is free of tuning parameters, and easy to parallelize. 11.1 Intro duction Here we consider conv ex optimization problems ov er matrices, whic h come with a regularization on either the nuclear norm or the max-norm of the optimization v ariable. 39 Con vex optimization with the nuclear norm has b ecome a very successful technique in v ari- ous mac hine learning, computer vision and compressed sensing areas such as low-rank recov ery [ FHB01 , CR09 , CT10 ], dimensionalit y reduction (suc h as robust principal comp onent analy- sis [ CLMW11 ]), and also recommender systems and matrix completion. Here matrix factor- izations [ SRJ04 , KBV09 ] — regularized b y the nuclear or max-norm — ha v e gained a lot of atten tion with the recently ended Netflix Prize comp etition. Many more applications of simi- lar optimization problems can b e found among dimensionalit y reduction, matrix classification, m ulti-task learning, spectral clustering and others. The success of these metho ds is fueled b y the prop erty of the n uclear norm b eing a natural conv ex relaxation of the rank, allowing the use of scalable conv ex optimization tec hniques. Based on the semidefinite optimization methods that w e hav e presented in the ab ov e Sections 7 and 8 , w e prop ose tw o new, yet simple, first-order algorithms for n uclear norm as w ell as max- norm regularized conv ex optimization. F or the n uclear norm case, our prop osed metho d builds up on the first-order scheme for semidef- inite optimization by [ Haz08 ], whic h w e ha ve in vestigated in Section 7.1 . This approac h allo ws us to significan tly reduce the computational complexity p er iteration, and therefore scale to m uch larger datasets: While existing metho ds need an entire and exact singular v alue decomp osition (SVD) in each step, our metho d only uses a single approximate eigenv ector computation p er iteration, which can b e done by e.g. the p ow er metho d. A conference v ersion of our w ork for n uclear norm regularized problems has app eared in [ JS10 ]. In the same spirit, w e will also give a new algorithm with a con vergence guaran tee for optimiz- ing with a max-norm regularization. F or matrix completion problems, exp erimen ts show that the max-norm can result in an impro v ed generalization performance compared to the n uclear norm in some cases [ SRJ04 , LRS + 10 ]. Nuclea r Norm Regularized Convex Optimization. W e consider the follo wing con vex optimiza- tion problems ov er matrices: min Z ∈ R m × n f ( Z ) + µ k Z k ∗ (31) and the corresp onding constrained v ariant min Z ∈ R m × n , k Z k ∗ ≤ t 2 f ( Z ) (32) where f ( Z ) is any differentiable conv ex function (usually called the loss function), k . k ∗ is the nucle ar norm of a matrix, also known as the tr ac e norm (sum of the singular v alues, or ` 1 -norm of the sp ectrum). Here µ > 0 and t > 0 resp ectively are given parameters, usually called the r e gularization p ar ameter . The nuclear norm is kno w as the natural generalization of the (sparsit y inducing) ` 1 -norm for vectors, to the case of semidefinite matrices. When choosing f ( X ) := kA ( X ) − b k 2 2 for some linear map A : R n × m → R p , the abov e formulation ( 31 ) is the matrix generalization of the problem min x ∈ R n k Ax − b k 2 2 + µ k x k 1 , for a fixed m atrix A , whic h is the imp ortant ` 1 - r e gularize d le ast squar es pr oblem , also kno wn as the basis pursuit de-noising problem in the compressed sensing literature, see also Section 5.1 . The analoguous vector v ariant of ( 32 ) is the L asso problem [ Tib96 ] which is min x ∈ R n n k Ax − b k 2 2 k x k 1 ≤ t o . Max-No rm Regularized Convex Optimization. In tuitiv ely , one can think of the matrix max- norm as the generalization of the vector ` ∞ -norm to PSD matrices. Here we consider optimiza- 40 tion problems with a max-norm regularization, whic h are given by min Z ∈ R m × n f ( Z ) + µ k Z k max (33) and the corresp onding constrained v ariant b eing min Z ∈ R m × n , k Z k max ≤ t f ( Z ) . (34) Our Contribution. Using the optimization metho ds from the previous Sections 7 and 27 , we presen t a muc h simpler algorithm to solve problems of the form ( 32 ), whic h do es not need an y internal SVD computations. The same approach will also solve the max-norm regularized problems ( 34 ). W e achiev e this by transforming the problems to the con v ex optimization setting o ver p ositive semidefinite matrices whic h w e ha v e studied in the ab ov e Sections 7.1 and 8 . Our new approach has several adv antages for nuclear norm optimization when compared to the existing algorithms suc h as “proximal gradient” methods (APG) and “singular v alue thresholding” (SVT), see e.g. [ GL W + 09 , CCS10 , TY10 , JY09 ], and also in comparison to the alternating-gradien t-descent-t yp e metho ds (as e.g. [ RS05 , Lin07 ]). i) By emplo ying the approximate SDP solv er by [ Haz08 ], see Algorithm 6 , we obtain a guaran teed ε -approximate solution Z after O 1 ε iterations. Crucially , the resulting solu- tion Z is simultaneously of low rank, namely rank O 1 ε . Also the algorithm maintains a compact representation of Z in terms of a lo w-rank matrix factorization Z = LR T (with the desired b ounded n uclear norm), and can therefore ev en b e applied if the full matrix Z w ould b e far too large to ev en b e stored. ii) Compared to the alternating-gradient-descen t-t yp e metho ds from mac hine learning, w e o vercome the problem of working with non-conv ex form ulations of the form f ( LR T ), whic h is NP-hard, and instead solve the original con v ex problem in f ( Z ). iii) The total running time of our algorithm for nuclear norm problems grows linear in the problem size, allows to tak e full adv antage of sparse problems such as e.g. for matrix completion. More precisely , the algorithm runs in time O N f ε 1 . 5 , where N f is the n um b er of matrix en tries on whic h the ob jective function f dep ends. Per iteration, our metho d consists of only a single approximate (largest) eigenv ector computation, allowing it to scale to any problem size where the p o w er metho d (or Lanczos’ algorithm) can still b e applied. This also makes the method easy to implement and to parallelize. Existing APG/SVT metho ds b y contrast need an en tire SVD in each step, which is significantly more expensive. iv) On the theory side, our simple conv ergence guaran tee of O 1 ε steps holds even if the used eigen v ectors are only appro ximate. In comparison, those existing metho ds that come with a conv ergence guarantee do require an exact SVD in each iteration, which migh t not alw a ys b e a realistic assumption in practice. W e demonstrate that our new algorithm on standard datasets impro ves ov er the state of the art methods, and scales to large problems such as matrix factorizations on the Netflix dataset. Hazan’s Algorithm 6 can b e in terpreted as the generalization of the c or eset approach to problems on symmetric matrices, whic h we ha ve explained in the previous Section 7.1 . Compared to the O (1 / √ ε ) con vergence metho ds in the spirit of [ Nes83 , Nes07a ], our num b er of steps is larger, whic h is how ever more than comp ensated by the impro v ed step complexity , b eing low er b y a factor of roughly ( n + m ). Our new metho d for the nuclear norm case can also b e interpreted as a mo dified, theoret- ically justified v ariant of Simon F unk’s p opular SVD heuristic [ W eb06 ] for regularized matrix 41 factorization. T o our kno wledge this is the first guaranteed conv ergence result for this class of alternating-gradien t-descent-t yp e algorithms. Related Wo rk. F or nuclear norm optimization, there are tw o lines of existing metho ds. On the one hand, in the optimization communit y , [ TY10 , LST09 ], [ GL W + 09 ] and [ JY09 ] indep en- den tly proposed algorithms that obtain an ε -accurate solution to ( 31 ) in O (1 / √ ε ) steps, by impro ving the algorithm of [ CCS10 ]. These metho ds are known under the names “accelerated pro ximal gradient” (APG) and “singular v alue thresholding” (SVT). More recently also [ MHT10 ] and [ MGC09 ] prop osed algorithms along the same idea. Each step of all those algorithms re- quires the computation of the singular v alue decomp osition (SVD) of a matrix of the same size as the solution matrix, which is exp ensiv e ev en with the currently av ailable fast metho ds suc h as PROP ACK . [ TY10 ] and [ JY09 ] and also [ GL W + 09 ] show that the primal error of their algo- rithm is smaller than ε after O (1 / √ ε ) steps, using an analysis inspired by [ Nes83 ] and [ BT09 ]. F or an ov erview of related algorithms, w e also refer the reader to [ CLMW11 ]. As mentioned ab o ve, the method presented here has a significan tly lo wer computational cost per iteration (one appro ximate eigen vector compared to a full exact SVD), and is also faster in practice on large matrix completion problems. On the other hand, in the machine learning comm unity , researc h originated from matrix completion and factorization [ SRJ04 ], later motiv ated by the Netflix prize challenge, getting sig- nifican t momen tum from the famous blog post b y [ W eb06 ]. Only v ery recen tly an understanding has formed that man y of these metho ds can indeed by seen as optimizing with regularization term closely related to the nuclear norm, see Section 11.5.4 and [ JS10 , SS10 ]. The ma jorit y of the curren tly existing machine learning metho ds such as for example [ RS05 , Lin07 ] and later also [ P at07 , ZWSP08 , KBV09 , TPNT09 , IR10 , GNHS11 ] are of the type of “alternating” gra- dien t descent applied to f ( LR T ), where at eac h step one of the factors L and R is k ept fixed, and the other factor is up dated by a gradien t or sto chastic gradien t step. Therefore, despite w orking well in many practical applications, all these mentioned metho ds can get stuck in local minima — and so are theoretically not well justified, see also the discussion in [ DeC06 ] and our Section 11.4 . The same issue also comes up for max-norm optimization, where for example [ LRS + 10 ] op- timize o ver the non-con vex factorization ( 38 ) for b ounded max-norm. T o our knowledge, no algorithm with a conv ergence guarantee was known so far. F urthermore, optimizing with a rank constrain t w as recen tly sho wn to b e NP-hard [ GG10 ]. In practical applications, nearly all approaches for large scale problems are working ov er a factorization Z = LR T of b ounded rank, therefore ruling out their abilit y to obtain a solution in polynomial time in the w orst-case, unless P = NP. Our new metho d for b oth n uclear and max-norm a voids all the ab o ve describ ed problems by solving an equiv alent conv ex optimization problem, and prov ably runs in near linear time in the n uclear norm case. 11.2 The Nuclear No rm fo r Matrices The nucle ar norm k Z k ∗ of a rectangular matrix Z ∈ R m × n , also known as the tr ac e norm or Ky F an norm , is giv en b y the sum of the singular v alues of Z , whic h is equal to the ` 1 -norm of the singular v alues of Z (because singular v alues are alwa ys non-negativ e). Therefore, the nuclear norm is often called the Sc hatten ` 1 -norm. In this sense, it is a natural generalization of the ` 1 -norm for vectors whic h we hav e studied earlier. The n uclear norm has a nice equiv alent characterization in terms of matrix factorizations of Z , 42 i.e. k Z k ∗ := min LR T = Z 1 2 k L k 2 F ro + k R k 2 F ro , (35) where the num b er of columns of the factors L ∈ R m × k and R n × k is not constrained [ FHB01 , SRJ04 ]. In other w ords, the n uclear norm constrains the a v erage Euclidean row or column norms of an y factorization of the original matrix Z . F urthermore, the nuclear norm is dual to the standard sp ectral matrix norm (i.e. the matrix op erator norm), meaning that k Z k ∗ = max B , k B k spec ≤ 1 B • Z , see also [ RFP10 ]. Recall that k B k spec is defined as the first singular v alue σ 1 ( B ) of the matrix B . Similarly to the prop erty of the v ector k . k 1 -norm b eing the b est conv ex appro ximation to the sparsit y of a v ector, as we discussed in Section 5 the n uclear norm is the b est conv ex appro x- imation of the matrix rank. More precisely , k . k ∗ is the conv ex env elop e of the rank [ FHB01 ], meaning that it is the largest conv ex function that is upp er b ounded by the rank on the conv ex domain of matrices n Z k Z k spec ≤ 1 o . This motiv ates why the n uclear norm is widely used as a proxy function (or con vex relaxation) for r ank minimization , which otherwise is a hard com binatorial problem. Its relation to semidefinite optimization — whic h explains why the n uclear norm is often called the trace norm — is that k Z k ∗ = minimize V ,W t s.t. V Z Z T W 0 and T r( V ) + T r( W ) ≤ 2 t . (36) Here the tw o optimization v ariables range ov er the symmetric matrices V ∈ S m × m and W ∈ S n × n . This semidefinite characterization will in fact b e the central to ol for our algorithmic approac h for n uclear norm regularized problems in the following. The equiv alence of the ab ov e c haracterization to the earlier “factorization” form ulation ( 35 ) is a consequence of the following simple Lemma 27 . The Lemma gives a corresp ondence b et ween the (rectangular) matrices Z ∈ R m × n of b ounded nuclear norm on one hand, and the (symmetric) PSD matrices X ∈ S ( m + n ) × ( m + n ) of bounded trace on the other hand. Lemma 27 ([ FHB01 , Lemma 1]) . F or any non-zer o matrix Z ∈ R m × n and t ∈ R , it holds that k Z k ∗ ≤ t 2 if and only if ∃ symmetric matric es V ∈ S m × m , W ∈ S n × n s.t. V Z Z T W 0 and T r( V ) + T r( W ) ≤ t . Pr o of. ⇒ Using the c haracterization ( 35 ) of the nuclear norm k Z k ∗ = min LR T = Z 1 2 ( k L k 2 F ro + k R k 2 F ro ) we get that ∃ L, R , LR T = Z s.t. k L k 2 F ro + k R k 2 F ro = T r( LL T ) + T r( RR T ) ≤ t , or in other w ords w e ha v e found a matrix LL T Z Z T RR T = ( L R )( L R ) T 0 of trace ≤ t . ⇐ As the matrix V Z Z T W is symmetric and PSD, it can b e (Cholesky) factorized to ( L ; R )( L ; R ) T s.t. LR T = Z and t ≥ T r( LL T ) + T r( RR T ) = k L k 2 F ro + k R k 2 F ro , therefore k Z k ∗ ≤ t 2 . 43 In terestingly , for c haracterizing b ounded n uclear norm matrices, it does not make any differ- ence whether w e enforce an equality or inequality constraint on the trace. This fact will turn out to b e useful in order to apply our Algorithm 6 later on. Co rolla ry 28. F or any non-zer o matrix Z ∈ R m × n and t ∈ R , it holds that k Z k ∗ ≤ t 2 if and only if ∃ symmetric matric es V ∈ S m × m , W ∈ S n × n s.t. V Z Z T W 0 and T r( V ) + T r( W ) = t . Pr o of. ⇒ F rom Lemma 27 we obtain a matrix V Z Z T W =: X 0 of trace say s ≤ t . If s < t , w e add ( t − s ) to the top-left entry of V , i.e. we add to X the PSD rank-1 matrix ( t − s ) e 1 e T 1 (whic h again giv es a PSD matrix). ⇐ follo ws directly from Lemma 27 . 11.2.1 W eighted Nuclear No rm A promising weigh ted n uclear norm regularization for matrix completion was recen tly prop osed b y [ SS10 ]. F or fixed weigh t v ectors p ∈ R m , q ∈ R n , the weigh ted n uclear norm k Z k nuc ( p,q ) of Z ∈ R m × n is defined as k Z k nuc ( p,q ) := k P Z Q k ∗ , where P = diag ( √ p ) ∈ R m × m denotes the diagonal matrix whose i -th diagonal entry is √ p i , and analogously for Q = diag ( √ q ) ∈ R n × n . Here p ∈ R m is the v ector whose entries are the probabilities p ( i ) > 0 that the i -th row is observed in the sampling Ω. Analogously , q ∈ R n con tains the probability q ( j ) > 0 for each column j . The opp osite weigh ting (using 1 p ( i ) and 1 q ( j ) instead of p ( i ), q ( j )) has also b een suggested by [ WKS08 ]. An y optimization problem with a weigh ted n uclear norm regularization min Z ∈ R m × n , k Z k nuc ( p,q ) ≤ t/ 2 f ( Z ) (37) and arbitrary loss function f can therefore b e formulated equiv alently ov er the domain k P Z Q k ∗ ≤ t/ 2, suc h that it reads as (if w e substitute ¯ Z := P Z Q ), min ¯ Z ∈ R m × n , k ¯ Z k ∗ ≤ t/ 2 f ( P − 1 ¯ Z Q − 1 ) . Hence, we hav e reduced the task to our standard con vex problem ( 32 ) for ˆ f that here is defined as ˆ f ( X ) := f ( P − 1 ¯ Z Q − 1 ) , where X =: V ¯ Z ¯ Z T W . This equiv alence implies that any algorithm solving ( 32 ) also serves as an algorithm for weigh ted n uclear norm regularization. In particular, Hazan’s Algorithm 6 do es imply a guaran teed appro ximation qualit y of ε for problem ( 37 ) after O 1 ε man y rank-1 up dates, as w e discussed in Section 7 . So far, to the b est of our knowledge, no approximation guarantees w ere kno wn for the weigh ted n uclear norm. Solution path algorithms (maintaining approximation guarantees when the regularization parameter t changes) as prop osed by [ GJL10 , GJL12a , GJL12b ], and the author’s PhD the- sis [ Jag11 ], can also b e extended to the case of the w eigh ted n uclear norm. 44 11.3 The Max-Norm fo r Matrices W e think of the matrix max-norm as a generalization of the v ector ` ∞ -norm to the case of p ositiv e semidefinite matrices, whic h we hav e studied b efore. In some matrix completion applications, the max-norm has b een observ ed to pro vide solutions of b etter generalization p erformance than the nuclear norm [ SRJ04 ]. Both matrix norms can b e seen as a conv ex surrogate of the rank [ SS05 ]. The max-norm k Z k max of a rectangular matrix Z ∈ R m × n has a nice c haracterization in terms of matrix factorizations of Z , i.e. k Z k max := min LR T = Z max {k L k 2 2 , ∞ , k R k 2 2 , ∞ } , (38) where the n umber of columns of the factors L ∈ R m × k and R n × k is not constrained [ LRS + 10 ]. Here k L k 2 , ∞ is the maximum ` 2 -norm of any row L i : of L , that is k L k 2 , ∞ := max i k L i : k 2 = max i q P k L 2 ik . Compared to the nuclear norm, we therefore observ e that the max-norm con- strains the maximal Euclidean row-norms of an y factorization of the original matrix Z , see also [ SS05 ]. 7 An alternativ e formulation of the max-norm w as giv en b y [ LMSS07 ] and [ SS05 ], stating that k Z k max = min LR T = Z (max i || L i : || 2 )(max i || R i : || 2 ) . The dual norm to the max-norm, as given in [ SS05 ], is k Z k ∗ max = max k Y k max ≤ 1 Z • Y = max k, l i ∈ R k , k l i k 2 ≤ 1 r j ∈ R k , k r j k 2 ≤ 1 X i,j Z ij l T i r j , where the last equality follows from the c haracterization ( 38 ). The relation of the max-norm to semidefinite optimization — which also explains the naming of the max-norm — is that k Z k max = minimize V ,W t s.t. V Z Z T W 0 and V ii ≤ t ∀ i ∈ [ m ] , W ii ≤ t ∀ i ∈ [ n ] (39) Here the tw o optimization v ariables range o ver the symmetric matrices V ∈ S m × m and W ∈ S n × n , see for example [ LRS + 10 ]. As already in the n uclear norm case, this semidefi- nite characterization will again b e the central to ol for our algorithmic approach for max-norm regularized problems in the following. The equiv alence of the ab ov e c haracterization to the earlier “factorization” formul ation ( 38 ) is a consequence of the following simple Lemma 29 . The Lemma gives a corresp ondence betw een the (rectangular) matrices Z ∈ R m × n of b ounded max- norm on one hand, and the (symmetric) PSD matrices X ∈ S ( m + n ) × ( m + n ) of uniformly b ounded diagonal on the other hand. Lemma 29. F or any non-zer o matrix Z ∈ R n × m and t ∈ R : k Z k max ≤ t 7 Note that the max-norm do es not coincide with the matrix norm induced by the v ector k . k ∞ -norm, that is k Z k ∞ := sup x 6 =0 k Z x k ∞ k x k ∞ . The latter matrix norm by contrast is known to b e the maxim um of the row sums of Z (i.e. the ` 1 -norms of the rows). 45 if and only if ∃ symmetric matric es V ∈ S m × m , W ∈ S n × n s.t. V Z Z T W 0 and V ii ≤ t ∀ i ∈ [ m ] , W ii ≤ t ∀ i ∈ [ n ] Pr o of. ⇒ Using the ab o ve characterization ( 38 ) of the max-norm, or namely that k Z k max = min LR T = Z max {k L k 2 2 , ∞ , k R k 2 2 , ∞ } , we get that ∃ L, R with LR T = Z , s.t. max {k L k 2 2 , ∞ , k R k 2 2 , ∞ } = max { max i k L i : k 2 2 , max i k R i : k 2 2 } ≤ t , or in other w ords we hav e found a matrix LL T Z Z T RR T = ( L ; R )( L ; R ) T 0 where every diagonal elemen t is at most t , that is k L i : k 2 2 = ( LL T ) ii ≤ t ∀ i ∈ [ m ], and k R i : k 2 2 = ( RR T ) ii ≤ t ∀ i ∈ [ n ]. ⇐ As the matrix V Z Z T W is symmetric and PSD, it can b e (Cholesky) factorized to ( L ; R )( L ; R ) T s.t. LR T = Z and k L i : k 2 2 = ( LL T ) ii ≤ t ∀ i ∈ [ m ] and k R i : k 2 2 = ( RR T ) ii ≤ t ∀ i ∈ [ n ], which implies k Z k max ≤ t . 11.4 Optimizing with Bounded Nuclear Norm and Max-No rm Most of the curren tly known algorithms for matrix factorizations as w ell as nuclear norm or max-norm regularized optimization problems, such as ( 31 ), ( 32 ), ( 33 ) or ( 34 ), do suffer from the follo wing problem: In order to optimize the conv ex ob jective function f ( Z ) while con trolling the norm k Z k ∗ or k Z k max , the metho ds instead try to optimize f ( LR T ), with resp ect to b oth factors L ∈ R m × k and R ∈ R n × k , with the corresp onding regularization constrain t imp osed on L and R . This approac h is of course very tempting, as the constraints on the factors — which originate from the matrix factorization characterizations ( 35 ) and ( 38 ) — are simple and in some sense easier to enforce. Unhealthy Side-Effects of Facto rizing. How ever, there is a significan t price to pa y: Even if the ob jectiv e function f ( Z ) is c onvex in Z , the v ery same function expressed as a function f ( LR T ) of b oth the factor v ariables L and R b ecomes a severely non-c onvex problem, naturally consisting of a large num b er of saddle-p oin ts (consider for example just the smallest case L, R ∈ R 1 × 1 together with the identit y function f ( Z ) = Z ∈ R ). The ma jority of the currently existing metho ds such as for example [ RS05 , Lin07 ] and later also [ P at07 , ZWSP08 , KBV09 , TPNT09 , IR10 , GNHS11 ] is of this “alternating” gradien t descent t yp e, where at each step one of the factors L and R is kept fixed, and the other factor is up dated by e.g. a gradient or sto chastic gradient step. Therefore, despite working well in many practical applications, all these mentioned metho ds can get stuc k in lo cal minima — and so are theoretically not well justified, see also the discussion in [ DeC06 ]. The same issue also comes up for max-norm optimization, where for example [ LRS + 10 ] opti- mize o v er the non-con v ex factorization ( 38 ) for b ounded max-norm. Concerning the fixed rank of the factorization, [ GG10 ] ha ve shown that finding the optim um under a rank constraint (ev e n if the rank is one) is NP-hard (here the used function f w as the standard squared error on an incomplete matrix). On the p ositive side, [ BM03 ] hav e sho wn that if the rank k of the factors L and R exceeds the rank of the optim um solution X ∗ , then — in some cases — it can b e guaranteed that the lo cal minima (or saddle p oints) are also global minima. How ev er, in nearly all practical applications it is computationally infeasible for the ab o ve mentioned metho ds to optimize with the rank k being in the same order of magnitude as the original matrix size m and n (as e.g. in the Netflix problem, such factors L, R could p ossibly not ev en be stored on a single machine 8 ). 8 Algorithm 6 in contrast do es never need to store a full estimate matrix X , but instead just keeps the rank-1 factors v obtained in each step, maintaining a factorized represen tation of X . 46 Relief: Optimizing Over an Equivalent Convex Problem. Here we simply ov ercome this problem by using the transformation to semidefinite matrices, which w e hav e outlined in the ab o ve Corollary 28 and Lemma 29 . These bijections of bounded nuclear and max-norm matrices to the PSD matrices ov er the corresp onding natural conv ex domains do allow us to directly optimize a conv ex problem, a v oiding the factorization problems explained abov e. W e describe this simple trick formally in the next t w o Subsections 11.4.1 and 11.4.2 . But what if y ou really need a Matrix F actorization? In some applications (such as for exam- ple embeddings or certain collab orative filtering problems) of the ab o ve men tioned regularized optimization problems o v er f ( Z ), one w ould still w ant to obtain the solution (or appro ximation) Z in a factorized representation, that is Z = LR T . W e note that this is also straight-forw ard to ac hieve when using our transformation: An explicit factorization of any feasible solution to the transformed problem ( 20 ) or ( 27 ) — if needed — can alw ays b e directly obtained since X 0. Alternativ ely , algorithms for solving the transformed problem ( 20 ) can directly maintain the appro ximate solution X in a factorized representation (as a sum of rank-1 matrices), as ac hieved for example by Algorithms 6 and 7 . 11.4.1 Optimization with a Nuclea r Norm Regularization Ha ving Lemma 27 at hand, w e immediately get to the crucial observ ation of this section, allowing us to apply Algorithm 6 : An y optimization problem o ver b ounded n uclear norm matrices ( 32 ) is in fact equiv alent to a standard bounded trace semidefinite problem ( 20 ). The same transformation also holds for problems with a b ound on the weighte d nuclear norm, as giv en in ( 37 ). Co rolla ry 30. Any nucle ar norm r e gularize d pr oblem of the form ( 32 ) is e quivalent to a b ounde d tr ac e c onvex pr oblem of the form ( 20 ), namely minimize X ∈ S ( m + n ) × ( m + n ) ˆ f ( X ) s.t. T r( X ) = t , X 0 (40) wher e ˆ f is define d by ˆ f ( X ) := f ( Z ) for Z ∈ R m × n b eing the upp er right p art of the symmetric matrix X . F ormal ly we again think of X ∈ S ( n + m ) × ( n + m ) as c onsisting of the four p arts X =: V Z Z T W with V ∈ S m × m , W ∈ S n × n and Z ∈ R m × n . Here “equiv alen t” means that for any feasible p oint of one problem, w e hav e a feasible p oin t of the other problem, attaining the same ob jectiv e v alue. The only difference to the original form ulation ( 20 ) is that the function argumen t X needs to b e rescaled by 1 t in order to ha v e unit trace, which ho w ever is a v ery simple op eration in practical applications. Therefore, we can directly apply Hazan’s Algorithm 6 for an y max-norm regularized problem as follows: Using our analysis of Algorithm 6 from Section 7.1 , w e see that Algorithm 8 runs in time ne ar line ar in the n umber N f of non-zero entries of the gradien t ∇ f . This mak es it very attractiv e in particular for recommender systems applications and matrix completion, where ∇ f is a sparse matrix (same sparsity pattern as the observed en tries), whic h w e will discuss in more detail in Section 11.5 . Co rolla ry 31. After at most O 1 ε many iter ations (i.e. appr oximate eigenvalue c omputations), A lgorithm 8 obtains a solution that is ε close to the optimum of ( 32 ). The algorithm r e quir es a total of ˜ O N f ε 1 . 5 arithmetic op er ations (with high pr ob ability). 47 Algo rithm 8 Nuclear Norm Regularized Solver Input: A conv ex n uclear norm regularized problem ( 32 ), target accuracy ε Output: ε -approximate solution for problem ( 32 ) 1. Consider the transformed symmetric problem for ˆ f , as giv en b y Corollary 30 2. Adjust the function ˆ f so that it first rescales its argumen t by t 3. Run Hazan’s Algorithm 6 for ˆ f ( X ) o ver the domain X ∈ S . Pr o of. W e use the transformation from Corollary 30 and then rescale all matrix en tries by 1 t . Then result then follows from Corollary 20 on page 31 on the running time of Hazan’s algorithm. The fact that eac h iteration of our algorithm is computationally very c heap — consisting only of the computation of an appro ximate eigenv ector — strongly contrasts the existing “proximal gradien t” and “singular v alue thresholding” metho ds [ GL W + 09 , JY09 , MGC09 , LST09 , CCS10 , TY10 ], which in e ach step need to compute an en tire SVD. Suc h a single incomplete SVD computation (first k singular vectors) amounts to the same computational cost as an entire run of our algorithm (for k steps). F urthermore, those existing metho ds whic h come with a theoretical guaran tee, in their analysis assume that all SVDs used during the algorithm are exact, which is not feasible in practice. By contrast, our analysis is rigorous even if the used eigen vectors are only ε 0 -appro ximate. Another nice prop erty of Hazan’s metho d is that the returned solution is guaranteed to b e sim ultaneously of low rank ( k after k steps), and that by incremen tally adding the rank-1 matrices v k v T k , the algorithm automatically maintains a matrix factorization of the approximate solution. Also, Hazan’s algorithm, as b eing an instance of our presented general framework, is designed to automatically stay within the feasible region S , where most of the existing metho ds do need a pro jection step to get bac k to the feasible region (as e.g. [ Lin07 , LST09 ]), making b oth the theoretical analysis and implementation more complicated. 11.4.2 Optimization with a Max-No rm Regularization The same approac h w orks analogously for the max-norm, by using Lemma 29 in order to apply Algorithm 7 : An y optimization problem ov er b ounded max-norm matrices ( 34 ) is in fact equiv alen t to a semidefinite problem ( 27 ) ov er the “b o x” of matrices where eac h element on the diagonal is b ounded abov e b y t . W e think of this domain as generalizing the positive cub e of vectors, to the PSD matrices. Co rolla ry 32. Any max-norm r e gularize d pr oblem of the form ( 34 ) is e quivalent to a b ounde d diagonal c onvex pr oblem of the form ( 27 ), i.e., minimize X ∈ S ( m + n ) × ( m + n ) ˆ f ( X ) s.t. X ii ≤ 1 ∀ i, X 0 (41) wher e ˆ f is define d by ˆ f ( X ) := f ( Z ) for Z ∈ R m × n b eing the upp er right p art of the symmetric matrix X . F ormal ly we again think of any X ∈ S ( n + m ) × ( n + m ) as c onsisting of the four p arts X =: V Z Z T W with V ∈ S m × m , W ∈ S n × n and Z ∈ R m × n . 48 Again the only difference to the original form ulation ( 27 ) is that the function argumen t X needs to b e rescaled by 1 t in order to hav e the diagonal b ounded by one, which ho wev er is a v ery simple op eration in practical applications. This means w e can directly apply Algorithm 7 for an y max-norm regularized problem as follows: Algo rithm 9 Max-Norm Regularized Solver Input: A conv ex max-norm regularized problem ( 34 ), target accuracy ε Output: ε -approximate solution for problem ( 34 ) 1. Consider the transformed symmetric problem for ˆ f , as giv en b y Corollary 32 2. Adjust the function ˆ f so that it first rescales its argumen t by t 3. Run Algorithm 7 for ˆ f ( X ) o ver the domain X ∈ . Using the analysis of our new Algorithm 7 from Section 7.1 , w e obtain the follo wing guaran tee: Co rolla ry 33. After l 8 C f ε m many iter ations, Algorithm 9 obtains a solution that is ε close to the optimum of ( 34 ). Pr o of. W e use the transformation from Corollary 32 and then rescale all matrix en tries by 1 t . Then the running time of the algorithm follows from Theorem 25 . Maximum Ma rgin Matrix F acto rizations. In the case of matrix completion, the “loss” func- tion f is defined as measuring the error from X to some fixed observ ed matrix, but just at a small fixed set of “observed” p ositions of the matrices. As w e already mentioned, semidefinite optimization ov er X as ab o ve can alwa ys b e interpreted as finding a matrix factorization , as a symmetric PSD matrix X alw a ys has a (unique) Cholesky factorization. No w for the setting of matrix completion, it is known that the ab ov e describ ed optimization task under b ounded max-norm, can b e geometrically in terpreted as learning a maxim um margin separating hyperplane for each user/mo vie. In other words the factorization problem decomp oses in to a collection of SVMs, one for each user or movie, if we think of the corresp onding other factor to be fixed for a momen t [ SRJ04 ]. W e will discuss matrix completion in more detail in Section 11.5 . Other Applications of Max-No rm Optimization. Apart from matrix completion, optimiza- tion problems employing the max-norm hav e other prominent applications in spectral metho ds, sp ectral graph properties, low-rank recov ery , and com binatorial problems suc h as Max-Cut. 11.5 Applications Our Algorithm 8 directly applies to arbitrary nuclear norm regularized problems of the form ( 32 ). Since the nuclear norm is in a sense the most natural generalization of the sparsity-inducing ` 1 - norm to the case of lo w rank matrices (see also the discussion in the previous chapters) there are man y applications of this class of optimization problems. 11.5.1 Robust Principal Comp onent Analysis One prominent example of a nuclear norm regularized problem in the area of dimensionality reduction is given by the technique of r obust PCA as in tro duced b y [ CLMW11 ], also called principal component pursuit, which is the optimization task min Z ∈ R m × n k Z k ∗ + µ k M − Z k 1 . (42) 49 Here M ∈ R m × n is the given data matrix, and k . k 1 denotes the entry-wise ` 1 -norm. By consid- ering the equiv alent constrained v arian t k Z k ∗ ≤ t 2 instead, we obtain a problem the form ( 32 ), suitable for our Algorithm 8 . Ho wev er, since the original ob jectiv e function f ( Z ) = k M − Z k 1 is not differentiable, a smo othed version of the ` 1 -norm has to b e used instead. This situation is analogous to the hinge-loss ob jectiv e in maxim um margin matrix factorization [ SRJ04 ]. Existing algorithms for robust PCA do usually require a complete (and exact) SVD in each iteration, as e.g. [ TY10 , AGI11 ], and are often harder to analyze compared to our approac h. The first algorithm with a con vergence guarantee of O 1 ε w as giv en b y [ AGI11 ], requiring a SVD computation p er step. Our Algorithm 8 obtains the same guarantee in the same order of steps, but only requires a single approximate eigenv ector computation p er step, which is significan tly c heap er. Last but not least, the fact that our algorithm delivers appro ximate solutions to ( 42 ) of rank O 1 ε will b e in teresting for practical dimensionality reduction applications, as it re-introduces the imp ortant concept of lo w-rank factorizations as in classical PCA. In other words our algo- rithm pro duces an embedding into at most O 1 ε man y new dimensions, which is muc h easier to deal with in practice compared to the full rank n solutions resulting from the existing solv ers for robust PCA, see e.g. [ CLMW11 ] and the references therein. W e did not yet p erform practical experiments for robust PCA, but chose to demonstrate the practical performance of Algorithm 6 for matrix completion problems first. 11.5.2 Matrix Completion and Lo w Norm Matrix Facto rizations F or matrix completion problems as for example in collab orative filtering and recommender sys- tems [ KBV09 ], our algorithm is particularly suitable as it retains the sparsity of the observ ations, and constructs the solution in a factorized wa y . In the setting of a partially observed matrix suc h as in the Netflix case, the loss function f ( X ) only dep ends on the observ ed p ositions, which are v ery sparse, so ∇ f ( X ) — whic h is all we need for our algorithm — is also sparse. W e w ant to appro ximate a partially given matrix Y (let Ω b e the set of known training en tries of the matrix) by a pro duct Z = LR T suc h that some conv ex loss function f ( Z ) is minimized. By Ω test w e denote the unknown test entries of the matrix w e w an t to predict. Complexit y . Just recently it has b een shown that the standard lo w-rank matrix completion problem — that is finding the b est approximation to an incomplete matrix b y the standard ` 2 - norm — is an NP-hard problem, if the rank of the appro ximation is constrained. The hardness is claimed to hold even for the rank 1 case [ GG10 ]. In the ligh t of this hardness result, the adv an tage of relaxing the rank by replacing it by the nuclear norm (or max-norm) is even more evident. Our near linear time Algorithm 8 relies on a conv ex optimization formulation and do es indeed deliver an guaranteed ε -accurate solution for the n uclear norm regularization, for arbitrary ε > 0. Suc h a guarantee is lacking for the “alternating” descent heuristics suc h as [ RS05 , Lin07 , P at07 , ZWSP08 , KBV09 , TPNT09 , IR10 , GNHS11 , SS10 , LRS + 10 , RR11 ] (whic h build upon the non-conv ex factorized v ersions ( 35 ) and ( 38 ) while constraining the rank of the used factors L and R ). Different Regula rizations. Regularization b y the weigh ted nuclear norm is observed by [ SS10 ] to provide b etter generalization p erformance than the classical n uclear norm. As it can b e simply reduced to the n uclear norm, see Section 11.2.1 , our Algorithm 8 can directly b e applied in the w eighted case as well. On the other hand, exp erimental evidence also shows that the max-norm sometimes pro vides b etter generalization p erformance than the nuclear norm [ SRJ04 , LRS + 10 ]. 50 F or any conv ex loss function, our Algorithm 9 solves the corresponding max-norm regularized matrix completion task. Different Loss Functions. Our metho d applies to any conv ex loss function on a lo w norm matrix factorization problem, and we will only men tion t w o loss functions in particular: Maximum Mar gin Matrix F actorization (MMMF) [ SRJ04 ] can directly b e solved by our Al- gorithm 8 . Here the original (soft margin) formulation is the trade-off form ulation ( 31 ) with f ( Z ) := P ij ∈ Ω | Z ij − y ij | b eing the hinge or ` 1 -loss. Because this function is not differentiable, the authors recommend using the differen tiable smo othed hinge loss instead. When using the standard squared loss function f ( Z ) := P ij ∈ Ω ( Z ij − y ij ) 2 , the problem is kno wn as R e gularize d Matrix F actorization [ W u07 ], and b oth our algorithms directly apply . This loss function is widely used in practice, has a v ery simple gradient, and is the natural matrix generalization of the ` 2 -loss (notice the analogous Lasso and regularized least squares form ulation). The same function is kno wn as the ro oted mean squared error, which was the qualit y measure used in the Netflix comp etition. W e write RMSE train and RMSE test for the ro oted error on the training ratings Ω and test ratings Ω test resp ectiv ely . Running time and memory . F rom Corollary 31 w e ha ve that the running time of our n uclear norm optimization Algorithm 8 is linear in the size of the input: Each matrix-vector multipli- cation in Lanczos’ or the p o wer metho d exactly costs | Ω | (the num b er of observed p ositions of the matrix) op erations, and w e kno w that in total we need at most O 1 /ε 1 . 5 man y such matrix-v ector m ultiplications. Also the memory requirements are very small: Either we store the entire factorization of X ( k ) (meaning the O 1 ε man y v ectors v ( k ) ) — which is still muc h smaller than the full matrix X — or then instead we can only up date and store the prediction v alues X ( k ) ij for ij ∈ Ω ∪ Ω test in eac h step. This, together with the known ratings y ij determines the sparse gradient matrix ∇ f ( X ( k ) ) during the algorithm. Therefore, the total memory requirement is only | Ω ∪ Ω test | (the size of the output) plus the size ( n + m ) of a single feature vector v . The constant C f in the running time of Algo rithm 6 . One migh t ask if the constant hidden in the O ( 1 ε ) num b er of iterations is indeed con trollable. Here we sho w that for the standard squared error on any fixed set of observed entries Ω, this is indeed the case. F or more details on the constan t C f , w e refer the reader to Sections 3.4 and 7.1 . Lemma 34. F or the squar e d err or f ( Z ) = 1 2 P ij ∈ Ω ( Z ij − y ij ) 2 over the sp e ctahe dr on S , it holds that C ˆ f ≤ 1 . Pr o of. In Lemma 6 , we ha v e seen that the constant C ˆ f is upp er b ounded b y half the diameter of the domain, times the largest eigenv alue of the Hessian ∇ 2 ˆ f ( ~ X ). Here w e consider ˆ f as a function on vectors ~ X ∈ R n 2 corresp onding to the matrices X ∈ S n × n . How ever for the squared error as in our case here, the Hessian will b e a diagonal matrix. One can directly compute that the diagonal en tries of ∇ 2 ˆ f ( ~ X ) are 1 at the en tries corresp onding to Ω, and zero everywhere else. F urthermore, the squared diameter of the sp ectahedron is upp er b ounded by 2, as w e ha ve sho wn in Lemma 23 . Therefore C ˆ f ≤ 1 for the domain S . If the domain is the scaled spectahedron t · S as used in our Algorithm 8 , then the squared diameter of the domain is 2 t 2 , compare to Lemma 23 . This means that the curv ature is upp er b ounded by C ˆ f ≤ t 2 in this case. Alternatively , the same b ound for the curv ature of ˜ f ( X ) := ˆ f ( tX ) can b e obtained along the same lines as for the sp ectahedron domain in the previous lemma, and the same factor of t 2 will be the scaling factor of the Hessian, resulting from the c hain-rule for taking deriv ativ es. 51 11.5.3 The Structure of the Resulting Eigenvalue Problems F or the actual computation of the appro ximate largest eigenv ector in Algorithm 6 , i.e. the in ternal pro cedure Appr oxEV −∇ ˆ f ( X ( k ) ) , 2 C ˆ f k +2 , either Lanczos’ metho d or the p ow er metho d (as in P ageRank, see e.g. [ Ber05 ]) can b e used. In our Theorem 18 of Section 7.1 , w e stated that b oth the p o w er metho d as w ell as Lanczos’ algorithm do prov ably obtain the required appro ximation quality in a bounded num b er of steps if the matrix is PSD, with high probability , see also [ KW92 , AHK05 ]. Both metho ds are known to scale well to very large problems and can b e parallelized easily , as eac h iteration consists of just one matrix-vector m ultiplication. How ever, we ha v e to b e careful that we obtain the eigenv ector for the lar gest eigen v alue which is not necessarily the principal one (largest in absolute v alue). In that case the spectrum can b e shifted by adding an appropriate constan t to the diagonal of the matrix. F or arbitrary loss function f , the gradient −∇ ˆ f ( X ), which is the matrix whose largest eigen- v ector we hav e to compute in the algorithm, is alw ays a symmetric matrix of the blo ck form ∇ ˆ f ( X ) = 0 G G T 0 for G = ∇ f ( Z ), when X =: V Z Z T W . In other w ords ∇ ˆ f ( X ) is the adjacency matrix of a weigh ted bip artite gr aph . One vertex class corresp onds to the n rows of the original matrix X 2 ( users in recommender systems), the other class corresp onds to the m columns ( items or movies ). It is easy to see that the spectrum of ∇ ˆ f is alw a ys symmetric : Whenev er ( v w ) is an eigenv ector for some eigen v alue λ , then ( v − w ) is an eigenv ector for − λ . Hence, w e hav e exactly the same setting as in the established Hubs and Authorities (HITS) mo del [ Kle99 ]. The first part of an y eigen vector is alw ays an eigen vector of the h ub matrix G T G , and the second part is an eigen vector of the authority matrix GG T . Rep eated squa ring. In the sp ecial case that the matrix G is very rectangular ( n m or n m ), one of the tw o square matrices G T G or GG T is very small. Then it is kno wn that one can obtain an exp onential sp eed-up in the p ow er metho d by repeatedly squaring the smaller one of the matrices, analogously to the “square and multiply”-approac h for computing large in teger p o wers of real num b ers. In other words we can p erform O (log 1 ε ) man y matrix-matrix m ultiplications instead of O ( 1 ε ) matrix-ve ctor multiplications. 11.5.4 Relation to Simon F unk’s SVD Metho d In terestingly , our prop osed framework can also b e seen as a theoretically justified v ariant of Simon F unk’s [ W eb06 ] and related appro ximate SVD metho ds, whic h w ere used as a building blo c k b y most of the teams participating in the Netflix competition (including the winner team). Those metho ds hav e b een further inv estigated by [ P at07 , TPNT09 ] and also [ KBC07 ], whic h already prop osed a heuristic using the HITS form ulation. These approaches are algorithmically extremely similar to our metho d, although they are aimed at a sligh tly differen t optimization problem, and do not directly guaran tee b ounded nuclear norm. Recently , [ SS10 ] observed that F unk’s algorithm can b e seen as sto c hastic gradien t descent to optimize ( 31 ) when the regular- ization term is replaced by a weighte d v ariant of the nuclear norm. Simon F unk’s metho d considers the standard squared loss function ˆ f ( X ) = 1 2 P ij ∈ S ( X ij − y ij ) 2 , and finds the new rank-1 estimate (or feature) v b y iterating v := v + λ ( −∇ ˆ f ( X ) v − K v ), or equiv alently v := λ −∇ ˆ f ( X ) + 1 λ − K I v , (43) a fixed n umber of times. Here λ is a small fixed constant called the learning rate. Additionally a deca y rate K > 0 is used for regularization, i.e. to penalize the magnitude of the resulting 52 feature v . This matrix-vector m ultiplication form ulation ( 43 ) is equiv alent to a step of the pow er metho d applied within our framework 9 , and for small enough learning rates λ the resulting feature v ector will conv erge to the largest eigenv ector of −∇ ˆ f ( Z ). Ho wev er in F unk’s metho d, the magnitude of each new feature strongly dep ends on the starting v ector v 0 , the num b er of iterations, the learning rate λ as w ell as the deca y K , making the con vergence very sensitive to these parameters. This might b e one of the reasons that so far no results on the conv ergence sp eed could b e obtained. Our metho d is free of these parameters, the k -th new feature vector is alw ays a unit vector scaled b y 1 √ k . Also, we keep the F rob enius norm k U k 2 F ro + k V k 2 F ro of the obtained factorization exactly fixed during the algorithm, whereas in F unk’s metho d — whic h has a different optimization ob jective — this norm strictly increases with ev ery newly added feature. Our describ ed framework therefore gives theoretically justified v arian t of the exp erimentally successful method [ W eb06 ] and its related v ariants suc h as [ KBC07 , P at07 , TPNT09 ]. 11.6 Exp erimental Results W e run our algorithm for the following standard datasets 10 for matrix completion problems, using the squared error function. dataset #r atings n m MovieLens 100k 10 5 943 1682 MovieLens 1M 10 6 6040 3706 MovieLens 10M 10 7 69878 10677 Netflix 10 8 480189 17770 An y eigen vector metho d can b e used as a black-box in our algorithm. T o k eep the exp erimen ts simple, w e used the p ow er metho d 11 , and performed 0 . 2 · k p ow er iterations in step k . If not stated otherwise, the only optimization w e used is the improv ement by a veraging the old and new gradient as explained in Section 7.3 . All results w ere obtained b y our (single-thread) implemen tation in Java 6 on a 2.4 GHz Intel C2D laptop. Sensitivit y . The generalization p erformance of our metho d is relatively stable under changes of the regularization parameter, see Figure 2 : Movielens. T able 1 rep orts the running times of our algorithm on the three MovieLens datasets. Our algorithm giv es an ab out 5.6 fold sp eed increase o ver the rep orted timings b y [ TY10 ], whic h is a very similar metho d to [ JY09 ]. [ TY10 ] already improv es the “singular v alue thresholding” metho ds [ CCS10 ] and [ MGC09 ]. F or MMMF, [ RS05 ] report an optimization time of ab out 5 hours on the 1M dataset, but use the differen t smoothed hinge loss function so that the results cannot be directly compared. [ MGC09 ], [ SJ03 ] and [ JY09 ] only obtained results on m uc h smaller datasets. In the following exp erimen ts on the MovieLens and Netflix datasets we hav e pre-normalized all training ratings to the simple av erage µ i + µ j 2 of the user and movie mean v alues, for the sak e of being consistent with comparable literature. 9 Another difference of our metho d to Simon F unk’s lies in the sto chastic gradien t descent t yp e of the latter, i.e. “immediate feedbac k”: During eac h matrix m ultiplication, it already tak es the modified current feature v into account when calculating the loss ˆ f ( Z ), whereas our Algorithm 6 alters Z only after the eigenv ector computation is finished. 10 See www.grouplens.org and archiv e.ics.uci.edu/ml . 11 W e used the p o wer metho d starting with the uniform unit vector. 1 2 of the appro ximate eigenv alue corresp onding to the previously obtained feature v k − 1 w as added to the matrix diagonal to ensure go o d conv ergence. 53 0.89 0.91 0.93 0.95 0 15000 30000 45000 60000 T race regularization t RMSE test k=1000 Figure 2: Sensitivit y of the metho d on the choice of the regularization parameter t in ( 32 ), on MovieLens 1M . T able 1: Running times t our (in seconds) of our algorithm on the three MovieLens datasets compared to the rep orted timings t TY of [ TY10 ]. The ratings { 1 , . . . , 5 } were used as-is and not normalized to an y user and/or mo vie means. In accordance with [ TY10 ], 50% of the ratings were used for training, the others were used as the test set. Here NMAE is the mean absolute error, times 1 5 − 1 , ov er the total set of ratings. k is the n umber of iterations of our algorithm, #mm is the total num b er of sparse matrix-vector multiplications p erformed, and tr is the used trace parameter t in ( 32 ). They used Matlab/PROP ACK on an Intel Xeon 3.20 GHz pro cessor. NMAE t TY t our k #mm tr 100k 0.205 7.39 0.156 15 33 9975 1M 0.176 24.5 1.376 35 147 36060 10M 0.164 202 36.10 65 468 281942 F or MovieLens 10M , we used partition r b pro vided with the dataset (10 test ratings p er user). The regularization parameter t was set to 48333. W e obtained a RMSE test of 0.8617 after k = 400 steps, in a total running time of 52 min utes (16291 matrix m ultiplications). Our b est RMSE test v alue was 0.8573, compared to 0.8543 obtained by [ LU09 ] using their non-linear impro vemen t of MMMF. Algo rithm Va riants. Comparing the prop osed algorithm v ariants from Section 7.3 , Figure 3 demonstrates moderate improv ements compared to our original Algorithm 8 . Netflix. T able 2 compares our metho d to the t wo “hard impute” and “soft impute” sin- gular v alue thresholding metho ds of [ MHT10 ] on the Netflix dataset, where they used Mat- lab/PROP A CK on an Intel Xeon 3 GHz pro cessor. The “soft impute” v ariant uses a constrained rank heuristic in eac h up date step, and an “un-shrinking” or fitting heuristic as p ost-pro cessing. Both are adv antages for their metho d, and were not used for our implementation. Nevertheless, our algorithm seems to p erform comp etitiv e compared to the rep orted timings of [ MHT10 ]. Note that the primary goal of this exp erimental section is not to comp ete with the prediction qualit y of the b est engineered recommender systems (which are usually ensemble metho ds, i.e. com binations of many differen t individual metho ds). W e just demonstrate that our metho d solv es nuclear norm regularized problems of the form ( 32 ) on large sample datasets, obtaining strong performance improv ements. 54 0.63 0.708 0.785 0.863 0.94 0 100 200 300 400 k RMSE MovieLens 10M r b 1/k, test best on line segm., test gradient interp., test 1/k, train best on line segm., train gradient interp., train Figure 3: Impro vemen ts for the tw o algorithm v ariants described in Section 7.3 , when running on Movie- Lens 10M . The thic k lines ab ov e indicate the error on the test set, while the thinner lines indicate the training error. T able 2: Running times t our (in hours) of our algorithm on the Netflix dataset compared to the rep orted timings t M, hard for “hard impute” b y [ MHT09 ] and t M, soft for “soft impute” b y [ MHT10 ]. RMSE test t M, hard t M, soft t our k #mm tr 0.986 3.3 n.a. 0.144 20 50 99592 0.977 5.8 n.a. 0.306 30 109 99592 0.965 6.6 n.a. 0.504 40 185 99592 0.962 n.a. 1.36 1.08 45 243 174285 0.957 n.a. 2.21 1.69 60 416 174285 0.954 n.a. 2.83 2.68 80 715 174285 0.9497 n.a. 3.27 6.73 135 1942 174285 0.9478 n.a. n.a. 13.6 200 4165 174285 11.7 Conclusion W e hav e in tro duced a new method to solv e arbitrary conv ex problems with a nuclear norm regularization, whic h is simple to implement and to parallelize. The metho d is parameter- free and comes with a conv ergence guaran tee. This guarantee is, to our knowledge, the first guaran teed con vergence result for the class of Simon-F unk-t yp e algorithms, as well as the first algorithm with a guarantee for max-norm regularized problems. It remains to in vestigate if our algorithm can b e applied to other matrix factorization problems suc h as (p oten tially only partially observed) lo w rank approximations to kernel matrices as used e.g. by the PSVM technique [ CZW + 07 ], regularized v ersions of latent semantic analysis (LSA), or non-negativ e matrix factorization [ W u07 ]. 55 References [A GI11] Necdet Serhat Aybat, Donald Goldfarb, and Garud Iyengar. F ast First-Order Metho ds for Stable Principal Comp onen t Pursuit . arXiv math.OC , May 2011. [AHK05] Sanjeev Arora, Elad Hazan, and Saty en Kale. F ast algorithms for approximate semidefinite programming using the m ultiplicative weigh ts up date metho d . FOCS 2005 - 46th A nnual IEEE Symp osium on F oundations of Computer Scienc e , pages 339–348, 2005. [APV02] Pank a j Agarwal, Cecilia Procopiuc, and Kasturi V aradara jan. Approximation Algorithms for k-Line Cen ter . In Algorithms — ESA 2002 , pages 425–432. 2002. [BB09] Michel Baes and Mic hael Buergisser. Smo othing tec hniques for solving semidefinite programs with man y constraints . Optimization Online , 2009. [BC07] Mihai B˘ adoiu and Kenneth L Clarkson. Optimal core-sets for balls . Computational Ge ometry: The ory and Applic ations , 40(1):14–22, 2007. [Ber05] P av el Berkhin. A survey on PageRank computing . Internet mathematics , 2(1):73, 2005. [BHPI02] Mihai B˘ adoiu, Sariel Har-P eled, and Piotr Indyk. Approximate clustering via core-sets . STOC ’02: Pr o c e e dings of the thiry-fourth annual ACM Symp osium on The ory of Computing , 2002. [BJMO11] F rancis Bac h, Ro dolphe Jenatton, Julien Mairal, and Guillaume Ob ozinski. Optimization with Sparsit y-Inducing Penalties . T echnical rep ort, August 2011. [BL06] Jonathan M Borwein and Adrian S Lewis. Con vex analysis and nonlinear optimization: theory and examples . CMS b o oks in mathematics. Springer, 2006. [BLJ04] F rancis Bach, Gert R.G. Lanckriet, and Michael I Jordan. Multiple kernel learning, conic dualit y , and the SMO algorithm . ICML ’04: Pr o c e e dings of the twenty-first international c onfer enc e on Machine le arning , 2004. [BM03] Samuel Burer and Renato D C Mon teiro. A nonlinear programming algorithm for solving semidefinite programs via lo w-rank factorization . Mathematic al Pr o gr amming , 95(2):329–357, 2003. [Bot10] L ´ eon Bottou. Large-Scale Mac hine Learning with Sto chastic Gradient Descen t . In COMP- ST A T’2010 - Pr o c e e dings of the 19th International Confer enc e on Computational Statistics , pages 177–187, 2010. [BSS09] Joshua Batson, Daniel Spielman, and Nikhil Sriv astav a. Twice-raman ujan sparsifiers . STOC ’09: Pr o c e e dings of the 41st annual ACM Symp osium on The ory of Computing , 2009. [BT03] Amir Beck and Marc T eb oulle. Mirror descen t and nonlinear pro jected subgradien t methods for con vex optimization . Op er ations R ese ar ch L etters , 31(3):167–175, 2003. [BT09] Amir Bec k and Marc T eb oulle. A F ast Iterativ e Shrink age-Thresholding Algorithm for Linear In verse Problems . SIAM Journal on Imaging Scienc es , 2(1):183, 2009. [BTMN01] Aharon Ben-T al, T amar Margalit, and Ark adi Nemiro vski. The Ordered Subsets Mirror Descen t Optimization Metho d with Applications to T omography . SIAM Journal on Opti- mization , 12(1):79, 2001. [BTN05] Aharon Ben-T al and Ark adi Nemirovski. Non-euclidean restricted memory level metho d for large-scale con vex optimization . Mathematic al Pr o gr amming , 102(3):407–456, 2005. [BV04] Stephen P Bo yd and Lieven V andenberghe. Conv ex optimization . 2004. [CCS10] Jian-F eng Cai, Emmanuel J Candes, and Zuo wei Shen. A Singular V alue Thresholding Algorithm for Matrix Completion . SIAM Journal on Optimization , 20(4):1956–1982, 2010. [CDS98] Scott Shaobing Chen, David L Donoho, and Mic hael A Saunders. Atomic Decomp osition by Basis Pursuit . SIAM Journal on Scientific Computing , 20(1):33, 1998. 56 [Cla10] Kenneth L Clarkson. Coresets, Sparse Greedy Approximation, and the F rank-W olfe Algo- rithm . ACM T r ansactions on Algorithms , 6(4):1–30, 2010. [CLMW11] Emmanuel J Candes, Xiao dong Li, Yi Ma, and John W right. Robust principal component analysis? Journal of the ACM , 58(3), May 2011. [CR09] Emman uel J Candes and Benjamin Rec ht. Exact Matrix Completion via Conv ex Optimiza- tion . F oundations of Computational Mathematics , 9(6):717–772, 2009. [CT10] Emmanuel J Candes and T erence T ao. The Po wer of Con vex Relaxation: Near-Optimal Matrix Completion . IEEE T r ansactions on Information The ory , 56(5):2053–2080, 2010. [CZW + 07] Edward Chang, Kaihua Zhu, Hao W ang, Hong jie Bai, Jian Li, Zhihuan Qiu, and Hang Cui. PSVM: P arallelizing Supp ort V ector Machines on Distributed Computers . In NIPS ’07: A dvanc es in Neur al Information Pr o c essing Systems 20 , pages 257–264, 2007. [d’A08] Alexandre d’Aspremont. Smo oth Optimization with Approximate Gradient . SIAM Journal on Optimization , 19(3):1171, 2008. [DeC06] Dennis DeCoste. Collab orativ e prediction using ensem bles of Maximum Margin Matrix F ac- torizations . ICML ’06: Pr o c e e dings of the 23r d International Confer enc e on Machine L e arn- ing , 2006. [DH78] Joseph C Dunn and S Harshbarger. Conditional gradient algorithms with open loop step size rules . Journal of Mathematic al Analysis and Applic ations , 62(2):432–444, 1978. [Dun80] Joseph C Dunn. Conv ergence Rates for Conditional Gradient Sequences Generated b y Im- plicit Step Length Rules . SIAM Journal on Contr ol and Optimization , 18(5):473, 1980. [FHB01] Mary am F azel, Haitham Hindi, and Stephen P Boyd. A Rank Minimization Heuristic with Application to Minimum Order System Appro ximation . Pr o c e e dings Americ an Contr ol Con- fer enc e , 6:4734–4739, 2001. [FNW07] Mario A T Figueiredo, Rob ert D No wak , and Stephen J W right. Gradient Pro jection for Sparse Reconstruction: Application to Compressed Sensing and Other In verse Problems . IEEE Journal of Sele cte d T opics in Signal Pr o c essing , 1(4):586–597, 2007. [FW56] Marguerite F rank and Philip W olfe. An algorithm for quadratic programming . Naval R e- se ar ch L o gistics Quarterly , 3:95–110, 1956. [GG10] Nicolas Gillis and F ran¸ cois Glineur. Lo w-Rank Matrix Approximation with W eights or Miss- ing Data is NP-hard . arXiv math.OC , 2010. [Gil66] Elmer G Gilb ert. An Iterative Pro cedure for Computing the Minimum of a Quadratic F orm on a Con vex Set . SIAM Journal on Contr ol , 4(1):61–80, 1966. [GJ09] Bernd G¨ artner and Martin Jaggi. Coresets for polytop e distance . SCG ’09: Pr o c e e dings of the 25th Annual Symp osium on Computational Ge ometry , 2009. [GJL10] Joac him Giesen, Martin Jaggi, and S¨ oren Laue. Approximating Parameterized Con vex Op- timization Problems . In ESA 2010 - Pr o c e e dings of the 18th annual Eur op e an Confer enc e on A lgorithms: Part I , pages 524–535. LNCS, 2010. [GJL12a] Joachim Giesen, Martin Jaggi, and S¨ oren Laue. Approximating P arameterized Conv ex Op- timization Problems. T o app ear in ACM T r ansactions on Algorithms , 2012. [GJL12b] Joachim Giesen, Martin Jaggi, and S¨ oren Laue. Regularization P aths with Guarantees for Con vex Semidefinite Optimization. T o app ear in AIST A TS - Fifte enth International Con- fer enc e on Artificial Intel ligenc e and Statistics , 2012. [GL W + 09] Arvind Ganesh, Zhouchen Lin, John W right, Leqin W u, Minming Chen, and Yi Ma. F ast Algorithms for Recov ering a Corrupted Lo w-Rank Matrix . In CAMSAP - 3r d IEEE Inter- national Workshop on Computational A dvanc es in Multi-Sensor A daptive Pr o c essing , pages 213–216, 2009. 57 [GM11] Bernd G¨ artner and Ji ˇ r ´ ı Matou ˇ sek. Approximation Algorithms and Semidefinite Program- ming . Springer, December 2011. [GNHS11] Rainer Gemulla, Erik Nijk amp, P eter J Haas, and Y annis Sismanis. Large-Scale Matrix F actorization with Distributed Sto chastic Gradient Descent . In KDD ’11 - Pr o c e e dings of the 17th ACM SIGKDD international c onfer enc e on Know le dge Disc overy and Data Mining , 2011. [GW95] Michel Go emans and Da vid Williamson. Improv ed appro ximation algorithms for maximum cut and satisfiabilit y problems using semidefinite programming . Journal of the A CM , 42(6), 1995. [Haz08] Elad Hazan. Sparse Appro ximate Solutions to Semidefinite Programs . In LA TIN 2008 , pages 306–316. Springer, 2008. [Haz11] Elad Hazan. The con vex optimization approach to regret minimization . In Optimization for Machine L e arning . ie.technion.ac.il, 2011. [IR10] Alexander Ilin and T apani Raiko. Practical Approaches to Principal Comp onent Analysis in the Presence of Missing V alues . Journal of Machine L e arning R ese ar ch , pages 1–44, 2010. [Jag11] Martin Jaggi. Sp arse Convex Optimization Metho ds for Machine L e arning . PhD thesis, ETH Z ¨ uric h, 2011. [Jon92] Lee K Jones. A Simple Lemma on Greedy Appro ximation in Hilb ert Space and Con ver- gence Rates for Pro jection Pursuit Regression and Neural Netw ork T raining . The Annals of Statistics , 20(1):608–613, 1992. [JS10] Martin Jaggi and Marek Sulo vsk´ y. A Simple Algorithm for Nuclear Norm Regularized Prob- lems . ICML 2010: Pr o c e e dings of the 27th International Confer enc e on Machine L e arning , 2010. [JY09] Sh uiwang Ji and Jieping Y e. An accelerated gradien t metho d for trace norm minimization . ICML ’09: Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , 2009. [Kal07] Sat yen Kale. Efficien t Algorithms using the Multiplicative W eights Up date Metho d . PhD thesis, cs.princeton.edu, 2007. [KBC07] Mikl´ os Kurucz, Andras A Benczur, and Karoly Csalogany . Metho ds for large scale SVD with missing v alues . KDD Cup and Workshop at the 13th ACM SIGKDD Confer enc e , 2007. [KBP + 10] Mrinal Kalakrishnan, Jonas Buchli, Peter Pastor, Michael Mistry , and Stefan Schaal. F ast, robust quadrup ed lo comotion ov er c hallenging terrain . In ICRA 2010 - IEEE International Confer enc e on R ob otics and Automation , pages 2665–2670, 2010. [KBV09] Y ehuda Koren, Rob ert Bell, and Chris V olinsky . Matrix F actorization T echniques for Rec- ommender Systems . IEEE Computer , 42(8):30–37, 2009. [KKB07] Kwangmoo Koh, Seung-Jean Kim, and Stephen P Boyd. An in terior-p oint method for large- scale l1-regularized logistic regression . Journal of Machine L e arning R ese ar ch , 8:1519–1555, 2007. [KL96] Philip Klein and Hsueh-I Lu. Efficient approximation algorithms for semidefinite programs arising from MAX CUT and COLORING . In STOC ’96: Pr o c e e dings of the twenty-eighth annual ACM Symp osium on The ory of Computing , 1996. [Kle99] Jon M Klein b erg. Authoritative sources in a hyperlinked environmen t . Journal of the ACM , 46(5), 1999. [KSST09] Sham M Kak ade, Shai Shalev-Shw artz, and Ambuj T ewari. On the duality of strong conv exity and strong smo othness: Learning applications and matrix regularization . T echnical rep ort, T oy ota T echnological Institute - Chicago, USA, 2009. 58 [KW92] Jacek Kuczy´ nski and Henryk W o´ zniak owski. Estimating the Largest Eigen v alue by the P ow er and Lanczos Algorithms with a Random Start . SIAM Journal on Matrix Analysis and Applic ations , 13(4):1094–1122, 1992. [KZ05] Andrew Kurdila and Mic hael Zabarankin. Con vex functional analysis . Birkh¨ auser V erlag, 2005. [Lin07] Chih-Jen Lin. Pro jected Gradient Metho ds for Nonnegative Matrix F actorization . Neur al Comput. , 19(10):2756–2779, 2007. [LMSS07] Nathan Linial, Shahar Mendelson, Gideon Schec htman, and Adi Shraibman. Complexity measures of sign matrices . Combinatoric a , 27(4):439–463, 2007. [Lo v83] L´ aszl´ o Lov´ asz. Submo dular functions and conv exity . Mathematic al pr o gr amming: The state of the art , 1983. [LRS + 10] Jason Lee, Benjamin Rec ht, Ruslan Salakhutdino v, Nathan Srebro, and Jo el A T ropp. Practi- cal Large-Scale Optimization for Max-Norm Regularization . NIPS 2010: A dvanc es in Neur al Information Pr o c essing Systems 23 , 2010. [LST09] Y ong-Jin Liu, Defeng Sun, and Kim-Chuan T oh. An Implemen table Proximal P oint Algo- rithmic F ramework for Nuclear Norm Minimization . Optimization Online , 2009. [LU09] Neil La wrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes . ICML ’09: Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , 2009. [Mar52] Harry Marko witz. Portfolio Selection . The Journal of Financ e , 7(1):77–91, 1952. [MGC09] Shiqian Ma, Donald Goldfarb, and Lifeng Chen. Fixed point and Bregman iterativ e metho ds for matrix rank minimization . Mathematic al Pr o gr amming , 128(1):321–353, 2009. [MHT09] Rah ul Mazumder, T revor Hastie, and Rob ert Tibshirani. Sp ectral Regularization Algorithms for Learning Large Incomplete Matrices . Submitte d to JMLR , 2009. [MHT10] Rah ul Mazumder, T revor Hastie, and Rob ert Tibshirani. Sp ectral Regularization Algorithms for Learning Large Incomplete Matrices . Journal of Machine L e arning R ese ar ch , 11:1–36, 2010. [MR11] Olvi L Mangasarian and Benjamin Rec ht. Probability of unique integer solution to a system of linear equations . Eur op e an Journal of Op er ational R ese ar ch , In Press, 2011. [Nat95] Balas Kausik Natara jan. Sparse Approximate Solutions to Linear Systems . SIAM Journal on Computing , 24(2):227–234, 1995. [Nem04] Ark adi Nemiro vski. Prox-Method with Rate of Conv ergence O(1/t) for V ariational Inequal- ities with Lipsc hitz Con tinuous Monotone Operators and Smo oth Conv ex-Concav e Saddle P oint Problems . SIAM Journal on Optimization , 15(1):229, 2004. [Nem05] Ark adi Nemirovski. Lectures on mo dern conv ex optimization . Georgia Institute of T echnol- ogy , 2005. [Nes83] Y urii Nestero v. A metho d of solving a conv ex programming problem with conv ergence rate O(1/sqr(k)) . Soviet Mathematics Doklady , 27:372–376, 1983. [Nes04] Y urii Nesterov. Smo oth minimization of non-smo oth functions . Mathematic al Pr o gr amming , 103(1):127–152, 2004. [Nes07a] Y urii Nesterov. Gradient metho ds for minimizing comp osite ob jectiv e function . T ec hnical rep ort, CORE and INMA, Univ ersit´ e catholique de Louv ain, Belgium, 2007. [Nes07b] Y urii Nestero v. Primal-dual subgradien t metho ds for conv ex problems . Mathematic al Pr o- gr amming , 120(1):221–259, 2007. [Nes11] Y urii Nesterov. Random gradient-free minimization of con vex functions . CORE T e ch R ep ort , F ebruary 2011. 59 [P at93] Michael P atriksson. P artial linearization metho ds in nonlinear programming . Journal of Optimization The ory and Applic ations , 78(2):227–246, 1993. [P at98] Michael Patriksson. Cost Appro ximation: A Unified F ramework of Descent Algorithms for Nonlinear Programs . SIAM Journal on Optimization , 8(2):561, 1998. [P at07] Ark adiusz Paterek. Impro ving regularized singular v alue decomp osition for collab orative filtering . KDD Cup and Workshop at the 13th ACM SIGKDD Confer enc e , 2007. [Ren05] Jason D M Rennie. Regularized Logistic Regression is Strictly Conv ex . p e ople.csail.mit.e du , 2005. [RFP10] Benjamin Rec ht, Maryam F azel, and Pablo A P arrilo. Guaranteed Minim um-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization . SIAM R eview , 52(3):471–501, 2010. [Ro c97] R T yrrell Ro ck afellar. Conv ex analysis . Princeton Universit y Press, 1997. [RR11] Benjamin Rech t and Christopher R ´ e. P arallel Sto chastic Gradient Algorithms for Large-Scale Matrix Completion . submitte d , 2011. [RS05] Jason D M Rennie and Nathan Srebro. F ast maximum margin matrix factorization for collab orativ e prediction . ICML ’05: Pr o c e e dings of the 22nd International Confer enc e on Machine L e arning , 2005. [SJ03] Nathan Srebro and T ommi Jaakkola. W eighted Low-Rank Approximations . ICML ’03: Pr o c e e dings of the 20th International Confer enc e on Machine L e arning , 2003. [SRJ04] Nathan Srebro, Jason D M Rennie, and T ommi Jaakk ola. Maximum-margin matrix factor- ization . NIPS ’04: A dvanc es in Neur al Information Pr o c essing Systems 17 , 17:1329–1336, 2004. [SS05] Nathan Srebro and Adi Shraibman. Rank, T race-Norm and Max-Norm . COL T ’05: Pr o- c e e dings of the 18st annual Workshop on Computational L e arning The ory , 3559:545–560, 2005. [SS10] Ruslan Salakhutdino v and Nathan Srebro. Collab orative Filtering in a Non-Uniform W orld: Learning with the W eigh ted T race Norm . NIPS 2010: A dvanc es in Neur al Information Pr o c essing Systems 23 , 2010. [SSSZ10] Shai Shalev-Sh wartz, Nathan Srebro, and T ong Zhang. T rading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints . SIAM Journal on Optimization , 20:2807, 2010. [TG07] Joel A T ropp and Anna Gilb ert. Signal Reco very F rom Random Measurements Via Or- thogonal Matc hing Pursuit . IEEE T r ansactions on Information The ory , 53(12):4655–4666, 2007. [Tib96] Robert Tibshirani. Regression Shrink age and Selection via the Lasso . Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pages 267–288, 1996. [TPNT09] G´ abor T ak´ acs, Istv´ an Pil´ aszy , Botty´ an N ´ emeth, and Domonkos Tikk. Scalable Collab ora- tiv e Filtering Approaches for Large Recommender Systems . Journal of Machine L e arning R ese ar ch , 10, 2009. [T ro04] Jo el A T ropp. Greed is go o d: algorithmic results for sparse approximation . IEEE T r ansac- tions on Information The ory , 50(10):2231–2242, 2004. [TY07] Mic hael T o dd and E Alp er Yildirim. On Khachiy an’s algorithm for the computation of minim um-volume enclosing ellipsoids . Discr ete Applie d Mathematics , 155(13):1731–1744, 2007. [TY10] Kim-Ch uan T oh and Sangwoon Y un. An accelerated proximal gradien t algorithm for nuclear norm regularized linear least squares problems . Pacific Journal of Optimization , 2010. 60 [W eb06] Brandyn W ebb. Netflix Update: T ry This at Home . Blog post sifter.org/˜ simon/journal/20061211.html , 2006. [WKS08] Markus W eimer, Alexandros Karatzoglou, and Alex J Smola. Improving maximum margin matrix factorization . Machine L e arning , 72(3):263–276, 2008. [W u07] Mingrui W u. Collab orativ e filtering via ensembles of matrix factorizations . KDD Cup and Workshop at the 13th ACM SIGKDD Confer enc e , 2007. [ZdG10] Y ouw ei Zhang, Alexandre d’Aspremon t, and Laurent El Ghaoui. Sparse PCA: Conv ex Re- laxations, Algorithms and Applications . arXiv math.OC , 2010. [Zha03] T ong Zhang. Sequential greedy approximation for certain con vex optimization problems . IEEE T r ansactions on Information The ory , 49(3):682–691, 2003. [Zha11] T ong Zhang. Sparse Reco very with Orthogonal Matc hing Pursuit under RIP . IEEE T r ans- actions on Information The ory , 57(9):6215–6221, September 2011. [ZWSP08] Y unhong Zhou, Dennis Wilkinson, Robert Sc hreib er, and Rong P an. Large-Scale Parallel Collab orativ e Filtering for the Netflix Prize . In Algorithmic Asp e cts in Information and Management , pages 337–348. 2008. 61
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment