Rows vs Columns for Linear Systems of Equations - Randomized Kaczmarz or Coordinate Descent?

This paper is about randomized iterative algorithms for solving a linear system of equations $X \beta = y$ in different settings. Recent interest in the topic was reignited when Strohmer and Vershynin (2009) proved the linear convergence rate of a Ra…

Authors: Aaditya Ramdas

Ro ws vs Col umns for Linear Sy stems of Equations - Randomized Kaczmarz or Co ordinate Descen t? Aadit y a Ramdas Mac hine Learning Departmen t Carnegie Mellon Univers ity aramdas@cs. cmu.edu Septem b er 2 0, 2018 Abstract This pap er is ab out randomized iterative algorithms for solving a linear system of equations X β = y in different settings. Recent interest in the topic was reignited when Strohmer and V ershynin (2009) pro ved the linear conv ergence rate of a Randomized Kaczmarz (RK ) algorithm that works on th e row s of X (data p oints). F ollo wing that, Lev enthal and Lewis (2010) pro ved the li near con vergence of a Randomized Coordinate Descent (RCD) algo rithm that w orks on the column s of X (features). The aim of this pap er is to simplify our understanding of these tw o algorithms, establish the direct relationships b etw een them ( t hough RK is often compared to Stochastic Gradient Descent), and examine the algorithmic commonalities or tradeoffs invo lved with w orking on ro ws or co lumns. W e also discuss Kernel Ridge R egression and p resent a Kaczma rz-style algorithm that works on data p oints and having th e adv antage of solving the problem without ever storing or forming the Gram matrix, one of the recognized problems encountered when scaling ker nelized method s. 1 In tro duction Solving linear systems of equations is a cla s sical topic and our interest in this proble m is fa ir ly limited. While we do compare tw o algor ithms - Randomized K aczmarz (RK) (s e e Strohmer a nd V er shynin [12]) and Randomize d Co ordinate Descent (R CD) (see Leven thal and Lewis [4]) - to e a ch other, we will no t b e pres ent ly concerned with comparing them to the host of other classica l a lgorithms in the literature, like Conjugate Gradient and Gr adient Descent metho ds. O ur primary aim will b e to understand the algorithmic similarities and differences in volved in working with r ows and co lumns (RK and RCD) of a given input. Assume we ha ve a kno wn n × p matrix X (represe nt ing n data p oints with p features ea ch), an unkno wn p - dimensional vector β (regress ion co efficients), and a known n - dimensional vector y (obser v ations). The system o f equations that o ne wants to solve is X β = y (1) When there exists a t least o ne s olution to the a bove system, we say that it is consis tent . When there is a unique consistent solution, solving (1) is a special case of the more general problem o f minimizing the res idual norm (mak es sense when there ar e no consistent so lutions) min β ∈ R p 1 2 k y − X β k 2 =: L ( β ) (2) When there is a unique co nsistent solutio n, it is also a special ca s e of the more gener al problem of finding the minim um nor m consistent solution (makes sense when there are infinite solutions) min β ∈ R p k β k 2 s.t. y = X β (3) Sparse versions of the ab ove problems, while int eres ting, and not in the s cop e of this work. W e will later co nsider the Ridge Regres s ion extension to Eq.(2). 1 W e r epresent the i -th row ( i = 1 , ..., n ) o f X b y X i , and the j -th column ( j = 1 , ..., p ) by X j . Similarly , the i - th observ ation is y i , and the j -th regression co efficient is β j . The reason for the linea r reg ression setup using statistical notation of n, p and X , β , y , is simply author co mfort, and the litera ture sometimes uses Ax = b ins tea d, with an m × n matrix A. F or the rest of this pap er, we r efer to a metho d as b eing Kaczma rz-like when its updates depend o n r ows (data po ints) o f X , like β t +1 := β t + δ i X i where stepsize δ i could also dep end on X i . and we refer to a metho d as b eing Co o rdinate Desce nt style when its upda tes are on c o ordinates of β and dep end on columns (features) of X , like β t +1 := β t + δ j e j where stepsize δ j could dep e nd on X j . Randomized K aczmarz (and its v ariations) hav e b een likened to Stochastic Gr adient Desc e nt (see Needell et al. [6]). Indeed, even before having mentioned the form of δ i , the update alrea dy lo oks a lo t lik e that of the Perceptron algorithm by Rosenblatt [9 ], a well known sto chastic gra dient descent algorithm. How ever, even though w e will later describ e some differences fro m RCD, we argue that RK-style metho ds a re still m uch mor e like randomized co ordinate desce nt than sto chastic gradient descent a lgorithms - this is useful not o nly for interpretation but a lso for new deriv ations. W e will bring out the intuitiv e similarity b etw een RK and R CD b y establishing striking parallels in pro ofs o f conv ergenc e (these pro ofs are traditiona lly presented in a different manner), and exploit this in designing a simple RK algor ithm for Kernel Ridge Reg ression. There has b e e n a lot of interest and extensions on bo th these interesting alg orithms, and perhaps connections betw een the t wo hav e b een floating around in a subtle ma nner. O ne o f our aims will be to make these c onnections explicit, intuitiv ely clear , and enable the reader to build a n understa nding of idea s and pr o of techniques by the end. W e first need to introduce the afo rementioned algor ithms b efore w e summarize o ur contributions. In doing so, we will o nly refer to pa pe r s that are directly re le v ant to our w or k . 1.1 Randomized Kaczmarz (RK) T aking X , y as input a nd starting from an ar bitrary β 0 , it rep eats the following in e a ch iter ation. Firs t, pick a random row r ∈ { 1 ...n } with pro bability prop ortio nal to its Euclidean nor m, i.e. Pr( r = i ) = k X i k 2 k X k 2 F Then, pro ject the cur rent iterate o nt o that row, i.e. β t +1 := β t + ( y r − X r T β t ) k X r k 2 X r (4) Int uitively , this up date ca n b e seen as gr eedily sa tisfying the r th equa tion in the linea r system, b ecause it is ea s y to see that a fter the up date, X r T β t +1 = y r (5) Alternatively , referr ing to Eq.(2), since L ( β ) = 1 2 k y − X β k 2 = 1 2 P n i =1 ( y i − X iT β ) 2 , we can interpret this up date as stochastic gr adient descen t (w e pick a ra ndo m data-p oint on whic h to up date), wher e the stepsize is the inv erse Lipschitz constant of the sto chastic gradient ∇ 2 1 2 ( y i − X iT β ) 2 = k X i k 2 . Strohmer a nd V ershynin [12] showed that the ab ov e algo r ithm has an expec ted linear co nv ergence. W e will for mally discuss the conv erg ence prop er ties of this algo rithm in future sections . 2 1.2 Randomized Co ordinate Descen t (R CD) T akeing X , y as input, starting from a n arbitra ry β 0 , it rep eats the following in eac h iteration. First, pick a rando m column c ∈ { 1 ...p } with probability prop ortional to its Euclidea n norm, i.e. Pr( c = j ) = k X j k 2 k X k 2 F W e then minimize the ob jective L ( β ) = 1 2 k y − X β k 2 with resp ect to this co ordina te to get β t +1 := β t + X T c ( y − X β t ) k X c k 2 e c (6) where e c is the c th co ordinate axis. It can b e s e en as greedily minimizing the o b jective with resp ect to the c -th co ordinate. Indeed, letting X − c , β − c represent X without its c -th column and β without its c -th co o rdinate, ∂ L ∂ β c = − X T c ( y − X β ) = − X T c ( y − X − c β − c − X c β c ) (7) Setting this equa l to zero for the co o rdinatewise minimization, w e g e t the aforementioned up date for β c . Alter nately , since [ ∇ L ( β )] c = − X T c ( y − X β ), the ab ove update can intuitiv ely b e seen a s a univ ariate descent step where the stepsize is the inverse Lipschitz co nstant of the gradient along the c -th co ordina te, since [ ∇ 2 L ( β )] c,c = ( X T X ) c,c = k X c k 2 . Leven thal and Lewis [4] show ed that this algorithm has an exp ec ted linear conv ergence. W e will discuss the conv ergence prop e rties of this algo rithm in detail in future s ections. 2 Main Results W e fir st examine the differe nces in b ehavior of the tw o algorithms in three distinct but rela ted settings . This will bring out the o pp osite b ehaviors o f the t wo similar algor ithms. When the system of equations (1) has a unique solution, we represent this by β ∗ . This happ ens when n ≥ p , and the system is (luckily) consistent. Assuming that X has full column rank, β ∗ = ( X T X ) − 1 X T y (8) When (1) does not ha ve any co ns istent solution, we r e fer to the least-squar es solutio n of Eq. (2) as β LS . This could happ en in the overconstrained case, when n > p . Aga in, a s suming that X has full column rank, we have β LS = ( X T X ) − 1 X T y (9) When (1) has infinite so lutions, w e ca ll the minimum nor m solution to (3) as β ∗ M N . This co uld happen in the underconstra ined case, when n < p . Assuming that X has full row r ank, we have β ∗ M N = X T ( X X T ) − 1 y (10) In the a b ov e notation, the ∗ is used to denote the fact that it is co nsistent, i.e. it solves the system of equatio ns, the LS stands for Least Squares and M N for minimum no rm. W e shall return to ea ch of these three situations in that order in future sections of this pap er . One of our main con tributions is to achiev e a unified und ers ta nding of the b ehaviour o f RK and RCD in these different situations. The literature for RK deals with only the fir st t wo settings (see Strohmer and V ershynin [12], Needell [5], Zouzias and F reris [13]), but avoids the third. The literature for RCD typically fo cuses on more general setups than our sp ecific quadra tic lea st s q uares loss function L ( β ) (see Nesterov [7] or Rich t´ arik and T ak´ aˇ c [8]). How ever, for b oth the pur p o ses of completeness, and for a more thor o ugh understanding the relatio nship betw een RK and RCD, it turns out to b e cr uc ia l to analyse all three settings (for equatio ns (1)-(3)). 3 1. When β ∗ is a unique co nsistent solution, we present pro ofs of the linear conv ergence of b o th alg orithms - the results are kno wn from pap er s b y Strohmer and V er shynin [1 2] and Lev ent hal and Lewis [4] but are presented in a nov el manner so that their rela tionship b ecomes clear e r and direct comparis on is easily p ossible. 2. When β LS is the (inconsis tent) least squa res so lution, we s how why RCD itera tes conv erg e linearly to β LS , but RK iterates do not - making RCD pr eferable. These facts are not har d to see, but we make it mor e int uitively a nd mathematically clea r wh y this should b e the ca se. 3. When β ∗ M N is the minimum nor m consistent solution, we expla in why RK conv erges linea rly to it, but RCD iterates do not (b oth so far undo cumented observ ations) - ma king RK pre fer able. T ogether, the above three po ints complete the pictur e (with solid accompanying in tuition) of the opp osing behavior of RK a nd R CD. W e then use the insig hts thus gained to develop a Kaczmarz style algorithm for Ridge Reg ression and its kernelized v ers io n. It is well known tha t the solution to min β ∈ R p k y − X β k 2 + λ k β k 2 (11) can b e given in t wo e q uiv alen t forms (using the c ov ariance and gr am matrices) as β RR = ( X T X + λI ) − 1 X T y = X T ( X X T + λI ) − 1 y (12) The presented algo rithms completely a void inv erting, stor ing or even for ming X X T and X T X . Later, we will sho w that the following up dates take only O ( p ) computatio n p er iter ation like RK (starting w ith δ = 0 , α = 0 n , β = 0 p , r = y ) a nd hav e expe c ted linear convergence: δ t = y i − β T t X i − λα i t k X i k 2 + λ (13) α i t +1 = α i t + δ t (14) β t +1 = β t + δ t X i (15) where the i -th r ow is picked with probability prop or tional to k X i k 2 + λ . If H k is a Repro ducing Kernel Hilb ert Space (RKHS, see Sc holkopf and Smola [11] for a n in tro duction) asso cia ted to positive definite kernel k and feature map φ x , it is well known that the solution to the corre s p o nding Ker nel Ridg e Regre s sion (see Saunders et a l. [10]) problem is f K RR = arg min f ∈H k n X i =1 ( y i − f ( x i )) 2 + λ k f k 2 H k (16) = Φ T ( K + λI ) − 1 y (17) where Φ = ( φ x 1 , ..., φ x n ) T and K is the gra m matr ix with K ij = k ( x i , x j ). One of th e main problems with kernel methods is as data s iz e grows, the g r am matrix b ecomes too lar ge to store. This has motiv ated the study of approximation techniques for suc h k ernel ma trices, but w e hav e an alternate suggestion. The aim of a Kaczmar z st yle algorithm would b e to solve the problem by never forming K as exmplified in up dates for β RR . F or K RR, the up date is α i t +1 = y − P j 6 = i K ( x i , x j ) α j t K ( x i , x i ) + λ (18) and co sts O ( n ) p er iteratio n, a nd re s ults in linear conv erge nc e a s describ ed later. Note that her e RK for Kernel Ridge Regr ession costs O ( n ) p er iteration and RK for Ridge Regr ession cost O ( p ) p er iter ation due to different parameteriza tion. In the la tter, w e can keep track of β t as well as α t easily , see Eq .(14),(15), but for KRR, calculations can only b e p erfor med via ev aluations of the kernel only ( β t corres p o nds to a function and cannot b e stored), and hence have a different co st. The aforementioned up dates and their conv ergence can be easily derived a fter we develop a cle ar understanding of how RK and RCD metho ds r elate to each other and jointly to p os itive s emi-definite systems of e q uations. W e shall see more of this in Sec.6. 4 3 Ov erconstrained System, Consisten t T o be clear, here we will ass ume that n > p , X has full column ra nk, and that the system is consistent, so y = X β ∗ . First, le t us write the up dates used by bo th alg orithms in a revealing fashio n. If RK and RCD pick ed row i a nd column j at s tep t + 1, and e i is 1 in the i -th p o sition and 0 elsewhere, t hen the updates can b e r e written as below: (RK) β t +1 := β t + e iT r t k X i k 2 X i (19) (R CD) β t +1 := β t + X T j r t k X j k 2 e j (20) where r t = y − X β t = X β ∗ − X β t is the re sidual v ector. Then m ultiplying b o th equations by X g ives (RK) X β t +1 := X β t + X iT ( β ∗ − β t ) k X i k 2 X X i (21) (R CD) X β t +1 := X β t + X T j X ( β ∗ − β t ) k X j k 2 X j (22) W e now co me to an imp ortant difference, which is the key up date equation for RK and RCD. Firstly , fro m the up da te E q.(19) for RK, we have β t +1 − β t is pa rallel to X i . Also, β t +1 − β ∗ is o rthogona l to X i (wh y? becaus e X iT ( β t +1 − β ∗ ) = y i − y i = 0). By P ythagor as, k β t +1 − β ∗ k 2 = k β t − β ∗ k 2 − k β t +1 − β t k 2 (23) Note that fro m the up date E q .(22), we hav e X β t +1 − X β t is par allel to X j . Also, X β t +1 − X β ∗ is orthogona l to X j (wh y? b ecause X T j ( X β t +1 − X β ∗ ) = X T j ( X β t +1 − y ) = 0 by the optimality condition ∂ L/∂ β j = 0 ). By Pythagor as, k X β t +1 − X β ∗ k 2 = k X β t − X β ∗ k 2 − k X β t +1 − X β t k 2 (24) The rest of the pro of follows by simply substituting for the las t term in the ab ov e t wo eq uations, and is presented in the following table for easy compa rison. Note Σ = X T X is the full-rank cov ariance matr ix and we first take exp ectations with r esp ect to the randomness at the t + 1-st step, c o nditioning on all r andomness up to the t -th step. W e later iterate this exp ectatio n. Randomized Kaczmar z E k β t +1 − β ∗ k 2 Randomized Co ordina te Descent E k X β t +1 − X β ∗ k 2 = k β t − β ∗ k 2 − E k β t +1 − β t k 2 = k X β t − X β ∗ k 2 − E k X β t +1 − X β t k 2 = k β t − β ∗ k 2 − P i k X i k 2 k X k 2 F ( X i ( β t − β ∗ )) 2 ( k X i k 2 ) 2 k X i k 2 = k X β t − X β ∗ k 2 − P j k X j k 2 k X k 2 F ( X T j X ( β t − β ∗ )) 2 ( k X j k 2 ) 2 k X j k 2 = k β t − β ∗ k 2  1 − 1 k X k 2 F k X ( β t − β ∗ ) k 2 k β t − β ∗ k 2  = k X β t − X β ∗ k 2  1 − 1 k X k 2 F k X T X ( β t − β ∗ ) k 2 k X β t − X β ∗ k 2  ≤ k β t − β ∗ k 2 (1 − σ min (Σ) T r (Σ) ) ≤ k X β t − X β ∗ k 2 (1 − σ min (Σ) T r (Σ) ) Here, σ min (Σ) k β t − β ∗ k 2 ≤ k X ( β t − β ∗ ) k 2 i.e. σ min (Σ) is the s ma llest eigenv alue. It follows that (RK) E k β t − β ∗ k 2 ≤  1 − σ min (Σ) T r (Σ)  t k β 0 − β ∗ k 2 (25) (R CD) E k β t − β ∗ k 2 Σ ≤  1 − σ min (Σ) T r (Σ)  t k β 0 − β ∗ k 2 Σ (26) Since Σ is inv ertible when n > p and X has full column ra nk , the last equation a lso implies linea r conv ergence of E k β t − β ∗ k 2 . The final results do exist in Strohmer a nd V ershynin [12], Levent hal and Lewis [4] but there is utility in seeing the t wo pro ofs in a form that differs from their o riginal pres entation, side b y side. In this setting, both RK and RCD are essent ially equiv alent (without computational considera tions). 5 4 Ov erconstrained System, Incon sisten t Here, we will ass ume that n > p , X is full column rank, and the system is (expe c tedly) inco nsistent, so y = X β LS + z , where z is such that X T z = 0 . It is ea sy to see this co ndition, b ecause as mentioned earlier, β LS = ( X T X ) − 1 X T y implying that X T X β LS = X T y . Substituting y = X β LS + z gives that X T z = 0. In this setting, RK is known to not conv erge to the least squar es so lution, a s is ea sily v erified exp erimentally . The tightest co nv ergence upp er b ounds known a re by Needell [5] and Zouzia s a nd F reris [13] who show that E k β t − β LS k 2 ≤  1 − σ + min (Σ) T r (Σ)  t k β 0 − β LS k 2 + k w k 2 σ + min (Σ) 2 If y ou tried to follow the previous pro of, Eq.(23) doe s not go through - Pytha gora s fails b ecause β t +1 − β LS 6⊥ X i , since X iT ( β t +1 − β LS ) = y i − X iT β LS 6 = 0. Intuitiv ely , the re ason RK do es not co nv erge is that e very up date of RK (say of row i ) is a pro jection onto the “wrong ” hyp e rplane that has constant y i (where the “rig ht” hyperplane would involv e pro jecting on to a parallel hyperplane with constant y i − z i where z was defined abov e). An alter na te int uition is that all RK updates are in the span of the ro ws, but β LS is no t in the row spa n. These intuitiv e explanations are eas ily c o nfirmed by exp eriments seen in Needell [5], Zouzias and F r eris [13]. The same do esn’t hold for RCD. Almos t magically , in the previo us pr o of, Pythago ras w orks b ecause X T j ( X β t +1 − X β LS ) = X T j ( X β t +1 − y ) + X T j ( y − X β LS ) = 0 The fir s t term is 0 b y the optimality co ndition for β t +1 i.e. X T j ( X β t +1 − y ) = ∂ L/∂ β j = 0. The second term is zero by the global optimality of β LS i.e. X T ( y − X β LS ) = ∇ L = 0 . Also, Σ is full r ank as b efore. Hence, s ince R CD works in the space of fitted v alues X β and not the iterates β . In summar y , RK do es not converge to the LS solution, but RCD do es at the same linea r ra te. The Randomize d Extended K aczmarz by Zouzias and F reris [13] is a modification of RK designed to co nv erge by ra ndomly pro jecting out z , but its discussion is b eyond our scope. W e w ere also alerted to a recent, indepe ndent A rxiv pap er by Dumitrescu [1]. 5 Underconstrained System, Infinite S olutions Here, w e will assume tha t p > n , X is full r ow ra nk and the system is (expe ctedly) consisten t w ith infinite solutions. As mentioned ear lier, it is eas y to s how that β ∗ M N = X T ( X X T ) − 1 y (whic h clea rly satisfies X β ∗ M N = y ). Every other consistent so lution can b e ex pressed as β ∗ = β ∗ M N + z where X z = 0 Clearly any suc h β ∗ would also satisfy X β ∗ = X β ∗ M N = 0. Since X z = 0, z ⊥ β ∗ M N implying k β ∗ k 2 = k β ∗ M N k 2 + k z k 2 , showing that β ∗ M N is indeed the minimum norm solution as claimed. In this case, RK has go o d behaviour, and starting fro m β 0 = 0 , it do es co nv erge linearly t o β ∗ M N . In tuitv ely , β ∗ M N = X T α (f or α = ( X X T ) − 1 y ) and hence is in the ro w span of X . Starting f rom β 0 = 0, RK only adds m ultiples of r ows to its iter ates, and hence will never have any comp onent orthog o nal to the row span of X (i.e. will never add any z such that X z = 0 ). There is exa ctly one solution with no comp onent o rthogo na l to the row span of X , a nd that is β ∗ M N , and hence RK co nv erges linea rly to the required p oint. It is imp ortant not to start from an ar bitr ary β 0 since the RK up dates can never wip e out any comp onent of β 0 that is per p endicular to the row span o f X . Mathematically , the pr e vious earlier pro of works be c ause P y thagora s g o es through since it is a consistent system. How ever, Σ is not full rank but note that since b o th β ∗ M N and β t are in the row span, β t − β ∗ M N has no comp onent 6 orthogo nal to X (unless it eq uals zero and we’re done). Hence σ + min (Σ) k β t − β ∗ k 2 ≤ k X ( β t − β ∗ ) k 2 do es hold σ + min being the sma llest po sitive eigenv alue of Σ. R CD unfortunately s uffers the opp osite fa te. The iterates do not co nverge to β ∗ LS , even thoug h X β t do es converge to X β ∗ . Mathematically , the con vergence pro of still car r ies forward as before till Eq.(26) , but in the last step where X T X cannot be inv erted b ecause it is not full rank. Hence we get conv ergence o f the residual to zero, without getting conv ergence of the iterates to the least sq ua res solution. Unfortunately , when each update is cheaper for RK than RCD (due to matrix size), RCD is prefer red for reaso ns of conv erge nce and when it is cheap er for RCD than RK, RK is prefer red. 6 Randomized Kaczmarz for Ridge Regression Both RK and RCD can b e viewed in the following fashion. Supp ose we have a p ositive definite matrix A , and we wan t to solve Ax = b . Instea d of casting it a s min x k Ax − b k 2 , we can alternatively p o s e the different pr o blem min x 1 2 x T Ax − b T x . The n one could us e the up da te x t +1 = x t + b i − A T i x t A ii e i where the b i − A T i x t is ba sically the i -th co or dinate of the gradie nt, and A ii is the Lipschitz co nstant of the i -th co ordinate of the gradient (related w ork s include Leven thal and Lewis [4], Nesterov [7],Ric ht´ arik and T ak´ a ˇ c [8],Lee and Sidford [3]). In this light, the RK up date can be seen as the randomized c o ordinate descent rule for the psd system X X T α = y (substituting β = X T α ) and trea ting A = X X T and b = y . Similarly , the RCD up date can b e see n as the randomized co or dinate descent rule for the psd system X T X β = X T y and treating A = X T X and b = X T y . Using this connection, we prop ose the following upda te rule: δ t = y i − β T t X i − λα i t k X i k 2 + λ (27) α i t +1 = α i t + δ t (28) β t +1 = β t + δ t X i (29) where the i -th row w as picked with pr obability pr o p ortional to k X i k 2 + λ . If a ll r ows ar e normalized, then this is still a uniform distribution. How ev er, it is more typical to normalize the columns in statis tics , and hence one pass ov er the data must b e made to calculate row norms. W e argue that this up date rule exhibits linear co nv ergence for min β k y − X β k 2 + λ k β k 2 . Similar ly , for Kernel Ridg e Regressio n as mentioned in Eq.(16), sinc e o ne hop es to ca lculate f K RR = Φ T ( K + λI n ) − 1 y , the RK-style up date can b e derived from the randomized co o rdinate descent up date rule for the psd s ystem ( K + λI ) α = y by setting f K RR = Φ T α . The up date for α lo o ks like (where S a ( z ) = z 1+ a ) α i t +1 = K ( x i , x i ) K ( x i , x i ) + λ α i t + y i − P j K ( x i , x j ) α j t K ( x i , x i ) + λ (30) = S λ K ( x i ,x i )  α i t + r i K ( x i , x i )  (31) where row i is pick ed prop ortional to K ( x i , x i ) + λ (uniform for translatio n inv arian t kernels). 7 Let us contrast this with the rando mized co or dinate descent up da te r ule for the loss function min x 1 2 β T ( X T X + λI p ) β − y T X β i.e. the s ystem ( X T X + λI p ) β = X T y . β i t +1 = β i t + X T i y − X T i X β − λβ i k X i k 2 + λ (32) = k X i k 2 k X i k 2 + λ β i t + X T i r t k X i k 2 + λ (33) = S λ k X i k 2  β i t + X T i r t k X i k 2  (34) 6.1 Computation and Con v ergence The RCD upda tes in (32)-(34) take O ( n ) time, since each column (feature) is of s iz e n . In c o ntrast, the prop osed RK upda tes in (27) -(29) takes O ( p ) time since that is the length of a da ta po int. Lastly the RK upda tes in (30)- (31) take O ( n ) time (to up date r ) not counting time fo r kernel ev aluations. The difference betw een the t wo RK up dates for Ridge Regress ion and Kernel Ridge Regressio n is that for KRR, we cannot maintain α a nd β since the β is a function in the RKHS. This differen t parameter ization mak es the updates to α co st O ( n ) ins tead of O ( p ). While the RK and RCD a lgorithms ar e similar and r elated, o ne should no t b e tempted into thinking their co n- vergence rates are the same. Indeed, with no no rmalization assumption, using a similar style pro o f as prese nted earlier, one can s how that the convergence rate of the RK for Kernel Ridge Reg r ession is E k α t − α ∗ k 2 K + λI n ≤  1 − σ min ( K + λI n ) T r ( K + λI n )  t k α 0 − α ∗ k 2 K + λI n (35) =     1 − λ P i σ 2 i + nλ  t k α 0 − α ∗ k 2 K + λI n if n > p  1 − σ 2 1 + λ P i σ 2 i + nλ  t k α 0 − α ∗ k 2 K + λI n if p > n (36) and the rate of c onv ergence o f the RCD for Ridg e Regres sion is subtly different: E k β t − β ∗ k 2 Σ+ λI p ≤  1 − σ min (Σ + λI p ) T r (Σ + λI p )  t k β 0 − β ∗ k 2 Σ+ λI p (37) =     1 − σ 2 1 + λ P i σ 2 i + pλ  t k β 0 − β ∗ k 2 Σ+ λI p if n > p  1 − λ P i σ 2 i + pλ  t k β 0 − β ∗ k 2 Σ+ λI p if p > n (38) Kernel Ridge Reg ression is used as a subroutine in many kernelized ma chine learning pr oblems (for example, see section 4.3 of F ukumizu et al. [2 ] for a recent a pplication to a nonparametric s tate space models). One of the ma jor issues inv olved with scaling kernel methods is the formation o f the gra m matrix, which could b e prohibitively large to fo r m, store , inv ert, etc. Our RK -style algorithm gets around this issue completely by never for ming the kernel matrix, just as RCD av oids fo r ming Σ, making it a great choice for scaling kernel metho ds. 7 Conclusion In this pap er, w e studied the clos e connections b etw een the RK and RCD algo rithms that have b oth receiv ed a lot of recent atten tion in the literature, and which w e s how a re in some sense instances of each other when appropriately viewed. While RK is often viewed as a stochastic g radient a lgorithm, w e saw that its ties to R CD are muc h stro ng er and that it is easier to understa nd its conv ergence through the RCD p ersp ective than the SGD viewp oint. W e first a nalysed their opp osite b ehavior with linear systems. If the system was co nsistent and unique then we show ed that bo th algo rithms approached the solution at (approximately) the same linear r ate, with extremely similar pro ofs pre s ented for direct compariso n. Ho wev er, if the system was consistent with infinite solutions then 8 we s aw that RK c onv erged to the minimum no rm s olution but RCD didn’t, making RK preferable. In contrast, if the system was inconsis ten t, R CD c onv erged to the least squa res s olution but RK did not, making RCD the preferred choice. Unfortunately , in b oth cases, the pr e fer red c hoices have costlier up dates. W e then exploited the connection b e t ween RK and R CD to des ign an RK-style algorithm for Kernel Ridge Regres - sion (KRR) which av oided explic itly forming and inv erting the p otentially large kernel ma trix. W e anticipate this and other randomiza tion techniques will help the scala bility of kernel metho ds. Ac kno wlegemen ts The a uthors would like to thank Deanna Needell a nd Anna Ma for so me discussio ns, Peter Rich tarik for p o inting out Dumitresc u [1], and Ryan Tibshirani, for memorable class notes on co or dina te descent and for correctio ns on an earlier version o f this manuscript. References [1] Dumit res c u, B. [2014 ], ‘F aster alter natives to the randomized extended k aczmarz algor ithm’, arXiv pr eprint arXiv:140 5.6920 . [2] F ukumizu, K., Song, L. a nd Gr e tton, A. [201 4], ‘Kernel bayes’ r ule: Bay esian inference with p ositive definite kernels’, J ournal of Machine L e arning R ese ar ch 14 , 37 53–37 83. [3] Lee, Y. T. and Sidford, A. [201 3], Efficient accelera ted co or dinate descent metho ds and faster algor ithms for solving linear sys tems , in ‘F ounda tions of Computer Science (FOCS), 20 13 IEEE 54th Ann ual Symp osium on’, IEE E, pp. 147–1 56. [4] Lev enthal, D. and Lewis, A. S. [2010], ‘Randomized metho ds for linea r constr a ints: Conv ergence rates a nd conditioning’, Mathematics of O p er a tions Re se ar ch 3 5 (3), 64 1 –654 . [5] Needell, D. [2010], ‘Randomized k aczmar z solver for noisy linear systems’, BIT Numeric al Mathematics 50 (2), 39 5–403 . [6] Needell, D., Sr ebro, N. and W a rd, R. [2013], ‘Sto chastic gradien t descent a nd the randomized k aczmarz algorithm’, arXiv pr eprint arXiv:1310.5 715 . [7] Nestero v, Y. [201 2], ‘E fficiency of co or dinate desc e nt metho ds on huge-scale optimization problems’, SIA M Journal on Optimization 22 (2 ), 341– 362. [8] Ric h t´ ar ik, P . and T ak´ aˇ c, M. [2 012], ‘Itera tion complexity of ra ndo mized blo ck-coor dinate descent methods for minimizing a comp osite function’, Mathematic al Pr o gr amming pp. 1–3 8. [9] Rosen blatt, F. [1 958], ‘The pe r ceptron: a probabilistic mo del for information storag e a nd org anization in the brain.’, Psycholo gic al r eview 65 (6), 386 . [10] Saunders, C., Gammerman, A. and V ovk, V. [1 998], Ridge regr e ssion learning algo rithm in dual v ariables, in ‘(ICML-1998 ) Pro ceedings of the 15th In ternationa l Confere nce o n Machine Lear ning’, Morg an Kaufmann, pp. 515– 521. [11] Sc holkopf, B. and Smola , A. [2002 ], L e arning with kernels , MIT pre ss Cambridge. [12] St ro hmer, T. and V e r shynin, R . [2009], ‘A randomized k aczmarz alg orithm with exponential conv ergence’, Journal of F ourier Analysi s and A pplic ations 1 5 (2), 262–2 78. [13] Zouzias, A. and F reris, N. M. [2013 ], ‘Rando mized extended k aczmarz fo r s olving least s quares’, SIAM Journal on Matrix Analysis and Applic a tions 34 (2), 773 –793 . 9

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment