Randomized Methods for Linear Constraints: Convergence Rates and Conditioning

Randomized Metho ds for Linear Constrain ts: Con v ergence Rates and Conditioning D. Lev enthal ∗ A.S. Lewis † Octob er 29, 2018 Key w ords: co ordinate descen t, linear constrain t, condition num b er, ran- domization, error b ound, iterated pro jections, av eraged pro jections, distance to ill-p osedness, metric regularity AMS 2000 Sub ject Classiﬁcation: 15A12, 15A39, 65F10, 90C25 Abstract W e study randomized v ariants of tw o classical algorithms: co ordi- nate descent for systems of linear equations and iterated pro jections for systems of linear inequalities. Expanding on a recent randomized iterated pro jection algorithm of Strohmer and V ersh ynin for systems of linear equations, we sho w that, under appropriate probabilit y dis- tributions, the linear rates of con v ergence (in exp ectation) can b e b ounded in terms of natural linear-algebraic condition num b ers for the problems. W e relate these condition measures to distances to ill- p osedness, and discuss generalizations to con vex systems under metric regularit y assumptions. 1 In tro duction The condition num b er of a problem measures the sensitivity of a solution to small p erturbations in its input data. F or man y problems that arise in ∗ Sc ho ol of Operations Research and Information Engineering, Cornell Universit y , Ithaca, NY 14853, U.S.A. leventhal@orie.cornell.edu † Sc ho ol of Operations Research and Information Engineering, Cornell Universit y , Ithaca, NY 14853, U.S.A. aslewis@orie.cornell.edu 1 n umerical analysis, there is often a simple relationship b etw een the condi- tion n um b er of a problem instance and the distance to the set of ill-p osed problems—those problem instances whose condition num b ers are inﬁnite [3]. F or example, with resp ect to the problem of inv erting a matrix A , it is known (see [11], for example) that if A is p erturb ed to A + E for suﬃcien tly small E , then k ( A + E ) − 1 − A − 1 k k A − 1 k ≤ k A − 1 kk E k + O ( k E k 2 ) . Th us, a condition measure for this problem may b e taken as k A − 1 k . Asso- ciated with this is the classical Ec k art-Y oung theorem found in [7], relating the ab o v e condition measure to the distance to ill-p osedness. Theorem 1.1 (Ec k art-Y oung) F or any non-singular matrix, A , min E {k E k : A + E is singular } = 1 k A − 1 k . W e are typically concerned with relativ e condition n umbers as in tro duced b y Demmel in [3]. F or example, with respect to the problem of matrix in version, the relative condition n um b er is k ( A ) := k A kk A − 1 k , the commonly used condition measure. Condition num b ers are also imp ortan t from an algorithmic p ersp ectiv e. In the abov e example of matrix in version, for instance, the sensitivit y of a problem under p erturbations could come into prominence regarding errors in either the initial problem data or accumulated computational error due to rounding. Hence, it seems natural that condition n umbers should aﬀect algorithm sp eed. F or example, in the con text of linear programming, Renegar deﬁned a condition measure based on the distance to ill-p osedness in [19]– in a similar sense as the Eck art-Y oung result–and show ed its eﬀect on the con vergence rate of interior p oint metho ds in [20]. F or another example, consider the problem of ﬁnding a solution to the system Ax = b where A is a p ositiv e-deﬁnite matrix. It was shown in [1] that the steep est descen t metho d is linearly conv ergent with rate ( k ( A ) − 1 k ( A )+1 ) 2 and that this b ound is asymptotically tight for almost all c hoices of initial iterates. Similarly , it is w ell kno wn (see [8]) that the conjugate gradient metho d applied to the same problem is also linearly conv ergent with rate √ k ( A ) − 1 √ k ( A )+1 . 2 F rom a computational persp ectiv e, a related and imp ortant area of study is that of error b ounds. Given a subset of a Euclidean space, an error b ound is an inequality that b ounds the distance from a test v ector to the sp eciﬁed subset in terms of some residual function that is t ypically easy to compute. In that sense, an error b ound can b e used b oth as part of a stopping rule during implementation of an algorithm as well as an aide in pro ving algo- rithmic con v ergence. A comprehensiv e surv ey of error bounds for a v ariet y of problems arising in optimization can b e found in [15]. With regards to the problem of solving a nonsingular linear system Ax = b , one connection b etw een condition measures and error b ounds is immediate. Let x ∗ b e a solution to the system and x b e any other vector. Then k x − x ∗ k = k A − 1 A ( x − x ∗ ) k = k A − 1 ( Ax − b ) k ≤ k A − 1 kk Ax − b k , so the distance to the solution set is bounded b y a constan t multiple of the residual vector, k Ax − b k , and this constant is the same one that app ears in the con text of conditioning and distance to infeasibility . As we discuss later, this result is not conﬁned to systems of linear equations. As a result, error b ounds and the related condition num b ers often mak e a prominent app earance in con vergence proofs for a v ariety of algorithms. In this pap er, motiv ated by a recent randomized iterated pro jection scheme for systems of linear equations due to Strohmer and V ershynin in [22], w e revisit some classical algorithms and show that, with an appropriate randomization sc heme, w e can demonstrate con vergence rates directly in terms of these nat- ural condition measures. The rest of the pap er is organized as follows. In the remainder of this section, w e deﬁne some notation used throughout the rest of this pap er. In Section 3, we consider the problem of solving a lin- ear system Ax = b and sho w that a randomized co ordinate descen t sc heme, implemen ted according to a sp eciﬁc probability distribution, is linearly con- v ergent with a rate expressible in terms of traditional conditioning measures. In Section 4, we build up on the w ork of Strohmer and V ersh ynin in [22] b y considering randomized iterated pro jection algorithms for linear inequality systems. In particular, w e show ho w randomization can pro vide conv ergence rates in terms of the traditional Hoﬀman error b ound in [10] as w ell as in terms of Renegar’s distance to infeasibility from [18]. In Section 5, w e con- sider randomized iterated pro jection algorithms for general con vex sets and, under appropriate metric regularit y assumptions, obtain lo c al conv ergence rates in terms of the mo dulus of regularity . 3 The classical, deterministic v ersions of the simple algorithms w e consider here hav e b een widely studied, in part due to the extreme simplicity of each iteration: their linear con vergence is well-kno wn. How ev er, as remark ed for linear systems of equations in [22], randomized versions are interesting for several reasons. The randomized iterated pro jection metho d for linear equations from whic h this w ork originated ma y ha ve some practical promise, ev en compared with conjugate gradients, for example [22]. Our emphasis here, how ev er, is theoretical: randomization here provides a framework for simplifying the analysis of algorithms, allo wing easy bounds on the rates of linear conv ergence in terms of natural linear-algebraic condition measures, suc h as relativ e condition num b ers, Hoﬀman constants, and the mo dulus of metric regularit y . 2 Notation On the Euclidean space R n , w e denote the Euclidean norm by k · k . Let e i denote the column v ector with a 1 in the i th p osition and zeros elsewhere. W e consider m -by- n real matrices A . W e denote the set of ro ws of A by { a T 1 , . . . , a T m } and the set of columns is denoted { A 1 , . . . , A n } . The sp e ctr al norm of A is the quan tity k A k 2 := max k x k =1 k Ax k and the F r ob enius norm is k A k F := P i,j a 2 ij . These norms satisfy the follo wing inequalit y [11]: k A k F ≤ √ n k A k 2 . (2.1) F or an arbitrary matrix, A , let k A − 1 k 2 b e the smallest constant M such that k Ax k 2 ≥ 1 M k x k 2 for all v ectors x . In the case m ≥ n , if A has singular v alues σ 1 ≥ σ 2 ≥ · · · ≥ σ n , then M can also b e expressed as the recipro cal of the minim um singular v alue σ n , and, if A is in v ertible, this quan tity equals the sp ectral norm of A − 1 . The r elative c ondition numb er of A is the quan tit y k ( A ) := k A k 2 k A − 1 k 2 ; related to this is the sc ale d c ondition numb er introduced by Demmel in [4], deﬁned b y κ ( A ) := k A k F k A − 1 k 2 . F rom this, it is easy to v erify (using the singular v alue decomp osition, for example) the following relationship b et w een condition n umbers: 1 ≤ κ ( A ) √ n ≤ k ( A ) . No w supp ose the matrix A is n -b y- n symmetric and p ositive deﬁnite. The ener gy norm (or A -norm), denoted k · k A , is deﬁned by k x k A := √ x T Ax . The 4 inequalit y k x k 2 A ≤ k A − 1 k 2 · k Ax k 2 for all x ∈ R n (2.2) is useful later. F urther, if A is simply p ositive semi-deﬁnite, w e can generalize Inequalit y 2.2: x T Ax ≤ 1 λ ( A ) k Ax k 2 (2.3) where λ ( A ) is the smallest non-zero eigenv alue of A . W e denote the trace of A b y tr A : it satisﬁes the inequality k A k F ≥ tr A √ n . (2.4) Giv en a nonempty closed conv ex set S , let P S ( x ) b e the pro jection of x on to S : that is, P S ( x ) is the v ector y that is the optimal solution to min z ∈ S k x − z k 2 . Additionally , deﬁne the distance from x to a set S b y d ( x, S ) = min z ∈ S k x − z k 2 = k x − P S ( x ) k . The follo wing useful inequality is standard: k y − x k 2 − k P S ( y ) − x k 2 ≥ k y − P S ( y ) k 2 for all x ∈ S , y ∈ R n . (2.5) 3 Randomized Co ordinate Descen t Let A b e an n -b y- n p ositive-deﬁnite matrix. W e consider a linear system of the form Ax = b , with solution x ∗ = A − 1 b . W e consider the equiv alent problem of minimizing the strictly con vex quadratic function f ( x ) = 1 2 x T Ax − b T x, and note the standard relationship f ( x ) − f ( x ∗ ) = 1 2 k x − x ∗ k 2 A . (3.1) Supp ose our curren t iterate is x and we obtain a new iterate x + b y p er- forming an exact line search in the nonzero direction d : that is, x + is the solution to min t ∈ R f ( x + td ). This gives us x + = x + ( b − Ax ) T d d T Ad d 5 and f ( x + ) − f ( x ∗ ) = 1 2 k x + − x ∗ k 2 A = 1 2 k x − x ∗ k 2 A − (( Ax − b ) T d ) 2 2 d T Ad . (3.2) One natural c hoice of a set of easily-computable search directions is to c ho ose d from the set of coordinate directions, { e 1 , . . . , e n } . Note that, when using searc h direction e i , w e can compute the new point x + = x + b i − a T i x a ii e i using only 2 n + 2 arithmetic op erations. If the searc h direction is chosen at each iteration b y successively cycling through the set of co ordinate direc- tions, then the algorithm is known to b e linearly conv ergen t but with a rate not easily expressible in terms of t ypical matrix quantities (see [8] or [17]) . Ho wev er, b y c ho osing a co ordinate direction as a searc h direction randomly according to an appropriate probability distribution, w e can obtain a con- v ergence rate in terms of the relativ e condition n umber. This is expressed in the follo wing result. Algorithm 3.3 Consider an n -by- n p ositive semideﬁnite system Ax = b and let x 0 ∈ R n b e an arbitr ary starting p oint. F or j = 0 , 1 , . . . , c ompute x j +1 = x j + b i − a T i x j a ii e i wher e, at e ach iter ation j , the index i is chosen indep endently at r andom fr om the set { 1 , . . . , n } , with distribution P { i = k } = a kk tr A . Notice in the algorithm that the matrix A may b e singular, but that nonetheless a ii > 0 almost surely . If A is merely p ositiv e semideﬁnite, solu- tions of the system Ax = b coincide with minimizers of the function f , and consistency of the system is equiv alent to f b eing b ounded b elow. W e no w ha ve the following result. 6 Theorem 3.4 Consider a c onsistent p ositive-semideﬁnite system Ax = b , and deﬁne the c orr esp onding obje ctive and err or by f ( x ) = 1 2 x T Ax − b T x δ ( x ) = f ( x ) − min f . Then A lgorithm 3.3 is line arly c onver gent in exp e ctation: inde e d, for e ach iter ation j = 0 , 1 , 2 , . . . , E [ δ ( x j +1 ) | x j ] ≤  1 − λ ( A ) tr A  δ ( x j ) . In p articular, if A is p ositive-deﬁnite and x ∗ = A − 1 b , we have the e quivalent pr op erty E [ k x j +1 − x ∗ k 2 A | x j ] ≤  1 − 1 k A − 1 k 2 tr A  k x j − x ∗ k 2 A . Henc e the exp e cte d r e duction in the squar e d err or k x j − x ∗ k 2 A is at le ast a factor 1 − 1 √ nκ ( A ) ≤ 1 − 1 nk ( A ) at e ach iter ation. Pro of Note that if coordinate direction e i is chosen during iteration j , then Equation 3.2 sho ws f ( x j +1 ) = f ( x j ) − ( b i − a T i x j ) 2 2 a ii . Hence, it follo ws that E [ f ( x j +1 ) | x j ] = f ( x j ) − n X i =1 a ii tr ( A ) ( b i − a T i x j ) 2 2 a ii = f ( x j ) − 1 2tr A k Ax j − b k 2 . Using Inequalit y 2.3 and Equation 3.1, w e easily verify 1 2 k Ax j − b k 2 ≥ λ ( A ) δ ( x j ) , and the ﬁrst result follows. Applying Equation 3.1 pro vides the second result. The ﬁnal result comes from applying Inequalities 2.1 and 2.4. 2 7 The simple idea b ehind the pro of of Theorem 3.4 is the main engine driving the remaining results in this pap er. F undamentally , the idea is to c ho ose a probabilit y distribution so that the expected distance to the solution from the new iterate is the distance to the solution from the old iterate min us some m ultiple of a residual. Then, using some type of error b ound to b ound the distance to a solution in terms of the residual, we obtain exp ected linear con vergence of the algorithm. No w let us consider the more general problem of ﬁnding a solution to a linear system Ax = b where A is an m × n . More generally , since the system migh t b e inconsistent, we seek a “least squares solution” by minimizing the function k Ax − b k 2 . The minimizers are exactly the solutions of the p ositiv e- semideﬁnite system A T Ax = A T b , to which w e could easily apply the previous algorithm; how ever, as usual, we wish to av oid computing the new matrix A T A explicitly . Instead, we can pro ceed as follows. Algorithm 3.5 Consider a line ar system Ax = b for a nonzer o m -by- n matrix A . L et x 0 ∈ R n b e an arbitr ary initial p oint and let r 0 = b − Ax 0 b e the initial r esidual. F or e ach j = 0 , 1 , . . . , c ompute α j = A T i r j k A i k 2 x j +1 = x j + α j e i r j +1 = r j − α j A i , wher e, at e ach iter ation j , the index i is chosen indep endently at r andom fr om the set { 1 , . . . , n } , with distribution P { i = k } = k A k k 2 k A k 2 F ( k = 1 , 2 , . . . , n ) . (In the form ula for α j , notice b y assumption that A i 6 = 0 almost surely .) Note that the step size at each iteration can b e obtained b y directly minimizing the residual in the resp ectiv e co ordinate direction. How ever, the algorithm can also b e viewed as the application of the algorithm for p ositive deﬁnite systems on the system of normal equations, A T Ax = A T b , without actually ha ving to compute the matrix A T A . Given the motiv ation of directly minimizing the residual, w e would expect that Algorithm 3.5 w ould con verge to a least squares solution, ev en in the case where the underlying system is inconsisten t. The next result sho ws that this is, in fact, the case. 8 Theorem 3.6 Consider any line ar system Ax = b , wher e the matrix A is nonzer o. Deﬁne the le ast-squar es r esidual and the err or by f ( x ) = 1 2 k Ax − b k 2 δ ( x ) = f ( x ) − min f . Then Algorithm 3.5 is line arly c onver gent in exp e ctation to a le ast squar es solution for the system: for e ach iter ation j = 0 , 1 , 2 . . . , E [ δ ( x j +1 ) | x j ] ≤  1 − λ ( A T A ) k A k 2 F  δ ( x j ) . In p articular, if A has ful l c olumn r ank, we have the e quivalent pr op erty E [ k x j +1 − ˆ x k 2 A T A | x j ] ≤  1 − 1 κ ( A ) 2  k x j − ˆ x k 2 A T A wher e ˆ x = ( A T A ) − 1 A T b is the unique le ast-squar es solution. Pro of It is easy to v erify , by induction on j , that the iterates x j are ex- actly the same as the iterates generated b y Algorithm 3.3, when applied to the p ositiv e semi-deﬁnite system A T Ax = A T b , and furthermore that the residuals satisfy r j = b − Ax j for all j = 0 , 1 , 2 , . . . . Hence, the results follow directly b y Theorem 3.4. 2 By the coordinate descent nature of this algorithm, once we hav e com- puted the initial residual r 0 and column norms {k A i k 2 } n i =1 , w e can perform eac h iteration in O ( n ) time, just as in the p ositive-deﬁnite case. Sp eciﬁcally , this new iteration takes 4 n + 1 arithmetic op erations, compared with 2 n + 2 for the p ositiv e-deﬁnite case. F or a computational example, we apply Algorithm 3.5 to random 500 × n matrices where eac h elemen t of A and b is an indep endent Gaussian random v ariable and we let n take v alues 50, 100, 150 and 200. 9 Note that in the abov e examples, the theoretical b ound pro vided b y Theorem 3.6 predicts the actual b eha vior of the algorithm reasonably w ell. 4 Randomized Iterated Pro jections Iterated pro jection algorithms share some imp ortant c haracteristics with co- ordinate descent algorithms. Both are w ell studied and muc h con vergence theory exists; a comprehensiv e ov erview on iterated pro jections can b e found in [5]. Ho wev er, even for linear systems of equations, standard dev elop- men ts do not pro vide bounds on conv ergence rates in terms of usual condi- tion n um b ers. By contrast, in the recent paper [22], Strohmer and V ersh ynin obtained such b ounds via the following randomized iterated pro jection al- gorithm, which also provided the motiv ation for our work in the previous section. 10 Algorithm 4.1 Consider a line ar system Ax = b for a nonzer o m -by- n matrix A . L et x 0 ∈ R n b e an arbitr ary initial p oint. F or e ach j = 0 , 1 , . . . , c ompute x j +1 = x j − a T i x j − b i k a i k 2 a i wher e, at e ach iter ation j , the index i is chosen indep endently at r andom fr om the set { 1 , . . . , m } , with distribution P { i = k } = k a k k 2 k A k 2 F ( k = 1 , 2 , . . . , m ) . Notice that the new iterate x j +1 is simply the orthogonal pro jection of the old iterate x j on to the hyperplane { x : a T i x = b i } . At ﬁrst sight, the c hoice of probabilit y distribution may seem curious, since we could rescale the equations arbitrarily without having any impact on the pro jection op erations. Ho wev er, follo wing [22], we emphasize that the aim is to understand linear con vergence rates in terms of line ar-algebr aic condition measures asso ciated with the original system, rather than in terms of ge ometric notions associated with the hyperplanes. This randomized algorithm has the follo wing b ehavior. Theorem 4.2 (Strohmer-V ersh ynin, [22]) Given any matrix A with ful l c olumn r ank, supp ose the line ar system Ax = b has solution x ∗ . Then Algo- rithm 4.1 c onver ges line arly in exp e ctation: for e ach iter ation j = 0 , 1 , 2 , . . . , E [ k x j +1 − x ∗ k 2 2 | x j ] ≤  1 − 1 κ ( A ) 2  k x j − x ∗ k 2 2 W e seek a w ay of generalizing the abov e algorithm and con v ergence result to more general systems of linear inequalities, of the form ( a T i x ≤ b i ( i ∈ I ≤ ) a T i x = b i ( i ∈ I = ) , (4.3) where the disjoint index sets I ≤ and I = partition the set { 1 , 2 , . . . , m } . T o do so, sta ying with the tec hniques of the previous section, we need a corre- sp onding error b ound for a system of linear inequalities. First, given a v ector x ∈ R n , deﬁne the vector x + b y ( x + ) i = max { x i , 0 } . Then a starting point for this sub ject is a result by Hoﬀman in [10]. 11 Theorem 4.4 (Hoﬀman) F or any right-hand side ve ctor b ∈ R m , let S b b e the set of fe asible solutions of the line ar system (4.3). Then ther e exists a c onstant L , indep endent of b , with the fol lowing pr op erty: x ∈ R n and S b 6 = ∅ ⇒ d ( x, S b ) ≤ L k e ( Ax − b ) k , (4.5) wher e the function e : R m → R m is deﬁne d by e ( y ) i = ( y + i ( i ∈ I ≤ ) y i ( i ∈ I = ) , In the ab ov e result, eac h comp onen t of the v ector e ( Ax − b ) indicates the error in the corresp onding inequalit y or equation. In particular e ( Ax − b ) = 0 if and only if x ∈ S b . Thus Hoﬀman’s result pro vides a linear b ound for the distance from a trial p oin t x to the feasible region in terms of the size of the “a p osteriori error” asso ciated with x . W e call the minimum constan t L suc h that prop ert y (4.5) holds the Hoﬀ- man c onstant for the system (4.3). Sev eral authors giv e geometric or alge- braic meaning to this constant, or exact expressions for it, including [9], [14], [13]; for a more thorough treatment of the sub ject, see [15]. In the case of linear equations (that is, I ≤ = ∅ ), an easy calculation using the singular v alue decomp osition sho ws that the Hoﬀman constant is just the recipro cal of the smallest nonzero singular v alue of the matrix A , and hence equals k A − 1 k 2 when A has full column rank. F or the problem of ﬁnding a solution to a system of linear inequalities, w e consider a randomized algorithm generalizing Algorithm 4.1. Algorithm 4.6 Consider the system of ine qualities (4.3). L et x 0 b e an ar- bitr ary initial p oint. F or e ach j = 0 , 1 , . . . , c ompute β j = ( ( a T i x j − b i ) + ( i ∈ I ≤ ) a T i x j − b i ( i ∈ I = ) x j +1 = x j − β j k a i k 2 a i wher e, at e ach iter ation j , the index i is chosen indep endently at r andom fr om the set { 1 , . . . , m } , with distribution P { i = k } = k a k k 2 k A k 2 F ( k = 1 , 2 , . . . , m ) . 12 In the ab ov e algorithm, notice β j = e ( Ax j − b ) i . W e can no w generalize Theorem 4.2 as follo ws. Theorem 4.7 Supp ose the system (4.3) has nonempty fe asible r e gion S . Then Algorithm 4.6 c onver ges line arly in exp e ctation: for e ach iter ation j = 0 , 1 , 2 , . . . , E [ d ( x j +1 , S ) 2 | x j ] ≤  1 − 1 L 2 k A k 2 F  d ( x j , S ) 2 . wher e L is the Hoﬀman c onstant. Pro of Note that if the index i is c hosen during iteration j , then it follows that k x j +1 − P S ( x j +1 ) k 2 2 ≤ k x j +1 − P S ( x j ) k 2 2 =    x j − e ( Ax j − b ) i k a i k 2 a i − P S ( x j )    2 2 = k x j − P S ( x j ) k 2 2 + e ( Ax j − b ) 2 i k a i k 2 − 2 e ( Ax j − b ) i k a i k 2 a T i ( x j − P S ( x j )) . Note P S ( x j ) ∈ S . Hence if i ∈ I ≤ , then a T i P S ( x j ) ≤ b i , and e ( Ax j − b ) i ≥ 0, so e ( Ax j − b ) i a T i ( x j − P S ( x j )) ≥ e ( Ax j − b ) i ( a T i x j − b i ) = e ( Ax j − b ) 2 i . On the other hand, if i ∈ I = , then a T i P S ( x j ) = b i , so e ( Ax j − b ) i a T i ( x j − P S ( x j )) = e ( Ax j − b ) i ( a T i x j − b i ) = e ( Ax j − b ) 2 i . Putting these t wo cases together with the previous inequality shows d ( x j +1 , S ) 2 ≤ d ( x j , S ) 2 − e ( Ax j − b ) 2 i k a i k 2 . T aking the exp ectation with respect to the sp eciﬁed probability distribution, it follo ws that E [ d ( x j +1 , S ) 2 | x j ] ≤ d ( x j , S ) 2 − k e ( Ax j − b ) k 2 k A k 2 F 13 and the result no w follows by the Hoﬀman b ound. 2 Since Hoﬀman’s b ound is not indep enden t of the scaling of the matrix A , it is not surprising that a normalizing constan t lik e k A k 2 F term app ears in the result. F or a computational example, we consider linear inequality systems Ax ≤ b where the elements of A are indep enden t standard Gaussian random v ari- ables and b is chosen so that the resulting system has a non-empt y in terior. W e consider matrices A which are 500 × n , letting n tak e v alues 50, 100, 150 and 200. W e then apply Algorithm 4.6 to these problems and observ e the follo wing computational results. Another natural conditioning measure for linear inequality systems is the distance to infeasibilit y , deﬁned by Renegar in [18], and sho wn in [20] to go vern the con vergence rate of interior p oin t metho ds for linear programming. It is in teresting, therefore, from a theoretical p ersp ective, to obtain a linear con vergence rate for iterated pro jection algorithms in terms of this condition 14 measure as w ell. F or simplicit y , we concentrate on the inequality case, Ax ≤ b . T o b egin, let us recall the follo wing results. The distanc e to infe asibility [18] for the system Ax ≤ b is the num b er µ = inf n max {k ∆ A k , k ∆ b k} : ( A + ∆ A ) x ≤ b + ∆ b is infeasible o . Theorem 4.8 (Renegar, [18], Thm 1.1) Supp ose µ > 0 . Then ther e ex- ists a p oint ˆ x ∈ S satisfying k ˆ x k ≤ k b k /µ . F urthermor e, any p oint x ∈ R n satisﬁes the ine quality d ( x, S ) ≤ max { 1 , k x k} µ k ( Ax − b ) + k . Using this, w e can b ound the linear con v ergence rate for the Algorithm 4.6 in terms of the distance to infeasibility , as follo ws. Notice ﬁrst that k x j − ˆ x k is nonincreasing in j , by Inequality 2.5. Supp ose we start Algorithm 4.6 at the initial p oin t x 0 = 0. Applying Theorem 4.8, we see that for all j = 1 , 2 , . . . , k x j k ≤ k ˆ x k + k x j − ˆ x k ≤ k ˆ x k + k x 0 − ˆ x k ≤ 2 k b k µ , so d ( x j , S ) ≤ max n 1 µ , 2 k b k µ 2 o k ( Ax j − b ) + k . Using this inequalit y in place of Hoﬀman’s b ound in the pro of of Theorem 4.7 giv es E [ d ( x j +1 , S ) 2 | x j ] ≤   1 − 1 k A k 2 F (max { 1 µ , 2 k b k µ 2 } ) 2   d ( x j , S ) 2 . Although this bound ma y not b e the b est p ossible (and, in fact, it ma y not b e as goo d as the b ound pro vided in Theorem 4.7), this result simply em- phasizes a relationship b etw een algorithm sp eed and conditioning measures that app ears naturally in other con texts. In the next section, w e pro ceed with these ideas in a more general framew ork. 5 Metric Regularit y and Lo cal Con v ergence The previous section concerned global rates of linear con v ergence. If instead w e are interested in lo c al rates, we can re-examine a generalization of our 15 problem through an alternative p ersp ective of set-v alued mappings. Consider a set-v alued mapping Φ : R n → → R m and the problem of solving the asso ciated constrain t system of the form b ∈ Φ( x ) for the unknown vector x . F or example, ﬁnding a feasible solution to Ax ≤ b is equiv alen t to ﬁnding an x suc h that b ∈ Ax + R m + . (5.1) Related to this is the idea of metric r e gularity of set-v alued mappings. W e sa y the set-v alued mapping Φ is metrically regular at ¯ x for ¯ b ∈ Φ( ¯ x ) if there exists γ > 0 such that d ( x, Φ − 1 ( b )) ≤ γ d ( b, Φ( x )) for all ( x, b ) near ( ¯ x, ¯ b ) , (5.2) where Φ − 1 ( b ) = { x : b ∈ Φ( x ) } . F urther, the mo dulus of r e gularity is the inﬁm um of all constan ts γ suc h that Equation 5.2 holds. Metric regularit y is strongly connected with a v ariety of ideas from v ariational analysis: a go o d bac kground reference is [21]. Metric regularity generalizes the error b ounds discussed in previous sec- tions at the exp ense of only guaran teeing a b ound in lo cal terms. F or ex- ample, if Φ is a single-v alued linear map, then the mo dulus of regularity (at any ¯ x for any ¯ b ) corresp onds to the t ypical conditioning measure k Φ − 1 k (with k Φ − 1 k = ∞ implying the map is not metrically regular) and if Φ is a smo oth single-v alued mapping, then the mo dulus of regularity is the recipro- cal of the minimum singular v alue of the Jacobian, ∇ Φ( x ). F rom an alterna- tiv e p ersp ectiv e, metric regularity pro vides a framework for generalizing the Ec k art-Y oung result on the distance to ill-p osedness of linear mappings cited in Theorem 1.1. Sp eciﬁcally , if we deﬁne the r adius of metric r e gularity at ¯ x for ¯ b for a set-v alued mapping Φ b et ween ﬁnite dimensional spaces by radΦ( ¯ x | ¯ b ) = inf {k E k : Φ + E not metrically regular at ¯ x for ¯ b + E ( ¯ x ) } , where the inﬁm um is ov er all linear functions E , then one obtains the strik- ingly simple relationship (see [6]) mo dulus of regularit y of Φ at ¯ x for ¯ b = 1 radΦ( ¯ x | ¯ b ) . W e will not b e directly using the ab o ve result. Here, w e simply use the fundamen tal idea of metric regularit y which sa ys that the distance from a p oin t to the solution set, d ( x, Φ − 1 ( b )), is lo cally b ounded b y some constan t 16 times a ”residual”. F or example, in the case where Φ corresp onds to the linear inequalit y system (5.1), we hav e that d ( b, Φ( x )) = k ( Ax − b ) + k implies that the modulus of regularit y is in fact a global b ound and equals the Hoﬀman b ound. More generally , w e wish to emphasize that metric regularit y ties together sev eral of the ideas from previous sections at the exp ense of those results no w only holding lo cally instead of globally . In what follo ws, assume all distances are Euclidean distances. W e wish to consider ho w the mo dulus of regularity of Φ aﬀects the con v ergence rate of iterated pro jection algorithms. W e remark that linear conv ergence for iter- ated pro jection methods on con vex sets has b een v ery widely studied: see [5], for example. Our aim here is to observe, by analogy with previous sections, ho w randomization mak es the linear conv ergence rate easy to interpret in terms of metric regularit y . Let S 1 , S 2 , . . . , S m b e closed con v ex sets in a Euclidean space E such that ∩ i S i 6 = ∅ . Then, in a manner similar to [12], we can endo w the pro duct space E m with the inner pro duct h ( u 1 , u 2 , . . . , u m ) , ( v 1 , v 2 , . . . , v m ) i = m X i =1 h u i , v i i and consider the set-v alued mapping Φ : E → E m giv en by Φ( x ) = ( S 1 − x, S 2 − x, . . . , S m − x ) . (5.3) Then it clearly follo ws that ¯ x ∈ ∩ i S i ⇔ 0 ∈ Φ( ¯ x ). Under appropriate regularit y assumptions, we obtain the following lo cal conv ergence result. Theorem 5.4 Supp ose the set-value d mapping Φ given by Equation 5.3 is metric al ly r e gular at ¯ x for 0 with r e gularity mo dulus γ . L et ¯ γ b e any c onstant strictly lar ger than γ and let x 0 b e any initial p oint suﬃciently close to ¯ x . F urther, supp ose that x j +1 = P S i ( x j ) with pr ob ability 1 m for i = 1 , . . . , m . Then E [ d ( x j +1 , S ) 2 | x j ] ≤  1 − 1 m ¯ γ 2  d ( x j , S ) 2 . Pro of First, note that by Inequalit y 2.5, the distance k x j − ¯ x k is nonin- creasing in j . Hence if x 0 is suﬃcien tly close to ¯ x , then x j is as w ell for all j ≥ 0. Then, again using Inequality 2.5 (applied to the set S i ), w e ha v e, for all p oin ts x ∈ S ⊂ S i , k x j − x k 2 − k x j − P S i ( x j ) k 2 ≥ k P S i ( x j ) − x k 2 . 17 T aking the minimum ov er x ∈ S , we deduce d ( x j , S ) 2 − k x j − P S i ( x j ) k 2 ≥ d ( P S i ( x j ) , S ) 2 . Hence E [ d ( x j +1 , S ) 2 | x j ] = 1 m m X i =1 d ( P S i ( x j ) , S ) 2 ≤ 1 m m X i =1 [ d ( x j , S ) 2 − d ( x j , S i ) 2 ] = d ( x j , S ) 2 − 1 m m X i =1 d ( x j , S i ) 2 = d ( x j , S ) 2 − 1 m d (0 , Φ( x j )) 2 ≤  1 − 1 m ¯ γ 2  d ( x j , S ) 2 , using the deﬁnition of metric regularit y . 2 Note that metric regularit y at ¯ x for 0 is a sligh tly stronger assumption than actually necessary for this result. Sp eciﬁcally , the abov e result holds as long as Equation 5.2 holds for all x near ¯ x with ¯ b = 0 ﬁxed, as opp osed to the ab o v e deﬁnition requiring it to hold for all b near ¯ b as w ell. F or a momen t, let m = 2 and consider the sequence of iterates { x j } j ≥ 0 generated b y the randomized iterated pro jection algorithm. By idemp otency of the pro jection op erator, there’s no b eneﬁt to pro jecting on to the same set in t wo consecutive iterations, so the subsequence consisting of diﬀeren t iter- ates corresp onds exactly to that of the non-randomized iterated pro jection algorithm. In particular, if x j ∈ S 1 , then d ( P S 2 ( x j ) , S ) 2 ≤ d ( x j , S ) 2 − d ( x j , S 2 ) 2 = d ( x j , S ) 2 − [ d ( x j , S 2 ) 2 + d ( x j , S 1 ) 2 ] since d ( x j , S 1 ) = 0. This giv es us the follo wing corollary , which also follo ws through more standard deterministic argumen ts. Corollary 5.5 If Φ is metric al ly r e gular at ¯ x for 0 with r e gularity mo dulus γ and ¯ γ is lar ger than γ , then for x 0 suﬃciently close to ¯ x , the 2-set iter ate d pr oje ction algorithm is line arly c onver gent and d ( x j +1 , S ) 2 ≤  1 − 1 ¯ γ 2  d ( x j , S ) 2 . 18 F urther, consider the following reﬁned version of the m -set randomized algorithm. Supp ose x 0 ∈ S 1 and i 0 = 1. Then for j = 1 , 2 , . . . , let i j b e c hosen uniformly at random from { 1 , . . . , m }\{ i j − 1 } and x j +1 = P S i j ( x j ). Then w e obtain the following similar result. Corollary 5.6 If Φ is metric al ly r e gular at ¯ x for 0 with r e gularity mo dulus γ and ¯ γ is lar ger than γ , then for x 0 suﬃciently close to ¯ x , the r eﬁne d m -set r andomize d iter ate d pr oje ction algorithm is line arly c onver gent in exp e ctation and E [ d ( x j +1 , S ) 2 | x j , i j − 1 ] ≤  1 − 1 ( m − 1) ¯ γ 2  d ( x j , S ) 2 . A simple but eﬀective pro duct space formulation by Pierra in [16] has the b eneﬁt of reducing the problem of ﬁnding a p oin t in the in tersection of ﬁnitely man y sets to the problem of ﬁnding a point in the in tersection of 2 sets. Using the notation ab ov e, we consider the closed set in the pro duct space giv en by T = S 1 × S 2 × . . . × S m and the subspace L = { Ax : x ∈ E } where the linear mapping A : E → E m is deﬁned b y Ax = ( x, x, . . . , x ). Again, notice that ¯ x ∈ ∩ i S i ⇔ ( ¯ x, . . . , ¯ x ) ∈ T ∩ L . One interesting asp ect of this form ulation is that pro jections in the pro duct space E m relate bac k to pro jections in the original space E by ( z 1 , . . . , z m ) ∈ P T ( Ax ) ⇔ z i ∈ P S i ( x ) ( i = 1 , 2 , . . . , m ) ( P L ( z 1 , . . . , z m )) i = 1 m ( z 1 + z 2 + . . . + z m ) ( i = 1 , . . . , m ) This formulation pro vides a nice analytical framew ork: we can use the ab ov e equiv alence of pro jections to consider the metho d of aver age d pr oje ctions directly , deﬁned as follows. Algorithm 5.7 L et S 1 , . . . , S m ⊆ E b e nonempty close d c onvex sets. L et x 0 b e an initial p oint. F or j = 1 , 2 , . . . , let x j +1 = 1 m m X i =1 P S i ( x j ) . 19 Simply put, at eac h iteration, the algorithm pro jects the curren t iterate on to each set individually and takes the av erage of those pro jections as the next iterate. In the pro duct space form ulation, this is equiv alent to x j +1 = P L ( P T ( x j )). Expanding on the w ork of Pierra in [16], additional conv ergence theory for this algorithm has b een examined b y Bauschk e and Borw ein in [2]. Under appropriate regularit y conditions, the general idea is that con v ergence of the iterated pro jection algorithm for t w o sets implies conv ergence of the a veraged pro jection algorithm for m sets. In a similar sense, we prov e the follo wing result in terms of randomized pro jections. Theorem 5.8 Supp ose S = ∩ m i =1 S i is non-empty. If the r andomize d pr oje c- tion algorithm of The or em 5.4 is line arly c onver gent in exp e ctation with r ate α , then so is A lgorithm 5.7. Pro of Let x j b e the curren t iterate, x AP j +1 b e the new iterate in the method of a veraged pro jections and x RP j +1 b e the new iterate in the metho d of uniformly randomized pro jections. Then note that: x AP j +1 = 1 m m X i =1 P S i ( x j ) = E [ x RP j +1 ] . By con vexit y of the S i ’s, it follo ws that d ( x AP j +1 , S ) = d ( E [ x RP j +1 | x j ] , S ) ≤ E [ d ( x RP j +1 , S ) | x j ] ≤ αd ( x j , S ) , b y Jensen’s Inequality . 2 Hence, the method of av eraged pro jections con verges no more slo wly than the metho d of uniformly random pro jections. In particular, under the as- sumptions of Theorem 5.4, the metho d of a v eraged pro jections con v erges with rate no larger than 1 − 1 m ¯ γ 2 . References [1] H. Ak aik e. On a successive transformation of probability distribution and its application to the analysis of the optimum gradient metho d. A nnals of the Institute of Statistic al Mathematics , 11:1–16, 1959. 20 [2] H. H. Bauschk e and J. M. Borwein. On the conv ergence of von Neu- mann’s alternating pro jection algorithm for tw o sets. Set-V alue d Anal- ysis , 1:185–212, 1993. [3] J. W. Demmel. On condition n umbers and the distance to the nearest ill-p osed problem. Numerische Mathematik , 51:251–289, 1987. [4] J. W. Demmel. The probability that a numerical analysis problem is diﬃcult. Mathematics of Computation , 50(182):449–480, 1988. [5] F. Deutsc h. Best Appr oximation in Inner Pr o duct Sp ac es . Springer- V erlag, New Y ork, 2001. [6] A.L. Don tchev, A.S. Lewis, and R.T. Ro ck afellar. The radius of met- ric regularit y . T r ansactions of the Americ an Mathematic al So ciety , 355(2):493–517, 2003. [7] C. Ec k art and G. Y oung. The appro ximation of one matrix by another of lo w rank. Psychometrika , 1:211–218, 1936. [8] G. Golub and C. v an Loan. Matrix Computations . Johns Hopkins Uni- v ersity Press, 1996. [9] O. G¨ uler, A.J. Hoﬀman, and U. Roth blum. Appro ximations to solutions to systems of linear inequalities. SIAM J. Matrix Anal. Appl. , 16(2):688– 696, 1995. [10] A.J. Hoﬀman. On appro ximate solutions of systems of linear inequalities. J. R es. Nat. Bur. Stand. , 49:263–265, 1952. [11] R. Horn and C. Johnson. Matrix A nalysis . Cam bridge Universit y Press, 1999. [12] A.S. Lewis, D.R. Luk e, and J. Malick. Lo cal conv ergence for alternat- ing and a veraged noncon v ex pro jections. F oundations of Computational Mathematics , in revision. [13] W. Li. The sharp Lipschitz constan ts for feasible and optimal solutions of a perturb ed linear program. Line ar A lgebr a and its Applic ations , 187:15–40, 1993. 21 [14] K.F. Ng and X.Y. Zheng. Hoﬀman’s least error b ounds for linear in- equalities. Journal of Glob al Optimization , 30:391–403, 2004. [15] J.-S. Pang. Error b ounds in mathematical programming. Mathematic al Pr o gr amming , 79:299–332, 1997. [16] G. Pierra. Decomp osition through formalization in a pro duct space. Mathematic al Pr o gr amming , 28:96–115, 1984. [17] A. Quarteroni, R. Sacco, and F. Saleri. Numeric al Mathematics . Springer-V erlag, 2007. [18] J. Renegar. P erturbation theory for linear programming. Mathematic al Pr o gr amming , 65:73–91, 1994. [19] J. Renegar. Incorp orating conditions measures into the complexit y the- ory of linear programming. SIAM J. Opt. , 5(3):506–524, 1995. [20] J. Renegar. Linear programming, complexity theory and elemen tary functional analysis. Mathematic al Pr o gr amming , 70:279–351, 1995. [21] R.T. Rock afellar and R.J.-B. W ets. V ariational Analysis . Springer, Berlin, 1998. [22] T. Strohmer and R. V ersh ynin. A randomized Kaczmarz algorithm with exp onen tial con vergence. Journal of F ourier Analysis and Applic ations , to app ear. 22

Randomized Methods for Linear Constraints: Convergence Rates and Conditioning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment