Non-asymptotic Error Bounds For Constant Stepsize Stochastic Approximation For Tracking Mobile Agents
This work revisits the constant stepsize stochastic approximation algorithm for tracking a slowly moving target and obtains a bound for the tracking error that is valid for the entire time axis, using the Alekseev non-linear variation of constants fo…
Authors: Bhumesh Kumar, Vivek Borkar, Akhil Shetty
Non-asymptotic Error Bounds F or Consta n t Stepsize Sto c hastic Appro ximatio n F or T rac king Mobile Agen ts Bh umesh Kumar, Viv ek Bork ar, Akhil Shett y ∗ Marc h 4, 2019 Abstract This w ork revisits th e constant stepsize sto chas t ic app ro x imation al- gorithm for tracking a s lowly moving target and ob t ains a b ound for the trac k ing error th at is v alid for the entir e time axis, using th e Alekseev non-linear v ariation of constants formula. It is the first non - asymptptic b ound for the entire time axis in the sense that it is not based on t he v anishing stepsize limit and associated limit theorems un like prior wo rk s, and captu res clearly th e dep end en ce on problem parameters and the di- mension. Keyw ords : Sto chastic Appro xima tion; Constant Stepsize; Non-as ymptotic bo und; Alekseev’s F ormula; Martingale Concentration Inequalities; Perturba- tion Analysis; No n- stationary Optimiza tio n 1 In tro d uction 1.1 Bac kground Robbins and Monro prop o s ed in [39] a sto chastic iterative s cheme x n +1 = x n + a n h ( x n ) + M n +1 , n ≥ 0 , (1) ∗ VB is and BK, AS w ere with the Department of Electrical Engineering, IIT Bom bay , Po wa i , Mumbai, Maharashtra 400076, India. BK is now with the Departmen t of Electrical and Computer Engineering, Universit y of Wisconsin at Madis on, Madison, WI 53706, USA. AS is now with the Departmen t of Electrical Engine er ing and Compute r Science, Univer- sity of California at Berke l ey , Cory Hall, Hearst Aven ue, Berk el ey , CA 94720, USA. Email: bkumar@wisc.edu, bor k ar.vs@gmail.com, shett y . akhil@b erke l ey .edu. W ork of VB w as sup- ported in par t by a J. C. Bose F ellowship, CEFIPRA grant No. IF C/DST-Inria-2016-01/448 “Mac hine Learning f or Netw ork A nalytics” and a grant for ‘ Appr oximation for high dimen- sional optimization and c ontro l pr oblems ’ from the Department of Science and T ec hnology , Go vernmen t of India. 1 for finding the ze r o(s) of a function h ( · ) g iven its noisy ev alua tio ns, with M n +1 being the mea surement noise. By a clever choice of the stepsize seq uence { a n } , viz., those satisfying X n a n = ∞ , X n a 2 n < ∞ , (2) they were able to sho w almost sure (a.s.) co nv ergence of the scheme to a zero of h under rea sonable hypo theses. The scheme has since b een a corner stone of not o nly statistical computation, but also in a v ariety of engineering applica- tions r anging from signal pro cessing, adaptive control, to more recently , machine learning. See [8, 15] for so me recent p edago gical acco un ts of sto chastic approxi- mation. What makes it so p opular is its t ypically low p er iterate memory and computational r equirement and a bility to ‘av er age out’ the noise, w hich makes it ideal for adaptiv e estimation/learning sc enarios. A later v iewp o int [17], [34] views (1) a s a noisy discr etization of the or dinary differential equation (ODE for shor t) ˙ x ( t ) = h ( x ( t )) (3) with decreasing stepsize a nd argues that the errors due to dis cretization and noise ar e asymptotically negligible under (2), so tha t it has the same a symp- totic b ehaviour as (3). See [8, 10] for a fuller development of this approach. The clea n theory under (2) notwithstanding, there has also b een an interest and necessity to consider co nstant stepsize a n ≡ a > 0. The str ong conv er gence claims under (2) ca n no lo nger b e expected 1 , e.g., for the simple ca se o f { M n } being i.i.d. zero mean, the best one can ho pe for is conv e rgence to a sta tio n- ary distribution. What one can still ex pe c t is a high pro bability c o ncentration around the desir ed target, viz., zero(s) of h , if the stepsize a is s mall [30, 31]. This is a cceptable and in fac t unav o ida ble in the imp ortant application a rea of tracking a slowly moving tar get or measur ing a s lowly exciting signal [4, 23], and other instance s of lear ning in a slowly v arying environmen t. This is b ecause with decrea sing stepsiz e, the a lg orithmic time sca le, dicta ted by the decrea sing stepsize, even tually b ecomes slow er than the timesca le on which the target is moving and thereby loses its tra cking ability . The a lter native o f either frequent resets or adaptive loop gain is often not desir able b ecause of the additional logic it r equires, particula rly when the alg orithm is hard-wir ed [8, 44], and one settles for a judiciously c hos e n constant stepsize. Such schemes are a pa r t of traditional signal pro cessing and neura l netw ork alg o rithms [18, 19, 21, 2 8, 35, 41] and often show up in imp ortant applications such a s q uasi-stationa ry exp erimentation for meteorolog y [14], s lowly exciting physical wav e measurement [13], and mor e re- cently in o nline learning and non- stationary optimization [45, 51]. Howev er, the fo cus in online learning is cumulativ e reg ret b ounds ins tead o f all time b ounds. These developments hav e motiv ated a nalysis of constan t s teps ize schemes [10, 12, 25, 29, 30, 37, 38, 41] in the form of v ario us limit theorems, (non) a s ymptotic 1 barring some very sp ecial cases, e.g., when the ri gh t hand side of (1) is con tractive uni- formly w.r .t. the noise v ar iable. 2 analysis, law of iterated lo garithm etc., but a conv enient bo und v alid for all time, a useful metric for tracking a pplications in a slowly v ar ying environment, se ems to b e a topic of r e latively recent interest [51]. Our ob jective here is to pro v ide precisely one such b ound. 1.2 Comparison with Pr ior Art As alr eady mentioned, one of the main motiv ation o f co ns tant stepsize sto chastic approximation has b een their a bilit y to tra ck slowly moving environmen ts. No t surprisingly , m uch of the ea rly work has come from sig nal pro ces sing and control, most notably in adaptive filter ing, and this co ntin ues to b e its prima ry applica- tion domain. So me r epresentativ e w ork s are [4], [2 0], [22], [23], [24], [4 1], [5 1], etc. Much of this work concerns tracking in s pe c ific m o dels and the prop osed schemes usually hav e a very sp ecific structure, e.g., linear . F rom purely the- oretical angle, analys es a pp ea r in [12], [25], [30], [3 7], [38] among others. The emphasis of the la tter is tow a rds analyzing conv erge nc e prop erties in the sma ll stepsize limit and the asso ciated functiona l central limit theorem for fluctuations around the deterministic o .d.e. limit, ex cept in case of [25], which establishes a law of iterated lo garithms, and [38], which obtains confidence b ounds for a sp ecific c ho ic e of adaptive stepsizes and s topping rule. The latter is a non- asymptotic res ult as the title sugg ests, but in a differ ent sens e than us. In the co n tex t of tracking, the functional central limit theorem character- izing a Gauss -Markov pro cess a s a limit in law of suitably sca led fluctuations is also used for sugg e sting p erfo r mance metr ics for tracking applica tion, see, e.g., [5]. More rece ntly constant s tepsize sto chastic gradient and its v ar iants have elicited interest in m a chine le arning litera ture due to the p ossibility of using them in conjunction with itera te av era ging to get b etter sp eed t ha n decreas- ing stepsizes, see, e.g, [3]. The pros and c o ns of these hav e been discussed, e.g., in [32]. This motiv a tion, how ever, is not relev ant for tracking b ecause iterate av er aging is also a stochastic approximation with decreasing s tepsizes ( a n = 1 / ( n + 1) to b e precise ) and decreas ing s tepsizes is simply not an option here b ecause the iterates will even tually be come slow er than the slowly v arying signal and lose their tr acking ability . Another strand of work a nalyzes tr acking in the sp ecific c ontext of tracking the s olution of an optimization problem whe n its parameter s drift slowly [45]. T racking pro blems hav e a lso b een studied in the litera ture a s reg ime switching sto chastic approximations when the evolution is mo dulated by a Ma rko v chain on a time sca le equal to or faster than that o f the algor ithm. This situation has 3 bee n analysed throug h mean squa red error b ounds [46, 47] and is clos e in s pir it to ours. 1.3 Our con tributions Our main r esult is Theorem 4 .1. The highlights of this result are as follows. 1. Our set-up is applicable to a very gener al scenario that includes un b ounded correla ted noise without any explicit evolution mo del, no explicit strong conv exity or linear ity assumptions rega rding the dyna mics b eing track ed, and so on, render ing it a more gener al framework than in prior work. 2. W e provide a b ound v alid for the entire time a xis, not only for a finite time interv al as in, e.g., ‘sample complexity’ bounds, or purely a symptotic as in, e.g., cumulativ e regr et bounds or a symptotic err or b ounds. That is, it ho lds unifor mly for all n, 0 ≤ n < ∞ , not only for n ≤ so me N or in the n → ∞ limit, whic h need not reflect the finite time behavior. This is pa rticularly relev ant here because we are considering the pro b- lem o f contin uously tr acking a time-varying, in p articular, non-stationary target. F urthermore, this is achieved under a very genera l noise mo del, viz., martingale differences whic h allow de p endence acro ss times, requir- ing only uncorrelatedness. Their conditional distributions giv en the past are req uired to s atisfy an exponential mo ment b ound that is satis fie d by most standard distributio ns such as exp onential, g aussian, their mixtur es, etc., exce pt the heavy tailed ones. 3. This b ound is non-asymptotic , i.e., it is derived for the actual consta n t stepsize a > 0 and not from a n idealized limiting scena rio based on a limit theorem for fluctuations in the a ↓ 0 limit, as is o ften the cas e in prior studies. T o the best of our knowledge, ours is the firs t result to ac hieve this. Also, our deriv a tio n of the bo und a llows us to keep track o f its de- pendenc e on problem pa rameters, dimension, etc. if needed. 4. W e b ound the exact e rror whic h is given by the Alek s eev form ula , there is no approximation at this stag e. F urther more, we a nalyze this error keeping the slow mo vement of the targ et being tracked in tact, without treating it as ess ent ia lly static a s, e.g., in [5]. As for p o ten tia l av enues for impr ov ement, we hav e the following obser v a- tions: 1. It app ears unlikely that the b ounds that use Lipschitz constants etc., can be imp r ov ed muc h, if at all. The moment b ounds on ma r tingale differ- ences use sta te of the art martingale concentration inequa lities and could 4 improv e if b etter inequalities bec o me av ailable. It may be noted that we assume exponential tails for the distributions of martinga le differences . Stronger ineq ualities such as McDia rmid’s inequa lity may b e used under stronger h yp o theses such as unifor mly b ounded martingale differences , see, e.g., [6]. On a different note, if we allow heavy taile d no ise, one would get weak er claims using the co rresp onding, natur ally w ea ker, c o ncentra- tion ineq ua lities. Rather limited results a re av a ilable here, see e.g., [26] and its application to sto chastic approximation in [2]. 2. One potential s p o t for improvemen t is in the use of the assumption ( † ) below, which entails a stability condition for a linearized dynamics which is time-de p endent. Such conditions ar e av aila ble only under constraints on the time scale separation betw een fast dynamics (of the algorithm) and the s low one (of the target). This, as argued later, is unav o idable, bec ause there w ill b e no tr acking otherwis e. This fa ct necessitates s uch a condition o r something clo se to it. The one we hav e used, due to Solo [4 2], is the most genera l av a ila ble to our k nowledge. (Another cla ss of sufficient conditions av ailable is based on e xistence of Lia punov functions and not explicit like Solo’s.) 1.4 Organization W e b egin by descr ibing the pro ble m formulation in the next section. This is follow ed by the Alekseev formula as a no n-linear g eneraliza tio n of the v ar iation of constants formula, and a key exp o nential stabilit y ass umption. A useful set o f sufficient conditions for this as sumption are recalle d. Section 3 details the error analysis characteriz ing the tracking b ehaviour, developed thr ough a sequence of lemmas and leading to the main result in section 4. Sectio n 5 concludes with some discuss io n. An appendix recalls a martingale concentration inequality used in the main text. 1.5 Sym b ols and Notation The section num b e r where the notation first app ears is given in par entheses. x n = Iterate a t time n (2.1 ) h ( · , · ) = driv ing vector field of the tra cking scheme (2.1 ) a = Step-s ize (2.1 ) M n +1 = Martinga le difference nois e (2.1) ε n +1 = Additive e r ror (2.1) ε ∗ = Bound on k ε n +1 k (2.1) 5 y ( · ) = Slowly v a rying s ig nal to b e track e d (2.1) ǫ = Sma ll ( ≪ 1) num b er controlling ra te of y ( · ) (2.1 ) γ ( · ) = V ector field driving y ( · ) (2.1 ) C ∗ = max sup n E k x n k 2 1 / 2 , sup n E k x n k 4 1 / 4 (2.1) C γ = sup t ≥ 0 k y ( t ) k (2 .1) C M , δ = Cons tants featuring in the b ound for || M n +1 || (2.1 ) Φ = T r a nsition matrix of linea r system (2.2 ) z ( · ) = Slowly v arying equilibr ium for the algor ithm (2.3) d = Dimension o f x n and z ( · ) (2.1 /2.3) ∇ = Gra dient op erato r (2.3) C Φ , β = Consta n ts featuring in the ex p o nent ia l b ound for Φ (2.3) L f = Lipschitz constant for Lipschitz function f (generic) G f = constant o f linear growth for function f , i.e., | f ( x ) | ≤ C f (1 + k x k ). (generic) B f = sup x | f ( x ) | for a b ounded function f (generic) K γ = max { y }≤ C γ k γ ( y ) k (3.2 ) O ( · ) = Big O notation (3 .3) µ = 1 /β (3.3) K 1 = L ˜ h (1 + C h + C γ ) + ε ∗ (3.3) K 2 = C Φ L ˜ h (3.3) K 3 = K 1 + L γ aǫ (3.3) K 4 = max 2 C M /δ 2 , C 2 M /δ 2 (3.3) K 5 = K C 3 Φ L D /β (3.4) K 6 = C 3 Φ L γ L D ǫ (3.4) K 7 = max 24 C M δ 4 , 4 C 2 M δ 4 (3.4) K 8 = 2 p 6 C M C 2 h /δ 2 (3.4) K 9 = C M γ 1 d 1 . 5 δ (3.5) 6 2 Preliminaries 2.1 The trac king p r oblem W e co nsider a constant s tep size sto chastic a pproximation algo rithm giv en b y the d -dimensiona l iteration x n +1 = x n + a h ( x n , y n ) + M n +1 + ε n +1 , n ≥ 0 , (4) for tracking a slowly v arying signal gov er ne d by ˙ y ( t ) = aǫγ ( y ( t )) , (5) with 0 < a < 1, 0 < ǫ ≪ 1. Also, y n := y ( n ) , n ≥ 0 , the tra jector y of (5) sampled at unit 2 time interv als coincident with the clo ck of the a bove iteratio n, with s lig ht abuse of notation. W e ass ume that y ( t ) , t ≥ 0 , remains in a b o unded set. The term ε n +1 represents an added bo unded compo nent attributed to po ssible numerical error s (e.g., error in g radient estimation in case of sto chastic gradient algo rithms [8]). W e a s sume the following: • The smallness c ondition on ǫ ensures a separation o f time scale b etw een the t wo e volutions (4) and (5), in particular (5) has to b e ‘sufficiently slo w ’ in a sense to b e made precise later . • h : ( x, y ) 7→ h ( x, y ) is twice c o ntin uously differentiable in x with the firs t and second par tial der iv atives in x b ounded uniformly in y in a co mpact set, and Lipschitz in y . A common example is where h ( x, y ) = − ( x − y ) corres p o nding to least mean square criterio n for tracking in the above context with x , resp. y standing for the states of the tracking scheme and the targ et resp. • γ ( · ) is Lipschitz co nt inuous, • C ∗ := max sup n E k x n k 2 1 / 2 , sup n E k x n k 4 1 / 4 < ∞ . (See [10] for suf- ficient conditions for uniform b oundedness of second mo ment s . Analogous conditions can b e given for fourth moments.) • C γ := sup t ≥ 0 k y ( t ) k < ∞ , • there ex ists a co nstant ε ∗ > 0 such that || ε n +1 || ≤ ε ∗ , ∀ n ≥ 0 , (6) • M n is a mar tingale difference sequence w.r.t. the incr easing σ -fields F n := σ ( x m , M m , ε m , m ≤ n ) , n ≥ 0 , 2 without loss of generalit y 7 and satisfies: there exist contin uous functions c 1 , c 2 : R d → (0 , ∞ ) with c 2 being b ounded aw ay from 0, such that P ( || M n +1 || > u |F n ) ≤ c 1 ( x n ) e − c 2 ( x n ) u , n ≥ 0 , (7) for all u ≥ v for a fixed, sufficiently large v > 0 (i.e., a sub-exp onential tail) with sup n E [ c 1 ( x n )] < ∞ . (8) In par ticular, (7), (8) together imply that ther e exist δ, C M > 0 such that E h e δ k M n +1 k i ≤ C M , n ≥ 0 . (9) Using the T aylor expansio n of the exp onential function, we get ∞ X m =0 δ m E [ k M n +1 k m ] m ! ≤ C M , n ≥ 0 . (10) As each term in the above summation is p ositive, we ca n conclude tha t for all n, m ≥ 0, E [ || M n +1 || m ] ≤ C M m ! δ m . (11) W e shall be interested in m = 2 , 4. These b ounds will play a n imp orta nt role in our error ana ly sis. W e ne x t state a form ula due to Alekseev [1] that captures the difference b etw een the tra jectory of a system and its (reg ular) p erturba tion, and ma y be view ed as a ‘non-linear v ariatio n of constants’ formula. 2.2 Alekseev’s form ula Consider the O DE ˙ w ( t ) = f ( t, w ( t )) , t ≥ 0 , and its per turb ed version, ˙ u ( t ) = f ( t, u ( t )) + g ( t, u ( t )) , t ≥ 0 , where f , g : R × R d 7→ R d , with: • f ( t, x ) is measur able in t and co ntin uo usly differentiable in x with b ounded deriv atives unifor mly w.r.t. t , and, • g ( t, x ) is measurable in t a nd Lipschitz in x uniformly w.r.t. t . 8 Let w ( t, t 0 , u 0 ) a nd u ( t, t 0 , u 0 ) deno te res pec tively the solutions to the a b ov e non-linear systems for t ≥ t 0 , satisfying w ( t 0 , t 0 , u 0 ) = u ( t 0 , t 0 , u 0 ) = u 0 . Then for t ≥ t 0 , u ( t, t 0 , u 0 ) = w ( t, t 0 , u 0 ) + Z t t 0 Φ( t, s, u ( s, t 0 , u 0 )) g ( s, u ( s, t 0 , u 0 ))d s, (12) where Φ( t, s, w 0 ) for an y w 0 ∈ R d is the fundamental ma trix of the linear ized system ˙ φ ( t ) = ∂ f ∂ w ( t, w ( t, s, w 0 )) φ ( t ) , t ≥ s, (13) with Φ( s, s, w 0 ) = I d , the d-dimensional identit y matrix. That is, it is the unique so lution to the matrix linear differential eq ua tion ˙ Φ( t, s, w 0 ) = ∂ f ∂ w ( t, w ( t, s, w 0 ))Φ( t, s, w 0 ) with the afor e men tio ned initial c ondition at t = s . The equa tion (1 2) is the Alekseev nonlinea r v a riation o f constants for mula [1] (se e als o Lemma 3, [11]). The generaliza tion of Alekseev’s nonlinear v ariation of constants for differing initial co nditions [7] is given by u ( t, t 0 , u 0 ) = w ( t, t 0 , w 0 ) + Φ( t, t 0 , u 0 )( u 0 − w 0 ) + Z t t 0 Φ( t, s, u ( s, t 0 , u 0 )) g ( s, u ( s, t 0 , u 0 ))d s (14) where the additiona l a dditive term ca ptures the contribution due to differing initial co nditions. This term will decay exp onentially under our assumption ( † ) below. 2.3 P ert urbation analysis In view of the O DE approach descr ibed ea rlier, we consider the candidate O DE ˙ x ( t ) = h ( x ( t ) , y ) (15) where we hav e trea ted the y comp onent as frozen at a fixe d v alue in view of its slow evolution (re call that ǫ << 1). W e assume that this ODE has a globally stable equilibr ium λ ( y ) wher e λ is twice contin uously differen tia ble with b ounded first and second deriv atives. (Typically , this c a n be verified b y using the implicit function theorem.) In particular, h ( λ ( y ) , y ) = 0 ∀ y = ⇒ h ( λ ( y ( t )) , y ( t )) = 0 ∀ t ≥ 0 . Define z ( t ) = λ ( y ( t )) , t ≥ 0. Then ˙ z ( t ) = ǫa ∇ λ ( y ( t )) γ ( y ( t )) 9 = ah ( λ ( y ( t )) , y ( t )) + ǫa ∇ λ ( y ( t )) γ ( y ( t )) = ah ( z ( t ) , y ( t )) + ǫa ∇ λ ( y ( t )) γ ( y ( t )) = a ˜ h ( z ( t ) , y ( t )) for ˜ h ( z , y ) := h ( z , y ) + ǫ ∇ λ ( y ) γ ( y ) . The corr esp onding Euler scheme would b e z n +1 = z n + a ˜ h ( z n , y n ) . The tracking a lgorithm (4) can therefo re b e equiv alently written as: x n +1 = x n + a h ( x n , y n ) + M n +1 + ε n +1 (16) = x n + a ˜ h ( x n , y n ) − ǫ ∇ λ ( y n ) γ ( y n ) + M n +1 + ε n +1 (17) = x n + a ˜ h ( x n , y n ) + κ n ( y n ) , (18) where, κ n ( y n ) = − ǫ ∇ λ ( y n ) γ ( y n ) + M n +1 + ε n +1 . (19) Let ¯ x ( t ) b e the linearly interpolated tra jecto ry of the sto chastic approximation iterates such that ¯ x ( t k ) = x k . That is, for t n ≡ na ∀ n , ¯ x ( t ) = ¯ x ( t n ) + t − t n a ¯ x ( t n +1 ) − ¯ x ( t n ) , t ∈ [ t n , t n +1 ] . (20) Then fro m (18), we g et ¯ x ( t n +1 ) = ¯ x ( t 0 ) + n X k =0 a ˜ h ( ¯ x ( t k ) , y ( t k )) − n X k =0 aǫ ∇ λ ( y ( t k )) γ ( y ( t k )) + n X k =0 aM k +1 + n X k =0 aε k +1 (21) = ¯ x ( t 0 ) + n X k =0 Z t k +1 t k ˜ h ( ¯ x ( t k ) , y ( t k ))d s − n X k =0 Z t k +1 t k ǫ ∇ λ ( y ( t k )) γ ( y ( t k ))d s + n X k =0 Z t k +1 t k M k +1 d s + n X k =0 Z t k +1 t k ε k +1 d s. (22) F o r k ≥ 0 a nd s ∈ [ t k , t k +1 ], define p erturbation terms: ζ 1 ( s ) := ˜ h ( ¯ x ( t k ) , y ( t k )) − ˜ h ( ¯ x ( s ) , y ( s )) , ζ 2 ( s ) := M k +1 , ζ 3 ( s ) := ε k +1 , ζ 4 ( s ) := − ǫ ∇ λ ( y ( t k )) γ ( y ( t k )) . 10 Thu s ¯ x ( t n +1 ) = ¯ x ( t 0 ) + Z t n +1 t 0 ˜ h ( ¯ x ( s ) , y ( s ))d s + Z t n +1 t 0 ζ 1 ( s ) + ζ 2 ( s ) + ζ 3 ( s ) + ζ 4 ( s ) d s. Using (20), ¯ x ( t ) = ¯ x ( t 0 ) + Z t t 0 ˜ h ( ¯ x ( s ) , y ( s ))d s + Z t t 0 ζ 1 ( s ) + ζ 2 ( s ) + ζ 3 ( s ) + ζ 4 ( s ) d s. (23) Define Ξ( t ) = ζ 1 ( t ) + ζ 2 ( t ) + ζ 3 ( t ) + ζ 4 ( t ) . Consider the co upled sy s tems ˙ z ( t ) = ˜ h ( z ( t ) , y ( t )) , (24) ˙ y ( t ) = ǫaγ ( y ( t )) , (25) and ˙ ¯ x ( t ) = ˜ h ( ¯ x ( t ) , y ( t )) + Ξ( t ) , (26) ˙ y ( t ) = ǫaγ ( y ( t )) . (27) The O DE (26) can b e seen as a p erturba tion of the (2 4), with the p erturba tion term b eing Ξ( t ). Let D ( · , · ) ∈ R d × d denote the Jacobia n m a trix of h (and therefore of ˜ h ) in the first a r gument, and Γ( · ) ∈ R d × d the Ja cobian matrix of λ . Then the lin- earization or ‘eq ua tion o f v a riation’ o f (24) is ˙ r ( t ) = D ( z ( t ) , y ( t )) r ( t ) . (28) F o r t ≥ s ≥ 0 and x, y ∈ R d , let Φ( t, s ; x 0 , y 0 ) denote the fundamental matr ix for the time v a rying linear s ystem (28), i.e., the solution to the matrix -v alued differential equatio n ˙ Φ( t, s ; x 0 , y 0 ) = D ( z ( t ) , y ( t ))Φ( t, s ; x 0 , y 0 ) , t ≥ s , (29) with initial condition Φ( s, s ; x 0 , y 0 ) = I . Then by Alekseev’s for mu la , ¯ x ( t ) = z ( t ) + Φ( t, t 0 ; ¯ x ( t 0 ) , y 0 )( ¯ x ( t 0 ) − z ( t 0 )) + Z t t 0 Φ( t, s ; ¯ x ( s ) , y ( s ))Ξ( s ) ds. Define n = Φ( t n , t 0 ; ¯ x ( t 0 ) , y 0 )( ¯ x ( t 0 ) − z ( t 0 )) (30) 11 A n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) ˜ h ( ¯ x ( t k ) , y ( t k )) − ˜ h ( ¯ x ( s ) , y ( s )) d s, (31) B n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) M k +1 d s, (32) C n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( t k ) , y ( t k )) M k +1 d s, (33) D n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) ε k +1 d s, (34) E n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) × ǫ ∇ λ ( y ( t k )) γ ( y ( t k ))d s. (35) Then ¯ x ( t n ) = z ( t n ) + Z t n t 0 Φ( t n , s ; ¯ x ( s ) , y ( s )) ζ 1 ( s )d s + Z t n t 0 Φ( t n , s ; ¯ x ( s ) , y ( s )) ζ 2 ( s )d s + Z t n t 0 Φ( t n , s ; ¯ x ( s ) , y ( s )) ζ 3 ( s )d s + Z t n t 0 Φ( t n , s ; ¯ x ( s ) , y ( s )) ζ 4 ( s )d s + n (36) = z ( t n ) + A n + ( B n − C n ) + C n + D n − E n + n . (37) Therefore || ¯ x ( t n ) − z ( t n ) || ≤ || A n || + || B n − C n || + || C n || + || D n || + || E n || + || n || . Also E || ¯ x ( t n ) − z ( t n ) || 2 1 / 2 ≤ E || A n || 2 1 / 2 + E || E n || 2 1 / 2 + E || C n || 2 1 / 2 + E || D n || 2 1 / 2 + E || B n − C n || 2 1 / 2 + E || n || 2 1 / 2 . (38) W e sha ll individually b ound the ab ove error terms in the next section under the impo rtant ass umption of exp onential stability of the equation of v aria tio n (28 ): ( † ) There exists a β > 0 such that ∀ t > s ≥ 0 and x 0 , y 0 , || Φ( t, s ; x 0 , y 0 ) || ≤ C Φ e − β ( t − s ) . This seemingly restr ictive as sumption require s some discussio n, we ar gue in particular tha t s ome such assumption is essential if one is to o btain b ounds v a lid for a ll time. T o b egin, since the idea is to ha ve the parametrized o.d.e. (15 ), whic h is a sur rogate for the or ig inal iteration, track its unique asy mptotically stable 12 equilibrium parametrize d by y a s the pa rameter y ≈ y ( t ) changes slowly , it is essential that its ra te o f appr oach to the equilibrium, dictated by the sp ectrum of its linearized drift at this equilibrium, should b e muc h faster than the rate of change of the pa rameter. This alrea dy makes it clear that there will b e a requirement of minim um time s cale separ ation for tra cking to work at all. A s tronger mo tiv a tion co mes from the f a ct t ha t t he tracking err or, given exactly b y the Alekseev formula, dep ends on the linea rization of the o.d.e. itself around its ideal tra jectory z ( · ), which is a time-v arying linear differential equa- tion of the t y p e ˙ r ( t ) = A ( t ) r ( t ). It is w e ll k nown in control theor y that this can be unstable even if the matrix A ( t ) is stable for ea ch t , see, e.g ., Example 8.1, p. 131, [40]. Stabilit y is guara nteed to hold only in the sp ecial case of A ( t ) v a rying slowly with time. The most g eneral result in this direc tion is that of [42], which we recall b elow as a sufficient condition for ( † ). (Ther e hav e a lso bee n some extensions there o f to nonlinear systems, se e , e.g., [36].) Consider the fo llowing time v arying linear dynamical system: ˙ x ( t ) = [ A ( t ) + P ( t )] x ( t ) (39) and assume the following for this p erturb ed sys tem: 1. There exists ¯ A > 0 such that lim sup T ↑∞ 1 T Z t 0 + T t 0 || A ( s ) || d s ≤ ¯ A ∀ t 0 . 2. There ex ists γ ∈ (0 , 1] , b > 0 and β > 0 sufficiently small in the sense made pr ecise in the theorem b elow, such that n 0 + n X t = n 0 || A ( t 2 + ( t − 1) T ) − A ( t 1 + ( t − 1) T ) || ≤ T b + T γ ( n + 1) β ∀ n, n 0 , whenever | t 2 − t 1 | ≤ T . 3. Let α ( t ) be the rea l part of the eigen v alue of A ( t ) whos e real part is the largest in absolute v a lue. Then there exists ¯ α < 0 such that, for any T > 0, lim sup N ↑∞ 1 N n 0 + N X n = n 0 α ( s + nT ) ≤ ¯ α ∀ s, n 0 . 4. There exists δ > 0 such that lim sup T ↑∞ Z t 0 + T t 0 || P ( s ) || d s ≤ δ, ∀ t 0 . 13 Theorem 2.1 (Stability test for deterministic pe r turbations using eige nv a lue based characterization [42]) . If the pr eviously mentione d assumptions ( A 1 ) − ( A 4 ) hold, the system ˙ x ( t ) = ( A ( t ) + P ( t )) x ( t ) is exp onent ial ly stable pr ovide d we chose ǫ , δ > 0 smal l enough so that ¯ α + ǫ < 0 , and ¯ α + ǫ + M ǫ δ < 0 , with M ǫ = 3( 2( ¯ A + b ) ǫ + 1) p − 1 / 2 , wher e ¯ A, b, ¯ α ar e as define d in ( A 1 )-( A 4 ) and β is smal l enough so that: ¯ α + ǫ + M ǫ δ + 2(ln M ǫ ) γ / ( γ +1) [ β ( M ǫ + ǫ/ ( ¯ A + b ))] 1 / ( γ +1) < 0 . The co rresp ondence o f the for egoing with o ur framework is g iven by A ( · ) ↔ D ( · , · ), P ( · ) ↔ Ξ( · ). W e note here that there a re also some sufficient conditions for stability of time-v arying linea r sy stems in ter ms of Lia punov functions, e.g., [49], [50], but they app ea r not so easy to verify . 3 Error b ounds Here we o btain the er ror b ounds through a sequence o f lemmas. 3.1 Bound on D n Lemma 1. F or D n define d in (34), E || D n || 2 1 / 2 ≤ C Φ ε ∗ β (40) Pr o of. W e hav e || D n || = || n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) ε k +1 d s || ≤ ε ∗ n − 1 X k =0 Z t k +1 t k || Φ( t n , s ; ¯ x ( s ) , y ( s )) || d s (41) ≤ C Φ ε ∗ Z t n t 0 e − β ( t n − s ) d s ( 4 2) ≤ C Φ ε ∗ β , 14 where (41) and (42) follow f r om (6) and ( † ) resp ectively . Therefore, for all n ≥ 0 , we hav e E || D n || 2 1 / 2 ≤ C Φ ε ∗ β . (43) 3.2 Bound on E n Lemma 2. F or E n define d in (35) , E || E n || 2 1 / 2 ≤ K γ L λ C Φ ǫ β (44) wher e K γ := max k y k≤ C γ || γ ( y ) || . Pr o of. W e hav e, || E n || = || n − 1 X k =0 Z t k +1 t k Φ( t n , s ; x ( s ) , y ( s )) × ǫ ∇ λ ( y ( t k )) γ ( y ( t k ))d s || ≤ ǫ n − 1 X k =0 Z t k +1 t k || Φ( t n , s ; x ( s ) , y ( s )) || × ||∇ λ ( y ( t k )) || × || γ ( y ( t k )) || d s, (4 5) ≤ ǫK γ L λ n − 1 X k =0 Z t k +1 t k || Φ( t n , s ; x ( s ) , y ( s )) || d s. (46 ) Using ( † ), || E n || ≤ ǫK γ L λ C Φ Z t n t 0 e − β ( t n − s ) d s (47) ≤ K γ L λ C Φ ǫ β . (48) Hence E || E n || 2 1 / 2 ≤ K γ L λ C Φ ǫ β . (49) 3.3 Bound on A n The next lemma is a v ariant o f Lemma 5 .5 of [43]. Lemma 3. F or a suitable c onstant K 1 > 0 , Z t k +1 t k e − β ( t n − s ) || ¯ x ( s ) − ¯ x ( t k ) || d s ≤ [ K 1 + G ˜ h || ¯ x ( t k ) || + || M k +1 || ] e − β ( t n − t k +1 ) a 2 . 15 Pr o of. Using (20) and (1), for s ∈ [ t k , t k +1 ], || ¯ x ( s ) − ¯ x ( t k ) || = s − t k a || ¯ x ( t k +1 ) − ¯ x ( t k ) || = ( s − t k )[ || ˜ h ( ¯ x ( t k ) , y ( t k )) + M k +1 + ε k +1 || ] ≤ ( s − t k )[ || ˜ h ( ¯ x ( t k ) , y ( t k )) || + || M k +1 || + || ε k +1 || ] ≤ ( s − t k )[ G ˜ h (1 + || ¯ x ( t k ) || + || y ( t k ) || ) + || M k +1 || + ε ∗ ] (50) ≤ ( s − t k )[ G ˜ h (1 + C γ ) + G ˜ h || ¯ x ( t k ) || + || M k +1 || + ε ∗ ] ≤ ( s − t k )[ K 1 + G ˜ h || ¯ x ( t k ) || + || M k +1 || ] , (51) where K 1 = G ˜ h (1 + C γ ) + ε ∗ . A ls o, Z t k +1 t k ( s − t k ) e − β ( t n − s ) d s ≤ e − β ( t n − t k +1 ) a 2 . Therefore Z t k +1 t k e − β ( t n − s ) || ¯ x ( s ) − ¯ x ( t k ) || d s ≤ [ K 1 + G ˜ h || ¯ x ( t k ) || + || M k +1 || ] e − β ( t n − t k +1 ) a 2 . Lemma 4. F or A n as define d in (31), E || A n || 2 1 / 2 = O ( a ) . Pr o of. W e hav e || A n || = || n − 1 X k =0 Z t k +1 t k Φ( t n , s ; x ( s ) , y ( s )) × ˜ h ( ¯ x ( t k ) , y ( t k )) − ˜ h ( ¯ x ( s ) , y ( s )) d s || ≤ n − 1 X k =0 Z t k +1 t k || Φ( t n , s ; x ( s ) , y ( s )) || × || ˜ h ( ¯ x ( t k ) , y ( t k )) − ˜ h ( ¯ x ( s ) , y ( s )) || d s ≤ n − 1 X k =0 Z t k +1 t k C Φ e − β ( t n − s ) L ˜ h × || ¯ x ( t k ) − ¯ x ( s ) || + || y ( t k ) − y ( s ) || (52) ≤ C Φ L ˜ h n − 1 X k =0 [ K 1 + G ˜ h || ¯ x ( t k ) || + || M k +1 || ] e − β ( t n − t k +1 ) a 2 + K γ aǫ Z t k +1 t k ( s − t k ) e − β ( t n − s ) d s (53) ≤ C Φ L ˜ h n − 1 X k =0 [ K 1 + G ˜ h || ¯ x ( t k ) || + || M k +1 || ] e − β ( t n − t k +1 ) a 2 16 + K γ aǫe − β ( t n − t k +1 ) a 2 (54) = C Φ L ˜ h n − 1 X k =0 [ K 1 + G ˜ h || ¯ x ( t k ) || + K γ aǫ + || M k +1 || ] × e − β ( t n − t k +1 ) a 2 ≤ aC Φ L ˜ h ( K 1 + K γ aǫ ) µ + n − 1 X k =0 G ˜ h || ¯ x ( t k ) || + || M k +1 || ae − β ( t n − t k +1 ) , where µ := 1 β . Equation (53) follows fr om Lemma 3. Denote the terms C Φ L ˜ h , K 1 + K γ aǫ , P n − 1 k =0 || M k +1 || ae − β ( t n − t k +1 ) and P n − 1 k =0 a || ¯ x ( t k ) || e − β ( t n − t k +1 ) by K 2 , K 3 , F n and ˜ F n . N o te that K 3 is O (1). Then || A n || ≤ aK 2 K 3 µ + F n + G ˜ h ˜ F n , E [ || A n || 2 ] 1 / 2 ≤ aK 2 K 3 µ + E [ F 2 n ] 1 / 2 + G ˜ h E [ ˜ F 2 n ] 1 / 2 . Now, E [ ˜ F 2 n ] 1 / 2 = n − 1 X k =0 a E [ || ¯ x ( t k ) || 2 ] 1 / 2 e − β ( t n − t k +1 ) ≤ C ∗ µ (55) and E [ F 2 n ] 1 / 2 = n − 1 X k =0 E [ || M k +1 || 2 ] 1 / 2 ae − β ( n − k ) a ≤ n − 1 X i =0 √ 2 C M δ ae − β ( n − k ) a ≤ K 4 µ (56) where K 4 = √ 2 C M δ . Therefore E [ || A n || 2 ] 1 / 2 ≤ aK 2 K 3 µ + K 4 µ + G ˜ h C ∗ aµ . Hence E [ || A n || 2 ] 1 / 2 = O ( a ) . 3.4 Bound on B n − C n Lemma 5. F or B n and C n define d in (32) and (33) E [ || B n − C n || 2 ] 1 / 2 = O ( a ) . 17 Pr o of. F r om (32) a nd (33) we hav e B n − C n = n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) − Φ( t n , s ; ¯ x ( t k ) , y ( t k )) M k +1 d s Therefore || B n − C n || = || n − 1 X k =0 Z t k +1 t k Φ( t n , s ; ¯ x ( s ) , y ( s )) − Φ( t n , s ; ¯ x ( t k ) , y ( t k )) M k +1 d s || ≤ n − 1 X k =0 Z t k +1 t k || Φ( t n , s ; ¯ x ( s ) , y ( s )) − Φ( t n , s ; ¯ x ( t k ) , y ( t k )) || × || M k +1 || d s. F r om (28), w e k now that Φ( t, s ; ¯ x ( s ) , y ( s )) and Φ( t, s ; ¯ x ( t k ) , y ( t k )) are funda- men ta l matrices fo r the linear sys tems given by: for t ≥ s , ˙ χ ( t, s ; ¯ x ( s ) , y ( s )) = D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s ))) χ ( t, s ; ¯ x ( s ) , y ( s )) , (57) and ˙ e χ ( t, s ; ¯ x ( t k ) , y ( t k )) = D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k ))) e χ ( t, s ; ¯ x ( t k ) , y ( t k )) . (58) So Φ( t, s ; ¯ x ( s ) , y ( s )) and Φ( t, s ; ¯ x ( t k ) , y ( t k )) s atisfy the following matrix v a lued differential equatio ns ˙ Φ( t, s ; ¯ x ( s ) , y ( s )) = D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s )))Φ( t, s ; ¯ x ( s ) , y ( s )) , (59) and ˙ Φ( t, s ; ¯ x ( t k ) , y ( t k )) = D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k )))Φ( t, s ; ¯ x ( t k ) , y ( t k )) . (60) F o r ea ch column indexed by j , the differential e quations (59) and (60) can be equiv alently written as ˙ Φ j ( t, s ; ¯ x ( s ) , y ( s )) = D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s )))Φ j ( t, s ; ¯ x ( s ) , y ( s )) , (61) and ˙ Φ j ( t, s ; ¯ x ( t k ) , y ( t k )) = D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s )))Φ j ( t, s ; ¯ x ( t k ) , y ( t k )) + D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k ))) − D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s ))) × Φ j ( t, s ; ¯ x ( t k ) , y ( t k )) . (62 ) 18 T reating (62) as a p erturba tion of (61) and a pplying Alexs eev’s for mula (1 2) 3 to each c o lumn of Φ( • , • ; • , • ), we hav e Φ j ( t n , s ; ¯ x ( t k ) , y ( t k )) − Φ j ( t n , s ; ¯ x ( s ) , y ( s )) = Z t n s Φ( t n , t ; ¯ x ( t ) , y ( t )) × D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k ))) − D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s ))) × Φ j ( t, s ; ¯ x ( t k ) , y ( t k ))d t (63) Combining the equations (6 3) for a ll columns, w e get Φ( t n , s ; ¯ x ( t k ) , y ( t k )) − Φ( t n , s ; ¯ x ( s ) , y ( s )) = Z t n s Φ( t n , t ; ¯ x ( s ) , y ( s )) × D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k ))) − D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s ))) × Φ( t, s ; ¯ x ( t k ) , y ( t k ))d t Therefore || B n − C n || ≤ n − 1 X k =0 Z t k +1 t k Z t n s || Φ( t n , t ; ¯ x ( s ) , y ( s )) || × || D ( z ( t, s ; ¯ x ( t k ) , y ( t k )) , y ( t, s ; y ( t k ))) − D ( z ( t, s ; ¯ x ( s ) , y ( s )) , y ( t, s ; y ( s ))) || × || Φ( t, s ; ¯ x ( t k ) , y ( t k )) || × || M k +1 || d t d s ≤ n − 1 X k =0 Z t k +1 t k Z t n s C 2 Φ L D × e − β ( t n − s ) × e − β ( t − s ) × || z ( t, s ; ¯ x ( s ) , y ( s )) − z ( t, s ; ¯ x ( t k ) , y ( t k )) || + || y ( s ) − y ( t k ) || × || M k +1 || d t d s (64) ≤ n − 1 X k =0 Z t k +1 t k Z t n s C 2 Φ L D × e − β ( t n − s ) × e − β ( t − s ) × || ¯ x ( s ) − ¯ x ( t k ) || + || y ( s ) − y ( t k ) || C Φ e − β ( t − s ) + || y ( s ) − y ( t k ) || × || M k +1 || d t d s, (65) where (64) follows from ( † ) and the Lipschitz prop erty of D ( · , · ) while (65) 3 in fact, the classical v ariation of constant s formula for linear systems which it generalizes 19 follows from (14) and ( † ). W e split the a nalysis into tw o terms as follows: G n := n − 1 X k =0 Z t k +1 t k Z t n s C 3 Φ L D e − β ( t n − s ) || ¯ x ( s ) − ¯ x ( t k ) || × e − 2 β ( t − s ) × || M k +1 || d t d s = n − 1 X k =0 Z t k +1 t k C 3 Φ L D e − β ( t n − s ) || ¯ x ( s ) − ¯ x ( t k ) || × 1 − e − 2 β ( t n − s ) 2 β × || M k +1 || d s = K 5 n − 1 X k =0 Z t k +1 t k e − β ( t n − s ) × || ¯ x ( s ) − ¯ x ( t k ) || × 1 − e − 2 β ( t n − s ) × || M k +1 || d s ≤ K 5 n − 1 X k =0 Z t k +1 t k e − β ( t n − s ) || ¯ x ( s ) − ¯ x ( t k ) || × || M k +1 || d s ≤ K 5 n − 1 X k =0 a 2 e − β ( t n − t k +1 ) K 1 || M k +1 || + || M k +1 || 2 + G ˜ h || ¯ x ( t k ) || || M k +1 || (66) where K 5 denotes C 3 Φ L D / 2 β and (66) follows from Lemma 3, H n := n − 1 X k =0 Z t k +1 t k Z t n s C 3 Φ L D × e − β ( t n − s ) || y ( s ) − y ( t k ) || e − 2 β ( t − s ) || M k +1 || d t d s + n − 1 X k =0 Z t k +1 t k Z t n s C 2 Φ L D × e − β ( t n − s ) || y ( s ) − y ( t k ) || e − β ( t − s ) || M k +1 || d t d s = n − 1 X k =0 Z t k +1 t k C 3 Φ L D e − β ( t n − s ) || y ( s ) − y ( t k ) || 1 2 β 1 − e − 2 β ( t n − s ) || M k +1 || d s + n − 1 X k =0 Z t k +1 t k C 2 Φ L D e − β ( t n − s ) || y ( s ) − y ( t k ) || 1 β 1 − e − β ( t n − s ) || M k +1 || d s ≤ n − 1 X k =0 Z t k +1 t k C 2 Φ ( 1 2 + C Φ ) L D β e − β ( t n − s ) K γ aǫ ( s − t k ) || M k +1 || d s ≤ K 6 n − 1 X k =0 e − β ( t n − t k +1 ) a 3 || M k +1 || (67) where (67) follows from (5) and K 6 := ( 1 2 + C Φ ) C 2 Φ K γ L D ǫ/β . F urther define G 1 ,n , G 2 ,n and G 3 ,n as follows G 1 ,n = n − 1 X k =0 ae − β ( t n − t k +1 ) || M k +1 || , G 2 ,n = n − 1 X k =0 ae − β ( t n − t k +1 ) || M k +1 || 2 , 20 G 3 ,n = n − 1 X k =0 ae − β ( t n − t k +1 ) || ¯ x ( t k ) || || M k +1 || . Then || B n − C n || ≤ G n + H n ≤ K 5 a ( K 1 G n, 1 + G n, 2 + G ˜ h G n, 3 ) + K 6 a 2 G n, 1 . Therefore E [ || B n − C n || 2 ] 1 / 2 ≤ K 5 a K 1 E [ G 2 n, 1 ] 1 / 2 + E [ G 2 n, 2 ] 1 / 2 + G ˜ h E [ G 2 n, 3 ] 1 / 2 + a 2 K 6 E [ G 2 n, 1 ] 1 / 2 W e now bound each of the terms in the prev ious expression. Using a calculation similar to the one us ed for (56), we have E [ G 2 n, 1 ] 1 / 2 ≤ K 4 µ, (68) E [ G 2 n, 2 ] 1 / 2 = n − 1 X k =0 E [ || M k +1 || 4 ] 1 / 2 ae − β ( n − k ) a ≤ n − 1 X k =0 √ 24 C M δ 2 ae − β ( n − k ) a ≤ K 7 n − 1 X k =0 ae − β ( n − k ) a ≤ K 7 µ, (69) where K 7 = √ 24 C M /δ 2 , E [ G 2 n, 3 ] 1 / 2 = E n − 1 X k =0 ae − β ( t n − t k +1 ) E || ¯ x ( t k ) || 2 || M k +1 || 2 1 / 2 ≤ n − 1 X k =0 ae − β ( t n − t k +1 ) E || ¯ x ( t k ) || 4 E || M k +1 || 4 1 / 4 ≤ n − 1 X k =0 ae − β ( t n − t k +1 ) C ∗ E || M k +1 || 4 1 / 2 ≤ n − 1 X k =0 ae − β ( t n − t k +1 ) C ∗ (24 C M ) 1 / 4 δ ≤ K 8 µ, (70) where K 8 = C ∗ (24 C M ) 1 / 4 δ . 21 Using (68), (69) and (70), we hav e E [ || B n − C n || 2 ] 1 2 ≤ K 5 a K 1 K 4 µ + K 7 µ + G ˜ h K 8 µ + a 2 K 6 K 8 µ = O ( a ) . (71) 3.5 Bound on C n Lemma 6. F or C n define d in (33) E [ || C n || 2 ] 1 / 2 = O (max { a 1 . 5 d 3 . 25 , a 0 . 5 d 2 . 5 } ) . Pr o of. It is easy to v e rify that C n satisfies the condition for the martingale concentration inequa lity provided in Theorem 5 .1 in Appendix, with α k,n = Z t k +1 t k Φ( t n , s, ¯ x ( t k ) , y ( t k ))d s, γ 1 = C Φ β , γ 2 = 1 , β n = a, for k , n ≥ 0. Th us, E [ || C n || 2 ] = Z ∞ 0 P ( || C n || 2 ≥ s )d s = Z ∞ 0 P ( || C n || ≥ √ s )d s Using the ma rtingale co ncentration inequality provided in 5.1 in App endix, we hav e E [ || C n || 2 ] = Z K 9 0 2 d 2 exp − cs d 3 a d s + Z ∞ K 9 2 d 2 exp − c √ s d 3 / 2 a d s, (72) where K 9 = C M γ 1 d 1 . 5 δ . A na lysing the terms sepa rately , we hav e Z K 9 0 2 d 2 exp − cs d 3 a d s = 2 d 5 a c 1 − exp − cK 9 d 3 a ≤ 2 d 5 c a, (73) and Z ∞ K 9 2 d 2 exp − c √ s d 3 / 2 a d s = 4 d 5 a c 2 exp − c √ K 9 d 3 / 2 a a + c √ K 9 d 3 / 2 22 ≤ 4 d 5 a c 2 ad 3 / 2 c √ K 9 a + c √ K 9 d 3 / 2 (74) = O (max { a 3 d 6 . 5 , a 2 d 5 } ) , (75) where (74) follows fro m the fact that e − 1 /a ≤ a for a > 0. F ro m (73) and (75), we hav e E [ || C n || 2 ] ≤ 2 d 5 c a + O (max { a 3 d 6 . 5 , a 2 d 5 } ) = O (max { a 3 d 6 . 5 , ad 5 } ) ∴ E [ || C n || 2 ] 1 / 2 = O (max { a 1 . 5 d 3 . 25 , a 0 . 5 d 2 . 5 } ) . 4 Main result Combining the foregoing b ounds leads to our main result stated as follows. Theorem 4 .1. The me an squar e deviation of tr acke d iter ates fr om a n on- stationary tr aje ctory s at isfies: E || x n − λ ( y ( n )) || 2 1 / 2 ≤ C Φ ε ∗ β + K γ L λ C Φ ǫ β + O (max { a 1 . 5 d 3 . 25 , a 0 . 5 d 2 . 5 } ) + C Φ e − β ( t n − t 0 ) || x 0 − λ ( y (0)) || (76) Pr o of. Using (38), ( † ) and lemmas 1-6 , we get E || ¯ x ( t n ) − z ( t n ) || 2 1 / 2 ≤ C Φ ε ∗ β + K γ L λ C Φ ǫ β + O ( a ) + O (max { a 1 . 5 d 3 . 25 , a 0 . 5 d 2 . 5 } ) + C Φ e − β ( t n − t 0 ) || ¯ x ( t 0 ) − z ( t 0 ) || = C Φ ε ∗ β + K γ L λ C Φ ǫ β + O (max { a 1 . 5 d 3 . 25 , a 0 . 5 d 2 . 5 } ) + C Φ e − β ( t n − t 0 ) || ¯ x ( t 0 ) − z ( t 0 ) || . The claim follows. Remark : 1. The O ( · ) notation is used ab ov e t o isolate the dep endence on the stepsiz e a . The exa ct constants inv olved a r e av ailable in the relev ant lemmas, but a re suppres sed in orde r to improve clarity . 23 2. The linear complexity o f the err o r b ound in ε ∗ and ǫ is na tural to exp ect, these b eing con tr ibutions from bounded a dditive er ror comp onent ε n and ra te of v ariation of the tr acking signa l, respe c tively . The O ( · ) term is due to the martingale noise and discretiza tion. The last term acco unts for the effect of initial co ndition. 3. By setting ǫ = 0 in (76), we ca n rec over a s a spe cial case a bo und v alid for all time fo r a stationa ry ta rget. Then y ( · ) ≡ y ∗ , a co nstant, and z ( · ) ≡ x ∗ = λ ( y ∗ ) , also a cons tant, viz., an equilibrium for the system ˙ x ( t ) = h ( x ( t ) , y ∗ ). 5 Conclusion and F uture W ork W e ana lyzed a consta nt step-size sto chastic a pproximation a lgorithm fo r track- ing a slowly v ary ing dynamical system and o btained a non-asymptotic bo und valid for al l time , with dep endence on step-size and dimens io n explicitly given. The latter in pa rticular provides insig ht int o step-size selection in high dimen- sional regime. A natural extension would b e to the pro blem of tracking a sto chastic dy- namics. Indeed, a suitable extension of Alekseev’s fo rmula is a v aila ble for this purp ose [48], which is muc h more complex. 24 App endix: A martingale concen t ration ine quality W e state here the mar tingale concentration inequalit y we hav e used, from [43], which in turn is a s lig ht a da ptation of the results o f [33]. Theorem 5. 1. L et S n = P n k =1 α k,n X k , wher e X k is a R d value d F k - adapte d martingale differ enc e se quenc e and α k,n is a se quenc e of b oun de d pr e-visible r e al value d d × d r andom matric es, i. e., α k,n ∈ F k − 1 and t her e ex ists finite numb er, say A k,n , such t hat || α k,n || ≤ A k,n . Supp ose that for some δ, C > 0 E e δ | | X k || F k − 1 ≤ C, k ≥ 1 . F urther assume that ther e exist c onstant s γ 1 , γ 2 > 0 , indep en dent of n , so that P n k =1 A k,n ≤ γ 1 and max 1 ≤ k ≤ n A k,n ≤ γ 2 β n , wher e β n is some p ositive se quenc e. Then for η > 0 , ther e exists some c onstant c > 0 dep ending on δ, C, γ 1 , γ 2 such that P ( || S n || > η ) ≤ 2 d 2 e − cη 2 d 3 β n if η ∈ 0 , C γ 1 d 1 . 5 δ , 2 d 2 e − cη d 1 . 5 β n otherw ise . References [1] Alekseev, V. M. “ An estimate for the perturba tions of the s olutions of ordinary differe ntial equations .” Westnik Moskov Unn. Ser 1, pp. 28 – 36, 1961. [2] Anant ha ram, V. a nd Bork ar, V. S., “Sto chastic approximation with lo ng range dep endent a nd heavy tailed noise.” Q ueuing Systems 71.1-2 pp. 221 - 242, 2 012. [3] Bach, F. a nd Mo ulines, E., “Non- strongly-c onv ex s mo oth stochastic ap- proximation with convergence rate O (1 /n ).” , NIPS , pp. 773-7 81, 2 013. [4] Benv eniste, A. “Des ig n of adaptive algorithms for the trac king of time- v a rying systems.” International Journal of adaptive c ontr ol and signal pr o- c essing , 19 .1 pp. 3 –29, 198 7. [5] Benv eniste, A. and Ruget, G. “A measure o f the tr a cking capability of recursive sto chastic algo rithms with co ns tant gains.” IEEE T r ansactions on Automatic Contr ol , 47 .3 pp. 63 9–649 , 1982. [6] Bork ar, V. S., “On trapping proba bilit y o f sto chastic approximation.” Com- binatorics, Pr ob ability and Computing 11.1 , pp. 11-2 0, 200 2. 25 [7] Bork ar, A.V. and Bor k ar, V.S. and Sinha , A. “Aerial monitoring of slow moving con voys using elliptical or bits.” Eur op e an Journal of Contr ol , av ail- able o nline, 2018 . [8] Bork ar, V. S. “Sto chastic Approximation: A Dynamical Systems View- po int .” Hindustan Publish ing A gency, New Delhi, and Cambridge Univer- sity Pr ess, Cambrid ge, UK , 20 08. [9] Bork ar, V. S. “Pr obability Theor y: An Adv anced Course.” Springer Sci- enc e & Business Me dia, New Y ork , 2012 . [10] B ork ar, V. S. and Mey n, S. P . “The ODE metho d for conv er gence of sto chastic a pproximation and reinforc ement learning.” SIAM Journal on Contr ol and Optimization 38.2 , pp. 447 –469, 2000 . [11] B rauer, F. “Perturbations o f non-linear systems of differ ent ia l e q uations.” Journal of Mathematic al Analy s is and Applic ations 1 4.2, pp. 198–20 6, 1966. [12] B ucklew, J. A. and Kurtz, T. G., “W eak conv ergence a nd lo ca l stability prop erties of fixed step size recursive a lgorithms.” IEEE T r ansactions on Information The ory 39(3), pp. 966 - 978, 19 93,. [13] B urke, W. L., “Gravitational radia tion da mping of slowly moving systems calculated using matched asymptotic expansions.” Journal of Mathematic al Physics 12.3, pp. 401– 418, 19 71. [14] C ha ppe ll, C. F. “Quas i-stationar y co nv ective ev ents.” Mesosc ale Mete or ol- o gy and F or e c asting, Springer , pp. 28 9-310 , 1986. [15] C hen, Han-F u, “ Sto chastic Approximation a nd Its Applications.” Springer Scienc e & Business Me dia 64 , 2006 . [16] Da lal, G. and Szore n y i, B. and Thoppe, G. and Mannor, S. “ Concentration Bounds for Tw o Time-scale Sto chastic Approximation with Applications to Reinforcement Learning.” ar Xiv pr eprint arXiv:1703.0 5376 , 2 0 17. [17] Der evitskii, D. P . and F r a dko v, A. L., “Two mo dels analyzing the dynamics of adaptatio n algo rithms”, Automation and R emote Contro l 35(1), pp. 59– 67, 19 74. [18] Dia mantaras, K. I. and Kung, S. Y. “ Principal Comp o nent Neural Net- works: Theory and Applications.” John Wiley & Sons, In c. , 1 996. [19] E weda, E. “Compar ison of RLS, LMS, a nd sign a lgorithms for trac king randomly time-v ar y ing channels.” IEEE T r ansactions on Signal Pr o c essing 42.11, pp. 2937 –2944 , 1994. [20] F a r den, D. “T racking prop erties of adaptive signa l pro ce s sing a lgorithms.” IEEE T r ansactions on A c oust ics, Sp e e ch, and Signal Pr o c essing , 29.3 pp. 439-4 46 1 981. 26 [21] Finno ff, W. “Diffusion appr oximations for the consta nt learning rate back- propaga tion a lgorithm and resistence to lo cal minima.” A dvanc es in Neu r al Information Pr o c essing S ystems , pp. 459 -466, 19 93. [22] Guo , L. and Ljung, L. “Ex p o nent ia l stability of genera l tracking algo- rithms.” IEEE T r ansactions on A u tomatic Contro l , 40.8 pp. 1376-1 387, 1995. [23] Guo , L. and Ljung, L. “Performance analysis of genera l tracking algo- rithms.” IEEE T ra n s actions on Automatic Contr ol 40.8 pp. 1388– 1402 , 1995. [24] Guo , L., Ljung, L. and W ang, G.-J ., “ Nece ssary and s ufficien t co nditio ns for stabilit y of LMS.” IEEE T r ansactions on Automatic Contr ol 42.6 pp. 76-1- 770, 1 997. [25] J oslin, J. A. and Heunis, A. J., “Law of the iterated logarithm fo r a constant-gain linea r sto chastic gr adient algorithm.” SIAM Journ al on Con- tr ol and O ptimization 39 .2, pp. 5 33-5 7 0, 2000 . [26] J oulin, A.,“On ma ximal inequalities for stable sto chastic in teg r als”, Poten- tial Analysis 2 6.1, pp. 5 7 -78, 200 7. [27] K halil, H. K., “Nonlinear Systems” (3rd ed.), Pr entic e H al l, Englewo o d Cliffs, N J , 199 6. [28] K uan, C. M. and Hornik, K. “Conv er gence of learning algorithms with constant learning rates.” IEEE T r ansactions on Neur al Networks , 2.5 pp. 484–4 89, 1 991. [29] K ushner, H. J. and Hai H. “Averaging metho ds for the asymptotic a nal- ysis of learning and adaptive systems, with small adjustmen t rate.” S IA M Journal on Contr ol and Optimization 19 .5 pp. 6 35-6 5 0, 1981 . [30] K ushner, H. J. and Huang, H. “Asymptotic properties of sto ch a stic ap- proximations with constant co efficients.” SIAM Journal on Contr ol and Optimization 1 9.1 pp. 87–1 05, 198 1. [31] K ushner, H. J. and Yin, G. G. “Stochastic Approximation Algor ithms and Applications.” Springer V erlag, New Y ork , 199 7. [32] L a xminarayanan, C. and Szep esv ari, C., “Linear sto chastic appr oximation: how far do es constant stepsize and iterate av era ging go?” Pr o c. 21st Intel. Conf. on Artificial Intel ligenc e and Statistics (AIST A TS) , Lazaro te, Spain, 2018. [33] L iu, Q., and W atbled, F. “ Exp onential inequalities for martingales and asymptotic properties for the free energy of directed p oly mers in a r an- dom environment.” Sto chastic Pr o c esses and Their A pplic ations 119.10 pp. 3101- 3132, 2 009. 27 [34] L jung, L. “ Analysis of recurs ive sto chastic algorithms” , IEEE T r ansactions on Automatic Contr ol 22 (4), pp. 5 5 1–57 5, 19 77. [35] Ng , S. C. a nd Leung, S. H. and Luk, A. “F ast conv erg e n t gener a lized back-propagation algorithm with constant lea r ning rate.” Neur al Pr o c essing L etters , 9.1 pp. 13 –23, 1999. [36] Peuteman, J. and Aeyels, D., “E x po nential stabilit y o f slowly time-v arying non-linear systems.” Mathematics of Contr ol, S ignals and Systems 15, pp. 202-2 28, 2 002. [37] P flug, G. Ch. “Sto chastic minimization with constant step-size: asymptotic laws.” SIAM Journ al on Contr ol and Optimization 2 4.4 pp. 655– 666, 198 6. [38] P flug, G. Ch. “Non-asymptotic confidence b ounds for s to chastic a pprox- imation algorithms with constant step size.” Monatshefte f¨ ur Mathematik 110.3 pp. 297-3 14, 1 990. [39] Ro bbins, H. and Monr o, J., “A stochastic approximation metho d.” The Annals of Ma t hematic al Statistics 22.3 , pp. 40 0–407 , 1951. [40] Rug h, W. J ., “Linear System Theor y .” Pr entic e Hal l, Englewo o d Cliffs, N J , 1993. [41] Sha rma, R. and Sethares, W. A., a nd B ucklew, J. A., “Asymptotic a nalysis of s to chastic gradient-based adaptive filtering algo rithms with ge ne r al co s t functions.” IEEE T r ansactions o n Signal Pr o c essing , 44.9 pp. 2 186-2 194, 1996. [42] So lo, Victor, “ On the stability of slowly time-v ary ing linear systems.” Math- ematics of Contro l, Signals, and Syst ems (MCSS) 7.4, pp. 3 3 1–35 0, 1994 . [43] T ho ppe , G. and Bor k ar, V. S. “ A concentration b ound for s to ch a stic approximation via Alekseev’s formula.” Sto chastic S yst ems , to app ear arXiv:150 6.08657 , 2019 . [44] Utkin, V. and Guldner, J. and Shi, J. “Sliding Mo de Control in Electr o- mechanical Sys tems.” CRC pr ess , 201 7. [45] Wilso n, C. a nd V eerav alli, V. and Nedic A. “ Ada ptive Sequen tia l Sto chastic Optimization.” a rXiv preprint arXiv:161 0.01970 , 2018 . [46] Yin , G. and Ion, C. and K r ishnamurth y V. “ How do es a sto chastic op- timization/approximation algorithm adapt to a randomly ev olving opti- m um/ ro ot with jump Ma rko v sample paths.” Mathematic al Pr o gr amming 120.1 pp. 67-99 , 2 009. [47] Yin , G. and Kris hna mu r thy V. and Ion, C. “Regime switching sto chastic approximation algo rithms with application T o adaptive discrete sto chastic optimization.” SI A M Journal on Optimizatio n 14.4 pp. 11 87-12 15, 2004. 28 [48] Z erihun, T. and La dde, G. S., “F undamental prop er ties of s olutions o f nonlinear sto chastic differential equations andmMetho d o f v ariatio n o f pa - rameteres.” Dyn amic al Systems and Applic ations 22 , 2 013, pp. 433-4 58. [49] Z hou, B., “On a symptotic stability of linear time-v arying systems.” Auto- matic a 6 8, pp. 266–2 76, 201 6. [50] Z hou, B., “Stabilit y a nalysis of no n- linear time-v ary ing systems by Lya- punov functions with indefinite der iv atives.” IET Contro l The ory & Appli- c ations 9, pp. 1434 –144 2, 2 0 17. [51] Z hu, J. and Spall, J. C. “T racking capabilit y of sto chastic gra dient al- gorithm with constant gain.” 55th IEEE Co n f. o n De cision and Contr ol (CDC) , Las V egas, pp. 4522 –4527 , 20 16. 29
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment