Universal Approximation of Input-Output Maps by Temporal Convolutional Nets
There has been a recent shift in sequence-to-sequence modeling from recurrent network architectures to convolutional network architectures due to computational advantages in training and operation while still achieving competitive performance. For sy…
Authors: Joshua Hanson, Maxim Raginsky
Universal A ppr oxim ation of Input-Out p ut Maps by T emporal Con v olutio n al Nets Joshua Hanson University of Illinois Urbana, IL 6 1801 jmh4@ illino is.edu Maxim Rag insky University of Illino is Urbana, IL 6 1801 maxim @illin ois.edu Abstract There has been a recent shift in sequ ence-to-sequ ence modeling from recurre n t network architectures to conv olutional network arch itectures due to comp utational advantages in tra ining an d oper ation while still a c hieving competitive perf or- mance. For systems h a vin g limited long-ter m tempor al de p endencies, the appro x- imation capability of recu rrent networks is essentially equivalent to that of tempo- ral co n volutional nets (TCNs). W e pr ove that TCNs can app r oximate a large class of input-ou tp ut maps having ap proxim a tely finite memory to arbitrary error tol- erance. Furthermo re, we der i ve quantitative a pproxim ation r ates for deep ReLU TCNs in term s of the width a nd depth o f the network an d mod u lus of c o ntinuity of the original input-ou tput map, and ap ply these results to input-o utput maps of systems that admit fin ite-dimensiona l state-space realizations (i.e., recu rrent m o d- els). 1 Intr oduction Until recently , recurren t networks hav e been considered the de facto standar d for modeling input- output maps that tran sform seq uences to sequences. Con volutional network architectu res are b ecom- ing fa vorable altern ativ es for sev eral applications d ue to reduced comp u tational overhead incurred during both training and regular operation, wh ile often performing as well as or b etter than r e- current a r chitectures in practice. The c o mputation al advantage of con volutional n etworks follows from the lack of f eedback elements, which enab les shifted copies of th e input sequence to b e pro- cessed in parallel rather than sequ entially [Gehring et al., 2017]. Conv olution al architectur es have demonstra te d excep tional accu racy in sequence mod e ling tasks that have typically been appro ached using recu r rent architectures, such as m achine translation, audio gene ration, and lang u age mod - eling [Daup hin et al., 2017, Kalch brenner et al., 201 6, van den Oo r d et al., 20 1 6, W u et al., 2016, Gehring et al., 2017, Johnson an d Zhang, 201 7]. One explan ation for this sh if t is that both co n volutional and recurrent architectur es are inherently suited to mo deling systems with limited lo n g-term dependen cies. Recurrent models possess infi- nite memory (the output at each time is a function of th e initial conditions and the entire h istory of in puts until that time), and thus are strictly more expr e ssi ve than finite-memory autor egressi ve models. Howe ver , in synthetic stress tests designed to measure the ab ility to m o del long- term be- havior , recur r ent arch itec tures often fail to learn lon g seq uences [Bai et al., 201 8]. Furthermo re, this unlimited memo r y prop erty is u sually u nnecessary , whic h is supported in theory [Sharan et al., 2018] and in practice [Chelba et al., 2017, Gehring et al., 2017]. I n situations wher e it is only im- portant to learn finite-leng th sequenc es, feed forward architectu res based on tempor al conv olution s (tempor a l con volutional nets, or TCNs) ca n achieve similar results and even outp erform recurre n t nets [Dau phin et al., 201 7, Y in et al., 201 7, Bai et al., 201 8]. 33rd Conference on Neural Information Pr ocessing Systems (NeurIPS 2019), V ancouv er , Canada. These results promp t a closer look at the conditions u nder which con volutional architectu res pr o - vide better appro ximation than recurre nt architectu res. Recent work by Miller and Har dt [ 2 019] has shown that recurr e n t models that are expone ntially stable ( in th e sense that the effect of the initial condition s on th e o utput decays expon entially with time) can be efficiently appro ximated by fe e dfor- ward mo dels. A key co nsequence is that expon e n tially stable rec u rrent mod els can be a pproxim ated by systems th a t on ly co nsider a finite number o f r e cent values of the inpu t sequence for determin ing the value of the sub sequent outpu t. Howe ver , this notion of stability is inher ently tied to a particular state-space realization, and it is not d ifficult to come up with examples of seq uence-to- sequence maps th a t have both a stable and an unstable state-space realizatio n (e.g ., simply by addin g un stable states that do not affect th e out- put). Th is suggests that the q uestion of appr oximating seque n ce-to-sequ ence maps by feedfo rward conv olutional map s should be studied by abstractin g away the no tion of stability and only re q uiring that the sy stem outpu t d epend app r eciably on recent inpu t values and negligibly on inpu t values in the d istan t past. Th e formalizatio n of this pro perty was introduced by Sandberg [19 91] under the name of appr oximately fi nite memo ry , building on earlier work by Boyd a n d Chua [1985]. Outputs of systems characterized by this p roperty can be approxim a ted b y the output of the same system when applied to a trun c ated version of the input seq uence. These systems are n a turally suited to be modeled using TCNs, which by construction only oper ate on values of the input sequen ce for times within a finite horizon into the past. In th is work, we aim to dev elop qu antitativ e results f or the approximatio n capability of TCNs fo r modeling in put-ou tput maps that have the pro perties of cau sality , time in variance, and ap proxima te ly finite memor y . In Sectio n 2, we introdu c e the n ecessary definitions and review the ap proxima tely finite memo ry property due to Sand berg [1 991]. Sec tio n 3 gives th e main result for a pproxim at- ing input- output ma p s by ReLU TCNs, together with a quan titati ve result on the equiv alenc e be- tween app roximately finite memory an d a related n o tion of fading memory [Boyd an d Chu a, 1 985, Park and Sandberg, 1 992]. These results are applied in Section 4 to recurrent models that are in- cr ementally stable [ T ran et al., 2017], i.e . , the influence o f the initial c o ndition is asymptotically negligible. W e show that increm entally stab le recurre nt mo d els h av e appr oximately fin ite memory , and then use th is formalism to derive a gen eralization of the result of Miller and Hardt [2019]. W e provide a compar iso n in Section 5 to oth er arc hitectures used f or ap proxim a tin g inpu t-output maps. All omitted pr oofs are provide d in the Suppleme ntary Material. 2 Input-outpu t maps and appr oximately finite memory Let S den o te the set of all real-valued sequen ces u = ( u t ) t ∈ Z + , where Z + : = { 0 , 1 , 2 , . . . } . An input-o u tput map (or i/o map, f or shor t) is a no nlinear oper ator F : S → S that maps an in put sequence u ∈ S to an ou tput sequenc e y = F u ∈ S . (W e are co nsidering r eal-valued inp ut and output sequences for simp licity; all our results ca r ry over to vector-v alued sequen ces at the expense of additional notation .) W e will deno te the application and th e comp o sition of i/o maps b y concatenatio n. In this p aper, w e are concer n ed with i/o maps F that are: • causal — fo r any t ∈ Z + , u 0 : t = v 0 : t implies ( F u ) t = ( F v ) t , wh ere u 0 : t : = ( u 0 , . . . , u t ) ; • time-in varian t — for any k ∈ Z + , ( FR k u ) t = ( ( F u ) t − k , for t ≥ k 0 , for 0 ≤ t < k , where R : S → S is the r ight sh ift operato r ( R u ) t : = u t − 1 1 { t ≥ 1 } . The key notion we will work with is th a t of a ppr oximately fi nite memory [Sand berg , 19 91]: Definition 2.1. An i/o map F h as approx im ately fin ite m emory o n a set o f inp uts M ⊆ S if for any ε > 0 ther e exists m ∈ Z + , such tha t sup u ∈ M sup t ∈ Z + ( F u ) t − ( FW t,m u ) t ≤ ε, (1) wher e W t,m : S → S is the win dowing oper ator ( W t,m u ) τ : = u τ 1 { max { t − m, 0 }≤ τ ≤ t } . W e will denote by m ∗ F ( ε ) the smallest m ∈ Z + , for which (1) h olds. 2 If m ∗ F (0) < ∞ , then we say that F h as finite memory on M . If F is causal and time-inv arian t, this is equiv alent to the existence of an in teger m ∈ Z + and a no nlinear functional f : R m +1 → R , such that f (0 , . . . , 0) = 0 and , fo r any u ∈ M and any t ∈ Z + , ( F u ) t = f ( u t − m , u t − m +1 , . . . , u t ) , (2) with th e convention that u s = 0 if s < 0 . In this work, we will fo cus on the im p ortant case when f is a feed f orward neural n e t with rectified linear unit (ReLU) acti vations ReLU( x ) : = ma x { x, 0 } . That is, ther e exist k affine maps A i : R d i → R d i +1 with d 1 = m + 1 and d k +1 = 1 , such that f is giv en by the compo sition f = A k ◦ ReLU ◦ A k − 1 ◦ ReLU ◦ . . . ◦ ReLU ◦ A 1 , where, for any r ≥ 1 , ReLU( x 1 , . . . , x r ) : = (ReLU( x 1 ) , . . . , ReLU( x r )) . Her e, k is the depth (numb er of lay ers) and max { d 2 , . . . , d k } is the width (largest n umber of u nits in any hidd en layer) . Definition 2.2. An i/o map F is a ReLU temporal co n volutional net (or R eLU TCN, fo r short) with context len gth m if (2) hold s for some feed forwar d ReLU neu ral net f : R m +1 → R . Remark 2 .3. While such an F is evidently causal, it is generally not time-in variant un le ss f (0 , . . . , 0) = 0 . 3 The univ ersal approximation theor em In this section, we state and pr ove our main result: any causal and time-inv ariant i/o map that has approx imately finite memo ry and satisfies an add itio nal co ntinuity c ondition can be appro ximated arbitrarily well b y a ReLU tempo ral conv olu tional net. In what fo llows, we will con sider i/o maps with unifo rmly bo unded inputs, i.e., inp uts in the set M ( R ) : = { u ∈ S : k u k ∞ : = sup t ∈ Z + | u t | ≤ R } for some R > 0 . For any t ∈ Z + and any u ∈ M ( R ) , the finite subseq u ence u 0 : t = ( u 0 , . . . , u t ) is an elem e n t of the cube [ − R, R ] t +1 ⊂ R t +1 ; conversely , any vector x ∈ [ − R, R ] t +1 can b e emb edded into M ( R ) by setting u s = x s 1 { 0 ≤ s ≤ t } . T o any causal and time-inv arian t i/o map F we can associate the n onlinear function al ˜ F t : R t +1 → R d e fined in the o bvious way: f or a ny x = ( x 0 , x 1 , . . . , x t ) ∈ R t +1 , ˜ F t ( x ) : = ( F u ) t , where u ∈ S is any input such th at u s = x s for s ∈ { 0 , 1 , . . . , t } (th e values of u s for s > t can be arbitrary by cau sality). W e impose th e following assumptio n s on F : Assumption 3.1. The i/o map F h as appr oximate ly finite memo ry o n M ( R ) . Assumption 3.2. F or a n y t ∈ Z + , the fun ctional ˜ F t : R t +1 → R is u niformly co ntinuou s on [ − R, R ] t +1 with mod u lus o f continuity ω t, F ( δ ) : = sup n | ˜ F t ( x ) − ˜ F t ( x ′ ) | : x , x ′ ∈ [ − R , R ] t +1 , k x − x ′ k ∞ ≤ δ o , and inver se mo d ulus of continu ity ω − 1 t, F ( ε ) : = sup δ > 0 : ω t, F ( δ ) ≤ ε . wher e k x k ∞ : = max 0 ≤ i ≤ t | x i | is the ℓ ∞ norm on R t +1 . The following qu alitativ e universal appro ximation result was obtained by Sand berg [19 91]: if a causal and time- in variant i/o map F satisfies the above two a ssum ptions, then, fo r any ε > 0 , there exists an affine map A : R m +1 → R d and a lattice map ℓ : R d → R , su c h that sup u ∈ M ( R ) sup t ∈ Z + ( F u ) t − ℓ ◦ A ( u t − m : t ) < ε, (3) where we say that a map ℓ : R d → R is a lattice map if ℓ ( x 0 , . . . , x d − 1 ) is generated from x = ( x 0 , . . . , x d − 1 ) by a finite num b er of m in and max oper ations tha t do not d epend on x . Any lattice map can be implem ented using ReLU un its, so (3) is a ReLU TCN appro x imation gu arantee. Our main result is a qu a ntitativ e version of Sandberg’ s theorem : 3 Theorem 3.3. Let F be a c a usal an d time-in varian t i/o map sa tisfying Assumption s 3.1 and 3.2. Then, for an y ε > 0 an d any γ ∈ (0 , 1 ) , ther e exists a ReLU TCN b F with co ntext length m = m ∗ F ( γ ε ) , width m + 2 , a nd depth O ( R ) ω − 1 m, F ((1 − γ ) ε ) m +2 , such that sup u ∈ M ( R ) k F u − b F u k ∞ < ε. (4) Remark 3.4 . The role o f the ad ditional parameter γ ∈ (0 , 1 ) is to tr ade off the context length and the depth of the ReLU TCN. Remark 3.5. While the approxim ating ReLU TCN b F is clear ly cau sal, it may not be time- in variant unless b f (0 , . . . , 0) = 0 , where b f is the ReLU n et constructed in the proof below . Pr oof. Let m = m ∗ F ( γ ε ) . Since ˜ F m : R m +1 → R is con tinuous with m odulus of continu ity ω m, F ( · ) , there exists a ReLU net b f : R m +1 → R o f width m + 2 and depth O ( R ) ω − 1 m, F ((1 − γ ) ε ) m +2 , such that sup x ∈ [ − R,R ] m +1 | ˜ F m ( x ) − b f ( x ) | < (1 − γ ) ε [Hanin and Sellke, 2018]. Consider the TCN b F d efined by ( F u ) t : = b f ( u t − m , . . . , u t ) . Fix an in put u ∈ M ( R ) and con sider two cases: 1) I f t ≥ m , then u t − m : t = ( L t − m W t,m u ) 0 : m , where L : S → S is the left shift operator ( L u ) t : = u t +1 . Theref o re, ( FW t,m u ) t (a) = ( FR t − m L t − m W t,m u ) t (b) = ( FL t − m W t,m u ) m (c) = ˜ F m ( u t − m : t ) , where (a) uses the fact that t ≥ m , (b ) is by time in variance of F , and (c) is by the d efinition of ˜ F m . 2) I f t < m , then u t − m : t = ( R m − t W t,m u ) 0 : m (recall the conv entio n that, fo r any v , we set v s = 0 whenever s < 0 ). Th e refore ( FW t,m u ) t (a) = ( R m − t FW t,m u ) m (b) = ( FR m − t W t,m u ) m (c) = ˜ F m ( u t − m : t ) , where (a) uses the fact that m > t , (b ) is by time in variance, an d (c) is by th e defin itio n o f ˜ F m . In either case, the triangle inequality gives | ( F u ) t − ( b F u ) t | ≤ | ( F u ) t − ( FW t,m u ) t | + | ( FW t,m u ) t − ( b F u ) t | = | ( F u ) t − ( FW t,m u ) t | + | ˜ F m ( u t − m : t ) − b f ( u t − m : t ) | < γ ε + (1 − γ ) ε = ε. Since this h olds for all t and all u with k u k ∞ ≤ R , the result follows. 3.1 The fading memo ry property In o rder to app ly Th eorem 3. 3, we need con tr ol on the context length m ∗ F ( · ) and on th e m odulus o f continuity ω t, F ( · ) . In general, th ese qu antities are d ifficult to estimate. Howe ver , it was shown by Park and Sandberg [1992] that the p roperty of app roximately finite memor y is clo sely related to the notion of fadin g memory , first introduced by Boyd an d Chua [198 5]. Intuitively , an i/o map F has fading memo ry if the outpu ts at any time t d ue to any two inputs u an d v that were close to on e another in re cent past will also be close. Let W denote th e subset of S consisting of all sequen ces w , such that w t ∈ (0 , 1] for all t and w t ↓ 0 as t → ∞ . W e will refer to the elemen ts o f W as weighting sequence s . Then we have the f o llowing definition, due to Park an d Sandberg [1992]: Definition 3.6 . W e say tha t a n i/o ma p F has fading memor y on M ⊆ S with respect to w ∈ W if for an y ε > 0 there exists δ > 0 such that, for a ll u , v ∈ M and all t ∈ Z + , max s ∈{ 0 ,...,t } w t − s | u s − v s | < δ = ⇒ | ( F u ) t − ( F v ) t | < ε. (5) 4 The weigh ting sequence w governs the rate at which the past values of th e input are discou n ted in determinin g the curre n t o u tput. T o captur e the best tr ade-offs in (5), we will also use a w -d ependent modulu s of c o ntinuity: α w , F ( δ ) : = sup n | ( F u ) t − ( F v ) t | : t ∈ Z + , u , v ∈ M , max s ∈{ 0 ,...,t } w t − s | u s − v s | ≤ δ o . It was shown by Park and San d berg [1992] that an i/o map satisfies Assumption s 3. 1 a nd (3. 2) if and only if it h as fading memo ry with r e spect to some (and hen ce any) w ∈ W . The following result provides a quantitative version o f this eq u i valence: Proposition 3 .7. Let F be an i/o map. 1. If F satisfies Assumptions 3.1 and 3.2, then it has fading memory on M with respect to any weighting sequenc e w ∈ W , and α − 1 w , F ( ε ) ≥ w m ∗ F ( ε/ 3) ω − 1 m ∗ F ( ε/ 3) , F ( ε/ 3) . (6) 2. If F has fading mem o ry o n M ( R ) with r espect to some w ∈ W , th en it has satisfies As- sumptions 3.1 a nd 3.2, and m ∗ F ( ε ; R ) ≤ inf n m ∈ Z + : w m ≤ α − 1 w , F ( ε ) R o and ω t, F ( δ ) ≤ α w , F ( δ ) . (7) 4 Recurr ent systems So far , we have con sidered arb itrary i/o maps F : S → S . Howe ver , many such maps adm it state- space r ealizatio ns [So ntag, 1998] — th ere exist a state transition map f : R n × R → R n , an output map g : R n → R , and an in itial co ndition ξ ∈ R n , such that th e o u tput sequ ence y = F u is detemined recu rsi vely b y x t +1 = f ( x t , u t ) (8a) y t = g ( x t ) (8b) with x 0 = ξ . The i/o m a p F rea lize d in this way is e vid ently causal, a nd it is time-inv ariant if f ( ξ , 0) = ξ and g ( ξ ) = 0 . In this section, we will iden tify th e conditions under wh ich recurr ent models satisfy A ssum ptions 3. 1 and 3.2. Along the way , we will derive the a p proxim ation results of Miller and Har dt [201 9 ] as a special case. 4.1 Appr oximat ely finite memo ry a nd incremental stability Consider the system in (8). Given any inpu t u ∈ S , any ξ ∈ R n , and any s, t ∈ Z + with t ≥ s , we denote by ϕ u s,t ( ξ ) th e state at time t wh en x s = ξ . Let M be a su b set of S . W e say that X ⊆ R n is a po sitively invariant set of (8) for inputs in M if, for all ξ ∈ X , all u ∈ M , an d all 0 ≤ s ≤ t , ϕ u s,t ( ξ ) ∈ X . W e will be interested in systems with th e following prop e rty [Tran et al., 2 0 17]: Definition 4. 1. The system (8) is unifor mly asymptotically in crementally stable for inputs in M on a positively invariant set X if there exists a function β : R + × R + → R + of cla ss KL 1 , such that the inequ ality k ϕ u s,t ( ξ ) − ϕ u s,t ( ξ ′ ) k ≤ β ( k ξ − ξ ′ k , t − s ) (9) holds for a ll in puts u ∈ M , all initial con ditions ξ , ξ ′ ∈ X , and all 0 ≤ s ≤ t , where k · k is the ℓ 2 norm on R n . In other words, a system is incr e mentally stable if the influen ce of any initial condition in X o n th e state trajector y is asymptotically negligible . A key consequen ce is the following estimate: 1 A function β : R + × R + → R + is of class KL if it is continuous and strictly increasing in its first argument, continuous and strictl y decreasing in its second argument, β (0 , t ) = 0 for any t , and lim t →∞ β ( r, t ) = 0 for any r [Sontag, 1998]. 5 Proposition 4 .2. Let u , ˜ u be two input sequ ences in M . Then, fo r any ξ ∈ X an d a ny t ∈ Z + , k ϕ u 0 ,t ( ξ ) − ϕ ˜ u 0 ,t ( ξ ) k ≤ t − 1 X s =0 β k f ( ˜ x s , u s ) − f ( ˜ x s , ˜ u s ) k , t − s − 1 , (10) wher e x s and ˜ x s denote the states at time s due to in puts u an d ˜ u , respectively , with x 0 = ˜ x 0 = ξ . Consider a state-spa c e model (8) with a positively inv ariant set X , with the following assum ptions: Assumption 4.3. The state transition ma p f ( x, u ) is L f -Lipschitz in u for all x ∈ X an d th e ou tput map g ( x ) is L g -Lipschitz in x ∈ X . Assumption 4.4 . F or any initial co ndition ξ ∈ X ther e e xists a co mpact set S ξ ⊆ X such that ϕ u 0 ,t ( ξ ) ∈ S ξ for all u ∈ M ( R ) a n d all t ∈ Z + . Assumption 4.5. Th e system (8) is u n iformly a symptotically incr emen ta lly stable o n X for inputs in M ( R ) , an d the fun c tion β in (9) satisfies the summability condition X t ∈ Z + β ( C , t ) < ∞ (11) for an y C ≥ 0 . (F or example, if β ( C, k ) = C k − α for some α > 1 , th e n this con dition is satisfied.) W e are n ow in po sition to prove the main result of this sectio n: Theorem 4 .6. S uppose that Assump tions 4 .3 – 4.5 ar e satisfied . Then the i/o map F o f the system (8) satisfies Assumptio ns 3. 1 and 3.2 with m ∗ F ( ε ) ≤ min n m ∈ Z + : X k ≥ m β (diam ( S ξ ) , k ) < ε/L g o (12) and ω t, F ( δ ) ≤ L g t − 1 X s =0 β ( L f δ, s ) , ∀ t ∈ Z + . (13) Pr oof. Fix some t, m ∈ Z + . For an arbitrar y inp ut u ∈ M ( R ) , let ˜ u = W t,m u , where we may assume withou t lo ss of gener a lity that t ≥ m . Then ˜ u s = u s 1 { t − m ≤ s ≤ t } , and the refore t − 1 X s =0 β k f ( ˜ x s , u s ) − f ( ˜ x s , ˜ u s ) k , t − s − 1 = t − m − 1 X s =0 β k f ( ˜ x s , u s ) − f ( ˜ x s , 0) k , t − s − 1 ≤ t − m − 1 X s =0 β (diam ( S ξ ) , t − s − 1) ≤ ∞ X s = m β (diam( S ξ ) , s ) . (14) By the summab ility cond ition (11), th e summation in (14) co nverges to 0 as m ↑ ∞ . Th us, if we choose m so that the righ t-hand side o f (14) is smaller than ε/ L g , it fo llows from Proposition 4.2 that | ( F u ) t − ( FW t,m u ) t | = | g ( ϕ u 0 ,t ( ξ )) − g ( ϕ ˜ u 0 ,t ( ξ )) | ≤ L g k ϕ u 0 ,t ( ξ ) − ϕ ˜ u 0 ,t ( ξ ) | < ε. This proves (1 2). Now fix any two u , ˜ u ∈ M ( R ) with k u 0 : t − ˜ u 0 : t k ∞ < δ . T hen max 0 ≤ s ≤ t k f ( x, u s ) − f ( x, ˜ u s ) k ≤ L f δ for a ll x ∈ X , so Propo sition 4.2 giv es | ˜ F t ( u 0 : t ) − ˜ F t ( ˜ u 0 : t ) | = | g ( ϕ u 0 ,t ( ξ )) − g ( ϕ ˜ u 0 ,t ( ξ )) | ≤ L g k ϕ u 0 ,t ( ξ ) − ϕ ˜ u 0 ,t ( ξ ) k ≤ L g t − 1 X s =0 β ( L f δ, s ) , which pr oves (1 3). 6 4.2 Exponential incremental stability and the Demidovich criterion Miller and Har dt [2 019] co nsider the case o f contracting systems: there exists som e λ ∈ (0 , 1) and a set U ⊆ R m , such that k f ( x, u ) − f ( x ′ , u ) k ≤ λ k x − x ′ k (15) for all x, x ′ ∈ R n and all u ∈ U . Such a system is uniformly e xpo nentially incrementally stable on any p ositi vely inv ariant set X , with β ( C , t ) = C λ t . In th is section, we obtain the ir result as a special case o f a more genera l stability c riterion, k nown in the literature on nonlinea r sy stem stability as the Demidovich criterion [Pavlov et al., 2006]. Th e following result is a simp lified version of a more general result of T ran et al. [20 1 7]: Proposition 4 .7 (the discrete-tim e Demid ovich criterio n) . Con sider the r ecurrent system (8) with a conve x positively invariant set X , wher e the state transition map f ( x, u ) is differ entiab le in x for an y u ∈ U . S u ppose that there exists a symmetric positive defin ite ma trix P a n d a consta nt µ ∈ (0 , 1) , such that ∂ ∂ x f ( x, u ) ⊤ P ∂ ∂ x f ( x, u ) − µP 0 (16) for all x ∈ X an d all u ∈ U , wher e ∂ ∂ x f ( x, u ) is the Jacobian o f f ( · , u ) with r espect to x . Th en the system (8) is uniformly exponentia lly incrementally stable with β ( C , t ) = p κ ( P ) C µ t/ 2 , where κ ( P ) is the cond ition n umber of P . Pr oof. Fix any u ∈ U an d ξ , ξ ′ ∈ X , an d define the function Φ : [0 , 1] → R by Φ( s ) : = ( f ( ξ , u ) − f ( ξ ′ , u )) ⊤ P f ( sξ + (1 − s ) ξ ′ , u ) . Then Φ(1) − Φ(0) = ( f ( ξ , u ) − f ( ξ ′ , u )) ⊤ P ( f ( ξ , u ) − f ( ξ ′ , u )) . (17) By the m ean-value theore m, there exists so me ¯ s ∈ [0 , 1 ] , such that Φ(1) − Φ(0) = d d s Φ( s ) s = ¯ s = ( f ( ξ , u ) − f ( ξ ′ , u )) ⊤ P ∂ ∂ x f ( ¯ ξ , u )( ξ − ξ ′ ) , (18) where ¯ ξ = ¯ sξ + (1 − ¯ s ) ξ ′ ∈ X , since X is co nvex. From (16), (17), an d ( 18) it follows that ( f ( ξ , u ) − f ( ξ ′ , u )) ⊤ P ( f ( ξ , u ) − f ( ξ ′ , u )) ≤ ( ξ − ξ ′ ) ⊤ ∂ ∂ x f ( ¯ ξ , u ) ⊤ P ∂ ∂ x f ( ¯ ξ , u )( ξ − ξ ′ ) ≤ µ ( ξ − ξ ′ ) ⊤ P ( ξ − ξ ′ ) . Define the functio n V : X × X → R + by V ( ξ , ξ ′ ) : = ( ξ − ξ ′ ) ⊤ P ( ξ − ξ ′ ) . From the above estimate, it follows that V is a L yap unov functio n f o r the dy namics, i.e., for any u ∈ U and ξ , ξ ′ ∈ X , V ( f ( ξ , u ) , f ( ξ ′ , u )) ≤ µV ( ξ , ξ ′ ) . (19) Consequently , fo r any input u with u t ∈ U fo r all t and any ξ , ξ ′ ∈ X , V ( ϕ u 0 ,t +1 ( ξ ) , ϕ u 0 ,t +1 ( ξ ′ )) = V ( f ( ϕ u 0 ,t ( ξ ) , u t ) , f ( ϕ u 0 ,t ( ξ ′ ) , u t )) ≤ µV ( ϕ u 0 ,t ( ξ ) , ϕ u 0 ,t ( ξ ′ )) . Iterating, we ob tain the ineq uality V ( ϕ u 0 ,t ( ξ ) , ϕ u 0 ,t ( ξ ′ )) ≤ µ t V ( ξ , ξ ′ ) . Finally , since P ≻ 0 , k ϕ u 0 ,t ( ξ ) − ϕ u 0 ,t ( ξ ) k 2 ≤ λ max ( P ) λ min ( P ) µ t k ξ − ξ ′ k 2 = κ ( P ) k ξ − ξ ′ k 2 µ t , and the p roof is co mplete. Theorem 4 .8. S uppose the system (8) satisfies Assumption 4.3 an d th e Demidovich criterion with U = [ − R, R ] , its positively invariant set X contain s 0 , and f (0 , 0 ) = 0 . Th e n its i/o map F with zer o initial co ndition x 0 = 0 satisfies Assumptions 3.1 a nd 3.2 with m ∗ F ( ε ) ≤ 2 log ( 2 κ ( P ) L f L g R (1 − √ µ ) 2 ε ) log 1 µ and ω t, F ( δ ) ≤ p κ ( P ) L f L g δ 1 − √ µ . (20) 7 Pr oof. Since P is symmetric and positive definite, k x k P : = √ x ⊤ P x is a norm on R n with λ min ( P ) k · k 2 ≤ k · k 2 P ≤ λ max ( P ) k · k 2 . Then, for all ξ ∈ X , u ∈ M ( R ) , and t , k ϕ u 0 ,t +1 ( ξ ) k P = k f ( ϕ u 0 ,t ( ξ ) , u t ) k P ≤ k f ( ϕ u 0 ,t ( ξ ) , u t ) − f (0 , u t ) k P + k f (0 , u t ) − f (0 , 0 ) k P ≤ √ µ k ϕ u 0 ,t ( ξ ) k P + p λ max ( P ) L f R, where we h ave used the L yapu n ov b ound (19). Un rolling th e recursion gives the estimate sup t ∈ Z + sup u ∈ M ( R ) k ϕ u 0 ,t ( ξ ) k P ≤ √ µ k ξ k P + p λ max ( P ) L f R 1 − √ µ . Thus, Assumption 4.4 is satisfied, where S ξ is the ball of ℓ 2 -radius p κ ( P ) k ξ k + L f R 1 − √ µ cen- tered at 0 . Assumption 4.5 is a lso satisfied by Proposition 4.7. The estimates in (20) f o llow from Theorem 4. 6. The following result now follows as a direct con sequence of Theo rems 3.3 and 4.8: Corollary 4 . 9. If the system (8) satisfies the con ditions of Theorem 4 .8, then its i/o map F with zer o initial c ondition can be ε -app r o ximated in the sense of Theor em 3 .3 by a ReLU TCN b F with width po lylog( 1 ε ) and depth quasip oly( 1 ε ) . 2 4.3 Contractivity vs. the Demidovich criterion If the contractivity cond ition (15) hold s an d f ( x, u ) is differentiable in x , then the Demidovich c r i- terion is satisfi ed with P = I n and µ = λ 2 . In that ca se, we immediately o b tain th e expon ential estimate β ( C, t ) ≤ C λ t . Howev er, the Demidovich criter ion c overs a wider class of nonlinear sys- tems. As an example, c o nsider a discr ete-time nonlinear system o f Lur’e type ( cf. Sandbe rg and Xu [1993], Kim an d Braatz [20 1 4], Sark ans and Loge m ann [2 016] and refe r ences th erein): x t +1 = Ax t + B ψ ( u t − y t ) (21a) y t = C x t (21b) Here, th e state x t is n -dimension al while the inp ut u t and the outp ut y t are scalar, so A ∈ R n × n , B ∈ R n × 1 , and C ∈ R 1 × n . The map ψ : R → R is a fixed differentiable n onlinearity . T he system in (2 1) has the f orm (8) with f ( x, u ) = Ax + B ψ ( u − C x ) and g ( x ) = C x , and can be realized as the negative feed back interconn ection of the discrete-time linear system x t +1 = Ax t + B v t (22a) y t = C x t (22b) and th e no n linear element ψ u sing the feedback law v t = ψ ( u t − y t ) . W e make th e following assumptions (see, e.g ., Sontag [19 98] fo r the requ isite control-th eoretic backg round) : Assumption 4.10 . The non linearity ψ : R → R sa tisfies ψ (0) = 0 , and th er e exist real nu mbers −∞ < a ≤ b < ∞ such th at a ≤ ψ ′ ( · ) ≤ b . Assumption 4.1 1. A is a Schur matr ix , i.e., its spectral radius ρ ( A ) is strictly smaller than 1 ; the pair ( A, B ) is controllab le , i.e., the n × n ma trix [ B | AB | . . . | A n − 1 B ] has rank n ; and the p air ( A, C ) is ob servable , i.e., the n × n matrix [ C ⊤ | A ⊤ C ⊤ | . . . | ( A ⊤ ) n − 1 C ⊤ ] ha s rank n . Assumption 4.12 . Let T : = { z ∈ C : | z | = 1 } denote th e unit cir cle in th e comp le x plan e. The rational fu nction G ( z ) : = C ( z I n − A ) − 1 B satisfies k G k H ∞ ( T ) : = sup z ∈ T | G ( z ) | < γ − 1 (23) for some γ > 0 such th at r 2 ≤ γ 2 for all a ≤ r ≤ b . 2 W e say t hat a giv en quantity N has quasipolynomial growth in 1 /ε , and write N ≤ quasip oly (1 /ε ) , if N = O (ex p(p olylog( 1 ε ))) . 8 Remark 4.1 3. Assumption 4 .10 imposes a slope con dition on ψ and is standard in th e an a lysis of Lur’e systems [Tsypkin, 1964, Sand berg , 1 991, Kim and Braatz, 2 0 14]. The fun c tion G ( z ) is the transfer function of the line ar system (22). Assum p tion 4.11 states th at the triple ( A, B , C ) is a minimal realization of G . The q u antity k G k H ∞ ( T ) appearin g in Eq . (2 3) in Assumption 4.12 is the H ∞ -norm o f G on the unit circle in the com plex p lane. Assump tions 4.1 1 and 4 .12 ar e also commo n and are in the spirit o f the well-k n own circle criterion [Tsypk in, 1 964, Sandberg an d Xu, 1993]. W ith th e se p reliminaries out o f the way , we hav e the following: Proposition 4 .14. Suppose that the system (21) satisfies Assumptions 4.10 – 4. 12 . The n it satisfies the discr ete-time De m id ovich criterion with X = R n and U = R , and mo r eover µ > ρ ( A ) 2 . The cr u cial ingredien t in the proof is the Discrete-Time Bound ed-Real Lem ma [V a id yanathan, 1985], which gu arantees the existence of the matrix P ap pearing in the Demidovich cr iterion. The main takeaw ay here is that the function f ( x, u ) = Ax + B ψ ( u − C x ) need n ot be co ntractive (i.e., it m ay be the case that P 6 = I n ), but it will be contr acti ve in the k · k P norm. 5 Comparison of arch itecture s So far , w e have shown that any i/o map F with appro ximately finite m emory can be ap proximate d by a ReLU tem poral con volutional net. W e have also co nsidered re c urrent mo dels and shown that any incrementa lly stable recu r rent model has app r oximately finite memory and can th e refore be app rox- imated by a ReLU TCN. As far as their ap proximatio n ca p abilities are concer ned, both recu rrent models an d au toregressiv e m odels like TCNs are equivalent, since any finite-mem ory i/o map of the form (2) admits the state-space r e alization x 1 t +1 = x 2 t , x 2 t +1 = x 3 t , . . . , x m − 1 t +1 = x m t , x m t +1 = u t y t = f ( x 1 t , x 2 t , . . . , x m t , u t ) of the tapped delay line type, with zero initial con dition ( x 1 0 , . . . , x m 0 ) = (0 , . . . , 0) . (Comp ared to (8), w e are allowing a d ir ect ‘f eedthrou gh’ connection from the inp ut u t to the ou tp ut y t .) The advantage of au to regressi ve mode ls like TCNs shows up du ring tr a ining an d r egular op e ration, sinc e shifted cop ies o f the inp u t sequ ence can be efficiently pro cessed in parallel rather than sequentially . Another po int worth men tioning is that, while th e constructio n in the proo f o f Theo r em 3.3 ma kes use of ReLU nets as a universal f u nction a p proxim a tor , a ny othe r family of universal app roximato r s can be used instead, for example, multiv ariate po lynomials or rational f u nctions. In fact, if one uses multiv ariate poly nomials to app roximate th e fun c tio nals ˜ F t , the resulting family of i/o maps is known as the ( d iscrete-time) fin ite V olterra series [Boyd and Chu a, 1985], an d has been used widely in the analysis of nonline a r systems. Howe ver , TCNs gen erally provide a more pa r simonious representatio n. T o see this, consider the following (admitted ly co ntrived) example of an i/o map : ( F u ) t = ReLU ∞ X s =0 h s u t − s ! , (24) where the filter coefficients h t have the exp o nential decay proper ty | h t | ≤ C λ t for some C > 0 and λ ∈ (0 , 1) . It is not har d to show that F has exponentially fading memory , and a very simp le ε - approx imation b y a TCN is ob tained by z eroing out all of the filter coefficients h s , s > m ∼ log( 1 ε ) : ( b F u ) t = ReLU m X s =0 h s u t − s ! . Howe ver , any ε -app r oximation fo r F using V olterra series would need p oly( 1 ε ) terms, since the b e st po lynomial ε -app roximatio n of th e ReLU on any c o mpact interval has degree Ω( 1 ε ) [DeV ore and Lorentz, 1993, Chap . 9, Th m. 3 .3]. On the othe r ha n d, if we con sider an i/o map of the form (24), but with a degree- d un i variate po lynomial instead of ReLU, then we can ε -a pproxim ate it with a TCN of depth O ( d + lo g d ε ) and O ( d lo g d ε ) units [L iang and Srikan t, 2017]. Acknowledgments This work was suppo rted in part by the National Scien ce Foundation und er the Cen ter fo r Advanced Electronics throu gh Mach ine Learnin g (CAEML) I/UCRC award no . CNS-16 -2481 1. 9 Refer ences Shaojie Bai, J. Zico Kolter , an d Vladlen K oltun . An empirica l ev aluatio n of generic conv olutional and recurren t networks for sequence modelin g, 2018. URL https ://arx iv.org/abs/1803.01271 . Stephen Boyd an d Leon O . Chua. Fading memory and the pro blem of ap proxim ating n onlinear operator s w ith V o lterra series. IEEE T ransactions on Cir cuits and Systems , CAS-32 ( 11):11 5 0– 1161, 1985 . Ciprian Chelba, Mohamm ad Norou zi, and Samy Bengio. N-gram languag e mo d eling u sing re current neural network estimation, 2017. URL ht tps:// arxiv. org/abs/1703.10724 . Y ann N. Daup hin, Angela Fan, Michael Au li, and David Grangier . Langua g e modelin g with gated conv olutional ne twork s. In Internatio nal Conference o n Machine Learning , 20 17. Ronald A. DeV ore and George G. Lor entz. Constructive Appr oxima tion . Sp ringer-V erlag, Berlin, 1993. Jonas Gehrin g, Michael Auli, David Gran gier , Denis Y ar ats, and Y an n N. Dauph in. Con volutional sequence to sequ ence learning. In Internation al Conference on Machine Learnin g , 2017. Boris Hanin and Mark Sellke. Approx imating continuo us func tio ns by ReLU nets of min imal width, 2018. URL http: //arxi v.org/abs/1710.11278 . Rie Johnson and T ong Zhang . Deep pyramid con volutional neu ral network s fo r text cat- egorization. In Pr ocee d ings of the 55 th Annual Meeting of the A ssociation for Compu - tational Linguistics (V o lume 1: Lon g P apers) , pages 562– 570, V an couver, Canada, July 2017. Association for Com putational Linguistics. do i: 1 0 .1865 3/v1/P17- 10 52. URL https ://www .aclweb.org/anthology/P17- 1052 . Nal Kalchbr enner, Lasse Espeholt, Karen Simo nyan, Aaron van den Oo rd, Alex Graves, and K oray Kavukcuoglu. Neural machine translation in linear time, 2 016. URL https ://arx iv.org/abs/1610.10099 . Kwang-Ki K. Kim and Richard D. Braatz. Observer-based o utput feedback contro l of d iscrete- time Lur’ e systems with sector-bou n ded slope- restricted non linearities. Internationa l Journal of Robust a nd Nonlinear Contr ol , 2 4:2458 –2472 , 2014. Shiyu Liang a n d R. Srikant. Why d e e p neu ral networks for fu n ction app r oximation ? In Internatio nal Confer ence o n Lea rning Rep resentations , 201 7. John Miller and Moritz Hardt. Stable recu rrent mo dels. In In ternational Conference on Learnin g Repr esentatio ns , 2 019. Jooyoung Park and Irwin W . Sandberg. Criteria for th e ap proximatio n o f no nlinear systems. I EEE T ransactions on Cir cuits and S y stems — I : Fun damental The ory and A pplications , 3 9 (8):67 3 –676, 1992. Alexe y Pa vlov , Nath an van d e W ouw, an d H e n k Nijmeijer . Uniform Outp ut Regulation of Nonline ar Systems: A Co nver gent Dyn amics Ap pr oach . Birkhäuser , 200 6. Irwin W . San dberg. Struc tu re theorems for n onlinear systems. Multidimen sional S y stems and S ignal Pr ocessing , 2:26 7–286 , 1 991. Irwin W . San d berg and Lilian Y . Xu. Steady -state errors in discrete-time co n trol systems. Automat- ica , 29( 2):523– 526, 19 93. Elvijs Sar kans and Hartmut L ogemann . In put-to-state stability of d iscrete-time Lur’ e systems. S IAM Journal on Contr ol and Op timization , 54(3) :1 739–1 768, 2016 . V atsal Shar an, Sham Kak a de, Percy Liang, an d Gregory V aliant. Prediction with a sh o rt m emory . In Symposium on Theory of Compu tin g , 201 8. 10 Eduard o D. So ntag. Ma thematical Contr ol Theory: Deterministic finite Dimension al Systems . Springer-V erlag, 19 98. Duc N. Tran, Björ n S. Rüffler , and Chr istopher M. Kellett. Conver gen ce proper ties fo r discrete-time nonlinear systems. IEEE T ransaction s o n Automatic Con tr o l , 2017. Y akov Z. Tsypkin. A criterio n of absolute stability for sampled-d ata systems with mo notone char- acteristics of the nonlinear elemen t. Do k lady Akad emii Nauk SSS R , 155( 5):1029 –1032, 19 64. In Russian. Palghat P . V aid yanathan. The discre te- time bo unded- real lemma in dig ital filterin g. IEE E T ransac- tions on Circuits a nd Systems , CAS-32 (9):918 –924, Sep tember 1985 . Aaron van den Oord, Sander Dielem an, H e ig a Zen , Karen Simonyan, Oriol V inyals, A lex Graves, Nal Kalchb renner, Andrew Senior, and Koray Ka vu kcuoglu . W avenet: A genera tive mod el for raw a udio, 2016 . URL https: //arxi v.org/abs/1609.03499 . Y onghu i W u, Mike Schuster, Z hifeng Chen, Quo c V . Le, Mohamm ad No rouzi, W olfgang M acherey , Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey , Jeff Klingne r, Apurva Shah, Melvin John- son, Xiaob ing L iu, Łukasz Kaiser, Stepha n Gouws, Y oshikiyo Kato, T ak u Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, W ei W ang, Cliff Y oung, Jason Smith, Jason Riesa, Alex Rudn ick, Oriol V inyals, Greg Corrado , Macdu ff Hughes, and Jeffrey Dean . Goog le’ s neur a l machine translation system: Bridging the gap betwee n human and mac h ine translation, 201 6 . W en peng Y in , Kath a rina Kann, Mo Y u, and Hinrich Schütze. Compar ati ve study of CNN an d RNN for natu ral lan guage processing, 201 7. URL http s://ar xiv.or g/abs/1702.01923 . 11 A Omitted proofs Pr oof of Pr opo sition 3.7. Suppose F satisfies Assumption s 3.1 and 3. 2. Fix some ε > 0 and let m = m ∗ F ( ε/ 3) and δ = w m ω − 1 m, F ( ε/ 3) . Now fix some t ∈ Z + and consider any two u , v ∈ M ( R ) such that max s ∈{ 0 ,...,t } w t − s | u s − v s | < δ. (A.1) Using th e sam e r easoning as in the proof of Theorem 3.3, we can write ( FW t,m u ) t = ˜ F m ( u t − m : t ) and ( FW t,m v ) t = ˜ F m ( v t − m : t ) , where, a s b efore, we set u s = v s = 0 for s < 0 . From th e monoto nicity of w and (A.1) it f ollows that k u t − m : t − v t − m : t k ∞ ≤ 1 w m max s ∈{ t − m,...,t } w t − s | u s − v s | < ω − 1 m, F ( ε/ 3) , which implies th at | ( FW t,m u ) t − ( FW t,m v ) t | = | ˜ F m ( u t − m : t ) − ˜ F m ( v t − m : t ) | < ε/ 3 . Altogether, we see tha t (A.1) implies that | ( F u ) t − ( F v ) t | ≤ | ( F u ) t − ( FW t,m u ) t | + | ( FW t,m u ) t − ( FW t,m v ) t | + | ( F v ) t − ( FW t,m v ) t | < ε/ 3 + ε/ 3 + ε/ 3 = ε, which leads to (6). Now su ppose that F has fading memo ry w .r .t. w . Given ε > 0 , let δ = α − 1 w , F ( ε ) a n d ch oose any m ∈ Z + , such that w m < δ /R . If t < m , th e n u 0 : t = ( W t,m u ) 0 : t , and thu s ( F u ) t = ( FW t,m u ) t . On the oth er hand, if t ≥ m , then , for any u ∈ M ( R ) , max s ∈{ 0 ,...,t } | u s − ( W t,m u ) s | = ( 0 , t − m ≤ s ≤ t | u s | , s < t − m and ther e fore, b y the mon otonicity of w and the c hoice of m , max s ∈{ 0 ,...,t } w t − s | u s − ( W u t,m ) s | = max s ρ ( A ) is continuou s. I n p articular, there exists som e r 0 ∈ ( ρ ( A ) , 1) , such that g ( r 0 ) < g (1) < γ − 1 . Consequently , th e rational function H ( z ) : = γ G ( r 0 z ) = γ C r 0 z I n − A r 0 − 1 B is well-defin e d fo r all z ∈ C with | z | ≥ r 0 , and we have the following: • A r 0 is a Sch ur matrix; • the pair ( A r 0 , B ) is contro llable; • the pair ( A r 0 , γ C r 0 ) is observable; • k H k H ∞ ( T ) < 1 . Then, b y the Discrete-Time Bounded -Real Lemma [V aidyan a than , 198 5], ther e exist r eal matrices L, W and a sym m etric positive definite matr ix P ∈ R n × n , such that A ⊤ P A + γ 2 C ⊤ C + r 2 0 L ⊤ L = r 2 0 P (A.4a) B ⊤ P B + W ⊤ W = I n (A.4b) A ⊤ P B + r 0 L ⊤ W = r 0 I n . (A.4c) From (A.4), for any θ ∈ R we have ( A − θ B C ) ⊤ P ( A − θ B C ) − r 2 0 P = A ⊤ P A − θ ( C ⊤ B ⊤ P A + A ⊤ P B C ) + θ 2 C ⊤ B ⊤ P B C − r 2 0 P = ( θ 2 − γ 2 ) C ⊤ C − ( r 0 L − θ W C ) ⊤ ( r 0 L − θ W C ) . Let µ : = r 2 0 . Th e n , since γ 2 ≥ θ 2 for all θ ∈ [ a , b ] , it follows that ( A − θ B C ) ⊤ P ( A − θ B C ) − µP 0 , a ≤ θ ≤ b . Since ∂ ∂ x f ( x, u ) = ∂ ∂ x Ax + B ψ ( u − C x ) = A − ψ ′ ( u − C x ) B C and ψ ′ ( u − C x ) ∈ [ a, b ] for all x and u , the propo sition is proved. 13
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment