On the space-time expressivity of ResNets

Residual networks (ResNets) are a deep learning architecture that substantially improved the state of the art performance in certain supervised learning tasks. Since then, they have received continuously growing attention. ResNets have a recursive st…

Authors: Johannes M"uller

On the space-time expressivity of ResNets
Presented at the DeepDif fEq workshop at ICLR 2020 O N T H E S PAC E - T I M E E X P R E S S I V I T Y O F R E S N E T S Johannes M ¨ uller Max Planck Insitute for Mathem a tics in the Sciences jmueller@mis .mpg.de A B S T R AC T Residual networks (ResNets) are a deep learning architecture that substantially improved the state of the art perfo rmance in certain sup ervised learnin g tasks. Since then, they ha ve received continuously growing attention. ResNets have a recursive structure x k +1 = x k + R k ( x k ) where R k is a n eural network called a residual block. Th is structure can be seen as the Euler discretisation of an associ- ated ordinary differential equation (ODE) which is ca lled a n eural ODE. Recently , ResNets w e re prop osed as the space-time appr oximation of ODEs which are not of th is neur al type. T o elabora te this con nection we sh ow that by incre a sing the number of resid u al blocks a s well as their expr essi vity the solution of an arbi- trary ODE can b e approx imated in sp a ce and time simultaneously by deep ReLU ResNets. Further, we d erive estimates on the com plexity of the residual blo cks required to obtain a pr e scribed accura cy under certain regularity assump tions. 1 I N T R O D U C T I O N V arious neural network based method s have been prop osed f or the num e rical analysis of par- tial d ifferential equ ations (PDEs) (see L e e an d Kang, 199 0; Dissanayake and Phan-T h ien , 1994; T akeuchi and K osugi, 199 4; Lagaris et a l. , 19 98) as well as for ordina r y d ifferential equations (ODEs) (see Meade Jr and Fer nandez, 199 4a;b; Lagaris et a l., 1998; Bree n et al., 2019). In sub - sequent years those meth o ds wher e im proved and extended to a variety of setting s and we refer to Y adav et al. (2 015) for an overview of n eural network b a sed methods for ODEs. Recently , deep networks have successfully bee n applied to the numerical simula tio n of stationary an d no n station- ary PDEs by E and Y u (201 8) an d E et al. (2 017); H a n et al. (2 0 18) resp ectiv ely; a list of fur ther improvement o f those method s can be fou nd in Gro hs et al. (2019). The pro mising empirical p e r- forman ce of th o se a p proach es raised inter est in theor etical guaran tees and led to a number of e rror estimates (see Jen tzen et al., 2018; Han and Long, 2018; Gro h s et al., 2 018; Elbr ¨ achter et al., 2 018; Berner et al., 20 18; Reisinger a nd Zh ang, 2019; Kutyniok et al., 2019). In par ticu lar it can b e shown that neural n etworks are capab le of appr oximating the solution s of a number of PDE s without suf - fering from th e curse of dim e nsionality . Howev er , it should be n oted that tho se works only provide estimates for th e spatial error at a fixed time rathe r than the app roximatio n error in space a n d time simultaneou sly . Compared to th e case of PDEs the an alysis of th e appro ximation er ror for ODEs is less co mplete. Although a pr iori and a p osteriori error estimates are p resent in the literature (see Filici, 2 0 08; 2010, respectively) they on ly con sid e r th e solution f or a single initial value rather than the full space-time solution x (0 , y ) = y , ∂ t x ( t, y ) = f ( t, x ( t, y )) for all t, y (1) to the right hand side f . Recently , Gro hs et al. (20 1 9) establish e d an appr oximation r esult in spa ce- time an d showed that Eu ler discretisations of a cer tain class of n eural ODEs can be a p prox im ated by n eural networks with err or d ecreasing expo nentially in the comp lexity of the network s. Those are the first space-time error estimates in the study o f neural network based methods for either PDEs or ODE s. Y et, in or der to ob tatin space- time error estimates f or th e solution of an ODE one ha s to bo und the app roximatio n err or of th e class o f Euler discretisations to the solu tion of this ODE. Such estimates are implied b y our main result T h eorem 3 co n cerning the app r oximation of space- time solutions with residual networks. Howe ver , the r e is a fu rther mo tiv ation in the stud y of the approx imation capabilities of r esidual ne twork s that we will present n ow . 1 Presented at the DeepDiffEq work sh op at ICLR 2020 R E S I D UA L N E T W O RK S A N D DY NA M I C A L S Y S T E M S Residual networks (ResNets) make use of skip c onnection s wh ich were in tr oduced to overcome difficulties in the train ing of deep neural n etworks in super v ised learning tasks. Rather th a n using the iter ativ e sch eme x l +1 : = ρ ( A l x l + b l ) like a tradition a l feedfo rward n etwork, ResNets copy the input x l to so m e su b sequent layer , in the easiest case to th e fo llowing layer wh ich le a ds to x l +1 : = x l + ρ ( A l x l + b l ) for l = 0 , . . . , L − 1 . (2) Obviously , this is only we ll defined if the dimensions of all states x l agree. It was sh own in He et al. (2016) th a t ResNets are super io r to traditio nal feedforward neural n etworks in some image classi- fication tasks. It has been pointed ou t in Haber et al. (201 8) that the recu rsiv e structure (2) can be interpreted as the explicit Euler discretisation of the ordina r y differential equa tio n ∂ t x ( t ) = ρ  A ( t ) x ( t ) + b ( t )  . (3) Building on this o b servation H a b er and Rutho tto (2 017) tra nsferred the knowledge ab out th e sta- bility of ODEs to the stability of for ward pro pagation in ResNets an d Lu et al. (2017) in troduced neural network s correspon ding to other numer ical sch emes for ODEs like implicit E uler or Runge- Kutta sch emes. Fur ther, Chen e t al. (2018) replaced ResNets by ODEs in supervised learning tasks and achieved state of the art perfo rmance with fewer param eters. A rigorous justification for this ap- proach using the no tion of Γ -convergence w as established in Th orpe and van Gennip (20 1 8 ). Lately , ResNets have been p roposed in Rousseau et al. (2019) as an approxim ation of space- time so lutions of a much more gener a l class of ODEs than (3) which always adm its no n d ecreasing solutions. Fur- ther , this was applied to the pro b lem of diffeomorph ic ima g e registration which can be interpreted as a controlled ODE prob lem. The expressi vity o f ResNets was studied in d ifferent ways. It was shown by L in a nd Jegelka (2 0 18) that ResNets are able to app roxima te arbitrary L p -functio n s and Cuch iero et al. (2 019) showed that ResNets can take prescribe d values on arbitrary po in t sets. Both works conside r the input- output mapping x 0 7→ x L induced by a ResNet. Si milarly , Dupon t et al. (20 19) and Zhang et al. (20 19) studied th e approx imation ca p abilities of neural ODEs at final time. Alth ough many works per ceiv e ResNets a s discrete dyn a m ical systems (see E, 2 017; Liu and Markowich, 20 19, and sub sequent work) an analysis of the expressi vity of their dynam ic s is still ab sent. C O N T R I BU T I O N S W e stud y the expressivity of the dynam ics o f ResNets and show that ResNets can a p prox im ate solutions of arbitrary ODEs in space-tim e . Th is includes the solution of th e co ntrol pro b lem ResNets where prop osed for in Rousseau et al. (201 9 ). M ore precisely , we make the following con tributions: 1. Un iversality : ResNets can approx imate solutions of ar bitrary ODEs u niform ly in space and time simu ltaneously (see Th eorem 2). 2. Comp lexity bou nds : Assume that th e right h a nd side f is Lipschitz con tinuous. The n the solution to this ODE can be app roximated with (local) erro r O ( n − 1 ) throug h ResNets w ith n residu al blocks which have O ( r d n n d ) n eurons; her e, ( r n ) n ∈ N ⊆ (0 , ∞ ) is an arbitra r y sequence div erging to + ∞ an d d is the dimension o f the ODE ( see Theo rem 3). 2 D E FI N I T I O N S A N D N OTA T I O N Let f or the re mainder d, m, L be natural nu mbers. Fu r ther, we consider tupels θ = (( A 1 , b 1 ) , . . . , ( A L , b L )) of matrix - vector pairs wher e A l ∈ R N l × N l − 1 and b l ∈ R N l and N 0 = d, N L = m . Every matrix vector p a ir ( A l , b l ) indu c es an affine linear transfo rmation that we denote b y T l : R N l − 1 → R N l . The neu ral network with parameters θ an d with r espect to some a ctivation function ρ : R → R is the f unction R = R θ : R d → R m , x 7→ T L ( ρ ( T L − 1 ( ρ ( · · · ρ ( T 1 ( x )))))) , where ρ is ap plied com ponen twise. W e ca ll d the input and m the outp ut dimension , L the dep th and N ( θ ) : = P L l =0 N l the num b er of neur ons of the n etwork. If we h av e f = R θ for some θ we say that th e f unction f is expr e ssed b y the neu ral network . 2 Presented at the DeepDiffEq work sh op at ICLR 2020 In the following we restrict ourselves to the case of a specific activ ation function which is n ot on ly common ly used in pra ctice (see Ramach a n dran et al., 2017) but also exhib its n ice theoretical p r op- erties ( see Arora et al., 20 1 6; Petersen et al., 20 1 8). The r ectified lin e ar unit or ReLU activation function is d efined v ia x 7→ max { 0 , x } and we call networks with this activ ation ReLU networks . Finally , we introd uce the n otion of residu al networks. In order to interp r et ResNets as f unctions in space-time we define them to be Euler discre tisation o f a cer tain class of ODEs which are linearly interpolated in time. It is impo rtant to note that this might differ from other definitions of residual networks pre sen t in the literature. Definition 1 (Residual network) . Let θ = ( θ 1 , . . . , θ n ) be a tupel of par ameters of n eural networks with inp ut an d outpu t d imension d . Let R 1 , . . . , R n : R d → R d denote the neural network s with parameters θ 1 , . . . , θ n and some activ ation ρ . W e refer to those networks as res idual blocks . Th e r esidual network or ResNet x n : [0 , 1 ] × R d → R d with para m eters θ = ( θ 1 , . . . , θ n ) and with respect to the activ ation fu nction ρ is de fined via x n (0 , y ) : = y , x n ( t k +1 , y ) : = x n ( t k , y ) + n − 1 · R k +1 ( x n ( t k , y )) for k = 0 , . . . , n − 1 and linearly in betwee n . In th e r emainder, we will only consider ResNets with respect to the ReLU activ ation fu nction and call those ReLU ResNets . 3 P R E S E N T A T I O N O F T H E M A I N R E S U LT S Now we have in troduced e n ough n otation to state our main results precisely . Theorem 2 ( Sp ace-time appro ximation with ResNets) . Let d ∈ N , f ∈ L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) 1 and let x be th e space- time solution 2 of the OD E with righ t hand side f . Then for every compact set K ⊆ R d and ε > 0 ther e is a ReLU ResNet ˜ x such that k ˜ x ( t, y ) − x ( t, y ) k ≤ ε for all t ∈ [0 , 1] , y ∈ K . The p roof is based on the obser vation that f can be appr oximated by function s that are piecewise constant in time on the intervals [ k /n, ( k + 1 ) /n ) . By stan dard continuity r e sults for th e solution op - erator of ODEs the approxim ation also holds for the associated sp a c e-time solutions and thu s one can without loss of gener ality assume that f is p iecewise co nstant. Howe ver , if f is merely integrable in time it is not possible to bound the num ber of con stant regions. Hen ce, on e can not b ound the n um- ber of th e residual b lo cks that is r equired in o r der to achieve a prescribed appro ximation acc u racy under no temporal regularity a ssum ptions. Nevertheless, in the proo f of the result the appro ximation in space an d in time are c lea rly separ ated an d in fact the co nstructed re sid ual blo cks share weigh ts depend ing on th e tempo r al regularity of the righ t hand side. Henc e , the same argu ments can be used to estab lish bound s on th e co mplexity of the residual networks like the f ollowing. Theorem 3 (Space- time approx imation with comp lexity b o unds) . Let d ∈ N , ( r n ) n ∈ N ⊆ (0 , ∞ ) b e a sequence d iver ging to + ∞ and let f : [0 , 1] × R d → R d be a b ound ed and Lipschitz contin uous function. Let x : [0 , 1] × R d → R d be the spac e -time solution of the ODE with right h and side f . Then for every n ∈ N there is a ReLU ResNet x n with parameters θ n = ( θ n 1 , . . . , θ n n ) such that th e following a r e satisfied: 1. Ap proxim ation: F o r every compac t set K ⊆ R it h olds sup t ∈ [0 , 1] ,y ∈ K k x n ( t, y ) − x ( t, y ) k ∈ O ( n − 1 ) . 2. Com plexity bound s: Every r esidual block θ n k has depth  log 2 (( d + 1)!)  + 2 and satisfies N ( θ n k ) ∈ O  r d n n d  . F inally , all but O  r d n n d  weights ca n be fi x ed. 1 Up to a technical measurability property (Bochner measurability) this means that f ( t, · ) is bound ed and Lipschitz continuous for almost all t and t hat the uniform norm and Lipschitz constants are integrable. 2 see (1); we use the notion of weak solutions introduced in the appendix; if f is continuous this coincides with the classical notion of a solution; further , the ODE is globally well posed for this class of r ight hand sides. 3 Presented at the DeepDiffEq work sh op at ICLR 2020 W e shall note that in similar fashion fur ther com plexity estimates can be obtain ed if the temp oral and sp a tial regularities are different. This includes cases where the Lipschitz constants in time and space differ or wher e o ne is gi ven b y some Sob olev or smo othness prop erty; the resultin g comp lexity bound s would cha nge acco rding ly to the respective spacial and temp oral ap p roxima tio n r esults. O U T L I N E O F T H E P RO O F In a nutshell the p roof o f the appr oximation results pr esented above relies o n a comb in ation of a spacial app roximatio n result fo r ReLU networks an d a Gr ¨ onwall argument. W e quickly pr esent the key argumen ts of Theo rem 3 and postpone any rigoro us calculations to th e ap pendix . The pr oof is based o n a variant of the universal ap proxim ation resu lts in Hanin ( 2017) and Y arotsky (2018) and we follow He et al. (2018) for the con struction of piecewise linear interp olations. This method achieves optima l r ates und e r the assumption o f continu ous weig ht assignment which are also optima l fo r bound ed dep th networks (see DeV ore et al., 1 989; Y arotsky, 2018). Alth ough faster approx imation rates for deep networks of boun d ed width are establishe d in Y arotsky (20 18) we u se the following result as it allows a dire c t co n trol of the uniform norm of the networks. Howe ver , o ur arguments can be generalised to oth er u niversal ap proxim ation results. Proposition 4 (Universal ap prox im ation u n der Lipschitz con dition) . Let d, m ∈ N a nd r > 0 and let f : R d → R m be Lipschitz co ntinuou s. The n for every ε > 0 ther e is a R eLU network R ε with parameters θ ε that satisfies the follo wing: 1. Ap proxim ation: It h olds that sup x ∈ [ − r,r ] d k f ( x ) − R ε ( x ) k ≤ ε . 2. Com plexity b ound s: The network has dep th  log 2 (( d + 1)!)  + 2 , O  r d ε − d  many ne ur ons and a ll but O  r d ε − d  weights ca n be fi x ed. F inally , if k f k is bo unded by c so is k R ε k . Pr oof of Theorem 3. For ev ery n ∈ N th e previous prop osition yield s the existence of neural net- works R n 1 , . . . , R n n of asserted co mplexity that satisfy sup x ∈ [ − r n ,r n ] d   f ( t k , x ) − R n k +1 ( x )   ≤ n − 1 . Since f is boun ded, let’ s say by c > 0 3 , so are all realisation s R n k indepen d ent of k an d n . Hence, for any initial condition y ∈ B R in some ball the tru e solution x ( t, y ) as well as th e ResNet x n ( t, y ) arising from the networks R n 1 , . . . , R n n remain in the bou nded set B R + c . Howev er , on th is bounded set the realisations R n k approx imate th e right h and side u niform ly and thus every ResNet can be interpreted as an p erturbe d Euler d iscretisation of th e ODE with righ t hand side f . Th erefore, the residual network satisfies an integral equation for e very fixed inital value y . An applicatio n of Gr ¨ onwall’ s in equality yields th a t x n does in fact conv erge towards x unifo rmly on [0 , 1 ] × B R with approx imation error in O ( n − 1 ) . Since th e b all B R was arb itrary the gen eral statemen t follows. 4 D I S C U S S I O N A N D F U RT H E R R E S E A R C H W e showed th at residu a l networks are cap able of appro ximating the solution o f gen eral ODE s in space-time. Further, under additional regu larity assumptions we establish e d bound s on the co mplex- ity o f the re sidual block s. The argumen ts pre sented above can directly b e generalised to other classes of righ t h and sides f that allow a mo re effecti ve spatial ap p roxima tio n throug h neu ral networks. This includes comp ositional f unctions or classes of (piecewise) smo oth functio ns ( see Mhaskar et al., 2016; Liang and Srika n t, 2 016; Petersen an d V oigtlaende r, 2018; Y arotsky, 20 18; She n e t al., 20 1 9; Montanelli and Y ang, 2019). For futu re directions we p r opose to in vestigate whether and if so in what notion controlled ResNets conv erge tow ards con trolled ODEs. It would be par ticularly interesting to see which weight regu- larisation c orrespon ds to which regularisation s of co ntrolled OD E s. This would b e a contin u ation of the work by Thorpe an d v an Gennip (2018); A velin and Nystr ¨ om (20 1 9 ) wher e r e sidual b locks of one laye r and constant weights are studied. Further, it is not clear f or which class of c ontrolled ODEs the cu rse of dimensiona lity c a n be circumvented , faster appr oximation rates can be established or weights can b e sha r ed between different residu al b lo cks. 3 By this we mean that the (E uclidean) norm is boun ded by c . 4 Presented at the DeepDiffEq work sh op at ICLR 2020 A C K N OW L E D G M E N T S JM acknowledges suppor t by the E vangelische s Studienwerk V illigst e.V . an d the I M PRS M iS. Further, the authors want to thank Nihat A y , Nicolas Charon , Philipp Harms, Jasper Hofmann , Guido Mont ´ ufar and Hsi-W ei Hsieh for valuable commen ts an d discussions. R E F E R E N C E S W . Arendt, C. J. Batty , M. Hieber, and F . Neubra n der . V ector-valued Lapla c e transforms and Ca u chy pr oblems , volume 9 6. Spr inger Science & Business Media, 2 011. R. Aro ra, A. Basu, P . Mianjy , an d A. Mukherjee . Und erstanding Deep Neu ral Networks with Rec- tified L in ear Units. arXiv preprint arXiv:16 11.01 491 , 2016 . B. A velin and K. Nystr ¨ om. Neu ral ODEs as the Deep Limit of ResNets with co n stant weights. a rXiv pr eprint a rXiv:190 6.121 83 , 2019 . J. Bern er , P . Grohs, an d A. Jentzen. Analy sis of the gen eralization error : Empir ical risk minimiza- tion over d eep ar tificial n e u ral ne tworks overcomes the curse of dimen sionality in the numeric a l approx imation of Black- Scholes partial differential e q uations. arXiv pr eprint arXiv:18 0 9.030 62 , 2018. P . G. Breen, C. N. Foley , T . Boekh olt, and S. P . Zwart. Newton vs the machine: solving the chao tic three-bo dy problem using deep neu ral network s. a rXiv pr eprint a rXiv:191 0.072 91 , 2 019. T . Q. Chen , Y . Rubanova, J. Bettencou rt, and D. K. Duvenaud. Neural Ordinar y Differential Equa - tions. In Adva nces in neural information pr ocessing systems , p ages 6571–6 583, 2018. C. Cuchiero, M . Larsson, and J. T eichmann. Dee p neu ral network s, gener ic universal interpolatio n , and controlled ODE s. a rXiv preprint arXiv:19 0 8.078 38 , 201 9. R. A. DeV ore, R. Howard, and C. Micc h elli. Optima l no nlinear app roximatio n. Ma n uscripta ma th- ematica , 6 3(4):4 69–47 8, 1989 . J. Diestel a nd J. Uhl. V ector Measures, 1 977. M. Dissanayake and N. Ph an-Thien . Neural-network- based appr oximation s fo r solving partial d if- ferential eq uations. communication s in Numerical Metho ds in En gineering , 10 (3):19 5–201 , 1 994. E. Dup o nt, A. Doucet, and Y . W . T e h . Au g mented Neural O DE s. a rXiv preprint arXiv:1904 . 01681 , 2019. W . E. A Propo sal o n Machine Learn ing via Dyn a m ical Sy stems. Co mmunication s in Mathematics and S tatistics , 5(1):1– 11, 2017. W . E and B. Y u. Th e Deep Ritz Method: A D e ep Learn ing-Based Numerical Algo rithm for Solv in g V ariational Pr o blems. Commun ications in Mathema tics an d Statistics , 6(1):1 –12, 2018 . W . E, J. Han, and A. Jentze n . Deep Learn ing-Based Num erical Method s for High-Dimen sional Parabolic Partial Differential Equatio ns and Backward Sto chastic Differential Equ ations. Com- munication s in Mathematics and Sta tistics , 5(4 ):349– 380, 2017. D. E lb r ¨ achter , P . Groh s, A. Jentzen, and C. Schwab . DNN Ex pression Rate Analysis of Hig h - dimensiona l PDEs: Application to Op tion Pricin g. arXiv preprint arXiv:180 9.076 69 , 201 8. C. Filici. On a Neural App roximato r to ODEs. IEE E transactions on neural networks , 19(3) :539– 543, 2008. C. Filici. Er r or estimation in the neural network solution of or dinary differential equ ations. Neural Networks , 23(5):61 4–61 7, 2010. P . Grohs, F . Hornu ng, A. Jentzen, and P . V on W urstemberger . A p r oof that a rtificial neur a l networks overcome th e c u rse of dimen sionality in the num e r ical appro ximation of Black-Scholes partial differential equatio ns. arXiv p r eprint arXiv:18 09.02 362 , 20 18. 5 Presented at the DeepDiffEq work sh op at ICLR 2020 P . Gr ohs, F . Hornun g, A. Jen tzen, an d P . Zimmer m ann. Spac e - time err or estimates for d eep ne u ral network ap p roxima tio ns fo r differential e q uations, 2019. E. Habe r and L. Ru th otto. Stable architecture s for dee p neural networks. In verse Pr oblems , 34 (1): 01400 4, 2017 . E. Haber, L. Rutho tto, E. Holtham, an d S.-H. Jun. Learning Acro ss Scales—Multiscale Meth ods for Co nvolution Neural Networks. In Thirty-Secon d AAAI Confer ence on Artificia l I n telligence , 2018. J. Han and J. L o ng. Con vergence of the Deep BSDE Me th od for Coupled FBSDEs. arXiv preprint arXiv:181 1.011 65 , 201 8. J. Han, A. Jentzen, an d W . E. Solvin g high- dimension a l p a rtial differential equation s using d eep learning. P r o ceedings of the Nationa l Aca demy of Science s , 115( 34):85 05–85 10, 20 1 8. B. Han in . Universal Fu nction App r oximation by Deep Neur a l Nets with Bounded Width and ReLU Activ ations . arXiv pr eprint arXiv : 1708 . 02691 , 2017. J. He, L. Li, J. Xu, and C. Zheng . ReLU Deep Neural Networks and Lin e ar Fin ite Elements. arXiv pr eprint a rXiv:180 7.039 73 , 2018 . K. He, X. Zhang, S. Ren, and J. Sun. Dee p Residu al Learnin g fo r I mage Reco gnition. In Pr oceedin gs of th e IEEE co nfer ence on comp uter vision and pattern r ecognition , pag es 770– 778, 2016 . A. Jentzen, D. Salimova, and T . W elti. A proo f that deep artificial neu ral n e twork s overcome the curse of dimensionality in the numerical appro ximation of K olmogo rov partial d ifferential equa- tions with constan t diffusion and no nlinear drift coefficients. arXiv preprint arXiv:1809 .0732 1 , 2018. G. Kutyniok , P . Petersen, M. Raslan, and R. Schneid er . A Theo retical Analysis of D e ep Neural Networks an d Parametric PDEs. arXiv preprint arXiv:190 4.003 77 , 201 9 . I. E. Lagaris, A. L ikas, and D. I. Fotiadis. Artificial neu ral networks for so lv ing ordin ary and par tial differential equatio ns. IEE E transactio ns on neural networks , 9(5) :9 87–1 000, 19 9 8. H. Lee and I. S. Kang. Neural alg o rithm for solvin g differential eq uations. J ournal of Computation al Physics , 91(1):11 0–13 1, 1990 . S. Liang and R. Srik a nt. Why Deep Neur al Networks fo r Function Appr oximation ? arXiv pr eprint arXiv:161 0.041 61 , 201 6. H. L in and S. Jegelka. ResNet with o ne-neu ron hidden layers is a Universal Appro ximator. In Advance s in Neural Information Pr o cessing Systems , p a g es 6 169–6 178, 2018. H. Liu and P . Markowich. Selection dynam ics for deep ne ural n etworks. arXiv preprint arXiv:190 5.090 76 , 201 9. Y . Lu, A. Zhon g, Q. Li, and B. Dong. Beyond finite lay er neural networks: Bridg ing de e p a r chitec- tures and n umerical differential equations. arXiv pr eprint arXiv:1710 .1012 1 , 2017 . A. J. Meade Jr and A . A. Fernan d ez. Th e numerical solution of linea r ord inary d ifferential equation s by feedforward neur a l network s. Ma thematical and Comp u ter Mod elling , 1 9(12) :1–25, 1994a. A. J. Meade Jr an d A. A. Fernan dez. Solution of nonlin ear ordinary differential eq uations by f e ed- forward n eural n e twork s. Mathematical and Comp u ter Mod elling , 2 0(9):1 9–44, 1994b . H. Mh askar, Q. Liao, and T . Poggio. Learn ing Function s: When Is Deep Better Than Shallow. arXiv pr eprint a rXiv:160 3.009 88 , 2016 . H. Mon tanelli and H. Y ang. Error bo u nds for d eep ReLU networks using th e Kolmogorov–Arn old superpo sition th eorem. a rXiv p reprint arXiv:1 906.1 1945 , 2 019. P . Peter sen and F . V oigtlaender . Optimal appro x imation of piecewise smooth functio n s using deep ReLU neural n etworks. Neural Networks , 108 :2 96–3 30, 2 0 18. 6 Presented at the DeepDiffEq work sh op at ICLR 2020 P . Petersen , M. Raslan, an d F . V oigtlaende r . T opolog ic a l properties of the set of fu nctions gene r ated by neural n etworks of fixed size. a rXiv preprint arXiv:18 0 6.084 59 , 201 8. Y . Qin. A nalytic In equalities a nd The ir Applications in PDEs . Spring e r, 2 017. P . Ramach andran , B. Zoph , an d Q. V . L e. Searching fo r Activ ation Fu nctions. arXiv pr eprint arXiv:171 0.059 41 , 201 7. C. Reising er and Y . Zhang . Rectified dee p neura l n etworks overcome th e cur se of d imensionality for nonsmo oth value functions in zer o-sum gam e s of nonlinear stif f systems. arXiv pr eprint arXiv:190 3.066 52 , 201 9. F . Rou sseau, L. Drumetz, and R. Fablet. Residual Networks as Flo ws of Diffeomorp hisms. Journal of Ma thematical I maging a nd V ision , pages 1 –11, 20 19. Z. Shen, H. Y an g, and S. Zhang . Nonlinear a p prox im ation via comp o sitions. Neural Networks , 1 1 9: 74–84 , 2019. J. T akeuch i and Y . Kosugi. Neural n etwork repre sentation of finite eleme n t method . Neural Net- works , 7(2):38 9–395 , 1994. M. Th orpe an d Y . van Gen n ip. Deep L imits of Residual Neur al Networks. arXiv pr eprint arXiv:181 0.117 41 , 201 8. N. Y adav , A. Y adav , M. Kumar, e t a l. An In tr oduction to Neural Network Metho d s for Differ en tial Equation s . Sprin ger, 2 015. D. Y arotsky . Op timal approxim ation of co n tinuou s fu nctions by very deep ReLU n etworks. a rXiv pr eprint a rXiv:180 2.036 20 , 2018 . L. Y o unes. Sh a pes an d Diffeomorp hisms , volume 171 . Springer, 2 010. H. Z hang, X. Gao , J. Unterm a n , and T . Arod z. Ap p roximatio n Capabilities of Neural Or dinary Differential Equation s. arXiv p r eprint a rXiv:1907 .1299 8 , 2019 . 7 Presented at the DeepDiffEq work sh op at ICLR 2020 A U N I V E R S A L A P P ROX I M A T I O N W I T H R E L U N E T W O R K S This section is conc e rned with the proof of the universal ap proxim ation result in Proposition 4. Similar pr oofs relyin g o n the appr oximation throug h interp o lation can b e fo u nd in Han in (2 017); Y arotsky (2 018). W e also follow He et al. (20 18) fo r the expr ession of nodal basis fu nctions, how- ev er , we also boun d the complexity of the ReLU n etworks need e d to express such func tions. A . 1 T R I A N G U L A T IO N S A N D P I E C E W I S E L I N E A R F U N C T I O N S Let in the following T be a locally finite trian gulation o f the e n tire Euclidean space R d consisting of no ndegenerate d + 1 simplices { τ k } k ∈ N and vertices V . More precisely , this means that th e union of the simplices covers the entire space but that their inter iors are pairwise disjoint a nd that ev ery bound ed set on ly intersects with finitely many simplices. Furthe r, every simplex should be th e conv ex hull o f d + 1 p o ints and have non trivial interior . For a vertex x ∈ V we set N ( x ) : = { k ∈ N | x ∈ τ k } define the maximu m numb er of n eighbo ring simplices to b e k T : = sup x ∈V | N ( x ) | which we will assume to be finite. Fu r ther, we set Ω( x ) : = [ k ∈ N ( x ) τ k and call T locally co n vex , if Ω( x ) is conve x fo r all x ∈ V . The fi n eness of th e triang ulation is defined to be th e supre m um over the diameters of the simp lices |T | : = sup k ∈ N diam( τ k ) and we will assume that is finite. W e will later g ive an explicit constructio n of a triangu lation that satisfies those conditions. Definition 5 (Piecewise lin ear fun ctions) . W e say a function f : R d → R is p iecewise linear (P WL) with r espect to T if it is affine line a r o n e very simplex of the tria n gulation . Giv en such a function f we call |V ( f ) | : = |{ x ∈ V | f ( x ) 6 = 0 }| the d e gr ees of fr eedom of the f unction. Note that th e d efinition of PWL fun ctions au tomatically imp lies contin uity since the affine regions are closed and cover R d and affine fun ctions a r e con tinuou s. It is well k n own from the theory of finite elemen ts th at for ev ery vertex x ∈ V there is a with re spect to T piecewise linear fu nction φ that satisfies φ ( x ) = 1 and vanishies at e very oth er vertex. W e c a ll this f u nction the nodal basis function associated with x . The nod al basis functio ns form a basis of the spac e of PWL f unctions with fin itely m any degrees o f f reedom . W e will give an explicit con struction of a triang ulation that satisfies the assump tions f r om above. For this note that the unit cu be [0 , 1] d can be d i vided into the simp lice s S σ : = n x ∈ R d   0 ≤ x σ (1) ≤ · · · ≤ x σ ( d ) ≤ 1 o where σ is a permu tation o f the set { 1 , . . . , d } . It is straigh t forward to check tha t th ose simplices cover th e u nit c u be an d have disjoint inter iors and are non degenerate d + 1 simplices. T h e fin eness of this tria n gulation is √ d . W e ca ll th is triangu lation the stan dar d trian gulation of the E u clidean space R d . W e will n eed the fact tha t the standard triang ulation is locally co n vex. Since it is period ic, it suffices to sh ow that Ω(0) is conve x. In order to do this we will sh ow that Ω(0) = n z ∈ [ − 1 , 1 ] d    z i ≤ z j + 1 f or all i, j = 1 , . . . , d o = : A. This expr esses Ω(0) as an intersection of co n vex sets and hen ce shows the conv exity of Ω(0) . 8 Presented at the DeepDiffEq work sh op at ICLR 2020 Let us take z = x − y ∈ Ω(0) with x, y ∈ S σ where y has binary entries. Then we obviously h av e z ∈ [ − 1 , 1] d . Let now i, j ∈ { 1 , . . . , d } , th e n we have to distinguish two cases. Th e first one is σ − 1 ( i ) ≤ σ − 1 ( j ) which implies x i ≤ x j and y i ≤ y j and thus z i − z j = ( x i − x j ) + ( y j − y i ) ≤ y j ≤ 1 . For σ − 1 ( i ) > σ − 1 ( j ) an analogue com putation shows z i ≤ z j + 1 and hence we ob tain the inclusion Ω(0) ⊆ A . T o see that the oth er inclusion holds tru e , we fix z ∈ A and set I : = { i | z i ≥ 0 } and J : = { 1 , . . . , d } \ I . Furth er , we defin e y ∈ { 0 , 1 } d via y i : =  0 for i ∈ I 1 otherwise and x : = z + y . By construction we have z = x − y an d x, y ∈ [0 , 1 ] d , y ∈ { 0 , 1 } d and h ence we only need to show th e existence of a permutatio n σ su ch that x, y ∈ S σ . Obviously , the statement y ∈ S σ is equiv alent to σ − 1 ( i ) ≤ σ − 1 ( j ) fo r all i ∈ I , j ∈ J . Since for i ∈ I and j ∈ J we have x i = z i ≤ z j + 1 = x j , there is a permu tation that additio nally satisfies σ − 1 ( i ) ≤ σ − 1 ( j ) when ever x i ≥ x j for some i, j ∈ { 1 , . . . , d } . A . 2 E X A C T E X P R E S S I O N O F P I E C E W I S E L I N E A R F U N C T I O N S A S R E L U N E T W O R K S W e q uickly pr e sent well known examples of function s that ca n exactly be expressed by ReLU net- works (see He et al., 2018; Petersen an d V o igtlaender, 2018). 1. Id entity map ping. A basic calcu lation shows th e id entity x = ρ ( x ) − ρ ( − x ) fo r all x ∈ R d . (4) Hence, th e identity is can be expressed as a ReLU network of width 2 d wh ich is visualised below . Note, that one can express the iden tity functio n as arbitrarily deep ReL U networks of wid th 2 d since o ne can simply add more hidd en lay e rs where the affine linear tran sf o r- mation is the ide ntity . x ρ ( x ) ρ ( − x ) x +1 − 1 +1 − 1 Figure 1: An example for a n exp r ession of the identity m apping as a ReLU network . Similarly , one obtain s that the absolu te value can be expressed as a ReLU network of arbitrary d epth and width 2 since | x | = ρ ( x ) + ρ ( − x ) . 2. Minimu m o peration. It is elementar y to check min( x, y ) = 1 2  x + y − | x − y |  . W e have alr eady seen how the terms on the right ha n d side c a n be expressed as shallow ReLU networks and hence we o btain min( x, y ) = 1 2  ρ ( x + y ) − ρ ( − x − y ) − ρ ( x − y ) − ρ ( − x + y )  . (5) Therefo re, the m inimum op eration can be expressed as a shallow ReLU network o f width 4 a n d with weig h ts ± 1 2 , ± 1 . 9 Presented at the DeepDiffEq work sh op at ICLR 2020 x y ρ ( − x − y ) ρ ( x + y ) ρ ( x − y ) ρ ( − x + y ) min( x,y ) Figure 2: Ex pressing th e m in imum o peration th rough a sha llow ReLU n etwork. A con sequence of the fact that the iden tity can be expressed as a shallow n etwork is that the class of neu ral network s is closed un der summatio n and also parallelisation. W e will use those co n- cepts of parallelisation and su mmation of networks which are r e lati vely intu iti ve and we r efer to Petersen and V oigtlaende r ( 2018) f or further details. The expression of nodal basis fu nctions as ReLU network s relies on th e following proposition . Lemma 6. Let T be a lo c ally finite and locally conve x trian g ulation of R d and let x ∈ V with no dal basis fu nction φ . Th e n we ha ve φ ( y ) = max  0 , min k ∈ N ( x ) g k ( y )  = min k ∈ N ( x ) ρ ( g k ( y )) for all y ∈ R d , (6) wher e g k is the glo bally a ffine linear function th a t agrees with φ o n the simplex τ k . For a proo f we refer to He et al. (2 018) which we also follow clo sely for the next two results, how- ev er , we additiona lly bou nd the complexity of the neural networks. Proposition 7 (Minim um fu nction) . The minimum function min : R d → R can b e expr essed thr ough a ReLU n e twork of depth ⌈ lo g 2 ( d ) ⌉ + 1 . Further , such a netwo rk can be constructed with weights  0 , ± 1 2 , ± 1  and O ( d ) ma ny neu r ons and O ( d ) non-zer o weights. Pr oof. Let us fo r the sake of easy no tation assume d = 2 m . The construction of the rep resentation of the minimum fu nction relies on th e ob servation that the minimu m operation is th e compo sition of log 2 ( d ) = m ma ppings of the form f k : R 2 k → R 2 k − 1 ,    x 1 . . . x 2 k    7→    min( x 1 , x 2 ) . . . min( x 2 k − 1 , x 2 k )    . Those functions a r e th e r e a lisation o f a p a r allelisation of the representation of the minim um fu nction constructed in Exa m ple 5. More precisely , f k can be repr esented throug h a shallow ReLU n e twork where the dimension of th e hidden lay er is 2 · 2 k . T he concaten ation of the m networks that r e pre- sent the fun c tions f k is a repr e sentation of the minimu m function o f depth m + 1 . By adding the dimensions of th e lay ers we o btain the this network has 2 m + 2 · 2 m + · · · + 2 2 + 1 = 5 d − 3 neuron s and 4 · (5 d − 4) non-zer o w e ig hts. Theorem 8 ( Exact expre ssion o f PWL fun c tions as ReLU ne twork s) . Consider d, m ∈ N a nd let T be a locally finite a nd locally conve x tria n gulation of R d with k T < ∞ . Eve ry function 10 Presented at the DeepDiffEq work sh op at ICLR 2020 f : R d → R m that is piecewise linear with r espect to T with N d e gr ees of fr eedom can b e expr essed as a deep ReLU network with dep th ⌈ log 2 ( k T ) ⌉ + 2 a nd at most O ( mk T N + d ) neur ons. Further , all but m ( d + 1) k T N weights can be fixed. Pr oof. W e assume m = 1 and no te that th e gen eral statement follows from building a parallelised network. Since f is the linear co m bination of N nodal b asis fu nctions φ an d hence it suffices to represent φ thro ugh a n eural network as a r epresentation of f can be ob ta in ed b y consider ing the standard addition o f th ose networks. In ord er to represent φ , we u se (6) an d the previous propo sition. For th e sake of easy notation we assume that N ( x ) = { 1 , . . . , M } , th en φ c an be represented by the following network d epicted in Figure 3 wher e the dash ed part stands fo r a repr esentation of the min im um fun c tion. I t is clear th at all weights except the ones of the first lay er wh ich are ( d + 1) M ≤ ( d + 1) k T many are fixed. y 1 y d ρ ( g 2 ( y )) ρ ( g 1 ( y )) ρ ( g M − 1 ( y )) ρ ( g M ( y )) min ρ ( g k ( y )) Figure 3 : A n expression o f a n odal b asis f unction φ where the dashed part stan ds fo r the expre ssion of the m in imum f unction tha t was construc ted in Pr oposition 7. A . 3 U N I V E R S A L A P P ROX I M AT I O N T H RO U G H I N T E R P O L A T I O N W e intr oduce the mo d ulus of contin u ity w f : [0 , ∞ ) → [0 , ∞ ] , δ 7→ sup n k f ( x ) − f ( y ) k   x, y ∈ Ω , k x − y k ≤ δ o . It is eleme n tary to check that a fu nction is unifo rmly continuo us if an d on ly if the modulus of continuity takes fin ite values and is contin uous. Th e pseudo in verse w − 1 f : [0 , ∞ ) → [0 , ∞ ) o f th e modulu s of continuity is defin e d via ε 7→ inf { δ > 0 | w f ( δ ) > ε } . Note that if w f is continuou s, we hav e w f ( w − 1 f ( ε )) = ε for all ε > 0 , i.e. we hav e k f ( x ) − f ( y ) k ≤ ε f or all x, y ∈ Ω with k x − y k ≤ w − 1 f ( ε ) . Finally , if f is Lipschitz continu ous with constant L we h av e w f ( δ ) ≤ L δ and h ence w − 1 f ( ε ) ≥ ε L . 11 Presented at the DeepDiffEq work sh op at ICLR 2020 Proposition 9 (Fun ction app roximatio n with piecewise linear fun ctions) . Let d, m ∈ N and f : R d → R m be a co ntinuou s fu nction and T be a locally fin ite trian gulation of the Eu clidean space R d with finen ess δ ∈ (0 , ∞ ) . Let Ω ⊆ R d be a unio n o f simp lices o f T and g be the with r espect to T piec ewise linear function that agrees with f on all vertices inside o f Ω an d vanishes everywher e else. Then we have k f − g k ∞ , Ω ≤ w f | Ω ( δ ) . F inally , we have k g k ∞ ≤ k f k ∞ . Pr oof. Let x ∈ Ω then x lies in a co n vex simplex with vertices x 1 , . . . , x d +1 ∈ Ω . Hence, we find conv ex weights α 1 , . . . , α d +1 ∈ [0 , 1] such th at x = P d +1 i =1 α i x i . Now we o btain   f ( x ) − g ( x )   =      f ( x ) − d +1 X i =1 α i f ( x i )      ≤ d +1 X i =1 α i   f ( x ) − f ( x i )   ≤ w f | Ω ( δ ) . Furthermo re, if k f k is bound ed b y c , then we ob tain   g ( x )   ≤ d +1 X i =1 α i ·   f ( x i )   ≤ d +1 X i =1 α i · c = c for all x ∈ R d . Combining the previous results with the constructio n o f the standard trian gulation we o btain the following result. Proposition 10 (Un i versal app roximatio n with ReLU networks) . Con sider a co ntinuou s functio n f : R d → R m wher e d, m ∈ N . Let further r > 0 and ε > 0 and let w f ,r be the modulu s o f continuity of f | [ − r,r ] d . Then for every ε > 0 there is a ReLU network R ε with parameters θ ε that satisfies the fo llowing: 1. Ap proxim ation: It h olds that sup x ∈ [ − r,r ] d k f ( x ) − R ε ( x ) k ≤ ε . 2. Com plexity bounds: The network has d epth  log 2 (( d + 1)!)  + 2 , O  ω − 1 f ,r ( ε ) − d  many neur ons and all b ut O  ω − 1 f ,r ( ε ) − d  weights ca n b e fi xed. F inally , we have k R ε k ∞ ≤ k f k ∞ . Pr oof. Building on the previous re su lts we o nly have to check that th ere is a trian gulation T with fineness a t most w − 1 f ,r ( ε ) , k T < ∞ 4 such that [ − r, r ] d is the un ion o f simp lices an d 2 d · & √ d · r w − 1 f ,r ( ε ) ' d vertices in [ − r, r ] d . W e o btain this trian g ulation by scaling the standard triangulation b y r · & √ d · r w − 1 f ,r ( ε ) ' − 1 , for which the pro perties are easily verified. This result can easily b e rewritten for Lip schitz con tinuous functio ns as the Lipschitz continu ity controls th e m odulus of co ntinuity . W e obtain the following approx im ation result. Proposition 4 (Universal ap prox im ation u n der Lipschitz con dition) . Let d, m ∈ N a nd r > 0 and let f : R d → R m be Lipschitz co ntinuou s. The n for every ε > 0 ther e is a R eLU network R ε with parameters θ ε that satisfies the follo wing: 1. Ap proxim ation: It h olds that sup x ∈ [ − r,r ] d k f ( x ) − R ε ( x ) k ≤ ε . 2. Com plexity bou nds: The network h as depth  log 2 (( d + 1 )!)  + 2 , O  r d ε − d  many neur ons and a ll but O  r d ε − d  weights ca n be fi x ed. F inally , if k f k is bo unded by c so is k R ε k . 4 One can count the neighboring points and show k T = ( d + 1)! . 12 Presented at the DeepDiffEq work sh op at ICLR 2020 B E RR O R E S T I M A T E F O R P E RT U R B E D E U L E R S C H E M E S First, we ne e d to intro duce th e n otion o f weak solu tions of ord in ary differential equ ations. Definition 11 (W eak solutio ns) . Le t f : [0 , 1] × R d → R d be a Carath ´ eod ory functio n, i. e ., mea- surable in the first an d co ntinuou s in the second argument an d let furth er x 0 ∈ R d . Then we say x : [0 , 1] → R d is a weak solutio n of the differential equatio n ∂ t x ( t ) = f ( t, x ( t )) , x (0) = x 0 if it satisfies x ( t ) = x 0 + Z t 0 f ( s, x ( s ))d s f or all t ∈ [0 , 1] . The integral on the r ight hand side can b e interp reted as a compo nentwise Leb esgue integral wh ere the Carath ´ eodo ry con dition ensu res the me asurability . Furth er , we call x : [0 , 1] × R d → R d the space-time solu tion of the ODE with righ t hand side f if it solves ∂ t x ( t, y ) = f ( t, x ( t, y )) , x (0 , y ) = y . The well posed n ess of ordinary differential equ a tions in the weak sense c a n be proved just like the well posed n ess results fr om the c lassical theory . In particular, a g lobal solution x : [0 , 1] → R d exists for ev ery initial value x 0 ∈ R d if f ( t, · ) is bound ed and L ip schitz contin u ous f o r almost all t with in tegrable unifo rm norm and Lipschitz constant (see Y ounes, 201 0). W e d enote the space of those functions w h ich are also Bochne r-measurable 5 by L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) . Definition 12 (Euler d iscretisation) . Let 0 = t 0 < · · · < t n = 1 be a pa r tition of the un it interval and x 0 ∈ R d . Let f : [0 , 1 ] × R → R be an a r bitrary Carathod ory fu nction. Then we define the Euler d iscretisation or E u ler scheme to the right hand side f , initial value x 0 and with respect to the p artition ( t 0 , t 1 , . . . , t n ) via x n (0) : = x 0 , an d x n ( t i +1 ) = x n ( t i ) + ( t i +1 − t i ) f ( t i , x n ( t i )) and linearly in between . It is important to no te that the Euler d iscretisation x n satisfies the in tegral equatio n x n ( t ) = x 0 + t Z 0 γ ( s )d s for all t ∈ [0 , 1] , where γ ( t ) : = n − 1 X i =0 χ [ t i ,t i +1 ) f ( t i , x ( t i )) . Lemma 13 ( Generalised Gr ¨ onwall inequality ) . Let x 0 , y 0 ∈ R d and let γ 0 , γ 1 ∈ L 1 ([0 , 1]; R d ) 6 and let x a nd y satisfy the integral equations x ( t ) = x 0 + t Z 0 γ 1 ( s )d s and y ( t ) = y 0 + t Z 0 γ 2 ( s )d s for all t ∈ [0 , 1 ] . Assume n ow that there are non ne gative fun ctions α, β ∈ L 1 ([0 , 1]]) such tha t k γ 1 ( t ) − γ 2 ( t ) k ≤ α ( t ) + β ( t ) · k x ( t ) − y ( t ) k for a ll t ∈ [0 , 1 ] . Then we h ave k x ( t ) − y ( t ) k ≤ c ·  k x 0 − y 0 k + k α k L 1 ( I )  for a ll t ∈ [0 , 1] , wher e we ca n choose c = 1 + k β k L 1 ([0 , 1]) · exp( k β k L 1 ([0 , 1]) ) . 5 See Diestel and Uhl (1977); there such functions are called strongly measurable. 6 i.e., their norms are i ntegrable; see Diestel and Uhl (1977) for an introduction to vector v alued integration. 13 Presented at the DeepDiffEq work sh op at ICLR 2020 Pr oof. For t ≥ t 0 we co mpute k x ( t ) − y ( t ) k ≤ k x 0 − y 0 k + t Z t 0 k γ 1 ( s ) − γ 2 ( s ) k d s ≤ k x 0 − y 0 k + t Z t 0 α ( s )d s + t Z t 0 β ( s ) · k x ( s ) − y ( s ) k d s. An app lication of Gr ¨ onwall’ s inequality yield s the assertion. 7 For t ≤ t 0 the com p utation follows in analogu e way or by reflection. Remark 1 4. I f k f k is b ounde d b y c we ob tain the growth estimate k x ( t ) k ≤ k x 0 k + c. Further, this estimate hold s also for all E uler discretisations of f . Proposition 15 ( Continuity o f so lu tion map) . Let x 0 , y 0 ∈ R d and let f , g : [0 , 1] × R d → R d be Carathodory fun ctions such tha t f ( t, · ) is Lipschitz contin uous with con stant h ( t ) for t ∈ [0 , 1] wher e h ∈ L 1 ([0 , 1]) . Further , let f − g ∈ L 1 ([0 , 1]; L ∞ ( R d ; R d )) 8 and let x, y : [ a, b ] → R d be weak so lu tions to the d iffer e n tial equ ations ∂ t x ( t ) = f ( t, x ( t )) , x ( t 0 ) = x 0 and ∂ t y ( t ) = g ( t, y ( t )) , y ( t 0 ) = y 0 . Then we h ave sup t ∈ [0 , 1] k x ( t ) − y ( t ) k ≤ c ·  k x 0 − y 0 k + k f − g k L 1 ([0 , 1]; L ∞ ( R d ; R d ))  , (7) wher e the co nstant c on ly depe n ds on k h k L 1 ([0 , 1]) . Pr oof. W e o nly need to check the r e quiremen ts of th e previous result. W e recall that x and y solve the in tegral equatio ns associated to the ODEs and he n ce obtain fo r t ∈ I k γ 1 ( t ) − γ 2 ( t ) k = k f ( t, x ( t )) − g ( t, y ( t )) k ≤ k f ( t, x ( t )) − f ( t, y ( t )) k + k f ( t, y ( t )) − g ( t, y ( t )) k ≤ h ( t ) · k x ( t ) − y ( t ) k + k f ( t, · ) − g ( t, · ) k ∞ . Later we will perceive residual networks as an perturb ed Euler approx imation of an o rdinary dif- ferential equation . T o show convergence of those we provide an err or estima te f or such p erturba- tions, namely we replace th e direction f ( t i , x n ( t i )) of the E uler approx imation x n on [ t i , t i +1 ) by z i ≈ f ( t i , x n ( t i )) . Proposition 16 (Er ror estimate for p erturbed Euler schemes) . Let f : [0 , 1] × R d → R d be a Carathodory functio n such that f ( t, · ) is Lip schitz with con stant h ( t ) wher e h ∈ L 1 ([0 , 1]) . Let now x 0 ∈ R d and x : [0 , 1] → R d be the weak solu tion to ∂ t x ( t ) = f ( t, x ( t )) and x (0) = x 0 . F ix z 0 , . . . , z n − 1 ∈ R d and set t i : = i/ n as well as γ : = P n − 1 i =0 χ [ t i ,t i +1 ) z i . Let x n : [0 , 1 ] → R d satisfy the integral equation x n ( t ) = x 0 + t Z 0 γ ( s )d s for all t ∈ [0 , 1] . Assume that we h ave k z i − f ( t, x n ( t i )) k ≤ ε fo r all t ∈ [ t i , t i +1 ) , i = 0 , . . . , n − 1 as well as k z i k ≤ c for all i = 0 , . . . , n − 1 . The n we have k x n ( t ) − x ( t ) k ≤ ˜ c ·  ε + c n · k h k L 1 ([0 , 1])  , wher e ˜ c only d epends on k h k L 1 ([0 , 1]) . 7 For a general version of Gr ¨ onwall’ s i nequality we refere to Theorem 1.2.8 in Qin (2017). 8 i.e., the uniform distance k f ( t, · ) − g ( t, · ) k ∞ is integrable ov er [0 , 1] . 14 Presented at the DeepDiffEq work sh op at ICLR 2020 Pr oof. Once more we will use L emma 13 with obvious choices of γ 1 and γ 2 . For t ∈ [ t i , t i +1 ) we estimate k γ 1 ( t ) − γ 2 ( t ) k = k z i − f ( t, x ( t )) k ≤ k z i − f ( t, x n ( t i )) k + k f ( t, x n ( t i )) − f ( t, x ( t )) k ≤ ε + k f ( t, x n ( t i )) − f ( t, x n ( t )) k + k f ( t, x n ( t )) − f ( t, x ( t )) k ≤ ε + h ( t ) · k x n ( t i ) − x n ( t ) k + h ( t ) · k x n ( t ) − x ( t ) k ≤ ε + c n · h ( t ) + h ( t ) · k x n ( t ) − x ( t ) k . C P R O O F S O F T H E M A I N R E S U LT S Let us quick ly recall our definition o f residu al networks. Let R 1 , . . . , R n : R d → R d be neural networks. The resulting ResNet x n : [0 , 1] × R d → R d is defined via x n (0 , y ) : = y , x n ( t k +1 , y ) : = x n ( t k , y ) + n − 1 · R k +1 ( x n ( t k , y )) for k = 0 , . . . , n − 1 and linearly in b etween. It is clear from the definition that ResNets are in fact Euler approxim a tio ns to the piecewise constant right hand side f : [0 , 1 ] × R d → R d which is define d via f ( t, x ) : = n − 1 X i =0 χ [ t i ,t i +1 ) ( t ) R i +1 ( x ) for all t ∈ [0 , 1 ] , x ∈ R d , where t i : = i/ n . W e will use the error estimate on p erturb e d Eu ler sch emes established in th e previous chapter to show that by letting n → ∞ and incr easing the expressivity of the networks R i ReLU ResNets ar e ab le to ap prox im ate spac e-time solutions of arbitr a ry rig h t h and sides. Lemma 17. Let f ∈ L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) and let x b e the space-time solution of th e ODE with right h a nd side f . Then for every ε > 0 th er e is n ∈ N a nd g ∈ L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) that is constant o n all in tervals of the fo rm [ i/ n, ( i + 1) /n ) such that the space-time solution ˜ x to g satisfies k x ( t, y ) − ˜ x ( t, y ) k ≤ ε for a ll t ∈ [0 , 1 ] , y ∈ R d . Pr oof. By standard Bochner theory (see Aren dt et al., 20 11) the continu ous function s are dense in L 1  [0 , 1]; C 0 , 1 b  R d ; R d   . Howe ver , continuo us function s can appr oximated arbitrar ily well by functio ns that are con stant on intervals of equal len gth. Now the con tin uity estimate (7) yields the a ssertion. Theorem 2 (Space-time approx im ation with ResNets) . Let d ∈ N and f ∈ L 1  [0 , 1]; C 0 , 1 b  R d ; R d   and let x be th e space-time solution to f . Then for every comp act set K ⊆ R d and ε > 0 there is a ReLU R e sNet ˜ x such th at k ˜ x ( t, y ) − x ( t, y ) k ≤ ε for all t ∈ [0 , 1] , y ∈ K . Pr oof. By the pr evious lemma we can withou t loss of gener ality assume that f is constan t on the intervals [ i/n, ( i + 1) /n ) for some n ∈ N . W e no te that since f is piecewise constant with values in C 0 , 1 b ( R d ; R d ) there is c > 0 such th at k f ( t, x ) k ≤ c for all t ∈ [0 , 1] , x ∈ R d . It suffices to show the statem ent for the co mpact set K = B N where B N denotes the ba ll o f radius N aroun d the or igin. By Rem ark 14 we h av e x ( t, y ) ∈ B M for ev ery t ∈ [0 , 1] , y ∈ B N where M = N + c . 15 Presented at the DeepDiffEq work sh op at ICLR 2020 Let now ε > 0 , then the u niversal approx imation result 10 for ReLU networks y ields the existence of ReLU networks R 0 , . . . , R n − 1 : R d → R d with p arameters θ 0 , . . . , θ n − 1 such that k f ( t, y ) − R i ( y ) k ≤ ε f or all y ∈ B M , t ∈ [ i/n, ( i + 1) /n ) . (8) as well as k R i k ≤ c . Further, we choose k ∈ N such that c k n · k f k L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) ≤ ε. (9) Let n ow ˜ x be the ReLU ResNet with p arameters ( θ 0 , . . . , θ 0 , θ 1 , . . . , θ 1 , . . . , θ n − 1 , . . . , θ n − 1 ) , (10) where each network θ i is included k times. Now we aim to ap ply Proposition 16 and hence check its requirem ents and fix y ∈ B N and denote x ( t, y ) , ˜ x ( t, y ) with x ( t ) and ˜ x ( t ) respectively an d ag ain Remark 14 yields ˜ x ( t ) ∈ B M for all t ∈ [0 , 1] . In ord er to use the notatio n f rom th e propo sition we set t i : = i/ ( k n ) and z i : = R j ( ˜ x ( t i )) for i = k j, . . . , k ( j + 1) − 1 and obtain ˜ x ( t ) = y + t Z 0 γ ( s )d s for γ = kn − 1 X i =0 χ [ t i ,t i +1 ) z i . Further, it h olds that k z i − f ( t, ˜ x ( t i )) k ≤ ε for all t ∈ [ t i , t i +1 ) , i = 0 , . . . , k n − 1 as well as k z i k ≤ c . Now Prop osition 1 6 yields k ˜ x ( t ) − x ( t ) k ≤ ˜ c ·  ε + c k n · k f k L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d ))  ≤ 2 ˜ c · ε for all t ∈ [0 , 1 ] , where ˜ c only depend s on k f k L 1 ([0 , 1]; C 0 , 1 b ( R d ; R d )) and n o t o n y ∈ B N . The universal appr oximation th eorem presented above is of qualitative natur e since it d oes not give any estimates on the co m plexity of the residual n etwork need ed to appro x imate a flow u p to a certain precision. This is du e to th e fact that we work with density results for continuo us fu nctions in the Bochner space L 1 ([0 , 1]; C 0 , 1 b  R d ; R d  ) . In th e pro of above one c ould also assume that (9) h o lds for k = 1 since f is also piecewise constant on the intervals [ i / ( k n ) , ( i + 1) / ( k n )) . Howe ver , we wanted to sep arate th e app roximatio n pr ocedur e s in space an d in time. More precisely , if f is (almost) con stant in time, (8) can be ac hieved with little n an d he nce the constructed ResNet (2) shares a lot of weights. This observation could be used to explo re appr oximation capabilities of ResNets with shared weights und er d ifferent spatial and tem poral regularity of th e right hand side f . W e use an alogue arguments to establish estimates on th e number an d complexity of residual blocks required to approxim a te space- time solutions of ODEs with Lipschitz continuou s right han d side f . Theorem 3 (Space- time approx imation with comp lexity b o unds) . Let d ∈ N , ( r n ) n ∈ N ⊆ (0 , ∞ ) b e a sequenc e conver gent to ∞ an d let f : [0 , 1] × R d → R d be a bo unded a nd Lipschitz continu ous function. Let x : [0 , 1] × R d → R d be the spac e -time solution of the ODE with right h and side f . Then for every n ∈ N there is a ReLU ResNet x n with parameters θ n = ( θ n 1 , . . . , θ n n ) such that th e following a r e satisfied: 1. Ap proxim ation: F o r every compac t set K ⊆ R it h olds sup t ∈ [0 , 1] ,y ∈ K k x n ( t, y ) − x ( t, y ) k ∈ O ( n − 1 ) . 2. Com plexity bound s: Every r esidual block θ n k has depth  log 2 (( d + 1)!)  + 2 and satisfies N ( θ n k ) ∈ O  r d n n d  . F inally , all but O  r d n n d  weights ca n be fi x ed. 16 Presented at the DeepDiffEq work sh op at ICLR 2020 Pr oof. W e fix n ∈ N and set t i : = i/n . Let R n i be ReLU network s of asserted comp lexity that approx imate f ( t i , · ) on [ − r n , r n ] d up to n − 1 which exist by Prop o sition 4 . L et x n be the ReLU ResNet with residual blo c ks R n 1 , . . . , R n n . W e fix N > 0 and will show sup t ∈ [0 , 1] ,y ∈ B N k x n ( t, y ) − x ( t, y ) k ∈ O ( n − 1 ) for n → ∞ throug h an application o f Proposition 16. Since f is bound ed, there is c > 0 such th at k f ( t, y ) k ≤ c for a ll t ∈ [0 , 1 ] , y ∈ R d and hence th e fu nctions R n i satisfy th is as well. Setting M : = N + c , Remark 14 yields x ( t, y ) , x n ( t, y ) ∈ B M for all t ∈ [0 , 1] , y ∈ B N . Now we fix y ∈ B N and write x ( t ) , x n ( t ) for x ( t, y ) and x n ( t, y ) r espectively; to keep to the notation of th e er ror estima te f or Eu ler sch emes, we set z i : = R n i ( x n ( t i )) . For n ≥ M we obtain k z i − f ( t, x n ( t i )) k = k R n i ( x n ( t i )) − f ( t, x n ( t i )) k ≤ k R n i ( x n ( t i )) − f ( t i , x n ( t i )) k + k f ( t i , x n ( t i )) − f ( t, x n ( t i )) k ≤ n − 1 · (1 + L ) for all t ∈ [ t i , t i +1 ) and i = 0 , . . . , n − 1 w h ere L de notes the Lipschitz constan t of f . Furthermo re, we have k z i k ≤ c for all i = 0 , . . . , n − 1 and hence Proposition 16 com p letes th e pro of. 17

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment