Dissecting Neural ODEs

Continuous deep learning architectures have recently re-emerged as Neural Ordinary Differential Equations (Neural ODEs). This infinite-depth approach theoretically bridges the gap between deep learning and dynamical systems, offering a novel perspect…

Authors: Stefano Massaroli, Michael Poli, Jinkyoo Park

Dissecting Neural ODEs
Dissecting Neural ODEs Stefano Massaroli ∗ The Univ ersity of T okyo, DiffEqML massaroli@robot.t.u-tokyo.ac.jp Michael Poli ∗ KAIST , DiffEqML poli_m@kaist.ac.kr Jinky oo Park KAIST jinkyoo.park@kaist.ac.kr Atsushi Y amashita The Univ ersity of T okyo yamashita@robot.t.u-tokyo.ac.jp Hajime Asama c The Univ ersity of T okyo asama@robot.t.u-tokyo.ac.jp Abstract Continuous deep learning architectures hav e recently re–emerged as Neural Or- dinary Differ ential Equations (Neural ODEs). This infinite–depth approach theo- retically bridges the gap between deep learning and dynamical systems, of fering a novel perspecti ve. Howe ver , deciphering the inner w orking of these models is still an open challenge, as most applications apply them as generic black–box modules. In this work we “open the box”, further de veloping the continuous–depth formulation with the aim of clarifying the influence of se veral design choices on the underlying dynamics. 1 Introduction Neural ODEs ( Chen et al. , 2018 ) represent the latest instance of continuous deep learning models, first dev eloped in the context of continuous recurrent networks ( Cohen and Grossberg , 1983 ). Since their introduction, research on Neural ODEs v ariants ( Tzen and Raginsk y , 2019 ; Jia and Benson , 2019 ; Zhang et al. , 2019b ; Yıldız et al. , 2019 ; Poli et al. , 2019 ) has progressed at a rapid pace. Howe ver , the search for concise explanations and experimental e valuations of no vel architectures has left many fundamental questions unanswered. In this work, we establish a general system–theoretic Neural ODE formulation ( 1 ) and dissect it into its core components; we analyze each of them separately , shining light on peculiar phenomena unique to the continuous deep learning paradigm. In particular, augmentation strate gies are generalized beyond ANODEs ( Dupont et al. , 2019 ), and the nov el concepts of data–contr ol and adaptive–depth enriching ( 1 ) are sho wcased as effecti ve approaches to learn maps such as r eflections or concentric annuli without augmentation. While explicit dependence on the depth–variable has been considered in the original formulation ( Chen et al. , 2018 ), a parameter depth–variance in continuous models has been overlooked. W e provide a treatment in infinite–dimensional space required by the true deep limit of ResNets, the solution of which leads to a Neural ODE variant based on a spectral discretization. Neural Ordinary Differential Equation      ˙ z ( s ) = f θ ( s ) ( s, x , z ( s )) z (0) = h x (x) ˆ y ( s ) = h y ( z ( s )) s ∈ S (1) Input x R n x Output ˆ y R n y (Hidden) State z R n z Parameters θ ( s ) R n θ Neural V ector Field f θ ( s ) R n z Input Network h x R n x → R n z Output Network h y R n z → R n y ∗ Equal contribution. Author order w as decided by flipping a coin. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), V ancouv er , Canada. Depth–variance V anilla Neural ODEs ( Chen et al. , 2018 ) cannot be considered the deep limit of ResNets. W e discuss the subtleties in volv ed, uncovering a formal optimization problem in functional space as the price to pay for true depth–v ariance. Obtaining its solution leads to two nov el variants of Neural ODEs: a Gal ¨ e rkin–inspired spectral discretization (GalNODE) and a piece wise–constant model. GalNODEs are showcased on a task in volving a loss distributed on the depth–domain, requiring the introduction of a generalized version of the adjoint in ( Chen et al. , 2018 ). A ugmentation strategies The augmentation idea of ANODEs ( Dupont et al. , 2019 ) is taken further and generalized to novel dynamical system–inspired and parameter ef ficient alternativ es, relying on dif ferent choices of h x in ( 1 ). These approaches, which include input–layer and higher– or der augmentation, are verified to be more ef fecti ve than existing methods in terms of performance and parameter efficienc y . Beyond augmentation: data–contr ol and adaptive–depth W e un veil that although important, augmentation is not always necessary in challenging tasks such as learning r eflections or concentric annuli ( Dupont et al. , 2019 ). T o start, we demonstrate that depth–varying vector fields alone are suf ficient in dimensions greater than one. W e then provide theoretical and empirical results moti v ating two nov el Neural ODE paradigms: adaptive–depth , where the integration bound is itself determined by an auxiliary neural network, and data–contr olled , where f θ ( s ) is conditioned by the input data x , allowing the ODE to learn a family of vector fields instead of a single one. Finally , we warn ag ainst input networks h x of the multilayer , nonlinear type, as these can make Neural ODE flows superfluous . 2 Continuous–Depth Models A general formulation In the context of Neural ODEs we suppose to be gi ven a stream of input– output data { ( x k , y k ) } k ∈K (where K is a linearly–ordered finite subset of N ). The inference of Neural ODEs is carried out by solving the inital value pr oblem (IVP) ( 1 ), i.e. ˆ y ( S ) = h y  h x ( x ) + Z S f θ ( τ ) ( τ , x , z ( τ ))d τ  Our de gree of freedom, other than h x and h y , in the Neural ODE model is the choice of the parameter θ inside a gi ven pre-specified class W of functions S → R n θ . W ell–posedness If f θ ( s ) is Lipschitz, for each x k the initial v alue problem in ( 1 ) admits a unique solution z defined in the whole S . If this is the case, there is a mapping φ from W × R n x to the space of absolutely continuous functions S 7→ R n z such that z k := φ ( x k , θ ) satisfies the ODE in ( 1 ) . This in turn implies that, for all k ∈ K , the map ( θ , x k , s ) 7→ γ ( s, x k , θ ) := h y  φ ( θ , x k )( s )  satisfies ˆ y = γ ( θ, x k , s ) . F or compactness, for any s ∈ S , we denote φ ( θ , x k )( s ) by φ s ( θ , x k ) . T raining: optimal control ( Chen et al. , 2018 ) treated the training of constant–parameters Neural ODE (i.e. W is the space of constant functions) considering only terminal loss functions depending on the terminal state z ( S ) . Ho wev er , in the framew ork of Neural ODEs, the latent state ev olves through a continuum of layers steering the model output ˆ y ( s ) tow ards the label. It thus makes sense to introduce a loss function also distributed on the whole depth domain S , e.g. ` := L ( z ( S )) + Z S l ( τ , z ( τ ))d τ (2) The training can be then cast into the optimal contr ol ( Pontryagin et al. , 1962 ) problem min θ ∈W 1 |K| X k ∈K ` k subject to ˙ z ( s ) = f θ ( s ) ( s, x k , z ( s )) s ∈ S z (0) = h x ( x k ) , ˆ y ( s ) = h y ( z ( s )) , ∀ k ∈ K (3) solved by gradient descent. Here, if θ is constant, the gradients can be computed with O (1) memory efficienc y by generalizing the adjoint sensitivity method in ( Chen et al. , 2018 ). Proposition 1 (Generalized Adjoint Method) . Consider the loss function ( 2 ) . Then, d ` d θ = Z S a > ( τ ) ∂ f θ ∂ θ d τ wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ ∂ z − ∂ l ∂ z a > ( S ) = ∂ L ∂ z ( S ) Supplementary material contains additional insights on the choice of activ ation, training regularizers and approximation capabilities of Neural ODEs, along with a detailed deriv ation of the abov e result. 2 3 Depth-V ariance: Infinite Dimensions f or Infinite Layers Bring residual networks to the deep limit V anilla Neural ODEs, as they appear in the original paper ( Chen et al. , 2018 ), cannot be fully considered the deep limit of ResNets. In fact, while each residual block is characterized by its own parameters vector θ s , the authors consider model ˙ z = f θ ( s, z ( s )) where the depth variable s enters in the dynamics per se 2 rather than in the map s 7→ θ ( s ) . The first attempt to pursue the true deep limit of ResNets is the hypernetwork approach of ( Zhang et al. , 2019b ) where another neural network parametrizes the dynamics of θ ( s ) . Howe ver , this approach is not backed by an y theoretical argument and it exhibits a considerable parameter inefficiency , as it generally scales polynomially in n θ . W e adopt a different approach, setting out to tackle the problem theoretically in the general formulation. Here, we uncover an optimization problem in functional space, solved by a direct application of the adjoint sensitivity method in infinite-dimensions. W e then introduce two parameter efficient depth–v ariant Neural ODE architectures based on the solution of such problem: Gal ¨ e rkin Neural ODEs and Stac ked Neural ODEs. Gradient descent in functional space When the model parameters are depth–v arying, θ : S → R n θ , the nonlinear optimization problem ( 3 ) should be in principle solved by iterating a gradient descent algorithm in a functional space ( Smyrlis and Zisis , 2004 ), e.g. θ k +1 ( s ) = θ k ( s ) − η δ ` k /δ θ ( s ) once the Gateaux deri vati ve δ ` k /δ θ ( s ) is computed. Let L 2 ( S → R n θ ) be the space of square– integrable functions S → R n θ . Hereafter , we sho w that if θ ( s ) ∈ W := L 2 ( S → R n θ ) , then the loss sensitivity to θ ( s ) can be computed through the adjoint method. Theorem 1 (Infinite–Dimensional Gradients) . Consider the loss function ( 2 ) and let θ ( s ) ∈ L 2 ( S → R n θ ) . Then, sensitivity of ` with respect to θ ( s ) (i.e . dir ectional derivative in functional space) is δ ` δ θ ( s ) = a > ( s ) ∂ f θ ( s ) ∂ θ ( s ) wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ ( s ) ∂ z − ∂ l ∂ z a > ( S ) = ∂ L ∂ z ( S ) Note that, although Theorem 1 provides a constructiv e method to compute the loss gradient in the infinite–dimensional setting, its implementation requires choosing a finite dimensional approximation of the solution. W e offer tw o alternativ es: a spectral discr etization approach relying on reformulating the problem on some functional bases and a depth discr etization approach. Spectral discretization: Galërkin Neural ODEs The idea is to expand θ ( s ) on a complete orthogonal basis of a predetermined subspace of L 2 ( S → R n θ ) and truncate the series to the m -th term: θ ( s ) = m X j =1 α j  ψ j ( s ) In this way , the problem is turned into finite dimension and the training will aim to optimize the parameters α = ( α 1 , . . . , α m ) ∈ R mn θ whose gradients can be computed as follows Corollary 1 (Spectral Gradients) . Under the assumptions of Theorem 1 , if θ ( s ) = P m j =1 α j  ψ j ( s ) , d ` d α = Z S a > ( τ ) ∂ f θ ( s ) ∂ θ ( s ) ψ ( τ )d τ , ψ = ( ψ 1 , . . . , ψ m ) Depth discretization: Stack ed Neural ODEs An alternati ve approach to parametrize θ ( s ) is to assume it piece wise constant in S , i.e. θ ( s ) = θ i ∀ s ∈ [ s i , s i +1 ] and S = S p − 1 i =0 [ s i , s i +1 ] . It is easy to see how e v aluating this model is equiv alent to stac king p Neural ODEs with constant parameters, z ( S ) = h x ( x ) + p − 1 X i =0 Z s i +1 s 1 f θ i ( τ , x , z ( τ ))d τ Here, the training is carried out optimizing the resulting pn θ parameters using the following: Corollary 2 (Stacked Gradients) . Under the assumptions of Theor em 1 , if θ ( s ) = θ i ∀ s ∈ [ s i , s i +1 ] , d ` d θ i = − Z s i s i +1 a > ( τ ) ∂ f θ i ∂ θ i d τ wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ i ∂ z − ∂ l ∂ z s ∈ [ s i , s i +1 ] a > ( S ) = ∂ L ∂ z ( S ) 2 In practice, s is often concatenated to z and fed to f θ . 3 The two approaches offer different perspecti ves on the problem of parametrizing the ev olution of θ ( s ) ; while the spectral method imposes a stronger prior to the model class, based on the chosen bases (e.g. Fourier series, Chebyshe v polynomials, etc.) the depth–discretization method allows for more freedom. Further details on proofs, deriv ation and implementation of the two models are gi ven in the Appendix. 0 1 2 3 − 1 0 1 s [depth] z ( s ) [state] P erio dic T rac king with In tegral Loss Figure 1: Galërkin Neural ODEs trained with in- tegral losses accurately reco ver periodic signals. Blue curves correspond to different initial condi- tions and con verge asymptotically to the reference desired trajectory . T racking signals via depth–variance Con- sider the problem of tracking a periodic sig- nal β ( s ) . W e show ho w this can be achieved without introducing additional inducti ve biases such as ( Greydanus et al. , 2019 ) through a syn- ergistic combination of a two–layer Galërkin Neural ODEs and the generalized adjoint with integral loss l ( s ) := k β ( s ) − z ( s ) k 2 2 . The models, trained in s ∈ [0 , 1] generalize accu- rately in extrapolation, recov ering the dynamics. Fig. 2 sho wcases the depth–dynamics of θ ( s ) for Galërkin and Stacked v ariants trained to solve a simple binary classification problem. Additional insights and details are reported in Appendix. Depth–variance brings Neural ODEs closer to the ideal continuum of neural netw ork layers with untied weights, enhancing their expressi vity . 4 A ugmenting Neural ODEs Augmented Neur al ODEs (ANODEs) ( Dupont et al. , 2019 ) propose solving the initial value pr oblem (IVP) in a higher dimensional space to limit the complexity of learned flows, i.e. ha ving n z > n x . The proposed approach of the seminal paper relies on initializing to zero the n a := n z − n x augmented dimensions: z (0) = [ x , 0 ] . W e will henceforth refer to this augmentation strategy as 0 –augmentation . In this section we discuss alternative augmentation strate gies for Neural ODEs that match or improve on 0 –augmentation in terms of performance or parameter efficienc y . Input–layer augmentation Follo wing the standard deep learning approach of increasing layer width to achieve improv ed model capacity , 0 –augmentation can be generalized by introducing an 0 1 s z 1 z 2 Stac ked Neural ODEs 0 1 s z 1 z 2 Gal¨ erkin Neural ODEs 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 5 0 5 s θ ( s ) P arameters Evolution 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 5 0 5 s θ ( s ) P arameters Evolution Figure 2: Galërkin and Stack ed parameter-v arying Neural ODE variants. Depth flows (Abo ve) and ev olution of the parameters (Belo w). 4 input network h x : R n x → R n z to compute z (0) : z (0) = h x ( x ) (4) leading to the general formulation of ( 1 ) . This approach ( 4 ) giv es the model more freedom in determining the initial condition for the IVP instead of constraining it to a concatenation of x and 0 , at a small parameter cost if h x is, e.g., a linear layer . W e refer to this type of augmentation as input layer (IL) augmentation and to the model as IL–Neural ODE (IL–NODE). Note that 0-augmentation is compatible with the general IL formulation, as it corresponds to x 7→ ( x , 0 ) := h x ( x ) In applications where maintaining the structure of the first n x dimensions is important, e.g. approxi- mation of dynamical systems, a parameter ef ficient alternati ve of ( 4 ) can be obtained by modifying the input network h x to only affect the additional n a dimensions, i.e. h x := [ x , ξ ( x )] , ξ : R n x → R n a . Higher –order Neural ODEs Further parameter efficienc y can be achiev ed by lifting the Neural ODEs to higher orders. F or example, let z ( s ) = [ z q ( s ) , z p ( s )] a second–order Neural ODE of the form: ¨ z q ( s ) = f θ ( s ) ( s, z ( s )) . (5) equiv alent to the first–order system ˙ z q ( s ) = z p ( s ) ˙ z p ( s ) = f θ ( s ) ( s, z q ( s ) , z p ( s )) (6) The abov e can be extended to higher–order Neural ODEs as d n z 1 d s n = f θ ( s )  s, z , d z 1 d s , · · · , d n − 1 z 1 d s n − 1  , z = [ z 1 , z 2 , . . . , z n ] , z i ∈ R n z /n (7) or , equiv alently , ˙ z i ( s ) = z i +1 ( s ) , ˙ z n ( s ) = f θ ( s ) ( s, z ( s )) . Note that the parameter ef ficiency of this method arises from the fact that f θ ( s ) : R n z → R n z /n instead of R n z → R n z . A limitation of system ( 6 ) is that a naiv e extension to second–order requires a number of augmented dimensions n a = n x . T o allo w for flexible augmentations of few dimensions n a < n x , the formulation of second–order Neural ODEs can be modified to select only a few dimensions to have higher order dynamics. W e include formulation and additional details of selective higher–order augmentation in the supplementary material. Finally , higher–order augmentation can itself be compatible with input–layer augmentation. Revisiting results f or augmented Neural ODEs In higher dimensional state spaces, such as those of image classification settings, the benefits of augmentation become subtle and manifest as performance improv ements and a lower number of function evaluations (NFEs) ( Chen et al. , 2018 ). W e re visit the image classification experiments of ( Dupont et al. , 2019 ) and e v aluate four classes of depth–in variant Neural ODEs: namely , vanilla (no augmentation), ANODE (0–augmentation), IL- NODE (input–layer augmentation), and second–order . The input network h x is composed of a single, linear layer . Main objectiv e of these experiments is to rank the ef ficieny of dif ferent augmentation strategies; for this reason, the setup does not in v olve hybrid or composite Neural ODE architectures and data augmentation. The results for fiv e experiments are reported in T able 4 . IL–NODEs consistently preserve lo wer NFEs than other v ariants, whereas second–order Neural ODEs offer a parameter ef ficient alternativ e. The performance gap widens on CIF AR10, where the disadvantage of fixed 0 initial conditions forces 0 –augmented Neural ODEs into performing a high number of function ev aluations. NODE ANODE IL-NODE 2nd–Ord. MNIST   CIF AR MNIST   CIF AR MNIST   CIF AR MNIST   CIF AR T est Acc. 96 . 8   58 . 9 98 . 9   70 . 8 99 . 1   73 . 4 99 . 2   72 . 8 NFE 98   93 71   169 44   65 43   59 Param.[K] 21 . 4   37 . 1 20 . 4   35 . 0 20 . 7   36 . 1 20 . 0   34 . 6 T able 1: Mean test results across 10 runs on MNIST and CIF AR. W e report the mean NFE at con vergence. Input layer and higher order augmentation improv e task performance and preserve low NFEs at con vergence. 5 0 1 − 1 0 1 s z ( s ) x = − 1 0 1 − 1 0 1 s z ( s ) x = 1 Data–Controlled Neural ODEs Figure 3: Depth trajectories ov er vector field of the data–contr olled neural ODEs ( 9 ) for x = 1 , x = − 1 . The model learns a family of v ector fields conditioned by the input x to approximate ϕ ( x ) . It should be noted that prepending an input multi–layer neural network to the Neural ODE was the approach chosen in the experimental e v aluations of the original Neural ODE paper ( Chen et al. , 2018 ) and that ( Dupont et al. , 2019 ) opted for a comparison between no input layer and 0 –augmentation. Ho wev er , a significant difference exists between architectures depending on the depth and expressivity of h x . Indeed, utilizing non–linear and multi–layer input networks can be detrimental, as discussed in Sec. 5 . Augmentation reliev es Neural ODEs of their expressi vity limitations. Learning initial conditions improv es on 0–augmentation in terms of performance and NFEs. 5 Bey ond A ugmentation: Data–Contr ol and Depth–Adaptation Augmentation strategies are not always necessary for Neural ODEs to solve challenging tasks such as concentric annuli ( Dupont et al. , 2019 ). While it is indeed true that two distinct trajectories can never intersect in the state–space in the one–dimensional case, this does not necessarily hold in general. In fact, dynamics in the first two spatial dimensions are substantially dif ferent e.g no chaotic behaviors are possible ( Khalil and Grizzle , 2002 ). In the two–dimensions of R 2 (and so in R n ), infinitely wider than R , distinct trajectories of a time–varying process can well intersect in the state–space, provided that they do not pass through the same point at the same time ( Khalil and Grizzle , 2002 ). This implies that, in turn, depth–varying models such as Gal ¨ e rkin Neural ODEs can solve these tasks in all dimensions but R . Starting from the one–dimensional case, we propose new classes of models allowing Neural ODEs to perform challenging tasks such as approximating r eflections ( Dupont et al. , 2019 ) without the need of any augmentation. 5.1 Data–controlled Neural ODEs W e hereby deriv e a new class of models, namely data–contr olled Neural ODEs . T o introduce the proposed approach, we start with an analytical result regarding the approximation of reflection maps such as ϕ ( x ) = − x . The proof provides a design recipe for a simple handcrafted ODE capable of approximating ϕ with arbitrary accuracy by le veraging input data x . W e denote the conditioning of the vector field with x necessary to achiev e the desired result as data–contr ol . This result highlights that, through data–control, Neural ODEs can arbitrarily approximate ϕ without augmentation, providing a no vel perspective on existing results about expressi vity limitations of continuous models ( Dupont et al. , 2019 ). The result is the follo wing: Proposition 2. F or all  > 0 , x ∈ R there e xists a parameter θ > 0 suc h that | ϕ ( x ) − z (1) | < , (8) wher e z (1) is the solution of the Neural ODE  ˙ z ( s ) = − θ ( z ( s ) + x ) z (0) = x , s ∈ [0 , 1] . (9) The proof is reported in the Appendix. Fig. ( 3 ) shows a v ersion of model ( 9 ) where θ is trained with standard backpropagation. This model is indeed able to closely approximate ϕ ( x ) without augmentation, confirming the theoretical result. From this preliminary e xample, we then define the 6 general data–contr olled Neural ODE as ˙ z ( s ) = f θ ( s ) ( s, x , z ( s )) z (0) = h x ( x ) . (10) Model ( 10 ) incorporates input data x into the vector field, ef fecti vely allo wing the ODE to learn a family of vector fields instead of a single one. Direct dependence on x further constrains the ODE to be smooth with respect to the initial condition, acting as a regularizer . Indeed, in the e xperimental ev aluation at the end of Sec. 5 , data–controlled models reco ver an accurate decision boundary . Further experimental results with the latter general model on the representation of ϕ are reported in the Appendix. It should be noted that ( 10 ) does not require explicit dependence of the v ector field on x . Computa- tionally , x can be passed to f θ ( s ) in different w ays, such as through an additional embedding step. In this setting, data–control offers a natural e xtension to conditional Neural ODEs. − 1 0 1 − 1 0 1 0 5 q 1 p θ q 2 p θ z z S f θ ( z S , z ) Conditional Contin uous Normaling Flo ws 0 0 . 2 0 . 4 0 . 6 0 . 8 1 − 1 0 1 s z ( s ) Figure 4: Data–controlled CNFs can morph prior distributions into distinct posteriors to produce con- ditional samples. This task often requires crossing trajectories and is not possible with vanilla CNFs. Data–control in normalizing flows Condi- tional variants of generativ e models can be guided to produce samples of different character- istics depending on specific requirements. Data– control can be lev eraged to obtain a conditional variant of continuous normalizing flo ws (CNFs) ( Chen et al. , 2018 ). Here, we consider the stan- dard setting of learning an unknown data distri- bution p ( x ) giv en samples { x k } k ∈K through a parametrized function p θ . Continuous normal- ing flows (CNFs) ( Chen et al. , 2018 ; Grathwohl et al. , 2018 ; Finlay et al. , 2020 ) obtain p θ by change of variables using the flo w of an ODE to warp a (known) prior distribution q ( z ) , i.e. log p θ ( x ) = log q ( φ S ( x )) + log det |∇ φ S ( x ) | where the log determinant of the Jacobian is computed via the fluid mechanics identity d d s log det |∇ φ s ( x ) | = ∇ · f θ ( t ) ( s, φ s ( x )) , ( V illani , 2003 ). CNFs are trained via maximum– likelihood, i.e by minimizing the Kullback– Leibler div ergence between p and p θ , or equiv a- lently ` := − 1 / |K | P k log p θ ( x k ) . A CNF can be then used as generative model for p θ ( x ) by sampling the known distrib ution z S ∼ q ( z S ) and e volve z S backward in the depth domain: z (0) = z S + Z 0 S f θ ( s )( s, z ( s ))d s In this context, introducing data–control into f θ allows the CNF to be conditioned with data or task information. Data–controlled CNFs can thus be used in multi–objecti ve generati ve tasks e.g using a single model to sample from N different distribution p θ by warping N predetermined known distributions q i . W e train one–dimensional, data–controlled CNFs to approximate two different data distributions p 1 , p 2 by sampling from two distinct priors q 1 , q 2 and conditioning the vector field with the samples z S of the prior distributions, i.e. ˙ z ( s ) = f θ ( z S , z ( s )) , z S ∼ q 1 or z S ∼ q 2 Fig 4 sho ws ho w data–controlled CNFs are capable of conditionally sampling from two normal target data distributions. In this case we selected p 1 , p 2 as univ ariate normal distributions with mean − 1 and 1 , respectiv ely and q 1 ≡ p 2 , q 2 ≡ p 1 . The resulting learned vector field strongly depends on the value of the prior sample z S and it is almost constant in z , meaning that the prior distributions are just shifted almost rigidly along the flow in a direction determined by the initial condition. This task is inaccessible to standard CNFs as it requires crossing flows in z . Indeed, the proposed benchmark represents a density estimation analogue to the crossing trajectories problem. 7 5.2 Adaptive–Depth Neural ODEs Let us come back to the approximation of ϕ ( x ) . Indeed, without incorporating input data into f θ ( s ) , it is not possible to realize a mapping x 7→ φ s ( x ) mimicking ϕ due to the topology pre- serving property of the flows. Ne vertheless, a Neural ODE can be employed to approximate ϕ ( x ) without the need of any cr ossing trajectory . In fact, if each input is integrated for in a dif ferent depth domain, S ( x ) = [0 , s ∗ x ] , it is possible to learn ϕ without crossing flows as shown in Fig. 5 . 0 1 2 3 − 1 0 1 s z ( s ) Adaptive Integration Depth Inputs trajectories through network depth Figure 5: Depth trajectories o ver vector field of the adaptive—depth Neural ODEs. The reflection map can be learned by the proposed model. The key is to assign dif ferent integration times to the inputs, thus not requiring the intersection of trajectories. In general, we can use a hypernetwork g trained to learn the integration depth of each sample. In this setting, we define the general adaptive depth class as Neural ODEs performing the mapping x 7→ φ g ω ( x ) ( x ) , i.e. leading to ˆ y = h y h x ( x ) + Z g ω ( x ) 0 f θ ( s ) ( τ , x , z ( τ ))d τ ! , where g ω : R n x × R n ω → R is a neural netw ork with trainable parameters ω . Supplementary ma- terial contains details on dif ferentiation under the inte gral sign, required to back–propagate the loss gradients into ω . 5.3 Additional Results Experiments of non–augmented models W e inspect the performance of different Neural ODE variants: depth–inv ariant, depth–variant with s concatenated to z and passed to the vector field, Gal ¨ e rkin Neural ODEs and data–controlled. The concentric annuli ( Dupont et al. , 2019 ) dataset is utilized, and the models are qualitativ ely ev aluated based on the complexity of the learned flo ws and on ho w accurately they extrapolate to unseen points, i.e. the learned decision boundaries. For Gal ¨ e rkin Neural ODEs, we choose a Fourier series with m = 5 harmonics as the eigenfunctions ψ k , k = 1 , . . . , 5 to compute the parameters θ ( s ) , as described in Sec. 3 . Data–control allows Neural ODEs to learn a f amily of vector fields, conditioning on input data information. Depth–adaptation sidesteps known expressi vity limitations of continuous–depth models. x 1 x 2 Original space ”o” first p oin t, ”+” last z 1 z 2 Flo ws in Laten t Space Figure 6: Solving concentric annuli without aug- mentation by prepending a nonlinear transforma- tion performed by a two–layer fully–connected network. Mind y our input networks An alternativ e ap- proach to learning maps that prove to be chal- lenging to approximate for vanilla Neural ODEs in volv es solving the ODE in a latent state space. Fig. 6 shows that with no augmentation, a net- work composed by a two fully–connected layers with non–linear acti vation followed by a Neu- ral ODE can solve the concentric annuli prob- lem. Howe ver , the flows learned by the Neu- ral ODEs are superfluous: indeed, the clusters were already linearly separable after the first non–linear transformation. This example w arns against superficial ev aluations of Neural ODE architectures preceded or followed by se veral layers of non–linear input and output transformations. In these scenarios, the learned flows risk performing unnecessary transformations and in pathological cases can collapse into a simple identity map. T o sidestep these issues, we propose visually inspecting trajectories or performing an ablation experiment on the Neural ODE block. 8 − 2 0 2 − 2 0 2 z 1 z 2 V anilla − 2 − 1 0 1 2 h y ( z ( s )) − 2 0 2 − 2 0 2 z 1 z 2 Concatenated Depth − 1 0 1 h y ( z ( s )) − 2 0 2 − 2 0 2 z 1 z 2 Galërkin − 0 . 5 0 0 . 5 1 h y ( z ( s )) − 2 0 2 − 2 0 2 z 1 z 2 Data–Con trolled − 2 − 1 0 1 h y ( z ( s )) Figure 7: Depth-flows of the data in the state–space. The resulting decision boundaries of output linear layer h y are indicated by the dotted orange line. 6 Related W ork W e include a brief history of classical approaches to dynamical system–inspired deep learning. A brief historical note on continuous deep learning Continuous neural networks have a long history that goes back to continuous time v ariants of recurrent networks ( Cohen and Grossberg , 1983 ). Since then, se veral works e xplored the connection between dynamical systems, control theory and machine learning ( Zhang et al. , 2014 ; Li et al. , 2017 ; Lu et al. , 2017 ; W einan , 2017 ). ( Marcus and W estervelt , 1989 ) pro vides stability analyses and introduces delays. Many of these concepts have yet to resurface in the conte xt of Neural ODEs. Haber and Ruthotto ( 2017 ) analyzes ResNet dynamics and links stability with rob ustness. Injecting stability into neural networks has inspired the design of a series of architectures ( Chang et al. , 2019 ; Haber et al. , 2019 ; Bai et al. , 2019 ; Massaroli et al. , 2020 ). Hauser et al. ( 2019 ) explored the algebraic structure of neural networks go verned by finite difference equations, further linking discretizations of ODEs and ResNets in ( Hauser et al. , 2019 ). Approximating ODEs with neural networks has been discussed in ( W ang and Lin , 1998 ; Filici , 2008 ). ( Poli et al. , 2020a ) explores the interplay between Neural ODEs and their solver . On the optimization front, sev eral works lev erage dynamical system formalism in continuous time ( Wibisono et al. , 2016 ; Maddison et al. , 2018 ; Massaroli et al. , 2019 ). Neural ODEs This work concerns Neural ODEs ( Chen et al. , 2018 ) and a system–theoretic discussion of their dynamical beha vior . The main focus is on Neural ODEs and not the extensions to other classes of dif ferential equations ( Li et al. , 2020 ; Tzen and Raginsk y , 2019 ; Jia and Benson , 2019 ), though the insights dev eloped here can be broadly applied to continuous–depth models. More recently , Finlay et al. ( 2020 ) introduced regularization strategies to alleviate the hea vy computational training o verheads of Neural ODEs. These terms are propagated during the forward pass of the model and thus require state–augmentation. Le veraging our gener alized adjoint formulation provides an approach to integral re gularization terms without augmentation and memory overheads. 7 Conclusion In this work, we establish a general system–theoretic framework for Neural ODEs and dissect it into its core components. W ith the aim of shining light on fundamental questions regarding depth–variance, we formulate and solve the infinite–dimensional problem linked to the true deep limit formulation of Neural ODE. W e provide numerical approximations to the infinite–dimensional problem, leading to nov el model v ariants, such as Gal ¨ e rkin and piece wise–constant Neural ODEs. Augmentation is dev eloped beyond existing approaches ( Dupont et al. , 2019 ) to include input–layer and higher–or der augmentation strategies sho wcased to be more performant and parameter efficient. Finally , the nov el paradigms of data–control and depth–adaptation are introduced to perform challenging tasks such as learning r eflections without augmentation. The code to reproduce all the experiments present in the paper is built on TorchDyn ( Poli et al. , 2020b ) and PyTorch–Lighning ( Falcon et al. , 2019 ) and can be found in the following repo: https://github.com/DiffEqML/diffeqml- research/ tree/master/dissecting- neural- odes . 9 Broader Impact As continuous deep learning sees increased utilization across fields such as healthcare ( Rubanova et al. , 2019 ; Yıldız et al. , 2019 ), it is of utmost importance that we de velop appropriate tools to further our understanding of neural differential equations. The search for robustness in traditional deep learning has only recently seen a surge in ideas and proposed solutions; this w ork aims at providing exploratory first steps necessary to extend the discussion to this nov el paradigm. The leitmotif of this work is injecting system–theoretic concepts into the frame work of continuous models. These ideas are of foundational importance in tangential fields such control and forecasting of dynamical systems, and are routinely used to dev elop robust algorithms with theoretical and practical guarantees. Acknowledgment This work was supported by the Basic Science Research Program through the National Research Foundation of K orea (NRF) funded by the Ministry of Education, 2018R1D1A1B07050443. References S. Bai, J. Z. Kolter , and V . K oltun. Deep equilibrium models. In H. W allach, H. Larochelle, A. Beygelzimer , F . d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pr ocessing Systems 32 , pages 688–699. 2019. B. Chang, M. Chen, E. Haber , and E. H. Chi. Antisymmetricrnn: A dynamical system view on recurrent neural networks. arXiv pr eprint arXiv:1902.09689 , 2019. T . Q. Chen, Y . Rubano va, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in neural information pr ocessing systems , pages 6571–6583, 2018. D.-A. Clev ert, T . Unterthiner , and S. Hochreiter . Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint , 2015. M. A. Cohen and S. Grossberg. Absolute stability of global pattern formation and parallel memory storage by competitiv e neural networks. IEEE transactions on systems, man, and cybernetics , (5): 815–826, 1983. E. Dupont, A. Doucet, and Y . W . T eh. Augmented neural odes. In Advances in Neural Information Pr ocessing Systems , pages 3134–3144, 2019. W . F alcon et al. Pytorch lightning. GitHub . Note: https://github . com/williamF alcon/pytor ch-lightning Cited by , 3, 2019. C. Filici. On a neural approximator to odes. IEEE tr ansactions on neural networks , 19(3):539–543, 2008. C. Finlay , J.-H. Jacobsen, L. Nurbekyan, and A. M. Oberman. Ho w to train your neural ode. arXiv pr eprint arXiv:2002.02798 , 2020. W . Grathwohl, R. T . Chen, J. Bettencourt, I. Sutskev er , and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable rev ersible generative models. arXiv preprint , 2018. S. Greydanus, M. Dzamba, and J. Y osinski. Hamiltonian neural networks. In Advances in Neural Information Pr ocessing Systems , pages 15353–15363, 2019. E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inver se Pr oblems , 34(1): 014004, 2017. E. Haber , K. Lensink, E. T riester , and L. Ruthotto. Ime xnet: A forward stable deep neural network. arXiv pr eprint arXiv:1903.02639 , 2019. M. Hauser , S. Gunn, S. Saab Jr , and A. Ray . State-space representations of deep neural networks. Neural computation , 31(3):538–554, 2019. J. Jia and A. R. Benson. Neural jump stochastic differential equations. In Advances in Neur al Information Pr ocessing Systems , pages 9843–9854, 2019. H. K. Khalil and J. W . Grizzle. Nonlinear systems , volume 3. Prentice hall Upper Saddle Riv er , NJ, 2002. 10 D. P . Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. Q. Li, L. Chen, C. T ai, and E. W einan. Maximum principle based algorithms for deep learning. The Journal of Mac hine Learning Resear ch , 18(1):5998–6026, 2017. Q. Li, T . Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. arXiv pr eprint arXiv:1912.10382 , 2019. X. Li, T .-K. L. W ong, R. T . Q. Chen, and D. Duvenaud. Scalable gradients for stochastic differential equations. v olume 108 of Pr oceedings of Machine Learning Resear ch , pages 3870–3882, Online, 26–28 Aug 2020. PMLR. URL http://proceedings.mlr.press/v108/li20i.html . Z. Lu, H. Pu, F . W ang, Z. Hu, and L. W ang. The expressi ve power of neural netw orks: A view from the width. In Advances in neur al information pr ocessing systems , pages 6231–6239, 2017. C. J. Maddison, D. Paulin, Y . W . T eh, B. O’Donoghue, and A. Doucet. Hamiltonian descent methods. arXiv pr eprint arXiv:1809.05042 , 2018. C. Marcus and R. W estervelt. Stability of analog neural networks with delay . Physical Revie w A , 39 (1):347, 1989. S. Massaroli, M. Poli, F . Califano, A. Faragasso, J. P ark, A. Y amashita, and H. Asama. Port- hamiltonian approach to neural network training. arXiv pr eprint arXiv:1909.02702 , 2019. S. Massaroli, M. Poli, M. Bin, J. Park, A. Y amashita, and H. Asama. Stable neural flows. arXiv pr eprint arXiv:2003.08063 , 2020. M. Poli, S. Massaroli, J. P ark, A. Y amashita, H. Asama, and J. Park. Graph neural ordinary dif ferential equations. arXiv pr eprint arXiv:1911.07532 , 2019. M. Poli, S. Massaroli, A. Y amashita, H. Asama, and J. Park. Hypersolv ers: T oward f ast continuous- depth models. arXiv pr eprint arXiv:2007.09601 , 2020a. M. Poli, S. Massaroli, A. Y amashita, H. Asama, and J. Park. T orchdyn: A neural differential equations library . arXiv pr eprint arXiv:2009.09346 , 2020b. L. S. Pontryagin, E. Mishchenko, V . Boltyanskii, and R. Gamkrelidze. The mathematical theory of optimal processes. 1962. P . J. Prince and J. R. Dormand. High order embedded runge-kutta formulae. Journal of Computational and Applied Mathematics , 7(1):67–75, 1981. Y . Rubanov a, T . Q. Chen, and D. K. Duvenaud. Latent ordinary differential equations for irregularly- sampled time series. In Advances in Neural Information Pr ocessing Systems , pages 5321–5331, 2019. G. Smyrlis and V . Zisis. Local conv ergence of the steepest descent method in hilbert spaces. J ournal of mathematical analysis and applications , 300(2):436–453, 2004. B. Tzen and M. Raginsky . Neural stochastic dif ferential equations: Deep latent gaussian models in the diffusion limit. arXiv pr eprint arXiv:1905.09883 , 2019. C. V illani. T opics in optimal transportation . Number 58. American Mathematical Society , 2003. Y .-J. W ang and C.-T . Lin. Runge-kutta neural network for identification of dynamical systems in high accuracy . IEEE T ransactions on Neural Networks , 9(2):294–307, 1998. E. W einan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics , 5(1):1–11, 2017. A. W ibisono, A. C. W ilson, and M. I. Jordan. A v ariational perspecti ve on accelerated methods in optimization. pr oceedings of the National Academy of Sciences , 113(47):E7351–E7358, 2016. Ç. Yıldız, M. Heinonen, and H. Lähdesmäki. Ode 2 vae: Deep generative second order odes with bayesian neural networks. arXiv pr eprint arXiv:1905.10994 , 2019. H. Zhang, Z. W ang, and D. Liu. A comprehensiv e revie w of stability analysis of continuous-time recurrent neural networks. IEEE T ransactions on Neur al Networks and Learning Systems , 25(7): 1229–1262, 2014. H. Zhang, X. Gao, J. Unterman, and T . Arodz. Approximation capabilities of neural ordinary differential equations. arXiv pr eprint arXiv:1907.12998 , 2019a. 11 T . Zhang, Z. Y ao, A. Gholami, K. Keutzer , J. Gonzalez, G. Biros, and M. Mahoney . Anodev2: A coupled neural ode ev olution frame work. arXiv pr eprint arXiv:1906.04596 , 2019b. H. Zheng, Z. Y ang, W . Liu, J. Liang, and Y . Li. Improving deep neural networks using softplus units. In 2015 International Joint Confer ence on Neural Networks (IJCNN) , pages 1–4. IEEE, 2015. 12 Dissecting Neural ODEs Supplementary Material T able of Contents A Proofs and Additional Theor etical Results 13 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.3 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.4 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.5 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.6 Additional Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B Practical Insights for Neural ODEs 18 B.1 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Acti vations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.3 Regularization for Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.4 Approximation Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.5 Example Implementation of Data–Control . . . . . . . . . . . . . . . . . . . . 20 C Experimental Details 20 C.1 Experiments of Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Experiments of Section 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.3 Experiments of Section 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A Proofs and Additional Theor etical Results A.1 Proof of Theor em 1 Proposition 1 (Generalized Adjoint Method) . Consider the loss function ( 2 ) . Then, d ` d θ = Z S a > ( τ ) ∂ f θ ∂ θ d τ wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ ∂ z − ∂ l ∂ z a > ( S ) = ∂ L ∂ z ( S ) Pr oof. Let us define a La grange multiplier or adjoint state a , dual to z . As the dual of R n z is R n z itself, a ∈ R n z . Moreo ver , let L be a perturbed loss function of the form L := ` − Z S 0 a > ( τ ) [ ˙ z ( τ ) − f θ ( s, x t , z ( τ ))] d τ Since ˙ z − f θ ( s, x , z ) = 0 by construction, the integral term in L is always null and, thus, a ( s ) can be freely assigned while d L / d θ = d `/ d θ . For the sake of compactness we do not explicitly write the 13 dependence on variables of the considered functions unless strictly necessary . Note that, Z S 0 a > ˙ z d τ = a > ( τ ) z ( τ )   S 0 − Z S 0 ˙ a > z d τ obtained via integration by parts. Hence, L = ` − a > ( τ ) z ( τ )   S 0 + Z S 0  ˙ a > z + a > f θ  d τ = L ( z ( S )) − a > ( τ ) z ( τ )   S 0 + Z S 0  ˙ a > z + a > f θ + l  d τ (11) W e can compute the gradient of ` with respect to θ as d ` d θ = d L d θ = ∂ L ( z ( S )) ∂ z ( S ) d z ( S ) d θ − a > ( S ) d z ( S ) d θ − a > (0)    d z (0) d θ + Z S 0  ˙ a > d z d θ + a >  ∂ f θ ∂ θ + ∂ f θ ∂ z d z d θ + ∂ f θ ∂ x    d x d θ + ∂ f θ ∂ τ    d τ d θ  + ∂ l ∂ z d z d θ + ∂ l ∂ τ    d τ d θ  d τ which, by reorganizing the terms, yields to d ` d θ =  ∂ L ∂ z ( S ) − a > ( S )  d z ( S ) d θ + + Z S 0  ˙ a > + a > ∂ f θ ∂ z + ∂ l ∂ z  d z d θ d τ + Z S 0 a > ∂ f θ ∂ θ d τ (12) Now , if a ( s ) satisfies the final v alue problem ˙ a > ( s ) = − a > ( s ) ∂ f θ ∂ z − ∂ l ∂ z , a > ( S ) = ∂ L ∂ z ( S ) (13) to be solved backward in [0 , S ] ; then ( 12 ) reduces to d ` d θ = Z S 0 a > ∂ f θ ∂ θ d τ (14) proving the result. Remark 1 (Implementation of the generalized adjoint method) . Note that, similarly to ( Chen et al. , 2018 ), the gradient ( 14 ) is practically computed by defining the parameters adjoint state a θ and solving backwar d the system of ODEs ˙ a > = − a > ∂ f θ ∂ z − ∂ l ∂ z , a > ( S ) = ∂ L ∂ z ( S ) ˙ a > θ = − a > ∂ f θ ∂ θ , a θ ( S ) = 0 n θ (15) Then, d ` d θ = a θ (0) . A.2 Proof of Theor em 1 Theorem 1 (Infinite–Dimensional Gradients) . Consider the loss function ( 2 ) and let θ ( s ) ∈ L 2 ( S → R n θ ) . Then, sensitivity of ` with respect to θ ( s ) (i.e . dir ectional derivative in functional space) is δ ` δ θ ( s ) = a > ( s ) ∂ f θ ( s ) ∂ θ ( s ) wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ ( s ) ∂ z − ∂ l ∂ z a > ( S ) = ∂ L ∂ z ( S ) Pr oof. The proof follo ws the same steps of the one of Theorem 1 up to ( 11 ) . Howe ver , here θ ( s ) ∈ L 2 and the loss sensitivity to θ ( s ) corresponds to the directional (Gateaux) deriv ati ve δ `/δ θ ( s ) in L 2 14 deriv ed as follows. W e start by computing the total variation of ` : δ ` = ∂ L ∂ z ( S ) δ z ( S ) − a > ( s )( δ z ( S ) − δ z (0)) + Z S 0  ˙ a > ( τ ) δ z ( τ ) + a > ( τ )  ∂ f θ ( τ ) ∂ z ( τ ) δ z ( τ ) + ∂ f θ ( τ ) ∂ θ ( τ ) δ θ ( τ )  + ∂ l ∂ z ( τ ) δ z ( τ )  d τ Thus, δ ` δ θ ( s ) =  ∂ L ∂ z ( S ) − a > ( s )  δ z ( S ) δ θ ( s ) + δ z (0) δ θ ( s ) + Z S 0  ˙ a > ( τ ) δ z ( τ ) δ θ ( s ) + a > ( τ )  ∂ f θ ( τ ) ∂ z ( τ ) δ z ( τ ) δ θ ( s ) + ∂ f θ ( τ ) ∂ θ ( τ ) δ θ ( τ ) δ θ ( s )  + ∂ l ∂ z ( τ ) δ z ( τ ) δ θ ( s )  d τ Since it must hold Z δ θ ( τ ) δ θ ( s ) d τ = 1 , then, model class choice θ ( s ) ∈ L 2 implies δ θ ( τ ) δ θ ( s ) = δ ( τ − s ) where δ ( τ − s ) is the Dirac’ s delta. Therefore, it holds δ ` δ θ ( s ) =  ∂ L ∂ z ( S ) − a > ( s )  δ z ( S ) δ θ ( s ) + δ z (0) δ θ ( s ) + Z S 0  ˙ a > ( τ ) δ z ( τ ) δ θ ( s ) + a > ( τ )  ∂ f θ ( τ ) ∂ z ( τ ) δ z ( τ ) δ θ ( s ) + ∂ f θ ( τ ) ∂ θ ( τ ) δ ( τ − s )  + ∂ l ∂ z ( τ ) δ z ( τ ) δ θ ( s )  d τ and, finally δ ` δ θ ( s ) =  ∂ L ∂ z ( S ) − a > ( s )  δ z ( S ) δ θ ( s ) + δ z (0) δ θ ( s ) + Z S 0  ˙ a > ( τ ) + a > ( τ ) ∂ f θ ( τ ) ∂ z ( τ ) + ∂ l ∂ z ( τ )  δ z ( τ ) δ θ ( s ) d τ + a > ( s ) ∂ f θ ( s ) ∂ θ ( s ) Hence, if for any s ∈ S the adjoint state a ( s ) satisfies ˙ a > = − a > ∂ f θ ( s ) ∂ z − ∂ l ∂ z , a > ( S ) = ∂ L ∂ z ( S ) we hav e δ ` δ θ ( s ) = a > ( s ) ∂ f θ ( s ) ∂ θ ( s ) A.3 Proof of Cor ollary 1 Corollary 1 (Spectral Gradients) . Under the assumptions of Theorem 1 , if θ ( s ) = P m j =1 α j  ψ j ( s ) , d ` d α = Z S a > ( τ ) ∂ f θ ( s ) ∂ θ ( s ) ψ ( τ )d τ , ψ = ( ψ 1 , . . . , ψ m ) Pr oof. The proof follo ws naturally from Theorem 1 by noticing that if θ ( s ) has some parametrization θ = θ ( s, µ ) with parameters µ ∈ R n µ , then, d ` d µ = Z S 0 a > ( τ ) ∂ f θ ∂ θ ∂ θ ∂ µ d τ (16) Therefore, if θ ( s ) = m X j =1 α j  ψ j ( s ) , 15 the loss gradient with respect to the parameters α := ( α 1 , . . . , α m ) ∈ R mn θ is computed as d ` d α = Z S 0 a > ( τ ) ∂ f θ ( τ ) ∂ θ ( τ ) ∂ θ ( s ) ∂ α d τ = Z S 0 a > ( τ ) ∂ f θ ( τ ) ∂ θ ( τ ) ψ d τ being ψ := ( ψ 1 , . . . , ψ m ) . Remark 2 (Choose your parametrization) . A further insight fr om this r esult, which paves the way to futur e developments, is that we can easily compute the loss gr adients with r espect to any parametrization of θ ( s ) thr ough ( 16 ) A.4 Proof of Cor ollary 2 Corollary 2 (Stacked Gradients) . Under the assumptions of Theor em 1 , if θ ( s ) = θ i ∀ s ∈ [ s i , s i +1 ] , d ` d θ i = − Z s i s i +1 a > ( τ ) ∂ f θ i ∂ θ i d τ wher e a ( s ) satisfies ( ˙ a > ( s ) = − a > ( s ) ∂ f θ i ∂ z − ∂ l ∂ z s ∈ [ s i , s i +1 ] a > ( S ) = ∂ L ∂ z ( S ) Pr oof. The proof follows from the one of Theorems 1 and 1 by recalling the solution of the stacked neural ODEs: z ( S ) = h x ( x ) + p − 1 X i =0 Z s i +1 s 1 f θ i ( τ , x , z ( τ ))d τ W e can recov er a relation similar to ( 12 ) d ` d θ i =  ∂ L ∂ z ( S ) − a > ( S )  d z ( S ) d θ i + + p − 1 X j =0 Z s j +1 s j  ˙ a > + a > ∂ f θ j ∂ z + ∂ l ∂ z  d z d θ i d τ + p − 1 X j =0 Z s j +1 s j a > ∂ f θ j ∂ θ i d τ Since ∀ j = 0 , . . . , p − 1 ∂ f θ j ∂ θ i 6 = 0 ⇔ j = i, we hav e p − 1 X j =0 Z s j +1 s j a > ∂ f θ j ∂ θ i d τ = Z s i +1 s i a > ∂ f θ i ∂ θ i d τ = − Z s i s i +1 a > ∂ f θ i ∂ θ i d τ which leads to the result by assuming a ( τ ) to satisfy ˙ a > ( s ) = − a > ( s ) ∂ f θ i ∂ z − ∂ l ∂ z s ∈ [ s i , s i +1 ] a > ( S ) = ∂ L ∂ z ( S ) A.5 Proof of Theor em 2 Proposition 2. F or all  > 0 , x ∈ R there e xists a parameter θ > 0 suc h that | ϕ ( x ) − z (1) | < , (8) wher e z (1) is the solution of the Neural ODE  ˙ z ( s ) = − θ ( z ( s ) + x ) z (0) = x , s ∈ [0 , 1] . (9) 16 Pr oof. The general solution of ( 9 ) is z ( s ) = x (2 e − θs − 1) Thus, e = ϕ ( x ) − z (1) = x + x (2 e − θ − 1) = 2 xe − θ ⇔ | e | = 2 | x | e − θ It follows that 2 | x | e − θ <  ⇔ e − θ <  2 | x | ⇔ θ > − ln   2 | x |  A.6 Additional Theoretical Results A.6.1 Explicit Parameter Dependence of the Loss Note that, in both the seminal paper from Chen et al. ( 2018 ) and Theorem 1 the loss function was consider without explicit dependence on the parameters. Howe ver , in practical applications (see, e.g. ( Finlay et al. , 2020 )) the loss has this explicit dependence: ` = L ( z ( S ) , θ ) + Z S l ( s, z ( τ ) , θ )d τ , (17) In this case we need to modify the adjoint gradients accordingly Theorem 2 (Generalized Adjoint Method with Parameter–Dependent Loss) . Consider the loss function ( 17 ) . Then, d ` d θ = ∂ L ∂ θ + Z S  a > ( τ ) ∂ f θ ∂ θ + ∂ l ∂ θ  d τ wher e a ( s ) satifies ( 13 ) . Pr oof. The proof follows immediately from Theorem 1 by noticing that, with the e xplicit dependence on θ of ` , ( 12 ) would become d ` d θ = ∂ L ∂ θ +  ∂ L ∂ z ( S ) − a > ( S )  d z ( S ) d θ + + Z S 0  ˙ a > + a > ∂ f θ ∂ z + ∂ l ∂ z  d z d θ d τ + Z S 0  a > ∂ f θ ∂ θ + ∂ l ∂ θ  d τ leading to the result. In the depth–variant case where we might consider a loss function of type ` = L ( z ( S ) , θ ( S )) + Z S l ( z ( τ ) , θ ( τ ))d τ (18) a similar result can be obtained for the infinite–dimensional adjoint. A.6.2 Integration Bound Gradients It is also possible to obtain the loss gradient with respect to the integration bound S . Theorem 3 (Inte gration Bound Gradient) . Consider a loss function 2 . Then, d ` d S = ∂ L ∂ z ( S ) f θ ( S ) ( S, x , z ( S )) + l ( z ( S )) 17 Pr oof. d ` d S = ∂ L ∂ z ( S ) d z ( S ) d S + d d S Z S 0 l ( z ( τ ))d τ = ∂ L ∂ z ( S ) d d S h x ( x ) + Z S 0 f θ ( τ ) ( τ , x , z ( τ )) ! + d d S Z S 0 l ( z ( τ ))d τ Therefore, by applying the Leibniz integral rule we obtain d ` d S = ∂ L ∂ z ( S ) f θ ( S ) ( S, x , z ( S )) + l ( z ( S )) B Practical Insights f or Neural ODEs B.1 A ugmentation A ugmenting con volution and graph based architectur es In the case of convolutional neural network (CNN) or graph neur al network (GNN) architectures, augmentation can be performed along different dimensions i.e. c hannel , heigth, width or similarly node features or number of nodes. The most physically consistent approach, employed in ( Dupont et al. , 2019 ) for CNNs, is augmenting along the channel dimension, equi valent to pro viding each pixel in the image additional states. By vie wing an image as a lattice graph, the generalization to GNN–based Neural ODEs ( Poli et al. , 2019 ) operating on arbitrary graphs can be achie ved by augmenting each node feature with n a additional states. Selective higher–order A limitation of system ( 6 ) is that a nai ve extension to second–order requires a number of augmented dimensions n a = n z / 2 . T o allow for flexible augmentations of few dimensions n a < n z / 2 , the formulation of second–order Neural ODEs can be modified as follo ws. Let z := ( z q , z p , ¯ z ) , z q , z p ∈ R n a / 2 , ¯ z ∈ R n z − n a . W e can decide to give second order dynamics only to the first n a states while the dynamics of other n z − n a states is free. Therefore, this approach yields   ˙ z q ˙ z p ˙ ¯ z   =   z p f p θ ( s ) ( s, z ) ¯ f θ ( s ) ( s, z )   , (19) A similar argument could be applied to orders higher than two. Selective higher–or der Neural ODEs are compatible with input layer augmentation. B.2 Activations Mind your activation W e in vestigate the effects of appending an activ ation function to the last layer of f θ . The chosen nonlinearity will strongly affect the “shape” of the v ector field and, as a Figure 8: Depth trajectories of the hidden state and relative v ector fields f θ ( z ) for different acti v ation functions in a nonlinear classification task. It can be noticed how the models with tanh and ELU outperform the others, as f θ is able to steer z along negati ve directions. 18 consequence, the flo ws learnable by the model. Therefore, while designing f θ as a multi–layer neural network, it is generally advisable to append a linear layer to maximize the expressi veness of the underlying vector field. In some applications, conditioning the vector field (and thus the flo ws) with a specific nonlinearities can be desirable, e.g., when there exist priors on the desired transformation, such as boundedness of the vector field. Effects of activations In order to compare the effect of different activ ation functions in the last layer of f θ , we set up a nonlinear classification task with the half–moons dataset. For the sake of completeness, we selected activ ations of different types, i.e., The dataset is comprised of 2 13 data Activ ation T ype Hyperbolic tangent ( tanh ) bounded Sigmoid bounded, non–negati ve output ReLU unbounded, non–negati ve output Softplus unbounded, non–negati ve output ELU lo wer–bounded points. W e utilize the entire dataset for training and e valuation since the experiment has the aim of deliv ering a qualitati ve description of the learned v ector fields. f θ has been selected as a multilayer perceptron with two hidden layers of 16 neurons each. The training has been carried out using Adam ( Kingma and Ba , 2014 ) optimizer with learning rate 10 − 3 and weight decay set to 10 − 4 . In Figure 8 we see ho w different activ ation functions in the last layer of f θ condition the vector fields and the depth ev olution of the hidden state in the classification of nonlinearly separable data. It is worth to be noticed that the models with better performance are the ones with hyperbolic tangent (tanh) and ELU ( Clev ert et al. , 2015 ) as the vector field can assume both positive and neg ativ e values and, thus, can “force” the hidden state in dif ferent directions. On the other hand, with sigmoid, ReLU or softplus ( Zheng et al. , 2015 ), the vector field is nonne gati ve in all directions and thus has limited freedom. Further , Figure 9 sho ws ho w different acti v ation functions shape the vector field and as a result the decision boundary . B.3 Regularization for Stability The concept of stability can be used to regularize Neural ODEs through a variety of additional terms or dif ferent formulations ( Finlay et al. , 2020 ; Massaroli et al. , 2020 ). ( Finlay et al. , 2020 ) proposes minimizing a loss term: ` reg = Z S k f θ ( τ ) ( τ , x , z ( τ ) k 2 d τ , (20) to achie ve stability . A simple alternative stabilizing regularization term can be considered at no significant additional computational cost: ` reg =   f θ ( S ) ( S, x , z ( S ))   2 , (21) which penalizes non–con ver gence to some fixed point of f θ at s = S . The abov e can also be seen as a cheaper alternativ e to the kinetic energy re gularization proposed in ( Finlay et al. , 2020 ). B.4 Appr oximation Capabilities V anilla Neural ODEs are not, in general, uni versal function approximators (UF As) ( Zhang et al. , 2019a ). Besides some recent works on the topic ( Zhang et al. , 2019a ; Li et al. , 2019 ) this apparent limitation is still not well–understood in the context of continuous–depth models. When Neural ODEs are employed as general–purpose black–box modules, some assurances on the approximation capabilities of the model are necessary . Let n z := n x + 1 and let z := ( z x , z a ) ( z x ∈ R n x , z a ∈ R ). ( Zhang et al. , 2019a ) noticed that a depth–in variant augmented Neural ODE  ˙ z x ˙ z a  =  0 n x f θ ( z x )  ,  z x (0) z a (0)  =  x 0  , s ∈ [0 , 1] (22) where the output is picked as ˆ y := z a (1) , can approximate any function Ψ : R n x → R provided that the neural network f θ ( x ) is an approximator of Ψ , since z a (1) = f θ ( x ) , mimicking the mapping 19 x 7→ f θ ( x ) . Although this simple result is not sufficient to provide a constructiv e blueprint to the design of Neural ODE models, it suggests the following (open) questions: • Why should we use a Neural ODE if its vector field can solve the approximation problem as a standalone neural network? • Can Neural ODEs be UF As with non-UF A vector fields? On the other hand, if Neural ODEs are used for model disco very or observation of dynamical systems, requiring an UF A neural network to parametrize the model pro vides it with the ability to approximate arbitrary dynamical systems. B.5 Example Implementation of Data–Control W e report here a short PyT orch code snippet detailing the implementation of the simplest data– contr olled Neural ODE variant, accompanied, for further accessibility , by a brief text description. class DC_DEFunc (nn . Module): """PyTorch implementation of data--controlled $f_\theta$""" def __init__ ( self , f): super () . __init__ () self . f = f def forward ( self , s, z): """Forward is called by the ODE solver repeatedly""" self . nfe += 1 # data-control step: # alternatives include embeddings of input data ` x ` i.e g(x) # or addition ` x + z ` z = torch . cat([z, self . x], 1 ) dz = self . f(z) return dz where the initial condition x is passed to the model at the start of the integration at s = 0 . The information contained is thus passed repeatedly to the function f θ , conditioning the dynamics. It should be noted that ev en in the case of concatenation of x and z ( s ) , the above is not a form of augmentation, since the state itself is not given additional dimensions during forward propagation. In fact, the dynamics take the form of a function f θ : R n x × R n z → R n z instead of f θ : R n a × R n x → R n a × R n x as is the case for general first–order augmentation with n z = n x + n a . C Experimental Details Computational r esources The e xperiments were carried out on a cluster of tw o N V I D I A ® T I T A N RT X GPUs with CUD A 10.1 and I N T E L ® I 9 1 0 9 8 0 X E CPU. All Neural ODEs were trained on GPU. The code was built upon Pytorch’ s torchdyn library for neural differential equations ( Poli et al. , 2020b ). General experimental setup W e report here general information about the experiments. All Neural ODEs are solved numerically via the Dormand–Prince method ( Prince and Dormand , 1981 ). W e refer to concat as the depth–variant Neural ODE variants where the depth–v ariable s is concatenated to z ( s ) as done in ( Chen et al. , 2018 ). Furthermore, we denote Galërkin Neural ODEs as GalNODE for con venience. Benchmark problems Throughout the paper we e xtensiv ely utilize the concentric annuli bench- mark task introduced in ( Dupont et al. , 2019 ) is used e xtensiv ely . Namely , gi ven r > 0 define ϕ : R n → Z ϕ ( x ) =  − 1 k x k 2 < r 1 k x k 2 ≥ r . (23) 20 Figure 9: Decision boundaries learned by the vector field of a Neural ODE are directly conditioned by the choice of activ ation function. W e consider learning the map ϕ ( x ) with Neural ODEs prepending a linear layer R n → R . Notice that ϕ has been slightly modified with respect to ( Dupont et al. , 2019 ), to be well–defined in its domain. For the one–dimensional case, we will often instead refer to the map ϕ ( x ) = − x as the cr ossing trajectories problem. The optimization is carried out by minimizing mean squared err or (MSE) losses of model outputs and mapping ϕ . C.1 Experiments of Section 3 T rajectory tracking Consider the problem of tracking a periodic signal β ( s ) . W e show ho w this can be achieved without introducing additional inductive biases such as ( Greydanus et al. , 2019 ) through a synergistic combination of a two–layer Galërkin Neural ODEs and the generalized adjoint with integral loss l ( s ) := k β ( s ) − z ( s ) k 2 2 . In particular, we construct a two–layer Galërkin Neural ODE with Fourier series and m = 2 harmonics as the eigenfunctions. The training is carried out for 1000 epochs with learning rate 10 − 3 . The practical implementation of the generalized adjoint necessary to distribute the loss across the depth domain is discussed in Appendix A. The models, trained in s ∈ [0 , 1] generalize accurately when task ed to perform long trajectory extrapolation of se veral seconds. Depth–varying classification W e showcase ho w different discretization options of the functional optimization problem discussed in Sec. 3 af fect the final dynamics of θ ( s ) . Namely , we consider a simple binary classification on the nested spirals problem, training all models for 300 epochs and learning rate 5 · 10 − 3 . Galërking Neural ODEs are equipped with a polynomial basis with m = 10 . The Fig.s in Sec 3 re veal the dif ferent nature of θ ( s ) depending on model choice: depth–discretization of Stack ed yields a flexible, though lower resolution form of θ ( s ) , whereas spectral discretizations limit the functional form of θ ( s ) to the span of a chosen eigenbasis. Mind your input netw ork experiments W e tackle the concentric annuli task with a Neural ODE preceded by a simple two–layer neural network with 16 units and ReLU activ ation. The second layer is linear . C.2 Experiments of Section 4 Image classification W e use AdamW with learning rate 10 − 3 , batch size 64 , weight decay 5 ∗ 10 − 4 and a learning rate step schedule with multiplicativ e factor γ = 0 . 9 ev ery 5 epochs. W e train each model for 20 epochs. The v ector fields f θ are parametrized by 3–layer depth–in v ariant CNNs, with each layer follo wed by an instance normalization layer . The choice of depth–in variance is motiv ated by the discussion carried out in Section 5 : both augmentation and depth–v ariance can reliev e approximation limitations of vanilla, depth–in variant Neural ODEs. As a result, including both renders the ablation study for augmentation strategies less accurate. W e note that the results of this ablation analysis do not utilize any form of data augmentation; data augmentation can indeed be introduced to further improv e performance. For input layer augmented Neural ODE models, namely IL–NODE and 2nd order , we prepend to the Neural ODE a single, linear CNN layer . In the case of 2nd order models, we use input layer augmentation for the positions and initialize the velocities at 0 . The hidden channel dimension of the CNN parametrizing f θ in augmented models is set to 32 on MNIST and 42 on CIF AR; v anilla Neural ODEs, on the other hand, are equipped with dimensions 42 and 62 for a fair comparison. The output class probabilities are then computed by mapping the output of the Neural ODE through av erage pooling followed by a linear layer . Second order Neural ODEs, 2nd , use f θ to compute the vector 21 0 0 . 5 1 − 1 0 1 s z ( s ) f θ ( z ( s )) 0 0 . 5 1 − 1 0 1 s z ( s ) f θ ( s, z ( s )) 0 0 . 5 1 − 1 0 1 s z ( s ) f θ ( s ) ( s, z ( s )) Figure 10: Depth evolution o ver the learned vector fields of the standard models: depth–in variant and depth–v ariant (“concat” f θ ( s, z ( s )) and GalNODE f θ ( s, z ( s )) ). As expected the Neural ODE cannot approximate the map ϕ ( x ) = − x . field of velocities: therefore, the output of f θ is n x / 2 –dimensional, and the remaining n x / 2 outputs to concatenate (vector field of positions ) are obtained as the last n x / 2 elements of z . W e note that vanilla Neural ODEs are capable of con ver gence without any spikes in loss or NFEs. W e speculate the numerical issues encountered in ( Dupont et al. , 2019 ) to be a consequence of the specific neural network architecture used to parametrize the vector field f θ , which employed an excessi ve number of channels inside f θ , i.e 92 . C.3 Experiments of Section 5 Experiments on crossing trajectories W e trained both current state–of–the–art as well as proposed models to learn the map ϕ ( x ) = − x . W e created a training dataset sampling x equally spaced in [ − 1 , 1] . The models have been trained to minimize L1 losses using Adam ( Kingma and Ba , 2014 ) with learning rate lr = 10 − 3 and weight decay 10 − 5 for 1000 epochs using the whole batch. W e trained vanilla Neural ODEs, i.e. both depth–inv ariant and depth v ariant models (“concat” and GalNODE). As expected, these models cannot approximate ϕ . Both depth–in v ariant and concat hav e been selected with two hidden layers of 16 and 32 neurons each, respectively and tanh acti vation. The GalNODE hav e been designed with one hidden layer of 32 neurons whose depth–varying weights were parametrized by a F ourier series of fi ve modes. The resulting trajectories over the learned vector fields are shown in Fig. 10 . Data–controlled Neural ODEs W e ev aluate both the handcrafted linear depth–in variant model ( 9 ) and the general formulation of data–controlled models ( 10 ), realized with two hidden layers of 32 neurons each and tanh activ ation in all layers but the output. Note that the loss of the handcrafted model results to be con vex and continuously differentiable. Moreover , proof A.5 provides analytically a lower bound on the model parameter to ensure the loss to be upper–bounded by a desired  , making its training superfluous. Ne vertheless, we provide results with a trained version to sho w that the benefits of data–controlled Neural ODEs are compatible with gradient–based learning. The results are shown in Fig.s 10 and 11 . The input data information embedded into the vector field allo ws the Neural ODE to steer the hidden state to wards the desired label through its continuous depth. Data–controlled Neural ODEs can be used to learn challenging maps ( Dupont et al. , 2019 ) without augmentation. Concentric annuli with non–augmented variants W e train each model for 1024 iterations using AdamW with learning rate 10 − 3 , weight decay 10 − 6 and batch size 1024 . All models ha ve a single hidden layer of dimension 32 . The GalNODE layer is parametrized by a Fourier series of fiv e modes. Conditional continuous normalizing flows W e train data–controlled continuous normalizing flows for 2000 iterations with samples of size 2 14 . W e use AdamW with learning rate 10 − 3 and weight decay 10 − 7 . Absolute and relati ve tolerances of the chosen solver , dopri5 are set to 10 − 8 . The CNF network ha ve 2 hidden layers of dimension 128 with softplus nonlinearities. Adaptive depth Neural ODEs The e xperiments hav e been carried out with a depth–v ariant Neural ODE in “concat” style where f was parametrized by a neural network with two hidden layers of 8 units and tanh activ ation. Moreover , the function g ω ( x ) computing the data–adaptiv e depth of the Neural ODE was composed by a neural network with one hidden layer (8 neurons and ReLU 22 0 0 . 5 1 − 1 0 1 s z ( s ) T ra jectories 0 0 . 5 1 − 1 0 1 s z ( s ) x = − 1 0 0 . 5 1 − 1 0 1 s z ( s ) x = 1 0 0 . 5 1 − 1 0 1 s z ( s ) T ra jectories 0 0 . 5 1 − 1 0 1 s z ( s ) x = − 1 0 0 . 5 1 − 1 0 1 s z ( s ) x = 1 Figure 11: Depth evolution o ver the learned v ector fields of ( 9 ) and a data–controlled Neural ODE. As discussed in Sec. 5 introducing data–control allows the model to approximate the map ϕ ( x ) = − x . 0 0 . 5 1 − 2 0 s z ( s ) T ra jectories − 1 − 0 . 5 0 0 . 5 1 0 . 9 1 1 . 1 1 . 2 1 . 3 x s ∗ x Learned Depth Figure 12: Ev olution of the input data through the depth of the Neural ODEs activ ation) whose output is summed to one and then taken in absolute v alue, g ( x ) =   1 + w > o σ ( w i x + b i ) + b o   where σ is the ReLU acti v ation, w o , w i , b i ∈ R 8 and ω = ( w o , b o , w i , b i ) . In particular , the summation to one has been employed to help the network “sparsify” the learned inte gration depths and av oid highly stiff v ector fields, while the absolute value is needed to av oid infeasible integration intervals. The training results can be visualized in Fig. 12 . This early result should be intended as a proof of concept rather than a definitiv e ev aluation of the depth adaptation methods, which we reserve for future w ork. W e note that the result of Fig. 5 showed in the main te xt has been obtained by training the model only on x ∈ {− 1 , 1 } and manually setting s ∗ − 1 = 1 , s ∗ 1 = 3 . 23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment