Neural networks-based backward scheme for fully nonlinear PDEs

Neural net w orks-based bac kw ard sc heme for fully nonlinear PDEs ∗ Huy ˆ en Pham † Xa vier W arin ‡ Maximilien Germain § This v ersion: December 7, 2020 Abstract W e propose a numerical method for solving high dimensional fully nonlinear partial diﬀeren tial equations (PDEs). Our algorithm estimates simultaneously by bac kward time induction the solution and its gradien t b y m ulti-lay er neural net works, while the Hessian is approximated by automatic diﬀerentiation of the gradient at previous step. This methodology extends to the fully nonlinear case the approach recently prop osed in [HPW20] for semi-linear PDEs. Numerical tests illustrate the p erformance and accuracy of our method on several examples in high dimension with non-linearity on the Hessian term including a linear quadratic con trol problem with control on the diﬀusion co eﬃcient, Monge-Amp` ere equation and Hamilton-Jacobi- Bellman equation in p ortfolio optimization. Key w ords: Neural netw orks, fully nonlinear PDEs in high dimension, bac kward scheme. MSC Classiﬁcation: 60H35, 65C20, 65M12. 1 In tro duction This pap er is devoted to the resolution in high dimension of fully nonlinear parab olic partial diﬀerential equations (PDEs) of the form ( ∂ t u + f ( ., ., u, D x u, D 2 x u ) = 0 , on [0 , T ) × R d , u ( T , . ) = g , on R d , (1.1) with a non-linearit y in the solution, its gradien t D x u and its hessian D 2 x u via the function f ( t, x, y , z , γ ) deﬁned on [0 , T ] × R d × R × R d × S d (where S d is the set of symmetric d × d matrices), and a terminal condition g . The n umerical resolution of this class of PDEs is far more diﬃcult than the one of classical semi-linear PDEs where the nonlinear function f do es not depend on γ . In fact, rather few metho ds are av ailable to solv e fully nonlinear equations even in mo derate dimension. • First based on the w ork of [Che+07], an eﬀective sc heme developed in [FTW11] using some regression tec hniques has b een shown to be con v ergent under some ellipticit y conditions later remov ed by [T an13]. Due to the use of basis functions, this scheme do es not p ermit to solve PDE in dimension greater than 5. • A scheme based on nesting Monte Carlo has b een recently prop osed in [W ar18]. It seems to be eﬀective in very high dimension for maturities T not to o long and linearities not to o imp ortant. • A n umerical algorithm to solve fully nonlinear e quations has b een prop osed by [BEJ19] based on the second order backw ard sto c hastic diﬀeren tial equations (2BSDE) representation of [Che+07] and global deep neural netw orks minimizing a terminal ob jectiv e function, but no test on real fully nonlinear case is giv en. This extends the idea introduced in the pioneering papers [EHJ17; HJE18], which w ere the ﬁrst serious works for using machine learning metho ds to solve high dimensional PDEs. • The Deep Galerkin metho d prop osed in [SS18] based on some machine learning techniques and using some automatic diﬀeren tiation of the solution seems to b e eﬀectiv e on some cases. It has b een tested in [AA+18] for example on the Merton problem. ∗ This work is supported by FiME, Laboratoire de Finance des March ´ es de l’Energie, and the ”Finance and Sustainable Devel- opment” EDF - CACIB Chair. † LPSM, Univ ersit´ e de P aris, CREST-ENSAE & FiME pham at lpsm.pa ris ‡ EDF R&D & FiME xavier.w arin at edf.fr § EDF R&D, LPSM, Universit ´ e de Paris mgermain at lpsm.paris 1 In this article, we introduce a n umerical metho d based on mac hine learning tec hniques and backw ard in time iterations, whic h extends the prop osed sc hemes in [VSS18] for linear problems, and in the recent work [HPW20] for semi-linear PDEs. The approach in these works consists in estimating simultaneously the solution and its gradien t by multi-la yer neural netw orks by minimizing a sequence of loss functions deﬁned in backw ard induction. A basic idea to extend this metho d to the fully nonlinear case would rely on the representation prop osed in [Che+07]: at eac h time step t n of an Euler sc heme, the Hessian D 2 x u at t n is approximated by a neural netw ork minimizing some lo cal L 2 criterion associated to a BSDE inv olving D x u at date t n +1 and D 2 x u . Then, the pair ( u, D x u ) at date t n is appro ximated/learned with a second minimization similarly as in the method describ ed by [HPW20]. The ﬁrst minimization can be implemented with diﬀerent v ariations but n umerical results show that the global sch eme does not scale w ell with the dimension. Instability on the D 2 x u calculation rapidly propagates during the backw ard resolution. Besides, the metho dology app ears to b e costly when using tw o optimizations at each time step. An alternativ e approach that w e develop here, is to com bine the ideas of [HPW20] and the splitting method in [Bec+19] in order to derive a new deep learning scheme that requires only one lo cal optimization during the bac kward resolution for learning the pair ( u, D x u ) and appro ximating D 2 x u by automatic diﬀerentiation of the gradient computed at the previous step. The outline of the pap er is organized as follo ws. In Section 2, w e brieﬂy recall the mathematical description of the classical feedforward appro ximation, and then derive the prop osed neural net works-based bac kward sc heme. W e test our method in Section 3 on v arious examples. First w e illustrate our results with a PDE inv olving a non- linearit y of type uD 2 x u . Then, we consider a sto c hastic linear quadratic problem with controlled v olatility where an analytic solution is a v ailable, and w e test the performance and accuracy of our algorithm up to dimension 20. Next, we apply our algorithm to a Monge-Amp` ere equation, and ﬁnally , w e provide numerical tests for the solution to fully nonlinear Hamilton-Jacobi-Bellman equation, with non-linearities of the form | D x u | 2 /D 2 x u , arising in p ortfolio selection problem with sto c hastic volatilities. 2 The prop osed deep bac kw ard sc heme Our aim is to numerically appro ximate the function u : [0 , T ] × R d 7→ R , assumed to be the unique smooth solution to the fully nonlinear PDE (1.1) under suitable conditions. This will be ac hiev ed b y means of neural net works appro ximations for u and its gradient D x u , relying on a backw ard scheme and training simulated data of some forward diﬀusion process. Approximations of PDE in high dimension by neural netw orks hav e no w b ecome quite popular, and are supp orted theoretically b y recent results in [Hut+18] and [DLM20] showing their eﬃciency to ov ercome the curse of dimensionality . 2.1 F eedforward neural netw ork to appro ximate functions W e denote by d 0 the dimension of the input v ariables, and d 1 the dimension of the output v ariable. A (deep) neural net work is characterized by a num ber of lay ers L + 1 ∈ N \ { 1 , 2 } with m ` , ` = 0 , . . . , L , the num b er of neurons (units or no des) on each la yer: the ﬁrst lay er is the input lay er with m 0 = d , the last lay er is the output la yer with m L = d 1 , and the L − 1 lay ers b etw een are called hidden lay ers, where we choose for simplicit y the same dimension m ` = m , ` = 1 , . . . , L − 1. A feedforward neural netw ork is a function from R d 0 to R d 1 deﬁned as the comp osition x ∈ R d 7− → A L ◦ % ◦ A L − 1 ◦ . . . ◦ % ◦ A 1 ( x ) ∈ R . (2.1) Here A ` , ` = 1 , . . . , L are aﬃne transformations: A 1 maps from R d 0 to R m , A 2 , . . . , A L − 1 map from R m to R m , and A L maps from R m to R d 1 , represented by A ` ( x ) = W ` x + β ` , for a matrix W ` called weigh t, and a vector β ` called bias term, % : R → R is a nonlinear function, called activ ation function, and applied comp onen t-wise on the outputs of A ` , i.e., % ( x 1 , . . . , x m ) = ( % ( x 1 ) , . . . , % ( x m )). Standard examples of activ ation functions are the sigmoid, the ReLu, the Elu, tanh. All these matrices W ` and v ectors β ` , ` = 1 , . . . , L , are the parameters of the neural netw ork, and can b e iden tiﬁed with an elemen t θ ∈ R N m , where N m = P L − 1 ` =0 m ` (1 + m ` +1 ) = d 0 (1 + m ) + m (1 + m )( L − 2) + m (1 + d 1 ) is the num ber of parameters. W e denote by N d 0 ,d 1 ,L,m the set of all functions generated by (2.1) for θ ∈ R N m . 2.2 F orward-bac kward represen tation Let us introduce a forward diﬀusion pro cess X t = X 0 + Z t 0 µ ( s, X s ) ds + Z t 0 σ ( s, X s ) dW s , 0 ≤ t ≤ T , (2.2) 2 where µ is a function deﬁned on [0 , T ] × R d with v alues in R d , σ is a function deﬁned on [0 , T ] × R d with v alues in M d the set of d × d matrices, and W a d -dimensional Brownian motion on some probability space (Ω , F , P ) equipp ed with a ﬁltration F = ( F t ) 0 ≤ t ≤ T satisfying the usual conditions. The pro cess X will be used for the sim ulation of training data in our deep learning algorithm, and we shall discuss later the c hoice of the drift and diﬀusion co eﬃcients µ and σ , see Remark 2.3. Let us next denote by ( Y , Z , Γ) the triple of F -adapted pro cesses v alued in R × R d × S d , deﬁned by Y t = u ( t, X t ) , Z t = D x u ( t, X t ) , Γ t = D 2 x u ( t, X t ) , 0 ≤ t ≤ T . (2.3) By Itˆ o’s form ula applied to u ( t, X t ), and since u is solution to (1.1), w e see that ( Y , Z, Γ) satisﬁes the bac kward equation: Y t = g ( X T ) − Z T t  µ ( s, X s ) .Z s + 1 2 tr( σ σ | ( s, X s )Γ s ) − f ( s, X s , Y s , Z s , Γ s )  ds − Z T t σ | ( s, X s ) Z s .dW s , 0 ≤ t ≤ T . (2.4) Remark 2.1 This BSDE do es not uniquely characterize a triple ( Y , Z, Γ) contrarily to the semilinear case (without a non-linearit y with resp ect to Γ) in which prop er assumptions on the equation co eﬃcients pro vide existence and uniqueness for a solution couple ( Y , Z ). In the present case at least tw o options can b e used to estimate the Γ comp onent: • Rely on the 2BSDE represen tation from [Che+07] which extends the probabilistic representation of [PP90] for semilinear equations to the fully nonlinear case. It is the approach used by [BEJ19] with a global large minimization problem, as in [HJE18]. • Compute the second order deriv ative by automatic diﬀerentiation. This is the p oint of view we adopt in this pap er together with a lo cal approac h solving several small optimization problems. In this wa y , we pro vide an extension of [HPW20] to cov er a broader range of nonlinear PDEs.  2.3 Algorithm W e no w provide a n umerical approximation of the forw ard backw ard system (2.2)-(2.4), and consequently of the solution u (as well as its gradien t D x u ) to the PDE (1.1). W e start from a time grid π = { t i , i = 0 , . . . , N } of [0 , T ], with t 0 = 0 < t 1 < . . . < t N = T , and time steps ∆ t i := t i +1 − t i , i = 0 , . . . , N − 1. The time discretization of the forw ard pro cess X on π is then equal (typically when µ and σ are constants) or approximated by an Euler sc heme: X t i +1 = X t i + µ ( t i , X t i )∆ t i + σ ( t i , X t i )∆ W t i , i = 0 , . . . , N − 1 , where w e set ∆ W t i := W t i +1 − W t i (b y misuse of notation, we k eep the same notation X for the contin uous time diﬀusion pro cess and its Euler scheme). The bac kward SDE (2.4) is approximated by the time discretized sc heme Y t i ' Y t i +1 −  µ ( t i , X t i ) .Z t i + 1 2 tr  σ σ | ( t i , X t i Γ t i  − f ( t i , X t i , Y t i , Z t i , Γ t i )  ∆ t i − σ | ( t i , X t i ) Z t i . ∆ W t i , that is written in forward form as Y t i +1 ' F ( t i , X t i , Y t i , Z t i , Γ t i , ∆ t i , ∆ W t i ) , i = 0 , . . . , N − 1 , (2.5) with F ( t, x, y , z , γ , h, ∆) := y − ˜ f ( t, x, y , z , γ ) h + z | σ ( t, x )∆ , (2.6) ˜ f ( t, x, y , z , γ ) := f ( t, x, y , z , γ ) − µ ( t, x ) .z − 1 2 tr  σ σ | ( t, x ) γ  . The idea of the proposed scheme is the follo wing. Similarly as in [HPW20], we appro ximate at each time t i , u ( t i , . ) and its gradien t D x u ( t i , . ), b y neural netw orks x ∈ R d 7→ ( U i ( x ; θ ) , Z i ( x ; θ )) with parameter θ that are learned optimally b y backw ard induction: suppose that ˆ U i +1 := U i +1 ( . ; θ ∗ i +1 ), ˆ Z i +1 := Z i +1 ( . ; θ ∗ i +1 ) is an 3 appro ximation of u ( t i +1 , . ) and D x u ( t i +1 , . ) at time t i +1 , then θ ∗ i is computed from the minimization of the quadratic loss function: ˆ L i ( θ ) = E    ˆ U i +1 − F ( t i , X t i , U i ( X t i ; θ ) , Z i ( X t i ; θ ) , D ˆ Z i +1 ( T ( X t i +1 )) , ∆ t i , ∆ W t i )    2 where T is a truncation op erator such that T ( X ) is b ounded for example by a quantile of the diﬀusion pro cess and D ˆ Z i +1 stands for the automatic diﬀerentiation of ˆ Z i +1 . The idea b ehind the truncation is the following. During one step resolution, the estimation of the gradient is less accurate at the edge of the explored domain where samples are rarely generated. Diﬀeren tiating the gradient giv es a v ery oscillating Hessian at the edge of the domain. A t the following time step resolution, these oscillations propagate to the gradient and the solution even if the domain where the oscillations o ccur is rarely attained. In order to av oid these oscillations, a truncation is achiev ed, p ermits to av oid that the oscillations of the neural netw ork ﬁt in zone where the sim ulations propagate scarcely to areas of imp ortance. This truncation may b e necessary to get con vergence on some rather diﬃcult cases. Of course this truncation is only v alid if the real Hessian do es not v aries to o muc h. The in tuition for the relev ance of this sc heme to the approximation of the PDE (1.1) is the following. F rom (2.3) and (2.5), the solution u to (1.1) should approximately satisfy u ( t i +1 , X t i +1 ) ' F ( t i , X t i , u ( t i , X t i ) , D x u ( t i , X t i ) , D 2 x u ( t i , X t i ) , ∆ t i , ∆ W t i ) . Supp ose that at time t i +1 , ˆ U i +1 is an estimation of u ( t i +1 , . ). Recalling the expression of F in (2.6), the quadratic loss function at time t i is then approximately equal to ˆ L i ( θ ) ' E    u ( t i , X t i ) − U i ( X t i ; θ ) +  D x u ( t i , X t i ) − Z i ( X t i ; θ )  | σ ( t i , X t i )∆ W t i − ∆ t i  ˜ f ( t i , X t i , u ( t i , X t i ) , D x u ( t i , X t i ) , D 2 x u ( t i , X t i )) − ˜ f ( t i , X t i , U i ( X t i ; θ ) , Z i ( X t i ; θ ) , D ˆ Z i +1 ( T ( X t i +1 )))     2 . By assuming that ˜ f has small non-linearities in its arguments ( y , z , γ ), say Lipschitz, p ossibly with a suitable c hoice of µ , σ , the loss function is thus appro ximately equal to ˆ L i ( θ ) ' (1 + O (∆ t i )) E   u ( t i , X t i ) − U i ( X t i ; θ )   2 + O (∆ t i ) E   D x u ( t i , X t i ) − Z i ( X t i ; θ )   2 + O ( | ∆ t i | 2 ) . Therefore, b y minimizing o ver θ this quadratic loss function, via sto c hastic gradient descent (SGD) based on sim ulations of ( X t i , X t i +1 , ∆ W t i ) (called training data in the mac hine learning language), one exp ects the neural net works U i and Z i to learn/appro ximate better and better the functions u ( t i , . ) and D x u ( t i , ) in view of the univ ersal approximation theorem for neural netw orks. The rigorous conv ergence of this algorithm is p ostp oned to a future work. T o sum up, the global algorithm is given in Algo 1 in the case where g is Lipsc hitz and the deriv ative can b e analytically calculated almost ev erywhere. If the deriv ative of g is not av ailable, it can be calculated by automatic diﬀerentiation of the neural netw ork appro ximation of g . Algorithm 1 Algorithm for fully non-linear equations. 1: Use a single deep neural netw ork ( U N ( . ; θ ) , Z N ( . ; θ )) ∈ N d, 1+ d,L,m and minimize (by SGD)      ˆ L N ( θ ) := E    U N ( X t N ; θ ) − g ( X t N )    2 + ∆ t N − 1 d E    Z N ( X t N ; θ ) − D g ( X t N )    2 θ ∗ N ∈ arg min θ ∈ R N m ˆ L N ( θ ) . 2: b U N = U N ( . ; θ ∗ N ), and set b Z N = Z N ( . ; θ ∗ N ) 3: for i = N − 1 , . . . , 0 do 4: Use a single deep neural netw ork ( U i ( . ; θ ) , Z i ( . ; θ )) ∈ N d, 1+ d,L,m for the approximation of ( u ( t i , . ) , D x u ( t i , . )), and compute (by SGD) the minimizer of the exp ected quadratic loss function      ˆ L i ( θ ) := E    b U i +1 ( X t i +1 ) − F ( t i , X t i , U i ( X t i ; θ ) , Z i ( X t i ; θ ) , D ˆ Z i +1 ( T ( X t i +1 )) , ∆ t i , ∆ W t i )    2 θ ∗ i ∈ arg min θ ∈ R N m ˆ L i ( θ ) . (2.7) 5: Up date: b U i = U i ( . ; θ ∗ i ), and set b Z i = Z i ( . ; θ ∗ i ). Remark 2.2 Several alternatives can be implemen ted for the computation of the second order deriv ative. A natural candidate w ould consist in choosing to appro ximate the solution u at time t i b y a neural net work U i 4 and estimate Γ i as the iterated automatic diﬀerentiation D 2 x U i . How ev er, it is sho wn in [HPW20] that choosing only a single neural netw ork for u and using its automatic deriv ative to estimate the Z component degrades the error in comparison to the c hoice of t wo neural netw orks U , Z . A similar b ehavior has b een observed during our tests for this second order case and the most eﬃcien t c hoice was to compute the deriv ativ e of the Z net work. This deriv ativ e can also b e estimated at the current time step t i instead of t i +1 . How ev er this metho d leads to an additional cost for the neural netw orks training b y complicating the computation of the automatic gradients p erformed by T ensorﬂo w during the backpropagation. It also leads numerically to worse results on the con trol estimation, as empirically observ ed in T able 5 and described in the related paragraph ”Comparison with an implicit version of the sc heme”. F or this reason, we decided to apply a splitting method and ev aluate the Hessian at time t i +1 . F or this reason, we decided to apply a splitting metho d and ev aluate the Hessian at time t i +1 .  Remark 2.3 The diﬀusion process X is used for the training simulations in the sto chastic gradien t descen t metho d for ﬁnding the minimizer of the quadratic loss function in (2.7), where the exp ectation is replaced by empirical av erage for n umerical implementation. The choice of the drift and diﬀusion parameters are explained in Section 3.1.  3 Numerical results W e ﬁrst construct an example with diﬀerent non-linearities in the Hessian term and the solution. W e graphically sho w that the solution is v ery w ell calculated in dimension d = 1 and then mov e to higher dimensions. W e then use an example deriv ed from a sto chastic optimization problem with an analytic solution and sho w that w e are able to accurately calculate the solution. Next, we consider the numerical resolution of the Monge- Amp ` ere equation, and ﬁnally , giv e some tests for a fully nonlinear Hamilton-Jacobi-Bellman equation arising from p ortfolio optimization with sto c hastic volatilities. 3.1 Choice of the algorithm hyperparameters W e describ e in this paragraph how w e choose the v arious hyperparameters of the algorithm and explain the learning strategy . • P arameters of the training simula tions : the c hoice of the drift co eﬃcient is t ypically related to the underlying probabilistic problem asso ciated to the PDE (for example a sto chastic control problem), and should driv e the training pro cess to regions of interest, e.g.., that are visited with large probability b y the optimal state pro cess in sto chastic control. I n practice, we can take a drift function µ ( . ) equal to the drift asso ciated to some a priori con trol. This c hoice of con trol could b e an optimal con trol for a related problem for which w e kno w the solution, or could b e the control obtained by the ﬁrst iteration of the algorithm. The choice of the diﬀusion co eﬃcien t σ is also imp ortant: large σ induces a b etter exploration of the state space, but as we will see in most of examples b elo w, it giv es a sc heme slo wly con verging to the solution with resp ect to the time discretization and it generates a higher v ariance on the results. Moreov er, for the applications in sto c hastic control, we migh t explore some region that are visited with very small probabilities b y the optimal state pro cess, hence represen ting few interest. On the other hand, small σ means a weak exploration, and we migh t lack information and precision on some region of the state space: the solution calculated at each time step is far more sensitive to very local errors induced b y the neural netw ork approximation and tends to generate a bias. Therefore a trade oﬀ has to be found b et ween rather high v ariance with slo w conv ergence in time and fast con vergence in time with a p otential bias. W e also refer to [NR20] for a discussion on the role of the diﬀusion co eﬃcient. In practice and for the numerical examples in the next section, we test the scheme for diﬀerent σ and b y v arying the n umber of time steps, and if it conv erges to the same solution, one can consider that we hav e obtained the correct solution. W e also show the impact of the choice of the diﬀusion co eﬃcient σ . • P arameters of trunca tion : Given the training sim ulations X , we choose a truncation op erator T p indexed b y a parameter p close to 1, so that T p ( X t ) corresponds to a truncation of X t at a given quan tile φ p . In the n umerical tests, we shall v ary p b etw een 0 . 95 and 0 . 999. • P arameters of the optimiza tion algorithm o ver neural networks : In the whole n umerical part, w e use a classical F eedforw ard netw ork using lay ers with m neurons eac h and a tanh activ ation function, the output la yer uses an iden tity activ ation function. A t each time step the resolution of equation (2.7) is achiev ed using a mini-batch with 1000 training tra jectories. The training and learning rate adaptation pro cedure is the follo wing: • Every 40 inner gradient descen t iterations, the loss is chec k ed on 10000 v alidation tra jectories. • This optimization sequence is rep eated with 200 outer iterations for the ﬁrst optimization step at date t N = T and only 100 outer iterations at the dates t i with i < N . 5 • An a verage of the loss calculated on 10 successiv e outer iterations is performed. If the decrease of the a verage loss every 10 outer iterations is less than 5% then the learning rate is divided by 2. The optimization is performed using the Adam gradient descent algorithm, see [KB14]. Notice that the adaptation of the learning rate is not common with the Adam metho d but in our case it app ears to b e crucial to ha ve a steady diminution of the loss of the ob jectiv e function. The pro cedure is also describ ed in [CWNMW19] and the c hosen parameters are similar to this article. At the initial optimization step at time t N = T , the learning rate is taken equal to 1 E − 2 and at the follo wing optimization steps, we start with a learning rate equal to 1 E − 3. During time resolution, it is far more eﬀectiv e to initialize the solution of equations (2.7) with the solution ( U , Z ) at the next time step. Indeed the previously computed v alues at time step t i +1 are go o d approximations of the pro cesses at time step t i if the PDE solution and its gradient are contin uous. All experiments are ac hieved using T ensorﬂow [Aba+15]. In the sequel, the PDE solutions on curves are calculated as the av erage of 10 runs. W e provide the standard deviation asso ciated to these results. W e also sho w the inﬂuence of the n umber of neurons on the accuracy of the results. 3.2 A non-linearity in uD 2 x u W e consider a generator in the form f ( t, x, y , z , γ ) = y tr( γ ) + y 2 + 2 y 2 − 2 y 4 e − ( T − t ) , and g ( x ) = tanh  P d i =1 x i √ d  , so that an analytical solution is a v ailable: u ( t, x ) = tanh  P d i =1 x i √ d  e − T − t 2 . W e ﬁx the horizon T = 1, and c ho ose to ev aluate the solution at t = 0 and x = 0 . 5 1 I d √ d (here 1 I d denotes the vector in R d with all components equal to 1), for whic h u ( t, x ) = 0 . 761902 while its deriv ative is equal to 1 . 2966. This initial v alue x is chosen such that indep endently of the dimension the solution is v arying around this p oint and not in a region where the tanh function is close to − 1 or 1. The co eﬃcients of the forward pro cess used to solve the equation are (here I d is the identit y d × d -matrix) σ = ˆ σ √ d I d , µ = 0 , and here the truncation op erator is chosen equal to T p ( X 0 ,x t ) = min  max[ x − σ √ tφ p , X 0 ,x t ] , x + σ √ tφ p  , where φ p = N − 1 ( p ), with N is the CDF of a unit centered Gaussian random v ariable. In the numerical results, we take p = 0 . 999 and m = 20 neurons. W e ﬁrst begin in dimension d = 1, and sho w in Figure 1 ho w u , D x u and D 2 x u are well approximated by the resolution metho d. On Figure 2, we c heck the con vergence, for diﬀerent v alues of ˆ σ of b oth the solution u and its deriv ative at p oint x and date 0. Standard deviation of the function v alue is very lo w and the standard deviation of the deriv ative still b eing low. 6 Y at date t = 0 . 5. Z at date t = 0 . 5 Γ at date t = 0 . 5 Y at date t = 0 . 006125. Z at date t = 0 . 006125 Γ at date t = 0 . 006125 Figure 1: A single v aluation run for test case one 1D using 160 time steps, ˆ σ = 2 . , p = 0 . 999, 20 neurons, 2 la yers. Con vergence of u dep ending on ˆ σ Standard deviation of u Con vergence of D x u dep ending on ˆ σ Standard deviation of D x u Figure 2: Conv ergence in 1D of the case one, num b er of neurons par lay er equal to 20, 2 lay ers, p = 0 . 999. 7 As the dimension increas es, we hav e to increase the v alue of ˆ σ of the forward pro cess. In dimension 3, the v alue ˆ σ = 0 . 5 giv es high standard deviation in the result obtained as shown on Figure 3, while in dimension 10, see Figure 4, we see that the v alue ˆ σ = 1 is to o low to giv e go o d results. W e also clearly notice that in 10D, a smaller time step should b e used but in our test cases we decided to consider a maximum num b er of time steps equal to 160. Con vergence of u dep ending on ˆ σ Con vergence of D x u (ﬁrst comp onent) depending on ˆ σ Figure 3: Conv ergence in 3D of the case one, num b er of neurons par lay er equal to 20, 2 lay ers, p = 0 . 999. Con vergence of u dep ending on ˆ σ Con vergence of D x u dep ending on ˆ σ (ﬁrst comp onent) Figure 4: Conv ergence in 10D of the case one, num b er of neurons par lay er equal to 20, 2 lay ers, p = 0 . 999. On this simple test case, the dimension is not a problem and very go o d results are obtained in dimension 20 or ab o ve with only 20 neurons and 2 lay ers. 8 3.3 A linear quadratic sto c hastic test case. In this example, we consider a controlled pro cess X = X α with dynamics in R d according to d X t = ( A X t + B α t ) dt + D α t dW t , 0 ≤ t ≤ T , X 0 = x, where W is a real Bro wnian motion, the control pro cess α is v alued in R , and the constant co eﬃcien ts A ∈ M d , B ∈ R d , D ∈ R d . The quadratic cost functional to b e minimized is J ( α ) = E h Z T 0  X | t Q X t + α 2 t N  dt + X | T P X T i , where P , Q are non negative d × d symmetric matrices and N ∈ R is strictly p ositive. The Bellman equation asso ciated to this sto c hastic control problem is: ∂ u ∂ t + inf a ∈ R  ( Ax + B a ) .D x u + a 2 2 tr( D D | D 2 x u ) + x | Qx + N a 2  = 0 , ( t, x ) ∈ [0 , T ) × R d , u ( T , x ) = x | P x, x ∈ R d , whic h can b e rewritten as a fully nonlinear equation in the form (1.1) with f ( t, x, y , z , γ ) = x | Qx + Ax.z − 1 2 | B | z | 2 tr( D D | γ ) + 2 N . An explicit solution to this PDE is giv en by u ( t, x ) = x | K ( t ) x, where K ( t ) is non negative d × d symmetric matrix function solution to the Riccati equation: ˙ K + A > K + K A + Q − K B B > K N + D > K D = 0 , K ( T ) = P . W e take T = 1. The co eﬃcients of the forward pro cess used to solve the equation are σ = ˆ σ √ d I d , µ ( t, x ) = Ax. In our numerical example we tak e the following parameters for the optimization problem: A = I d , B = D = 1 I d , Q = P = 1 d I d , N = d and we wan t to estimate the solution at x = 1 I d . In this example, the truncation op erator (indexed by p b et ween 0 and 1 and close to 1) is as follows: T p ( X x t ) = min  max  xe ˆ At − σ s e 2 ˆ At − ˆ 1 2 ˆ A φ p , X x t  , xe ˆ At + σ s e 2 ˆ At − ˆ 1 2 ˆ A φ p  , where φ p = N − 1 ( p ), ˆ A is a vector so that ˆ A i = A ii , i = 1 , ..., d , ˆ 1 is a unit vector, and the square ro ot is taken comp onen twise. On Figure 5 we give the solution of the PDE with d = 1 using ˆ σ = 1 . 5 obtained for t wo dates: at t = 0 . 5 and at t close to zero. W e observe that we hav e a very go od estimation of the function v alue and a correct one of the Γ v alue at date t = 0 . 5. The precision remains go o d for Γ close to t = 0 and very go o d for u and D x u . 9 Y at date t = 0 . 5. Z at date t = 0 . 5 Γ at date t = 0 . 5 Y at date t = 0 . 006125. Z at date t = 0 . 006125 Γ at date t = 0 . 006125 Figure 5: T est case linear quadratic 1D using 160 time steps, ˆ σ = 1 . 5, p = 0 . 999, 100 neurons. On Figure 6, w e give the results obtained in dimension d = 1 by v arying ˆ σ . F or a v alue of ˆ σ = 2, the standard deviation of the result b ecomes far higher than with ˆ σ = 0 . 5 or 1 . Con vergence of u dep ending on ˆ σ . Standard deviation of u Figure 6: Con vergence in 1D of the linear quadratic case, num b er of neurons par lay er equal to 50, 2 la yers, p = 0 . 999. On Figure 7, for d = 3, w e tak e a quite low truncation factor p = 0 . 95 and observe that the num b er of neurons to tak e has to b e rather high. W e hav e also chec k ed that taking a num b er of hidden lay ers equal to 3 do es not impro ve the results. 10 10 neurons 20 neurons 30 neurons 50 neurons Figure 7: Con vergence in 3D of the linear quadratic case, 2 lay ers, testing the inﬂuence of the n umber of neurons, truncation p = 0 . 95. On Figure 8, for d = 3, w e give the same graphs for a higher truncation factor. As w e take a higher truncation factor, the results are improv ed b y taking a higher n umber of neurons (100 in the ﬁgure b elow). 11 10 neurons 20 neurons 50 neurons 100 neurons Figure 8: Con vergence in 3D of the linear quadratic case, 2 lay ers, testing the inﬂuence of the n umber of neurons, truncation p = 0 . 99. On Figure 9, w e observe in dimension 7 the inﬂuence of the num b er of neurons on the result for a high truncation factor p = 0 . 999. W e clearly ha ve a bias for a num b er of neurons equal to 50. This bias disapp ears when the num ber of neurons increases to 100. Con vergence with 50 neurons Con vergence with 100 neurons Figure 9: Conv ergence in 7D of the linear quadratic case, 2 lay ers, p = 0 . 999. 12 On Figure 10, for d = 7, w e chec k that inﬂuence of the truncation factor app ears to be slow for higher dimensions. Figure 10: F unction v alue conv ergence in 7D of the linear quadratic case with 2 lay ers, 100 neurons, testing p , using ˆ σ = 2 . Finally , we give results in dimension 10, 15 and 20 for p = 0 . 999 on Figures 11, 12. W e observ e that the n umber a neurons with 2 hidden lay ers has to increase with the dimension but also that the increase is rather slo w in contrast with the case of one hidden lay er as theoretically shown in [Pin99]. F or ˆ σ = 5 we had to take 300 neurons to get very accurate results. 10D 15D Figure 11: F unction v alue conv ergence in 10D and 15D of the linear quadratic case with 2 la yers, p = 0 . 999. 13 Figure 12: F unction v alue conv ergence in 20D of the linear quadratic case with 2 lay ers, p = 0 . 999. 3.4 Monge-Amp ` ere equation Let us consider the parab olic Monge-Amp` ere equation ( ∂ t u + det( D 2 x u ) = h ( x ) , ( t, x ) ∈ [0 , T ] × R d , u ( T , x ) = g ( x ) , (3.1) where det( D 2 x u ) is the determinant of the Hessian matrix D 2 x u . It is in the form (1.1) with f ( t, x, γ ) = det( γ ) − h ( x ) . W e test our algorithm b y choosing a C 2 function g , then compute G = det( D 2 x g ), and set h := G − 1. Then, b y construction, the function u ( t, x ) = g ( x ) + T − t, is solution to the Monge-Amp ` ere equation (3.1). W e choose g ( x ) = cos( P d i =1 x i / √ d ), and w e shall train with the forward pro cess X = x 0 + W , where W is a d -dimensional Bro wnian motion. On this example, w e use neural net works with 3 hidden lay ers, d + 10 neurons p er la yer, and w e do not need to apply any truncation to the forward pro cess X . Actually , w e observe that adding a truncation worsens the results. F or choosing the truncation level, we ﬁrst test the metho d with no truncation b efore decreasing the quantile parameter p . In the Monge-Amp ` ere case the b est results are obtained without any truncation. It ma y b e caused by the oscillation of the Hessian. The following table gives the results in dimension d = 5, 15, and for T = 1. Dimension d Av eraged v alue Standard deviation Relativ e error (%) Theoretical solution 5 0.37901 0.00312 0.97 0.382727 15 0.25276 0.00235 1.17 0.255754 T able 1: Estimate of u (0 , x 0 = 1 5 ) on the Monge Amp ere problem (3.1) with N = 120. Average and standard deviation observed ov er 10 indep endent runs are rep orted. 3.5 P ortfolio selection W e consider a p ortfolio selection problem formulated as follo ws. There are n risky assets of uncorrelated price pro cess P = ( P 1 , . . . , P n ) with dynamics d P i t = P i t σ ( V i t )  λ i ( V i t )d t + d W i t  , i = 1 , . . . , n, where W = ( W 1 , . . . , W n ) is a n -dimensional Brownian motion, b = ( b 1 , . . . , b n ) is the rate of return of the assets, λ = ( λ 1 , . . . , λ n ) is the risk premium of the assets, σ is a p ositiv e function (e.g. σ ( v ) = e v corresp onding to the Scott model), and V = ( V 1 , . . . , V n ) is the v olatility factor modeled b y an Ornstein-Uhlenbeck (O.U.) pro cess d V i t = κ i [ θ i − V i t ]d t + ν i d B i t , i = 1 , . . . , n, (3.2) 14 with κ i , θ i , ν i > 0, and B = ( B 1 , . . . , B n ) a n -dimensional Brownian motion, s.t. d < W i , B j > = δ ij ρ ij dt , with ρ i := ρ ii ∈ ( − 1 , 1). An agent can inv est at any time an amount α t = ( α 1 t , . . . , α n t ) in the sto cks, which generates a wealth pro cess X = X α go verned by d X t = n X i =1 α i t σ ( V i t )  λ i ( V i t )d t + d W i t  . The ob jectiv e of the agen t is to maximize her exp ected utility from terminal wealth: E  U ( X α T )] ← maximize ov er α It is w ell-known that the solution to this problem c an b e characterized by the dynamic programming metho d (see e.g. [Pha09]), which leads to the Hamilton-Jacobi-Bellman for the v alue function on [0 , T ) × R × R n :      ∂ t u + n X i =1  κ i ( θ i − v i ) ∂ v i u + 1 2 ν 2 i ∂ 2 v i u  = 1 2 R ( v ) ( ∂ x u ) 2 ∂ 2 xx u + n X i =1  ρ i λ i ( v i ) ν i ∂ x u∂ 2 x v i u ∂ 2 xx u + 1 2 ρ 2 i ν 2 i ( ∂ 2 x v i u ) 2 ∂ 2 xx u  u ( T , x , v ) = U (x) , x ∈ R , v ∈ R n , with a Sharp e ratio R ( v ) := | λ ( v ) | 2 , for v = ( v 1 , . . . , v n ) ∈ (0 , ∞ ) n . The optimal p ortfolio strategy is then given in feedback form by α ∗ t = ˆ a ( t, X ∗ t , V t ), where ˆ a = (ˆ a 1 , . . . , ˆ a n ) is given by ˆ a i ( t, x , v ) = − 1 σ ( v i )  λ i ( v i ) ∂ x u ∂ 2 xx u + ρ i ν i ∂ 2 x v i u ∂ 2 xx u  , ( t, x , v = ( v 1 , . . . , v n )) ∈ [0 , T ) × R × R n , for i = 1 , . . . , n . This Bellman equation is in the form (1.1) with f ( t, x, y , z , γ ) = n X i =1  κ i ( θ i − v i ) z i + 1 2 ν 2 i γ ii  − 1 2 R ( v ) z 2 0 γ 00 − n X i =1  ρ i λ i ( v i ) ν i z 0 γ 0 i γ 00 + 1 2 ρ 2 i ν 2 i ( γ 0 i ) 2 γ 00  , for x = (x , v ) ∈ R n +1 , z = ( z 0 , . . . , z n ) ∈ R n +1 , γ = ( γ ij ) 0 ≤ i,j ≤ n ∈ S n +1 , and displa ys a high non-linearit y in the Hessian argument γ . The truncation op erator indexed by a parameter p is chosen equal to T p ( X 0 ,x t ) = min  max[ x + µt − σ √ tφ p , X 0 ,x t ] , x + µt + σ √ tφ p  , where φ p = N − 1 ( p ), N is the CDF of a unit centered Gaussian random v ariable. W e use neural netw orks with 2 hidden la yers and d + 10 neurons p er la yer. W e shall test this example when the utility function U is of exp onen tial form: U ( x ) = − exp( − η x ), with η > 0, and under diﬀerent cases for whic h closed-form solutions are av ailable: (1) Merton pr oblem. This corresp onds to a degenerate case where the factor V , hence the v olatility σ and the risk premium λ are constant, so that the generator of Bellman equation reduces to f ( t, x, y , z , γ ) = − 1 2 | λ | 2 z 2 γ , ( t, x, y , z ) ∈ [0 , T ] × R × R × R , (3.3) with explicit solution given by: u ( t, x ) = e − ( T − t ) | λ | 2 2 U ( x ) , ˆ a i = λ i η σ . W e train with the forw ard pro cess X k +1 = X k + λ ∆ t k + ∆ W k , k = 0 , . . . , N , X 0 = x 0 . (2) One risky asset: n = 1. A quasi-explicit solution is provided in [Zar01]: u ( t, x , v ) = U (x) w ( t, v ) , with w ( t, v ) =    exp  − 1 2 Z T t R ( ˆ V t,v s ) ds     L 1 − ρ 2 where ˆ V t,v s is the solution to the mo diﬁed O.U. mo del d ˆ V s =  κ ( θ − ˆ V s ) − ρν λ ( ˆ V s )  d s + ν d B s , s ≥ t, ˆ V t = v . 15 W e test our algorithm with λ ( v ) = λv , λ > 0, for whic h we hav e an explicit solution: w ( t, v ) = exp  − φ ( t ) v 2 2 − ψ ( t ) v − χ ( t )  , ( t, v ) ∈ [0 , T ] × R , where ( φ, ψ , χ ) are solutions of the Riccati system of ODEs: ˙ φ − 2 ¯ κφ − ν 2 (1 − ρ 2 ) φ 2 + λ 2 = 0 , φ ( T ) = 0 , ˙ ψ − ( ¯ κ + ν 2 (1 − ρ 2 ) φ ) ψ + κθ φ = 0 , ψ ( T ) = 0 , ˙ χ + κθ ψ − ν 2 2 ( − φ + (1 − ρ 2 ) ψ 2 ) = 0 , χ ( T ) = 0 , with ¯ κ = κ + ρν λ , and explicitly giv en by (see e.g. App endix in [SZ99]) φ ( t ) = λ 2 sinh( ˆ κ ( T − t )) ˆ κ cosh( ˆ κ ( T − t )) + ¯ κ sinh( ˆ κ ( T − t )) ψ ( t ) = λ 2 κθ ˆ κ cosh( ˆ κ ( T − t )) − 1 ˆ κ cosh( ˆ κ ( T − t )) + ¯ κ sinh( ˆ κ ( T − t )) χ ( t ) = 1 2(1 − ρ 2 ) ln  cosh( ˆ κ ( T − t )) + ¯ κ ˆ κ sinh( ˆ κ ( T − t ))  − 1 2(1 − ρ 2 ) ¯ κ ( T − t ) − λ 2 ( κθ ) 2 ˆ κ 2 h sinh( ˆ κ ( T − t )) ˆ κ cosh( ˆ κ ( T − t )) + ¯ κ sinh( ˆ κ ( T − t )) − ( T − t ) i − λ 2 ( κθ ) 2 ¯ κ ˆ κ 3 cosh( ˆ κ ( T − t )) − 1 ˆ κ cosh( ˆ κ ( T − t )) + ¯ κ sinh( ˆ κ ( T − t )) , with ˆ κ = p κ 2 + 2 ρν λκ + γ 2 λ 2 . W e train with the forward pro cess X k +1 = X k + λθ ∆ t k + ∆ W k , k = 0 , . . . , N − 1 , X 0 = x 0 , V k +1 = V k + ν ∆ B k , k = 0 , . . . , N − 1 , V 0 = θ . (3) No lever age eﬀe ct, i.e., ρ i = 0 , i = 1 , . . . , n . In this case, there is a quasi-explicit solution given by u ( t, x , v ) = U (x) w ( t, v ) , with w ( t, v ) = E h exp  − 1 2 Z T t R ( V t,v s ) ds i , ( t, v ) ∈ [0 , T ] × R n , (3.4) where V t,v is the solution to (3.2), starting from v at time t . W e test our algorithm with λ i ( v ) = λ i v i , λ i > 0, i = 1 , . . . , n , v = ( v 1 , . . . , v n ), for which we hav e an explicit solution given b y w ( t, v ) = exp  − n X i =1  φ i ( t ) v 2 i 2 + ψ i ( t ) v i + χ i ( t )   , ( t, v ) ∈ [0 , T ] × R n , φ i ( t ) = λ 2 i sinh( ˆ κ i ( T − t )) κ i sinh( ˆ κ i ( T − t )) + ˆ κ i cosh( ˆ κ i ( T − t )) ψ i ( t ) = λ 2 i κ i θ i ˆ κ i cosh( ˆ κ i ( T − t )) − 1 κ i sinh( ˆ κ i ( T − t )) + ˆ κ i cosh( ˆ κ i ( T − t )) χ i ( t ) = 1 2 ln  cosh( ˆ κ i ( T − t )) + κ i ˆ κ i sinh( ˆ κ i ( T − t ))  − 1 2 κ i ( T − t ) − λ 2 i ( κ i θ i ) 2 ˆ κ 2 i h sinh( ˆ κ i ( T − t )) ˆ κ i cosh( ˆ κ i ( T − t )) + κ i sinh( ˆ κ i ( T − t )) − ( T − t ) i − λ 2 ( κ i θ i ) 2 κ i ˆ κ 3 i cosh( ˆ κ i ( T − t )) − 1 ˆ κ i cosh( ˆ κ i ( T − t )) + κ i sinh( ˆ κ i ( T − t )) , with ˆ κ i = p κ 2 i + ν 2 i λ 2 i . W e train with the forward pro cess X k +1 = X k + ∆ W k , k = 0 , . . . , N − 1 , X 0 = x 0 , V i k +1 = V i k + ν i ∆ B i k , k = 0 , . . . , N − 1 , V i 0 = θ i , with < W, B i > t = 0. 16 Merton Problem. W e take η = 0 . 5, λ = 0 . 6, T = 1, N = 120, and σ ( v ) = e v . W e plot the neural netw orks appro ximation of u, D x u, D 2 x u, α (in blue) together with their analytic v alues (in orange). F or comparison with Figures 6 and 7, w e rep ort the error on the gradien t and the initial control. In practice, after empirical tests, w e choose p = 0 . 98 for the truncation. Av eraged v alue Standard deviation Theoretical v alue Relativ e error (%) u (0 , x 0 = 1) -0.50561 0.00029 -0.50662 0.20 D x u (0 , x 0 = 1) 0.25081 0.00088 0.25331 0.99 α (0 , x 0 = 1) 0.83552 0.02371 0.80438 3.87 T able 2: Estimate of the solution, its deriv ative and the optimal control at the initial time t = 0 in the Merton problem (3.3). Av erage and standard deviation observ ed ov er 10 indep enden t runs are rep orted. Y at date t = 0 . 5042. Z at date t = 0 . 5042 Γ at date t = 0 . 5042 Y at date t = 0 . 0084. Z at date t = 0 . 0084 Γ at date t = 0 . 0084 Figure 13: Estimates of the solution and its deriv ativ es on the Merton problem (3.3) using 120 time steps. α at date t = 0 . 5042. α at date t = 0 . 0084. Figure 14: Estimates of the optimal control α on the Merton problem (3.3). One asset ( n = 1 ) in Scott v olatility mo del. W e tak e η = 0 . 5, λ = 1 . 5, θ = 0 . 4, ν = 0 . 4, κ = 1, ρ = − 0 . 7. F or all tests we choose T = 1, N = 120, and σ ( v ) = e v . In practice, after empirical tests, we choose 17 p = 0 . 98 for the truncation. Av eraged v alue Standard deviation Relativ e error (%) -0.53431 0.00070 0.34 T able 3: Estimate of u (0 , x 0 = 1 , θ ) on the One Asset problem with sto chastic volatilit y ( d = 2). Average and standard deviation observed ov er 10 indep endent runs are rep orted. The exact solution is − 0 . 53609477. No Leverage in Scott mo del. In the case with one asset ( n = 1), we tak e η = 0 . 5, λ = 1 . 5, θ = 0 . 4, ν = 0 . 2, κ = 1. F or all tests we choose T = 1, N = 120, and σ ( v ) = e v . In practice, after empirical tests, we c ho ose p = 0 . 95 for the truncation. Dimension d Av eraged v alue Standard deviation Relativ e error (%) Theoretical solution 2 -0.49980 0.00073 0.35 -0.501566 5 -0.43768 0.00137 0.92 -0.441765 8 -0.38720 0.00363 1.96 -0.394938 10 -0.27920 0.05734 1.49 -0.275092 T able 4: Estimate of u (0 , x 0 = 1 , θ ) on the No Leverage problem (3.4). Average and standard deviation observed o ver 10 indep endent runs are rep orted. In the case with four assets ( n = 4, d = 5), we take η = 0 . 5, λ =  1 . 5 1 . 1 2 . 0 . 8  , θ =  0 . 1 0 . 2 0 . 3 0 . 4  , ν =  0 . 2 0 . 15 0 . 25 0 . 31  , κ =  1 . 0 . 8 1 . 1 1 . 3  . In the case with seven assets ( n = 7, d = 8) we tak e η = 0 . 5, λ =  1 . 5 1 . 1 2 . 0 . 8 0 . 5 1 . 7 0 . 9  , θ =  0 . 1 0 . 2 0 . 3 0 . 4 0 . 25 0 . 15 0 . 18  , ν =  0 . 2 0 . 15 0 . 25 0 . 31 0 . 4 0 . 35 0 . 22  , κ =  1 . 0 . 8 1 . 1 1 . 3 0 . 95 0 . 99 1 . 02  . In the case with nine assets ( n = 9, d = 10), we take η = 0 . 5, λ =  1 . 5 1 . 1 2 . 0 . 8 0 . 5 1 . 7 0 . 9 1 . 0 . 9  , θ =  0 . 1 0 . 2 0 . 3 0 . 4 0 . 25 0 . 15 0 . 18 0 . 08 0 . 91  , ν =  0 . 2 0 . 15 0 . 25 0 . 31 0 . 4 0 . 35 0 . 22 0 . 4 0 . 15  , κ =  1 . 0 . 8 1 . 1 1 . 3 0 . 95 0 . 99 1 . 02 1 . 06 1 . 6  . Hamilton-Jacobi-Bellman equation from portfolio optimization is a typical example of full-nonlinearity in the se cond order deriv ative, and the ab ov e results show that our algorithm p erforms quite well up to dimension d = 8, but gives a high v ariance in dimension d = 10. Comparison with an implicit version of the scheme. As explained in Remark 2.2, an alternative option for the estimation of the Hessian is to approximate it by the automatic diﬀerentiation of the curren t neural net w ork for the Z comp onen t. It corresponds to the replacemen t of D ˆ Z i +1 ( T ( X t i +1 )) b y D Z i ( T ( X t i )); θ ) in (2.7). An additional change has to b e made to the metho d for it to w ork. A t the last optimization step (for time step t 0 = 0), w e notice empirically that the v ariable Γ 0 is not able to properly learn the initial Hessian v alue at all. Therefore for this last step w e use v ariables Y 0 , Z 0 and an explicit estimation of the second order deriv ative given b y D ˆ Z 1 ( T ( X t 1 )). W e see in T able 5 that the results for the Merton problem are v ery similar to the ones from T able 2 for the splitting sc heme but with a worse estimation of the Hessian and optimal con trol (the error is m ultiplied b y around 1.5). When w e tested this implicit sc heme on the Monge Ampere problem w e also faced computational problems during the optimization step of T ensorﬂo w. The numerical com putation of the gradien t of the ob jectiv e function for the backpropagation step, more precisely for the determinant part, often giv es rise to matrix inv ertibility errors which stops the algorithm execution. F or these t wo reasons, w e fo cused our study on the explicit scheme. Av erage Std T rue v alue Relativ e error (%) u (0 , x 0 = 1) -0.50572 0.00034 -0.50662 0.18 D x u (0 , x 0 = 1) 0.25091 0.00067 0.25331 0.95 α (0 , x 0 = 1) 0.85254 0.01956 0.80438 5.99 T able 5: Estimate of the solution, its deriv ative and the optimal control at the initial time t = 0 in the Merton problem (3.3) with implicit estimation of the Hessian. Average and standard deviation (Std) observed ov er 10 indep enden t runs are rep orted Comparison with the 2BSDE sc heme of [BEJ19]. W e conclude this pap er with a comparison of our algorithm with the global sc heme of [BEJ19], called Deep 2BDSE. The tests b elow concern the Merton 18 problem (3.3) but similar b eha vior happens on the other examples with sto chastic volatilities. This sc heme w as implemen ted in the original paper only for small n umber of time steps (e.g. N = 30). Thus w e tested this algorithm on tw o discretizations, resp ectively with N = 20 and N = 120 time steps, as sho wn in Figure 15, for T = 1 where we plotted the learning curve of the Deep BSDE method. These curves corresp ond to the v alues tak en by the loss function during the gradient descen t iterations. F or this algorithm the loss function to minimize in the training of neural netw orks is deﬁned as the mean L 2 error betw een the generated Y N v alue and the true terminal condition g ( X N ). W e observ e that for this choice of maturit y T = 1 the loss function oscillates during the training pro cess and do es not v anish. As a consequence the Deep 2BSDE do es not con v erge in this case. Even when decreasing the learning rate, we noticed that we cannot obtain the conv ergence of the sc heme. Ho wev er, the Deep 2BSDE metho d do es conv erge for small maturities T , as illustrated in T able 6 with T = 0 . 1 and diﬀerent v alues for the n umber of time steps N . Nev ertheless, even if the v alue function is w ell appro ximated, the estimation of the gradient and con trol did not conv erge (the corresp onding v ariance is v ery large), in comparison with our sc heme whereas the gradien t is very w ell approximated and the control is quite precise. W e also ha ve a muc h smaller v ariance in the results. T able 7 shows the results obtained by our method with T = 0 . 1 in order to compare it with the p erformance of [BEJ19]. It illustrates the limitations of the global approac h and justiﬁes our in tro duction of a lo cal metho d. Figure 15: Learning curve in logarithmic scale for the scheme [BEJ19] on the Merton problem (3.3) with N = 20 times steps on the left and N = 120 time steps on the righ t. The maturity is T = 1 . 10000 gradient descent iterations w ere conducted. N Av eraged v alue Standard deviation Theoretical v alue Relative error (%) u (0 , x 0 = 1) 5 -0.60667 0.01588 -0.59571 1.84 u (0 , x 0 = 1) 10 -0.59841 0.02892 -0.59571 0.45 u (0 , x 0 = 1) 20 -0.59316 0.04251 -0.59571 0.43 D x u (0 , x 0 = 1) 5 0.09668 0.25630 0.29786 67.54 D x u (0 , x 0 = 1) 10 0.03810 0.44570 0.29786 93.36 D x u (0 , x 0 = 1) 20 0.07557 0.55030 0.29786 74.63 α (0 , x 0 = 1) 5 -0.15243 0.61096 0.80438 118.95 α (0 , x 0 = 1) 10 0.59971 1.97906 0.80438 25.44 α (0 , x 0 = 1) 20 0.28385 0.43775 0.80438 64.71 T able 6: Estimate of the solution, its deriv ative and the optimal control at the initial time t = 0 in the Merton problem (3.3) with maturit y T = 0 . 1 for the [BEJ19] sc heme. Average and standard deviation observ ed o v er 10 indep endent runs are rep orted. 19 N Av eraged v alue Standard deviation Theoretical v alue Relative error (%) u (0 , x 0 = 1) 5 -0.59564 0.01136 -0.59571 0.01 u (0 , x 0 = 1) 10 -0.59550 0.00037 -0.59571 0.04 u (0 , x 0 = 1) 20 -0.59544 0.00054 -0.59571 0.04 D x u (0 , x 0 = 1) 5 0.29848 0.00044 0.29786 0.21 D x u (0 , x 0 = 1) 10 0.29842 0.00084 0.29786 0.19 D x u (0 , x 0 = 1) 20 0.29785 0.00054 0.29786 0.001 α (0 , x 0 = 1) 5 0.82322 0.01014 0.80438 2.34 α (0 , x 0 = 1) 10 0.85284 0.07565 0.80438 6.02 α (0 , x 0 = 1) 20 0.84201 0.09892 0.80438 4.68 T able 7: Estimate of the solution, its deriv ativ e and the optimal con trol at the initial time t = 0 in the Merton problem (3.3) with maturity T = 0 . 1 for our sc heme. Average and standard deviation observed ov er 10 indep enden t runs are rep orted. References [AA+18] A. Al-Aradi et al. “Solving Nonlinear and High-Dimensional Partial Diﬀerential Equations via Deep Learning”. In: arXiv:1811.08782 (2018). [Aba+15] M. Abadi et al. T ensorFlow: L ar ge-Sc ale Machine L e arning on Heter o gene ous Systems . Soft ware a v ailable from tensorﬂow.org. 2015. url : https://www.tensorflow.org/ . [Bec+19] C. Beck et al. “Deep splitting metho d for parab olic PDEs”. In: arXiv:1907.03452 (2019). [BEJ19] C. Bec k, W. E, and A. Jentzen. “Machine Learning Approx imation Algorithms for High- Dimensional F ully Nonlinear Partial Diﬀeren tial Equations and Second-order Bac kward Sto chas- tic Diﬀerential Equations”. In: J. Nonline ar Sci. 29.4 (2019), pp. 1563–1619. [Che+07] P . Cheridito et al. “Second-order backw ard stochastic diﬀerential equations and fully nonlinear parab olic PDEs”. In: Communic ations on Pur e and Applie d Mathematics 60.7 (2007), pp. 1081– 1110. [CWNMW19] Quen tin Chan-W ai-Nam, Joseph Mik ael, and Xa vier W arin. “Machine learning for semi linear PDEs”. In: Journal of Scientiﬁc Computing 79.3 (2019), pp. 1667–1712. [DLM20] J. Darb on, G. Langlois, and T. Meng. “Overcoming the curse of dimensionality for some Hamilton–Jacobi partial diﬀerential equations via neural netw ork architectures”. In: R es Math Sci 7 (2020). [EHJ17] W. E, J. Han, and A. Jentzen. “Deep learning-based numerical metho ds for high-dimensional parab olic partial diﬀerential equations and backw ard stochastic diﬀerential equations”. In: Com- munic ations in Mathematics and Statistics 5.4 (2017), pp. 349–380. [FTW11] A. F ahim, N. T ouzi, and X. W arin. “A probabilistic numerical metho d for fully nonlinear parab olic PDEs”. In: The A nnals of Applie d Pr ob ability (2011), pp. 1322–1364. [HJE18] J. Han, A. Jentzen, and W. E. “Solving high-dimensional partial diﬀeren tial equations using deep learning”. In: Pr o c e e dings of the National A c ademy of Scienc es of the Unite d States of A meric a 115.34 (2018), pp. 8505–8510. [HPW20] Cˆ ome Hur ´ e, Huyˆ en Pham, and Xa vier W arin. “Deep bac kward sc hemes for high-dimensional nonlinear PDEs”. In: Mathematics of Computation 89.324 (2020), pp. 1547–1579. [Hut+18] M. Hutzen thaler et al. “Overcoming the curse of dimensionalit y in the n umerical approximation of semilinear parab olic partial diﬀerential equations”. In: arXiv:1807.01212 (2018). [KB14] D. P . Kingma and J. Ba. A dam: A Metho d for Sto chastic Optimization . Published as a conference pap er at the 3rd In ternational Conference for Learning Representations, San Diego, 2015. 2014. url : . [NR20] N. N ¨ usken and L. Rich ter. “Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural net works: persp ectives from the theory of controlled diﬀusions and measures on path space”. In: arXiv:2005.05409 (2020). [Pha09] H. Pham. Continuous-time Sto chastic Contr ol and Optimization with Financial Applic ations . V ol. 61. SMAP. Springer, 2009. [Pin99] A. Pinkus. “Appro ximation theory of the MLP model in neural net works”. In: A cta numeric a 8 (1999), pp. 143–195. 20 [PP90] E. Pardoux and S. P eng. “Adapted solution of a backw ard sto chastic diﬀerential equation”. In: Systems & Contr ol L etters 14.1 (1990), pp. 55–61. [SS18] J. Sirignano and K. Spiliop oulos. “DGM: A deep learning algorithm for solving partial diﬀer- en tial equations”. In: Journal of Computational Physics 375 (2018), pp. 1339–1364. [SZ99] R. Sc h¨ ob el and J. Zhu. “Stochastic volatilit y with an Ornstein-Uhlenbeck process and exten- sion”. In: R eview of Financ e 3.1 (1999), pp. 23–46. [T an13] X. T an. “A splitting metho d for fully nonlinear degenerate parab olic PDEs”. In: Ele ctr onic Journal of Pr ob ability 18 (2013). [VSS18] M. Sabate Vidales, D. Sisk a, and L. Szpruch. “Unbiased deep solvers for parametric PDEs”. In: arXiv:1810.05094v2 (2018). [W ar18] X. W arin. “Monte Carlo for high-dimensional degenerated Semi Linear and F ull Non Linear PDEs”. In: arXiv:1805.05078 (2018). [Zar01] T. Zariphop oulou. “A solution approach to v aluation with unhedgeable risks”. In: Financ e and Sto chastics 5 (2001), pp. 61–82. 21

Neural networks-based backward scheme for fully nonlinear PDEs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment