Stochastic Optimal Control as Approximate Input Inference

Optimal control of stochastic nonlinear dynamical systems is a major challenge in the domain of robot learning. Given the intractability of the global control problem, state-of-the-art algorithms focus on approximate sequential optimization technique…

Authors: Joe Watson, Hany Abdulsamad, Jan Peters

Stochastic Optimal Control as Approximate Input Inference
Stochastic Optimal Control as A ppr oxim ate Input Infer ence Joe W atson, Hany Abdulsamad, Jan P eters † Departmen t of Com p uter Science, T echnische Universit ¨ at Darmstadt, Germany † Robot Learn ing Group, Max Planck Institute for Intelligent Systems,T ¨ ubingen , Germany { wats on, abduls amad, peters } @i as.informatik.tu-darmstadt.de Abstract: Optimal co ntrol o f stoc hastic non linear dy namical systems is a ma jor challenge in the dom a in o f rob ot lea r ning. Gi ven the intractability of the glo bal control pr oblem, state-of-the-a r t alg orithms focu s on appro ximate sequ ential op - timization tech niques, that h eavily rely on heur istics for regu lar ization in or der to achieve stable co nvergence. By building upon the duality between infer ence and co ntrol, we develop the view o f Optimal Control as Inp ut E stimation, d evis- ing a probab ilistic stocha stic optimal co ntrol form ulation that iteratively infers the optimal inpu t distributions by min imizing an upper bound of the control cost. I n- ference is perf ormed throu gh Expec tatio n Maximization and message passing on a pro babilistic gra p hical mo del of th e dyn amical system, an d time-varying linear Gaussian feed back contro llers are extracted fro m the jo int state-ac tion distribu- tion. This p erspective in corpor a te s un certainty q uantification, effecti ve initial- ization th rough p riors, an d the princip led regularization inhere nt to the Bayesian treatment. Moreover , it can b e shown that for determin istic linearized systems, ou r framework derives the m aximum entropy lin ear quadra tic optimal contr ol law . W e provide a complete and d etailed deriv ation of ou r pr obabilistic appro ach and h igh- light its advantages in comp arison to other d eterministic a n d p r obabilistic solvers. Keywords: Stochastic Optimal Con trol, Approxim ate Infer ence 1 Intr oduction T rajector y Optim ization for n onlinear d ynamical systems is among the mo st fu ndamenta l paradigm s in the field of robotics. It has proven itself to be a co rnerstone f or both low- and high-level p lanning technique s [ 1 , 2 ]. A p o pular tool for devising such planning sche m es is Optimal Con trol [ 3 , 4 ], which frames th e search fo r th e b est sequence o f in puts into a dyn amical system as the optimization of the state- action trajec to ry . While Optimal Control h as ha d g reat success bo th in theory and appli- cation, mainly represented by Sequential Qu adratic Programm ing ( SQP) techniques [ 5 ], it is k nown to strug gle with stochastic environments due to its feedf o rward nature. Meanwh ile, a popu lar tool for dealing with uncertainty is Bayesian statistics [ 6 ], which in part u ses the n o tion of ran dom variables to describ e mod e l un certainty . T h e p rocess of determin in g the characteristics o f this u ncertainty is known as inference, a nd this too is often fram ed as an optimiza tio n pro blem. Control-as-in ference [ 7 , 8 , 9 , 10 ] is a body of research com bining these two par adigms, with the prop osition that the principled mec hanisms of in f erence will bring the benefits of faster convergence, mo re princip led regularization and the addition of u n certainty quantificatio n [ 11 ]. In this work we pr esent In p ut In ference f o r Control ( I 2 C ), a new p erspective on co n trol-as-infe r ence. By m ovin g away f rom the typical Optimal Con trol for mulation, while pr e serving the u nderlyin g operation s, recu rsiv e Baye sian inferen ce can be app lied to the in puts to manner that op timizes a control o bjective. This builds on previous work that pe rforms rec u rsiv e appro ximate inference of the state trajecto r y [ 12 ] and exact input inferenc e for linear sy stems [ 13 ]. Consider the fun damental task of co ntrol: to find the sequen ce o f action s that gener ate a desired trajectory . From an infere nce perspective, we would call this p roblem Inpu t Estimatio n, wh ere the ‘d e sired’ observed tr a je c tory in this case is a set of measureme n ts. In Optimal Control, as the desired trajectory is no t expected to be f u lly ach ie ved, the notion of a cost function is used to describe the desired deviation of the observed trajecto ry . Statistically this deviation would be framed as a ‘disturban ce’ and described 3rd Conference on Robot Learning (C oRL 2019), Osaka, Japan. by a proba b ility distribution. As likelihood s ar e of ten the optimization o bjectiv e of an inf erence problem , by comparin g the likelihood of this formula tio n to typical con trol co st fun c tions info rms our choice of distur b ance noise in o r der to achiev e equiv alence . In this work, we foc us on th e well-established duality be tween Gaussian noise and qu adratic penalties [ 6 ]. By makin g the linear Gau ssian a ssum ption on both our dyn amics and observation models, infer- ence can be p erformed in clo sed -form usin g m essage p assing, and we show th at this input in f erence reduces to the Linear Quadratic Regulator (L QR) solution in the determ inistic case. Moreover , the inferen ce is in fact per forming th e sam e Discrete Algebr aic Ricatti equation (D ARE) compu ta- tion [ 13 ]. Additionally , mak ing the inference appr oximate thro ugh local linearization s, we extend the scheme to no nlinear dyn amical systems and arrive a t a pro cedure akin to the p opular trajec- tory optim ization of D ifferential Dyn amic Program ming (DDP) [ 14 ] and variants (e. g . iLQR [ 1 5 ], eLQR[ 16 ], GPS [ 17 ]). While th e se m ethods req uire explicit regula r ization, boun ds an d heuristics to ma in tain steady conver gen ce, the b ehaviour of our sch eme is governed primar ily b y the cho ice of p riors, and the regularizatio n only req uired to acco unt for the log- likeliho od ap p roximatio n. The use of Bayesian inference also results in self-regular ize d exploration, as the cov arian ce of each in- put is a measur e of confid ence / ro bustness. Moreover , by examining the c o nditional distributions between the r esultant po ster io r state-action distribution, we a r riv e at (Bayes) op timal time-varying linear (Gaussian) contro llers, as in LQR [ 13 ]. W e show that the covariance o f th ese controllers naturally exhib its the ma ximum entr o py characteristic, ach ie ved witho ut explicit in corpora tion of a policy entropy term in the objec ti ve as done previously . The contr ibutions of this work are as f ollows: A co ntrol-as-inference f ormulat ion ( I 2 C ) that posits o ptimal co n trol as inp u t estimation for a dy- namical system, such that the optimization objective is separated fro m th e prior s over th e contro ls. This allows for Bayesian inference of th e controls, rather tha n fixin g them for exploratio n. A pra ctical realisation thr ough ap proximate Expe ctation Maximisation , perfo rming inferen ce v ia linearized Gaussian message p a ssing in the E- Step and hyp erparame te r op tim ization in the M-Step. Compared to previous method s, I 2 C has more principled regularizatio n, rely in g prim a rily on the priors rather than heuristic method s such as lin e search, smoo th ing an d annealing . 2 Input Infer ence for Contr ol Giv en a stochastic d iscrete-time fully-ob served nonlinear dyn amical system, x t +1 ∼ f ( x t , u t ) with state x ∈ R d x and inpu t u ∈ R d u , we wish to find th e optimal control inp uts u ∗ 0: T over time horizon T th at minimizes the cost fu nction C ( x , u ) for moving from an initial state x 0 to goal state x g . Our prop osed metho d ref r ames op tim al contr o l as inf erence o f the inputs of th e d ynamical system. This can be achieved with access to a d ynamics m odel an d by incorpo rating th e co st fun ction into the likelihood in an affine mann er throug h an ‘ observation m odel’ p ( z t | x t , u t ) o f ou r op tim ization variables z ∈ R d z , such th at αC ( x , u )+ β = log p ( z t | x t , u t ) . By max imizing this likelihood max u 0: T , θ p ( z 0: T , x 0: T , u 0: T , θ )= p ( x 0 ) Q T - 1 t =0 p ( x t +1 | x t , u t ) Q T t =0 p ( z t | x t , u t , θ ) p ( u t | x t ) , (1) both the co ntrol cost (observation likelihoo d) and trajectory likelihood are jointly optimized, gen er- ating an estimated o ptimal state-actio n joint distribution p ( x , u ) . From this, th e conditio nal distri- bution p ( u | x ) can be foun d and used as a policy . The likelihoo d acts a s an uncon strained contro l cost function by incorp o rating the constrain t of the dy namical system, present in typical Optimal Control formu lations, as an add itional likelihoo d. This makes sense fo r stoc hastic system s, where the dy n amical system can n o longer be treated as a determin istic co n straint. The likelihoo d also depend s on h yperpa r ameters θ , which can be o p timized via the marginal likelihood. 2.1 The Linear Ga ussian Assumption By a p plying the linear Gaussian assumption to the mo d els and the ir respective uncertain ties, Equa- tion 1 can not o nly be tackled in a tracta b le manner, but also co mpared to LQR contr ol ( Section A ). Firstly , we can express the condition als as linear state-space mo dels, Dynamics: p ( x t +1 | x t , u t ) : x t +1 = A t x t + B t u t + a t + η t , η t ∼ N ( 0 , Σ η t ) , (2) Cost: p ( z t | x t , u t ) : z t = E t x t + F t u t + e t + ξ t , ξ t ∼ N ( 0 , Σ ξ ) . (3) 2 A x y µ − → y = Aµ − → x Σ − → y = AΣ − → x A T ν ← − x = A T ν ← − y Λ ← − x = A T Λ ← − y A + x z y µ − → y = µ − → x + µ − → z Σ − → y = Σ − → x + Σ − → z µ ← − x = µ ← − y − µ − → z Σ ← − x = Σ ← − y + Σ − → z = x z y ν − → y = ν − → x + ν − → z Λ − → y = Λ − → x + Λ − → z ν ← − x = ν ← − y + ν − → z Λ ← − x = Λ ← − y + Λ − → z (a) L inear T ransform (b) Addition (c) Equality Figure 1: Lin ear Gaussian M e ssag e Passing rules for elemen tar y state-space op e r ations [ 18 ], with the mean ( µ ), covariance ( Σ ), pr ecision ( Λ = Σ − 1 ) and scaled mean ( ν = Λµ ), which d escribe the momen t an d inform ation (o r c a nonical) form of th e N o rmal distribution resp ecti vely . Secondly , the lo g-likelihood is transfor med in to a co nvex function (Equ ation 4 ) wh ic h is qua d ratic in the o ptimization v ariab les x , u and z [ 19 ]. −L ( θ ) = 1 2 P T − 1 t =0 log | Σ η t | + 1 2 P T t =0 ( z t − E t x t − F t u t − e t ) ⊺ Σ - 1 ξ ( z t − E t x t − F t u t − e t ) + T 2 log | Σ ξ | + 1 2 P T − 1 t =0 ( x t +1 − A t x t − B t u t − a t ) ⊺ Σ - 1 η t ( x t +1 − A t x t − B t u t − a t ) + . . . (4) In I 2 C , the ‘mea surement’ of z represen ts the desired state-action tr ajectory . Theref o re to trans- form the log-likelihood of z to a quadratic control co st, the pre cision o f th e ‘observation noise’ ξ is Σ − 1 ξ = Λ ξ = α Θ , where Θ r epresents the weig hts o f the c o st function and α accoun ts for its scale in variance. For the standard LQ prob lem (Section A ), z t = [ x g u g ] ⊺ and Θ = diag ( Q , R ) . Our hyperp arameters θ inclu de α , the scale factor , along with the priors over the in puts u . I n Equa - tion 4 , α acts as the scale factor o f the LQ cost ag ainst th e other ter ms in th e likelihood. T ypically for m u lti-objective cost fun ctions this scaling must be user-defined, but as it has a prob a bilistic interpretatio n h ere, it can be iterativ ely estimated durin g in ference. As α scales the given con- trol cost Θ such that it can be used as the observation noise precision Λ ξ , it can b e estimated based on the current estimated state-action trajectory deviation ab out the goal. This inference is carried out u sing the Exp e c tation Maximization (EM) algorithm [ 20 ], treating α as a latent v ariab le. = X t E t + e t + Z ′ t ξ t ξ t ∼ N ( 0 , Σ ξ ) Z t A t X ′ t + a t + X ′′ t η t ∼ N ( 0 , Σ η t ) + X ′′′ t X t +1 B t U ′′ t = U ′ t F t Z ′′ t U t Figure 2: Forney factor graph o f the linear Gaussian d y namical system used by I 2 C . Blue term s ar e in termediate variables used in the m e ssage deriv ations (Section B ). Expectation Step The E-Step, estimating the state- action trajectory , can be per- formed in a tractable ma n ner throug h line ar Gaussian message passing. F or m odel-based sig- nal p rocessing on linear Gaussian state space m o dels [ 18 , 2 1 , 2 2 ], expressing inferen ce problems as Forney-style f actor graphs en- ables the co nstruction of message- passing a lgorithms by fo llowing straightfor ward ru les (see Figur e 1 ). For cycle- free g raphs, the mes- sages can b e expressed in closed- form. T he forward m essages (i.e. − → x ) represen ts the priors, while the back ward m essages (i.e. ← − x ) represen t likelihoo d fu n ctions (u p to a scale factor). The up d ated be lief is th e posterior of an edge, which are the pr o duct o f th e edge’ s forward and backward message: Σ x = ( Λ − → x + Λ ← − x ) − 1 , µ x = Σ x ( ν − → x + ν ← − x ) . (5) In I 2 C , the bac kward messages perf orm optimal con trol, so the p osterior states an d contro ls represent a regularize d update of the estimated optimal state-action tr ajectory . 3 Data: T , α , δ α , f ( x , u ) , g ( x , u ) µ − → x 0 , Σ − → x 0 , µ − → u t , Σ − → u t for t = 0 : T Result: K t , k t , Σ k t for t = 0 : T while not co n ver ged do // E-Ste p for i ← 0 to T − 1 do Compute µ − → x t +1 , Σ − → x t +1 from forward m essages (Equ ation 18 - 29 ), updating A t , a t , B t , E t , e t and F t end for i ← T t o 1 do Compute µ x t , Σ x t µ u t , Σ u t from backward m essages and marginalisation (Equation 32 - 45 ) end // M-Ste p Update α with reg. ( Equation 6 , 10 ) Update priors, µ − → u = µ u , Σ − → u = Σ u end // Contr oller Computer linear Gau ssian controller K t , k t , Σ k t for t = 0 : T from messages (Equation 7 - 9 ) Algorithm 1: EM f or Linear Gaussian I 2 C The message- p assing on the graph of Figu re 2 perfor ms the sam e inf erence as Kalman filter- ing and smooth ing [ 23 ], with the addition that the inputs are also u n certain 1 Additionally , th e inference starts with an ‘inn ovati on ’ ( observa- tion) of x 0 in order to ev aluate ( x t , u t ) rather than ( x t +1 , u t ) , but this is a mino r discrep - ancy as th e su bsequent p rediction an d innova- tion steps are the same. The for ward and b a ck- ward messages a r e derived in Sec tions B.1 - B.2 . While the me ssage-passing form is mor e ver- bose than the standard Ka lm an filtering and smoothing equ ations, they allow us to app r eci- ate h ow this f ramework per f orms optimal con- trol [ 13 ]. From E quation 4 with the LQ- equiv alent z t and Θ , it is clear th at the negativ e log-likelihood acts an u pper bound on the L Q cost, as it incorp orates th e trajectory likelihood , which depend s on the system’ s stochasticity and u ncertainty in con trols. Therefo re, as the EM alg o rithm maximizes the log-likelihood, it in turn minimizes the LQ cost, perf orming Bayesian optimal co ntrol. The fur ther conne c - tions betwee n I 2 C and LQ c o ntrol are discusses in Section 2.1.1 and B.5 . Maximisation Step T o upda te α , the scale factor b etween the LQ cost Θ and the estimated Λ ξ must be found . This is derived by maximizin g the expected log -likelihood via the deriv ative: − 2 ∂ ∂ α E [ L ( α )] = ∂ ∂ α (tr { Σ − 1 ξ ˆ Σ ξ } + T log | Σ ξ | ) = − tr { Θ ˆ Σ ξ } + T d z α − 1 = 0 , (6) where ˆ Σ ξ = P T t =0  ( z t − E t µ x t − F t µ u t )( z t − E t µ x t − F t µ u t ) ⊺ + E t Σ x t E ⊺ t + F t Σ u t F ⊺ t  . In practice this means that over EM iterations, as the state-action trajectory moves towards the goal, Λ ξ and theref o re α steadily increases. T his in turn results in the c ontrol cost ter m increasing in significance in the lo g-likelihood ( E quation 4 ). The resulting annealing effect aids in stabilizing the optimization . This effect bares a r e semblance to curricu lum learning [ 24 ], whe r e the task (e. g . cost function ) increases in difficulty as the pe r forman c e impr oves, as a strategy for lear ning comp lex tasks effecti vely . Linear Gaussian Controller For finite horizon LQ con trol, it can be shown that a time- varying lin ear contro ller is the op timal policy . Here we show that this is true for the inf erence setting as we ll. By examin ing th e cond itional distribution between the margin alized posteriors o f x and u at each timestep, a time-varying linear Gaussian contro ller ( Equation 7 - 9 ) can be d eriv ed from the me ssages (see Section B.4 ). For a time- varying lin ear Gaussian con troller of the fo rm u t ∼ N ( K t x t + K t , Σ k t ) , I 2 C com p utes the parameters as K t = − Σ u t B t Γ t +1 Λ ← − x t +1 Ψ t +1 A t , (7) k t = Σ u t ( ν − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 ( z t − E t µ − → x t − e t ) + B ⊺ t ( Γ t +1 ν ← − x t +1 + ( I − Γ t +1 ) ν − → x ′′ t − Γ t +1 Λ ← − x t +1 Ψ t +1 a t )) , ( 8) Σ k t = Σ u t = ( Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t + B ⊺ t Γ t +1 Λ ← − x t +1 B t ) − 1 . (9) In Section 2.1.1 , it is shown how the expressions for the con troller resemble the corre sponding expressions fo r L Q contro l. Moreover, the d iscr e pancy betwe e n the I 2 C an d LQ con trollers can be interpreted as un certainty-d eriv ed regulation. In the I 2 C contr o ller two ad ditional (dimen sio n less) 1 If the input i s incorporated i nto the state, t he two procedures become identical, howe ver the joint dynamics then become deg enerate due to the independ ence of the inputs 4 LQR Riccati Backward Message Message-derived Controller Q E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 E t E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 E t R F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t F ⊺ t ( Σ ξ + E t Σ − → x t F ⊺ t ) - 1 F t P t Λ ← − x t Γ t Λ ← − x t p t − ν ← − x t − Γ t ν ← − x t T able 1: Du e to the formu lation of I 2 C , the precision of the observation noise is p ropor tio nal to the LQ cost f unction weights. Additionally , due to the linear Gaussian assumptio n, we c a n show that the pr ecision and scaled-mean of th e b ackward messages o f the state belief corresp ond to the value function parameter s in LQR. These equ i valences are explained fu rther in Section B.5 . terms appear, Γ a n d Ψ (see Sec tion B.4 ), w h ich are functions of Σ ← − x t +1 , Σ − → x ′′ t and Σ − → u ′′ t . As process uncer tainty increases, Γ acts to ‘ turn off ’ th e o ptimal co ntrol terms of the contro ller and rely on the priors. Meanwhile, Ψ represents the confidenc e in the contro ller , which countera cts the attenu a ting effects of Γ given suf ficient control certainty . Th ese findings par a llel the ‘tur n- off phenom enon’ o bserved in Dual Control [ 25 , 2 6 ] and Bayesian Rein forcemen t Le arning [ 27 ], where actions a re atten uated under u ncertainty . This behaviour is importan t for settings such as probab ilistic Model-b ased Reinfor cement Learn ing [ 28 ], where localised regions of uncertain ty can indicate mod e lling err or , and suc h errors can lead to detrimental policy updates. Attenuating the policy updates in these regions between mod el learning iterations would mitigate this pitfall. 2.1.1 Connections to Finite Ho rizon Maximum Entropy LQR T o u n derstand how this framework p erforms op timal contro l, we lo ok at the back ward m essages o f the prob a bilistic graphical model described in Figure 2 with a con trol perspective [ 13 ]. By lo o king at the backward m e ssag e s of the state ← − X t in Section B.3 , the backwards evolution of the p recision (Equation 52 ) and scaled-mean (Eq uation 57 ) can be seen to hav e a similar Ricatti for m to the quadra tic value fun ction p arameters for LQ con tr ol (Equatio n 16 - 17 ). Ex tending this analysis to fin d the linear Gaussian co ntrollers from the co n ditional distributions, we see that some of the eq uiv alent terms have the ad ditional uncertain ty-weighted scalar term Γ . T able 1 details the co rrespond ence. The contr ol covariance, Equ ation 9 , can be seen to resemble that of a Max imum Entropy controller . In Contr ol Theory and Rein forcemen t Learn ing, the entro py of a po licy ca n be in terpreted as a metric for ro bustness, so a m aximum entr opy objec ti ve has been added to co st fun ctions as regulari- sation [ 29 ]. Augm enting th e LQ cost fun ction with th e entro py of the co ntrol inputs, the covariance of the input at each timestep can be shown to be Σ t = ( R + B ⊺ P t +1 B ) − 1 (using LQ notation, see Section A ) [ 30 ]. Compa ring this to Equation 9 and T able 1 , it can be seen th at this m axi- mum entr o py contr o l is calculated by the backward message, an d combin ed with the prior (fo rward message) to con struct the posterio r . This fusion is importan t as the prior can be used to regularize exploration d uring in ference, which is essential fo r mitigating the e ffects of linearizin g the d ynamics during app roximate inference of nonlinear systems (see Section 2.2 ). This smo o thing mechan ism has previously been added explicitly or via constraints on the trajectory update during optimization. 2.2 Nonlinear I 2 C through A pproximate Inference The linear analysis co nducted h e r e can b e n a turally extended to nonlinear dyn amical systems throug h lin earization, takin g the Jacobian of the dynam ic s an d o b servation models about the cur- rent state-ac tio n trajecto r y . This app roach has been applied to bo th state estimation (i.e. E x tended Kalman Sm o othing) and o ptimal con trol (i.e. DDP). Fro m a p robabilistic per sp ecti ve, this lineariza- tion rend ers the inf erence a p proxim ate. As a c o nsequence , carefu l co n sideration of the priors and additional regular ization is required, as the ac t of linearizin g imposes a r equiremen t of local im- provement during in ference. Placing small prio r s on u ensu res th at the Bay esian posterio r rem a in s close to the prior, and was fo u nd to be critical f or systems that whe r e h ighly non linear or with low sampling frequ e ncies. As in Exten ded Kalman Filterin g and o ther in f erence schemes for nonlin ear systems [ 31 ], the d ynamics ar e line a rized in th e fo rward pass. Th is linear iza tion-based ap proximate inference ca n b e v ie wed as Gau ss-Newton o ptimization [ 32 ], makin g it c losely related to appr oxi- mate trajectory optimization algorithms such as iLQR. Ad ditionally , it was fou n d that the α update during the M-Step must b e restricted to ensure th e state-action distribution did not chan g e signif- 5 0 10 20 30 40 50 60 5 . 0 7 . 5 10 . 0 12 . 5 15 . 0 17 . 5 20 . 0 x 1 State Trajectory Filtered Prediction Posterior LQR 0 10 20 30 40 50 60 − 2 0 2 4 6 8 10 T imesteps x 2 Filtered Prediction Posterior LQR 0 10 20 30 40 50 60 − 8 − 6 − 4 − 2 0 Feedback Gains, K T ime-varying Lin ear Controller LQR I2C 0 10 20 30 40 50 60 20 40 60 80 100 120 140 T imesteps Feedforward Gains, k LQR I2C (a) Comparing state trajectories . (b) Comparing the parameters of the linear controller Figure 3: Dem o nstrating how I 2 C ge neralizes the Dynamic Prog ramming Finite Horizo n LQR solu - tion. This is achieved when the controls have a large prior an d the certainty in th e target observation is high. Note ‘ Filter e d’ and ‘Prediction ’ c orrespon d to µ − → x ′ t and µ − → x t +1 in Figure 2 respectively . icantly between iterations. By lookin g at a bo und δ ξ on the KL diver gen ce between Z up dates (Equation 10 ), this in fact can be app lied a s a bo und δ α on the up date ratio. As the expr e ssion D KL  Z i     Z i +1  = 1 2 " log | Σ i +1 ξ | | Σ i ξ | + tr { Λ i +1 ξ Σ i ξ } − d z # = 1 2  log α i +1 α i + d z α i α i +1 − d z  ≤ δ ξ (10) is monoto nic increasing in th e ratio α i +1 / α i . From the per spectiv e of ap proxim a te EM, the regu- larized M-Step is mo ti vated by mitigating the a d verse effect of the linearizatio n assump tio n on th e likelihood estimate. [ 33 ]. 3 Experimental Results An empirical ev aluation is pr esented, first to highlight the equiv alence o f I 2 C to the LQR solution and second to co mpare I 2 C to state-of- th e-art alg o rithms on nonlin ear dy namical systems 2 . 3.1 Equiv alence with finite-horizo n LQR by Dynamic Progra mming In Section 2.1 , the L QR problem was used to motivate the linear Gaussian assumption for I 2 C . In Section B.5 it is shown how , u nder specific settings, the message passing expressions reduce to those foun d when solving the LQR p r oblem via Dy n amic Pro grammin g. Figure 3 illustrates this numerically , f or an LQR pro blem described in Section C.1 . 3.2 Evaluation on nonlinear traj ectory optimization tasks T o ev aluate the v iability of I 2 C for non linear trajectory optimizatio n, its perf ormance on three stan- dard control tasks were comp ared to similar baselin e methods. iLQR and GPS a r e two p o pular algorithm s that use local linearization f o r time-varying co ntrollers an d have demonstrated stro ng perfor mance on complex control pr oblems. iLQR is determ inistic, so h ere it is used as a ba seline for 2 The code is av ail able at https://gi thub.com/JoeMWa tson/input- inference- for- control 6 I2C iLQR GPS 0 50 100 2 3 4 · 10 4 Iterations Cost Pendulum 0 50 100 15 0 2 00 1 2 3 4 · 10 5 Iterations Cost Cartpole 0 50 100 15 0 2 00 4 6 8 · 10 5 Iterations Cost Double Cartpo le Figure 4: Comparison of the trajectory cost prediction over iterations f or three simu lated tasks during trajectory o ptimization. For all algorithm s, the dyn amics are linearized once per itera tion. For experimental d etails see Section C.2 . the ignoring uncertainty in stocha stic con trol pro blems. While GPS was motiv ated to train Neural Network po licies, he r e we use its tim e -varying linear con trollers, v ie wing it as Maxim um Entropy iLQG. In order to perform the linearization requir ed fo r appr oximate inference (and the baseline approa c h es), the test en viro nments were implemen ted using the Auto grad librar y [ 34 ]. W e test on three classical prob lems of increasing comp lexity in state-action-o bservation dimensionality ( d x , d u , d z ): Pen dulum (2 , 1, 4) , Cartp ole (4, 1, 6) an d Dou b le Cartpole (6, 1 , 9 ) swing-u p. Both Cartpole domains are also und eractuated, which presents a significan t plann in g challenge . All environments also have constrained actuation, which in tr oduces bo th a nonlinearity and increased sensitivity to disturbanc e s. Ex p erimental details and addition al traje c to ry plo ts are includ ed in Section C.2 . Figure 4 shows that I 2 C is cap able of perf orming effective trajectory optim ization. The EM aspect of the alg orithm re su lts in a significan t p ortion of the tim e is used ‘warming up’ the priors, which ar e set to b e small in order to carr y out steady exploration, r a ther than op tim izing th e contro l cost. iLQR perfor ms superio r trajectory o ptimization, both in rate and final cost. However , actuation constraints were fo und to lead to suboptima l conver gen ce ( in th e Pendu lum task, Figure 5 ), and the optimized controller s were compar a ti vely high ly aggressive. GPS perf ormed steadier o ptimization due to the KL bo und and exploratio n in the forward pass. In T able 2 , the o ptimized (determ in istic) co ntrollers were evaluated on the stoch astic environment. I 2 C pe rforms the mo st con sistently , ope r ating clo se to its pr edicted co st fo r each task. GPS and iLQR, with m ore ag gressiv e con trollers and trajectories, both suffered reduced p e r forman c e when ev alua te d on the simu lated systems. W e attribute this to the high-r isk strategy of operating at the actua tion limits, when also subjected to d isturbances, esp ecially as time-varying co ntrol strategies are inhere n tly very brittle to any d eviation in trajec to ry . 4 Related W ork Optimal control of nonlinea r dyn amical systems thro ugh iterati ve linearization originated from Dif- ferential Dy namic Programmin g (DD P) [ 14 ]. A drawback of DDP is the need for th e co mputa- tionally expensive secon d-order approxima tion of the dynam ics. In the framework of Iterative LQR (iLQR) [ 35 ] a nd its stochastic extensio n iLQG [ 36 ], this req uirement is dr o pped. Both algo rithms perfor m only first orde r app roximation s, making them akin to a regular ized Gauss-Newton method. I2C iLQR GPS 0 50 100 0 2 4 6 T imesteps θ 0 50 100 − 5 0 5 10 T imesteps ˙ θ 0 50 100 − 2 0 2 T imesteps u Figure 5: Comparison of the state-actio n trajecto ries o f I 2 C , iLQR and GPS on the Pendu lum swing- up task after co n vergence. 7 En vir o nment Algorithm Predicted Cost Evaluated Cost Pendulum I 2 C 1 . 35 × 10 4 1 . 37 × 10 4 ± 3 . 82 iLQR 1 . 66 × 10 3 1 . 11 × 10 5 ± 20 . 38 GPS 2.00 × 10 4 7 . 01 × 10 4 ± 30 . 96 Cartpole I 2 C 1.73 × 10 5 1 . 74 × 10 5 ± 0 . 14 iLQR 1 . 14 × 10 5 1 . 76 × 10 7 ± 88 . 63 GPS 1.65 × 10 5 2 . 94 × 10 6 ± 17 . 60 Double Cartpole I 2 C 3.12 × 10 5 3 . 21 × 10 5 ± 1 . 79 iLQR 2 . 37 × 10 5 1 . 76 × 10 7 ± 5 . 2 7 × 10 5 GPS 3.76 × 10 5 2 . 94 × 10 6 ± 44 . 39 T able 2: Evaluating the optimized determ inistic co ntroller of each algor ithm on the simulated stochastic en viro nments. Predicted Cost refer s to th e con verged value from Figure 4 , Evaluated Cost shows th e mean and stand ard deviation after 10 0 trials. All forme r method s howe ver lack a principled forward p ass and instead rely on a line-searc h ap- proach to find a suitab le r egularization, tha t co unteracts the greedin ess of their loc a l ap proxima tio ns. Extended LQR (eLQR) [ 16 ] and its stochastic extension seLQR [ 37 ] address this issue and p e rform a f orward pass based on the ‘co st-to -come’, that has similarities to Kalman filtering . A more e legan t solution to the p roblem of r egularization is pr oposed in Guided Policy Search (GPS) [ 38 , 3 9 ], where the Stoch astic Optimal Control pr o blem is formu lated with a KL b ound on the ch ange of trajectory distributions. GPS derives a Max im um Entropy iL QG a s a me a n s to train neur a l network policies. The co nnection between o ptimal contro l an d inferenc e , a lso kn own as the estimation-co ntrol dual- ity an d Kalm an duality [ 4 , 40 ] was initially noted by Kalman [ 41 ], while workin g o n the Kalman Filter and Optimal Contro l. Probabilistic Con tr ol Design [ 42 , 43 , 44 ] derives a prob abilistic vari- ant o f LQR throu gh a KL div ergence minimization , also no tin g the conn ection between the LQR cost weight matrices and the pr ecisions of m u lti variate no rmal distributions. Furthermo re, the sim- ilarity betwe e n LQR an d Kalman Smoothe d trajecto ries has p reviously be en utilized f o r the ER TS controller [ 45 ]. Howe ver , th is work uses stand ard smoothin g in th e state and does not der i ve a correspo n ding con troller , relying on an ap proximate inverse dyn amics m odel instead. Inferen ce h as been a p plied to reinforcem ent learning for discrete en viro n ments [ 7 ], throug h maxi- mizing the likelihood o f a discrete latent optimality variable. AICO [ 12 ] applies this appr oach to the co ntinuou s LQR settin g , with the state co st defining the op timality prob ability , and the actio n weight definin g the pr ecision of the action prior, which is treated like a disturban ce f or exploration . As with I 2 C , the backward messages were found to sha re similarities with the D AREs of LQR, howe ver unlike I 2 C , the input priors ar e fixed. AICO was ge n eralised to Posterior Po licy Iteratio n (PPI) [ 46 , 47 ] in which a risk - tuned linear Gaussian controller was obtaine d from the infe rred value function . The idea o f contr ols as a ran dom diffusion pro cess is shared b y T odor ov [ 48 ], alon g with Path Integral (PI ) Contro l (KL Control for discrete en viro nments) [ 9 , 49 , 50 ] that ta kes advantage of Feynman-Kac lemm a to app roximately solve the continu ous-time Hamilton-Jaco bi-Bellman equa - tion using stochastic pro cesses. PI metho d s iterativ ely compu te lo c al improvements to the controls, allowing them to be u sed to train p a r ametric p olicies or for mo del predictive control. 5 Conclusion In this work we have in troduced Inpu t Infe rence fo r Control ( I 2 C ) , a novel con trol-as-infer ence f or- mulation, by casting op timal con trol as Bayesian in ference over the inpu ts. Through making the linear Gau ssian assumption, we arr i ved at a tractable a pproxim ate EM algorithm with th e use of message passing for approxim ate in f erence, and are ab le to draw con nections with lin ear quadr atic optimal co ntrol, Kalman filtering and Kalma n smoothin g through examin ation of th e messages. Compared to p r ior work, this scheme employs n a tural regulariza tio n throug h the m echanisms of Bayesian inferenc e, offering a more pr in cipled ap proach than cu rrently established d eterministic solvers. Moreover, ou r app r oach im proves previous proba bilistic ap p roaches by n aturally inco r po- rating and optimizin g over a ctions, enabling us to retrieve time-variant feed back controller s. Fu- ture avenues of research include th e analysis of d ifferent app roximate inference techniq ues, su ch as Monte Car lo , variational me th ods, an d num erical quadr ature, and the investi gatio n of the trade-off between accuracy of infer e nce, co m putational cost and bene fit to co ntrol optimization . 8 Acknowledgments The au thors would like to thank Mic h ael Lutter and Julen Urain for valuable fe edback on the draft. This project has r eceiv ed funding from the Eur opean Unions Horizon 2020 research and inn ov ation progr amme unde r gr a nt agreemen t No. 64 0554 (SKILLS4R OBOTS). Refer ences [1] Y . T assa, T . Erez, and E. T odor ov . Synthesis and stabilization of c omplex behaviors throug h on- line trajecto ry op timization. I n 2012 IEEE /RSJ Internatio n al Con fer ence o n Intelligent Rob o ts and Systems . IEEE, 2 012. [2] M. T oussaint, K. Allen, K. A. Smith, and J. B. T enenbau m. Differentiable physics and stab le modes for too l- use an d manipu lation plan ning. In R o botics: Science and Systems , 2018. [3] D. E. Kir k. Optimal co ntr o l theory: an intr odu c tion . Cour ier Cor poration , 201 2 . [4] R. F . Stengel. S tochastic o ptimal contr ol: Theory and application . John Wile y & Son s, Inc., 1986. [5] A. E. Bryson . Ap plied optimal contr ol: Optimizatio n , estimatio n and contr ol . Routledge, 2018. [6] C. M. Bishop. P attern Recognition an d Machine Learning (I nformation Science an d Statistics) . Springer-V erlag, 200 6. [7] H. Attias. Plan n ing by pr obabilistic inferen ce. In Pr oc. of the 9th I n t. W orkshop on Artificial Intelligence and Statistics , 2003 . [8] M. T oussaint and A. Stor key . Probab ilistic inferenc e fo r solvin g discrete a n d con tinuous state markov decision processes. In Pr o ceedings o f the 23 r d interna tional con ference on Machine learning . A CM, 20 06. [9] H. J. Kap pen, V . G ´ omez, and M. Opper . Optima l contr o l as a g raphical m odel infe r ence problem . In Pr oceeding s of the T wenty-Thir d I nternationa l Confer ence on Automated Plan n ing and Scheduling, ICAPS 2013 , 20 1 3. [10] S. Levine. Reinforceme nt learnin g a n d contro l as prob a bilistic inf e rence: T utorial and r evie w . arXiv pr eprint a rXiv:1805. 00909 , 2018. [11] P . Henn ig, M. A. Osborn e, and M. Girolami. Probabilistic n u merics and un certainty in com pu- tations. Pr oceed ings of the R oyal Society A: Mathematica l, Physical an d Engineering Sciences , 2015. [12] M. T oussaint. Robot trajectory o p timization using app roximate inferen ce. I n P r oc e e dings of the 26th ann ual international confer ence on ma chine lea rning . ACM, 2009. [13] C. Hoffmann and P . Rostalski. Linear optimal con trol on factor g raphs - a message passing perspective - . IF AC (I nternationa l F e deration of A uto matic Co ntr o l) , 2017 . [14] D. H. Jacobson and D. Q. M a yne. Dif fer ential d ynamic program ming. 1970. [15] W . L i an d E. T odo rov . Iterative linear q uadratic regulator d esign fo r no nlinear bio logical move- ment systems. In ICINCO (1) , 2004. [16] J. van den Berg. Extended LQR: Locally-o p timal feedba ck con trol for systems with no n-linear dynamics and non - quadratic co st. In Robotics Resear ch . Springer, 2016. [17] S. Levine and V . Koltun. Guided policy search. In Pr o ceedings of the 3 0th Internation al Confer ence on Machine Learning, ICML 20 13 , 2013. [18] H.-A. Loeliger, J. Dauwels, J. Hu, S. Korl, L. Ping, and F . R. Kschischang . The factor graph approa c h to mod e l-based sign al processing. Pr oce e dings of the IEE E , 2 007. [19] Z. Gha h ramani and G. E. Hinton. Parameter estimation for linea r dynam ic a l systems. T echnical report, 199 6. [20] A. P . Dempster, N . M. Laird, and D. B. Rub in. Maximu m likelihood from incom plete data v ia the em algor ithm. Journal of the Royal S tatistical Society: Series B ( Methodological) , 1977. [21] L. Brud erer . Inp ut estimation and dynamical system id e ntification: New algorithms a nd r esults . PhD thesis, ETH Zu rich, 2015. [22] H.-A. Loeliger, L. Bruderer, H. Malmberg, F . W adeh n , a n d N. Zalmai. On sparsity by nuv-em, gaussian message passing, and Kalman smoothing . In 20 16 Info rma tion Theory and Ap p lica- tions W orkshop (IT A) . IEEE, 2016 . [23] B. D. An derson and J. B. M o ore. Op timal filtering . Cou rier Corpo r ation, 2 012. [24] Y . Beng io, J. Lou radour, R. Collobe r t, and J. W eston. Cur riculum learning . In Pr oceed ings of the 26th ann ual international confer ence on ma chine lea rning . ACM, 2009. 9 [25] M. Aoki. Optimization o f stochastic systems: top ic s in discrete-time systems , volume 32. Academic Press, 19 6 7. [26] Y . Bar-Shalom. Stoch astic dynam ic p rogramm ing: Caution and prob in g. IEEE T ransactions on Automatic Contr o l , 1981. [27] E. D . Klenske and P . Hennig. Dual control for approx im ate bayesian rein f orcement lear ning. Journal of Ma chine Learnin g Resea r ch , 20 16. [28] M. P . Deisenroth, G. Neumann , J. Peters, et al. A survey on po licy search f or ro botics. F oun- dations and T rends R  in Rob otics , 2013 . [29] B. D. Ziebar t. Modeling purposef u l adaptive behavior with th e princip le o f maxim um causal entropy . 2010. [30] S. Levine and V . Koltun. V ariationa l po licy sear ch via trajector y o ptimization. In Advanc es in neural in formation pr ocessing systems , 20 13. [31] Z. Ghahrama n i an d S. T . Roweis. Learning non lin ear dynamical sy stem s using an em algo- rithm. I n Advanc e s in neural information pr ocessing systems , 19 99. [32] B. M. Bell. Th e iterated K a lm an smoo ther as a Gauss–Newton method. SIAM Journal on Optimization , 1994 . [33] X. Y i an d C. Caramanis. R egularize d em algorithm s: A unified framework an d statistical guaran tee s. I n Advan ces in Neural Information Pr o cessing Systems , 2 015. [34] D. Maclau rin, D. Duvena ud, and R. P . Adams. Autogr ad: Effortless g radients in num py . In ICML 2015 AutoML W orkshop , volume 238 , 2 015. [35] W . L i an d E. T odo rov . Iterative linear q uadratic regulator d esign fo r no nlinear bio logical move- ment systems. In ICINCO 2004, P r oce e dings of the F irst Internatio nal Con fer ence o n Info r - matics in Contr ol, Automation and Ro botics, 2004 , 2004. [36] E. T o dorov and W . Li. A gener alized iterative LQG method for locally - optimal feedb ack control of constrain ed nonlinea r sto c h astic systems. In P r oce e dings of the 20 05, American Contr ol Confer ence, 200 5. IE E E, 2005. [37] W . Su n, J. van den Berg, and R. Alterovitz. Stoch astic extended LQR for op timization-b a sed motion plann in g un d er uncertainty . IEEE T rans. Automation Science an d Engineering , 2016. [38] S. Levine. Motor skill learning with local tr ajec to ry method s . PhD thesis, Stanfo rd Un iv ersity , 2014. [39] S. L evine and P . Abbeel. Lear ning neural n etwork policies with guided policy search un d er unknown d ynamics. In Ad vances in Neural Informatio n Pr ocessing Systems , 2 014. [40] E. T od o rov . General duality between optimal contro l and estimation. In 2008 4 7th IEEE Confer ence on Decision and Contr ol . IEEE, 2008. [41] R. E. Kalma n. A n e w ap proach to linear filtering and prediction prob lems. Journal o f basic Engineering , 196 0. [42] M. K ´ arn ` y. T owards fully prob abilistic contro l d esign. A utoma tica , 19 96. [43] M. K ´ arn ` y and T . V . Guy . Fully p robabilistic control design. Systems & Contr ol Letters , 2006. [44] J. ˇ Sindel ´ aˇ r, I. V ajd a, and M. K ´ arn ` y. Stoch astic contro l op timal in the kullback sense. Kyber - netika , 2008 . [45] M. Zima, L. Armesto, V . Girb ´ es, A. Sala, and V . ˇ Sm´ ıdl. Extend ed rauch - tung-strieb el con- troller . In 52nd IEEE Conference o n Decision and Contr ol . I E EE, 2013. [46] K. Rawlik, M. T ou ssaint, and S. V ijayak u mar . On stochastic o ptimal co ntrol and r e inforce- ment learn in g by approx imate infer ence. In T wenty-Third In ternational Joint Con fer ence on Artificial Intelligence , 2013. [47] K. C. Rawlik. On pro babilistic infe r ence approach es to stoch astic optimal control. 2013. [48] E. T odor ov . Linearly -solvable mar kov d ecision prob lems. In Advan ces in n eural information pr ocessing systems , 20 07. [49] Y . Pan, E. Theod o rou, and M . Kontitsis. Sample efficient path integral co ntrol un d er uncer- tainty . In Advan ces in Neural Informatio n Pr ocessing Systems 28: Annu a l Confer ence on Neural Information Pr oce ssing Systems 201 5 , 2015. [50] V . G ´ omez , H. J. Kappen, J. Peters, and G. Neumann. Policy search for path integral contr o l. In Machine Learnin g and Knowled ge Discovery in Da tabases - E u r o pean Confer ence, ECML PKDD 201 4 , 20 14. [51] K. B. Petersen , M. S. Pedersen, et al. The matrix cookbo ok. T echnical University of Denm a rk , 2008. 10 A The Dynamic Programming solution to the Linear Q uadratic Regulator Giv en a linear system, we wish to fin d a control sequence u ∗ 0: T that minimizes a quadr a tic cost function over a finite time h orizon T fo r a goal state x g and input u g : min u 0: T h ( x T − x g ) ⊺ Q f ( x T − x g ) + P T − 1 t =0 ( x t − x g ) ⊺ Q ( x t − x g ) + ( u t − u g ) ⊺ R ( u t − u g ) i s.t. x t +1 = Ax t + a + B u t (11) Solving this method v ia Dyna m ic Progr amming, we can construct a quad ratic value function back- wards thr ough time to find the optimal control at each timestep , which we can calculate for Equa- tion 11 using Bellman’ s Princip le of Op timality . Starting with P T = Q f , p T = − x g ⊺ Q , p T = 0 . V t ( x ) = x ⊺ P t x + 2 x ⊺ p t + p t (12) = min u [( x t − x g ) ⊺ Q ( x t − x g ) + ( u t − u g ) ⊺ R ( u t − u g ) + V t +1 ( x t +1 )] (13) The optima l inpu t can be foun d to be linear in state, u ∗ t = − ( R + B ⊺ P t +1 B ) - 1 ( B ⊺ P t +1 ( Ax t + a ) + B ⊺ p t +1 − Ru g ) (14) = K t x t + k t (15) The param eters o f the value fu nctions follow th e recursive form, i.e. P t = Q + A ⊺ P t +1 A + A ⊺ P t +1 B ( R + B ⊺ P t +1 B ) - 1 B ⊺ P t +1 A (16) p t = A ⊺ ( P t +1 a + p t +1 − P t +1 B ( R + B ⊺ P t +1 B ) - 1 ( B ⊺ P t +1 a + B ⊺ p t +1 − Ru g )) − Qx g (17) B Derivation of I 2 C Linear Gaussian Messages All m essages are derived following the graph ical m odel in Figure 2 . Note the figure in cludes th e intermediate variables ( denoted with prim es), u sed to add clarity to the derivations. B.1 Forward Messages The f orward m essage are very close to those of Kalman Filtering, except the inputs are a lso observed and so have th eir own innovation step. The innovation and pr opagation o f the input in to the system dyn amics: ν − → u ′ t = ν − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 ( z t − E t µ − → x t − e t ) (18) Λ − → u ′ t = Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t (19) µ − → u ′′ t = B t µ − → u ′ t (20) Σ − → u ′′ t = B t Σ − → u ′ t B ⊺ t (21) The innovation and pr opagation o f the state, inc orporatin g the input: ν − → x ′ t = ν − → x t + E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 ( z t − F t µ − → u t − e t ) (22) Λ − → x ′ t = Λ − → x t + E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 E t (23) µ − → x ′′ t = A t µ − → x ′ t + a t (24) Σ − → x ′′ t = A t Σ − → x ′ t A ⊺ t (25) µ − → x ′′′ t = µ − → x ′′ t (26) Σ − → x ′′′ t = Σ − → x ′′ t + Σ η t (27) µ − → x t +1 = µ − → x ′′′ t + µ − → u ′′ t (28) Σ − → x t +1 = Σ − → x ′′′ t + Σ − → u ′′ t (29) 11 B.2 Backward Messages The most efficient m eans of constructin g the back ward messages for margina lisation is to make use of th e ‘aux iliary’ form (see [ 18 , 21 ]) , which has several useful prop erties for m essage pro paga- tion. I n particu lar they ar e inv ariant to the Additio n factor , so the various offsets are a u tomatically considered . ˜ Λ x = ( Σ − → x + Σ ← − x ) - 1 = Λ − → x − Λ − → x Σ x Λ − → x (30) ˜ ν x = ˜ Λ x ( µ − → x t − µ ← − x t ) = ν − → x t − Λ − → x t µ x t (31) Like the marginal they ar e a fusion of th e forward and back ward message, but in the ‘ dual’ form. For in itialising the backward p a ss there ar e two appro aches. One is to follow th e idea o f the term inal cost fro m E quation 11 , wh ere for example P T = Q f = Q , p T = 0 , Λ ← − x T = Λ ξ , ν ← − x T = 0 . Th e marginals can then be co nstructed following Eq uation 5 . Any Q f can be used , so long as Λ ← − x T is con structed with an α following Section 2.1 . In practice, it was found crucial to tun e up this terminal cost to en su re the target state is reached with a respo nsiv e con troller (as many target states lay at unstable equilibr ia). Howe ver Q f the represents another (multi-d imensional) hyp erparameter to tune. Usin g the pr obabilistic perspective, instead cho ose Σ x T such th a t the prior Σ − → x T has been reduced b y a scale factor κ . While this deviates fr o m the previous quad ratic cost formulation into an adaptive co st fu nction, it was found to b e b oth simple and effectiv e wh en tacklin g difficult d omains. The adapta tio n beco mes an importan t quality for non linear p r oblems where the initial dyna m ics are stable a nd the target state dynam ics are unstable. This scheme acts to tune up the terminal co st a s the d ynamics beco me more unstable, which ca uses th e state uncerta in ty to grow at a greater ra te , which in turn acts to increase the responsiveness o f the contro ller . As we also wish to keep the pr ior and poster io r trajectories tight d uring op timization (to ensure th e linearizatio n a ssum ption is valid), we set µ x T = µ − → x T . In the exper im ents o f Section 3.2 , Q f = Q . Starting with Σ x T = Σ − → x T , µ x T = µ − → x T . Construct the aux illary for x t +1 , ˜ Λ x t +1 = Λ − → x t +1 − Λ − → x t +1 Σ x t +1 Λ − → x t +1 (32) ˜ ν x t +1 = ν − → x t +1 − Λ − → x t +1 µ x t +1 (33) The auxiliar y is inv arian t across a n addition oper ation, so ˜ Λ x ′′ t = ˜ Λ x ′′′ t = ˜ Λ x t +1 (34) ˜ ν x ′′ t = ˜ ν x ′′′ t = ˜ ν x t +1 (35) Propagate the state belief b ackwards through system dyn amics, ˜ Λ x ′ t = A ⊺ t ˜ Λ x ′′ t A t (36) ˜ ν x ′ t = A ⊺ t ˜ ν x ′′ t (37) Marginalized variables are inv ariant across the Equ ality n ode, so marginalize x t at x ′ t , Σ x t = Σ x ′ t = Σ − → x ′ t − Σ − → x ′ t ˜ Λ x ′ t Σ − → x ′ t (38) µ x t = µ x ′ t = µ − → x ′ t − Σ − → x ′ t ˜ ν x ′ t (39) T o find µ u t , no te that due to the ad dition operation , the au x illary of u ′′ t is eq u al to that of x ′′′ t , ˜ Λ u ′′ t = ˜ Λ x ′′′ t (40) ˜ ν u ′′ t = ˜ ν x ′′′ t (41) ˜ Λ u ′ t = B ⊺ t ˜ Λ u ′′ t B t (42) ˜ ν u ′ t = B ⊺ t ˜ ν u ′′ t (43) Σ u t = Σ u ′ t = Σ − → u ′ t − Σ − → u ′ t ˜ Λ u ′ t Σ − → u ′ t (44) µ u t = µ u ′ t = µ − → u ′ t − Σ − → u ′ t ˜ ν u ′ t (45) 12 B.3 Riccati Backward Messages for Control T o understand the relation to op tim al c o ntrol, th e backward m essages must be repr esented recur- si vely as a Discre te Algebra ic Ricatti E quation. Recursion of the precision Λ ← − x t = Λ ← − x ′ t + E ⊺ t Λ ← − z ′ t E t (46) = A ⊺ t Λ ← − x ′′ t A t + E ⊺ t Λ ← − z ′ t E t (47) Λ ← − z ′ t = ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 (48) Using the matrix in version identity ( A - 1 + B ) - 1 = A − A ( A + B - 1 ) A [ 51 ], Λ ← − x ′′ t = ( Σ η t + Σ − → u ′′ t + Σ ← − x t +1 ) - 1 (49) = Λ ← − x t +1 − Λ ← − x t +1 (( Σ η t + Σ − → u ′′ t ) - 1 + Λ ← − x t +1 ) - 1 Λ ← − x t +1 (50) Σ − → u ′′ t = B t ( Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t ) - 1 B ⊺ t (51) So the re c ursion in full is Λ ← − x t = E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 E t + A ⊺ t Λ ← − x t +1 A t − A ⊺ t Λ ← − x t +1 (( Σ η t + B t ( Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t ) - 1 B ⊺ t ) - 1 + Λ ← − x t +1 ) - 1 Λ ← − x t +1 A t (52) Recursion of the scaled-mean ν ← − x t = ν ← − x ′ t + E ⊺ t Λ ← − z ′ t ( z t − F t µ − → u t − e t ) (53) ν ← − x ′ t = A ⊺ t Λ ← − x ′′ t ( µ ← − x ′′ t − a t ) = A ⊺ t Λ ← − x ′′ t ( Σ ← − x t +1 ν ← − x t +1 − µ − → u ′′ t − a t ) (54) Substituting Equatio n 18 - 21 into Equ a tion 54 (using Λ ← − z ′′ t for brevity) ν ← − x ′ t = A ⊺ t Λ ← − x ′′ t ( Σ ← − x t +1 ν ← − x t +1 − a t − B t Σ − → u ′ t ( ν − → u t + F ⊺ t Λ ← − z ′′ t ( z t − E t µ − → x t − e t ))) (55) Substituting Λ ← − x ′′ t throug h Equ ation 50 , ν ← − x ′ t = A ⊺ t ( Λ ← − x t +1 − Λ ← − x t +1 (( Σ η t + Σ − → u ′′ t ) - 1 + Λ ← − x t +1 ) - 1 Λ ← − x t +1 ) ( Σ ← − x t +1 ν ← − x t +1 − a t − ( B t ( Λ − → u t + F ⊺ t Λ ← − z ′′ t F t ) - 1 ( ν − → u t + F ⊺ t Λ ← − z ′′ t z t ))) (56) So the fu ll recursion is, ν ← − x t = A ⊺ t ( I − Λ ← − x t +1 (( Σ η t + B t ( Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t ) - 1 B ⊺ t ) - 1 + Λ ← − x t +1 ) - 1 ) ( ν ← − x t +1 − Λ ← − x t +1 a t − Λ ← − x t +1 ( B t ( Λ − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 F t ) - 1 ( ν − → u t + F ⊺ t ( Σ ξ + E t Σ − → x t E ⊺ t ) - 1 ( z t − E t µ − → x t − e t )))) + E ⊺ t ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 ( z t − F t µ − → u t − e t ) (57) B.4 Linear Gaussian Cont roller T o extract the linear Gaussian co ntrollers, we find the co nditional distribution between u t and x t . The inpu t estimate is marginalized by fusing the f orward and backward message: µ u t = Σ u t ( ν − → u t + ν ← − u t ) (58) ν ← − u t = F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 ( z t − E t µ − → x t − e t ) + B ⊺ t ν ← − u ′′ t (59) = F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 ( z t − E t µ − → x t − e t ) + B ⊺ t Λ ← − u ′′ t µ ← − u ′′ t (60) = F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 ( z t − E t µ − → x t − e t ) + B ⊺ t Λ ← − u ′′ t ( µ ← − x t +1 − µ − → x ′′′ t ) (61) 13 Eventually we need an expression in terms of the ma rginal x t , so we need to be able to express Eq. ( 61 ) in terms of µ x t . T aking the margin alisation rule f rom Equation 5 , µ x ′′′ t = Aµ x t + a t = Σ x ′′′ t ( ν − → x ′′′ t + ν ← − x ′′′ t ) (62) µ − → x ′′′ t = Σ − → x ′′′ t ( Λ x ′′′ t µ x ′′′ t − ν ← − x ′′′ t ) = Σ − → x ′′′ t ( Λ x ′′′ t µ x ′′′ t − ν ← − x ′′′ t ) (63) µ ← − x t +1 − µ − → x ′′′ t = µ ← − x t +1 + Σ − → x ′′′ t ν ← − x ′′′ t − Σ − → x ′′′ t Λ x ′′′ t µ x ′′′ t (64) T o find Λ ← − u ′′ t , we can use th e matr ix inv ersion identity ( A - 1 + B - 1 ) - 1 = B ( A + B ) - 1 A = A ( A + B ) - 1 B [ 51 ] Λ ← − u ′′ t = ( Σ ← − x t +1 + Σ − → x ′′′ t ) - 1 (65) = Λ − → x ′′′ t ( Λ ← − x t +1 + Λ − → x ′′′ t ) - 1 Λ ← − x t +1 (66) = Λ ← − x t +1 ( Λ ← − x t +1 + Λ − → x ′′′ t ) - 1 Λ − → x ′′′ t (67) W e in troduce d imensionless term Γ for b revity , an d will d iscuss inter pretions of it in subseq uent sections, Γ t +1 = Λ − → x ′′′ t ( Λ ← − x t +1 + Λ − → x ′′′ t ) - 1 (68) I − Γ t +1 = Λ ← − x t +1 ( Λ ← − x t +1 + Λ − → x ′′′ t ) - 1 (69) This allows us to express the last term of Equation 61 as, Λ ← − u ′′ t ( µ ← − x t +1 − µ − → x ′′′ t ) = ( Γ t +1 ν ← − x t +1 + ( I − Γ t +1 ) ν ← − x ′′′ t − Γ t +1 Λ ← − x t +1 Σ − → x ′′′ t Λ x ′′′ t µ x ′′′ t ) (70 ) where ν ← − x ′′′ t = Λ ← − x ′′′ t ( µ ← − x t +1 − µ − → u ′′ t ) (71) T o develop the last ter m of Equ a tio n 70 , recall the m arginalisation ru le for Λ (Eq uation 5 ), Σ − → x ′′′ t Λ x ′′′ t µ x ′′′ t = Σ − → x ′′′ t ( Λ − → x ′′′ t + Λ ← − x ′′′ t ) µ x ′′′ t (72) T o understan d this b etter , it is best to expand Λ ← − x ′′′ t , Λ ← − x ′′′ t = ( Σ ← − x t +1 + Σ − → u ′′ t ) - 1 = Λ − → u ′′ t ( Λ ← − x t +1 + Λ − → u ′′ t ) - 1 Λ ← − x t +1 (73) Applying this to Eq uation 72 and introd ucing Ψ (another dimensionless scaling term) Σ − → x ′′′ t ( Λ − → x ′′′ t + Λ ← − x ′′′ t ) µ x ′′′ t = Σ − → x ′′′ t ( Λ − → x ′′′ t + Λ − → u ′′ t ( Λ ← − x t +1 + Λ − → u ′′ t ) - 1 Λ ← − x t +1 ) µ x ′′′ t (74) = Ψ t +1 µ x ′′′ t (75) where Ψ t +1 = Σ − → x ′′′ t ( Λ − → x ′′′ t + Λ − → u ′′ t ( Λ ← − x t +1 + Λ − → u ′′ t ) - 1 Λ ← − x t +1 ) ( 76) T o summarize: µ u t = Σ u t ( ν − → u t + F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 ( z t − E t µ − → x t − e t ) + B ⊺ t ( Γ t +1 ν ← − x t +1 + ( I − Γ t +1 ) ν ← − x ′′′ t − Γ t +1 Λ ← − x t +1 Ψ t +1 ( Aµ x t + a ))) (77) T o find Σ u t , Σ u t = ( Λ − → u t + Λ ← − u t ) - 1 (78) Σ u t = ( Λ − → u t + F ⊺ t Λ ξ F t + B ⊺ t ( Σ ← − x t +1 + Σ − → x ′′′ t ) - 1 B t ) - 1 (79) Using the matrix in version identity ( A - 1 + B - 1 ) - 1 = B ( A + B ) - 1 A [ 51 ], Σ u t = ( Λ − → u t + F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 F t + B ⊺ t Λ − → x ′′′ t ( Λ − → x ′′′ t + Λ ← − x t +1 ) - 1 Λ ← − x t +1 B t ) - 1 (80) Σ u t = ( Λ − → u t + F ⊺ t ( Σ ξ + E ⊺ t Σ − → x t E t ) - 1 F t + B ⊺ t Γ t +1 Λ ← − x t +1 B t ) - 1 (81) Interpreting the sca le matrices Γ and Ψ Deriving the contro ller lead to the emergenc e of two scale m atrices Γ and Ψ , repr esenting matrix fractions of the forward m essages (uncertain ty) and ba ckward messages (o ptimality). T o inter p ret 14 the mean ing o f these terms, there ar e four scenarios that ar e important to con sider: h igh process uncertainty ( Λ − → x ′′′ t → 0 ), low process u ncertainty ( Λ − → x ′′′ t → ∞ ), high input prior u ncertainty ( Λ − → u ′′ t → 0 ) an d low input prio r u ncertainty ( Λ − → u ′′ t → ∞ ). Note, ‘p rocess uncerta in ty’ inclu des accumulated un certainty fr om previous timesteps, including that from the in p ut pr iors. Th e inp ut prior described above is specific to that timestep. 1. High process uncertainty , high input prior uncert ainty Here Γ t +1 → 0 , Ψ t +1 → 0 an d Λ ← − x ′′′ t → 0 . Th e refore the controller beco mes cut off from the back ward m essages (i. e . any sense of op timality) and becomes a weighted average of it’ s p rior and goal state. 2. High process uncertainty , low input prior uncertaint y Here Γ t +1 → 0 , Ψ t +1 → Γ - 1 t +1 and Λ ← − x ′′′ t → Λ ← − x t +1 . Despite th e system unc e rtainty , th e controller confiden ce re acti vates the contr ol ter ms by cancelling ou t Γ . 3. Low process uncertainty , hig h input prior uncerta inty Here Γ t +1 → I , Ψ t +1 → I an d Λ ← − x ′′′ t → 0 . This is the equivalent LQR setting, assumin g the determin istic con troller is used (see Section B.5 ). 4. Low process uncertainty , low input prior uncerta inty Here Γ t +1 → I , Ψ t +1 → I an d Λ ← − x ′′′ t → Λ ← − x t +1 . As above this is similar to the LQR setting, howe ver now the co ntroller u pdate will be closer to its pr ior . B.5 Equiv alence t o the Dyna mic Programming LQR Solution First, rememb ering that the back wards message co rrespond to likelihoods, the lo g-likelihood of a Gaussian distribution is, log N ( x ; µ , Σ ) = ( x − µ ) ⊺ Σ - 1 ( x − µ ) + constant (82) = x ⊺ Σ - 1 x − 2 x ⊺ Σ - 1 µ + constant (83) = x ⊺ Λx − 2 x ⊺ ν + constant (84) By compar ing this to the LQR value f unction in E quation 12 , the equ i valence between Λ , P , − ν and p outlin ed in T able 1 may be ap preciated. T o arrive at the recu rsi ve LQR expr essions outlined in Section A we m ust co nsider linea r mo dels, deterministic d y namics and infinitely broad pr iors, which requires Σ v → 0 and Λ − → u t → 0 . Ad- ditionally , fro m the formulatio n ou tlin ed in Section 2.1 , E ⊺ Λ ξ E = α Q and F ⊺ t Λ ξ F t = α R . T o recover the LQR result, we require the o bservation likelihood to dominate, which occurs for suitable large α . Moreover , given large inpu t priors, we requir e α such that ( Σ ξ + F t Σ − → u t F ⊺ t ) - 1 ≈ Λ ξ , therefor e α → ∞ as Λ − → u t → 0 . A s the value fu n ction parameters scale linearly with the cost function and the co ntroller is inv ariant to the scale, we can om it α from the analysis for brevity . Recursion of the precision Applying these co n ditions to the Λ ← − x t recursion in Equ ation 52 , Λ ← − x t = Q + A ⊺ Λ ← − x t +1 A − A ⊺ Λ ← − x t +1 (( B t R - 1 B ⊺ t ) - 1 + Λ ← − x t +1 ) - 1 Λ ← − x t +1 A (85) = Q + A ⊺ Λ ← − x t +1 A − A ⊺ Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ) - 1 B ⊺ Λ ← − x t +1 A (86) Equation 85 to 86 is achieved using the iden tity ( A + B ) - 1 = B - 1 ( B - 1 + A - 1 ) - 1 A - 1 [ 51 ], (( B R - 1 B ⊺ ) - 1 + Λ ← − x t +1 ) - 1 = Σ ← − x t +1 ( B R - 1 B ⊺ + Σ ← − x t +1 ) - 1 B R - 1 B ⊺ (87) along with ( A + J ⊺ B J ) - 1 J ⊺ B = A - 1 J ⊺ ( B - 1 + J AJ ⊺ ) - 1 [ 51 ] = B ( R + B ⊺ Λ ← − x t +1 B ) - 1 B ⊺ (88) Recursion of the scaled-mean 15 Applying the c onditions to the ν ← − x t recursion in Eq uation 57 , along with th e identity trick s used above, ν ← − x t = A ⊺ ( I − Λ ← − x t +1 (( B R - 1 B ⊺ t ) - 1 + Λ ← − x t +1 ) - 1 )( ν ← − x t +1 − Λ ← − x t +1 ( B R - 1 ( Ru g )) − Λ ← − x t +1 a ) + Qx g (89) = A ⊺ ( I − Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ⊺ ) - 1 B ⊺ )( ν ← − x t +1 − Λ ← − x t +1 ( B R - 1 ( Ru g )) − Λ ← − x t +1 a ) + Qx g (90) = A ⊺ ( ν ← − x t +1 − Λ ← − x t +1 a − Λ ← − x t +1 B u g − Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ⊺ ) - 1 B ⊺ ( ν ← − x t +1 − Λ ← − x t +1 B u g − Λ ← − x t +1 a ) + Qx g (91) Here, there is a discr e pancy in the u g terms, but this can b e r e c tified th rough addin g Ru g − Ru g and rearr anging, = A ⊺ ( ν ← − x t +1 − Λ ← − x t +1 a − Λ ← − x t +1 B u g − Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ⊺ ) - 1 ( B ⊺ ν ← − x t +1 − B ⊺ Λ ← − x t +1 B u g + Ru g − Ru g − B ⊺ Λ ← − x t +1 a ) + Qx g (92) − ( R + B ⊺ Λ ← − x t +1 B ⊺ ) u g can b e taken outside to cancel out the existing ter m there , so o nly one u g term remain s, = A ⊺ ( ν ← − x t +1 − Λ ← − x t +1 a − Λ ← − x t +1 B u g + Λ ← − x t +1 B u g − Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ⊺ ) - 1 ( B ⊺ ν ← − x t +1 + Ru g − B ⊺ Λ ← − x t +1 a ) + Qx g (93) = A ⊺ ( ν ← − x t +1 − Λ ← − x t +1 a − Λ ← − x t +1 B ( R + B ⊺ Λ ← − x t +1 B ⊺ ) - 1 ( B ⊺ ν ← − x t +1 − B ⊺ Λ ← − x t +1 a + Ru g ) + Qx g (94) Recall that p t is equ i valent to − ν t , so all n on- ν terms sho u ld have the oppo site sign to those in Eq. 17 . The linear Gaussian controller As mention ed above, for the LQR co nditions Γ t +1 → I , Ψ t +1 → I and Λ ← − x ′′′ t → 0 . Applying the co nditions to the co ntroller, Σ u t = ( R + B ⊺ t Λ ← − x t +1 B t ) - 1 , (95) µ u t = − Σ u t ( − Ru g + B ⊺ t ( − ν ← − x t +1 + Λ ← − x t +1 ( Aµ x t + a ))) , (96) remember ing that ν ← − x t +1 is the opp osite sign to p t +1 . Expand ing on the case of h ighly un c ertainty , wh ere Γ t → 0 , here th e stochastic controller is in- depend ent of the backward m essages (an d therefore any notio n of optim ality). Therefo re it would depend pu r ely on a weighted com bination of its prio r and goal: Σ u t = ( Λ − → u t + α R ) - 1 , (97) µ u t = ( Λ − → u t + α R ) - 1 ( ν − → u t + α Ru g ) , (98) where the stationa ry distribution would be Σ u t → 0 an d µ u t → 0 . B.6 Hyperparameter Sensitivity For nonlinear tasks, the crucial hype r parameters for I 2 C are th e initial input priors Σ − → u and the update limit (m otiv ated as a KL boun d) of α , δ α . T h e r ole of Σ − → u t is to facilitate exploration , but too much unc ertainty in the trajectory leads to the linearization assum ption b ecoming inv alid du ring inference . T his failure mode manifests as the posterio r inputs becom in g inaccurate, th erefore leading to the subsequent p r ior trajectory d eviating fr o m the pr evious posterior trajec to ry . This means that the pr e dicted perfo rmance of th e contro ller diver ges fro m the true per forman c e w h en evaluated on the actual system. Theref ore, fo r fast an d successful con vergence, Σ − → u depend s no t only on the expected in put ran ge, but shou ld also be tun ed ba sed on the inh erent u ncertainty o f th e system and nonlinear ity of th e d ynamics. Even after tuning Σ − → u , the appro ximate infere n ce c an fail after aggressive updates to α in the M- Step, du e to the appr oximate natu r e o f the log -likelihood ev aluation with th e line arization a ssump- tion. A KL bound , simplified to a bou nd δ α on the update ratio, smoo ths the op timization by limiting 16 En vir o nment z z g Θ = diag ( Q , R ) u limit Σ η Pendulum [sin θ , cos θ, ˙ θ , u ] ⊺ [0 , 1 , 0 , 0] ⊺ diag (1 , 100 , 1 , 1) [ − 2 , 2] diag ( ǫ 1 , ǫ 3 ) Cartpole [ x, sin θ, cos θ, ˙ x, ˙ θ , u ] ⊺ [0 , 0 , 1 , 0 , 0 , 0 ] ⊺ diag (1 , 1 , 1 00 , 1 , 1 , 1) [ − 5 , 5] diag ( ǫ 1 , ǫ 1 , ǫ 2 , ǫ 2 ) Double Cartpole [ x, sin θ 1 , cos θ 1 , sin θ 2 , cos θ 2 , ˙ x , ˙ θ 1 , ˙ θ 2 , u ] ⊺ [0 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 0] ⊺ diag (1 , 1 , 1 00 , 1 , 100 , 1 , 1 , 1 , 1) [ − 10 , 10] diag ( ǫ 1 , ǫ 1 , ǫ 1 , ǫ 2 , ǫ 2 , ǫ 2 ) T ab le 3: Environment param e ters of the nonlin ear tasks. ǫ 1 = 1e - 12 , ǫ 2 = 1e - 6 an d ǫ 3 = 1e - 3 . aggressive up dates. For tuning , δ α should smoo th ou t any large upda te s to α while limitin g the im- pact to the rate o f co n vergence. Note that as α is increasing, it is n umerically easier to work with α - 1 , which tends to zero, so the limiting is implemente d in practice as δ α - 1 acting on α i / α i +1 . Figure 6 demon strates the b ehaviour of th e hyp erparameter s for th e Cartpole swing-u p task. C Experimental Details C.1 Equi valence with finite-ho r iz o n LQR by Dynamic Progra mming x t +1 =  1 . 1 0 0 . 1 1 . 1  x t +  0 . 1 0  u t +  − 1 − 2  (99) Q =  10 0 0 10  , R = [1] , x g =  10 10  , u g = [0] , α = 1e5 , Σ − → u t = [10 0] (1 00) C.2 Evaluation on nonlinear traje c t ory o ptimization tasks Both iLQR and GPS req u ired the cost function in T ab le 3 to be scaled in ord er to have good n u- merics. In T able 5 and 6 we ref e r to th is has α (as it perfor ms the same role as the I 2 C p arameter). Additionally , iLQR and GPS were en a ble to optimize without a random initialization. In o rder to c o mpare with I 2 C , which initializes by design with fixed priors, the random initialisation was set to have a smallest a mplitude tha t allowed optimization to take place. All algo rithms achieved faster con verged with ra n dom initialisation, howev er such ‘warm start’ strategies we re not the focus of th is work, instead we wished to focu s on I 2 C strength in deter ministic initialisation. For these experiments we use the terminal cost Q f = Q . 0 25 50 75 100 125 1 50 175 20 0 − 0 . 5 0 0 . 5 1 1 . 5 2 2 . 5 · 10 5 EM Iterations Cost Difference Σ − → u =0 . 05 Σ − → u =0 . 1 Σ − → u =0 . 25 Σ − → u =0 . 5 Σ − → u =1 . 0 0 50 100 150 200 − 0 . 5 0 0 . 5 1 1 . 5 2 2 . 5 · 10 5 EM Iterations δ α - 1 =0 . 0 δ α - 1 =0 . 9 δ α - 1 =0 . 99 δ α - 1 =0 . 993 δ α - 1 =0 . 999 (a) Cost difference over iterati ons with v arying Σ − → u for δ α - 1 =0 . S maller values delay the point of di verg ence, al- though optimization progress i s also slowed. (b) Cost difference ov er iterations with va ry- ing δ α - 1 for Σ − → u =0 . 25 . Increasing δ α - 1 sta- bilizes the approximate inference. Figure 6: Difference b etween evaluated cost and predicted cost over EM iterations across hyp erpa- rameters for the Cartpo le swing-up task. 17 En vir o nment Σ − → u (init.) α (init.) δ α - 1 Pendulum [0 . 2] 1 / 100 0 . 99 Cartpole [0 . 25] 1 / 67 0 . 993 Double Cartpole [0 . 0 4] 1 / 90 0 . 9995 T ab le 4: I 2 C par ameters for the nonlinear trajecto ry optimization tasks. En vir o nment λ ran ge λ multiplier σ k (init.) α Pendulum [1 − 1e - 9] 1 . 002 1 e - 2 1 e - 4 Cartpole [1 − 1e - 7] 1 . 001 1 e - 2 1 e - 3 Double Cartpole [1 − 1e - 7] 1 . 0 01 1e - 2 1e - 3 T ab le 5: iLQR par a meters for the non linear trajecto ry optimization tasks. En vir o nment Σ Explore KL bo und σ k (init.) α Pendulum [2 . 0] 0 . 07 1e - 2 1e - 4 Cartpole [1 . 25] 1 . 0 1e - 1 1e - 3 Double Cartpole [5 . 0 ] 0 . 75 1e - 1 1e - 3 T ab le 6: GPS para m eters f or the non linear tra jec tory optimization tasks 18 I2C iLQR GPS 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 2 − 1 0 x 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 0 2 4 6 θ 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 4 − 2 0 2 ˙ x 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 5 0 5 ˙ θ 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 5 0 5 T imesteps u Figure 7: Comparison of th e state-action tr ajectories of I 2 C , iLQR and GPS on the Cartpole swing - up task after co n vergence. 19 I2C iLQR GPS 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 1 0 1 x 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 0 2 4 6 θ 1 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 0 2 4 6 θ 2 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 4 − 2 0 2 4 ˙ x 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 10 0 10 ˙ θ 1 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 10 0 10 20 ˙ θ 2 0 1 00 200 300 400 500 60 0 700 800 900 1 , 000 − 10 0 10 T imesteps u Figure 8: Compar ison of the state-action trajectories of I 2 C , iLQR and GPS on the Double Cartpole swing-up task after co n vergence. 20

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment