Adaptive Optimal Control for Reference Tracking Independent of Exo-System Dynamics
Model-free control based on the idea of Reinforcement Learning is a promising approach that has recently gained extensive attention. However, Reinforcement-Learning-based control methods solely focus on the regulation problem or learn to track a refe…
Authors: Florian K"opf, Johannes Westermann, Michael Flad
Adapti v e Optima l Control for Reference T racking Indep enden t of Exo-System Dynamics Florian K¨ opf a, ∗ , Johannes W estermann b , Mic hael Flad a , S¨ oren Hohmann a a Institute of Contr ol Systems, Karlsruhe Institute of T e chnolo gy (KIT), 76131 Karlsruhe, Germany b Intel lige nt Sensor-A ctuator-Systems L ab or atory, Instit ute for A nt hr op omatics and R ob otics, Karlsruhe Institute of T e chnolo gy, Karlsruhe, Germany Abstract Mo del-free control based on the idea of Reinforc e ment Learning is a pr omising a pproach that ha s recently g a ined extensive attent ion. How ever, Reinforcement-Learning-ba sed control metho ds solely focus on the regulation pro blem o r learn to track a refer ence that is g enerated by a time-inv a riant exo- s ystem. In the latter case, controllers ar e only able to tra ck the time-in v ariant reference dynamics which they hav e bee n trained on and need to b e r e-trained each time the reference dynamics c hange. Consequently , these metho ds fail in a n umber of applications which obviously rely on a tra jectory not being g enerated b y an ex o-system. One prominent example is autonomo us driving . This pap er provides for the first time an adaptive optimal control metho d capa ble to track reference tra jectories not b eing gener ated by a time-inv ariant exo- system. The main inno v ation is a nov el Q- function that directly incorp or ates a given refere nce tr a jectory on a mo ving horizon. This new Q-function exhibits a par ticular structure which allows the de s ign of an efficien t, iterative, prov a bly conv er gent Reinforcement Lear ning a lgorithm that enables optimal tracking. Tw o r eal-world examples demonstrate the effectiveness of our new metho d. Keywor ds: Adaptive Dynamic Pr ogra mming, O ptimal T r acking, Reinfor cement Learning, Lea r ning Systems, Machine Learning, Artificial Intelligence, Optimal Control 1. In tro duction Reinforcement Lea rning (RL) has recently ga ined ex- tensive attention as a mo del-free adaptive optimal control metho d whic h lea r ns o ptimal b ehavior fr o m interaction with the environmen t and observ ations of resulting states and rewards (see e.g. [1 – 3] and r eferences therein). The existing cont rol-o riented appr oaches can be cla ssi- fied in tw o g roups: Some metho ds fo cus on the r egulation problem, i.e. steering the system tow a rds a n equilibrium po int (typically 0 ), e.g. [3 – 7]. The seco nd group is de- voted to a sp ecific tracking case, where the reference r k to be track ed is generated b y an exo-system, i.e. it is assumed that ther e ex ists r k +1 = f ref ( r k ). It is noteworthy that a controller that has learned to solve the r egulation problem is not directly applicable to the tracking case as the reference tra jectory influences the asso ciated rewards that the RL ag ent is facing and there- fore a ffects the v alue function which has to b e lear ned. In o rder to extend learning - based controllers to the tracking case, v ar ious RL- based tr acking controllers b e- longing to the se c ond gr oup of appro aches hav e bee n pr o - po sed recently . F or unknown internal sys tem dynamics but a known co ntrol co e fficie nt matrix, adaptive tra cking ∗ Corresp onding author Email addr ess: florian.koep f@kit.edu (Flori an K¨ opf ) controllers approximating the v alue function ar e prop os ed for discrete- time [8] a nd contin uous-time [9, 10] systems. Notable critic-only Q-lea rning methods fo r co mpletely un- known dynamics a re prop os ed in [11] and [12], where [12] fo cuses o n the linear- quadratic setting a nd [11] allows non- linear system dynamics. Other works combine system ident ification and a daptive schemes including past info r- mation [13], use filter-based go al representation Heuristic Dynamic Progr amming for tracking [14], fo cus on lea rning a tracking controller from input-output da ta rather than assuming full state information [15], consider the no nzero- sum game case [16], utilize the exo -system of a pres c rib ed rob ot imp edance mo del in order to learn a model-fo llowing behavior for assistive hum an-ro b ot interaction [1 7], fo cus on s y stems with ma tched uncertainties [18] or co nsider tracking o n an infinite horizon with unbounded c o st [19]. All of these metho ds [8 – 19] consider the case wher e the reference tra jectory is ge ne r ated b y a time-inv ar iant exo- system f ref , i.e. rely on the as sumption that the reference r k follows r k +1 = f ref ( r k ) (resp ectively ˙ r ( t ) = f ref ( r ( t )) in the contin uous-time ca se). How ever, for v arious a pplications, this assumption do es not hold (e.g . vehicles that follow a roa d, rob ots in so - phisticated human-machine collabo ration o r sp ecific time- v arying sequences in pro ces s engineering). Conse q uently , present exo -system-bas ed metho ds are not suited for these t yp es of a pplications. This is be cause the lea r ned para me- Pr eprint submitt e d to Neur o c omputing De c emb e r 2, 2019 ters (i.e. Q-function or v alue function appr oximations a nd asso ciated control laws) a re cor resp onding to the system dynamics a nd reference dynamics f ref and thus need to be re-lea r ned as so on as any other refer ence ˜ f ref 6 = f ref should b e fo llow ed. A metho d that tr ies to cop e with this challenge is the mult iple-mo del approach in [20], wher e an adaptive se lf-orga niz ing map detects c hanges and switch es betw een v ar ious learned mo dels. How ever, in this ap- proach new sub-mo dels need to b e trained whenever the reference dynamics f ref changes. An RL method that tracks refer ence tr a jector ie s not necessarily re s ulting from a time-in v aria nt exo-s ystem f ref (or multiple switched mo dels [20 ]) has not yet b een pro- po sed. Therefor e , in c ontrast to [8 – 20], our work pr ovides for the first time an RL metho d allowing to track an arbi- trary reference tra jectory on a moving ho rizon. Our ge neral idea is to e x plicitly inco rp ora te the reference tra jector y on a moving horizo n in our new Q-function that is lear ned without requiring the system dynamics. Due to considering not only the current tr a cking er r or but r ather the reference tra jector y , this Q-function yields a controller which do es not only ac hieve reac tive but predictive b ehav- ior. In pa r ticular, we provide: • A nov el mo ving horizon tracking Q-function who se minimizing c o ntrol als o minimizes the tr a cking costs. • Der iv ation and pr o of of the a nalytical so lutio n of the nov el Q-function g iven the cost function and the sys - tem dynamics for the linear - quadratic tracking case. • Co nv er gence pro o fs for our lea rning a lgorithm that learns optimal tracking fro m data without knowledge of the system dynamics. Thu s, the estimated Q- function par ameters as well a s the asso c iated control law co nv er ge to the optimal solution. The rest of this pa p er is or g anized as follows. In Sec- tion 2, we define the optimal tracking problem with un- known system dynamics. The nov e l Q-function for arbi- trary reference tracking is defined a nd analyzed in Sec- tion 3 . In Section 4, w e in tr o duce our learning a lg orithm that is based on the prev io usly defined Q - function and provide conv ergence pr o ofs. Simulation results and a com- parison of our metho d with an adaptive tracking co nt roller that is tr a ined on a re fer ence gener ated by an exo-s ystem f ref are present ed and discussed in Section 5 b efore the pap er is co ncluded in Section 6. 2. Problem Definiti o n Consider a discr ete-time linear system x k +1 = Ax k + B u k , (1) where k ∈ N 0 is the discrete time step, x k ∈ R n the state vector and u k ∈ R m the input vector. Both the system matrix A ∈ R n × n and the input matrix B ∈ R n × m are unknown, alb eit the s y stem ( A , B ) is a ssumed to b e con- trollable. Assume that at each time s tep k, a n ar bitrary r eference ˜ r ( k ) i ∈ R n is given on a moving horizon of length N , where i is the time index starting at k . Beyond the horizo n N , let the refer ence b e 0 . Thus, r ( k ) i = ( ˜ r ( k ) i , for i = k , . . . , k + N 0 , for i = k + N + 1 , . . . , ∞ (2) follows. Our aim is to lea r n to tra ck r ( k ) i (2) optimally w.r.t. the quadratic cost J k = ∞ X i = k γ i − k 1 2 ( e ⊺ i Qe i + u ⊺ i Ru i ) | {z } c ( x i , u i , r i )= c i (3) that has to b e minimized. Note that w e wr ite r i = r ( k ) i for brevity of notation in the fo llowing as J k indicates that the reference r ( k ) i starting at k is used accor ding to (2). Here, e i = x i − r i is the deviatio n o f the system state x i from the reference r i at time step i . Q ∈ R n × n , Q = Q ⊺ 0 is a symmetric, p ositive semidefinite matrix pe na lizing de- viations of the state x i from the reference r i , R ∈ R m × m and R = R ⊺ ≻ 0 is a symmetric, po sitive definite ma- trix p enalizing the control effort. F urthermore, γ ∈ [0 , 1) a disco unt factor and c i ∈ R denotes the one-step cost. The choice of our cost function (3) is mo tiv ated a s fol- lows: On o ne hand, the reference to be tra ck ed is usually known o nly on a finite ho rizon N (e.g . a road cours e). On the o ther hand, the Q -learning metho d that we us e in Section 4 relies on an infinite hor izon co st function which allows the efficient formulation of a Bellman-like eq ua tion. Thu s, this cost function (3) acco unts for a given reference of finite horizon N and e nables stability a nd co nv erg ence of the asso c ia ted RL algorithm due to the infinite hor izon. Therefore, our pro blem definition can be summarized in Problem 1. Problem 1. A t each time step k , find the optimal con trol input u ∗ k and apply it to system (1), where u ∗ k , u ∗ k +1 , . . . is the control sequence minimizing the discounted cost (3) sub ject to the sy s tem dynamics (1), wher e A and B are unknown, g iven x k and r k , r k +1 , . . . , r k + N . 3. Extended Q-F unction for Re ference T rac king Our ide a is to define a new r e ference-dep endent Q- function in co ntrast to the commonly used Q-function (i.e. we define a state-action- r efer enc e function rather than a state-action function). This Q-function is cons tr ucted such that its minimizing co nt rol input constitutes a so lution to Problem 1. In this s ection, we introduce the reference-dep endent Q- function and deriv e its ana lytical so lution. This provides 2 impo rtant insights in how to parametr iz e the Q-function in Section 4 , which is the basic ing r edient for a conv ergent reinforcement learning a lgorithm ca pable of le arning the optimal tracking solution. W e begin with the g eneral cas e o f a fi nite optimiza tion horizon K , la ter we let K → ∞ . The notation s e e ms to b e a little bit clumsy but is of high imp orta nce, since optimization hor iz o n K and mo ving horizon of the refer- ence N should no t b e confused. F urthermore, the notation which co ntrol input is plugged in to the one-s tep cost and Q-functions will b e useful a nd we need to dis tinguish be- t ween the actual time step k whic h the system is in a nd the time step κ on the optimization ho rizon K . Definition 1 (Reference-de p endent Q-function). Our prop osed reference-de p endent Q - function is defined as Q K κ = c k + κ + γ k + K X i = k + κ +1 γ i − ( k + κ +1) c i | u ∗ i = c k + κ + γ Q K κ +1 u ∗ k + κ +1 (4) where c i = c ( x i , u i , r i ) , (5) Q K κ = Q K − κ ( x k + κ , u k + κ , r k + κ , . . . , r k + K ) , (6) Q K K = c ( x k + K , u k + K , r k + K ) . (7) Here, κ ∈ N 0 , κ < K denotes the time s tep on the current optimization horizon of length K star ting at k and r i = 0 , ∀ i > k + N (see (2)). The notation c i | u ∗ i indicates tha t the optimal control u ∗ i is applied in (5) and Q K κ +1 u ∗ k + κ +1 denotes that u k + κ +1 = u ∗ k + κ +1 in Q K κ +1 (cf. (6)). Therefore, Q K κ is the ac cumu lated discounted cost from time step k + κ to k + K if the control u k + κ is applied at time step k + κ and the optimal co ntrols u ∗ k + κ +1 , . . . , u ∗ k + K minimizing the cost-to- go ar e applied therea fter while the reference is pr ovided on a moving horizon of length N . With the finite hor izon cost function defined as J K k = k + K X i = k γ i − k 1 2 ( e ⊺ i Qe i + u ⊺ i Ru i ) (8) the subse quent Lemma 1 follows. Lemma 1. The c ontro l u k minimizing the re fer enc e- dep endent Q-function Q K 0 is a solution for u ∗ k minimizing J K k . Pr o of. With (8) and min u k Q K 0 = c k | u ∗ k + γ Q K 1 u ∗ k +1 = Q K 0 u ∗ k = k + K X i = k γ i − k c i | u ∗ i (9) follows min u k ,..., u k + K J K k = k + K X i = k γ i − k c i | u ∗ i = Q K 0 u ∗ k . (10) T aking the limit lim K →∞ J K k = J k (11) yields J k (cf. (3)). With the Q- function Q K 0 defined ac- cording to Definition 1 and as a res ult of Lemma 1, P rob- lem 1 is equiv ale nt to the following P r oblem 2. Problem 2. Giv en r k , r k +1 , . . . , r k + N , in ea ch sta te x k at time step k , find the co nt rol u ∗ k minimizing the reference-dep endent Q-function Q 0 , wher e Q 0 = lim K →∞ Q K 0 (12) and apply it to the system who se matrices A and B are unknown. W e will pro ceed in tw o steps. First, it is assumed that the system matrices A and B ar e known. Later on, this assumption will be dro ppe d a nd a n iterative so lution ba sed on a temp oral difference er ror will b e introduced. W e will now prop ose the analytica l so lution o f Q K 0 in Theorem 1. Theorem 1 (Analytical so lutio n of Q K 0 ) . F or K ≥ N , the Q-function Q K 0 (cf. Definition 1) with the obje ctive function (3) is given by Q K 0 = 1 2 x ⊺ k u ⊺ k r ⊺ k · · · r ⊺ k + N 0 ⊺ H K x k u k r k . . . r k + N 0 , (13) wher e H K = H ⊺ K ∈ R (( K +2) n + m ) × (( K +2) n + m ) with H K = h xx h xu h xr 0 h xr 1 h xr 2 · · · h xr K h ux h uu 0 h ur 1 h ur 2 · · · h ur K h r 0 x 0 h r 0 r 0 0 0 · · · 0 h r 1 x h r 1 u 0 h r 1 r 1 0 · · · 0 h r 2 x h r 2 u 0 0 h r 2 r 2 · · · h r 2 r K h r 3 x h r 3 u 0 0 h r 3 r 2 · · · h r 3 r K . . . . . . . . . . . . . . . . . . . . . h r K x h r K u 0 0 h r K r 2 · · · h r K r K . (14) The exact values of H K fol low fr om the subse quent pr o of. Pr o of. The pr o of is of r ather technical na ture and given in App endix A. 3 F or K → ∞ let H be the northw estern (( N + 2) n + m ) × (( N + 2) n + m )-submatrix of H K . Then, Q 0 = lim K →∞ Q K 0 = 1 2 z ⊺ k H z k (15) follows, where z k = x ⊺ k u ⊺ k r ⊺ k · · · r ⊺ k + N ⊺ , as r i = 0 , ∀ i > k + N . Thu s, the Q- function for the LQ tracking pro blem is quadratic w.r.t. the state x k , the control input u k and the reference r k , . . . , r k + N and furthermore completely parametrized by H . This obviously is a generaliza tion of the well known str ucture for r eference free RL-pro cedures [21, 22]. In addition, H is not only quadratic but ha s a sp ecific structure, which is a new result and a llows a n effi- cient pa r ametrizatio n for the Q-lear ning based algorithm in the following. 4. Q-Learning Based T rac king W e use the new Q -function for reference tracking o n a moving ho r izon where the sys tem matrices A and B are unknown. Thus, o ur aim is to determine H (cf. (15)) by observ a tions of states a nd rewards. The optimal control u ∗ k , which is equiv alent to (A.14) for κ = 0 a nd K → ∞ , can b e expressed directly by mea ns of H a nd is given by Cor ollary 1. Here, (A.15) ensures that u ∗ k in fact minimizes the Q -function Q 0 . Corollary 1 (O ptimal control u ∗ k ) . With L emma 1 and The or em 1, t he optimal c ontr ol at time st ep k is given by u ∗ k = − h − 1 uu h ux h ur 1 · · · h ur N x k r k +1 . . . r k + N . (16) Hence, if the Q- function Q 0 is known b y means of the matrix H , the optimal control directly res ults fro m (16). Note 1. In con trast to usua l con trollers learned b y Re- inforcement Lear ning , our control law (1 6) explicitly de- pends on the reference v alues r k +1 , . . . , r k + N and is there- fore a ble to genera lize to arbitrary r eference tra jectories on the ho r izon N . 4.1. Par ametrization of the Re fer enc e- dep endent Q- function In order to learn the Q-function, we par a metrize Q 0 and per form a v alue iteration on the r esulting squared Bellman-like temp ora l difference (TD) er ror in o rder to estimate the Q-function parameter s as well as the co r- resp onding o ptimal control law. Let the estimated Q - function be par ametrized b y means of a sum of weight ed basis functions: ˆ Q 0 = w ⊺ φ ( x k , u k , r k , . . . , r k + N ) = w ⊺ φ ( z k ) . (17) Here, w ∈ R L is a weigh t vector and φ : R ( N +2) n + m → R L is a vector of basis functions. Note that, in co ntrast to usual Q-function approximations, ˆ Q 0 in (17) ex plicitly incorp ora tes the r eference tra jectory r k , r k +1 , . . . , r k + N . Lemma 2. With L = 1 2 (( N + 2) n + m ) (( N + 2 ) n + m + 1) − n 2 (2 N − 1) + mn (18) quadr atic b asis functions φ = φ 1 · · · φ L ⊺ , ther e exists a weight ve ctor w = w ∗ such t hat ˆ Q 0 = Q 0 . Pr o of. Due to the symmetry and the z e ros in H K (cf. (14)) and therefore also in H , ther e are L non-r edundant elements in H . Define quadratic basis functions φ l , l = 1 , . . . , L of the for m φ l = ( { z k } i { z k } j , for i 6 = j 1 2 { z k } 2 i , for i = j, (19) where i , j indicate the co rresp onding non-r edundant ele- men ts of H , { · } i denotes the i -th element of a vector a nd z k is de fined as in (15). Thus, Q 0 in (15) is eq uiv alent to ˆ Q 0 in (17) if the w eights { w } i , i = 1 , . . . , L , ar e equal to the corres po nding non- r edundant elements in H which w e denote by w = w ∗ . Although L in Lemma 2 gives the maximum nu mber of weigh ts needed in o rder to par ametrize Q 0 exactly , in the frequently o ccur ring ca se of a spa rse w eighting matrix Q in the cost functional (3), the exa ct knowledge of the struc- ture of H can b e exploited further in orde r to drastically reduce the weigh ts w that ar e necessar y . Lemma 3. If the l -th r ow and l -th c olumn of Q e quals zer o, then the • l - th c olumn of h xr i ∀ i ∈ { 0 , . . . , N } , • l - th c olumn of h ur i ∀ i ∈ { 0 , . . . , N } and • l - th r ow and l -th c olumn of h r i r j ∀ i, j ∈ { 0 , . . . , N } ar e al l e qu al to zer o. Thus, the num b er of non-r e dundant weights L in H (c orr esp onding t o the weight ve ctor w ) r e du c es to L = ( n − q )( n − q + 1) N 2 + 1 + 1 2 n ( n + 1) (20) + ( m + N ( n − q )) ( m + n ) + 1 2 ( N − 2 )( N − 1)( n − q ) 2 , wher e h xr 0 = Q and h r 0 r 0 = − Q has b e en c onsider e d and q denotes the n umb er of r ows and c olumns of Q t hat ar e b oth zer o. Pr o of. The rather technical pro of dir e ctly follows from (A.12) c onsidering (A.8)–(A.11). 4 Although Lemma 3 is of technical nature, sparsity of H is esse ntial in order to implemen t efficient learning con- trollers as w ill b e discussed in Sectio n 5 . Note 2. Based on the sp ecific structure of the analyti- cal solution of Q 0 derived in Theorem 1, acc o rding to Lemma 3, L quadratic basis functions are sufficient to parametrize Q 0 exactly . Note 3. Equation (17) and Lemma 3 show tha t o ur pro - po sed Q-function generalizes ov er reference tra jecto ries as the w eight vector w doe s not dep end on a sp ecific reference we in tend to follow. 4.2. Online L e arning A lgorithm In this section, we prop ose our v alue iteration bas ed algorithm in order to learn the w eights w online by mini- mizing the sq uared temp oral difference err o r. Let ˆ Q 1 = w ⊺ φ ( x k +1 , u k +1 , r k +1 , . . . , r k + N , 0 ) = w ⊺ φ ( z k +1 ) , (21) where z k +1 = x ⊺ k +1 u ⊺ k +1 r ⊺ k +1 · · · r ⊺ k + N 0 ⊺ ⊺ . The estimated optimal control input ˆ u ∗ k + κ based o n the estimated Q-function ˆ Q κ u k + κ is defined as ˆ u ∗ k + κ = arg min u k + κ ˆ Q κ u k + κ = L x ⊺ k + κ r ⊺ k + κ +1 . . . r ⊺ k + κ + N ⊺ , (22) where L = − ˆ h − 1 uu ˆ h ux ˆ h ur 1 · · · ˆ h ur N (cf. (1 6)). Here, let ˆ H b e the estimated matrix H based on w and ˆ h denote the subma tr ices of ˆ H as in (14 ). F urther more, ˆ u ∗ k + κ = u ∗ k + κ results if w = w ∗ according to Lemma 3 and the o ptima l control law L ∗ results if ˆ H = H . Then, based on the B ellman-like eq uation (4), the TD error ǫ k [23] is given in Definition 2. If w = w ∗ , ǫ k would v anish. Definition 2 (TD error). The temp ora l difference e r ror ǫ k , i.e. the approximation er ror of the B ellman-like equa- tion (4) due to the deviation of the weight estimate w from w ∗ is defined as ǫ k = c k + γ ˆ Q 1 ˆ u ∗ k +1 − ˆ Q 0 = c k + γ w ⊺ φ z ∗ k +1 − w ⊺ φ ( z k ) . (23) Here, z ∗ k +1 = z k +1 | u k +1 = ˆ u ∗ k +1 . In or der to improv e the estimated r eference-dep endent Q-function ˆ Q 0 as well as the resulting estimated optimal control ˆ u ∗ k , we employ a v a lue itera tion pro cedur e. This iteration consists of a p olicy evaluation which up dates the weigh t es timate w ( i ) representing the Q -function and a p olicy impr ovement step, wher e, based on the updated Q- function weigh t w ( i ) and the corresp o nding matrix H ( i ) , the co nt rol law L ( i ) is adapted accor ding to (22). T o ev a luate ǫ k in (23), ˆ u ∗ k +1 is re q uired. Ho wev er , as the optimal weigh t w ∗ is unknown a pr iori, we initialize w (0) = 0 and the estimated optimal control by ˆ u ∗ (0) k +1 = 0 which is achiev ed by setting L (0) = 0 . In the p olicy evaluation s tep, the a im is to find an up- dated w ( i +1) such that ǫ ( i ) k 2 is minimized, wher e ǫ ( i ) k = c k + γ w ( i ) ⊺ φ z ∗ ( i ) k +1 − w ( i +1) ⊺ φ ( z k ) (24) in analo g y to (2 3). In accorda nce with Lemma 3, w ∈ R L follows. Thus, ǫ ( i ) k needs to b e co nsidered at M ≥ L time steps in order to p erform a least-squa res update, where M is the num ber of s a mples used for the po licy ev aluatio n. Then, w ( i +1) results from w ( i +1) = arg min w ( i +1) k X j = k − M +1 ǫ ( i ) j 2 . (25) Now, we define Φ = φ ( z k − M +1 ) . . . φ ( z k ) ⊺ (26) and c = c k − M +1 + γ w ( i ) ⊺ φ z ∗ ( i ) k − M +2 . . . c k + γ w ( i ) ⊺ φ z ∗ ( i ) k +1 . (27) If the excitatio n condition rank ( Φ ⊺ Φ ) = L (28) is satisfied, w ( i +1) minimizing (25) e x ists, is unique and given by w ( i +1) = ( Φ ⊺ Φ ) − 1 Φ ⊺ c (29) (cf. [24, Theor em 2.1 ]). Then, the p olicy impr ovement step is based on the new weigh t estimate w ( i +1) and its co rresp o nding H ( i +1) and results in L ( i +1) = − h ( i +1) uu − 1 h h ( i +1) ux h ( i +1) ur 1 · · · h ur ( i +1) N i (30) (cf. Coro llary 1). With the time step still be ing fixed at k , this iteration is p erfor med unt il the ch ange in w s tays below a given threshold e w . Although w e would like to ev aluate the Q- function corr esp onding to the tar get p o licy ˆ u ∗ k (note that z ∗ ( i ) k +1 is used in (24) ), we have not yet discusse d how to choose the b ehavior p olicy , i.e. the control u k that is a p- plied to the sy stem and app ears in c k and z k (cf. (24)). During the learning pro ce s s is active, let u k = ˜ u ∗ k = ˆ u ∗ k + ξ . (31) 5 Algorithm 1 Q -function T racking Controller 1: initialize M , w = w (0) = 0 , L (0) = 0 2: for k = 1 , 2 , . . . do 3: apply ˜ u ∗ k (31) to the system 4: if k mo d M = 0 then 5: i = 0 , w ( i ) = w 6: do 7: po licy ev aluation: w ( i +1) (25) 8: po licy improv ement: L ( i +1) (30) 9: i = i + 1 10: whi le w ( i ) − w ( i − 1) 2 > e w 11: w = w ( i ) 12: end if 13: end for Here, the Gaussian noise ξ ∼ N m ( 0 , Σ ) serves as explo- ration noise as per sistent excitation is required for con- vergence (cf. (28) , [24 – 26]). F urthermor e, if the r e ference to tra ck is smo oth, additiona l excita tion noise should b e added to the refer ence in o rder to satisfy condition (2 8). The co mplete algo rithm is shown in Algo rithm 1 a nd lea rns to track an arbitrar y reference that is giv en on a moving horizon of length N without knowledge of the system ma- trices A and B . If the system dynamics is not exp ected to change over time, lear ning might b e stopp ed after the firs t complete v a lue iter ation based on M data tuples has b een per formed. This can b e done by sto pping Algorithm 1 af- ter line 1 1. In practice, learning might b e enable d again whenever the Bellman er ror increas es which is an indicator for sub optimal weights w . Note 4. The itera tive procedur e of po lic y ev alua tion a nd po licy impr ov e ment in Algorithm 1 is a v alue iter ation (cf. [1]). This is due to the definition of ǫ k in (24), wher e ˆ Q 1 relies o n w ( i ) and ˆ Q 0 on w ( i +1) (cf. [26]). Note 5. Just a s r egular Q- learning [2 7], our algorithm belo ngs to the o ff-p olicy RL metho ds [1] as the b ehavior po licy ˜ u ∗ k = ˆ u ∗ k + ξ is followed while the agent lear ns the Q-function b elonging to the targ et p olicy ˆ u ∗ k . 4.3. Conver genc e Analy sis of the L e arning A lgorithm In this section, we will provide co nv er g ence pro ofs, i.e. show that the estimated refer ence-dep endent Q-function ˆ Q 0 conv er ges to the underlying Q-function Q 0 and that w ( i ) → w ∗ , i.e. H ( i ) → H a s i → ∞ . This also implies that L ( i ) → L ∗ , i.e. the v alue itera tion conv erges to the optimal control law. Our convergence analysis is structured as follows. Fir st, we prov e that the v alue iteration (i.e. iterating b etw een (25) and (3 0)) is equiv alent to a matrix sequence on H ( i ) . Second, we prov e that this sequence of matrices is upp er bo unded in the sense that 0 H ( i ) Y while H ( i ) is monotonically increasing , i.e. 0 H ( i ) H ( i +1) . Hence, the sequence co nv er ges. Finally , w e show that the co n- verged sequence fulfills the Bellman equatio n and the cor - resp onding control law is optimal. The following Lemma 4 extends [28, Lemma 1] to the tracking case and shows that our pr op osed v alue iteration is e quiv alent to a matrix sequence on H ( i ) . Lemma 4. Le t H (0) = 0 , R ⊺ = R ≻ 0 , Q ⊺ = Q 0 and ( A , B ) c ontr ol lable. The value itera tion describ e d by (25) and (30) is e quivalent to t he iter ation H ( i +1) = G + γ M L ( i ) ⊺ H ( i ) M L ( i ) , (32) wher e G = Q 0 − Q 0 0 R 0 0 − Q 0 Q 0 0 0 0 0 (33) and M L ( i ) = A B 0 0 0 · · · 0 L ( i ) x A L ( i ) x B 0 0 L ( i ) 1 · · · L ( i ) N − 1 0 0 0 I n 0 · · · 0 0 0 0 0 I n · · · 0 . . . . . . . . . . . . . . . . . . . . . 0 0 0 0 0 · · · I n 0 0 0 0 0 · · · 0 (34) with L ( i ) = h L ( i ) x L ( i ) 1 · · · L ( i ) N i = L H ( i ) dep ending on H ( i ) as in (30) . Pr o of. See App e ndix B. The following technical Lemma 5 is required later for the pr o of of Lemma 6. Lemma 5. F or H (0) = 0 , R ⊺ = R ≻ 0 and Q ⊺ = Q 0 , ∀ i > 0 : L ( i ) x ⊺ k + κ r ⊺ k + κ +1 . . . r ⊺ k + κ + N is the un ique minimizer of ˆ Q ( i ) κ = 1 2 z ⊺ k + κ H ( i ) z k + κ . (35) Pr o of. Due to H (0) = 0 0 , R ≻ 0 a nd Q 0 , (32) yields H ( i ) 0 . F urthermo re, due to R ≻ 0 it is ob vious that h ( i ) uu ≻ 0 , ∀ i > 0. Ther efore, ∂ ˆ Q ( i ) κ ∂ u k + κ = 0 yields L ( i ) and ∂ 2 ˆ Q ( i ) κ ∂ u 2 k + κ ≻ 0 follows due to h ( i ) uu ≻ 0 a nd co mpletes the pro of. Define the op era tor F Ω ( i ) , Γ ( i ) = G + γ M Γ ( i ) ⊺ Ω ( i ) M Γ ( i ) , (36) i.e. F H ( i ) , L ( i ) = H ( i +1) according to (3 2). In order to prov e that H ( i ) given ac cording to Lemma 4 is upp er b o unded, the following technical Lemma 6 is required firs t, whic h gener alizes [22, Lemma B.1 .1] to cop e with the r eference-dep endent Q -function. Note that knowledge of the exact structur e of H and therefo r e the analytical so lutio n b y means of Theo rem 1 plays a crucial role fo r the extension to the tracking case. 6 Lemma 6. L et W ( i ) b e an arbitr ary matrix se qu en c e and 0 H (0) Z (0) . Then, given the se quenc es Z ( i +1) = F Z ( i ) , W ( i ) and H ( i +1) = F H ( i ) , L H ( i ) with (36) it fol lows that 0 H ( i +1) Z ( i +1) . Pr o of. See App e ndix C. In the next step, upper boundedness of H ( i ) in the sense that 0 H ( i ) Y is shown. F or the reg ula tion ca se, bo undedness was shown in [2 2, Lemma B.1 .2]. In contrast to that, we co nsider the specific structure o f the iteration in the tracking ca s e (cf. (3 2)) and prov e b oundednes s o f H ( i ) for the more gener a lized tr a cking for mu lation. F or reasons o f se lf-consistency , the complete pro of is given. Lemma 7. L et ( A , B ) b e c ontr ol lable and H ( i ) b e the se quenc e (32) with H (0) = 0 . Then, t her e ex ists Y such that 0 H ( i ) Y . Pr o of. See App e ndix D. W e will now get to the main result of our conv ergence analysis and show tha t the prop osed v alue itera tion con- verges to the optimal weigh t vector w ∗ and the optimal control law L ∗ in the tracking case desc r ib ed by Pro b- lem 1 . Theorem 2. L et R ⊺ = R ≻ 0 , Q ⊺ = Q 0 , ( A , B ) c ontr ol lable and w (0) = 0 , i.e. H (0) = 0 . Then, iter ating b etwe en (25) and (30) yields H ( i ) → H , i.e. w ( i ) → w ∗ as wel l as L ( i ) → L ∗ . Pr o of. Acco r ding to Lemma 4 , the v alue iteration is equiv- alent to itera ting on H ( i ) (cf. (32)). With Z (0) = H (0) and Z ( i +1) = F Z ( i ) , L H ( i +1) , Lemma 6 yields 0 H ( i ) Z ( i ) . With H (0) = 0 follows H (1) = G 0 a nd hence H (1) − Z (0) 0 . The pro o f is drawn by induc- tion: Assume the induction hypothesis H ( i ) − Z ( i − 1) 0 . Then, H ( i +1) − Z ( i ) = γ M L H ( i ) ⊺ H ( i ) − Z ( i − 1) M L H ( i ) 0 . (37) This implies 0 H ( i ) Z ( i ) H ( i +1) . As the matrix sequence is upp er b ounded by Y ac c o rd- ing to Lemma 7 and 0 H ( i ) H ( i +1) , the limit H ( ∞ ) exists, i.e. the v alue iter ation co nv erg es to H ( ∞ ) = G + γ M L H ( ∞ ) ⊺ H ( ∞ ) M L H ( ∞ ) . (38) F urthermor e, L H ( ∞ ) minimizes ˆ Q ( ∞ ) 0 according to Lemma 5. Th us, lim i →∞ ǫ ( i ) k = c k + γ z ∗ k +1 ⊺ H ( ∞ ) z ∗ k +1 − z ⊺ k F H ( ∞ ) , L H ( ∞ ) z k = 0 . (39) Therefore, H ( i ) → H ( ∞ ) = H , i.e. w ( i ) → w ∗ and L ( i ) → L ∗ which completes the pro of. 5. Resul ts In o r der to v alidate our propo sed metho d, we show r e- sults for tw o LQ-tra cking problems with unknown system dynamics. W e fur thermore compare our results with an RL tracking method which a ssumes that the r eference can be des c r ib ed by a time-inv ar iant exo-s ystem f ref . F or ex c itation pur po ses (cf. (28)), the v ariance of the input excita tion ξ in (31) is set to Σ = 0 . 1 and Gaus- sian noise ξ ref ∼ N (0 , 0 . 1) is added to the re fer ence tra- jectory which we in tend to track while lea r ning is ac- tive. W e furthermore c ho ose the n umber of da ta tuples to M = 1 . 2 L , with L according to (20), the stopping crite- rion to e w = 1 × 1 0 − 6 , the disco un t factor to γ = 0 . 9 , the horizon o n which the r e ference is known to N = 10 and finally x 0 = 0 for all exp eriments. 5.1. Simulation Examples The first system is a second-or der rotatory mass-spring- damp er system, whereas the second system is a sixth-or der linear s ingle-track steering mo del. Both sys tems were ini- tially given in co ntin uous time a nd hence discr etized b y means of a T ustin approximation with a sampling time of 1 s. 5.1.1. System 1 Consider a r otatory mass-spr ing -damp er system mo d- eled by the discrete-time second o rder linear s ta te space representation x k +1 = 0 . 99 0 . 9 − 0 . 02 0 . 8 x k + 0 . 01 0 . 02 u k . ( 40) The control input u k is a torque comma nd a pplied to the system. Note that this sy stem is not known to the con- troller and only needed for simu lation as w ell as v alidation purp oses. F urthermo r e, let Q = diag(100 , 0) a nd R = 1 (41) be the par a meters of the cost function (3 ), i.e. w e fo cus on tracking a r eference a ngle x 1 , ref = α ref that is given on a moving horiz o n of length N = 10 . 5.1.2. System 2 Our second example system is a sixth order linear single- track steer ing mo del x k +1 = Ax k + B u k (42) 7 with A = − 0 . 741 − 0 . 033 0 0 − 2 . 0 × 10 − 4 − 6 . 4 × 10 − 3 4 . 146 − 0 . 9 14 0 0 4 . 7 × 10 − 3 0 . 151 2 . 073 0 . 043 1 0 2 . 4 × 1 0 − 3 0 . 076 23 . 326 0 . 106 2 0 1 0 . 022 0 . 693 23 . 499 − 2 . 593 0 0 − 0 . 939 − 2 . 053 11 . 749 − 1 . 297 0 0 0 . 031 − 0 . 027 and B = − 3 . 8 × 10 − 4 0 . 0091 0 . 004 6 0 . 0 41 0 . 1 17 0 . 059 ⊺ , where u k is a torque a pplied to the steering whee l by the controller. The par ameters of the corresp onding contin uous- time model ca n be found in [16]. The system state x k is g iven by x k = β k ψ r,k ψ k y k δ v, k δ k ⊺ , (43) with sideslip angle β k , yaw angle ψ k , yaw rate ψ r,k , later al deviation from the orig in y k , steering wheel a ngle δ k and angular velo city δ v, k . Geometr ic rela tio ns are depicted in Fig. 1, where v is the constant velocity (2 0 m s − 1 in our example) a nd δ s = 0 . 0625 δ denotes the steering a ngle in contrast to the steering wheel angle δ . As we desire to track the la ter al p osition x 4 , ref = y ref , we choos e Q = diag(0 , 0 , 0 , 100 , 0 , 0) and R = 1 . (44) 5.2. Evaluation Metho d In or de r to compare our metho d with the cla ss of RL tracking a lgorithms wher e the reference tra jectory is as- sumed to b e genera ted by a time-inv ariant exo-s ystem f ref ( r k ) = F ref r k , let b oth the refer e nc e angle α ref (sys- tem 1 ) and the reference la teral p os itio n y ref (system 2) be g enerated by r k +1 = 0 . 9801 0 . 19 87 − 0 . 198 7 0 . 9801 | {z } F ref r k , (45) α ref = y ref = 1 0 r k (46) during the training pro cedure. F o r comparis o n r easons, we then train b o th our prop osed metho d and an algo rithm as in [11, 12] which assumes that the reference alwa ys follo ws the dynamics of an e x o-system f ref ( r k ) (in the following termed c omp arison algorithm ). The co mparison algor ithm is tr ained on the augmented system x aug ,k +1 = A 0 0 F ref x aug ,k + B 0 u k , (47) δ s v β ψ r y 0 Figure 1: Geo metric r elations of the single-trac k model. T able 1: RMS trac king errors and w eigh t estimation errors. system 1 prop osed metho d compariso n alg. [11, 12] α RMS 2 . 1 × 10 − 3 1 . 55 e I 6 . 2 × 10 − 5 – e II 2 . 1 × 10 − 3 – system 2 prop osed metho d compariso n alg. [11, 12] y RMS 2 . 9 × 10 − 5 1 . 01 e I 2 . 1 × 10 − 7 – e II 1 . 4 × 10 − 5 – where x aug ,k = x ⊺ k r ⊺ k ⊺ . After the training, we v ary α ref = y ref (arbitrar y r eferences such as differen t freq uen- cies, r amps and steps ) in o rder to show that the controller learned with our method successfully gener alizes to these references. In co nt rast, we show the resulting b ehavior of the co mparison algor ithm w hich is co nstructed on the as- sumption that the refer ence dynamics a lwa y s follows (45). Our e v aluation is tw o fold. On o ne hand, after the learning pro cess has finished, we analyze the ro ot mea n square (RMS) tracking erro rs α RMS and y RMS betw een the learned tracking b ehavior b y means of the tra jector y α learned and y learned where the sys tem dynamics is un- known (bo th for our algo r ithm and the comparison a l- gorithm) and the optimal solutio n α opt and y opt which results fro m The o rem 1 and known system dyna mics. On the o ther hand, we compare the learned weigh ts w of our algorithm with the optimal solutio n w ∗ (w eights corre- sp onding to Theorem 1 and Lemma 3). In order to achiev e comparability for different ranges o f w , we normaliz e the absolute er ror of ea ch weigh t with the maximum a bsolute weigh t max j { w ∗ } j and define the average of this nor - malized abso lute err or by e I = 1 L L X i =1 |{ w } i − { w ∗ } i | max j { w ∗ } j . (48) Its ma ximum is given b y e II = max i |{ w } i − { w ∗ } i | max j { w ∗ } j . (4 9 ) Note tha t these measur es are only reaso nable to judge ho w well our metho d has learned the unknown optimal weigh ts w ∗ as the co mparison a lgorithm does not learn w ∗ cor- resp onding to Theor em 1 and Lemma 3 but the optimal weigh ts corr e sp onding to (47). 5.3. R esults of the Ad aptive T r acking Contr ol ler The RMS tr acking error s α RMS for system 1, y RMS for system 2 a nd the corres po nding w eight es tima tion errors 8 e I and e II are given in T able 1. Plots of the corr esp onding tracking perfor mances of our prop osed metho d and the compar ison metho d a re g iven in Fig. 2 for system 1 a nd Fig. 3 for system 2. Here, the vertical da sh-dotted lines indicate the weight up date of our metho d (i.e. when k = M ). The reference tra- jectories α ref and y ref are depicted in gr ay . The black dashed lines show the optimal solutions α opt and y opt calculated us ing full system knowledge and the red line the lear ned b ehavior α learned , our and y learned , our without knowledge of the sy stem matric e s A and B when using our prop osed metho d. The resulting tra cking behavior α learned , comp arison and y learned , comp arison of the co mparison metho d [11, 12] is depicted in blue. 0 1 00 2 00 3 00 4 0 0 50 0 60 0 70 0 800 − 10 − 5 0 5 10 time step k α = x 1 α ref α learned , our α opt α learned , comparison weigh t upda te Figure 2: Q - learning based tracking for system 1 (second order r o- tatory mass-spring-damp er model). The w eight estimatio n measur es e I and e II in T able 1 indicate that for b oth systems the o ptimal Q- function weigh ts hav e suc c essfully b een learned. The decay o f e I and e II during the v alue iteration (with iteration index i in Algor ithm 1) is depicted in Fig. 4 (system 1 ) and Fig . 5 (system 2). 5.4. Discussion Under the excita tion condition (28), our learning co n- troller conv erges to the optimal control law according to Section 4.3. Compa r ing α RMS (system 1), y RMS (system 2) and considering Fig. 2, Fig. 3 and Fig. 6 it is o bvious that our a lg orithm is successfully tracking ar bitrary references that ha ve not been seen during training. State of the art 0 1 00 2 00 3 00 4 0 0 50 0 60 0 70 0 800 0 5 10 time step k y = x 4 y ref y learned , our y opt y learned , comp arison weigh t upda te Figure 3: Q-learning based tracking for system 2 (sixth order linear single-track steering mo del). 0 5 10 15 20 25 3 0 0 0 . 5 1 iteration i e II e I I 0 5 10 1 5 20 25 3 0 0 2 4 6 · 10 − 2 iteration i e I e I e II Figure 4: Decay of w eigh t estimation er r ors during learning for sys- tem 1 (second order rotatory mass- spring-damp er mo del). Here, e I (48) denotes the mean and e II (49) the m aximum of the element -wise absolute error of w , b oth normalized with max j { w ∗ } j . metho ds which ass ume that the reference follows a time- inv aria nt ex o -system f ref (e.g. [11, 12]) are very effective as long as this ass umption holds but their tracking perfor- mance decr eases as so on a s the re ference to track deviates from the sine describ ed b y (45)–(46). This b ehavior is hardly surpris ing as these co ntrollers w ere sp ecifically de- signed under the assumption of time-inv ariant exo-system dynamics. Due to the explicit dep endency of our Q- function on the reference on a moving ho rizon N , the lear ne d weigh ts gen- eralize to reference s that ar e unknown during the lear ning pro cedure suc h as the ramps, steps and curves in the ex- amples. Due to the o ptimal be havior accor ding to (3 ) our 9 0 5 10 1 5 20 25 3 0 0 0 . 5 1 iteration i e II e I I 0 5 10 1 5 20 25 3 0 0 0 . 5 1 1 . 5 · 10 − 2 iteration i e I e I e II Figure 5: Decay of w eigh t estimation er r ors during learning for sys- tem 2 (sixth order li near single-track steering model). H ere, e I (48) denotes the mean and e II (49) the maximum of the elemen t-wise absolute error of w , b oth normali zed with max j { w ∗ } j . prop osed controller e xhibits predictive rather than only re- active b ehavior as can b e se e n in Fig. 6 which depicts a more deta ile d vie w of the step in Fig.2. 420 440 4 60 480 5 00 520 − 6 − 4 − 2 0 2 4 6 time step k α = x 1 α ref α learned , our α opt α learned , comp arison Figure 6: More deta iled view of Fig. 2 (system 1) to visualize the predictiv e b ehavior of our con trol ler due to the moving hori zon N . W e would further like to po int out that e x act knowl- edge o f the structure o f H K and thus H which results from Theorem 1 is very b eneficial if not vital for the Q - learning metho d to work efficiently as it helps to reduce the nu mber of w eights that hav e to b e lear ned. If one w ould only assume H to b e quadratic and symmetric, L = 32 5 (system 1) and L = 2701 (system 2) weigh ts w ould hav e to be estimated. Cons ide r ing Le mma 2, these num b ers reduce to L = 2 47 (system 1) and L = 201 1 (s y stem 2) and explo iting the sparsity pro p er ties of Q according to Lemma 3, L = 84 (system 1) and L = 1 46 (system 2) result which r enders the complex ity tracta ble . The choice of the moving horizo n length N dep ends on the av ailable informatio n rega rding the refer e nce to track. F urthermor e, the larg er N , the more predictive the learned controller will b e but the more unknown weights w need to b e learned (see (2 0 )). Th us , an appro priate c hoice of N obviously dep ends o n the sp ecific application. 6. Conclusi on In this pap er, we prop os e d a new Reinforce ment - Learning-ba sed a lg orithm tha t is able to track an arbi- trary reference tra jectory which is g iven on a moving hori- zon while the system dynamics is unknown. In contrast to s tate-of-the-ar t metho ds that are bas e d o n RL resp ec- tively Adaptive Dynamic Pro gramming, our metho d do es not r e quire the re ference tra jectory to be generated by an exo-sy stem. It explicitly inco rp orates arbitrar y refer- ence v alues in a new Q-function. This Q-function, which is constructed such that its minimizing control is part of the solution of the optimal LQ tracking pro blem, genera l- izes to arbitrary reference tra jector ies giv en o n a mo ving horizon. W e showed that the analytical solution to this Q - function has a sp ecific structure w.r.t. the current state and control as well as the reference on the given ho r i- zon. B ased thereon, sparsity pr op erties of the r e sulting structure were exploited in order to r e duce the Q - function weigh ts that hav e to b e estimated. The temp ora l differ- ence er ror o f the reference-dep endent Q-function serves as a tar get in order to lea rn the optimal tracking be havior online when the system dynamics is unknown. Here, the choice o f ba sis functions is based on the findings rega rding the sp ecific structure of the analytical solution. W e prov ed that this iterative a lgorithm co nv er ges to the optimal so - lution. App endix A. Pro of of Theorem 1 Theorem 1 is v ita l for an efficien t Q-function parametriza tion as it yields the exa c t s tructure of the analytical solution of the nov el r eference-dep endent Q- function which reduces the num b er s of function approx- imation w eig hts needed (see Lemma 2 and Lemma 3 and the disc ussion r egarding the num b er o f weights to b e learned in Section 5.4). Although the notatio n is complex, it is r e quired to keep the technical pr o of corr ect. This is due to the fact that clear distinction of the curr ent time step k and the time step κ in the current optimization horizo n K starting at k is required. The main idea o f this pro of is to use dynamic progra mming and prove the analytical solution of the Q - function by means of backw ards induction. W r iting down this dyna mic pr ogra mming pr o cedure is cumbersome but recomp ensates by means o f the exac t structure of our new Q function. F or l = 1 , . . . , d , the l - th submatr ix o f a matrix 10 Π ∈ R p × nd is de fined a s Π [ l ] = Π (1 , ( l − 1) n + 1) · · · Π (1 , nl ) . . . . . . . . . Π ( p, ( l − 1) n + 1) · · · Π ( p, nl ) . (A.1) F urthermor e, note that I n denotes the n × n ident ity matrix and ζ ∈ N 0 is a placeholder. Later, ζ will b e re- placed by η (resp ectively η + 1 in the inductive step), where η = K − κ denotes the remaining time steps on the horizon K . F urther, p is a n index with p ∈ N : p > 1. W e define the following sho rthand nota tions. X 0 ζ = X 0 = I n − I n , X 1 ζ = √ γ − X 0 [1] B G ζ h F ζ L ζ − 1 ζ · · · L 0 ζ i + X 0 [1] A X 0 [2] 0 · · · 0 and X p ζ = √ γ − X p − 1 ζ [1] B G ζ h F ζ L ζ − 1 ζ · · · L 0 ζ i + h X p − 1 ζ [1] A 0 X p − 1 ζ [2] · · · X p − 1 ζ [ ζ − 1] i , (A.2) as well as U 1 ζ = − G ζ h F ζ L ζ − 1 ζ · · · L 0 ζ i and U p ζ = √ γ − U p − 1 ζ [1] B G ζ h F ζ L ζ − 1 ζ · · · L 0 ζ i + h U p − 1 ζ [1] A 0 U p − 1 ζ [2] · · · U p − 1 ζ [ ζ − 1] i , (A.3) with M ζ = γ B ⊺ ζ − 2 X i =0 X i ζ [1] ⊺ QX i ζ [1] + ζ − 2 X i =1 U i ζ [1] ⊺ RU i ζ [1] ! , (A.4) F ζ = M ζ A , (A.5) G − 1 ζ = M ζ B + R , (A.6) L j ζ = γ B ⊺ X 0 [1] ⊺ QX 0 [2] , for j = ζ − 1 , γ B ⊺ ζ − 2 P i =1 X i ζ [1] ⊺ QX i ζ [ ζ − j ] + ζ − 2 P i =1 U i ζ [1] ⊺ RU i ζ [ ζ − j ] ! , for j < ζ − 1 , (A.7) with j ∈ N 0 . Let further more ρ κ 0 = x ⊺ k κ r ⊺ k κ X 0 ⊺ , (A.8) ρ κ 1 = x ⊺ k κ u ⊺ k κ r ⊺ k κ +1 X 0 [1] A ⊺ X 0 [1] B ⊺ X 0 [2] ⊺ , (A.9) µ κ i = x ⊺ k κ u ⊺ k κ r ⊺ k κ +2 · · · r ⊺ k + K U η − i η [1] A ⊺ U η − i η [1] B ⊺ U η − i η [2] ⊺ . . . U η − i η [ η ] ⊺ , (A.10) χ κ i = x ⊺ k κ u ⊺ k κ r ⊺ k κ +2 · · · r ⊺ k + K X η − i η [1] A ⊺ X η − i η [1] B ⊺ X η − i η [2] ⊺ . . . X η − i η [ η ] ⊺ , (A.11) where k κ = k + K − η = k + κ a nd i ∈ N . Pr o of. In the first step, we use ba ckwards induction to prov e that the Q -function Q K κ (cf. Definition 1) for system (1) with the ob jective function (3) is given by Q K κ = 1 2 ρ κ 0 Q ( ρ κ 0 ) ⊺ + u ⊺ k + κ Ru k + κ + γ ρ κ 1 Q ( ρ κ 1 ) ⊺ + γ K − κ − 2 X i =1 ( χ κ i Q ( χ κ i ) ⊺ + µ κ i R ( µ κ i ) ⊺ ) ! . (A.12) Starting from Q K K (cf. (7)), u ∗ k + K = 0 directly follows from Definition 1 and ∂ Q K K ∂ u k + K u ∗ k + K = 0 , ∂ 2 Q K K ∂ u 2 k + K u ∗ k + K = R ≻ 0 . (A.13 ) Then, by iterating backw ards in time, applying (4) and the sys tem dyna mics (1), with η = K − κ , (A.12) can b e shown to hold for η = 0 , 1 , 2, i.e. κ = K , K − 1 , K − 2. F urthermor e, u ∗ k + κ = − G η F η x k + κ + η − 1 X j =0 L j η r k + K − j (A.14) minimizes (A.12 ) b eca use ∂ Q K κ ∂ u k + κ u ∗ k + κ = 0 , where ∂ 2 Q K κ ∂ u 2 k + κ u ∗ k + κ ≻ 0 (A.15) is guaranteed as R ≻ 0 and Q 0 . The induction hypoth- esis Q K κ − 1 (see (A.12) with κ → κ − 1) is then proven in 11 the inductive s tep. This is done b y representing Q K κ − 1 by means of (4) and utilizing u ∗ k + κ from (A.14). This yields Q K κ − 1 = 1 2 x k + κ − 1 r k + κ − 1 ⊺ X 0 ⊺ QX 0 x k + κ − 1 r k + κ − 1 + 1 2 u ⊺ k + κ − 1 Ru k + κ − 1 + 1 2 x k + κ r k + κ ⊺ X 0 ⊺ QX 0 x k + κ r k + κ + 1 2 γ ¯ z ⊺ k η − 1 X i =1 X i η +1 ⊺ QX i η +1 + U i η +1 ⊺ RU i η +1 ¯ z k , (A.16) where ¯ z ⊺ k = x ⊺ k + κ r ⊺ k + κ +1 . . . r ⊺ k + K . Then, replace x k + κ = Ax k + κ − 1 + B u k + κ − 1 (cf. (1)) in (A.16) w hich r esults in Q K κ − 1 = 1 2 ρ κ − 1 0 Q ρ κ − 1 0 ⊺ + u ⊺ k + κ − 1 Ru k + κ − 1 + γ ρ κ − 1 1 Q ρ κ − 1 1 ⊺ + γ K − ( κ − 1) − 2 X i =1 χ κ − 1 i Q χ κ − 1 i ⊺ + µ κ − 1 i R µ κ − 1 i ⊺ ! (A.17) and yields the induction hypothesis ((A.12 ) with κ → κ − 1) a nd thus proves (A.12). Thu s, the a nalytical s olution of Q K 0 is quadr a tic w.r.t. ρ κ 0 , u k , ρ κ 1 , χ κ i and µ κ i . As each of these comp one nts is linear w.r.t. x k , u k and r k +1 , . . . , r k + N according to (A.8)–(A.11), Theorem 1 follows directly for κ = 0 and K ≥ N . App endix B. Pro of of Lemma 4 Pr o of. Let v ( · ) b e a function that transforms a symmetri- cal squared matrix to a v ector such that v H ( i ) = w ( i ) . With c k + γ w ( i ) ⊺ φ z ∗ ( i ) k +1 = 1 2 z ⊺ k G + γ M L ( i ) ⊺ H ( i ) M L ( i ) | {z } = H ( i +1) z k , (B.1) (26), (27) and the definition of φ acco r ding to Lemma 3 follows that (29) can be wr itten a s w ( i +1) = ( Φ ⊺ Φ ) − 1 Φ ⊺ 1 2 z ⊺ k − M +1 H ( i +1) z k − M +1 . . . 1 2 z ⊺ k H ( i +1) z k = ( Φ ⊺ Φ ) − 1 ( Φ ⊺ Φ ) | {z } = I v H ( i +1) . (B.2) Thu s, as H ( i +1) is symmetrica lly constructed from w ( i +1) , it follows from (B.2) that iterating on w ( i ) by mea ns of the v alue iteration is equiv alent to itera ting on (32). App endix C. Pro of of Lemma 6 Pr o of. According to Lemma 5, the control law L Z ( i ) minimizes z ⊺ k +1 Z ( i ) z k +1 , ∀ i > 0. Thus, z ⊺ k M L W ( i ) ⊺ Z ( i ) M L W ( i ) z k ≥ z ⊺ k M L Z ( i ) ⊺ Z ( i ) M L Z ( i ) z k (C.1) follows. This yields M L W ( i ) ⊺ Z ( i ) M L W ( i ) − M L Z ( i ) ⊺ Z ( i ) M L Z ( i ) 0 (C.2) and hence Z ( i +1) = F Z ( i ) , W ( i ) F Z ( i ) , L Z ( i ) = : ˆ Z ( i +1) . (C.3) This a lso implies F H ( i ) , L Z ( i ) F H ( i ) , L H ( i ) = H ( i +1) . (C.4) With 0 H (0) Z (0) and the induction h yp othesis H ( i ) Z ( i ) , F H ( i ) , L Z ( i ) = G + γ M L Z ( i ) ⊺ H ( i ) M L Z ( i ) G + γ M L Z ( i ) ⊺ Z ( i ) M L Z ( i ) = F Z ( i ) , L Z ( i ) = ˆ Z ( i +1) (C.5) follows. With (C.4), this yields H ( i +1) ˆ Z ( i +1) and in- corp ora ting (C.3) completes the pro of. App endix D. Pro of of Lemma 7 Pr o of. Let Z (0) = H (0) , Z ( i +1) = F Z ( i ) , ˜ L , where ˜ L is chosen such that all eigenv alues of A + B ˜ L x are inside the unit cir cle. Note that existence of ˜ L is guar anteed due to ( A , B ) co ntrollable. With W ( i ) = ˜ L in Lemma 6, 0 H ( i ) Z ( i ) holds. With Z ( i +1) − Z ( i ) = F Z ( i ) , ˜ L − F Z ( i − 1) , ˜ L = γ M ˜ L ⊺ Z ( i ) − Z ( i − 1) M ˜ L , (D.1) 12 vec( · ) stac king the columns of a matrix and ⊗ b eing the Kronecker pro duct, vec Z ( i +1) − Z ( i ) = γ M ˜ L ⊺ ⊗ M ˜ L | {z } = E vec Z ( i ) − Z ( i − 1) (D.2) follows, th us vec Z ( i ) − Z ( i − 1) = E i − 1 vec Z (1) − Z (0) . (D.3) If all eig env alues of √ γ M ˜ L are inside the unit c ir cle, this also holds for the eigenv alues of E . Due to its sp ecific structure (cf. (34)), ( N + 1) n eig env alues of M ˜ L are at the origin. Therefore, consider the remaining eigenv a l- ues, i.e. the eigenv alues of D = A B ˜ L x A ˜ L x B (cf. [22 , Lemma B.1.2 ]). Let k · k b e the sp ectral nor m of a matrix, resp ectively the euc lidea n nor m of a vector. Then, lim i →∞ D i = lim i →∞ I n ˜ L x A + B ˜ L x i − 1 A B ≤ lim i →∞ I n ˜ L x A B A + B ˜ L x i − 1 = 0 . (D.4) As lim i →∞ D i = 0 , all eigenv alues of D ar e inside the unit circle. Hence, a ll eigenv alues of E ar e also inside the unit circle a nd e = k E k < 1. With vec Z ( j ) = v ec Z (0) + j X i =1 vec Z ( i ) − Z ( i − 1) (D.3) = v ec Z (0) + j X i =1 E i − 1 vec Z (1) − Z (0) (D.5) follows vec Z ( j ) ≤ vec Z (0) + ∞ X i =0 k E k i vec Z (1) − Z (0) = e 0 , (D.6) where the upper b ound e 0 is independent of j . As vec Z ( j ) is upp er bounded by e 0 , ther e e x ists e 1 such that Z ( j ) ≤ e 1 , ∀ j . With Y = e 1 I dim( H ) , 0 H ( i ) Z ( i ) Z ( i ) I dim( H ) e 1 I dim( H ) = Y res ults which completes the pr o of. References [1] R. S. Sutton, A. G. Barto, Reinforcemen t Learning: An i nt ro- duction, 2nd Edi tion, MIT Press, Cambridge Massach usetts, 2018. [2] D. W ang, H. He, D. Liu, A daptiv e critic nonlinear r obust con- trol: A survey , IEEE transact ions on cyb ernetics 47 (10 ) (2017) 3429–345 1. [3] B. Luo, D. Liu, H. W u, D. W ang, F. L. Lewis, Policy gradien t adaptiv e dynamic pr ogramming for data-based optimal con trol, IEEE transactions on cybernetics 47 (10) (2017) 3341–3354. [4] R. Kamalapurk ar, P . W alters, W. E. Dixon, Model- based rein- forcemen t learning for appro ximate optimal r egulation, Auto- matica 64 (2016) 94–104. [5] K . G. V amv oudakis, Q-l earning for con tinuous-time l inear sys- tems: A mo del-fr ee infinite horizon optimal con trol approach, Systems & Con trol Letters 100 (2017) 14–20. [6] T. W ang, H. Zhang, Y. Luo, Infinite-time s tochastic linear quadratic optimal con trol for unknown discrete-time systems using adaptiv e dynamic programming approach , Neurocomput- ing 171 (2016) 379–386. [7] Q. W ei, D. Liu, Q. Lin, R. Song, Discrete-time optimal con trol via local p olicy i teration adaptive dynamic programming, IEEE transactions on cybernetics 47 (10) (2017) 3367–3379. [8] T. Dierks, S. Jagannathan, Optimal tracking cont rol of affine nonlinear discrete-time systems with unkno wn int ernal dynam- ics, in: 2009 Joint 48th IEEE Conference on Decision and Con- trol (CDC) and 28th Chinese Control Conference (CCC), 2009, pp. 6750–6755 . [9] H . Mo dares, F. L. Lewis, Linear quadratic tracking con tr ol of partially-unknown contin uous-time systems using reinforce- men t l earning, IEEE T ransactions on A utomatic Cont rol 59 (11) (2014) 3051–3056. [10] K. Zhang, H . Zhang, G. Xiao, H. Su, T r ac king control opti- mization sc heme of contin uous-time nonlinear system via online single net work adaptiv e critic design method, Neurocomputing 251 (2017) 127–135. [11] B. Luo, D. Liu, T. Huang, D. W ang, Mo del-fr ee optimal track- ing cont rol vi a critic-only q-learning, IEEE T ransactions on Neural Net works and Learning Systems 27 (10) (2016) 2134– 2144. [12] B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, M . -B. Naghibi-Sistani, Reinforcemen t q- learning for optimal trac king con trol of linear discrete-time systems with unkno wn dynamics, Automatica 50 (4) (2014) 1167–1175 . [13] Y. Lv, J. Na, Q. Y ang, G. Herr mann, Adaptive optimal tr ack- ing con trol of unkno wn nonlinear systems using system aug- men tation, in: 2016 Internat ional Join t Conference on Neural Net wo rks (IJCNN 2016), Proceedings of the Inte rnational Joint Conference on Neural Netw orks (IJCNN), 2017, pp. 3516–3521. [14] C. Mu, Z. Ni, C. Sun, H. He, Data-driven trac king control with adaptive dynamic programm i ng for a class of contin uous- time nonli near systems, IEEE transactions on cyber netics 47 (6) (2017) 1460–1470. [15] B. Kiumarsi, F. L. Lewis, M. Naghibi-Sis tani, A. Karimp our, Optimal tracking cont rol of unknown di screte-time li near sys- tems usi ng input-output measured data, IEEE transactions on cybernetics 45 (12) (2015) 2770–2779. [16] F. K¨ opf, S. Ebb ert, M. Fl ad, S. Hohmann, Adapt ive dynamic programming f or coop erative cont rol with incomplete infor m a- tion, in: 2018 IEEE In ternational Conference on Systems, Man and Cyb ernetics (SMC), 2018. [17] H. Modares, I. Ranatunga, F. L. Lewis, D. O. Popa, Optimized assistive human– rob ot interac tion using reinforcement l earning, IEEE transactions on cybernetics 46 (3) (2016) 655–667. [18] C. M u, C. Sun, D. W ang, A . Song, Adaptiv e tracking cont rol for a class of cont inuou s-time uncertain nonlinear systems using the approximate solution of hjb equation, Neurocomputing 260 (2017) 432–442. [19] S. Bernhard, J. Adamy , Lq optimal trac ki ng with unbounded cost for unknown dynamics: An adaptiv e dynamic program- ming approac h, in: Europ ean Control Conference, 2018. [20] B. Kiumarsi, F. L. Lewis, D. S. Levine, Optimal control of non- linear discrete time-v arying systems using a new neural net work appro ximation structure, Neurocomputing 156 (2015) 157– 165. [21] S. J. Bradtk e, B. E. Ydstie, A. G. Barto, Adaptiv e li near 13 quadratic con trol using p olicy iteration, in: Pro ceedings of the 1994 American Con trol Conference, 1994, pp. 3475–3479 . [22] T. Landelius, Reinfor cemen t l earning and distributed lo cal model synth esis, Ph.D. thesis, Link¨ oping Universit y Electronic Press (1997) . [23] R. S. Sutton, Learning to pr edict b y the methods of temporal differences, Mac hine learning 3 (1) (1988) 9–44. [24] K. J. ˚ Astr¨ om, B. Wittenmark, Adaptiv e control, 2nd Edition, Addison-W esley , Reading, Mass., 1995. [25] K. S. Narendra, A. M. Annaswam y , Persisten t excitation in adaptiv e systems, In ternational Journal of Con trol 45 (1) (1987) 127–160. [26] F. Lewis, D. V rabie, Reinforcement learning and adaptive dy- namic programming for feedbac k cont rol, IEEE Circuits and Systems Magazine 9 (3) (2009 ) 32–50. [27] C. J. W atkins, P . Da ya n, Q-learning, Machine learning 8 (3-4) (1992) 279–29 2. [28] A. Al-T amimi, F. L. Lewis, M. A bu-Khalaf, Mo del-fr ee q- learning designs for linear discr ete-time zero-sum games with application to h-infinity con trol, Automatica 43 (3) (2007) 473– 481. 14
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment