Task Decomposition for Iterative Learning Model Predictive Control
A task decomposition method for iterative learning model predictive control is presented. We consider a constrained nonlinear dynamical system and assume the availability of state-input pair datasets which solve a task T1. Our objective is to find a …
Authors: Charlott Vallon, Francesco Borrelli
T ask Decomposition f or Iterativ e Learning Model Pr edictiv e Contr ol Charlott V allon and Francesco Borrelli Abstract — A task decomposition method for iterati ve learn- ing model predicti ve contr ol is presented. W e consider a constrained nonlinear dynamical system and assume the a vail- ability of state-input pair datasets which solve a task T 1 . Our objective is to find a feasible model pr edictive contr ol policy for a second task, T 2 , using stor ed data from T 1 . Our approach applies to tasks T 2 which are composed of subtasks contained in T 1 . In this paper we propose a formal definition of subtasks and the task decomposition problem, and pro vide proofs of feasibility and iteration cost improvement over simple initializations. W e demonstrate the effectiveness of the proposed method on autonomous racing and robotic manipulation experiments. I . I N T R O D UC T I O N Control design for systems repeatedly performing a sin- gle task has been studied extensi vely . Such problems arise frequently in practical applications [1], [2] and e xamples range from autonomous cars racing around a track [3]–[5] to robotic system manipulators [6]–[9]. Iterative Learning Controllers (ILCs) aim to autonomously improve a system’ s closed-loop reference tracking performance at each iteration of a repeated task, while rejecting periodic disturbances [1], [10], [11]. In classical ILC, the controller uses tracking error data from pre vious task iterations to better track a provided reference trajectory during the current iteration. Recent work has also explored reference-free ILC strategies for tasks whose goals are better defined in terms of an economic metric, rather than a reference trajectory . The controller again uses previous iteration data to impro ve closed-loop performance with respect to the chosen performance metric. Examples include autonomous racing tasks (e.g. “minimize lap time”) [5], [12], or optimizing flight paths for tethered energy-harv esting systems (e.g. “maximize average po wer generation”) [13]. The aforementioned iterativ e learning methods require either a reference trajectory to track (classical ILC) or a feasible trajectory with which to initialize the iterativ e control algorithm (reference-free ILC). If the task changes, a new trajectory needs to be designed to match the new task. This can be challenging for complex tasks. Many methods exist to use data collected from a task to efficiently solve variations of that task, including model- based and model-free methods. Here, we focus specifically on model-based methods for using stored trajectories from previous tasks in order to find feasible trajectories for ne w This research was sustained in part by fellowship support from the Na- tional Physical Science Consortium and the National Institute of Standards and T echnology . Charlott V allon and Francesco Borrelli are with the Department of Mechanical Engineering, University of California, Berkeley , Berkele y , CA 94701, USA { charlott, fborrelli } @ berkeley.edu tasks. The authors in [14] propose running a desired planning method in parallel with a retrieve and repair algorithm that adapts reference trajectories from previous tasks to the constraints of a new task. Retriev e and repair was shown to decrease ov erall planning time, but requires checking for constraint violations at each point along a retriev ed trajectory . In [15], environment features are used to divide a task and create a library of local trajectories in relati ve state space frames. These trajectories are then pieced back together in real-time according to the features of the new task en vironment. A trajectory library b uilt using differential dynamic programming is used in [16] to design a controller for balance control in a humanoid robot. At each time step, a trajectory is selected from the library based on current task parameter estimates and a k-nearest neighbor selection scheme. A similar method is explored in [17], where differential dynamic programming is combined with receding horizon control. While these methods can decrease planning time, the y verify or interpolate sav ed trajectories at ev ery time step, which can be inefficient and unnecessary . The authors in [18] propose piecing together stored tra- jectories corresponding to discrete system dynamics only at states of dynamics transition. Howe ver , this method only applies to discontinuities in system dynamics, and does not generalize to other task variations. In [19], a static map is learned between a giv en reference trajectory and the input sequence required to make a linear time inv ariant system track that reference trajectory . Once learned, this map can be used to determine an input sequence that lets the linear system track a ne w reference trajectory . In this paper , our objecti ve is to find a feasible trajectory to smartly initialize an Iterati ve Learning Model Predictiv e Controller (ILMPC) [20] for a new task, using data from previous tasks. ILMPC is a type of reference-free ILC that uses a safe set to design a model predictiv e control (MPC) policy for an iterativ e control task. The ILMPC safe set is initialized using a feasible task trajectory . W e consider a constrained nonlinear dynamical system and assume the availability of a dataset containing states and inputs corresponding to multiple iterations of a task T 1 . This dataset can be stored explicitly (e.g. by human demon- strations [21] or an iterativ e controller [20]) or generated by roll-out of a given policy (e.g. a hand-tuned controller). W e introduce a T ask Decomposition for ILMPC algorithm (TDMPC), and show how to use the stored T 1 dataset to efficiently construct a non-empty ILMPC safe set for task T 2 (a new variation of T 1 ), containing feasible trajectories for T 2 . The contributions of this paper are twofold: 1) W e first present how to build the aforementioned T 2 safe set using the TDMPC algorithm. TDMPC reduces the complexity to adapt trajectories from T 1 to a new task T 2 by decomposing task T 1 into different modes of operation, called subtasks. The stored T 1 trajectories are adapted to T 2 only at points of subtask transition, by solving one-step controllability problems. 2) W e prov e that the resulting safe set based ILMPC policy is feasible for T 2 , and the corresponding closed-loop trajectories ha ve lo wer iteration cost compared to an ILMPC initialized using simple methods. I I . P R O B L E M D E FI N I T I O N A. Safe Set Based ILMPC W e consider a system x k +1 = f ( x k , u k ) , (1) where f ( x k , u k ) is the dynamical model, subject to the constraints x k ∈ X , u k ∈ U . (2) The vectors x k and u k collect the states and inputs at time step k . W e pick a set P ⊂ X to be the target set for an iterativ e task T , performed repeatedly by system (1) and defined by the tuple T = {X , U , P } . (3) Assumption 1: P is a control in variant set [22, Sec 10.9]: ∀ x k ∈ P , ∃ u k ∈ U : x k +1 = f ( x k , u k ) ∈ P . Each task ex ecution is referred to as an iteration. The goal of an ILMPC is to solve at each iteration the optimal task completion problem: V ? 0 → T ( x 0 ) = min T ,u 0 ,...,u T − 1 T X k =0 h ( x k , u k ) (4) s . t . x k +1 = f ( x k , u k ) , x k ∈ X , u k ∈ U ∀ k ≥ 0 , x T ∈ P , where V ? 0 → T ( x 0 ) is the optimal cost-to-go from the initial state x 0 , and h ( x k , u k ) is a chosen stage cost. At the j -th successful task iteration, the vectors E j ( T ) = [ x j , u j ] (5a) x j = [ x j 0 , x j 1 , ..., x j T j ] , x j k ∈ X ∀ k ∈ [0 , T j ] , x j T j ∈ P , (5b) u j = [ u j 0 , u j 1 , ...u j T j ] , u j k ∈ U ∀ k ∈ [0 , T j ] , (5c) collect the inputs applied to system (1) and the corresponding state evolution. In (5), x j k and u j k denote the system state and control input at time k of the j -th iteration, and T j is the duration of the j -th iteration. After J number of iterations, we define the sampled safe state set and sampled safe input set as: S S J = J [ j =1 x j , S U J = J [ j =1 u j , (6) where S S J contains all states visited by the system in previous task iterations, and S U J the corresponding inputs applied at each of these states. Hence, by construction of the safe set, for any state in S S J there exists a feasible input sequence contained in S U J to reach the goal set P while satisfying state and input constraints (2). Similarly , we define the sampled cost set as: S Q J = J [ j =1 q j (7a) q j = [ V j ( x j 0 ) , V j ( x j 1 ) , ..., V j ( x j T j )] , (7b) where V j ( x j k ) is the realized cost-to-go from state x j k at time step k of the j -th task execution: V j ( x j k ) = T j X i = k h ( x j i , u j i ) . (8) The safe set based ILMPC policy tries to solve (4) by using state and input data collected during past task iterations, stored in the sampled safe sets. At time k of iteration J + 1 , we solve the optimal control problem: V ILMPC ,J +1 ( x J +1 k ) = (9) min u k | k ,...,u k + N − 1 | k k + N − 1 X t = k h ( x t | k , u t | k ) + V J ( x t + N | k ) s . t . x t +1 | k = f ( x t | k , u t | k ) , ∀ t ∈ [ k, k + N − 1] , x t | k ∈ X , u t | k ∈ U , ∀ t ∈ [ k, k + N − 1] , x k | k = x j k , x k + N | k ∈ S S J ∪ P , which searches for an input sequence ov er a chosen planning horizon N that controls the system (1) to the state in the sampled safe state set or task target set P with the lo west cost-to-go (8). W e then apply a receding horizon strategy: u ( x j k ) = π ILMPC ( x j k ) = u ? k | k . (10) A system (1) in closed-loop with (10) leads to a feasible task execution if x 0 ∈ S S J . At each time step, the ILMPC policy searches for the optimal input based on pre vious task data, leading to performance impro vement on the task as the sampled safe sets continue to grow with each subsequent iteration. For details on ILMPC, we refer to [23]. The sampled safe sets used in the ILMPC policy (10) must first be initialized to contain at least one feasible task ex ecution. The aim of TDMPC is to use data collected from a task T 1 in order to efficiently find such an execution for a new task T 2 . This will induce an ILMPC policy (10) that can be used to directly solv e or initialize an ILMPC for T 2 . W e approach this using subtasks , formalized in Sec. II-B, and the concept of controllability . Definition: A system (1) is N-step contr ollable from an initial state x 0 to a terminal state x P if there exists an input sequence [ u 0 , u 1 , ..., u N − 1 ] ∈ U such that the corresponding state trajectory satisfies state constraints (2) and x N = x P . A system is contr ollable from x 0 to x P if there exists an N > 0 such that the system is N -step controllable to x P . [22, Sec 10.1] B. Subtasks Consider an iterative task T (3) and a sequence of M subtasks, where the i -th subtask S i is the tuple S i = {X i , U i , R i } . (11) W e take X i ⊆ X as the subtask workspace, U i ⊆ U the subtask input space, and R i the set of transition states from the current subtask S i workspace into the subsequent S i +1 workspace: R i ⊆ X i = { x ∈ X i : ∃ u ∈ U i , f ( x, u ) ∈ X i +1 } . A successful subtask e xecution E( S i ) of a subtask S i is a trajectory of inputs and corresponding states ev olving according to (1) while respecting state and input constraints (2), ending in the transition set. W e define the j -th successful ex ecution of subtask S i as E j ( S i ) = [ x j i , u j i ] , (12a) x j i = [ x j 0 , x j 1 , ..., x j T j i ] , x k ∈ X i ∀ k ∈ [0 , T j i ] , x j T j i ∈ R i , (12b) u j i = [ u j 0 , u j 1 , ..., u j T j i ] , u k ∈ U i ∀ k ∈ [0 , T j i ] , where the vectors u j i and x j i collect the inputs applied to the system (1) and the resulting states, respectively , and x j k and u j k denote the system state and the control input at time k of subtask ex ecution j . T j i is the duration of the j -th ex ecution of subtask i . The final state of each successful subtask ex ecution is in the subtask transition set, from which it can ev olve into new subtasks. For the sake of notational simplicity , we hav e written all subtask executions as beginning at time step k = 0 . W e say the task T is an ordered sequence of the M subtasks ( T = {S i } M i =1 ) if the j -th successful task execution (5) is equal to the concatenation of successful subtask ex ecutions: E j ( T ) = [ E j ( S 1 ) , E j ( S 2 ) , ..., E j ( S M )] = [ x j , u j ] , x j = [ x j 1 , x j 2 , ..., x j M ] , u j = [ u j 1 , u j 2 , ..., u j M ] , f x j T j [1 → i ] , u T j [1 → i ] ∈ X i +1 , i ∈ [1 , M − 1] , x j T j [1 → M ] ∈ R M , where T j [1 → i ] is the duration of the first i subtasks during the j -th task iteration. When the state reaches a subtask transition set, the system has completed subtask S i , and it transitions into the follo wing subtask S i +1 . The task is completed when the system reaches the last subtask’ s transition set, R M , which we consider as the task’ s control in variant target set (referred to as P in the pre vious section). I I I . T A S K D E C O M P O S I T I O N F O R I L M P C Here we describe the intuition behind TDMPC, and pro- vide an algorithm for the method. W e also prove feasibility and iteration cost reduction of policies output by TDMPC. A. TDMPC Let T ask 1 and T ask 2 be different ordered sequences of the same M subtasks: T 1 = {S i } M i =1 , T 2 = {S l i } M i =1 , (14) where the sequence [ l 1 , l 2 , ..., l M ] is a reordering of the sequence [1 , 2 , ..., M ] . Assume non-empty sampled safe sets S S J [1 → M ] , S U J [1 → M ] , and S Q J [1 → M ] (6, 7) containing exe- cutions that solve T 1 . The goal of TDMPC is to use the executions stored in the sampled safe sets from T ask 1 in order to find a feasible execution for T ask 2 , ending in the new target set R l M . The key intuition of the method is that all successful subtask ex ecutions from T ask 1 are also successful subtask ex ecutions for T ask 2 , as this definition only depends on properties (12) of the subtask itself, not the subtask sequence. Therefore, only stored points at subtask transitions need to be analyzed. Based on this intuition, Algorithm 1 proceeds backwards through the new subtask sequence [ l 1 , l 2 , ..., l M ] . The ke y steps are discussed belo w . • Consider subtask S l M . All states from S l M stored in the T ask 1 ex ecutions are controllable to R l M using stored inputs, i.e. there exists a stored input sequence that can be applied to the state such that the system evolv es to be in R l M . W e ne xt look for stored states from the preceding subtask, S l M − 1 , which are controllable to R l M via S l M . (Algorithm 1, Lines 4-6 ) Define the sampled guard set of S l M − 1 as S G l M − 1 = J [ j =1 x j T j [1 → l M − 1 ] . (15) The set contains those states in S l M − 1 from which the system transitioned into another subtask during one of the previous J ex ecutions of T 1 . Only controllability from the sampled guard set will be important in our approach. • W e search for the set of points in S G l M − 1 that are controllable to stored states in S l M . This problem can be solved using a variety of numerical approaches. (Algorithm 1, Lines 9-15) • For any stored state x in S G l M − 1 for which the control- lability analysis failed, we remove the stored S S J [ l M − 1 ] subtask ex ecution ending in x as candidate controllable states for T ask 2 . All remaining stored states in S l M − 1 Algorithm 1 TDMPC algorithm 1: input S S J [1 → M ] , S U J [1 → M ] , S Q J [1 → M ] , [ l 1 , l 2 , ..., l M ] 2: do guard set clustering ( S S J [1 → M ] ) (15) 3: initialize empty ˆ S S , ˆ S U 4: ˆ S S l M ← S J j =1 x j l M 5: ˆ u j l M ← u j l M ∀ j ∈ [1 , J ] , ˆ S U l M ← S J j =1 ˆ u j l M , 6: ˆ S Q l M ← S J j =1 V j ( x j l M ) 7: for i ∈ [ l M − 1 : − 1 : l 1 ] do 8: ˆ S S i ← S J j =1 x j i 9: ˆ u j i ← u j i ∀ j ∈ [1 , J ] 10: for x ∈ S G i do 11: k = { k : x ∈ x k i } 12: initialize empty Q ? 13: for j : x j i +1 ∈ ˆ S S i +1 do 14: ˆ q j i +1 = V j ( x j i +1 ) 15: solv e ( Q ? j , u ? j ) = Ctrb( x, x j i +1 , ˆ q j i +1 ) (17) 16: if Q ? not empty then 17: j ∗ = arg min j Q ? j 18: ˆ u k i [ − 1] ← u ? j ∗ 19: else 20: ˆ S S i ← ˆ S S i \ x k i , ˆ S U i ← ˆ S U i \ ˆ u k i 21: ˆ S U i ← S J j =1 ˆ u j i , ˆ S Q i ← S J j =1 V j ( ˆ x j i ) 22: Return S S 0 [ l 1 → l M ] = S l M i = l 1 ˆ S S i S U 0 [ l 1 → l M ] = S l M i = l 1 ˆ S U i S Q 0 [ l 1 → l M ] = S l M i = l 1 ˆ S Q i are controllable to stored states in S l M , and therefore also to R l M . (Algorithm 1, Lines 15-20) Algorithm 1 iterates backwards through the remaining subtask sequence, connecting points in sampled guard sets to previously verified trajectories in the next subtask. Fig. 1 de- picts this process across three subtasks from an autonomous racing task detailed in Section IV. The algorithm terminates when it has iterated through the new subtask order , or when no states in a subtask’ s sampled guard set can be sho wn to be controllable to R l M . The algorithm returns sampled safe sets for T ask 2 that have been verified through controllability to contain feasible e xecutions of T ask 2 . TDMPC can improv e on the computational comple xity of existing trajectory transfer methods in two ke y ways: (i) by verifying stored trajectories only at states in the sampled guard set, rather than at each recorded time step, and (ii) by solving a data-driven, one-step controllability problem to adapt the trajectories, rather than a multi-step or set-based controllability method. B. Pr operties of TDMPC-Derived P olicies W e pro ve feasibility and iteration cost reduction of ILMPC policies (10) initialized using TDMPC. Fig. 1: Algorithm 1 checks controllability from states in the sampled guard set (green) to the con vex hulls of safe trajectories through the next subtask (light red). If the controllability fails for a point in the sampled guard set, the backwards reachable states are removed from the safe set (grey). The centerlane is plotted in dashed yello w . Assumption 2: T ask 1 and T ask 2 are defined as in (14), where the subtask workspaces and input spaces are gi ven by X i = X , U i = U for all i ∈ [1 , M ] . Theor em 1: (F easibility) Let Assumptions 1-2 hold. As- sume non-empty sets S S J [1 → M ] , S U J [1 → M ] , S Q J [1 → M ] con- taining trajectories of system (1) for T ask 1 . Assume Al- gorithm 1 outputs non-empty sets S S 0 [ l 1 → l M ] , S U 0 [ l 1 → l M ] , S Q 0 [ l 1 → l M ] for T ask 2 . Then, if x 0 ∈ S S 0 [ l 1 → l M ] , the policy π ILMPC [ l 1 → l M ] , as defined in (10), produces a feasible execution of T ask 2 . Pr oof: At ev ery state x k , the ILMPC policy (10) searches for a sequence of inputs [ u k , u k +1 , ..., u k + N − 1 ] such that, when applied to the system (1), the resulting state x k + N is in S S 0 [ l 1 → l M ] or the tar get set R l M . Since all states in S S 0 [ l 1 → l M ] are either stored as part of feasible trajectories to R l M or are directly in R l M , such a sequence of inputs can always be found, and (9) always has a solution: ∀ x k ∈ S S 0 [ l 1 → l M ] , ∃ [ u k , u k +1 , ..., u k + N − 1 ] ∈ U : x k + N ∈ S S 0 [ l 1 → l M ] ⊆ X . As the terminal constraint set in 9 is itself an in variant set, recursiv e feasibility follows from standard MPC arguments [22]. It follows that the policy π I LM P C [ l 1 → l M ] produces feasible trajectories for T ask 2 . The abov e Theorem 1 implies that the safe sets designed by the TDMPC algorithm induce an ILMPC policy that can be used to successfully complete T ask 2 while satisfying all input and state constraints. Assumption 3: Consider T ask 1 and T ask 2 as defined in (14). The trajectories stored in S S J [1 → M ] and S U J [1 → M ] correspond to executions of T ask 1 by a nonlinear system (1). One stored trajectory corresponds to an execution of (1) in closed-loop with a policy π 0 ( · ) that is feasible for both T ask 1 and T ask 2 . Theor em 2: (Cost Impr ovement) Let Assumptions 2- 3 hold. Then, Algorithm 1 will return non-empty sets S S 0 [ l 1 → l M ] , S U 0 [ l 1 → l M ] , S Q 0 [ l 1 → l M ] for T ask 2 . Furthermore, if x 0 ∈ S S 1 [1 → M ] , an ILMPC initialized using S S 0 [ l 1 → l M ] will incur no higher iteration cost during an execution of T ask 2 than an ILMPC initialized using a trajectory corre- sponding to (1) in closed-loop with π 0 ( · ) . Pr oof: Define the v ectors x 1 = S S 1 [1 → M ] ⊆ S S J [1 → M ] , (16a) u 1 = π 0 ( x 1 ) = S U 1 [1 → M ] ⊆ S U J [1 → M ] , (16b) to be the stored state and input trajectory associated with the implemented policy π 0 ( · ) . Since π 0 ( · ) is also feasi- ble for T ask 2 , when Algorithm 1 is applied, the entire task ex ecution can be stored as a successful e xecution for T ask 2 without adapting the policy . It follows that S S 1 [1 → M ] ⊆ S S 0 [ l 1 → l M ] and S U 1 [1 → M ] ⊆ S U 0 [ l 1 → l M ] , and the returned sample safe sets for T ask 2 are non-empty . At the initial state x 0 , the ILMPC policy (10) optimizes the chosen input so as to minimize the remaining cost-to-go. Consider an MPC planning horizon of N = 1 (though this extends directly for any N ≥ 1 ). T rivially , min u h ( x 0 , u ) + V j ( x j p ) ≤ h ( x k , π 0 ( x 0 )) + V j ( x j p ) s.t. u ∈ U , s.t. f ( x 0 , π 0 ( x 0 )) = x j p . f ( x 0 , u ) = x j p . It follo ws that the cost incurred by a T ask 2 ex ecution with π I LM P C [ l 1 → l M ] is no higher than an execution with π I LM P C π 0 . The proof that the improved closed-loop iteration cost fol- lows from the improv ed ILMPC cost (9) is not presented here. Howe ver , the result shown holds for the examples presented in Sec. IV- V. C. Discussion In the examples presented in this paper, we implement the search for controllable points (Algorithm 1, Line 15) by solving a one-step controllability problem, ( Q ? , u ? ) = Ctrb( x, z , q ) , where u ? , λ ? = arg min u,λ h ( x, u ) + λ > q (17) s.t. f ( x, u ) = λ > z , X λ i = 1 , λ i ≥ 0 , u ∈ U i , Q ? = λ ? > q , (18) where z is a previously verified state trajectory through the next T ask 2 subtask, and q the sampled cost vector associated with the trajectory . (17) aims to find an input that connects the sampled guard state x to a state in the con ve x hull of the trajectory [22, Sec 4.4.2]. If such an input is found, the new cost-to-go (8) for the state x is taken to be the con ve x combination of the stored cost vector (18). Solving the controllability analysis to the conv ex hull is an additional method for reducing computational complexity of TDMPC and is exact only for linear systems with con ve x constraints. The points of subtask transition should be defined as is most useful, giv en the two tasks. Subtask transition points simply indicate which se gments of the stored trajectories are Fig. 2: Each subtask of the racing task corresponds to a segment of the track with constant curv ature. The vehicle state s tracks the distance traveled along the centerline. certain to remain feasible in T 2 using the stored policies - but this can change depending on how exactly T 2 dif fers from T 1 . The TDMPC method is therefore not limited in applicability to a predetermined number of reshuffled tasks. I V . A P P L I C A T I O N 1 : A U T O N O M O U S R AC I N G A. T ask F ormulation Consider an autonomous racing task, in which a vehicle is controlled to minimize lap time driving around a race track with piecewise constant curvature (Fig. 2). W e model this task as a series of ten subtasks, where the i -th subtask corresponds to a section of the track with constant radius of curv ature c i . T asks with dif ferent subtask order are tracks consisting of the same road segments in a dif ferent order . The vehicle is modeled in the curvilinear abscissa refer- ence frame [24], with states and inputs at time step k x k = [ v x k v y k ˙ ψ k e ψ k s k e y k ] > , u k = [ a k δ k ] > , where v x k , v y k , and ˙ ψ k are the vehicle’ s longitudinal ve- locity , lateral velocity , and yaw rate, respectiv ely , at time step k , s k is the distance trav elled along the centerline of the road, and e ψ k and e y k are the heading angle and lateral distance error between the vehicle and the path. The inputs are longitudinal acceleration a k and steering angle δ k . The system dynamics (1) are described using an Euler discretized dynamic bic ycle model [5]. Accordingly , the system state and input spaces are X = R 6 , U = R 2 . W e formulate each subtask according to (11), with: 1) Subtask W orkspace X i : X i = x : 0 − π 2 rad s i − 1 − l 2 ≤ v x e ψ s e y ≤ 3 m / s π 2 rad s i l 2 , where s i − 1 and s i mark the distances along the centerline to the start and end of the curve, and l = 0 . 8 is the lane width in meters. s 10 = s end is the total length of the track. These bounds indicate that the vehicle can only dri ve forwards on the track, up to a maximum velocity , and must stay within the lane. 2) Subtask Input Space U i : U i = − 1 m / s 2 − 0 . 5 rad / s 2 ≤ a δ ≤ 1 m / s 2 0 . 5 rad / s 2 . The input limits are a function of the vehicle, and do not change between subtasks. 3) Subtask T ransition Set, R i : Lastly , we define the subtask transition set to be the states along the subtask border where the track’ s radius of curv ature changes: R i = { x ∈ X i : ∃ u ∈ U i , s.t. s + = s i +1 } , where x + = f ( x, u ) . The task target set is the race track’ s finish line, R M = { x : s ≥ s end } . The task goal is to complete a lap and reach the tar get set as quickly as possible. Therefore we define the stage cost as: h ( x k , u k ) = ( 0 , x k ∈ P 1 , otherwise . B. Simulation Setup An ILMPC (10) is used to complete J = 5 ex ecutions of T ask 1 , the track depicted in Fig. 2. The vehicle begins each task iteration at standstill on the centerline at the start of the track. The J executions and their costs are stored in S S [1 → M ] , S U [1 → M ] , and S Q [1 → M ] . An initial trajectory for the ILMPC safe sets is ex ecuted using a centerline-tracking, low-v elocity PID controller , Π 0 . TDMPC then uses these sampled safe sets to design initial policies for a new track composed of the same track segments. T wo ILMPCs are designed for the reconfigured track: one initialized with TDMPC, and another initialized with Π 0 . Each ILMPC completes J = 10 laps around the new tracks. In this examples, the reconfigured track is not continuous, and should be considered to be a se gments of larger , continuous track. C. Simulation Results Fig. 3 compares the first and tenth trajectories around the track of the two ILMPCs, plotted as black and red lines. The Π 0 -initialized ILMPC (in dashed black) initially stays close to the centerline, taking nearly 18 seconds to trav erse the ne w track. The TDMPC-initialized ILMPC, ho wev er , tra verses the ne w track more efficiently starting with the first lap. The first lap completed using the TDMPC-initialized ILMPC (in solid black) be gins closer to the final locally optimal policies (in red) that both ILMPCs ev entually con ver ge to. In this example, the TDMPC method is able to leverage experience on another track in order to complete sections of the new track in a locally optimal way , even on the first iteration of a ne w task. The lap time of each of the ten ILMPC iterations is plotted in the bottom of Fig. 3. As expected, the TDMPC-initialized ILMPC completes the first several laps faster than the Π 0 - initialized ILMPC. The TDMPC-initialized ILMPC requires Fig. 3: The TDMPC-initialized ILMPC con verges to a locally optimal trajectory faster than the PID-initialized one. Fig. 4: T opvie w of the robotic path planning task. Each subtask corresponds to an obstacle in the en vironment with constant height. fewer task iterations and less time per iteration to reach a locally optimal trajectory . V . A P P L I C A T I O N 2 : R O B OT I C P A T H P L A N N I N G TDMPC can also be used to combine knowledge gained from solving a v ariety of previous tasks. For example, if D ILMPCs as in (10) complete J iterations of D dif ferent tasks, TDMPC can be used to design a policy for a task ( D + 1) . The algorithm dra ws on subtask ex ecutions collected ov er D different tasks in order to build safe sets for T ask D + 1 . W e ev aluated this approach in a robotic path planning example. A. T ask F ormulation Consider a task in which a UR5e 1 robotic arm needs to move an object to a target without colliding with obsta- cles (Fig. 4). The obstacles are modeled as extruded disks of varying heights abov e and below the robot, leaving a 1 https://www .universal-robots.com/products/ur5-robot/ workspace space between h min ,i and h max ,i . Here, each subtask corresponds to the workspace above a particular obstacle. Different subtask orderings correspond to a rear - ranging of the obstacle locations. The end-effector reference tracking accuracy of the UR5e allows us to use a simplified model in robot experiments, in place of a discretized second-order model as in [25]. W e solve the task in the reduced state space: x k = [ q 0 k ˙ q 0 k z k ˙ z k ] > , u k = [ ¨ q 0 k ¨ z k ] > , where z k is the height of the robot end-effector at time step k , calculated from the joint angles via forward kinematics, and ˙ z k the upw ard velocity . W e control ¨ q 0 k and ¨ z k , the accelerations of q 0 and z , respecti vely . The system state and input spaces are X = R 4 , U = R 2 . W e model the simplified system as a quadruple integrator: x k +1 = 1 dt 0 0 0 1 0 0 0 0 1 dt 0 0 0 1 x k + 0 0 dt 0 0 0 0 dt u k , (23) where dt = 0 . 01 seconds is the sampling time. This simpli- fied model holds as long as we operate within the region of high end-effector reference tracking accuracy , characterized in pre vious experiments. W e formulate each subtask according to (11). 1) Subtask W orkspace X i : ˜ X i = θ i − 1 − π rad / s h min ,i − 1 m / s ≤ q 0 ˙ q 0 k z k ˙ z k ≤ θ i π rad / s h max ,i 1 m / s where θ i − 1 and θ i mark the cumulati ve angle to the begin- ning and end of the i -th obstacle, as in Fig. 4. States ˙ q 0 k and ˙ z k are constrained to lie in the e xperimentally determined region of high end-effector tracking accuracy . The robot end-effector is constrained not to collide with the subtask obstacle. 2) Subtask Input Space U i : ˜ U i = − π rad / s 2 − 0 . 6 m / s 2 ≤ ¨ q 0 k ¨ z k ≤ π rad / s 2 0 . 6 m / s 2 , where ¨ q 0 k and ¨ z k are constrained to lie in the e xperimentally determined region of high end-effector tracking accuracy . 3) Subtask T ransition Set, R i : W e define the subtask transition set to be the states along the subtask border where the next obstacle begins: R i = { x ∈ X i : ∃ u ∈ U i , s.t. q + 0 ≥ θ i } , where x + = f ( x, u ) . The task target set is the end of the last mode: R M = { x : q 0 = θ M , h min ,M ≤ z ≤ h max ,M } . The task goal is to reach the tar get set as quickly as possible: h ( x k , u k ) = ( 0 , x k ∈ P 1 , otherwise . B. Experimental Setup An ILMPC (10) was used to complete J = 10 e xecutions of five different training tasks, where each training task corresponded to a reordering of the obstacles. In each task, the ILMPC tries to reach the target set as quickly as possible while a voiding the obstacles. Each ILMPC was initialized with a trajectory resulting from ex ecuting a policy Π 0 that tracks the center height of each mode with the end-ef fector , while the robot rotated at a low constant joint velocity ˙ q 0 . TDMPC was then applied to the combined sampled safe sets of the fiv e training tasks, and used to design an initial policy for a new ILMPC on an unseen ordering of obstacles, shown in Fig. 5. The white space corresponds to en vironment obstacles, so that the ILMPC task is to reach the end of the last mode as quickly as possible while controlling the end-effector to remain within the safe (green) part of the state space. A second ILMPC was initialized with the center- height tracking Π 0 , for comparison. After initialization, the two ILCMPs completed J = 20 iterations of the ne w task. These iterations were ex ecuted in simulation using the simplified model (23), and the first and last trajectories of each ILMPC were then tracked by a real UR5e robot using end-effector tracking. C. Experimental Results The measured robot trajectories are plotted in Fig. 5. The Π 0 -initialized ILMPC follo ws the center-height of each mode closely during the first task iteration (plotted in dashed black). After ten iterations of the task, the resulting trajectory (plotted in dashed red) has only div erged from the center- height trajectory slightly . Correspondingly , after ten itera- tions the Π 0 -initialized ILMPC still requires more than four seconds to complete the task. The TDMPC-initialized ILMPC, howe ver , draws on knowledge gathered over many previous tasks in order to solve the task efficiently right away . Already on the first trajectory (plotted in solid black), the TDMPC-initialized ILMPC solves the task in under three seconds. This is a 30% improv ement over the Π 0 -initialized ILMPC. As in the autonomous dri ving task, the first trajectory completed by the TDMPC-initialized ILMPC is very close to the ultimate locally optimal trajectory . Because of the noncon vex obstacles, this task is noncon- ve x, and there are many locally optimal trajectories. At var - ious iterations of the task, both the TDMPC-initialized and the Π 0 -initialized ILMPCs get stuck at such local minima, so that the ILMPCs performance metric remained constant ov er sev eral iterations before improving again (Fig. 5). At these performance plateaus, the realized trajectories continue to change. W e believ e that the variability in the mixed integer solver used in the ILMPC led the ILMPC to follow different trajectories with the same iteration cost, as if encouraging Fig. 5: The TDMPC-initialized ILMPC solves T 2 much faster than the ILMPC initialized with a center-height track- ing policy Π 0 . exploration. Some of these different trajectories then allowed for performance improv ement in the next iteration. V I . C O N C L U S I O N A task decomposition method for ILMPC was presented. The TDMPC algorithm uses stored state and input trajecto- ries from ex ecutions of a task, and ef ficiently designs policies for ex ecuting variations of that task. TDMPC breaks tasks into subtasks and performs controllability analysis at sampled safe states between subtasks. The algorithm can improv e upon other methods by only needing to verify and adapt the original task polic y at points of subtask transition, rather than along the entire trajectory . W e ev aluate the effecti veness of the proposed algorithm on autonomous racing and robotic manipulation tasks. Our results confirm that TDMPC allo ws an ILMPC to con ver ge to a locally-optimal minimum-time trajectory faster than using simple methods. R E F E R E N C E S [1] D. A. Bristow , M. Tharayil, and A. G. Alleyne, “ A surv ey of iterative learning control, ” IEEE Contr ol Systems Magazine , v ol. 26, no. 3, pp. 96–114, 2006. [2] Y . W ang, F . Gao, and F . J. Doyle III, “Survey on iterativ e learning control, repetiti ve control, and run-to-run control, ” Journal of Process Contr ol , vol. 19, no. 10, pp. 1589–1600, 2009. [3] K. Kritayakirana and J. C. Gerdes, “Using the centre of percussion to design a steering controller for an autonomous race car, ” V ehicle System Dynamics , vol. 50, no. sup1, pp. 33–51, 2012. [4] J. V . Carrau, A. Liniger, X. Zhang, and J. L ygeros, “Efficient im- plementation of randomized mpc for miniature race cars, ” in 2016 Eur opean Control Confer ence (ECC) . IEEE, 2016, pp. 957–962. [5] U. Rosolia, A. Carvalho, and F . Borrelli, “ Autonomous racing using learning model predicti ve control, ” in 2017 American Contr ol Confer- ence (ACC) . IEEE, 2017, pp. 5115–5120. [6] R. Horowitz, “Learning control of robot manipulators, ” Journal of Dynamic Systems, Measur ement, and Contr ol , v ol. 115, no. 2B, pp. 402–411, 1993. [7] S. Arimoto, M. Sekimoto, and S. Kawamura, “T ask-space iterative learning for redundant robotic systems: Existence of a task-space control and con vergence of learning, ” SICE Journal of Contr ol, Mea- sur ement, and System Inte gration , vol. 1, no. 4, pp. 312–319, 2008. [8] J. V an Den Berg, S. Miller , D. Duckworth, H. Hu, A. W an, X.- Y . Fu, K. Goldberg, and P . Abbeel, “Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations, ” in 2010 IEEE International Confer ence on Robotics and Automation . IEEE, 2010, pp. 2074–2081. [9] Y . C. W ang, C. J. Chien, and C. N. Chuang, “Backstepping adaptive iterativ e learning control for robotic systems, ” in Applied Mec hanics and Materials , vol. 284. Trans T ech Publ, 2013, pp. 1759–1763. [10] J. H. Lee and K. S. Lee, “Iterativ e learning control applied to batch processes: An overvie w , ” Control Engineering Pr actice , vol. 15, no. 10, pp. 1306–1318, 2007. [11] J. Lu, Z. Cao, C. Zhao, and F . Gao, “110th anniv ersary: An ov erview on learning-based model predictiv e control for batch processes, ” Industrial & Engineering Chemistry Research , vol. 58, no. 37, pp. 17 164–17 173, 2019. [12] J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger, “Learning- based model predictiv e control for autonomous racing, ” IEEE Robotics and Automation Letters , vol. 4, no. 4, pp. 3363–3370, 2019. [13] M. K. Cobb, K. Barton, H. Fathy, and C. V ermillion, “Iterativ e learning-based path optimization for repetiti ve path planning, with application to 3-d crosswind flight of airborne wind energy systems, ” IEEE T ransactions on Control Systems T echnology , pp. 1–13, 2019. [14] D. Berenson, P . Abbeel, and K. Goldberg, “ A robot path planning framew ork that learns from e xperience, ” in 2012 IEEE International Confer ence on Robotics and Automation . IEEE, 2012, pp. 3671–3678. [15] M. Stolle and C. Atkeson, “Finding and transferring policies using stored behaviors, ” Autonomous Robots , vol. 29, no. 2, pp. 169–200, 2010. [16] C. Liu and C. G. Atkeson, “Standing balance control using a trajectory library , ” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems . Citeseer, 2009, pp. 3031–3036. [17] Y . T assa, T . Erez, and W . D. Smart, “Receding horizon differential dynamic programming, ” in Advances in neural information pr ocessing systems , 2008, pp. 1465–1472. [18] C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach, ” in Ad- vances in Neural Information Pr ocessing Systems , 2003, pp. 1643– 1650. [19] K. Pereida, M. K. Helwa, and A. P . Schoellig, “Data-efficient mul- tirobot, multitask transfer learning for trajectory tracking, ” IEEE Robotics and A utomation Letters , vol. 3, no. 2, pp. 1260–1267, 2018. [20] U. Rosolia and F . Borrelli, “Learning model predictive control for iterativ e tasks. a data-driv en control frame work, ” IEEE T ransactions on Automatic Contr ol , vol. 63, no. 7, pp. 1883–1896, 2017. [21] A. Coates, P . Abbeel, and A. Y . Ng, “Learning for control from multiple demonstrations, ” in Pr oceedings of the 25th international confer ence on Mac hine learning . ACM, 2008, pp. 144–151. [22] F . Borrelli, A. Bemporad, and M. Morari, Pr edictive contr ol for linear and hybrid systems . Cambridge University Press, 2017. [23] U. Rosolia, X. Zhang, and F . Borrelli, “Simple policy e valuation for data-rich iterati ve tasks, ” in 2019 American Contr ol Confer ence (A CC) . IEEE, 2019, pp. 2855–2860. [24] R. Rajamani, V ehicle dynamics and contr ol . Springer Science & Business Media, 2011. [25] M. Spong and M. V idyasagar , Robot Dynamics And Control , 01 1989.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment