Adaptive Guidance with Reinforcement Meta-Learning

(Preprint) AAS 19-293 AD APTIVE GUID ANCE WITH REINFORCEMENT MET A-LEARNING Brian Gaudet ∗ and Ric hard Linares † This paper proposes a novel adapti ve guidance system developed using reinforce- ment meta-learning with a recurrent policy and value function approximator . The use of recurrent network layers allo ws the deployed policy to adapt real time to en vironmental forces acting on the agent. W e compare the performance of the DR/D V guidance law , an RL agent with a non-recurrent policy , and an RL agent with a recurrent policy in four dif ﬁcult tasks with unknown but highly variable dynamics. These tasks include a safe Mars landing with random engine failure and a landing on an asteroid with unkno wn environmental dynamics. W e also demonstrate the ability of a recurrent policy to navigate using only Doppler radar altimeter returns, thus integrating guidance and na vigation. INTR ODUCTION Many space missions take place in en vironments with complex and time-v arying dynamics that may be incompletely modeled during the mission design phase. For example, during an orbital refueling mission, the inertia tensor of each of the two spacecraft will change signiﬁcantly as fuel is transferred from one spacecraft to the other , which can make the combined system dif ﬁcult to control. 1 The wet mass of an exoatmospheric kill vehicles (EKV) consists largely of fuel, and as this is depleted with di vert thrusts, the center of mass changes, and the divert thrusts are no longer orthogonal to the EKV’ s velocity vector , which wastes fuel and impacts performance. Future mis- sions to asteroids might be undertaken before the asteroid’ s gravitational ﬁeld, rotational velocity , and local solar radiation pressure are accurately modeled. Also consider that the aerodynamics of hypersonic re-entry cannot be perfectly modeled. Moreov er, there is the problem of sensor distor- tion creating bias to the state estimate giv en to a guidance and control system. Finally , there is the possibility of actuator failure, which signiﬁcantly modiﬁes the dynamic system of a spacecraft and its en vironment. These examples sho w a clear need for a guidance system that can adapt in real time to time-v arying system dynamics that are likely to be imperfectly modeled prior to the mission. Recently , sev eral works have demonstrated improv ed performance with uncertain and complex dynamics by training with randomized system parameters. In (Reference 2), the authors use a re- current neural network to explicitly learn model parameters through real time interaction with an en vironment; these parameters are then used to augment the observation for a standard reinforce- ment learning algorithm. In (Reference 3), the authors use a recurrent policy and v alue function in a modiﬁed deep deterministic policy gradient algorithm to learn a policy for a robotic manipu- lator arm that uses real camera images as observations. In both cases, the agents train ov er a wide range of randomized system parameters. In the deployed policy , the recurrent network’ s internal ∗ Co-Founder , DeepAnalytX LLC, 1130 Swall Meadows Rd, Bishop CA 93514 † Charles Stark Draper Assistant Professor, Department of Aeronautics and Astronautics, Massachusetts Institute of T ech- nology , Cambridge, MA 02139 1 state quickly adapts to the actual system dynamics, providing good performance for the agent’ s respecti ve tasks. An RL agent using a recurrent network to learn a policy that performs well ov er a wide range of en vironmental conditions can be considered a form of meta-learning. In meta-learning (learning to learn), an agent learns through experience on a wide range of Markov decision processes (MDP) a strategy for quickly learning (adapting) to no vel MDPs. By considering en vironments with dif ferent dynamical systems as different MDPs, we see that our approach is a form of meta-learning. Specif- ically , if the system is trained on randomized dynamical systems (as in References 3 and 2), this gi ves the policy experience on a range of ground truth dynamics, and minimizing the cost function requires that the recurrent policy’ s hidden state quickly adapts to dif ferent dynamical systems. In this work, we use reinforcement meta-learning to dev elop an adaptiv e guidance law . Specif- ically , we will use proximal policy optimization (PPO) with both the policy and value function implementing recurrent layers in their networks. T o understand how recurrent layers result in an adapti ve agent, consider that gi ven some agent position and velocity x ∈ R 6 , and action vector u ∈ R 3 output by the agent’ s policy , the next observation depends not only on x and u , but also on the ground truth agent mass and external forces acting on the agent. Consequently , during training, the hidden state of a network’ s recurrent network e v olves differently depending on the observed sequence of observ ations from the en vironment and actions output by the policy . Speciﬁcally , since the policy’ s weights (including those in the recurrent layers) are optimized to maximize the likeli- hood of actions that lead to high advantages, the trained policy’ s hidden state captures unobserved information such as external forces and the current lander mass, as well as the past history of ob- serv ations and actions, as this information is useful in minimizing the cost function. In contrast, a non-recurrent policy (which we will refer to as an MLP policy), which does not maintain a per- sistent state vector , can only optimize using a set of current observations, actions, and advantages, and will tend to under-perform a recurrent policy on tasks with randomized dynamics, although as we have shown in (Reference 4), training with parameter uncertainty can give good results using an MLP policy , provided the parameter uncertainty is not too extreme. After training, although the recurrent policy’ s network weights are frozen, the hidden state will continue to ev olve in response to a sequence of observations and actions, thus making the policy adapti ve. In contrast, an MLP policy’ s behavior is ﬁx ed by the network parameters at test time. An agent implementing recurrent networks can potentially learn a policy for a partially observ- able Markov decision process (POMPD). T o see why , again consider a sequence of observations and actions taken in a trajectory , with the observations not satisfying the Markov property , i.e., p ( o t | a, o t − 1 , o t − 2 , ... 6 = p ( o t | a, o t − 1 ) . Consequently , minimizing the policy’ s cost function requires the hidden state representation to contain information on the temporal dependencies in a sequence of observations and actions. This is similar to how a recursi ve Bayesian ﬁlter can learn to predict a spacecraft’ s full state from a history of position only measurements. Ho we ver , a recurrent network has the ability to capture much longer temporal dependencies than a Kalman ﬁlter . W e compare the performance of the DR/D V guidance law, 5 an RL agent with an MLP policy , and an RL agent with a recurrent policy over a range of tasks with unknown but highly variable dynamics. W e use the DR/DV guidance law as a performance baseline, and to improve its perfor- mance, we give the DR/DV guidance law access to the ground truth gravitational force, as well as the lander mass at the start of an episode. In contrast, the RL agent only has access to observations that are a function of the lander’ s position and velocity . Except for the asteroid landing task, these tasks are v ariations on the Mars Landing task. These tasks include: 2 1. Unkno wn Dynamics (Asteroid Landing): In each episode, the acceleration due to gra vity , so- lar radiation pressure, and rotation are randomly chosen over a wide range, limited only by the lander’ s thrust capability . 2. Engine F ailure: W e assume a redundant thruster conﬁguration (such as used on MSL), and at the start of each episode, with probability p a random thruster is disabled which results in a reduction in thrust capability . 3. Lar ge Mass V ariation: W e use a large engine speciﬁc impulse and assume wet/dry masses of 2000kg/200kg respecti vely , which results in a large variation in lander mass during the landing. This creates a difﬁcult control problem, as the agent does not hav e access to the ground truth mass. 4. Landing using non-Marko v observations: The agent must learn to land using only the returns from the four Doppler radar altimeters as an observ ation, whereas the pre vious tasks assumed access to observations that are a function of the ground truth position and velocity . The measurement error becomes quite extreme at lower ele v ations, making this a difﬁcult task. For ob vious reasons, we don’t try this with the DR/D V guidance law . METHODS Equations of Motion W e model the landing in 3-DOF , where the translational motion is modeled as follo ws: ˙ r = v (1a) ˙ v = T + F en v m + g + 2 ˙ r a × ω + ( ω × r a ) × ω (1b) ˙ m = − k T k I sp g ref (1c) Here r is the lander’ s position in the target centered reference frame, r a is the lander’ s position in the planet (or asteroid) centered reference frame, T is the lander’ s thrust vector g ref = 9 . 8 m / s 2 , g =  0 0 − 3 . 7114  m / s 2 is used for Mars, I sp = 225 s, and the spacecraft’ s mass is m . F en v is a vector of normally distributed random variables representing en vironmental disturbances such as wind and variations in atmospheric density . ω is a vector of rotational velocities in the planet (or asteroid) centered reference frame, and is set to zero for the Mars landing experiments. For the Mars landing en vironment, the minimum and maximum thrust is constrained to be in the range [2000N, 15000N], and the lander’ s nominal wet mass is 2000kg. For the asteroid landing en vironment, we assume pulsed thrusters with a thrust capability of 2N along each axis in the target centered reference frame. Reinf orcement Learning Over view In the RL frame work, an agent learns through episodic interaction with an en vironment how to successfully complete a task by learning a policy that maps observations to actions. The en viron- ment initializes an episode by randomly generating an internal state, mapping this internal state to an observation, and passing the observation to the agent. These observations could be a corrupted version of the internal state (to model sensor noise) or could be raw sensor outputs such as Doppler 3 radar altimeter readings, or a multi-channel pixel map from an electro-optical sensor . At each step of the episode, an observ ation is generated from the internal state, and gi ven to the agent. The agent uses this observ ation to generate an action that is sent to the en vironment; the environment then uses the action and the current state to generate the ne xt state and a scalar re ward signal. The rew ard and the observation corresponding to the next state are then passed to the agent. The environment can terminate an episode, with the termination signaled to the agent via a done signal. The termination could be due to the agent completing the task or violating a constraint. Initially , the agent’ s actions are random, which allows the agent to explore the state space and begin learning the value of experiencing a given observ ation, and which actions are to be preferred as a function of this observ ation. Here the v alue of an observation is the expected sum of discounted re wards receiv ed after experiencing that observation; this is similar to the cost-to-go in optimal control. As the agent gains experience, the amount of exploration is decreased, allowing the agent to exploit this experience. For most applications (unless a stochastic policy is required), when the policy is deployed in the ﬁeld, exploration is turned off, as exploration gets quite expensi ve using an actual lander . The safe method of continuous learning in the ﬁeld is to hav e the lander send back telemetry data, which can be used to improve the en vironment’ s dynamics model, and update the policy via simulated e xperience. In the follo wing discussion, the vector x k denotes the observ ation provided by the en vironment to the agent. Note that in general x k does not need to satisfy the Markov property . In those cases where it does not, several techniques ha ve pro ven successful in practice. In one approach, observa- tions spanning multiple time steps are concatenated, allo wing the agent access to a short history of observ ations, which helps the agent infer the motion of objects in consecuti ve observations. This was the approach used in [6]. In another approach, a recurrent neural netw ork is used for the polic y and value function implementations. The recurrent network allows the agent to infer motion from observ ations, similar to the w ay a recursi ve Bayesian ﬁlter can infer velocity from a history of posi- tion measurements. The use of recurrent netw ork layers has prov en ef fective in supervised learning tasks where a video stream needs to be mapped to a label [7]. Each episode results in a trajectory deﬁned by observation, actions, and rewards; a step in the trajectory at time t k can be represented as ( x k , u k , r k ) , where x k is the observ ation provided by the en vironment, u k the action taken by the agent using the observation, and r k the rew ard returned by the environment to the agent. The reward can be a function of both the observation x k and the action u k . The re ward is typically discounted to allow for inﬁnite horizons and to facilitate temporal credit assignment. Then the sum of discounted rew ards for a trajectory can be deﬁned as r ( τ ) = T X i =0 γ i r k ( x k , u k ) (2) where τ = [ x 0 , u 0 , ..., x T , u T ] denotes the trajectory and γ ∈ [0 , 1) denotes the discount factor . The objecti ve function the RL methods seek to optimize is gi ven by J ( θ ) = E p ( τ ) [ r ( τ )] = Z T r ( τ ) p θ ( τ ) d τ (3) where p θ ( τ ) = " T Y k =0 p ( x k +1 | x k , u θ ) # p ( x 0 ) (4) 4 where E p ( τ ) [ · ] denotes the expectation over trajectories and in general u θ may be deterministic or stochastic function of the policy parameters, θ . Ho we ver , it was noticed by Ref. 8 that if the policy is chosen to be stochastic, where u k ∼ π θ ( u k | x k ) is a pdf for u k conditioned on x k , then a simple policy gradient e xpression can be found. ∇ θ J ( θ ) = Z T T X k =0 r k ( x k , u k ) ∇ θ log π θ ( u k | x k ) p θ ( τ ) d τ ≈ M X i =0 T X k =0 r k ( x i k , u i k ) ∇ θ log π θ ( u i k | x i k ) (5) where the integral over τ is approximated with samples from τ i ∼ p θ ( τ ) which are monte carlo roll-outs of the policy given the en vironment’ s transition pdf, p ( x k +1 | x k ) . The e xpression in Eq. (5) is called the polic y gradient and the form of this equation is referred to as the REINFORCE method [8]. Since the dev elopment of the REINFORCE method additional theoretical work improved on the performance of the REINFORCE method. In particular , it was shown that the reward r k ( x k , u k ) in Eq. (5) can be replaced with state-action value function Q π ( x k , u k ) , this result is known as the Policy Gradient Theorem. Furthermore, the variance of the policy gradient estimate that is deri ved from the monte carlo roll-outs, τ i , is reduced by subtracting a state-dependent basis from Q π ( x k , u k ) . This basis is commonly chosen to be the state value function V π ( x k ) , and we can deﬁne A π ( x k , u k ) = Q π ( x k , u k ) − V π ( x k ) . This method is kno wn as the Adv antage-Actor-Critic (A2C) Method. The polic y gradient for the A2C method is gi ven by (where the w subscript denotes a function parameterized by w ) ∇ θ J ( θ ) ≈ M X i =0 T X k =0 A π w ( x i k , u i k ) ∇ θ log π θ ( u i k | x i k ) (6) Proximal Policy Optimization The Proximal Policy Optimization (PPO) approach [9] is a type of policy gradient which has demonstrated state-of-the-art performance for many RL benchmark problem. The PPO approach is de veloped using the properties of the T rust Region Policy Optimization (TRPO) Method [10]. The TRPO method formulates the policy optimization problem using a constraint to restrict the size of the gradient step taken during each iteration [11]. The TRPO method policy update is calculated using the follo wing problem statement: minimize θ E p ( τ )  π θ ( u k | x k ) π θ old ( u k | x k ) A π w ( x k , u k )  subject to E p ( τ ) [ KL ( π θ ( u k | x k ) , π θ old ( u k | x k ))] 6 δ (7) The parameter δ is a tuning parameter but the theory justifying the TRPO methods proves monotonic improv ement in the polic y performance if the policy change in each iteration is bounded a parameter C . The parameter C is computed using the Kullback-Leibler (KL) div ergence [12]. Reference 10 computes a closed-form expression for C but this e xpression leads to prohibitiv ely small steps, and therefore, Eq. (7) with a ﬁx constraint is used. Additionally , Eq. (7) is approximately solved using the conjugate gradient algorithm, which approximates the constrained optimization problem giv en by Eq. (7) with a linearized objectiv e function and a quadratic approximation for the constraint. The 5 PPO method approximates the TRPO optimization process by accounting for the policy adjustment constrain with a clipped objecti ve function. The objecti ve function used with PPO can be e xpressed in terms of the probability ratio p k ( θ ) gi ven by , p k ( θ ) = π θ ( u k | x k ) π θ old ( u k | x k ) (8) where the PPO objecti ve function is then as follo ws: L ( θ ) = E p ( τ ) [min [ p k ( θ ) , clip( p k ( θ ) , 1 − , 1 +  )] A π w ( x k , u k )] (9) This clipped objective function has been shown to maintain the KL diver gence constraints, which aids con vergence by insuring that the polic y does not change drastically between updates. PPO uses an approximation to the adv antage function that is the difference between the empirical return and a state v alue function baseline, as sho wn belo w in Equation (10): A π w ( x k , u k ) = " T X ` = k γ ` − k r ( x ` , u ` ) # − V π w ( x k ) (10) Here the v alue function V π w is learned using the cost function gi ven belo w in (11). L ( w ) = M X i =1 V π w ( x i k ) − " T X ` = k γ ` − k r ( u i ` , x i ` ) #! 2 (11) In practice, policy gradient algorithms update the policy using a batch of trajectories (roll-outs) collected by interaction with the en vironment. Each trajectory is associated with a single episode, with a sample from a trajectory collected at step k consisting of observ ation x k , action u k , and re ward r k ( x k , u k ) . Finally , gradient accent is performed on θ and gradient decent on w and update equations are gi ven by w + = w − − β w ∇ w L ( w ) | w = w − (12) θ + = θ − + β θ ∇ θ J ( θ ) | θ = θ − (13) where β w and β θ are the learning rates for the value function, V π w , and policy , π θ ( u k | x k ) , respec- ti vely . In our implementation, we adjust the clipping parameter  to target a KL div ergence between policy updates of 0.001. The policy and value function are learned concurrently , as the estimated v alue of a state is policy dependent. W e use a Gaussian distribution with mean π θ ( x k ) and a diagonal cov ariance matrix for the action distribution in the policy . Because the log probabilities are calculated using the exploration v ariance, the degree of exploration automatically adapts during learning such that the objecti ve function is maximized. Recurrent Policy and V alue Function Appr oximator Since the policy’ s cost function uses adv antages (the dif ference between empirical discounted re wards receiv ed after taking an action and following the policy and the expected v alue of the observ ation from which the action is taken), we also need the agent to implement a recurrent value 6 function approximator . This recurrent value function has a recurrent layer , where the hidden state e volv es differently depending on the past sequence of observ ations and returns. During learning, we need to unroll the recurrent layer in time in order to capture the temporal relationship between a sequence of observations and actions. Howe ver , we want to do this unrolling in a manner consistent with processing a large number of episodes in parallel. T o see why , imagine we want to unroll the network for 60 steps for the forward pass through the network. The hidden state at step 61 is not av ailable until we have completed the forward pass from steps 1 through 60; consequently if we want to do both segments in parallel, we must insert the state from step 60 that occurred when sampling from the policy during optimization as the initial state prior to the forward pass for steps 60-61. It follows that to allow parallel computation of the forward pass, we must capture the actual hidden state when we sample from the polic y (or predict the v alue of an observ ation using the v alue function) and add it to the rollouts. So where before a single step of interaction with the environment would add the tuple ( o , a , r ) (observation, action, reward) to the rollouts, we would augment this tuple with the hidden states of the policy and value functions. The forward pass through the policy and v alue function networks is then modiﬁed so that prior to the the recurrent layer, the network unrolls the output of the previous layer, reshaping the data from R m × n to R T × m/T × n , where m is the batch size, n the feature dimension, and T the number of steps we unroll the network. At step zero, we input the hidden state from the rollouts to the recurrent layer , but for all subsequent steps up to T , we let the state ev olve according the current parameterization of the recurrent layer . After the recurrent layer (or layers), the recurrent layer output is then reshaped back to R m × n . Note that without injecting the hidden states from the rollouts at the ﬁrst step, parallel unrolling of a batch would not be possible, which would dramatically increase computation time. There are quite a few implementation details to get this right, as the rollouts must be padded so that m/T is an integer , but unpadded for the computation of the loss. The unrolling in time is illustrated graphically in Figure 1 for the case where we unroll for 5 steps. The numbers associated with an episode are the representation of the observation for that step of the trajectory as giv en by the output of the layer preceding the recurrent layer, and an ”X” indicates a padded v alue. It is also important to not shuf ﬂe the training data, as this destroys the temporal association in the trajectories captured in the rollouts. In order to a void issues with exploding / v anishing gradients when we back propagate through the unrolled recurrent layer , we implement the recurrent layers as gated recurrent units (GR U). 13 7 Figure 1. Unrolling the f orward pass in Time RL Problem Formulation A simpliﬁed vie w of the agent and environment are shown below in Figure 2. The policy and v alue functions are implemented using four layer neural networks with tanh activ ations on each hidden layer . Layer 2 for the policy and value function is a recurrent layer implemented as a gated recurrent unit. The network architectures are as sho wn in T able 1, where n hi is the number of units in layer i , obs dim is the observ ation dimension, and act dim is the action dimension. Figure 2. Agent-Envir onment Interface The most dif ﬁcult part of solving the planetary landing problem using RL was the de velopment of 8 T able 1. Policy and V alue Function network architecture Policy Network V alue Network Layer # units activ ation # units activ ation hidden 1 10 ∗ obs dim tanh 10 ∗ obs dim tanh hidden 2 √ n h1 ∗ n h3 tanh √ n h1 ∗ n h3 tanh hidden 3 10 ∗ act dim tanh 5 tanh output act dim linear 1 linear a rew ard function that works well in a sparse re ward setting. If we only re ward the agent for making a soft pinpoint landing at the correct attitude and with close to zero rotational velocity , the agent would nev er see the reward within a realistic number of episodes, as the probability of achieving such a landing using random actions in a 3-DOF en vironment with realistic initial conditions is exceedingly low . The sparse rew ard problem is typically addressed using in verse reinforcement learning [14], where a per-timestep reward function is learned from expert demonstrations. W ith a re ward gi ven at each step of agent-en vironment interaction, the rew ards are no longer sparse. Instead, we chose a different approach, where we engineer a reward function that, at each time step, pro vides hints to the agent (referred to as shaping re wards) that driv e it to wards a soft pinpoint landing. The recommended approach for such a rew ard shaping function is to make the reward a dif ference of potentials, in which case there are theoretical results sho wing that the additional reward does not change the optimal policy[15]. W e experimented with sev eral potential functions with no success. Instead we drew inspiration from biological systems that use the gaze heuristic. The gaze heuristic is used by animals such as hawks and cheetahs to intercept prey (and baseball players to catch ﬂy balls), and works by keeping the line of sight angle constant during the intercept. The gaze heuristic is also the basis of the well known PN guidance law used for homing phase missile guidance. In our case, the landing site is not maneuvering, and we hav e the additional constraint that we want the terminal velocity to be small. Therefore we use a heuristic where the agent attempts to keep its velocity vector aligned with the line of sight vector . Since the target is not moving in the target-centered reference frame, the target’ s future position is its current position, and the optimal action is to head directly tow ards the target. Such a rule results in a pinpoint, b ut not necessarily soft, landing. T o achie ve the soft landing, the agent estimates time-to-go as the ratio of the range and the magnitude of the lander’ s velocity , and reduces the targeted velocity as time-to-go decreases. It is also important that the lander’ s terminal velocity be directed predominantly downward. T o achieve these requirements, we use the piecewise reward shaping function giv en below in Equations (14a), (14b), (14c), (14d), and (14e), where τ 1 and τ 2 are hyperparameters and v o is set to the magnitude of the lander’ s velocity at the start of the po wered descent phase. W e see that the shaping re wards tak e the form of a velocity ﬁeld that maps the lander’ s position to a target velocity . In words, we target a location 15m abo ve the desired landing site and target a z-component of landing velocity equal to -2m/s. Below 15m, the downrange and crossrange velocity components of the target velocity ﬁeld 9 are set to zero, which encourages a vertical descent. v targ = − v o  ˆ r k ˆ r k   1 − exp  − t g o τ  (14a) t g o = k ˆ r k k ˆ v k (14b) ˆ r =    r − h 0 0 15 i , if r 2 > 15 h 0 0 r 2 i , otherwise (14c) ˆ v =    v − h 0 0 − 2 i , if r 2 > 15 v − h 0 0 − 1 i , otherwise (14d) τ = ( τ 1 , if r 2 > 15 τ 2 , otherwise (14e) Finally , we provide a terminal rew ard bonus when the lander reaches an altitude of zero, and the terminal position, velocity , and glideslope are within speciﬁed limits. The rew ard function is then gi ven by Equation (15), where the v arious terms are described in the following: 1. α weights a term penalizing the error in tracking the target velocity . 2. β weights a term penalizing control effort. 3. γ is a constant positive term that encourages the agent to keep making progress along the trajectory . Since all other rew ards are negati ve, without this term, an agent would be incen- ti vized to violate the attitude constraint and prematurely terminate the episode to maximize the total discounted re wards recei ved starting from the initial state. 4. η is a bonus gi ven for a successful landing, where terminal position, velocity , and glideslope are all within speciﬁed limits. The limits are r lim = 5 m, v lim = 2 m, and g smin lim = 5 rad/s. The minimum glideslope at touchdown insures the lander’ s velocity is directed pre- dominatly do wnward. r = α k v − v targ k + β k T k + + γ + η ( r 2 < 0 and k r k < r lim and k v k < v lim and g s > g smin lim (15) This reward function allows the agent to trade of f between tracking the target velocity given in Eq. (14a), conserving fuel, and maximizing the rew ard bonus gi ven for a good landing. Note that the constraints are not hard constraints such as might be imposed in an optimal control problem solved using collocation methods. Howe ver , the consequences of violating the constraints (a large negati ve re ward and termination of the episode) are sufﬁcient to insure they are not violated once learning has con verged. Hyperparameter settings and coef ﬁcients used in this work are given below in T able 2, note that due to lander symmetry , we do not impose any limits on the lander’ s yaw . As shown below in Eq. (16), the observation gi ven to the agent during learning and testing is v error = v − v targ , with v targ gi ven in Eq. (14a) , the lander’ s estimated altitude, and the time to go. Note that aside from the altitude, the lander translational coordinates do not appear in the observ ation. This results in 10 T able 2. Hyperparameter Settings v o (m/s) τ 1 (s) τ 2 (s) α β γ η k v o k 20 100 -0.01 -0.05 0.01 10 a policy with good generalization in that the policy’ s behavior can extend to areas of the full state space that were not experienced during learning. obs =  v error r 2 t go  (16) It turns out that the when a terminal rew ard is used (as we do), it is advantageous to use a rel- ati vely large discount rate. Ho wev er , it is also advantageous to use a lower discount rate for the shaping rewards. T o our kno wledge, a method of resolving this conﬂict has not been reported in the literature. In this work, we resolve the conﬂict by introducing a framework for accommodating multiple discount rates. Let γ 1 be the discount rate used to discount r 1 ( k ) , the rew ard function term (as giv en in 15) associated with the κ coefﬁcient). Moreover , let γ 2 be the discount rate used to discount r 2 ( k ) , the sum of all other terms in the rew ard function. W e can then rewrite the Eq. (11) and (10) in terms of these rew ards and discount rates, as shown below in Eq. (17a) and Eq. (17b). Although the approach is simple, the performance improv ement is signiﬁcant, exceeding that of using generalized advantage estimation (GAE) [16] both with and without multiple discount rates. W ithout the use of multiple discount rates, the performance w as actually worsened by including the terminal re ward term. J ( w ) = M X i =1 V π θ ( x k ) − " n X τ = t γ τ − t 1 r 1 ( u τ , x τ ) + γ τ − t 2 r 2 ( u τ , x τ ) #! 2 (17a) A π w ( x t , u t ) = " n X τ = t γ τ − t 1 r 1 ( u τ , x τ ) + γ τ − t 2 r 2 ( u τ , x τ ) # − V π w ( x k ) (17b) EXPERIMENTS For each experiment, we compare the performance of a DR/D V policy , MLP policy , and recur- rent policy with unrolling the recurrent layer through 1, 20, and 60 timesteps during the forward pass through the policy network. As a shorthand, we will refer to a recurrent network with the for- ward pass unrolled T steps in time as a T -step RNN. The DR/DV policy just instantiates a DR/DV controller . For the experiments that use the Mars landing en vironment, each episode begins with the initial conditions shown in T able 3. In each experiment, we optimize the policy using PPO, and then test the trained policy for 10,000 episodes. For the experiments using the Mars landing en vironment, the acceleration due to gravity (including downrange and crossrange components) is randomly set over a uniform distribution +/-5% of nominal, and the lander’ s initial mass is set over a random distribution +/-10% of nominal. T able 3. Mars Lander Initial Conditions for Optimization V elocity Position min (m/s) max (m/s) min (m) max (m) Downrange -70 -10 0 2000 Crossrange -30 30 -1000 1000 Elev ation -90 -70 2300 2400 11 Experiment 1: Asteroid Landing with Unknown Dynamics W e demonstrate the adaptiv e guidance system in a simulated landing on an asteroid with unknown and highly variable en vironmental dynamics. W e chose an asteroid landing environment for this task because the asteroid’ s rotation can cause the Coriolis and centrifugal forces to v ary widely , creating in effect an unknown dynamical en vironment, where the forces are only bounded by the lander’ s thrust capability (recurrent policies are great, but they can’ t get around the laws of physics). At the start of each episode, the asteroid’ s angular velocity ( ω ), gravity ( g ), and the local solar radiation pressure (SRP) are randomly chosen within the bounds gi ven in T able 4. Note that ω , g , and SRP are drawn from a uniform distribution with independent directional components. Near-earth asteroids would normally have parameters kno wn much more tightly than this, but the unknown dynamics scenario could be realistic for a mission that visits many asteroids, and we want the guidance la w to adapt to each unique en vironment. The lander has pulsed thrusters with a 2N thrust capability per direction, and a wet mass that is randomly chosen at the start of each episode within the range of 450kg to 500kg. T able 4. Experiment 1: Envir onmental F orces and Lander Mass Minimum Maximum x y z x y z Asteroid ω (rad/s) -1e-3 -1e-3 -1e-3 1e-3 1e-3 1e-3 Asteroid g ( m/s 2 -1e-6 -1e-6 -1e-6 -100e-6 -100e-6 -100e-6 SRP ( m/s 2 ) -1e-6 -1e-6 -1e-6 1e-6 1e-6 1e-6 The lander’ s initial conditions are sho wn in T able 5. The v ariation in initial conditions is typically much less than this for such a mission. Indeed, (Reference 17) assume position and velocity standard de viations at the start of the Osiris Rex T AG maneuver of 1cm and 0.1cm/s respecti vely . The lander targets a position on the asteroid’ s pole that is a distance of 250m from the asteroid center . Due to the range of en vironmental parameters tested, the effect would be identical if the target position was on the equator , or anywhere else 250m from the asteroid’ s center of rotation. For purposes of computing the Coriolis and centrifugal forces, we translate the lander’ s position from the target centered reference frame to the asteroid centered reference frame. W e deﬁne a landing plane with an surface normal depending on the targeted landing site, which allows use of the Mars landing en vironment with minimal changes. T able 5. Experiment 1: Lander Initial Conditions V elocity Position min (cm/s) max (cm/s) min (m) max (m) Downrange -100 -100 900 1100 Crossrange -100 -100 900 1100 Elev ation -100 -100 900 1100 The RL implementation is similar to that of the Mars landing. W e used the rew ard shaping function as shown below in Equations (18a) and (18b), the re ward function as gi ven in (15) but with g smin set to 0, and the hyperparameter settings shown below in T able 6. For the terminal rew ard, we use r lim = 1 m and v lim = 0 . 2 m/s . Pulsed thrust was achie ved by discretizing the policy output in each dimension to tak e three values: negati ve maximum thrust, zero thrust, or positi ve maximum thrust. W e tuned the nominal gra vity parameter for the DR/D V polic y for best performance; it turns out that the optimal setting for this parameter is quite a bit higher than the actual en vironmental 12 gravity . v targ = − v o  ˆ r k ˆ r k   1 − exp  − t g o τ  (18a) t g o = k ˆ r k k ˆ v k (18b) T able 6. Asteroid Mission Hyperparameter Settings v o (m/s) τ (s) α β γ η 1 300 -1.0 -0.01 0.01 10 A typical trajectory is shown below in Figure 3, where the lower left hand subplot shows the value function’ s prediction of the value of the state at time t . Learning curves are gi ven in Figure 4. T est results are giv en in T able 7. W e see that the DR/D V guidance law fails in this task, with occasional catastrophic position errors (these typically occur after less than 1000 episodes of testing). All of the RL deriv ed policies have acceptable performance, but we see an increase in fuel efﬁcienc y as the number of recurrent steps in the forward pass is increased. T able 7. Experiment 1: Performance T erminal Position (m) T erminal V elocity (cm/s) Fuel (kg) µ σ max µ σ max µ σ max DR/D V 1.9 54.6 1811 54 23 47 1.66 0.38 4.99 MLP 0.2 0.1 0.9 4.1 1.4 9.7 1.62 0.92 3.32 RNN 1 step 0.2 0.1 0.7 3.8 1.3 8.7 1.48 0.73 3.36 RNN 20 steps 0.2 0.1 1.7 4.0 1.3 11.0 1.42 0.38 3.63 RNN 60 steps 0.3 0.2 1.0 4.2 1.3 9.0 1.44 0.37 3.37 RNN 120 steps 0.3 0.2 0.9 4.1 1.3 7.9 1.43 0.35 2.85 Experiment2: Mars Landing using Radar Altimeter Observ ations In this task, the observations are simulated Doppler altimeter readings from the lander using a digital terrain map (DTM) of the Mars surface in the vicinity of Uzbois V alis. Since the simulated beams can co ver a wide area of terrain, we doubled the map size by reﬂecting the map and joining it with original map, as sho wn in Figure 5. Note that the agent does not hav e access to the DTM, but will learn how to successfully complete a maneuver using rew ards received from the environment. Although these observ ations are a function of both lander position and lander velocity , the observ a- tions do not satisfy the Markov property as there are multiple ground truth positions and velocities that could correspond to a gi ven observation, making the optimal action a function of the history of past altimeter readings. For the simulated altimeter readings, the agent’ s state in the target centered reference frame is transformed to the DTM frame, which has a target location of [4000, 4000, 400] meters. In order to simulate altimeter readings f ast enough to allow optimization to complete in a reason- able time, we had to create a fast altimeter model. The model uses a stack of planes with surface normals in the z (elev ation direction) that spans the elev ations in the DTM. Each of the four radar beams has an associated direction vector , which, in conjunction with the spacecraft’ s position, can quickly be used to ﬁnd the intersection of the beam and the planes. The intersection x,y indices are used to index the DTM, and the plane intersection with a z value closest to that of the indexed 13 Figure 3. Experiment 1: T ypical T rajectory for Asteroid Landing Figure 4. Experiment 1: Learning Curves 14 Figure 5. Experiment 2: Digital T errain Map 15 DTM ele v ation is used to determine the distance between the lander and the terrain feature hit by the beam. This is extremely f ast (about 1000X faster than the ray-casting approach we used in Ref- erence (18), but is error prone at lower elev ations as sometimes the closest distance between DTM ele v ation and associated plane intersect z component is the far side of a terrain feature. Rather than call this a bug, we use it to showcase the ability of a recurrent policy to get remarkably close to a good landing, giv en the large errors. The reduction in accuracy at lo wer elev ations is apparent in T able 8. The accuracy was estimated by choosing 10,000 random DTM locations and casting a ray to a random position at the designated ele v ation. The length of this ray is the ground truth altimeter reading. W e then checked what are measurement model returned from that lander position, with the error being the difference. Note that the DTM elev ations range from 0 to 380m. In this scenario, the lander target’ s a landing position 50m abov e the top of a hill at 350m ele v ation. The altimeter beams are modeled as having equal of fset angles ( π / 8 radians) from a direction vector that points in a direction that is av eraged between the lander’ s velocity vector and straight do wn. W e thought this a reasonable assumption as we are modeling the lander in 3-DOF . W e see from T able 9 that although you would not want to entrust an e xpensive rover to this integrated guid- ance and navigation algorithm, the performance is remarkably good given the altimeter inaccuracy at lower elev ations. Learning curv es for the 1-step and 20-step RNN’ s are sho wn in Figures 6 and 7, which plots statistics for terminal position (r f) and terminal velocity (v f) as a function of episode, with the statistics calculated over the 30 episodes used to generate rollouts for updating the policy and value function. W e see from the learning curves that the amount of steps we unroll the recurrent network in the forward pass has a large impact on optimization performance, and that for the 120 step case, the optimization initially makes good progress, but then stalls, likely due to the highly inaccurate altimeter readings at lo wer altitudes. T able 8. Experiment2: Altimeter Error as function of lander ele vation (m) mean (m) std (m) max (m) miss % 100 2600 3200 10000 61 400 513 1500 8800 22 500 122 528 4467 12 600 25 201 2545 6 700 8 92 1702 4 800 4 60 1300 2 T able 9. Experiment 2: Performance T erminal Position (m) T erminal V elocity (m/s) Glideslope Fuel (kg) µ σ max µ σ max µ σ min µ MLP 131 101 796 48 18 145 1.61 0.38 0.96 224 RNN 1 steps 114 95 1359 37 4 67 3.57 3.87 0.49 228 RNN 20 steps 78 41 390 26 1.7 39 3.54 6.05 0.49 234 RNN 120 steps 72 40 349 28 2 44 3.34 3.78 0.49 233 RNN 200 steps 59 42 288 23 4 40 2.55 2.62 0.70 242 In a variation on this experiment, we assume that the lander has the ability to point its radar altimeters such that the central direction vector remains ﬁxed on the target location, and therefore the beams themselves bracket the target. This functionality could be achiev ed with phased array radar , but a separate pointing polic y would need to be learned that k eeps the beams pointed in the required direction. W e also reduce the initial condition uncertainty to that shown in T able 10. Here we see that performance markedly improv es. W e postulate one reason for the improved performance is that 16 Figure 6. Experiment 2: Learning Curves for 1 step RNN Figure 7. Experiment 2: Learning Curves for 120 step RNN the altimeter beams remain in areas of high terrain di versity . Indeed, when we repeat the experiment for a landing site further to the south (bottom of DTM), we ﬁnd that performance degrades. Another factor could be that since the policy is focused on a small portion of the map, it does not ”forget” the relationship between observ ations and adv antages. T able 10. Experiment 2: IC with Altimeter beams pointed at target V elocity Position min (m/s) max (m/s) min (m) max (m) Downrange -30 -10 0 1000 Crossrange -30 30 -500 500 Elev ation -50 -40 1000 1000 T able 11. Experiment 2: Performance with target pointing T erminal Position (m) T erminal V elocity (m/s) Glideslope Fuel (kg) µ σ max µ σ max µ σ min µ MLP 1.4 2.9 179.2 4.92 2.29 84.16 9.33 13.83 0.41 211 RNN 1 step 3.3 1.9 108.6 5.75 1.38 69.80 6.42 6.42 0.30 220 RNN 20 steps 0.3 1.2 116.0 1.61 0.60 64.84 11.22 15.73 0.98 219 RNN 60 steps 0.4 1.5 111.5 2.00 0.92 53.73 10.74 16.42 0.82 221 RNN 120 steps 0.6 1.6 139.8 2.23 0.94 57.47 8.32 14.67 0.74 223 T aking into account the number of large outliers, it is probably best to focus on the average performance when comparing the network architectures. Fuel usage is really not a good measure of performance in this experiment, as it decreases with increased landing velocity . The general trend is that performance increases as we increase the number of steps we unroll the recurrent layer for the 17 forward pass. This implies that the temporal dependencies for this task probably span a signiﬁcant fraction of a single episode. Experiment 3: Mars Landing with Engine F ailure T o test the ability of the recurrent policy to deal with actuator failure, we increase the Mars lander’ s maximum thrust to 24000N. In a 6-DOF environment, each engine would be replaced by two engines with half the thrust, with reduced thrust occurring when one engine in a pair fails. At the start of each episode, we simulate an engine failure in 3-DOF by randomly choosing to limit the av ailable downrange or crossrange thrust by a factor of 2, and limit the vertical (elev ation) thrust by a factor of 1.5. Some episodes occur with no failure; we use a failure probability of 0.5. A real engine w ould hopefully be more reliable, b ut we w ant to optimize with each f ailure mode occurring often. The goal is to demonstrate a lev el of adaptability that would not be possible without an integrated and adaptiv e guidance and control system. W e see in T able 12 that the DR/D V policy fails catastrophically , whereas the MLP policy comes close to a safe landing. The 1-step recurrent policy has performance close to that of the MLP policy , but performance improves to something consistent with a safe landing for the 20-step and 60-step recurrent policies. The 20-step and 60- step policies also hav e improved fuel ef ﬁciency . T able 12. Experiment 3: Performance T erminal Position (m) T erminal V elocity (m/s) Glideslope Fuel (kg) µ σ max µ σ max µ σ min µ DR/D V 123 186 1061 20.07 19.06 59.29 2.16 1.04 0.96 213 MLP 0.7 0.2 1.8 0.99 0.62 4.67 13.46 5.35 4.93 302 RNN 1 steps 0.9 0.2 1.7 0.95 0.44 4.76 14.23 3.44 6.29 298 RNN 20 step 0.3 0.1 0.9 0.99 0.06 1.15 49.99 82.14 11.36 295 RNN 60 steps 0.3 0.2 1.1 1.00 0.06 1.16 24.98 12.70 9.45 295 Experiment 4: Mars Landing with High Mass V ariation Here we divide the lander engine’ s speciﬁc impulse by a factor of 6, which increases fuel con- sumption to around 1200kg on average, with a peak of 1600kg. This complicates the guidance problem in that the mass varies by a signiﬁcant fraction of the lander’ s initial mass during the de- scent, and we do not giv e the agent access to the actual mass during the descent. Although we are using a Mars landing en vironment for this task, the large variability in mass would be more similar to the problem encountered in an EKV interception of an ICBM warhead, where there is a high ratio of wet mass to dry mass. W e see in T able 13 that the DR/D V policy has a rather large maximum position error, and an unsafe terminal velocity . The MLP policy and 1-step recurrent policy giv e improved performance, but still result in an unsafe landing. The 20-step recurrent policy achiev es a good landing, which is slightly improv ed on by the 60-step recurrent policy . 18 T able 13. Experiment 4: Performance T erminal Position (m) T erminal V elocity (m/s) Glideslope Fuel (kg) µ σ max µ σ max µ σ min µ DR/D V 0.4 1.5 19.6 0.63 9.56 5.39 36.89 180.6 3.12 1362 MLP 0.7 0.2 3.7 0.92 0.26 5.25 10.12 2.09 0.63 1235 RNN 1 step 0.4 0.1 0.8 0.98 0.42 6.48 20.05 4.20 7.71 1229 RNN 20 steps 0.6 0.1 1.0 1.17 0.07 1.36 22.64 7.18 12.62 1224 RNN 60 steps 0.6 0.2 1.1 1.06 0.05 1.21 20.18 7.94 8.88 1227 CONCLUSION W e have applied reinforcement meta learning with recurrent policy and value function to four dif ﬁcult tasks: An asteroid landing with unknown and highly variable environmental parameters, a Mars landing using noisy radar altimeter readings for observations, a Mars landing with engine actuator failure, and a Mars landing with high mass variability during the descent. In each of these tasks, we ﬁnd the best performance is achiev ed using a recurrent policy , although for some tasks the MLP policy comes close. The DR/DV policy had the worst performance over all four tasks. The take away is that the ability to optimize using parameter uncertainty leads to robust policies, and the ability of a recurrent policy to adapt in real time to dynamic en vironments makes it the best performing option in en vironments with unknown or highly variable dynamics. Future work will explore the adaptability of recurrent policies in more realistic 6-DOF en vironments. W e will also explore different sensor models for observations such as simulated camera images, and ﬂash LID AR for asteroid close proximity missions. REFERENCES [1] Z. Guang, Z. Heming, and B. Liang, “ Attitude Dynamics of Spacecraft with T ime-V arying Inertia Dur- ing On-Orbit Refueling, ” Journal of Guidance, Contr ol, and Dynamics , 2018, pp. 1–11. [2] W . Y u, J. T an, C. K. Liu, and G. T urk, “Preparing for the unkno wn: Learning a universal policy with online system identiﬁcation, ” arXiv preprint , 2017. [3] X. B. Peng, M. Andrychowicz, W . Zaremba, and P . Abbeel, “Sim-to-real transfer of robotic control with dynamics randomization, ” arXiv preprint , 2017. [4] B. Gaudet, R. Linares, and R. Furfaro, “Deep Reinforcement Learning for Six Degree-of-Freedom Planetary Powered Descent and Landing, ” arXiv pr eprint arXiv:1810.08719 , 2018. [5] C. D’Souza and C. D’Souza, “ An optimal guidance law for planetary landing, ” Guidance, Navigation, and Contr ol Confer ence , 1997, p. 3709. [6] V . Mnih, K. Ka vukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Graves, M. Riedmiller , A. K. Fidjeland, G. Ostrovski, et al. , “Human-level control through deep reinforcement learning, ” Na- tur e , V ol. 518, No. 7540, 2015, p. 529. [7] M. Baccouche, F . Mamalet, C. W olf, C. Garcia, and A. Baskurt, “Sequential deep learning for hu- man action recognition, ” International W orkshop on Human Behavior Understanding , Springer , 2011, pp. 29–39. [8] R. J. W illiams, “Simple statistical gradient-following algorithms for connectionist reinforc ement learn- ing, ” Machine learning , V ol. 8, No. 3-4, 1992, pp. 229–256. [9] J. Schulman, F . W olski, P . Dhariwal, A. Radford, and O. Klimov , “Proximal policy optimization algo- rithms, ” arXiv preprint , 2017. [10] J. Schulman, S. Levine, P . Abbeel, M. Jordan, and P . Moritz, “Trust region policy optimization, ” Inter- national Confer ence on Machine Learning , 2015, pp. 1889–1897. [11] D. C. Sorensen, “Ne wtons method with a model trust re gion modiﬁcation, ” SIAM Journal on Numerical Analysis , V ol. 19, No. 2, 1982, pp. 409–426. [12] S. Kullback and R. A. Leibler , “On information and suf ﬁciency , ” The annals of mathematical statistics , V ol. 22, No. 1, 1951, pp. 79–86. 19 [13] J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Gated feedback recurrent neural networks, ” Interna- tional Confer ence on Machine Learning , 2015, pp. 2067–2075. [14] A. Y . Ng, S. J. Russell, et al. , “ Algorithms for inv erse reinforcement learning., ” Icml , 2000, pp. 663–670. [15] A. Y . Ng, Shaping and policy searc h in reinfor cement learning . PhD thesis, Univ ersity of California, Berkeley , 2003. [16] J. Schulman, P . Moritz, S. Levine, M. Jordan, and P . Abbeel, “High-dimensional continuous control using generalized advantage estimation, ” arXiv pr eprint arXiv:1506.02438 , 2015. [17] B. Udrea, P . P atel, and P . Anderson, “Sensitivity Analysis of the T ouchdown Footprint at (101955) 1999 RQ36, ” Pr oceedings of the 22nd AAS/AIAA Spaceﬂight Mechanics Confer ence , V ol. 143, 2012. [18] B. Gaudet and R. Furfaro, “ A navig ation scheme for pinpoint mars landing using radar altimetry , a digital terrain model, and a particle ﬁlter, ” 2013 AAS/AIAA Astr odynamics Specialist Confer ence, As- tr odynamics 2013 , Univ elt Inc., 2014. 20

Adaptive Guidance with Reinforcement Meta-Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment