Robust Recovery Controller for a Quadrupedal Robot using Deep Reinforcement Learning

©2019 IEEE. Personal use of this material is permitted. Per - mission from IEEE must be obtained for all other uses, in an y current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collecti v e works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this w ork in other works. Rob ust Reco very Controller for a Quadrupedal Robot using Deep Reinforcement Learning Joonho Lee, Jemin Hwangbo, and Marco Hutter Robotic Systems Lab, ETH Zurich E-mails: { jolee, jhwangbo, mahutter } @ethz.ch Abstract —The ability to reco ver from a fall is an essential feature f or a legged robot to na vigate in challenging envir onments rob ustly . Until today , there has been very little progr ess on this topic. Current solutions mostly build upon (heuristically) prede- ﬁned trajectories, resulting in unnatural beha viors and requiring considerable effort in engineering system-speciﬁc components. In this paper , we present an approach based on model-free Deep Reinfor cement Learning (RL) to control recovery maneuvers of quadrupedal robots using a hierar chical behavior -based con- troller . The controller consists of f our neural network policies including three behaviors and one beha vior selector to coordinate them. Each of them is trained individually in simulation and deployed directly on a real system. W e experimentally validate our approach on the quadrupedal robot ANYmal, which is a dog-sized quadrupedal system with 12 degrees of freedom. With our method, ANYmal manifests dynamic and reactive recov ery behaviors to recov er from an arbitrary fall conﬁguration within less than 5 seconds. W e tested the recov ery maneuver more than 100 times, and the success rate was higher than 97 %. I . I N T R O D U C T I O N In case of a fall, animals sho w the remarkable ability to recov er from any posture by pushing against their sur - roundings and swinging limbs to gain momentum. Having similar abilities in legged robots would signiﬁcantly improve their robustness against failure and extend their applicability in harsh environments. W e address this topic in the present work by dev eloping a control strategy for the robust recov ery maneuver of quadrupedal robots. By recovery maneuver , we mean the maneuver of returning to a typical operating state (standing or walking) from a fall as shown in the Fig. 1. For such maneuver , the robot needs to produce motions that make good use of the interactions with the ground and swinging motion of the legs while avoiding self-collisions. Optimization-based methods [17] have a hard time to solve such a task as they usually require analytic dynamic models and often predeﬁned contact sequences [1], which are both hard to ﬁnd as the system can interact at multiple uncertain contact points or patches. Moreov er , all control methods that are based on simpliﬁed template models are not valid any- more in such fall conﬁguration. Existing methods in recov ery controllers simplify the problem by using handcrafted control sequences [24], [25] or using simpliﬁed models [21], or even adding mechanisms such as a tail or extra limbs [5], [7]. Consequently , they exhibit predictable behavioral patterns, which limit their robustness in corner cases (e.g., when a robot’ s legs get stuck below its base). They also require a considerable engineering effort. Fig. 1: A recovery maneuver of ANYmal. (T op left) ANYmal is initialized at a fall conﬁguration. (T op r ow) ANYmal swings its legs to gain momentum, (Middle row) pushes the ground to regain the upright and stable posture, and then (Bottom ro w) stands up and walks. A promising alternativ e for generating self-righting behav- iors is model-free Deep Reinforcement Learning (RL). In this paradigm, an agent interacts with its environment and learns the dynamics and a control policy from the experience. By using model-free methods, we can generate control policies without any abstraction in modeling complex dynamics. Un- fortunately , existing model-free algorithms typically require an excessi ve number of trial-and-errors to obtain a performant policy , and hence it often becomes impractical to train a policy on sophisticated hardware. This is particularly true for dynamic systems like legged robots where bad policies can be fatal. T o overcome these limitations, existing works leverage simulations where one can generate massive data at no cost and with high consistency . V ery recent and promising research results in the ﬁeld of legged robotics have demonstrated that learned locomotion policies can be transferred from simulation to reality [13], [26]. In order to realize this transfer and ov ercome the so-called reality gap, it was important to use high-ﬁdelity simulations. In [26] this was achieved by model parameter estimation, while [13] proposed a method to learn parts of the simulated model from real data. Beside traditional locomotion, the latter work also demonstrated a sophisticated self-righting policy that can be trained in a few hours in the simulation. Howe ver , it is still challenging to train a single policy that can manifest multiple behaviors including self-righting, standing up, locomotion or others, because the present Deep RL algorithms have limited capability in learn- ing multiple skills. It often shows the degradation of the performance in individual tasks in case of learning multiple skills at once (e.g., Multitasker in [2]). A simple yet powerful solution to incorporate multiple behaviors is to divide-and- conquer . In this approach, a control problem is decomposed into several behaviors, and then the behaviors are coordinated either by hand [6] or learning a rule for behavior selection. The behaviors and the selection rule can be learned separately [16], [14] or simultaneously [8], [19]. For separate learning, a high- lev el behavior selector is trained using a set of pre-learned behaviors. This is a widely adopted approach in behavior - based robotics [15] and has shown success in high-dimensional continuous control problems recently . Merel et al. [16] and Liu and Hodgins [14] demonstrated human-like behaviors in simulation with a hierarchically structured controller that consists of multiple lo w-level controllers and an agent to organize them in a task-oriented way . In this paper we present a hierarchically structured con- troller consisting only of neural networks that is able to generate complex combined maneuvers like recovering from a fall. we build up the complex skill set from individual behaviors including self-righting, standing up, and locomotion. The self-righting and locomotion beha viors are introduced in the previous work [13]. W e reproduced the self-righting be- havior with different action representation and reused existing policy for the locomotion beha vior . These beha viors are trained using T rust Region Polic y Optimization (TRPO) [22] using the simulation framew ork presented in the existing work [13]. W e take inspiration from [16] and [14], and learn the complex behavior selection subsequently . Moreover , we introduce a nov el learning-based state estimator that is learned in parallel to the behavior selection and which even works in degenerated conditions (i.e., when fallen on the ground and being in a complex, unobservable contact condition). W ith these elements, we are able to generate a recov ery controller of unprecedented robustness. The system was tested on the quadrupedal robot ANYmal [10] in more than 100 trials with a success rate higher than 97 %. Thereby , we showed that our method can cope with all kinds of corner cases for which previous solutions failed. As the propose controller is not based on an y heuristics, it has the potential to be applicable for a wide variety of complex skill sets and hence bring our robots as step closer to their natural counterparts. I I . M E T H O D In this section, we ﬁrst provide an overvie w of our method and then describe the details of the designs and implementation of each component. A. Overview Fig. 2 sho ws an ov erview of the control frame work. The behavior selector and three behaviors form a hierarchical behavior -based controller . Each behavior is a control policy Fig. 2: Control architecture for the recovery controller . TSIF refers to the T wo State Implicit Filter [4]. for individual behaviors. The behavior selector selects the most appropriate behavior for the current situation depending on the recent observations, commands, previously chosen behavior type and the previous action. At each time step, the control policy for the chosen behavior sends commands to the actuators. The height estimator is a neural network regression module for estimating base height during the deployment on the real system. W e decomposed the control task into three behaviors: self- righting, standing up, and locomotion. This decomposition is to cope with the difﬁculty of training a single policy that can manifest all of the necessary behaviors. From our experience, a policy trained to perform multiple tasks shows ungraceful behaviors such as frequent slippages and highly conservati ve postures. It also requires a lot more effort to come up with an effecti ve cost function that leads to natural motions because the desired controller has to cover larger state space. Learning three behaviors separately simpliﬁes the cost function design and enables us to troubleshoot each control policy separately on the real system. During the deployment on the real system, the estimated states from the T wo State Implicit Filter (TSIF) [4] are used. Additionally , we used a neural network named height estimator to estimate the base height ( z -position of the base in the world frame) to resolve the drift issues in linear position. The deﬁnition of the state spaces and the characteristics of TSIF are provided in the section II-B. Each behavior is separately trained and tested on the real robot, whereby training is done only in simulation. Sub- sequently , the behavior selector and a height estimator are trained using the set of pretrained beha viors. The simulated en vironments for learning consists of the data-driv en actuator model [13] and the fast contact solver presented in [12], which efﬁciently generates high-ﬁdelity samples. T rust Region Policy Optimization algorithm [22] with the Generalized Advantage Estimator [23] (TRPO+GAE) is used for learning. Although stochastic policies are used during training, the variances are reduced to 0 during deployment to ensure a more consistent behavior . B. F eature Selection for the State Spaces For each behavior , we select the most representativ e and reliable set of states as giv en in table I. The existing state Function Data Self-Righting Policy Gravity vector ( e g ) Base angular velocity in body frame ( ω B I B ) Joint positions ( φ j ) Joint velocities ( ˙ φ j ) History of joint position error & velocity Previous joint position targets ( a t − 1 ) Standing Up Policy Base linear velocity in body frame ( v B I B ) State space of the Self-Righting policy Locomotion Policy V elocity commands Estimated base height ( h e ) State space of the Standing up policy Behavior Selector Previous action (one-hot vector) State space of the Locomotion policy Height Estimator Gravity vector ( e g ) Joint positions Joint velocities History of joint position errors & velocities Actuator Model Desired joint position History of joint position errors & velocities T ABLE I: Deﬁnition of the State Spaces estimation frame work of ANYmal relies on the T wo State Im- plicit Filter (TSIF) [4] for the base pose and twist, and angular encoders for the joint states. TSIF estimates the base twist and the base position in the inertial frame by recursi vely fusing kinematics and measurements from the Inertial Measurement Unit (IMU). It makes use of the positions of feet touching the ground to incorporate kinematic contact constraints. When a foot slips or all four feet lose contacts with the ground, which are likely to happen when ANYmal falls, the estimated base states from TSIF becomes unreliable as the position and linear velocity drift ov er time (see section III-B). T o maximize reliability , the linear position and linear velocity are excluded from the state space of the self-righting controller . A unit vector pointing in the direction of the gravity expressed in the base frame ( e g ) is used to represent the orientation of the base because an IMU can observe only this two dimensional subspace of the orientation (yaw angle is unobservable). The x , y positions of the base in the inertial frame are excluded from the state spaces because they are always unobservable with IMU [3]. On the other hand, the z position of the base (base height) can be accurately estimated using kinematics while walking. W e kept this state in the locomotion policy as it is critical for fast learning and estimate the height using a new estimation network presented in section II-F. C. Cost terms The cost terms are presented in the table II. W e deﬁned cost functions for each behavior by a linear combination of cost terms. For joint position cost, we used the minimum angle difference between two angular positions denoted by d φ ( · , · ) : R × R → [0 , π ] . It is important to deﬁne a bounded cost function because otherwise an agent often ﬁnds it more rewarding to terminate than to explore its en vironment. T o this end, we used a logistic kernel function K : R → [ − 0 . 25 , 0) deﬁned as K ( e | α ) = − 1 / ( e αe + 2 + e − αe ) , α ∈ R > 0 where e represents an error Symbols ˙ φ j slim maximum joint speed I c index set of the contact points I c,f index set of the foot contact points I c,in index set of the self-collision points i c,n impulse of the n th contact g i gap function of the i th contact p f ,i position of the i th foot τ vector of joint torques |·| cardinality of a set or l 1 norm ||·|| l 2 norm ˆ · target value Cost T erms Angular velocity c ω = K ( | ω B I B − ˆ ω B I B | , α a ) Linear velocity c v = K ( | v B I B − ˆ v B I B | , α l ) Height c h = 1 . 0 if h < 0 . 35 , otherwise 0 Joint position c j p = d φ ( φ j , ˆ φ j ) Orientation c o = || [0 , 0 , − 1] T − e g || T orque c τ = || τ || 2 Power c pw = P 12 i =0 max ( ˙ φ j,i τ i , 0) Joint acceleration c a = P 12 i =0 || ¨ φ i || 2 Joint speed c j s = P 12 i =0 max ( ˙ φ j slim − | φ i | , 0) 2 Body impulse c bi = P n ∈ I c \ I c,f || i c,n || / ( | I c | − | I c,f | ) Body slippage c bs = P n ∈ I c || v c,n || 2 / | I c | Foot slippage c f s = P || v f ,i || ∀ i s.t. g i = 0 , i ∈ I f ,c Foot clearance c f c = P ( p f ,i − 0 . 07) 2 || v f ,i || ∀ i s.t. g i > 0 , i ∈ I f ,c Self collision c cin = | I c,in | Action difference c ad = || a t − 1 − a t || 2 T ABLE II: Cost T erms term. This kernel comes handy in tuning because it enables us to leverage relativ e importance between dif ferent cost terms and it enables us to adjust an agent’ s sensitivity to e by adjusting α . D. Behaviors The policies for self-righting, standing up and locomotion are individually trained to achiev e different tasks. 1) T asks: Each behavior is learned based on a different cost function, initial state distributions, and termination conditions. • Self-righting behavior is to regain upright base pose from an arbitrary conﬁguration (Fig. 3a) and re-position joints to the sitting conﬁguration (Fig. 3b) which is designed such that ANYmal has all feet on the ground for a safe stand-up maneuver . The cost function is deﬁned as k o c o + k j p c j p + k a c a + k bi c bi + k bs c bs + k c,in c c,in + k ad c ad + k j slim c j slim + k τ c τ , where k ( · ) is a scaling factor . Each cost terms are explained in table. II. The weight for the orientation cost ( k o ) was set to be the highest such that ANYmal recovers up-right base pose as soon as possible. The magnitudes of contact impulses are penalized to av oid violent motions. The joint accel- erations are penalized to generate smooth motions. T o sample initial states for training in simulation, we dropped ANYmal from 0.5 m above the ground with random joint positions. The termination condition is only (a) (b) Fig. 3: ( a ) Sampled initial states and ( b ) the target conﬁgura- tion of the self-righting task. the time limit. • Standing up behavior is to stand up from arbitrary sitting conﬁgurations. The cost function is similar to the cost of the self-righting but a height cost is additionally introduced: k j p c j p + k o c o + k h c h + k a c a + k ad c ad + k j slim c j slim + k τ c τ . The target joint conﬁguration is the standing conﬁguration. W e use the same strategy for sampling initial states as the self-righting task but with a near-upright pose and do not specify any termination condition except for the time limit. • Locomotion behavior is to follow a giv en velocity com- mand composed of desired forward velocity , lateral veloc- ity , and turning rate (or yaw rate) which are sampled from the uniform distribution, U ( − 1 , 1) m / s , U ( − 0 . 4 , 0 . 4) m / s , and U ( − 1 . 2 , 1 . 2) rad / s respectively . They are deﬁned concerning the capabilities of an existing controller [1]. The cost function penalizes the velocity tracking errors ( c ω and c v ), foot motions ( c f c and c f s ), and constraint violation. It is deﬁned as k ω c ω + k v c v + k o c o + k f c c f c + k f s c f s + k ad c ad + k j slim c j slim + k τ c τ . The initial states are sampled from a multiv ariate Gaus- sian distribution centered at the standing conﬁguration. An episode terminates when the joint limit is violated or ANYmal falls. 2) Action Repr esentation: The output of a policy for a behavior is a 12-dimensional real-valued vector and each dimension is mapped to a position target of the low-impedance Proportional-Deriv ative (PD) controllers running at each joint actuator . Peng and van de Panne [18] showed that using this type of action representations for learning motor skills achiev es better ﬁnal performance, faster learning, and higher robustness compared to other representations such as target joint torques and target joint velocities. The output of a policy network is mapped to joint position targets dif ferently depending on the task. For locomotion, the desired joint position φ d is deﬁned as φ d = k o t + φ n where k is a scaling parameter, o t is the output, and φ n is a nominal joint conﬁguration (standing). It is designed such that the distribution of the target positions has a standard deviation of approximately 1 and mean at the nominal conﬁguration at the beginning of training. It accelerates the learning because the agent explores trajectories near the standing conﬁguration more frequently . For the self-righting and the standing up, we deﬁne φ d = k o t + φ t , where φ t is a vector of current joint positions. It promotes exploration in joint spaces and results in faster learning of self-righting. 3) Arc hitecture: The policies are parameterized by a two- layered feed-forward neural network with tanh units in the hidden layers. The self-righting and standing-up policies have 128 units in each hidden layer and the locomotion policy has 128 and 256 units, respectiv ely . 4) T raining: Each policy is trained separately with TRPO+GAE. For the natural and smooth motions, we penal- ized joint torque, velocity , acceleration, and action difference. W e employed Curriculum Learning (CL) in a way that these terms are near zero at the ﬁrst iteration and increase as the training proceeds [13]; otherwise a learning agent con ver ged to a local minima where it stops moving. E. Behavior Selector The behavior selector learns to determine which beha vior to ex ecute depending on the current situation. 1) T ask: A behavior selector has to choose an appropriate behavior such that ANYmal returns to a nominal operating state every time it loses balance. By a nominal operating state, we mean states where it can locomote. T o this end, the cost function is deﬁned as k ω c ω + k v c v + k o c o + k h c h + k pw c pw + k ad c ad + k j slim c j slim + k τ c τ , which is similar to that of the locomotion task. Additionally , the use of c pw makes the resulting behaviors power efﬁcient, while c ad and c τ ensure smooth transitions between different behaviors. The initial states are sampled from the initial state distri- butions of a randomly selected behaviors and the termination condition is the same as that of self-righting. 2) State and Action Spaces: The state space of the behavior selector consists of the union of the state spaces of behaviors and the index of previously chosen behavior . A behavior selector maps a state s to a categorical dis- tribution over the behaviors. It is denoted as π θ ( a | s ) with a ∈ { 0 , 1 , 2 } and the output is a three dimensional real-valued vector { p 0 , p 1 , p 2 } . Each dimension represents the probability for choosing each behavior . 3) Arc hitecture: The behavior selector is parameterized by a two-layered feed-forward neural network with 128 tanh units in the hidden layers. Softmax function is used at the output such that P 3 i =0 π θ ( i | s ) = 1 for any s . The height estimator is of the same structure b ut without Softmax. Moreover it uses the softsign unit which is computationally more efﬁcient than tanh. 4) T raining: W e use a set of pre-trained behaviors, which are regarded as a part of the en vironment, to train the behavior selector using TRPO+GAE. The height estimator, which is explained in section II-F, is trained concurrently as outlined in the algorithm 1. The reasoning behind this strategy is to match their state visitation frequency . F . Height Estimation The estimated base height becomes unreliable when ANY - mal falls. W e could observe huge errors from the base position Algorithm 1 Training Behavior Selector Initialize θ , ψ randomly for i = 0 , 1 , ..., N do for t = 0 , 1 , ..., T do if i > N w then  N w = W arm-up period Use the estimated height h ψ ( s t ) Sample action a t ∼ π θ ( a | s t ) Excute the corresponding behavior Collect state s t , action a t , and rew ard r t Collect true height h t . Append a s t - h t pair into the replay memory Sample K pairs from the replay memory Update ψ by minimizing P K j =0 || h j − h ψ ( s j ) || 2 Update θ using TRPO [22] estimates during the experiments when ANYmal lies on the ground, which can lead to undesired behaviors. T o resolve this issue, we trained a neural network to estimate the base height. It is denoted as h ψ with a set of parameters ψ . The output is calculated from the body orientation (IMU) and joint positions (encoders), which are states that do not have a drift issue. The base height can be computed using forward kinematics under an assumption that ANYmal is on the ﬂat ground and the geometric properties of the links are known. The linear velocity estimate also had the same issue but the errors were not signiﬁcant. G. Handcrafted Behavior Selector W e introduce a traditional approach that we considered before using the neural network behavior selector . Finite State Machine (FSM) is a widely adopted method for controlling hybrid systems. It is deﬁned by a set of states and transitions between them, which are usually designed by a domain expert. The proposed controller can be seen as an FSM if we regard behaviors as states and the behavior selector as a learned transition rule. As the task is straightforward, we could handcraft it by going through trial-and-error (Fig. 4). T o maximize the success rate, it waits for TSIF to con ver ge for 0.5 seconds after it conducts the self-righting behavior . X = th e angle betw een th e inertial z axis an d the ba se z axis Self -ri ghti ng Stan din g up Locomo tion If X > 35 ° or base heigh t < 0.4 m If X < 20 ° for 0.5 s If base heigh t > 0.3 5 m Fig. 4: FSM for behavior selection. H. Simulating ANYmal The structure of our simulator is depicted in Fig. 5. It has actuator models and stochastic components that are designed to account for modeling errors and to robustify the resulting policies [13]. W e used RaiSim [12] as rigid-body simulator Fig. 5: Simulation for ANYmal. together with learned networks that represent the actuator dynamics. 1) Randomized Physical Pr operties: T o make the solution robust against modeling errors and to av oid tedious parameter estimation for each link as done in [26], we directly use phys- ical properties computed from the CAD model including link lengths and inertial properties, but randomize the simulation to ov ercome modeling errors. It has been shown in sev eral works that randomization enhances the robustness and increases the success rate of a sim-to-real transfer [13], [26], [11]. The link masses are randomized with additiv e noises up to 10 % of the original value, and the COM of the base is randomly translated up to 3 cm in x , y , z directions respectively ev ery episode. W e approximated the collision geometry of ANYmal with collision primitiv es such as a box, a cylinder , and a sphere. The positions and shapes of the these collision bodies are also randomized. The coefﬁcient of friction between the objects is uniformly sampled between 0.8 to 2.0 e very time step because we cannot accurately simulate the material properties. 2) Actuator Model: ANYmal’ s joints are Series Elastic Ac- tuators (SEA) [20]. An SEA consists of multiple components including a spring, gears, encoders, and an electric motor , which results in highly complex dynamics. It is essential to simulate actuators accurately and fast to ef ﬁciently train a sim- to-real transferable polic y because it substantially impro ves the simulation accuracy . Developing an analytic model requires a large number of parameters to be estimated or assumed to be accurately provided by a manuf acturer and thus often results in an inaccurate model [9]. Instead, we use a data-driven model introduced in [13]. The actuator model is a neural network that outputs a joint torque conditioned on the position command and a history of joint position errors and velocities. It is a two- layered feed-forward neural network with 32 softsign units and trained with real data. 3) Additive Noise to the Observation: T o replicate the noisy observation from the real robot, we added up to 0.2 m / s noise to the linear velocity , 0.25 rad / s to the angular velocity , 0.5 rad / s to the joint velocities, and 0.05 rad to the joint positions during training in simulation. Additionally , in order to replicate the behavior of TSIF , we increase the magnitude of the noise for the linear velocity and position of the base when all four legs lose contact. Fig. 6: ANYmal recov ering from arbitrary conﬁgurations. Fig. 7: ANYmal reacting to a kick. I. Implementation Details W e used the Robotics Artiﬁcial Intelligence [11] framework together with the rigid-body simulator [12] , which are both written in C++. T emporal attributes of an RL task such as the control fre- quency , the time limit, and the discount factor are regarded as hyper-parameters. The policies for self-righting, standing up, and locomotion run at 20 Hz, 100 Hz, and 200 Hz respectively and the behavior selector runs at 50 Hz. The height estimator runs synchronously with TSIF , which runs at 400 Hz. When ANYmal switches to a dif ferent beha vior, the output of the chosen behavior is computed immediately . As a result, the time step of the self-righting is often not preserved because it runs at the lowest frequency . As the policies require a history of joint measurements as input, we implemented a history buf fer that saves states for 0.05 seconds in 400 Hz. ANYmal waits in freeze mode to ﬁll the history buf fer before running the controller . I I I . R E S U LT A N D D I S C U S S I O N The experimental results are provided in this section. The behaviors for standing-up and locomotion are not assessed in detail in this paper . W e refer the readers to [13] for a comprehensiv e analysis of the locomotion behavior . A. Robustness W e conducted two kinds of experiments to verify the robustness of the policies. Firstly , we started the proposed controller while ANYmal lies on the ground at an arbitrary conﬁguration. W e tested 50 different conﬁgurations and the self-righting policy could recov er up-right base pose within 5 seconds in all experiments. ANYmal can recov er ev en when its legs are stuck beneath its base (Fig. 6.A). It ﬂips to its side to free the legs and then quickly regains up-right base pose. It recov ers when its base is almost upside-down (Fig. 6.B). Secondly , we applied external disturbances while ANYmal is walking or standing. An example is provided in Fig. 7. It smoothly switches to self-righting behavior or standing up behavior without any noticeable delay . Both experiments are conducted 50 times and ANYmal fell more than 100 times in total. The recovery maneuver failed only three times. Consequently , The success rate was higher than 97 %. Self-righting failed when a joint position ≥ 2 π . It fails because a self-righting policy hardly experiences such a high position during the training. This is very unlikely to happen when ANYmal falls while walking and can be easily ﬁxed by applying modulo operation to the joint positions with 2 π . B. Comparison to Simulation T o qualitativ ely analyze the accuracy of the simulation, we ran the same controller with the same initial state and velocity command. As shown in Fig. 8, the switch in behaviors appears in simulation (top) and reality (bottom) almost at the same time, and the behaviors are visually identical (Fig. 8). The estimated heights for the manuever abo ve is provided in Fig. 9. Unfortunately , due to a lack of motion capture system, we could not measure the height in the experiment but hav e only data from simulation. The output of the TSIF shows a huge error when ANYmal lies on the ground (as shown by the initial value) and ﬂuctuates during the self-righting maneuver . On the other hand, the output of the neural network is stable and accurately matches the simulated data (the RMS error is less than 1 cm). When deployed without the height estimator, Fig. 8: Comparison between simulation and experiments. ANYmal trots in 1 m/s after standing up. The snapshots are taken ev ery 0.5 seconds and the color bars represents the acti ve behavior . The numbers represent the time of the switching and rounded to the ﬁrst decimal place. ( Sim ) Simulated ANYmal with the initial state obtained from the experiment. ( Real ) Deploying the recov ery controller on the real robot. W e used the same noises and randomized dynamic properties as in simulation and did not hand-tune any number to matched the motions of the simulation and the experiment. height ( m) 0.6 0.4 0.2 0 -0.2 -0.4 s i m u l a t i o n n e u r a l n e t w o r k T S I F time (s) 0 1 2 3 4 Fig. 9: Estimated base height from different sources. Self- righting policy runs in the shaded region. ( neural network & TSIF ) Output of the height estimator network and TSIF during the experiment in Fig. 8.Real. ( simulation ) Computed height from the simulation in Fig. 8.Sim. Fig. 10: ANYmal showing undesirable behavior without the height estimator . ANYmal jumps up and hits itself (red circle). The behavior selector switches to locomotion policy while ANYmal is lying on the ground. the error in the base height sometimes results in an undesirable behavior switch as shown in the Fig. 10. C. FSM behavior selector W e discuss the FSM behavior selector introduced in sec- tion II-G. The simplicity of FSM did not allow us to capture corner cases (e ven with signiﬁcant tuning in simulation and on the real system) as the one shown in Fig. 11. Besides, the transitions between the behaviors are often unsmooth. The FSM behavior selector is still robust enough to be used in the ﬁeld. Howe ver , we did not examine the success rate of it because the result can be manipulated if we experiment with the corner cases more. The performance can be improv ed by adding more states and transitions. For example, the corner case in Fig. 11 can be resolved if we check joint positions before standing up. Howe ver , the fundamental problem is that Fig. 11: A corner case of our FSM. It switches to standing up policy at a bad timing and falls. it requires a number of design iterations and experiments on the real robot. I V . C O N C L U S I O N This paper presents a hierarchically structured controller for the quadruped robot ANYmal that can autonomously recover from a fall and locomote on ﬂat terrain. The control task is decomposed into three behaviors: standing, self-righting and walking. This strategy made it easier to design well-deﬁned RL tasks and troubleshoot on the real robot. The policies for behaviors are individually trained to achiev e a distinct behavior , and a behavior selector is trained to coordinate them. Additionally , a height estimator is learned, which turned out to be a crucial part for reliable maneuvers. All the policies are trained in simulation and deployed on the real robot. The proposed controller exhibits dynamic recovery ma- neuvers inv olving multiple ground contacts and the resulting motions are consistent with the simulated ones. ANYmal can recov er from multiple random fall conﬁgurations within 5 seconds and switches seamlessly between three behaviors in response to disturbances. The robustness of our recovery controller is ev aluated by testing the controller more than 100 times on the physical system. The current limitation of the proposed method is that it has only been trained and tested on ﬂat ground and it will fail in case of large inclinations or rough ground. W e plan to ov ercome these limitations in future work by randomizing the en vironment in simulation and by estimating its properties. R E F E R E N C E S [1] C Dario Bellicoso, Fabian Jenelten, Christian Gehring, and Marco Hutter . Dynamic Locomotion Through Online Nonlinear Motion Optimization for Quadrupedal Robots. IEEE Robotics and Automation Letters , 3(3):2261–2268, 2018. [2] Glen Berseth, Cheng Xie, Paul Cernek, and Michiel V an de Panne. Progressiv e reinforcement learning with distillation for multi-skilled motion control. arXiv pr eprint arXiv:1802.04765 , 2018. [3] Michael Bloesch et al. State estimation for legged robots- consistent fusion of leg kinematics and IMU. Robotics , 17:17–24, 2013. [4] Michael Bloesch et al. The T wo-State Implicit Filter Recursiv e Estimation for Mobile Robots. IEEE Robotics and Automation Letters , 3(1):573–580, 2018. [5] BostonDynamics. Intoducing SpotMini, June 2016. URL https://youtu.be/tf7IEVTDjng?t=91. Accessed Jan. 18, 2019. [6] Rodney Brooks. A robust layered control system for a mobile robot. IEEE journal on r obotics and automation , 2(1):14–23, 1986. [7] Dian-sheng Chen, Jun-mao Y in, Y u Huang, Kai Zhao, and Tian-miao W ang. A hopping-righting mechanism analysis and design of the mobile robot. Journal of the Brazilian Society of Mechanical Sciences and Engineer- ing , 35(4):469–478, 2013. [8] Ke vin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv pr eprint arXiv:1710.09767 , 2017. [9] Christian Gehring et al. Practice makes perfect: An optimization-based approach to controlling agile motions for a quadruped robot. IEEE Robotics & Automation Magazine , 23(1):34–43, 2016. [10] Marco Hutter et al. Anymal-a highly mobile and dynamic quadrupedal robot. In Intelligent Robots and Systems (IR OS), 2016 IEEE/RSJ International Conference on , pages 38–44. IEEE, 2016. [11] Jemin Hwangbo, Inkyu Sa, Roland Siegwart, and Marco Hutter . Control of a quadrotor with reinforcement learn- ing. IEEE Robotics and Automation Letters , 2(4):2096– 2103, 2017. [12] Jemin Hwangbo, Joonho Lee, and Marco Hutter . Per- Contact Iteration Method for Solving Contact Dynamics. IEEE Robotics and Automation Letters , 3(2):895–902, 2018. [13] Jemin Hwangbo, Joonho Lee, Alexe y Dosovitskiy , Dario Bellicoso, V assilios Tsounis, Vladlen K oltun, and Marco Hutter . Learning Agile and Dynamic Motor Skills for Legged Robots. Science Robotics , 4(26):eaau5872, 2019. [14] Libin Liu and Jessica Hodgins. Learning to schedule control fragments for physics-based characters using deep Q-learning. ACM T ransactions on Graphics (TOG) , 36(3):29, 2017. [15] Pattie Maes and Rodney A Brooks. Learning to Coor- dinate Beha viors. In AAAI , volume 90, pages 796–802, 1990. [16] Josh Merel et al. Hierarchical visuomotor control of humanoids. arXiv preprint , 2018. [17] Igor Mordatch, Emanuel T odorov , and Zoran Popovi ´ c. Discov ery of complex behaviors through contact- in variant optimization. ACM T ransactions on Graphics (TOG) , 31(4):43, 2012. [18] Xue Bin Peng and Michiel van de Panne. Learning locomotion skills using deeprl: does the choice of ac- tion space matter? In Proceedings of the ACM SIG- GRAPH/Eur ogr aphics Symposium on Computer Anima- tion , page 12. A CM, 2017. [19] Xue Bin Peng, Glen Berseth, and Michiel V an de Panne. T errain-adaptiv e locomotion skills using deep reinforce- ment learning. ACM T ransactions on Graphics (TOG) , 35(4):81, 2016. [20] Gill A Pratt and Matthew M Williamson. Series elastic actuators. In Intelligent Robots and Systems 95. ’Human Robot Inter action and Cooper ative Robots’, Proceedings. 1995 IEEE/RSJ International Conference on , volume 1, pages 399–406. IEEE, 1995. [21] Uluc Saranli, Alfred A Rizzi, and Daniel E Koditschek. Model-based dynamic self-righting maneuvers for a hexapedal robot. The International J ournal of Robotics Resear ch , 23(9):903–918, 2004. [22] John Schulman, Serge y Le vine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region polic y optimiza- tion. In International Confer ence on Machine Learning , pages 1889–1897, 2015. [23] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized adv antage estimation. arXiv pr eprint arXiv:1506.02438 , 2015. [24] Claudio Semini et al. DESIGN O VER VIEW OF THE HYDRA ULIC QU ADRUPED R OBO TS. In The F our - teenth Scandinavian International Confer ence on Fluid P ower , 2015. [25] J ¨ org St ¨ uckler , Johannes Schwenk, and Sv en Behnke. Get- ting Back on T wo Feet: Reliable Standing-up Routines for a Humanoid Robot. In IAS , pages 676–685, 2006. [26] Jie T an, Tingnan Zhang, Erwin Coumans, Atil Iscen, Y unfei Bai, Danijar Hafner, Steven Bohez, and V in- cent V anhoucke. Sim-to-Real: Learning Agile Lo- comotion For Quadruped Robots. arXiv pr eprint arXiv:1804.10332 , 2018.

Robust Recovery Controller for a Quadrupedal Robot using Deep Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment