Proximal Distilled Evolutionary Reinforcement Learning

Pr oximal Distilled Evolutionary Reinf or cement Lear ning Cristian Bodnar , Ben Day , Pietro Li ´ o Department of Computer Science & T echnology Univ ersity of Cambridge Cambridge, United Kingdom cb2015@cam.ac.uk Abstract Reinforcement Learning (RL) has achiev ed impressi ve per - formance in many comple x en vironments due to the inte- gration with Deep Neural Networks (DNNs). At the same time, Genetic Algorithms (GAs), often seen as a compet- ing approach to RL, had limited success in scaling up to the DNNs required to solve challenging tasks. Contrary to this dichotomic view , in the physical world, ev olution and learn- ing are complementary processes that continuously interact. The recently proposed Ev olutionary Reinforcement Learning (ERL) framew ork has demonstrated mutual beneﬁts to per- formance when combining the two methods. Howev er , ERL has not fully addressed the scalability problem of GAs. In this paper , we show that this problem is rooted in an unfor- tunate combination of a simple genetic encoding for DNNs and the use of traditional biologically-inspired variation op- erators. When applied to these encodings, the standard op- erators are destructi v e and cause catastrophic forgetting of the traits the networks acquired. W e propose a novel algo- rithm called Proximal Distilled Evolutionary Reinforcement Learning (PDERL) that is characterised by a hierarchical in- tegration between ev olution and learning. The main innova- tion of PDERL is the use of learning-based variation oper- ators that compensate for the simplicity of the genetic repre- sentation. Unlik e traditional operators, our proposals meet the functional requirements of variation operators when applied on directly-encoded DNNs. W e e valuate PDERL in ﬁve robot locomotion settings from the OpenAI gym. Our method out- performs ERL, as well as two state-of-the-art RL algorithms, PPO and TD3, in all tested en vironments. Introduction The ﬁeld of Reinforcement Learning (RL) has recently achiev ed great success by producing artiﬁcial agents that can master the game of Go (Silver et al. 2017), play Atari games (Mnih et al. 2015) or control robots to perform complex tasks such as grasping objects (Andrychowicz et al. 2017) or running (Lillicrap et al. 2015). Most of this success is caused by the combination of RL with Deep Learning (Goodfello w , Bengio, and Courville 2016), generically called Deep Rein- forcement Learning (DRL). At the same time, Genetic Algorithms (GAs), usually seen as a competing approach to RL, hav e achieved limited suc- Copyright c  2020, Association for the Adv ancement of Artiﬁcial Intelligence (www .aaai.org). All rights reserved. cess in e v olving DNN-based control policies for complex en vironments. Though previous work has shown GAs to be competiti v e with other DRL algorithms in discrete en- vironments (Such et al. 2017), they are still signiﬁcantly less sample efﬁcient than a simple method like Deep Q- Learning (Mnih et al. 2015). Moreov er , in complex robotic en vironments with large continuous state and action spaces, where en vironment interactions are costly , their sample in- efﬁcienc y is ev en more acute (Khadka and T umer 2018), (Such et al. 2017). Howe ver , in the physical world, e volution and learning interact in subtle ways. Perhaps, the most famous product of this interaction is the Baldwin ef fect (Simpson 1953), which explains how the genotype can assimilate learnt behaviours ov er the course of many generations. A more spectacular by- product of this interplay , which has receiv ed more attention in recent years, is the epigenetic inheritance of learnt traits (Dias and Ressler 2013). Despite these exciting intricacies of learning and ev olu- tion, the two have almost always receiv ed separate treat- ment in the ﬁeld of AI. Though they hav e been analysed together in computational simulations multiple times (Hin- ton and No wlan 1987), (Ackley and Littman 1992), (Suzuki and Arita 2004), they ha ve rarely been combined to produce nov el algorithms with direct applicability . This is surprising giv en that nature has always been a great source of inspira- tion for AI (Floreano and Mattiussi 2008). For the ﬁrst time, Khadka and T umer (2018) hav e re- cently demonstrated on robot locomotion tasks the practical beneﬁts of merging the two approaches in their Evolution- ary Reinforcement Learning (ERL) frame work. ERL uses an RL-based agent alongside a genetically e volv ed popula- tion, with a transfer of information between the two. Ho w- ev er , ERL has not fully addressed the scalability problem of GAs. While the gradient information from the RL agent can signiﬁcantly speed up the ev olutionary search, the pop- ulation of ERL is ev olved using traditional variation opera- tors. Paired with directly encoded DNNs, which is the most common genetic representation in use, we show that these operators are destructiv e. This paper brings the following contrib utions: • Demonstrates the negati ve side-ef fects in RL of the tradi- tional genetic operators when applied to directly encoded DNNs. • Proposes two nov el genetic operators based on backprop- agation. These operators do not cause catastrophic forget- ting in combination with simple DNN representations. • Integrates these operators as part of a no vel framework called Proximal Distilled Evolutionary Reinforcement Learning (PDERL) that uses a hierarchy of interactions between ev olution and learning. • Shows that PDERL outperforms ERL, PPO (Schulman et al. 2017) and TD3 (Fujimoto, van Hoof, and Meger 2018) in ﬁ ve robot locomotion en vironments from the OpenAI gym (Brockman et al. 2016). Background This section introduces the Evolutionary Reinforcement Learning (ERL) algorithm and the genetic operators it uses. Evolutionary Reinf or cement Learning The proposed methods build upon the ERL framework in- troduced by Khadka and T umer (2018). In this framework, a population of policies is evolv ed using GAs. The ﬁtness of the policies in the population is based on the cumula- tiv e total re ward obtained o ver a gi ven number of e v alua- tion rounds. Alongside the population, an actor-critic agent based on DDPG (Lillicrap et al. 2015) is trained via RL. The RL agent and the population synchronise periodically to es- tablish a bidirectional transfer of information. The ﬁrst type of synchronisation in ERL, from the RL agent to the genetic population, is meant to speed up the ev o- lutionary search process. This synchronisation step clones the actor of the RL agent into the population e v ery fe w gen- erations to transfer the polic y gradient information. The syn- chronisation period, ω , is a hyperparameter that controls the rate of information ﬂowing from the RL agent to the popu- lation. The second type of synchronisation consists of a re verse information ﬂow coming from the population to the RL agent. The actors in the population collect e xperiences from which the RL agent can learn of f-policy . All the transitions coming from rollouts in the population are added to the re- play buf fer of the DDPG agent. The population experiences can be seen as being generated by ev olution-guided param- eter space noise (Plappert et al. 2018). Genetic encoding and variation operators The policies in the ERL population are represented by neu- ral networks with a direct encoding. In this common genetic representation, the weights of a network are recorded as a list of real numbers. The ordering of the list is arbitrary but con- sistent across the population. As we will show , applying the usual biologically inspired v ariation operators on this repre- sentation can produce destructiv e beha viour modiﬁcations. In the physical world, mutations and crossovers rarely hav e catastrophic phenotypic effects because the phenotype is protected by the complex layers of physical, biological and chemical processes that translate the DNA. In a direct genetic encoding, the protectiv e layers of translation are ab- sent because the representation is so simple and immediate. As such, the biologically inspired variation operators com- monly found in the literature, including ERL, do not have the desired functionality when paired with a direct encod- ing. Ideally , crossov ers should combine the best behaviours of the two parents. At the same time, mutations should pro- duce only a slight variation in the behaviour of the parent, ensuring that the offspring inherits it to a signiﬁcant extent. Howe ver , because DNNs are sensitiv e to small modiﬁcations of the weights (the genes in a direct encoding), these oper- ators typically cause catastrophic forgetting of the parental behaviours. ERL ev olv es the population using two variation opera- tors commonly used for list-based representations: n -point crossov ers and Gaussian mutations (Eiben and Smith 2015, p. 49-79). n -point crossovers produce an offspring polic y by randomly exchanging segments of the lists of weights belonging to the two parents, where n endpoints determine the segments. ERL uses a version of the operator where the unit-segments are ro ws of the dense layer matrices, ensuring that an of fspring recei ves nodes as they appear in the par- ents rather than splicing the weights of nodes together . The resulting child policy matrices contain a mix of ro ws (nodes) coming from the matrices (layers) of both parents. This is in- tended to produce functional consistency across generations. Howe ver , the lack of an inherent node ordering in DNNs means that hidden representations need not be consistent ov er the population and as such the input to a node may not be consistent from parent to offspring, creating the possi- bility for destructi ve interference. This can cause the child policy to di v erge from that of the parents, as we will demon- strate. Similarly , the damaging ef fects of adding Gaussian noise to the parameters of a DNN hav e been discussed at great length by Lehman et al. (2018). A common approach to containing these issues, employed by ERL, is to mutate only a fraction of the weights. Nevertheless, these muta- tions are still destructive. Furthermore, e volving only a small number of weights can slow do wn the ev olutionary search for better policies. Method This section introduces our proposed learning-based genetic operators and describes how the y are inte grated with ERL. The genetic memory A signiﬁcant problem of the population from ERL is that it does not directly exploit the individual experiences collected by the actors in the population. The population only beneﬁts indirectly , through the RL agent, which uses them to learn and improv e. The individual experiences of the agents are an essential aspect of the new operators we introduce in the next sections and, therefore, the agents also need a place to store them. The ﬁrst modiﬁcation we mak e to ERL is to equip the members of the population, and the RL agent, with a small personal replay buffer containing their most recent expe- riences, at the e xpense of a mar ginally increased memory footprint. Depending on its capacity κ , the buf fer can also in- clude experiences of their ancestors. Because the transitions in the buf fer can span over multiple generations, we refer to this personal replay buf fer of each agent as the g enetic mem- ory . When the policies interact with the environment, they not only store their experiences in DDPG’ s replay buf fer as in ERL but also in their g enetic memory . The ancestral experiences in the genetic memory are in- troduced through the variation operators. A mutated child policy inherits the genetic memory of the parent entirely . During crossov er , the buffer is only partially inherited. The crossov er offspring ﬁlls its buf fer with the most recent half of transitions coming from each of the two parents’ genetic memories. Q-ﬁltered distillation cr ossov ers In this section, we propose a Q -ﬁltered behaviour distil- lation crosso ver that selectiv ely merges the behaviour of two parent policies into a child policy . Unlike n -point crossov ers, this operator acts in the phenotype space, and not in parameter space. Distillation Crossover n-Point Crossover Q-Network Genetic Memory Genetic Memory Figure 1: Q -ﬁltered distillation crossover compared to n - point crossov er . The contour plots represent the state visi- tation distributions of the agents (i.e. the average amount of time the agent spends in each state) for the ﬁrst two state dimensions of an en vironment. The plot is generated by ﬁt- ting a Gaussian kernel density model over the states col- lected ov er many episodes. These distributions sho w ho w Q -ﬁltered distillation crossover selectively merges the be- haviours of the two parents by inheriting from the shapes of both parent distributions. In contrast, the modes of the state visitation distribution obtained by the traditional crossover are mostly disjoint from the modes of the parent distribu- tions. For a pair of parent actors from the population, the crossov er operation works as follows. A new agent with an initially empty associated genetic memory is created. The genetic memory is ﬁlled in equal proportions with the lat- est transitions coming from the genetic memories of the two parents. The child agent is then trained via a form of Imi- tation Learning (Osa et al. 2018) to selectiv ely imitate the actions the parents would take in the states from the newly created genetic memory . Equi v alently , this process can be seen as a more general type of polic y distillation (Rusu et al. 2016) process since it aims to “distil” the behaviour of the parents into the child policy . Algorithm 1: Distillation Crossov er Input : Parent policies µ x , µ y with memory R x , R y Output: Child policy µ z with memory empty R z 1 Add latest κ 2 transitions from R x to R z 2 Add latest κ 2 transitions from R y to R z 3 Shufﬂe the transitions in R z 4 Randomly initialise the weights θ z of µ z with the weights of one of the parents 5 for e ← 1 to epochs do 6 for i ← 1 to κ/ N C do 7 Sample state batch of size N C from R z 8 Optimise θ z to minimise L ( C ) using SGD 9 end 10 end Unlike the conv entional policy distillation proposed by Rusu et al. (2016), two parent networks are in volved, not one. This introduces the problem of diver gent behaviours. The two parent policies can take radically different actions in identical or similar states. The problem is ho w the child pol- icy should decide whom to imitate in each state. The ke y ob- servation of the proposed method is that the critic of the RL agent already kno ws the values of certain states and actions. Therefore, it can be used to select which actions should be followed in a principled and globally consistent manner . W e propose the following Q -ﬁltered behaviour cloning loss to train the child policy: L ( C ) = N C X i k µ z ( s i ) − µ x ( s i ) k 2 I Q ( s i ,µ x ( s i )) >Q ( s i ,µ y ( s i )) + N C X j k µ z ( s j ) − µ y ( s j ) k 2 I Q ( s j ,µ y ( s j )) >Q ( s j ,µ x ( s j )) + 1 N C N C X k k µ z ( s k ) k 2 , where the sum is taken o ver a batch of size N C sampled from the genetic memories of the two parent agents. µ x and µ y represent the deterministic parent policies, while µ z is the deterministic policy of the child agent. The indicator function I uses the Q -Netw ork of the RL agent to decide which parent takes the best action in each state. The child policy is trained to imitate those actions by minimising the ﬁrst two terms. The ﬁnal term is an L 2 regularisation that prev ents the outputs from saturating the hyperbolic tangent acti v ation. Figure 1 contains a dia- gram comparing this new crossover with the ERL n -point crossov er . W e refer to ERL with the distillation crossover as Distilled Evolutionary Reinforcement Learning (DERL). W e note that while this operator is indeed more compu- tationally intensiv e, a small number of training epochs over the relatively small genetic memory sufﬁces. Additionally , we expect a distributed implementation of our method to compensate for the incurred wall clock time penalties. W e leav e this endea vour for future w ork. Par ent selection mechanism An interesting question is how parents should be selected for this crossover . A general approach is to deﬁne a mating score function m : Π × Π → R that takes as input two poli- cies and provides a score. The pairs with higher scores are more likely to be selected. Similarly to Gangwani and Peng (2018), we distinguish two ways of computing the score: greedy and distance-based. Greedy The score m ( µ x , µ y ) = f ( µ x ) + f ( µ y ) can be greedily determined by the sum of the ﬁtness of the two par- ents. This type of selection generally increases the stability of the population and makes it unlikely that good indi viduals are not selected. Distance based The score m ( µ x , µ y ) = d Π ( µ x , µ y ) can be computed using a distance metric in the space of all pos- sible policies. “Different” policies are more likely to be se- lected for mating. The exact notion of “different” depends on the precise form of the distance metric d Π . Here, we propose a distance metric in the behaviour space of the two policies that takes the form: d Π ( µ x , µ y ) = E x ∼ ρ x [ k µ x ( x ) − µ y ( x ) k 2 ] + E x ∼ ρ y [ k µ x ( x ) − µ y ( x ) k 2 ] , where ρ x and ρ y are the state-visitation distributions of the two agents. This distance metric measures the expected difference in the actions tak en by the two parent policies over states com- ing from a mixture of their state visitation distributions. This expectation is in practice stochastically approximated by sampling a large batch from the genetic memories of the two agents. This strategy biases the introduction of no v el be- haviours into the population at the expense of stability as the probability that ﬁt individuals are not selected is increased. Proximal mutations As showed by Lehman et al. (2018), Gaussian mutations can hav e catastrophic consequences on the behaviour of an agent. In fact, the stability of the policy update is a prob- lem even for gradient descent approaches, where an inap- propriate step size can hav e unpredictable consequences in the performance landscape. Methods like PPO (Schulman et al. 2017) are remarkably stable by minimising an auxiliary KL diver gence term that keeps the beha viour of the new pol- icy close to the old one. Based on these moti v ations, we integrate the safe mu- tation operator SM-G-SUM that has been proposed by Gaussian Mutation Proximal Mutation Genetic Memory Figure 2: Proximal mutations compared to Gaussian muta- tions. The blue contour plot sho ws the state visitation distri- bution of the parent policy . The red contour plots show the difference between the distribution of the children and that of the parent. The difference plots are generated by taking the normalised difference between the parent and child prob- ability densities. The beha viour of the policy obtained by proximal mutation is a small perturbativ e adjustment to the parent behaviour . In contrast, the traditional mutation pro- duces a div er gent behaviour , ev en though it modiﬁes only a fraction of the weights (shown in red). Algorithm 2: Proximal Mutation Input : Parent polic y µ x with memory R x Output: Child policy µ y with memory R y 1 Initialise R y ← R x and µ y ← µ x 2 Sample state batch of size N M from R x 3 Compute s on the batch samples s i as in Equation 1. 4 Mutate θ y ← θ y + x s , x ∼ N ( 0 , σ I ) Lehman et al. (2018) with the genetic memory of the pop- ulation. This operator uses the gradient of each dimension of the output action over a batch of N M transitions from the genetic memory to compute the sensitivity s of the actions to weight perturbations: s = v u u t |A| X k  N M X i ∇ θ µ θ ( s i ) k  2 (1) The sensitivity is then used to scale the Gaussian pertur- bation of each weight accordingly by θ ← θ + x s , with x ∼ N ( 0 , σ I ) , where σ is a mutation magnitude hyperpa- rameter . The resulting operator produces child policies that are in the pr oximity of their parent’ s behaviour . Therefore, we refer to this operator as a pr oximal mutation (Figure 2), and the version of ERL using it as Proximal Evolutionary Reinforcement Learning (PERL). While the proximal mutations do not explicitly use learn- ing, they rely on the capacity of the policies to learn, or in other words, to be differentiable. W ithout this property , these behaviour sensitivities to the parameter perturbations cannot be computed analytically . Integration The full beneﬁts of the newly introduced operators are re- alised when they are used together . The Q -ﬁltered distilla- tion crossover increases the stability of the population and driv es the agents towards regions of the state-action space with higher Q -values. The proximal mutations improve the exploration of the population and its ability to discov er bet- ter policies. As will be seen in the e v aluation section, the operators complement each other . W e refer to their dual in- tegration with ERL as Proximal Distilled Evolutionary Re- inforcement Learning (PDERL). Environment Evaluation RL Critic RL Actor Selection Mutation Actor Population Actor 1 Actor 2 --- --- --- Actor N Fitness Experiences Policy Gradient Inject learned behaviour into the population New population Genetic Memory Crossover Figure 3: A high-lev el view of PDERL. The new com- ponents and interactions are drawn in green and red. In PDERL, there is a higher ﬂow of information from the in- dividual experiences and learning (right) to the population (left) than in ERL. Ultimately , PDERL contains a hierarchy of interactions between learning and ev olution. A high-level interaction is realised through the information e xchange between the pop- ulation and the RL agent. The newly introduced operators add a lower layer of interaction, at the level of the genetic operators. A diagram of PDERL is giv en by Figure 3. Evaluation This section ev aluates the performance of the proposed methods, and also takes a closer look at the beha viour of the proposed operators. Experimental setup The architecture of the policy and critic networks is iden- tical to ERL. Those hyperparameters that are shared with ERL hav e the same values as those reported by Khadka and T umer (2018), with a few exceptions. For W alk er2D, the synchronisation rate ω was decreased from 10 to 1 to allow a higher information ﬂow from the RL agent to the popula- tion. In the same en vironment, the number of ev aluations ξ was increased from 3 to 5 because of the high total rew ard variance across episodes. Finally , the fraction of elites in the Hopper and Ant environments was reduced from 0 . 3 to 0 . 2 . Generally , a higher number of elites increases the stability of the population, but the stability gained through the new op- erators makes higher v alues of this parameter unnecessary . For the PDERL speciﬁc hyperparameters, we performed little tuning due to the limited computational resources. In what follows we report the chosen values alongside the values that were considered. The crossover and mutation batch sizes are N C = 128 and N M = 256 (searched over 64 , 128 , 256 ). The genetic memory has a capacity of κ = 8 k transitions ( 2 k , 4 k , 8 k , 10 k ). The learning rate for the distil- lation crossov er is 10 − 3 ( 10 − 2 , 10 − 3 , 10 − 4 , 10 − 5 ), and the child policy is trained for 12 epochs ( 4 , 8 , 12 , 16 ) . All the training procedures use the Adam optimiser . Greedy parent selection is used unless otherwise indicated. As in ERL, the population is formed of k = 10 actors. When reporting the results, we use the ofﬁcial imple- mentations for ERL 1 and TD3 2 , and the OpenAI Baselines 3 implementation for PPO. Our code is publicly av ailable at https://github .com/crisbodnar/pderl. Perf ormance evaluation This section ev aluates the mean reward obtained by the newly proposed methods as a function of the number of en vironment frames experienced. The results are reported across ﬁve random seeds. Figure 4 shows the mean rew ard and the standard deviation obtained by all algorithms on ﬁ ve MuJoCo (T odoro v, Erez, and T assa 2012) en vironments. While PERL and DERL bring improvements across mul- tiple en vironments, they do not perform well across all of them. PERL is effectiv e in stable environments like HalfCheetah and Hopper, where the total reward has low variance o ver multiple rollouts. At the same time, DERL is more useful in unstable environments like W alker2d and Ant since it dri ves the population tow ards regions with higher Q values. In contrast, PDERL performs consistently well across all the settings, demonstrating that the newly introduced operators are complementary . PDERL signiﬁ- cantly outperforms ERL and PPO across all en vironments and, despite being generally less sample efﬁcient than TD3, it catches up eventually . Ultimately , PDERL signiﬁcantly outperforms TD3 on Swimmer, HalfCheetah and Ant, and marginally on Hopper and W alker2d. T able 1 reports the ﬁnal reward statistics for all the tested models and en vironments. Side by side videos of ERL and PDERL running on simulated robots can be found at https:// youtu.be/7OGDom1y2YM. The following subsections take 1 https://github .com/ShawK91/erl paper nips18 2 https://github .com/sfujim/TD3 3 https://github .com/openai/baselines/ 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Frames (in millions) 0 50 100 150 200 250 300 350 Mean Reward Swimmer-v2 PDERL DERL PERL ERL PPO TD3 (a) Swimmer-v2 0 1 2 3 4 5 6 Frames (in millions) 0 2000 4000 6000 8000 10000 12000 14000 Mean Reward HalfCheetah-v2 (b) HalfCheetah-v2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Frame 0 500 1000 1500 2000 2500 3000 3500 4000 Mean Reward Hopper-v2 (c) Hopper-v2 0 1 2 3 4 5 6 Epoch 0 1000 2000 3000 4000 5000 6000 Mean Reward Walker2D-v2 (d) W alker2d-v2 0 1 2 3 4 5 6 Frames (in millions) 0 1000 2000 3000 4000 5000 6000 7000 Mean Reward Ant-v2 (e) Ant-v2 Figure 4: The mean reward obtained on Swimmer (a), HalfCheetah (b), Hopper (c), W alker2d (d) and Ant (e). PDERL outper- form ERL, PPO and TD3 on all the en vironments. a closer look at the ne wly introduced operators and offer a justiﬁcation for the improv ements achie ved by PDERL. En vironment Metric TD3 PPO ERL PERL DERL PDERL Swimmer Mean 53 113 334 327 333 337 Std. 26 3 20 26 6 12 Median 51 114 346 354 338 348 HalfCheetah Mean 11534 1810 10963 13668 11362 13522 Std. 713 28 225 236 358 287 Median 11334 1810 11025 13625 11609 13553 Hopper Mean 3231 2348 2049 3497 2869 3397 Std. 213 342 841 63 920 202 Median 3282 2484 1807 3501 3446 3400 W alker2D Mean 4925 3816 1666 3364 4050 5184 Std. 476 413 737 818 1170 477 Median 5190 3636 1384 3804 4491 5333 Ant Mean 6212 3151 4330 4528 4911 6845 Std. 216 686 1806 2003 1920 407 Median 6121 3337 5164 3331 5693 6948 T able 1: Final performance in all en vironments. The result with the highest mean is shown in bold. PERL marginally outperforms PDERL on two environments, but PDERL con- sistently performs well across all en vironments. Crosso ver e valuation A good indicator for the quality of a crossover operator is the ﬁtness of the offspring compared to that of the parents. Figure 5 plots this metric for ten randomly chosen pairs of parents in the Ant en vironment. Each group of bars gives the ﬁtness of the two parents and the policies obtained by the two types of crossov ers. All these values are normalised by the ﬁtness of the ﬁrst parent. The performance of the child Crossovers 0.0 0.2 0.4 0.6 0.8 1.0 Normalised Fitness Parent 1 Parent 2 n-point crossover Distillation crossover Figure 5: Normalised crossover performance on the Ant en- vironment. The distillation crossover achieves higher ﬁtness than the n -point crossover . Fitness is relative to Parent 1 in each group. obtained via an n -point crosso ver re gularly falls belo w 40% the ﬁtness of the best parent. At the same time, the ﬁtness of the policies obtained by distillation is generally at least as good as that of the parents. The state visitation distributions of the parents and chil- dren offer a clearer picture of the two operators. Figure 6 shows these distributions for a sample crossov er in the Ant en vironment. The n -point crossov er produces a behaviour that di ver ges from that of the parents. In contrast, the Q - ﬁltered distillation crossover generates a polic y whose be- haviour contains the best traits of the parent behaviours. The new operator implicitly dri ves each new generation in the population tow ards regions with higher Q v alues. 0.4 0.5 0.6 0.7 0.8 State Dimension 1 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 State Dimension 2 Parent 1 Fitness: 5489 0.4 0.5 0.6 0.7 0.8 State Dimension 1 Parent 2 Fitness: 5682 0.4 0.5 0.6 0.7 0.8 State Dimension 1 Distillation crossover Fitness: 5934 0.4 0.5 0.6 0.7 0.8 State Dimension 1 n-point crossover Fitness: 1239 Figure 6: This ﬁgure shows the state visitation distributions for the distillation crossov er and the n -point crossover . Un- like the n -point crossover , the distillation crossov er pro- duces policies that selectively merge the behaviour of the parents. Mutation ev aluation Mutations 0.0 0.2 0.4 0.6 0.8 1.0 Normalised Fitness Parent Gaussian Mutation Proximal Mutation Figure 7: Normalised mutation performance on the Ant envi- ronment. The proximal mutations obtain signiﬁcantly higher ﬁtness than the Gaussian mutations. Fitness is relati ve to the Parent in each group. Figure 7 shows the ﬁtness of the children obtained by the two types of mutation for ten randomly selected parents on the Ant en vironment. Most Gaussian mutations produce child policies with ﬁtness that is either negati ve or close to zero. At the same time, the proximal mutations create indi- viduals that often surpass the ﬁtness of the parents. 0.4 0.5 0.6 0.7 State Dimension 1 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 State Dimension 2 Parent Fitness: 4896 0.4 0.5 0.6 0.7 State Dimension 1 Proximal Mutation Fitness: 5496 KL: 0.03 0.4 0.5 0.6 0.7 State Dimension 1 Gaussian Mutation Fitness: -187 KL: 0.53 Figure 8: As before, the blue contours represent the state vis- itation distribution of the parent, whereas the red ones repre- sent the difference. The child obtained by proximal mutation inherits the behaviour of the parent to a lar ge de gree and ob- tains a 600 ﬁtness boost. The behaviour obtained by Gaus- sian mutation is entirely different from that of the parent. The KL div ergence between the parent and child distribu- tions (0.03 and 0.53) quantitativ ely conﬁrm this. As in the pre vious section, the analysis of the state visi- tation distribution of the policies rev eals the destructiv e be- haviour of the Gaussian mutations. The contours of these distributions for a sample mutation are given in Figure 8. The policy mutated by additi ve Gaussian noise completely div er ges from the behaviour of the parent. This sudden change in behaviour causes catastrophic forgetting, and the new offspring falls in performance to a total rew ard of − 187 . In contrast, the proximal mutation generates only a subtle change in the state visitation distribution. The of fspring thus obtained inherits to a great extent the behaviour of the par- ent, and achiev es a signiﬁcantly higher total re ward of 5496 . Related work This paper is part of an emerging direction of research at- tempting to merge Evolutionary Algorithms and Deep Re- inforcement Learning: Khadka and T umer (2018), Pourchot and Sigaud (2019), Gangwani and Peng (2018), Khadka et al. (2019). Most closely related are the papers of Lehman et al. (2018) and Gangwani and Peng (2018). Both of these works address the destructive behaviours of classic variation oper- ators. Lehman et al. (2018) focus e xclusi vely on safe mu- tations, and one of their proposed operators is directly em- ployed in the proximal mutations. Howe ver , their paper is lacking a treatment of crossovers and the integration with learning e xplored here. The methods of Gangw ani and Peng (2018) are focused e xclusi v ely on safe operators for stochas- tic policies, while the methods proposed in this w ork can be applied to stochastic and deterministic policies alike. The closest aspect of their work is that they also introduce a crossov er operator with the goal of merging the behaviour of two agents. Their solution reduces the problem to the tradi- tional single parent distillation problem using a maximum- likelihood approach to combine the behaviours of the two parents. They also propose a mutation operator based on gra- dient ascent using polic y gradient methods. Howe ver , this depriv es their method of the beneﬁts of deriv ati ve-free opti- misation such as the robustness to local optima. Discussion The ERL frame work demonstrates that genetic algorithms can be scaled to DNNs when combined with learning meth- ods. In this paper we hav e proposed the PDERL extension and sho wn that performance is further impro v ed with a hier - archical integration of learning and evolution. While main- taining a bi-directional ﬂow of information between the pop- ulation and RL agent, our method also uses learning within the genetic operators which, unlike traditional implementa- tions, produce the desired functionality when applied to di- rectly encoded DNNs. Finally , we show that PDERL outper- forms ERL, PPO and TD3 in all tested en vironments. Many exciting directions for future research remain, as discussed in the text. An immediate extension would be to de v elop a distributed v ersion able to exploit larger and more div erse populations. Better management of the inher- ited genetic memories may yield efﬁciency gains by priori- tising key experiences. Lastly , we note the potential for using learning algorithms at the lev el of selection operators. References [1992] Ackley , D., and Littman, M. 1992. Interactions be- tween learning and evolution. In Langton, C. G.; T aylor , C.; Farmer , C. D.; and S., R., eds., Artiﬁcial Life II, SFI Stud- ies in the Sciences of Complexity , volume X. Reading, MA, USA: Addison-W esle y . 487–509. [2017] Andrychowicz, M.; W olski, F .; Ray , A.; Schneider , J.; Fong, R.; W elinder , P .; McGrew , B.; T obin, J.; Abbeel, P .; and Zaremba, W . 2017. Hindsight experience replay . In NIPS . [2016] Brockman, G.; Cheung, V .; Pettersson, L.; Schneider , J.; Schulman, J.; T ang, J.; and Zaremba, W . 2016. Openai gym. [2013] Dias, B. G., and Ressler, K. J. 2013. Parental olfac- tory experience inﬂuences behavior and neural structure in subsequent generations. Natur e Neur oscience 17:89 EP –. [2015] Eiben, A. E., and Smith, J. E. 2015. Intr oduction to Evolutionary Computing . Springer Publishing Company , Incorporated, 2nd edition. [2008] Floreano, D., and Mattiussi, C. 2008. Bio-Inspir ed Artiﬁcial Intelligence: Theories, Methods, and T ec hnolo- gies . The MIT Press. [2018] Fujimoto, S.; v an Hoof, H.; and Meger , D. 2018. Ad- dressing function approximation error in actor-critic meth- ods. In ICML . [2018] Gangwani, T ., and Peng, J. 2018. Policy optimization by genetic distillation. In ICLR . [2016] Goodfellow , I.; Bengio, Y .; and Courville, A. 2016. Deep Learning . MIT Press. http://www .deeplearningbook. org. [1987] Hinton, G. E., and Nowlan, S. J. 1987. How learning can guide ev olution. Comple x Systems 1. [2018] Khadka, S., and T umer, K. 2018. Evolution-guided policy gradient in reinforcement learning. In NeurIPS . [2019] Khadka, S.; Majumdar , S.; Nassar, T .; Dwiel, Z.; T umer, E.; Miret, S.; Liu, Y .; and T umer , K. 2019. Col- laborativ e ev olutionary reinforcement learning. In Chaud- huri, K., and Salakhutdinov , R., eds., Pr oceedings of the 36th International Conference on Machine Learning , volume 97 of Pr oceedings of Machine Learning Resear ch , 3341–3350. Long Beach, California, USA: PMLR. [2018] Lehman, J.; Chen, J.; Clune, J.; and Stanle y , K. O. 2018. Safe mutations for deep and recurrent neural networks through output gradients. In GECCO . [2015] Lillicrap, T . P .; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T .; T assa, Y .; Silver , D.; and W ierstra, D. 2015. Con- tinuous control with deep reinforcement learning. CoRR abs/1509.02971. [2015] Mnih, V .; Kavukcuoglu, K.; Silver , D.; Rusu, A. A.; V eness, J.; Bellemare, M. G.; Grav es, A.; Riedmiller , M. A.; Fidjeland, A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Human-level control through deep reinforcement learning. Natur e 518:529–533. [2018] Osa, T .; Pajarinen, J.; Neumann, G.; Bagnell, J.; Abbeel, P .; and Peters, J. 2018. An algorithmic perspecti ve on imitation learning. F oundations and T r ends in Robotics 7(1-2):1–179. [2018] Plappert, M.; Houthooft, R.; Dhariwal, P .; Sidor , S.; Chen, R. Y .; Chen, X.; Asfour , T .; Abbeel, P .; and Andrychowicz, M. 2018. Parameter space noise for explo- ration. CoRR abs/1706.01905. [2019] Pourchot, A., and Sigaud, O. 2019. Cem-rl: Com- bining ev olutionary and gradient-based methods for policy search. CoRR abs/1810.01222. [2016] Rusu, A. A.; Colmenarejo, S. G.; aglar G ¨ ulehre; Desjardins, G.; Kirkpatrick, J.; Pascanu, R.; Mnih, V .; Kavukcuoglu, K.; and Hadsell, R. 2016. Policy distillation. CoRR abs/1511.06295. [2017] Schulman, J.; W olski, F .; Dhariwal, P .; Radford, A.; and Klimov , O. 2017. Proximal policy optimization algo- rithms. CoRR abs/1707.06347. [2017] Silver , D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T .; Baker , L. R.; Lai, M.; Bolton, A.; Chen, Y .; Lillicrap, T . P .; Hui, F . F . C.; Sifre, L.; van den Driessche, G.; Graepel, T .; and Hassabis, D. 2017. Mastering the game of go without human knowledge. Nature 550:354–359. [1953] Simpson, G. G. 1953. The baldwin effect. Evolution 7(2):110–117. [2017] Such, F . P .; Madhav an, V .; Conti, E.; Lehman, J.; Stanley , K. O.; and Clune, J. 2017. Deep neuroev olution: Genetic algorithms are a competitiv e alternati ve for train- ing deep neural networks for reinforcement learning. ArXiv abs/1712.06567. [2004] Suzuki, R., and Arita, T . 2004. Interactions between learning and ev olution: the outstanding strategy generated by the baldwin effect. Bio Systems 77 1-3:57–71. [2012] T odorov, E.; Erez, T .; and T assa, Y . 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , 5026–5033. Supplementary Material Beha viour and mutation magnitude 0.2 0.1 0.0 0.1 0.2 State Dimension 1 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 State Dimension 2 Magnitude: 0.0 0.2 0.1 0.0 0.1 0.2 State Dimension 1 Magnitude: 0.01 KL: 0.006 0.2 0.1 0.0 0.1 0.2 State Dimension 1 Magnitude: 0.05 KL: 0.086 0.2 0.1 0.0 0.1 0.2 State Dimension 1 Magnitude: 0.1 KL: 0.156 Figure 9: This ﬁgure shows how the state visitation distri- butions of the child policies obtained by proximal mutation smoothly depend on the mutation magnitude. The blue con- tour corresponds to the parent’ s state visitation distribution. The red contours show ho w the differences from the par- ent distribution increase as the mutation magnitude becomes higher . The KL diver gence between the parent and child dis- tributions, which strictly increases with the mutation magni- tude, also conﬁrm this behaviour . Another beneﬁt of the proximal mutations is that the size of the change in the behaviour can be directly adjusted by tuning the magnitude of the mutations. Figure 9 shows how increasing the mutation magnitude gradually induces a more signiﬁcant change in the behaviour of the child policy . This is not the case for the Gaussian mutations where a given mutation size can be unpredictably lar ge and non-linearly related to the behaviour changes. Par ent selection mechanism The previous experiments used the greedy parent selec- tion mechanism for choosing the policies in v olved in the crossov ers. This section offers a comparativ e view between this greedy selection and the ne wly proposed distance-based selection. The mechanisms are compared in Figure 10 on Hop- per and W alker2d. On a relativ ely unstable en vironment like W alker2d, the distance-based DERL and PDERL per- form signiﬁcantly worse than their ﬁtness-based equiv alents. Howe ver , on Hopper, which is more stable than W alk er2d, the distance-based PDERL surpasses the ﬁtness-based one, while also having v ery lo w v ariance. The fact that the distance-based selection performs better in the early stages of training for both environments supports the idea of using a con ve x combination of the two selections. Interactions between learning and e volution One of the motifs of this w ork is the interaction between learning and e v olution. In PDERL, this can be quantitativ ely analysed by looking at the number of times the RL agent was selected, discarded or became an elite in the population. Fig- ure 11 sho ws ho w these numbers e volve during training and rev eals that each en vironment has its own unique underlying interaction pattern. Swimmer . Swimmer is an en vironment where the genetic population drives the progress of the agent almost entirely . The RL agent rarely becomes good enough to be selected or become an elite. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Frame 0 500 1000 1500 2000 2500 3000 3500 4000 Mean Reward Hopper-v2 PDERL Fitness Selection PDERL Distance Selection DERL Fitness Selection DERL Distance Selection (a) Hopper-v2 0 1 2 3 4 5 6 Epoch 1000 2000 3000 4000 5000 Mean Reward Walker2D-v2 (b) W alker2d-v2 Figure 10: A comparison between the greedy (ﬁtness-based) parent selection mechanism and the distance-based parent selection. The distance-based selection works better in sta- ble en vironments like Hopper , but worse in en vironments with high variance across dif ferent episodes like W alker2d. En vironment Algorithm Elite Selected Discarded Swimmer ERL 4.0 ± 2.8 20.3 ± 18.1 76.0 ± 20.4 PDERL 3.9 ± 2.7 11.7 ± 4.7 84.3 ± 7.3 HalfCheetah ERL 83.8 ± 9.3 14.3 ± 9.1 2.3 ± 2.5 PDERL 71.6 ± 3.3 21.5 ± 3.1 6.8 ± 2.6 Hopper ERL 28.7 ± 8.5 33.7 ± 4.1 37.7 ± 4.5 PDERL 5.3 ± 1.4 29.0 ± 2.7 65.6 ± 3.6 W alker2D ERL 38.5 ± 1.5 39.0 ± 1.9 22.5 ± 0.5 PDERL 4.3 ± 1.5 20.6 ± 5.7 75.0 ± 4.8 Ant ERL 66.7 ± 1.7 15.0 ± 1.4 18.0 ± 0.8 PDERL 29.7 ± 2.5 26.2 ± 1.2 44.0 ± 2.9 T able 2: Cumulativ e selection rates for the RL agent for ERL and PDERL. HalfCheetah . HalfCheetah is at the other end of the spec- trum from Swimmer . In this en vironment, the RL agent be- comes an elite after over 90% of the synchronisations early in training. As the population reaches the high-re ward areas, genetic evolution becomes slightly more important for dis- cov ering better policies. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Frames (in millions) 0.0 0.2 0.4 0.6 0.8 1.0 Selection Percentage Swimmer-v2 Elite Selected Discarded (a) Swimmer-v2 0 1 2 3 4 5 6 Frames (in millions) 0.0 0.2 0.4 0.6 0.8 1.0 Selection Percentage HalfCheetah-v2 (b) HalfCheetah-v2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Frame 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Mean Reward Hopper-v2 (c) Hopper-v2 0 1 2 3 4 5 6 Frame 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Mean Reward Walker-v2 (d) W alker2d-v2 0 1 2 3 4 5 6 Frames (in millions) 0.0 0.2 0.4 0.6 0.8 Selection Percentage Ant-v2 (e) Ant-v2 Figure 11: Selection rates for the RL agent over the course of training on Swimmer (a), HalfCheetah (b), Hopper (c), W alker2d (d) and Ant (e). PDERL produces on each en vironment a unique and rich interaction pattern between e v olution and learning. Hopper . On Hopper, the RL agent driv es the population in the ﬁrst stages of training, but the population obtains supe- rior total rew ards beyond 1.5 million frames. Therefore, e vo- lution has a greater contribution to the late stages of training. W alker2d . In this en vironment, the dynamics between learning and e v olution are very stable. The selection rates do not change signiﬁcantly during the training process. This is surprising, gi ven the instability of W alker2d. Overall, evo- lution has a much higher contribution than learning. Ant . Unlike in the other en vironments, in the Ant envi- ronment, the interactions between learning and e volution are more balanced. The curves con v erge towards a 40% proba- bility of being discarded, and a 60% probability of becoming an elite or being selected. T able 2 sho ws the ﬁnal mean selection rates and their stan- dard deviations side by side with the ones reported by ERL. This comparison indicates how the dynamics between evo- lution and learning changed after adding the new crossovers and mutations. The general pattern that can be seen across all en vironments is that the probability that the RL agent be- comes an elite decreases. That probability mass is mainly mov ed tow ards the cases when the RL agent is discarded. This comes as another conﬁrmation that the newly intro- duced v ariation operators improve the performance of the population and the RL agent is much less often at the same lev el of ﬁtness as the population policies. Hyperparameters T able 3 includes the hyperparameters that where kept con- stant across all environments. T able 4 speciﬁes the parame- ters that vary with the task. Hyperparameter V alue Population size k 10 T ar get weight τ 0 . 001 RL Actor learning rate 5 e − 5 RL Critic learning rate 5 e − 4 Genetic Actor learning rate 1 e − 3 Discount factor γ 0 . 99 Replay buf fer size 1 e 6 Genetic memory size 8000 RL Agent batch size 128 Genetic Agent crossov er batch size N C 128 Genetic Agent mutation batch size N M 256 Distillation crossov er epochs 12 Mutation probability 0 . 9 T able 3: Hyperparameters constant across all en vironments. Parameter Swimmer HalfCheetah Hopper W alker2D Ant Elite fraction ψ 0.1 0.1 0.2 0.2 0.2 Trials ξ 1 1 3 5 1 Sync. Period ω 10 10 1 1 1 Mutation Mag. σ 0.1 0.1 0.1 0.1 0.01 T able 4: Hyperparameters that v ary across en vironments.

Proximal Distilled Evolutionary Reinforcement Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment