Neural Architecture Evolution in Deep Reinforcement Learning for Continuous Control

Neural Ar chitectur e Evolution in Deep Reinf or cement Lear ning for Continuous Control Jörg K.H. Franke ∗ , 1 , Gregor Koehler ∗ , 2 , Noor A wad 1 , Frank Hutter 1 , 3 1 Univ ersity of Freiburg 2 German Cancer Research Center (DKFZ) 3 Bosch Center for Artiﬁcial Intelligence Abstract Current Deep Reinforcement Learning algorithms still heavily rely on handcrafted neural network architectures. W e propose a no vel approach to automatically ﬁnd strong topologies for continuous control tasks while only adding a minor ov erhead in terms of interactions in the environment. T o achie ve this, we combine Neuroe volution techniques with of f-policy training and propose a novel architecture mutation operator . Experiments on ﬁve continuous control benchmarks sho w that the proposed Actor-Critic Neuroe volution algorithm often outperforms the strong Actor-Critic baseline and is capable of automatically ﬁnding topologies in a sample- ef ﬁcient manner which would otherwise ha ve to be found by e xpensive architecture search. 1 Introduction The ﬁeld of Deep Reinforcement Learning (DRL) bears a lot of potential for meta-learning. DRL has recently achiev ed remarkable success in areas ranging from playing games [ 1 – 3 ], locomotion control [ 4 ] and visual na vigation [ 5 ] to robotics [ 6 , 7 ]. Howe ver , all of these successes of DRL were based on manually-chosen neural architectures, rather than based on learned architectures. In this paper , we introduce a no vel and ef ﬁcient method for learning the architecture used in DRL algorithms for continuous control online . T o achie ve this, we jointly learn the architecture of both Actor and Critic in a Q-Function Policy Gradient setting based on TD3 [ 8 ]. Speciﬁcally , our contributions are as follo ws: – W e combine concepts from neuroev olution and off-polic y RL training to e volve and train a population of actor and critic architectures in a sample-efﬁcient w ay . – W e propose a novel genetic operator based on network distillation [ 9 ] for stable architecture mutation. – Our method is the ﬁrst to adapt the neural architecture of RL algorithms online , based on of f-policy training with the use of environment interactions from architecture e valuations shared in a global replay buf fer . – Our method ﬁnds optimal architectures but only has a small ov erhead in terms of steps in the en vironment over a RL run with a single ﬁx ed architecture. 2 Background Our proposed approach is based on Neuroev olution, a technique to optimize neural networks including their architecture using Genetic Algorithms (GAs) [ 10 – 12 ]. First, a population of computation graphs with minimal comple xity , represented by genomes, is created. Using genetic operators such as adding nodes or edges, additional structure is added incrementally . Different approaches in Neuroev olution ∗ Equal contribution. Correspondence to frankej@cs.uni-freiburg.de and g.koehler@dkfz.de . usually dif fer in what is represented by the nodes and edges. In [ 10 , 11 ], each node represents an indi vidual neuron and the edges determine the inputs to each neuron. Recent work e xtending Neuroev olution for larger network architectures used nodes to represent whole layers in a netw ork, encoding the layer using its type-speciﬁc hyperparameters (e.g. kernel size in con volutional layers) [ 13 ]. In this paper we follow a similar approach, encoding the network architecture with multi layer perceptrons (MLPs). In contrast to the fe w existing w orks for learning neural architectures for RL (using blackbox ev olutionary optimization [ 14 , 15 ] or multi-ﬁdelity Bayesian optimization [ 16 ]), our approach optimizes the neural network architecture online, substantially improving sample-ef ﬁciency . 3 Methods The foundation of Actor Critic Neuroe volution (A CN) is a genetic algorithm which e volv es a population of N agents P = {A 1 , . . . , A N } . W e associate each agent A n = [ f n , ψ a n , θ a n , ψ c n , θ c n ] with a ﬁtness value f n , along with topology descriptions ψ a n and ψ c n , for actor and critic respectiv ely , as well as their parameters θ a n and θ c n . For simplicity and comparability , we restrict the topology to standard MLPs for both actor and critic. After initializing the networks of each individual, we e valuate the actor MLP in the environment to obtain initial ﬁtness v alues. W ith the initial ﬁtness v alues in place, the ev olution loop runs for G generations utilizing tournament selection [ 17 ] for actor and critic individually to ﬁnd the candidates for the next generation. Since mutation changes the actor , the critic can not be conditioned on the actor’ s behaviour and needs to be generally optimal. In the follo wing sub-sections, we ﬁrst explain the components of our algorithm and then how the y are integrated in a single algorithm. 3.1 Distilled T opology Mutation W e introduce a novel mutation operator that acts on the topology of both the actor and the critic networks of the population. In order to mutate the actor and critic of an indi vidual in a stable way , our proposed method operates in tw o stages. First, we jointly gro w the topology of both networks in order to increase their capacity . W ith probability p L , this growing mechanism appends another hidden layer of the same size as the pre vious last hidden layer to the respectiv e networks; and with probability 1 − p L , an existing hidden layer is chosen at random and a random number of new nodes are added to this layer . Both types of changes in topology are applied identically for actor and critic networks. As the necessary initialization of the additional parameters introduced by growing the topology also changes both the critic’ s estimate of the state-action values, as well as the policy of the actor , we propose a second stage of the topology mutation operator based on network distillation [ 9 , 18 ]. Here we distill the parent’ s beha vior into the of fspring, using data D p = ( s i , q p i ) N i =1 consisting of N states (or state-action pairs for critic distillation) sampled uniformly at random from the global replay memory , along with the parent network’ s outputs. This data is then used to perform gradient-based updates on the offspring netw ork using the parent network’ s outputs as a target in a supervised learning setting: L ( D p , θ o ) = |D p | X i =1   q p i − µ o θ ( s i )   2 2 (1) W e apply a mean-squared-error loss (MSE) for both distillation updates on the of fspring actor and critic, where µ o θ represents the respecti ve offspring network with parameters θ o . W e use this additional step to stabilize the topology mutation operator , using the parent as a teacher to distill its knowledge into the of fspring. 3.2 Gradient-based Mutation W e adopt a second mutation operator (SM-G-SUM) as one of two mutation operators used to ev olve the actors in the population. This operator helps creating a more di verse set of actors in the population by altering the parameters of the actor network’ s layers. Neural network parameter mutations based on Gaussian noise can lead to strong deviations in behavior , often leading to deteriorated performance [ 19 , 20 ]. In order to stabilize the policies resulting from the mutation operator , we make use of the safe mutation operator introduced in [ 19 ]. This mutation approach scales the perturbations on 2 a per -parameter basis, depending on the sensitivity of the network’ s outputs with respect to the individual parameter . 3.3 Integration in Actor-Critic Neuroe volution T o realize all beneﬁts from the proposed genetic operators as well as the Actor-Critic training, we combine them in the Actor -Critic Neuroe volution (A CN) frame work, see Algorithm 1 in the appendix. In the A CN framew ork, we integrate the tw o mutation operators described above in a standard GA loop, always mutating the selected candidates by either performing distilled topology mutation (with topology growth probability p G ) or gradient-based mutation (with probability 1 − p G ). After mutating, we add a network training phase for each indi vidual following the setting of T win-Delayed DDPG (TD3), performing multiple of f-policy gradient updates making use of tar get policy smoothing and clipped double-Q learning [ 4 , 8 ]. Due to the changes in the neural network architecture the training phase requires a re-initialization of the optimizer and a recreation of the target netw ork at the start of each phase. The training of each indi vidual can be performed in parallel, since each individual carries its own actor and critic. By adding this training phase, which uses the e xperiences from a global replay memory , each indi vidual can beneﬁt from a di verse set of policies e xploring the en vironment during ﬁtness ev aluation. The training also impro ves the sample-efﬁcienc y of the GA as each offspring recei ves gradient-based updates, con ver ging to high-re ward solutions faster . This is in contrast to purely ev olutionary approaches which hav e to explore the network parameter space in a highly inef ﬁcient manner . 4 Experiments W e e valuate A CN on 5 robot locomotion tasks from the Mujoco suite [ 21 ]. On these tasks, we compare the performance of a TD3 [ 8 ] baseline against two v ariations of A CN, one which ev olves the architecture automatically and one with a ﬁx ed network architecture and parameter mutation only . This choice is motiv ated by the fact that TD3 is the algorithm we employ for each agent’ s individual training phase in A CN. T o the best of our knowledge, this is the ﬁrst work sho wing online architecture search on Mujoco tasks. For all e valuated algorithms, we only use a single h yperparameter setting for all Mujoco en vironments to facilitate comparison with TD3, which was also e valuated with one ﬁxed setting for all Mujoco en vironments. In terms of network architecture, the ﬁx ed architecture algorithms use two layers with 400 and 300 nodes respectiv ely . In case of A CN e volving the topology , we start with a single layer of 64 hidden nodes, initialized using He initialization [22]. Figure 1 sho ws that, at the end of the optimization, the two ACN variants perform on par with or better than TD3 on all ev aluated continuous control tasks. The best architectures found by A CN and the architecture experiments with TD3 are gi ven in T able 1. Especially in the Humanoid en vironment, the A CN algorithm sho ws a substantial impro vement in performance, which can most likely be attrib uted to the exploratory nature of the algorithm, both in terms of NN topology and parameters. This is also reﬂected in the rather atypical architectures found by A CN for this task. In HalfCheetah , the ﬁnal architectures found by A CN are smaller compared to the default architecture. This is consistent with the e xperiments in Appendix A where a smaller size also outperforms the default architecture. In Hopper , A CN takes more en vironment steps to optimize the network architecture, but e ventually catches up, again ﬁnding a smaller than usual network size consistent with the ﬁndings in Appendix A. The e volv ed network in W alk er2d also takes longer to optimize compared with a single TD3 run, but e ventually outperforms TD3 with a smaller architecture. The found architecture in Ant only contains one layer and half the nodes compared to the TD3 def ault, but sho ws comparable performance. In this en vironment the ﬁxed architecture variant of A CN outperforms TD3. This could be caused by the re-initialization and recreation of optimizer and target netw orks as shown in Appendix D. The e xperiments show the capability of A CN to ﬁnd suitable netw ork architectures ranging from smaller architectures to larger ones, both in terms of number of layers and the indi vidual layer sizes. A CN achieves this adding only a minor amount of computational cost. Appendix A sho ws experiments with dif ferent Actor/Critic NN architectures for TD3. These ex- periments show the signiﬁcant impact network architecture choices can ha ve on the algorithm’ s 3 (a) HalfCheetah (b) Ant (c) W alker2d (d) Humanoid (e) Hopper Figure 1: Comparison of mean performance on continuous control benchmarks for A CN with ﬁxed NN topology , with e volving neural network topology and TD3. W e used two random seeds for ACN and ﬁv e random seeds for TD3, the shaded area represents the standard error . En vironment ACN TD3 grid search Hopper [136, 72] [200, 150] Ant [276] [600, 450] HalfCheetah [80, 80, 88] [600, 450] Humanoid [672, 508] [400] W alker2d [200, 144] [600, 450] T able 1: Best actor architectures found by ACN compared with best performing TD3 runs. performance. W e also ev aluate the impact of re-initialization of the optimization algorithm and recreation of the target networks during training as it is applied during the A CN network training phase in Appendix D. For that e xperiment, we apply re-initialization of Adam and recreation of the target netw orks after each 10k training steps in TD3; it sho ws that the re-initialization and recreation does not tend to have ne gative impact and sometimes e ven prov es beneﬁcial to the TD3 training. 5 Conclusion This paper demonstrates ho w suitable neural network topologies of Actor and Critic networks can be found online , while still sho wing performance comparable with state-of-the-art methods in robot locomotion tasks. W e proposed the A CN algorithm, which combines the strengths of Neuroevol ution methods with the sample-ef ﬁcient training of off-polic y Actor-Critic methods. T o achiev e this, we proposed a nov el genetic operator which increases the network topology in a stable manner by distilling the parent network’ s kno wledge into the offspring. Additionally , we augmented the GA with an off-polic y Actor-Critic training phase, sharing collectively gathered en vironment interactions in a global replay memory . Our e xperiments sho wed that A CN automatically ﬁnds suitable neural network architectures for all ev aluated tasks which are consistent with strong architectures for these tasks, while only adding a small computational ov erhead ov er a single RL run with a ﬁxed architecture. Further w ork could in vestigate the impact of the mutation operator in RL training and why this combination of a GA and RL training often leads to a successful training of smaller topologies while achieving similar or e ven better performance compared to current RL algorithms. 4 Acknowledgments This work has partly been supported by the European Research Council (ERC) under the European Union’ s Horizon 2020 research and innov ation programme under grant no. 716721. References [1] V olodymyr Mnih, K oray Kavukcuoglu, David Silv er, Ale x Graves, Ioannis Antonoglou, Daan W ierstra, and Martin A. Riedmiller . Playing atari with deep reinforcement learning. CoRR , abs/1312.5602, 2013. URL . [2] David Silver , Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess- che, Julian Schrittwieser , Ioannis Antonoglou, V eda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner , Ilya Sutske ver , Timothy Lillicrap, Madeleine Leach, K oray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Natur e , 529(7587):484–489, Jan 2016. ISSN 0028-0836. doi: 10.1038/nature16961. [3] Oriol V inyals, Igor Babuschkin, Jun young Chung, Michael Mathieu, Max Jaderber g, W o- jtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgie v , Richard Powell, T imo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, V alentin Dal- ibard, David Choi, Laurent Sifre, Y ury Sulsky , Sasha V ezhne vets, James Molloy , T rev or Cai, Da vid Budden, T om Paine, Caglar Gulcehre, Ziyu W ang, T obias Pfaf f, T oby Pohlen, Dani Y ogatama, Julia Cohen, Katrina McKinney , Oliv er Smith, T om Schaul, Timothy Lil- licrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver . AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/ alphastar- mastering- real- time- strategy- game- starcraft- ii/ , 2019. [4] T imothy P . Lillicrap, Jonathan J. Hunt, Ale xander Pritzel, Nicolas Manfred Otto Heess, T om Erez, Y uv al T assa, Da vid Silver , and Daan W ierstra. Continuous control with deep reinforcement learning. CoRR , abs/1509.02971, 2015. [5] Y uke Zhu, Roozbeh Mottaghi, Eric K olve, Joseph J Lim, Abhina v Gupta, Li Fei-Fei, and Ali Farhadi. T arget-dri ven visual na vigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international confer ence on r obotics and automation (ICRA) , pages 3357–3364. IEEE, 2017. [6] Serge y Levine, Chelsea Finn, T rev or Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Resear ch , 17(1):1334–1373, 2016. [7] OpenAI, Marcin Andrycho wicz, Bowen Baker , Maciek Chociej, Rafal Józefowicz, Bob McGrew , Jakub W . Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Po well, Alex Ray , Jonas Schneider, Szymon Sidor , Josh T obin, Peter W elinder , Lilian W eng, and W ojciech Zaremba. Learning de xterous in-hand manipulation. CoRR , abs/1808.00177, 2018. URL http://arxiv.org/abs/1808.00177 . [8] Scott Fujimoto, Herk e Hoof, and David Meger . Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning , pages 1582–1591, 2018. [9] Geoffre y E. Hinton, Oriol V inyals, and Jeffre y Dean. Distilling the knowledge in a neural network. CoRR , abs/1503.02531, 2015. URL . [10] Kenneth O. Stanle y and Risto Miikkulainen. Ev olving neural networks through augment- ing topologies. Evol. Comput. , 10(2):99–127, June 2002. ISSN 1063-6560. doi: 10.1162/ 106365602320169811. URL http://dx.doi.org/10.1162/106365602320169811 . [11] Kenneth O. Stanle y , David B. D’Ambrosio, and Jason Gauci. A hypercube-based encoding for e volving lar ge-scale neural networks. Artiﬁcial Life , 15(2):185–212, 2009. doi: 10.1162/ artl.2009.15.2.15202. URL https://doi.org/10.1162/artl.2009.15.2.15202 . PMID: 19199382. 5 [12] Kenneth O Stanle y , Jeff Clune, Joel Lehman, and Risto Miikkulainen. Designing neural networks through neuroe volution. Nature Mac hine Intelligence , 1(1):24–35, 2019. [13] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy , et al. Evolving deep neural networks. In Artiﬁcial Intelligence in the Age of Neural Networks and Brain Computing , pages 293–312. Elsevier , 2019. [14] Hao-T ien Le wis Chiang, Aleksandra Faust, Marek Fiser , and Anthony Francis. Learning navigation beha viors end to end. CoRR , abs/1809.10124, 2018. URL abs/1809.10124 . [15] Aleksandra Faust, Anthon y Francis, and Dar Mehta. Evolving rew ards to automate reinforce- ment learning. CoRR , abs/1905.07628, 2019. URL . [16] Frederic Runge, Danny Stoll, Stef an Falkner , and Frank Hutter . Learning to design RN A. In International Confer ence on Learning Representations , 2019. [17] Brad L. Miller , Brad L. Miller , David E. Goldberg, and Da vid E. Goldber g. Genetic algorithms, tournament selection, and the effects of noise. Comple x Systems , 9:193–212, 1995. [18] Andrei A. Rusu, Sergio Gomez Colmenarejo, Çaglar Gülçehre, Guillaume Desjardins, James Kirkpatrick, Razvan P ascanu, V olodymyr Mnih, Koray Ka vukcuoglu, and Raia Hadsell. Polic y distillation. CoRR , abs/1511.06295, 2016. [19] Joel Lehman, Jay Chen, Jeff Clune, and K enneth O Stanle y . Safe mutations for deep and recur- rent neural networks through output gradients. In Pr oceedings of the Genetic and Evolutionary Computation Confer ence , pages 117–124. ACM, 2018. [20] Cristian Bodnar , Ben Day , and Pietro Lio’. Proximal distilled ev olutionary reinforcement learning. CoRR , abs/1906.09807, 2019. URL . [21] E. T odorov, T . Erez, and Y . T assa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 5026–5033, Oct 2012. doi: 10.1109/IR OS.2012.6386109. [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-le vel performance on imagenet classiﬁcation. In Pr oceedings of the IEEE international confer ence on computer vision , pages 1026–1034, 2015. [23] Jimmy Lei Ba, Jamie Ryan Kiros, and Geof frey E. Hinton. Layer Normalization. arXiv e-prints , art. arXiv:1607.06450, Jul 2016. 6 A Neural Network Architectur e Experiments for TD3 W e e valuated different choices for the neural network architecture used for both actor and critic networks in the TD3 algorithm in ﬁgure 2, keeping all hyperparameters ﬁxed as reported in the original work [ 8 ]. Figure 2 shows the results for v arious neural network architecture choices. Perhaps unsurprisingly , the default architecture chosen in recent literature does not sho w the best performance on all en vironments. For example, in the Humanoid en vironment, a simpler topology using only one hidden layer with 400 nodes performs substantially better , while in other environments like HalfCheetah, Ant and W alker2d , larger capacities like [600, 450] seem to be f avorable. (a) HalfCheetah (b) Ant (c) W alker2d (d) Hopper (e) Humanoid Figure 2: Comparison of the mean performance of TD3 with different neural network topologies on MuJoCo continuous control benchmarks. W e used 25 random seeds, but we e xclude and report the failed runs, where the a verage return was less than 5% of the best run. The shaded area represents the standard error . B A CN Algorithm A high-level description of the proposed ACN algorithm can be seen in Algorithm 1, making use of tournament selection, Actor-Critic training based on TD3 and the combined mutation operator combining gradient-based mutation and the proposed distilled topology mutation (see Algorithm 2). The E V A L U A T E function performs a number of Monte-Carlo rollouts in the en vironment and deter - mines the ﬁtness value as the av erage cumulativ e re ward across the performed rollouts. The selection operator T O U R NA M E N T S E L E C T runs indi vidual tournaments for all actors and critics of the new population, allo wing for dif ferent Actor-Critic combinations in subsequent children. This choice is made to prevent actors from exploiting their critic’ s weakness over time, leading to undesired behavior . After selection, the ne w population is mutated using transition batches from the global replay memory . W ith probability p g r owth , the network architecture is gro wn. This is achie ved either by appending a ne w layer of the same size as the last hidden layer of the actor (with probability p addlay er ) or by choosing a number of additional nodes from the set gi ven as a hyperparameter and adding this many nodes to any randomly chosen layer of the architecture. Both architecture growth operators perform identical architecture changes to both the actor and the critic, as indicated by A D D S A M E L A Y E R and A D D S A M E N O D E S in Algorithm 2. The set of possible additional node numbers can be found in T able 2. As this alteration of the network architectures changes the respecti ve network’ s behavior , we perform network distillation (see section 3.1) updates using transition batches sampled uniformly at random from the global replay memory on both netw orks. W ith this additional step, we can distill the respecti ve parent’ s knowledge into the of fspring, thus enabling to gro w the network architectures in a 7 Algorithm 1: Actor-Critic Neuroev olution (A CN) algorithm Input: population size k , number of generations G 1 Initialize global replay memory < 2 Initialize k indi viduals A i := { actor i , cr itic i , f itness i = N one } as initial population P ∗ 0 3 f or g = 1 to G do 4 P g − 1 , T e ← E V A L U A T E ( P # g − 1 ) 5 Store transitions T e in < 6 P elite ← T O P K ( P g − 1 ) 7 P selection ← T O U R N A M E N T S E L E C T ( P g − 1 ) 8 T s ← sample transitions from < 9 P # mutated ← M U T A T E ( P selection , T s ) 10 P # trained ← A C T O R C R I T IC T R A I N I N G ( P # mutated , T s ) 11 P # g ← P # trained ∪ P elite 12 end // # Individuals are not evaluated in environment. stable manner without requiring additional rollouts in the en vironment. Alternativ ely to growing the network architectures, we mutate the indi vidual actors of each agent in the population with probability 1 − p g r owth , making use of the S A F E M U T AT I O N operator described in section 3.2. This mutation operator alters the indi vidual’ s policy in a stable w ay , facilitating exploration in the en vironment. Algorithm 2: Mutation Algorithm Input: population P , transitions T 1 Initialize empty ne w population P mutated = {} 2 f or each individual i ∈ P with actor A and critic C do 3 if random number < p g r owth then 4 if random number < new layer pr obability then 5 A g r own , C g r own ← A D D L A Y E R ( A , C ) 6 else 7 A g r own , C g r own ← A D D N O D E S ( A , C ) 8 end 9 A distilled ← D I S T I L L P A R E N T ( A g r own , A , T ) 10 C distilled ← D I S T I L L P A R E N T ( C g r own , C , T ) 11 else 12 A mutated ← S A F E M U T A T I O N ( A , T ) 13 end 14 P mutated ← A D D {A mutated , C distilled } 15 end 16 retur n P mutated Each offspring created during the mutation phase is then trained indi vidually using the A C TO R C R I T - I C T R A I N I N G operator , which follows the off-polic y gradient-based updates described in [ 4 ], with the extensions introduced in [ 8 ]. The trained offspring, along with the elite determined as the best performing individual during e valuation, is then used as the ne xt generation in the GA. C Hyperparameters All hyperparameters are kept constant across all en vironments. For the TD3 training, the same set of hyperparameters as reported in the original paper [ 8 ] were used. T able 2 shows the h yperparameters used for A CN across all e valuated environments. All neural networks use the ReLU acti vation function for hidden layers and linear/tanh activ ations for critic and actor networks, respecti vely . W e apply Layernorm [ 23 ] after each hidden layer as it has proven beneﬁcial for the stability and performance across all experiments in this paper . 8 Hyperparameter V alue Population size 20 Elite size 5 % T ournament size 3 Network gro wth probability 0.2 Add layer probability 0.2 Add nodes probability 0.8 Set of possible nodes added during layer growth [4, 8, 16, 32] Network distillation updates 500 Network distillation batch size 100 Network distillation learning rate 0.1 Safe mutation batch size 1500 Safe Parameter mutation standard de viation 0.1 T able 2: Hyperparameters, constant across all environments. D Experiments on Re-Initialization of Optimizers in TD3 T o assess the performance impact of both the re-initialization of optimizers as inevitably done in A CN, as well as using a ne w target netw ork after a certain amount of steps, we e valuated dif ferent combinations in TD3. Figure 3 shows the impact of different combinations on the performance of TD3 for the continuous control benchmarks used in this paper . Surprisingly , the default TD3 choice does not show the best performance in all en vironments, as might be expected. Rather , using the current state of the critic as the new tar get network from time to time seems to beneﬁt performance. (a) HalfCheetah (b) Ant (c) W alker2d (d) Hopper (e) Humanoid Figure 3: Comparison of the mean performance of TD3 with re-initializing the optimizer , re-create the target network or both after 10k frames on MuJoCo continuous control benchmarks. W e used 25 random seeds, but we e xclude and report the failed runs, where the av erage return was less than 5% of the best run. The shaded area represents the standard error . 9

Neural Architecture Evolution in Deep Reinforcement Learning for Continuous Control

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment