Autonomous Goal Exploration using Learned Goal Spaces for Visuomotor Skill Acquisition in Robots

A U T O N O M O U S G O A L E X P L O R A T I O N U S I N G L E A R N E D G O A L S P A C E S F O R V I S U O M O T O R S K I L L A C Q U I S I T I O N I N R O B O T S Adrien Lav ersanne-Finot Flowers T eam Inria and Ensta-ParisT ech, France adrien.laversanne-finot@inria.fr Alexandre P ´ er ´ e Flowers T eam Inria and Ensta-ParisT ech, France alexandre.pere@inria.fr Pierre-Yv es Oudeyer Flowers T eam Inria and Ensta-ParisT ech, France pierre-yves.oudeyer@inria.fr A B S T R A C T The automatic and efﬁcient disco very of skills, without supervision, for long- living autonomous agents, remains a challenge of Artiﬁcial Intelligence. Intrinsi- cally Motiv ated Goal Exploration Processes give learning agents a human-inspired mechanism to sequentially select goals to achie ve. This approach giv es a new per- spectiv e on the lifelong learning problem, with promising results on both simu- lated and real-world experiments. Until recently , those algorithms were restricted to domains with experimenter-kno wledge, since the Goal Space used by the agents was built on engineered feature extractors. The recent advances of deep repre- sentation learning, enables new ways of designing those feature extractors, using directly the agent e xperience. Recent work has shown the potential of those meth- ods on simple yet challenging simulated domains. In this paper , we present recent results sho wing the applicability of those principles on a real-world robotic setup, where a 6-joint robotic arm learns to manipulate a ball inside an arena, by choos- ing goals in a space learned from its past experience. 1 I N T R O D U C T I O N Despite recent breakthroughs in artiﬁcial intelligence, learning agents often remain limited to tasks predeﬁned by human engineers. The autonomous discov ery and simultaneous learning of man y tasks in an open world remains challenging for reinforcement learning algorithms. Howe ver , dis- cov ering autonomously the set of outcomes that can be produced by acting in an en vironment is of paramount importance for learning agents. This is essential to acquire world models and repertoires of parameterized skills (Baranes & Oudeyer, 2013; Da Silva et al., 2014; Hester & Stone, 2017) or to efﬁciently bootstrap exploration for deep reinforcement learning problems with rare or deceptive rew ards (Conti et al., 2017; Colas et al., 2018b). In order to discov er as many diverse outcomes as possible, the learner should be able to self-organize its exploration curriculum in order to discover efﬁciently the possible outcomes that can be produced in its en vironment. When aiming at discov ering autonomously what outcomes can be produced by a physical robot, a nai ve exploration of the space of motor commands is bound to fail. Indeed, the space of motor commands is often continuous and high-dimensional. Secondly , this space is also highly redundant: many motor commands will produce the same ef fect. Lastly , in any real w orld setup, the number of samples that can be collected is limited. Thus, discov ering div erse outcomes and learning policies to reproduce them requires more elaborate strategies. One approach that was shown to be efﬁcient in this context is known as Intrinsically Motiv ated Goal Exploration Processes (IMGEPs) (Baranes & Oude yer, 2010; Forestier et al., 2017), an architecture closely related to Goal Babbling (Rolf et al., 2010). The general idea of IMGEPs is to equip the 1 Figure 1: The IMGEP with learned goal spaces strate gy . agent with a goal space. During exploration, the agent will sample goals in this goal space accord- ing to a certain strategy , before trying to reach them using an associated goal-parameterized rew ard function. For each sampled goal the agent will dedicate a b udget of experiments to improve his per- formance regarding this particular goal. Crucially , the agent stores each outcome discovered during the exploration, which allows him to learn in hindsight how to achiev e each outcome he discovers, should he later sample it as a goal. This makes the approach powerful since targeting a goal will often allo w an agent to simultaneously learn about other goals. IMGEPs can be implemented with population-based policy learning approaches (Baranes & Oude yer, 2010; P ´ er ´ e et al., 2018) or using goal parameterized deep reinforcement learning techniques (Colas et al., 2018a). Until recently IMGEPs where limited to engineered goal spaces. This approach limits the autonomy of the agent, and in many interesting problems, such a goal space is not provided and may be hard to design manually . It is always possible to use the sensory space as a goal space. Howe ver , in many cases the sensory space is high-dimensional, e.g. made of low perceptual measures such as pixels, and building a goal-parameterized reward directly in this space is problematic. Thus, it was proposed in P ´ er ´ e et al. (2018) to lev erage representation learning algorithms such as V ariational Auto-Encoders (V AEs) and to use the learned latent space as the goal space. It was further shown in Lav ersanne-Finot et al. (2018) that if the representation used as a goal space is disentangled (e.g. encoding separately different physical properties of the en vironment), then it becomes possible to achiev e more efﬁcient exploration in en vironments with multiple objects and distractors, through a modular goal exploration algorithm that samples goals which maximize the learning progress. Howe ver all these experiments were performed using simulated en vironments. Furthermore, they assumed the av ailability of man y observ ations of outcomes produced by another agent, co vering the div ersity of possible outcomes, in order to train initially the goal space representation. In this paper, we provide e vidence that the ideas de veloped in those papers can also be successfully applied to real world scenarios. W e also show how they can be transposed in a sample efﬁcient man- ner to a fully autonomous learning setting where the representation learning mechanism is trained on outcomes data gathered autonomously by the agent. In particular we consider an experiment where a 6-joint robotic arm interacts with a ball inside a closed arena and we show that using a learned representation as a goal space leads to a better exploration of the en vironment than a strong baseline consisting in randomly sampling dynamic motion primitiv es. 2 G O A L E X P L O R AT I O N W I T H L E A R N E D G O A L S PAC E S This section brieﬂy introduces Intrinsically Motivated Goal Exploration Processes , using a learned representation of the goal space. The o verall architecture is summarized in Figure 1. F or a more 2 thorough introduction to IMGEPs with engineered goal spaces and learned goal spaces we refer to Forestier et al. (2017) and La versanne-Finot et al. (2018), respecti vely . In order to understand the general idea of IMGEPs, one must imagine the agent as performing a sequence of contextualized and parameterized experiments. At the beginning of each experiment the agent will in sequence: observe the context, sample a goal according to some strategy , use its internal knowledge (policy) to ﬁnd the best motor parameters to achiev e this goal in this context, and then perform the experiment using these parameters. The goals are arbitrary and can range from “moving the ball to this speciﬁc position” to “moving the end effector of the arm to this location”, when the goal space is hand-crafted. If this is not the case one strategy proposed in P ´ er ´ e et al. (2018) is to learn a representation of the en vironment, using data sampled from demonstrations, and to use the latent space as the goal space. In this case a goal is a point in the latent space, and one uses a similarity function in this space as the associated goal achievement reward function. The agent then tries to produce an outcome that, when encoded, is as close as possible to this point in the latent space. See Algorithmic Architecture 1 for a high level algorithmic description of IMGEPs and Appendix 6.1 for more details on the different components. Algorithmic Architectur e 1: Goal Exploration Strategy Input: Policy Π , History H , (optional) Goal space (engineered or learned): ( R , γ ) , 1 begin 2 for A ﬁxed number of Bootstrapping iter ations do 3 Append H using Random Motor Exploration 4 Learn the goal space using a representation learning algorithm (if not provided) 5 Initialize Policy Π with history H 6 for A ﬁxed number of Exploration iter ations do 7 Observe conte xt c 8 Sample a goal, τ ∼ γ 9 Compute θ using Π on tuple ( c, τ ) 10 Perform experiment and retrie ve observ ation o 11 Append ( c, θ , o ) to H 12 Update Policy Π with ( c, θ , o ) 13 retur n The history H 3 E X P E R I M E N T S W e carried out experiments on a real world en vironment to address the following questions: • T o what extent can the ideas dev eloped in simulated environments be applied on a real world setup? • Does the dataset used to train the representation algorithm need to contain examples of all possible outcomes to learn a goal space that giv es good performances during exploration? Can it be learned during exploration, as e xample of outcomes are collected? In order to answer those questions we experimented on a robotic setup that is similar in spirit to the en vironments considered in the simulated experiments and that we no w describe in details: Robotic en vironment The en vironment is composed of a 6-joint robotic arm that e volv es in an arena. In this arena a (tennis) ball can me moved around. Due to the geometry of the arena, the ball is more or less constrained to e volv e on a circle. A picture of the environment is represented in Figure 2. The agent perceiv es the scene as a 64 × 64 pixels image. The motion of the arm is controlled by Dynamical Mov ement Primitives (DMP). Actions are the parameters of the DMPs used in the current episode. There is one DMP per joint. Each DMP is parametrized by one weight for each of the 7 basis functions and one supplementary weight specifying the end joint state, for a total of 48 parameters. 3 Figure 2: The robotic setup. It consists of a 6-joint robotic arm and a ball that is constrained to move in an arena. For the representation learning phase, we considered dif ferent strategies. In the ﬁrst strategy (as was done in P ´ er ´ e et al. (2018) and Laversanne-Finot et al. (2018)) we consider that the agent has access to a database of examples of the possible set of outcomes. From this database the agent learns a representation that is then used as a goal space for the exploration phase. This strate gy is referred to as RGE (V AE) . One could argue that using this method introduces kno wledge on the set of possible outcomes that can be obtained by the agent. In order to test ho w this impacts the performances of the exploration algorithms we also experimented using a representation learned using only the samples collected during a the initial iterations of random motor exploration. W e refer to this strategy as RGE (Online) . Baselines Results obtained using IMGEPs with learned goal spaces are compared to two baselines: • Random Parameter Exploration (RPE) , where exploration is performed by uniformly sampling parameters θ . This strategy is inefﬁcient as it does not leverage information collected during pre vious rollouts to choose the current parameters. It serves as a lower bound for the performances of the exploration algorithms. Since DMPs were designed to enable the production of a div ersity of arm trajectories with only fe w parameters, this lower bound is already a reasonable baseline that performs better than applying random joint torques at each time-step of the episode. • Goal Exploration with Engineered F eatures Representation (RGE-EFR) : it is an IMGEP in which the goal space is handcrafted and corresponds (as closely as possible) to the true de grees of freedom of the en vironment. In this experiment it is not clear what is the best representation as multiple choices can be used (e.g. Cartesian or polar coordinates for the position of the ball). W e settled for polar coordinates as the ball e volves on a circle. Since essentially all the information is a vailable to the agent under a highly semantic form, it is expected to gi ve an upper bound on the performances of the e xploration algorithms. 4 R E S U LT S T o assess the performances of the IMGEPs with learned goal spaces we performed between 8 and 14 trials for each of the conﬁgurations. In order to speed up the learning procedure, for each conﬁg- uration using a learned goal space, we used the same representation for all trials 1 . Exploration performances The performance of the algorithm is deﬁned as the number of ball positions reached during the e xperiments. In this conﬁguration, the ball is the hard part of the explo- ration problem since the end position of the robotic arm can be ef ﬁciently explored by performing random motor commands. In practice the performances of the exploration algorithms are measured by discretizing the outcome space in 900 cells (30 cells for each dimension) and counting the num- ber of different cells reached by the ball during the experiment. The number of cells that can be reached is limited due to the ﬁnite size of the arm/arena. 1 W e did not pick a particular representation and preliminary experiments show that similar performances are obtained for other representations. 4 Figure 3: Exploration performance during e xploration. The exploration performances are reported in Figure 3. From the plot, it is clear that IMGEPs with both learned and engineered goal spaces perform better than the RPE strategy . When using a representation learned before exploration ( RGE (V AE) ) the performances are at least as good as exploration using the engineered representation. When the goal space is learned using the online strategy , there is an initial phase where the exploration performances are the same as RPE . Ho wev er , after this initial collection phase, when the exploration strategy is switched from random parameter exploration to goal exploration using the learned goal space (at 2000 exploration episodes) there is a clear change in the slope of the curve in fa vor of the goal exploration algorithm 2 . All in all, the differences in performances between IMGEPs and random parameter exploration are less pronounced than in past simulated experiments. W e hypothesize that this is due to the ball being too simple to move around. Thus, the random parameter exploration, which leverages DMPs to produce div erse arm trajectories, achiev es decent exploration results. Also the motors of the robotic arm are far from being as precise as in simulation, which makes it harder to learn a good in verse model for the polic y and to output parameters that will mov e the ball. 5 C O N C L U S I O N In this paper we studied how learned representations can be used as goal spaces for exploration algorithms. W e ha ve shown in a real world experiment that using a representation as a goal space provides better exploration performances than a nai ve exploration of the space of motor commands. One of the main advantages of using a learned goal space is that it alle viates the need to engineer a representation, which is not a simple task in general. F or e xample, in the robotic setup it is not clear that the engineered representation used is the most con venient for the exploration algorithm. In this case, the position of the ball is parametrized using polar coordinates. In this representation two points that have the same distance to the center and hav e angles 0 and 2 π are percei ved as very distant e ven though physically they correspond to the same outcome. Also the position of the ball is extracted using a handcrafted algorithm. It may happen that this algorithm fails (e.g. when the ball is hidden by the robotic arm). In that case it may report wrong v alues to the policy . Such problems make learning an in verse model harder and thus reduce the exploration performances. On the other hand, using a learned representation obliviates those problems. As mentioned in the paper , it is possible to imagine more in volv ed goal selection schemes (see 6.6 for a short description of the results described in Laversanne-Finot et al. (2018)) when the representation is disentangled. These goal selection schemes lev erage the disentanglement of the representation to 2 Note that the ﬁrst 2000 exploration episodes for the online strategy are the same for all runs performed on the same platform. In practice it should be similar to the RPE curve which was performed with many more trials. 5 provide better exploration performances. W e tested these ideas in this experiment and did not ﬁnd any advantages in using those goal selection schemes. This is not surprising since there are no distractors in this experiment and modular goal exploration processes are speciﬁcally designed to handle distractors. Consequently , designing a real-world experiment with distractors, in order to test modular goal exploration processes with learned goal spaces, would be of great interest for future work. 6 A C K N O W L E D G M E N T S W e would like to thank S ´ ebastien Forestier for help in setting up the experiment. Simulated exper - iments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Uni versit ´ e de Bordeaux, Bordeaux INP and Conseil R ´ egional d’Aquitaine (see https://www.plafrim.fr/ ). R E F E R E N C E S Adrien Baranes and Pierre-Yves Oude yer . Intrinsically motiv ated goal exploration for active motor learning in robots: A case study . In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems , pp. 1766–1773. IEEE, 2010. Adrien Baranes and Pierre Yves Oudeyer . Activ e learning of inv erse models with intrinsically motiv ated goal exploration in robots. Robotics and Autonomous Systems , 61(1):49–73, 2013. ISSN 09218890. doi: 10.1016/j.robot.2012.05.008. Fabien C. Y . Benureau and Pierre-Yves Oudeyer . Behavioral Diversity Generation in Autonomous Exploration through Reuse of Past Experience. F rontier s in Robotics and AI , 3(March), 2016. ISSN 2296-9144. doi: 10.3389/frobt.2016.00008. C ´ edric Colas, Pierre Fournier , Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer . Curious: Intrinsically motiv ated multi-task, multi-goal reinforcement learning. arXiv preprint arXiv:1810.06284 , 2018a. C ´ edric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer . GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning. In International Confer ence on Machine Learning (ICML) , 2018b. Edoardo Conti, V ashisht Madhav an, Felipe Petroski Such, Joel Lehman, Kenneth O. Stanley , and Jeff Clune. Improving exploration in ev olution strate gies for deep reinforcement learning via a population of nov elty-seeking agents. arXiv preprint , 2017. Bruno Da Silv a, George Konidaris, and Andrew Barto. Active learning of parameterized skills. In International Confer ence on Machine Learning , pp. 1737–1745, 2014. S ´ ebastien F orestier and Pierre Yves Oude yer . Modular activ e curiosity-dri ven discovery of tool use. IEEE International Conference on Intelligent Robots and Systems , pp. 3965–3972, 2016. doi: 10.1109/IR OS.2016.7759584. S ´ ebastien Forestier , Y oan Mollard, and Pierre-Yves Oudeyer . Intrinsically motiv ated goal explo- ration processes with automatic curriculum learning. arXiv pr eprint arXiv:1708.02190 , 2017. T odd Hester and Peter Stone. Intrinsically motiv ated model learning for developing curious robots. Artiﬁcial Intelligence , 247:170–186, 2017. Irina Higgins, Loic Matthey , Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mo- hamed, and Ale xander Lerchner . Early V isual Concept Learning with Unsupervised Deep Learn- ing. arXiv preprint , jun 2016. URL 05579 . Irina Higgins, Loic Matthey , Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner . beta-V AE: Learning Basic V isual Concepts with a Constrained V ariational Framework. In ICLR , 2017a. URL https://openreview.net/ forum?id=Sy2fzU9gl . Irina Higgins, Arka Pal, Andrei A. Rusu, Loic Matthey , Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner . D ARLA: Improving Zero- Shot Transfer in Reinforcement Learning. ICML , jul 2017b. ISSN 1938-7228. URL http: //arxiv.org/abs/1707.08475 . Diederik P . Kingma and Jimmy Lei Ba. Adam: a Method for Stochastic Optimization. International Confer ence on Learning Repr esentations , 2015. 7 Adrien Laversanne-Finot, Alexandre Pere, and Pierre-Yves Oudeyer . Curiosity dri ven exploration of learned disentangled goal spaces. In Pr oceedings of The 2nd Confer ence on Robot Learning , volume 87 of Proceedings of Mac hine Learning Researc h , pp. 487–504. PMLR, 29–31 Oct 2018. URL http://proceedings.mlr.press/v87/laversanne- finot18a.html . Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agraw al, Dian Chen, Y ide Shentu, Evan Shelhamer , Jitendra Malik, Alex ei A. Efros, and T re vor Darrell. Zero-Shot V isual Imitation. In ICLR , pp. 1–12, 2018. URL . Alexandre P ´ er ´ e, S ´ ebastien Forestier , Oli vier Sigaud, and Pierre-Yves Oudeyer . Unsupervised Learn- ing of Goal Spaces for Intrinsically Motiv ated Goal Exploration. In ICLR , pp. 1–26, 2018. URL http://arxiv.org/abs/1803.00781 . Matthias Rolf, Jochen J. Steil, and Michael Gienger . Goal babbling permits direct learning of inv erse kinematics. IEEE T ransactions on A utonomous Mental De velopment , 2(3):216–229, 2010. ISSN 19430604. doi: 10.1109/T AMD.2010.2062511. 8 (a) Direct-Model Meta-Policy (b) In verse-Model Meta-Polic y Figure 4: The two dif ferent approaches to construct a meta-policy mechanism. 6 A P P E N D I C E S 6 . 1 I N T R I N S I C A L LY M O T I V A T E D G O A L E X P L O R A T I O N P RO C E S S E S In this part, we giv e further explanations on Intrinsically Moti vated Goal Exploration Processes. Meta-Policy Mechanism The (Meta-)Policy is responsible to outputs the actions/parameters that are used during the episode. Gi ven a context c and a goal τ the Policy should output the parameters θ that are the most likely to produce an observ ation o that fulﬁlls the task τ . That an observ ation o fulﬁlls a task τ can be quantiﬁed by a cost function C : T × O 7→ R . There are two dif ferent ways to construct a meta-policy both which are depicted in Figure 4: • Direct-Model Meta-Policy: In this case, an approximate phenomenon dynamic model ˜ D is learned using a regressor (e.g. L WR). The model is then updated regularly by perform- ing a training step with the newly acquired data. At execution time, for a gi ven goal τ , a loss function is deﬁned over the parameterization space through L ( θ ) = C ( τ , ˜ D ( θ, c )) . A black-box optimization algorithm, such as L-BFGS, is then used to optimize this func- tion and ﬁnd the optimal set of parameters θ (see (Baranes & Oudeyer, 2013; Forestier & Oudeyer, 2016; Benureau & Oudeyer, 2016) for examples of such meta-policy implemen- tations in the IMGEP framew ork). 9 • In verse-Model Meta-Policy: In this approach, an in verse model ˜ I : T × C 7→ Θ is learned from the history H which contains all the pre vious experiments in the form of tuples ( c i , θ i , o i ) . T o learn the in verse model it is necessary to associate to every observ ation o i a task τ i . The in verse model can then be learned using usual regression techniques from the set { ( τ i , c i , θ i ) } . In our case, we took the approach of using an In verse-Model based Meta-Policy . W e draw the attention of the reader on the following implementation details: • It may happen that using different parameters one obtain the same ﬁnal outcome. For example different movements of the arm can put the ball and the arm in the same ﬁnal position. Howe ver , in general, a combination of parameters leading to the them outcome does not produce a similar outcome. This is often referred to as the redundancy problem in robotics or as a multi-modality issue (Pathak et al., 2018). T o tackle this issue, we used a κ -nn regressor with κ = 1 . • In order to associate to each of the observations a goal we used the (either learned or engineered) embedding function. T o the observation o i corresponds the goal τ i deﬁned through: τ i := R ( o i ) . Our particular implementation of the Meta-Policy is outlined in Algorithm 2. The Meta-Policy is instantiated with one database per goal module. Each database store the representations of the observations projected on its associated subspace together with the associated contexts and parame- terizations. Giv en that the meta policy is implemented with a nearest neighbor re gressor , training the meta policy simply amounts to updating all the databases. Note that, as stated above, e ven though at each step the goal is sampled in only one module, the observation obtained after an exploration iteration is used to update all databases. Algorithm 2: Meta-Policy (simple implementation using a nearest-neighbor model) 1 Require: Goal modules: { R, P k , γ ( τ | k ) , C k } k ∈{ 1 ,..,n mod } 2 Function Initialize Meta-Policy( H ) : 3 for k ∈ { 1 , .., n mod } do 4 database k ← V oidDatabase 5 for ( c, θ , o ) ∈ H do 6 Add ( c, θ , P k R ( o )) to database k 7 Function Update Meta-Policy( c, θ , o ) : 8 for k ∈ { 1 , .., n mod } do 9 Add ( c, θ , P k R ( o )) to database k 10 Function Infer parameterization( c, τ , k ) : 11 θ ← NearestNeighbor ( database k , c, τ ) 12 retur n θ 6 . 2 D E E P R E P R E S E N TA T I O N L E A R N I N G A L G O R I T H M S In this section we summarize the theoretical arguments behind V ariational Auto-Encoder (V AE). V ariational A uto-Encoders (V AEs) Let x ∈ X be a set of observ ations. If we assume that the observed data are realizations of a random variable, we can hypothesize that they are conditioned by a random vector of independent factors z , i.e. that p ( x , z ) = p ( z ) p θ ( x , z ) , where p ( z ) is a prior distribution ov er z and p θ ( x , z ) is a conditional distribution . In this setting, giv en a i.i.d dataset X = { x 1 , . . . , x N } , learning the model amount to searching the parameters θ that maximizes the dataset likelihood: log L ( D ) = N X i =1 log p θ ( x i ) (1) 10 In practice it is often computationally intractable and so models are trained to optimize what is often referred to as the Evidence Lower Bound (ELBO): L ( x ; θ , φ ) = E z ∼ q φ ( z | x ) [log p θ ( x | z )] − D K L [ q φ ( z | x ) k p ( z )] , (2) where D K L is the Kullback-Leibler diver gence, by jointly optimizing over the parameters (of often neural networks) θ and φ . 6 . 3 D E TA I L S O F N E U R A L A R C H I T E C T U R E S A N D T R A I N I N G Model Architecture The encoder for the V AEs consisted of 4 conv olutional layers, each with 32 channels, 4x4 kernels, and a stride of 2. This was follo wed by 2 fully connected layers, each of 256 units. The latent distribution consisted of one fully connected layer of 20 units parametrizing the mean and log standard deviation of 10 Gaussian random variables. The decoder architecture was the transpose of the encoder , with the output parametrizing Bernoulli distributions ov er the pixels. ReLu were used as activ ation functions. This architecture is based on the one proposed in Higgins et al. (2016). T raining details The optimizer used was Adam Kingma & Ba (2015). For the simulated experiment we used a learning rate of 5 e − 5 and batch size of 64. The overall training of the representation took 1M training iterations. For the robotic experiment we used a learning rate of 1 e − 5 and batch size of 64 and trained the network for 300k iterations when the representation was learned before the exploration. When the representation was learned with the outcomes obtained by the random exploration we used a batch size of 32 the same learning rate and trained the network for 200 k iterations. 6 . 4 S C A T T E R P L OT S Robotic E N V I RO N M E N T Scatter plots of the e xploration for dif ferent exploration algorithms together with the number of cells reached are represented in Figure 5. Although the exploration of the outcome space of the arm is similar for all algorithms there is a qualitati ve difference in the outcomes obtained in the outcome space of the ball between RPE and all instantiations of IMGEPs. 6 . 5 E X P E R I M E N TA L S E T U P In practice experiments are performed in parallel using multiple copies of the same experiment. A picture of the complete experimental setup is represented in Figure 6. Only the 6-joints robotic arm in the center of the arena is used in the experiments presented in this paper . Camera extracting the images are located on the bar abov e the setup. 6 . 6 M O D U L A R G O A L E X P L O R A T I O N P RO C E S S E S In this section we recap some of the results presented in Lav ersanne-Finot et al. (2018). 6 . 6 . 1 I M G E P S W I T H M O D U L A R G O A L S PAC E S As mentioned in the main text, when the en vironment is more complex and in particular when it contains distractors (objects that cannot be controlled), it is possible to design more efﬁcient explo- ration algorithms. Modular goal exploration algorithms are designed to allow the agent to separate the exploration of different objects. For example the agent could decide to set for himself either goals for the ball or for its arm. The general idea is that some goals are harder (if not impossible) to reach than others. By monitoring its ability in fulﬁlling dif ferent kinds of goals the agent will be able to disco ver autonomously the difﬁculty of each type of goals and focus its exploration on goals which are neither too easy nor too hard. Using this strategy the agents thus autonomously design a curriculum. See Algorithmic Architecture 3 for the corresponding algorithmic architecture. When the goal space is engineered, the different modules can be readily deﬁned when designing the goal space. Howe ver , in the case of learned goal spaces there is no easy solution. The strategy 11 (a) RPE - End of arm positions (b) RPE - Ball positions (c) RGE (EFR) - End of arm positions (d) RGE (EFR) - Ball positions (e) RGE (V AE) - End of arm positions (f) MGE (V AE) - Ball positions (g) Online - End of arm positions (h) Online - Ball positions Figure 5: Scatter plots of the end of arm and ball positions visited during e xploration. 12 Figure 6: Experiments are performed in parallel ov er 6 robots. In this experiment only the 6-joints robotic arm inside the arena is used. proposed in Lav ersanne-Finot et al. (2018) is to form modules by grouping some of the latent vari- ables together . The goals of one module are then to reach observations for which the latent variables corresponding to this module hav e speciﬁc values. If the representation of the world is disentangled, different latent v ariables encode for different degrees of freedom of the en vironment. In that case modules will correspond to distinct objects corresponding to the latent variables of this module. By monitoring its progress in controlling each of the latent v ariables the agent will discov er that latent variables that encodes for distractors cannot be controlled while latent variables encoding for other objects can be controlled. The agent will thus be able to focus its exploration on controllable latent variables, leading to better e xploration performances. Algorithmic Architectur e 3: Curiosity Driven Modular Goal Exploration Strate gy Input: Goal modules (engineered or learned): { R, P i , γ ( ·| i ) , C i } , Meta-Policy Π , History H 1 begin 2 for A ﬁxed number of Bootstrapping iter ations do 3 Observe conte xt c 4 Sample θ ∼ U ( − 1 , 1) 5 Perform experiment and retrie ve observ ation o 6 Append ( c, θ , o ) to H 7 Initialize Meta-Policy Π with history H 8 Initialize module sampling probability p = U ( n mod ) 9 for A ﬁxed number of Exploration iter ations do 10 Observe conte xt c 11 Sample a module i ∼ p 12 Sample a goal for module i , τ ∼ γ ( ·| i ) 13 Compute θ using Meta-Policy Π on tuple ( c, τ , i ) 14 Perform experiment and retrie ve observ ation o 15 Append ( c, θ , o ) to H 16 Update Meta-Policy Π with ( c, θ , o ) 17 Update module sampling probability p to follow learning progress 18 retur n The history H 13 Figure 7: A roll-out of experiment in the Arm-2-Balls en vironment. The blue ball can be grasped and mo ved, while the orange one is a distractor that can not be handled, and follo ws a random w alk. (a) Small exploration noise ( σ = 0 . 05 ) (b) Large e xploration noise ( σ = 0 . 1 ) Figure 8: Exploration ratio during e xploration for different e xploration noises. 6 . 6 . 2 R E S U LT S O N Arm-2-Balls The ideas of modular IMGEPs were tested in the Arm-2-Balls en vironment that is described belo w . Arm-2-Balls The environment consists of a rotating 7-joint robotic arm that ev olves in a scene containing two balls of different sizes, as represented in Figure 7. One ball can be grasped and mov ed around in the scene by the robotic arm. The other ball acts as a distractor: it cannot be grasped nor moved by the robotic arm b ut follows a random walk. The agent perceiv es the scene as a 64 × 64 pixels image. For the representation learning phase we used a V ariational Auto-Encoder (V AE) for the entangled representation and a β -V AE for the disentangled representation. β -V AE are a variant of V AEs that hav e been ar gued to ha ve better disentanglement properties (Higgins et al., 2016; 2017b;a ). T o train the representation, we generated a dataset of images for which the positions of the two balls were uniformly distributed ov er [ − 1 , 1] 4 . This dataset w as then used to learn a representation using a V AE or a β -V AE. In order to test the impact of the disentanglement on the performances of the explo- ration algorithms, we used the same disentangled/entangled representation for all the instantiations of the e xploration algorithms. This allowed us to study the ef fect of disentangled representations by eliminating the variance due to the inherent dif ﬁculty of learning such representations. 6 . 6 . 3 S C A T T E R P L O T S Arm-2-Balls E N V I R O N M E N T Examples of exploration curves obtained with all the exploration algorithms discussed in this paper (Figure 9 for algorithms with engineered features representation and Figure 10 for algorithms with learned goal spaces). It is clear that the random parameterization exploration algorithm fails to produce a wide variety of observations. Although the random goal exploration algorithms perform much better than the random parameterization algorithm, the y tend to produce observ ations that are cluttered in a small region of the space. On the other hand the observations obtained with modular goal exploration algorithms are scattered over all the accessible space, with the exception of the case where the goal space is entangled (V AE). 14 (a) Random Parameterization Exploration (b) Random Goal Exploration with Engineered Features Representation (RGE-EFR) (c) Modular Goal Exploration with Engineered Features Representation (MGE-EFR) Figure 9: Examples of achiev ed observations together with the ratio of cov ered cells in the Arm- 2-Balls en vironment for RPE , MGE-EFR and RGE-EFR exploration algorithms. The number of times the ball was ef fectiv ely handled is also represented. 15 (a) Random Goal Exploration with an entangled representation (V AE) as a goal space (RGE-V AE) (b) Modular Goal Exploration with an entangled representation (V AE) as a goal space (MGE-V AE) (c) Random Goal Exploration with a disentangled representation ( β V AE) as a goal space (RGE- β V AE) (d) Modular Goal Exploration with a disentangled representation ( β V AE) as a goal space (MGE- β V AE) Figure 10: Examples of achiev ed observ ations together with the ratio of co vered cells in the Arm-2- Balls environment for MGE and RGE exploration algorithms using learned goal spaces ( V AE and β V AE ). The number of times the ball was effecti vely handled is also represented. 16

Autonomous Goal Exploration using Learned Goal Spaces for Visuomotor Skill Acquisition in Robots

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment