Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models

Under revie w as a conference paper at ICLR 2016 I N C E N T I V I Z I N G E X P L O R A T I O N I N R E I N F O R C E M E N T L E A R N I N G W I T H D E E P P R E D I C T I V E M O D E L S Bradly C. Stadie Department of Statistics Univ ersity of California, Berkeley Berkeley , CA 94720 bstadie@berkeley.edu Sergey Levine Pieter Abbeel EECS Department Univ ersity of California, Berkeley Berkeley , CA 94720 { svlevine,pabbeel } @cs.berkeley.edu A B S T R AC T Achieving efﬁcient and scalable exploration in comple x domains poses a major challenge in reinforcement learning. While Bayesian and P A C-MDP approaches to the e xploration problem offer strong formal guarantees, they are often imprac- tical in higher dimensions due to their reliance on enumerating the state-action space. Hence, exploration in comple x domains is often performed with simple epsilon-greedy methods. In this paper , we consider the challenging Atari games domain, which requires processing raw pixel inputs and delayed rewards. W e ev aluate sev eral more sophisticated exploration strategies, including Thompson sampling and Boltzman exploration, and propose a ne w exploration method based on assigning exploration bonuses from a concurrently learned model of the sys- tem dynamics. By parameterizing our learned model with a neural network, we are able to develop a scalable and efﬁcient approach to exploration bonuses that can be applied to tasks with complex, high-dimensional state spaces. In the Atari domain, our method provides the most consistent improv ement across a range of games that pose a major challenge for prior methods. In addition to ra w game- scores, we also dev elop an A UC-100 metric for the Atari Learning domain to ev aluate the impact of exploration on this benchmark. 1 I N T RO D U C T I O N In reinforcement learning (RL), agents acting in unknown en vironments face the exploration versus exploitation tradeoff. Without adequate exploration, the agent might fail to discover effecti ve control strategies, particularly in complex domains. Both P A C-MDP algorithms, such as MBIE-EB [1], and Bayesian algorithms such as Bayesian Exploration Bonuses (BEB) [2] have managed this tradeoff by assigning e xploration bonuses to no vel states. In these methods, the novelty of a state-action pair is deri ved from the number of times an agent has visited that pair . While these approaches offer strong formal guarantees, their requirement of an enumerable representation of the agent’ s en vironment renders them impractical for large-scale tasks. As such, exploration in large RL tasks is still most often performed using simple heuristics, such as the epsilon-greedy strategy [3], which can be inadequate in more complex settings. In this paper, we ev aluate sev eral exploration strategies that can be scaled up to complex tasks with high-dimensional inputs. Our results sho w that Boltzman e xploration and Thompson sampling signiﬁcantly improve on the na ¨ ıve epsilon-greedy strategy . Howe ver , we sho w that the biggest and most consistent improvement can be achieved by assigning exploration bonuses based on a learned model of the system dynamics with learned representations. T o that end, we describe a method that learns a state representation from observations, trains a dynamics model using this representation concurrently with the policy , and uses the misprediction error in this model to asses the novelty of each state. Nov el states are expected to disagree more strongly with the model than those states that hav e been visited frequently in the past, and assigning exploration bonuses based on this disagreement can produce rapid and effecti ve e xploration. Using learned model dynamics to assess a state’ s novelty presents sev eral challenges. Capturing an adequate representation of the agent’ s en vironment for use in dynamics predictions can be accom- 1 Under revie w as a conference paper at ICLR 2016 plished by training a model to predict the next state from the previous ground-truth state-action pair . Howe ver , one would not expect pixel intensity values to adequately capture the salient features of a giv en state-space. T o provide a more suitable representation of the system’ s state space, we propose a method for encoding the state space into lower dimensional domains. T o achieve sufﬁcient gen- erality and scalability , we modeled the system’ s dynamics with a deep neural network. This allows for on-the-ﬂy learning of a model representation that can easily be trained in parallel to learning a policy . Our main contribution is a scalable and efﬁcient method for assigning exploration bonuses in large RL problems with comple x observations, as well as an extensi ve empirical e valuation of this ap- proach and other simple alternati ve strategies, such as Boltzman exploration and Thompson sam- pling. Our approach assigns model-based exploration bonuses from learned representations and dynamics, using only the observations and actions. It can scale to large problems where Bayesian approaches to exploration become impractical, and we show that it achiev es signiﬁcant impro vement in learning speed on the task of learning to play Atari games from raw images [24]. Our approach achiev es state-of-the-art results on a number of games, and achiev es particularly large improvements for g ames on which human players strongly outperform prior methods. Aside from achie ving a high ﬁnal score, our method also achiev es substantially faster learning. T o ev aluate the speed of the learning process, we propose the A UC-100 benchmark to e valuate learning progress on the Atari domain. 2 P R E L I M I NA R I E S W e consider an inﬁnite-horizon discounted Mark ov decision process (MDP), deﬁned by the tuple ( S , A , P , R , ρ 0 , γ ) , where S is a ﬁnite set of states, A a ﬁnite set of actions, P : S × A × S → R the transition probability distribution, R : S → R the reward function, ρ 0 an initial state distribution, and γ ∈ (0 , 1) the discount factor . W e are interested in ﬁnding a policy π : S × A → [0 , 1] that maximizes the expected rew ard over all time. This maximization can be accomplished using a variety of reinforcement learning algorithms. In this work, we are concerned with online reinforcement learning wherein the algorithm receiv es a tuple ( s t , a t , s t +1 , r t ) at each step. Here, s t ∈ S is the previous state, a t ∈ A is the pre vious action, s t +1 ∈ S is the ne w state, and r t is the re ward collected as a result of this transition. The reinforcement learning algorithm must use this tuple to update its policy and maximize long- term reward and then choose the new action a t +1 . It is often insufﬁcient to simply choose the best action based on previous experience, since this strategy can quickly fall into a local optimum. Instead, the learning algorithm must perform e xploration. Prior work has suggested methods that address the exploration problem by acting with “optimism under uncertainty . ” If one assumes that the reinforcement learning algorithm will tend to choose the best action, it can be encouraged to visit state-action pairs that it has not frequently seen by augmenting the re ward function to deli ver a bonus for visiting nov el states. This is accomplished with the augmented reward function R Bonus ( s, a ) = R ( s, a ) + β N ( s, a ) , (1) where N ( s, a ) : S × A → [0 , 1] is a novelty function designed to capture the novelty of a gi ven state-action pair . Prior work has suggested a variety of different novelty functions e.g., [1, 2] based on state visitation frequency . While such methods offer a number of appealing guarantees, such as near-Bayesian exploration in polynomial time [2], they require a concise, often discrete representation of the agent’ s state- action space to measure state visitation frequencies. In our approach, we will employ function approximation and representation learning to devise an alternati ve to these requirements. 3 M O D E L L E A R N I N G F O R E X P L O R A T I O N B O N U S E S W e would like to encourage agent exploration by giving the agent exploration bonuses for visiting nov el states. Identifying states as nov el requires we supply some representation of the agent’ s state space, as well as a mechanism to use this representation to assess novelty . Unsupervised learning methods offer one promising av enue for acquiring a concise representation of the state with a good 2 Under revie w as a conference paper at ICLR 2016 Algorithm 1 Reinforcement learning with model prediction exploration bonuses 1: Initialize max e = 1 , EpochLength, β , C 2: for iteration t in T do 3: Observe ( s t , a t , s t +1 , R ( s t , a t )) 4: Encode the observations to obtain σ ( s t ) and σ ( s t +1 ) 5: Compute e ( s t , a t ) = k σ ( s t +1 ) − M φ ( σ ( s t ) , a t ) k 2 2 and ¯ e ( s t , a t ) = e ( s t ,a t ) max e . 6: Compute R B onus ( s t , a t ) = R ( s, a ) + β  ¯ e t ( s t ,a t ) t ∗ C  7: if e ( s t , a t ) > max e then 8: max e = e ( s t , a t ) 9: end if 10: Store ( s t , a t , R bonus ) in a memory bank Ω . 11: Pass Ω to the reinforcement learning algorithm to update π . 12: if t mo d EpochLength == 0 then 13: Use Ω to update M . 14: Optionally , update σ . 15: end if 16: end for 17: return optimized policy π similarity metric. This can be accomplished using dimensionality reduction, clustering, or graph- based techniques [4, 5]. In our w ork, we draw on recent de velopments in representation learning with neural networks, as discussed in the following section. Howe ver , e ven with a good learned state representation, maintaining a table of visitation frequencies becomes impractical for complex tasks. Instead, we learn a model of the task dynamics that can be used to assess the nov elty of a ne w state. Formally , let σ ( s ) denote the encoding of the state s , and let M φ : σ ( S ) × A → σ ( S ) be a dynamics predictor parameterized by φ . M φ takes an encoded version of a state s at time t and the agent’ s action at time t and attempts to predict an encoded version of the agent’ s state at time t + 1 . The parameterization of M is discussed further in the next section. For each state transition ( s t , a t , s t +1 ) , we can attempt to predict σ ( s t +1 ) from ( σ ( s t ) , a t ) using our predictiv e model M φ . This prediction will hav e some error e ( s t , a t ) = k σ ( s t +1 ) − M φ ( σ ( s t ) , a t ) k 2 2 . (2) Let e T , the normalized prediction error at time T , be given by e T := e T max t ≤ T { e t } . W e can assign a nov elty function to ( s t , a t ) via N ( s t , a t ) = ¯ e t ( s t , a t ) t ∗ C (3) where C > 0 is a decay constant. W e can no w realize our augmented reward function as R B onus ( s, a ) = R ( s, a ) + β  ¯ e t ( s t , a t ) t ∗ C  (4) This approach is motiv ated by the idea that, as our ability to model the dynamics of a particular state-action pair improves, we have come to understand the state better and hence its nov elty is lower . When we don’t understand the state-action pair well enough to make accurate predictions, we assume that more kno wledge about that particular area of the model dynamics is needed and hence a higher nov elty measure is assigned. Using learned model dynamics to assign novelty functions allows us to address the exploration versus exploitation problem in a non-greedy way . With an appropriate representation σ ( s t ) , e ven when we encounter a new state-action pair ( s t , a t ) , we expect M φ ( σ ( s t ) , a t ) to be accurate so long as enough similar state-action pairs hav e been encountered. Our model-based e xploration bonuses can be incorporated into any online reinforcement learning algorithm that updates the policy based on state, action, re ward tuples of the form ( s t , a t , s t +1 , r t ) , 3 Under revie w as a conference paper at ICLR 2016 such as Q-learning or actor-critic algorithms. Our method is summarized in Algorithm 1. At each step, we recei ve a tuple ( s t , a t , s t +1 , R ( s t , a t )) and compute the Euclidean distance between the encoded state σ ( s t +1 ) to the prediction made by our model M φ ( σ ( s t ) , a t ) . This is used to compute the exploration-augmented re ward R B onus using Equation (4). The tuples ( s t , a t , s t +1 , R Bonus ) are stored in a memory bank Ω at the end of e very step. Every step, the polic y is updated. 1 Once per epoch, corresponding to 50000 observ ations in our implementation, the dynamics model M φ is updated to improv e its accuracy . If desired, the representation encoder σ can also be updated at this time. W e found that retraining σ once ev ery 5 epochs to be sufﬁcient. This approach is modular and compatible with any representation of σ and M , as well as any reinforcement learning method that updates its policy based on a continuous stream of observation, action, reward tuples. Incorporating exploration bonuses does make the reinforcement learning task nonstationary , though we did not ﬁnd this to be a major issue in practice, as shown in our experimental evaluation. In the following section, we discuss the particular choice for σ and M that we use for learning policies for playing Atari games from raw images. 4 D E E P L E A R N I N G A R C H I T E C T U R E S Though the dynamics model M φ and the encoder σ from the previous section can be parametrized by any appropriate method, we found that using deep neural networks for both achieved good em- pirical results on the Atari games benchmark. In this section, we discuss the particular networks used in our implementation. 4 . 1 A U T O E N C O D E R S The most direct way of learning a dynamics model is to directly predict the state at the next time step, which in the Atari games benchmark corresponds to the next frame’ s pixel intensity v alues. Howe ver , directly predicting these pixel intensity values is unsatisfactory , since we do not expect pixel intensity to capture the salient features of the en vironment in a robust way . In our experiments, a dynamics model trained to predict raw frames e xhibited extremely poor beha vior , assigning explo- ration bonuses in near equality at most time steps, as discussed in our experimental results section. T o overcome these difﬁculties, we seek a function σ which encodes a lower dimensional represen- tation of the state s . For the task of representing Atari frames, we found that an autoencoder could be used to successfully obtain an encoding function σ and achiev e dimensionality reduction and feature extraction [6]. Our autoencoder has 8 hidden layers, followed by a Euclidean loss layer , Figure 1: Left: Autoencoder used on input space. The circle denotes the hidden layer that was extracted and utilized as input for dynamics learning. Right: Model learning architecture. which computes the distance between the output features and the original input image. The hidden layers are reduced in dimension until maximal compression occurs with 128 units. After this, the activ ations are decoded by passing through hidden layers with increasingly large size. W e train the 1 In our implementation, the memory bank Ω is used to retrain the RL algorithm via e xperience replay once per epoch (50000 steps). Hence, 49999 of these policy updates will simply do nothing. 4 Under revie w as a conference paper at ICLR 2016 network on a set of 250,000 images and test on a further set of 25,000 images. W e compared two separate methodologies for capturing these images. 1. Static AE: A random agent plays for enough time to collect the required images. The auto-encoder σ is trained ofﬂine before the polic y learning algorithm begins. 2. Dynamic AE: Initialize with an epsilon-greedy strategy and collect images and actions while the agent acts under the polic y learning algorithm. After 5 epochs, train the auto encoder from this data. Continue to collect data and periodically retrain the auto encoder in parallel with the policy training algorithm. W e found that the reconstructed input achiev es a small but non-trivial residual on the test set re- gardless of which auto encoder training technique is utilized, suggesting that in both cases it learns underlying features of the state space while av oiding overﬁtting. T o obtain a lo wer dimensional representation of the agent’ s state space, a snapshot of the network’ s ﬁrst six layers is saved. The sixth layer’ s output (circled in ﬁgure one) is then utilized as an encoding for the original state space. That is, we construct an encoding σ ( s t ) by running s t through the ﬁrst six hidden layers of our autoencoder and then taking the sixth layers output to be σ ( s t ) . In practice, we found that using the si xth layer’ s output (rather than the bottleneck at the ﬁfth layer) obtained the best model learning results. See the appendix for further discussion on this result. 4 . 2 M O D E L L E A R N I N G A R C H I T E C T U R E Equipped with an encoding σ , we can no w consider the task of predicting model dynamics. For this task, a much simpler two layer neural network M φ sufﬁces. M φ takes as input the encoded version of a state s t at time t along with the agent’ s action a t and seeks to predict the encoded next frame σ ( s t +1 ) . Loss is computed via a Euclidean loss layer regressing on the ground truth σ ( s t +1 ) . W e ﬁnd that this model initially learns a representation close to the identity function and consequently the loss residual is similar for most state-action pairs. Howe ver , after approximately 5 epochs, it begins to learn more complex dynamics and consequently better identify nov el states. W e ev aluate the quality of the learned model in the appendix. 5 R E L AT E D W O R K Exploration is an intensely studied area of reinforcement learning. Many of the pioneering algo- rithms in this area, such as R − M ax [7] and E 3 [8], achieve efﬁcient exploration that scales polynomially with the number of parameters in the agent’ s state space (see also [9, 10]). Howe ver , as the size of state spaces increases, these methods quickly become intractable. A number of prior methods also examine v arious techniques for using models and prediction to incenti vize exploration [11, 12, 13, 14]. Howe ver , such methods typically operate directly on the transition matrix of a discrete MDP , and do not provide for a straightforward extension to very large or continuous spaces, where function approximation is required. A number of prior methods have also been proposed to incorporate domain-speciﬁc factors to impro ve exploration. Doshi-V elez et al. [15] proposed incorporating priors into policy optimization, while Lang et al. [16] dev eloped a method speciﬁc to relational domains. Finally , Schmidhuber et al. hav e de veloped a curiosity dri ven approach to exploration which uses model predictors to aid in control [17]. Sev eral exploration techniques have been proposed that can extend more readily to large state spaces. Among these, methods such as C-P ACE [18] and metric- E 3 [19] require a good metric on the state space that satisﬁes the assumptions of the algorithm. The corresponding representation learning is- sue has some parallels to the representation problem that we address by using an autoecoder , but it is unclear how the appropriate metric for the prior methods can be acquired automatically on tasks with raw sensory input, such as the Atari games in our experimental evaluat ion. Methods based on Monte-Carlo tree search can also scale gracefully to comple x domains [20], and indeed pre vious work has applied such techniques to the task of playing Atari games from screen images [21]. How- ev er , this approach is computationally very intensiv e, and requires access to a generative model of the system in order to perform the tree search, which is not al ways a vailable in online reinforcement learning. On the other hand, our method readily integrates into any online reinforcement learning algorithm. 5 Under revie w as a conference paper at ICLR 2016 Finally , se veral recent papers hav e focused on driving the Q value higher . In [22], the authors use network dropout to perform Thompson sampling. In Boltzman exploration, a positiv e probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]. Both of these methods focus on controlling Q v alues rather than model-based ex- ploration. A comparison to both is provided in the ne xt section. 6 E X P E R I M E N TA L R E S U LT S W e ev aluate our approach on 14 games from the Arcade Learning Environment [24]. The task consists of choosing actions in an Atari emulator based on raw images of the screen. Previous work has tackled this task using Q-learning with epsilon-greedy exploration [3], as well as Monte Carlo tree search [21] and policy gradient methods [25]. W e use Deep Q Networks (DQN) [3] as the reinforcement learning algorithm within our method, and compare its performance to the same DQN method using only epsilon-greedy exploration, Boltzman exploration, and a Thompson sampling approach. The results for 14 games in the Arcade Learning Environment are presented in T able 1. W e chose those games that were particularly challenging for prior methods and ones where human e xperts outperform prior learning methods. W e e valuated two v ersions of our approach; using either an autoencoder trained in adv ance by running epsilon-greedy Q-learning to collect data (denoted as “Static AE”), or using an autoencoder trained concurrently with the model and policy on the same image data (denoted as “Dynamic AE”). T able 1 also sho ws results from the DQN implementation reported in previous work, along with human expert performance on each game [3]. Note that our DQN implementation did not attain the same score on all of the games as prior work due to a shorter running time. Since we are primarily concerned with the rate of learning and not the ﬁnal results, we do not consider this a deﬁciency . T o directly ev aluate the beneﬁt of including exploration bonuses, we compare the performance of our approach primarily to our own DQN implementation, with the prior scores provided for reference. In addition to raw-game scores, and learning curves, we also analyze our results on a new bench- mark we ha ve named Area Under Curv e 100 (A UC-100). For each g ame, this benchmark computes the area under the game-score learning curve (using the trapezoid rule to approximate the integral). This area is then normalized by 100 times the score maximum game score achiev ed in [3], which represents 100 epochs of play at the best-known le vels. This metric more effecti vely captures im- prov ements to the game’ s learning rate and does not require running the games for 1000 epochs as in [3]. For this reason, we suggest it as an alternati ve metric to raw g ame-score. Bowling The polic y without exploration tended to ﬁxate on a set pattern of nocking do wn six pins per frame. When bonuses were added, the dynamics learner quickly became adept at predicting this outcome and was thus encouraged to explore other release points. Frostbite This game’ s dynamics changed substantially via the addition of extra platforms as the player progressed. As the dynamics of these more complex systems was not well understood, the system was encouraged to visit them often (which required making further progress in the game). Seaquest A submarine must surface for air between bouts of ﬁghting sharks. Ho wever , if the player resurf aces too soon they will suf fer a penalty with ef fects on the game’ s dynamics. Since these effects are poorly understood by the model learning algorithm, resurfacing receiv es a high exploration bonus and hence the agent eventually learns to successfully resurface at the correct time. Q ∗ bert Exploration bonuses resulted in a lower score. In Q ∗ bert, the background changes color after level one. The dynamics predictor is unable to quickly adapt to such a dramatic change in the en vironment and consequently , exploration bonuses are assigned in near equality to almost ev ery state that is visited. This negati vely impacts the ﬁnal polic y . Learning curves for each of the g ames are shown in Figure (3). Note that both of the exploration bonus algorithms learn signiﬁcantly faster than epsilon-greedy Q-learning, and often continue learn- ing even after the epsilon-greedy strate gy con ver ges. All games had the inputs normalized according 6 Under revie w as a conference paper at ICLR 2016 Figure 2: Full learning curves and A UC-100 scores for all Atari games. W e present the raw A UC- 100 scores in the appendix. 7 Under revie w as a conference paper at ICLR 2016 to [3] and were run for 100 epochs (where one epoch is 50,000 time steps). Between each epoch, the policy was updated and then the new policy underwent 10,000 time steps of testing. The results represent the av erage testing score across three trials after 100 epoch each. Game DQN 100 epochs Exploration Static AE 100 epochs Exploration Dynamic AE 100 epochs Boltzman Exploration 100 epochs Thompson Sampling 100 epochs DQN [3] 1000 epochs Human Expert [3] Alien 1018 1436 1190 1301 1322 3069 6875 Asteroids 1043 1486 939 1287 812 1629 13157 Bank Heist 102 131 95 101 129 429.7 734.4 Beam Rider 1604 1520 1640 1228 1361 6846 5775 Bowling 68.1 130 133 113 85.2 42.4 154.8 Breakout 146 162 178 219 222 401.2 31.8 Enduro 281 264 277 284 236 301.8 309.6 Freew ay 10.5 10.5 12.5 13.9 12.0 30.3 29.6 Frostbite 369 649 380 605 494 328.3 4335 Montezuma 0.0 0.0 0.0 0 0 0.0 4367 Pong 17.6 18.5 18.2 18.2 18.2 18.9 9.3 Q ∗ bert 4649 3291 3263 4014 3251 10596 13455 Seaquest 2106 2636 4472 3808 1337 5286 20182 Space In vaders 634 649 716 697 459 1976 1652 T able 1: A comparison of maximum scores achieved by different methods. Static AE trains the state-space auto encoder on 250000 raw game frames prior to policy optimization (ra w frames are tak en from random agent play). Dynamic AE retrains the auto encoder after each epoch, using the last 250000 images as a training set. Note that exploration bonuses help us to achie ve state of the art results on Bo wling and Frostbite. Each of these games pro vides a signiﬁcant exploration challenge. Bolded numbers indicate the best-performing score among our experiments. Note that this score is sometimes lower than the score reported for DQN in prior work as our implementation only one-tenth as long as in [3]. The results show that more nuanced exploration strategies generally improv e on the naive epsilon greedy approach, with the Boltzman and Thompson sampling methods achie ving the best results on three of the games. Ho wever , exploration bonuses achiev e the fastest learning and the best results most consistently , outperforming the other three methods on 7 of the 14 games in terms of A UC-100. 7 C O N C L U S I O N In this paper , we ev aluated several scalable and efﬁcient exploration algorithms for reinforcement learning in tasks with complex, high-dimensional observ ations. Our results show that a ne w method based on assigning exploration bonuses most consistently achiev es the largest improvement on a range of challenging Atari games, particularly those on which human players outperform prior learn- ing methods. Our exploration method learns a model of the dynamics concurrently with the policy . This model predicts a learned representation of the state, and a function of this prediction error is added to the re ward as an exploration bonus to encourage the policy to visit states with high nov elty . One of the limitations of our approach is that the misprediction error metric assumes that any mis- prediction in the state is caused by inaccuracies in the model. While this is true in determinstic en vi- ronments, stochastic dynamics violate this assumption. An extension of our approach to stochastic systems requires a more nuanced treatment of the distinction between stochastic dynamics and un- certain dynamics, which we hope to explore in future work. Another intriguing direction for future work is to examine how the learned dynamics model can be incorporated into the policy learning process, beyond just pro viding exploration bonuses. This could in principle enable substantially faster learning than purely model-free approaches. 8 Under revie w as a conference paper at ICLR 2016 R E F E R E N C E S [1] A. L. Strehl and M. L. Littman, An Analysis of Model-Based Interval Estimation for Markov Decision Pr ocesses. Journal of Computer and System Sciences, 74, 12091331. [2] J. Z. Kolter and A. Y . Ng, Near-Bayesian Exploration in P olynomial T ime Proceedings of the 26th Annual International Conference on Machine Learning, pp. 18, 2009. [3] V . Mnih, K. Ka vukcuoglu, D. Silver , A. A. Rusu, et al. Human-level Contr ol Thr ough Deep Reinfor cement Learning. Nature, 518(7540):529533, 2015. [4] T . Hastie, R. T ibshirani, and J. Friedman, The Elements of Statistical Learning . Springer series in statistics. Springer , New Y ork, 2001. [5] M. Belkin, P . Niyogi and V . Sindhwani, Manifold Re gularization: A Geometric F ramework for Learning fr om Labeled and Unlabeled Examples . JMLR, vol. 7, pp. 2399-2434, Nov . 2006. [6] G. E. Hinton and R. Salakhutdinov , Reducing the dimensionality of data with neural networks . Science, 313, 504–507, 2006. [7] R. I. Brafman and M. T ennenholtz, R-max, a general polynomial time algorithm for near-optimal rein- for cement learning . Journal of Machine Learning Research, 2002. [8] M. Kearns and D. K oller , Efﬁcient r einforcement learning in factor ed MDPs. Proc. IJCAI, 1999. [9] M. K earns and S. Singh, Near -optimal r einforcement learning in polynomial time . Machine Learning Jour - nal, 2002. [10] W . D. Smart and L. P . Kaelbling, Pr actical reinfor cement learning in continuous spaces. Proc. ICML, 2000. [11] J. Sorg, S. Singh, R. L. Lewis, V ariance-Based Re wards for Appr oximate Bayesian Reinfor cement Learn- ing . Proc. U AI, 2010. [12] M. Lopes, T . Lang, M. T oussaint and P .-Y . Oudeyer , Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Pr ogr ess . NIPS, 2012. [13] M. Geist and O. Pietquin, Managing Uncertainty within V alue Function Appr oximation in Reinfor cement Learning . W . on Activ e Learning and Experimental Design, 2010. [14] M. Araya, O. Buf fet, and V . Thomas. Near-optimal BRL Using Optimistic Local T ransitions . (ICML-12), ser . ICML 12, J. Langford and J. Pineau, Eds. New Y ork, NY , USA: Omnipress, Jul. 2012, pp. 97-104. [15] F . Doshi-V elez, D. W ingate, N. Roy , and J. T enenbaum, Nonparametric Bayesian P olicy Priors for Rein- for cement Learning . NIPS, 2014. [16] T . Lang, M. T oussaint, K. Keristing, Exploration in r elational domains for model-based reinfor cement learning Proc. AAMAS, 2014. [17] Juergen Schmidhuber Dev elopmental Robotics, Optimal Artiﬁcial Curiosity , Creativity , Music, and the Fine Arts. . Connection Science, vol. 18 (2), p 173-187. 2006. [18] J. Pazis and R. Parr , P AC Optimal Exploration in Continuous Space Markov Decision Pr ocesses . Proc. AAAI, 2013. [19] Kakade, S., Kearns, M., and Langford, J. (2003). Explor ation in metric state spaces. Proc. ICML. [20] A. Guez, D. Silver , P . Dayan, Efﬁcient Bayes-Adaptive Reinforcement Learning using Sample-Based Sear ch . NIPS, 2014. [21] X. Guo, S. Singh, H. Lee, R. Lewis, X. W ang, Deep Learning for Real-T ime Atari Game Play Using Ofﬂine Monte-Carlo T ree Sear ch Planning . NIPS, 2014. [22] Y Gal, Z Ghahramani. Dr opout as a Bayesian approximation: Insights and applications Deep Learning W orkshop, ICML [23] David Carmel and Shaul Markovitch. Exploration Strate gies for Model-based Learning in Multi-agent Systems Autonomous Agents and Multi-Agent Systems V olume 2, Issue 2 , pp 141-172 [24] M. G. Bellemare, Y . Naddaf, J. V eness, and M. Bowling, The Ar cade Learning Envir onment: An Evalua- tion Platform for General Agents . J AIR. V olume 47, p.235-279. June 2013. [25] John Schulman, Ser gey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. T rust Re gion P olicy Optimization Arxiv preprint 1502.05477. 9 Under revie w as a conference paper at ICLR 2016 8 A P P E N D I X 8 . 1 O N AU T O E N C O D E R L A Y E R S E L E C T I O N Recall that we trained an auto-encoder to encode the game’ s state space. W e then trained a predictiv e model on the next auto-encoded frame rather than directly training on the pixel intensity values of the next frame. T o obtain the encoded space, we ran each state through an eight layer auto-encoder for training and then utilized the auto-encoder’ s sixth layer as an encoded state space. W e chose to use the sixth layer rather than the bottleneck fourth layer because we found that, over 20 iterations of Seaquest at 100 epochs per iteration, using this layer for encoding deli vered measurably better performance than using the bottleneck layer . The results of that experiment are presented belo w . Figure 3: Game score av eraged ov er 20 Seaquest iterations with various choices for the state-space encoding layer . Notice that choosing the sixth layer to encode the state space signiﬁcantly outper- formed the bottleneck layer . 8 . 2 O N T H E Q UA L I T Y O F T H E L E A R N E D M O D E L D Y N A M I C S Evaluating the quality of the learned dynamics model is somewhat difﬁcult because the system is rew arded achieving higher error rates. A dynamics model that conv erges quickly is not useful for exploration bonuses. Nev ertheless, when we plot the mean of the normalized residuals across all games and all trials used in our experiments, we see that the errors of the learned dynamics models continually decrease over time. The mean normalized residual after 100 epochs is approximately half of the maximal mean achieved. This suggests that each dynamics model was able to correctly learn properties of underlying dynamics for its gi ven game. Figure 4: Normalized dynamics model prediction residual across all trials of all games. Note that the dynamics model is retrained from scratch for each trial. 10 Under revie w as a conference paper at ICLR 2016 8 . 3 R AW AU C - 1 0 0 S C O R E S Game DQN Exploration Static AE Exploration Dynamic AE Boltzman Exploration Thompson Sampling Alien 0.153 0.198 0.171 0.187 0.204 Asteroids 0.259 0.415 0.254 0.456 0.223 Bank Heist 0.0715 0.1459 0.089 0.089 0.1303 Beam Rider 0.1122 0.0919 0.1112 0.0817 0.0897 Bowling 0.964 1.493 1.836 1.338 1.122 Breakout 0.191 0.202 0.192 0.294 0.254 Enduro 0.518 0.495 0.589 0.538 0.466 Freew ay 0.206 0.213 0.295 0.313 0.228 Frostbite 0.573 0.971 0.622 0.928 0.746 Montezuma 0.0 0.0 0.0 0 0 Pong 0.52 0.56 0.424 0.612 0.612 Q ∗ bert 0.155 0.104 0.121 0.13 0.127 Seaquest 0.16 0.172 0.265 0.194 0.174 Space In vaders 0.205 0.183 0.219 0.183 0.146 T able 2: A UC-100 is computed by comparing the area under the game-score learning curve for 100 epochs of play to the area under of the rectangle with dimensions 100 by the maximum DQN score the game achie ved in [3]. The integral is approximated with the trapezoid rule. This more holistically captures the games learning rate and does not require running the games for 1000 epochs as in [3]. For this reason, we suggest it as an alternati ve metric to raw game-score. 11

Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment