Learning sparse representations in reinforcement learning
Reinforcement learning (RL) algorithms allow artificial agents to improve their selection of actions to increase rewarding experiences in their environments. Temporal Difference (TD) Learning -- a model-free RL method -- is a leading account of the m…
Authors: Jacob Rafati, David C. Noelle
Learning sparse representations in reinforcement learning Jacob Rafati , David C. Noelle Electrical Engineering and Computer Scinence Computational Cognitive Neur oscience Laboratory University of California, Mer ced 5200 North Lake Road, Mer ced, CA 95343 USA. Abstract Reinforcement learning (RL) algorithms allow artificial agents to improve their selection of ac- tions to increase rewarding e xperiences in their en vironments. T emporal Di ff erence (TD) Learn- ing – a model-free RL method – is a leading account of the midbrain dopamine system and the basal ganglia in reinforcement learning. These algorithms typically learn a mapping from the agent’ s current sensed state to a selected action (known as a policy function) via learning a value function (expected future rew ards). TD Learning methods have been very successful on a broad range of control tasks, but learning can become intractably slow as the state space of the envi- ronment grows. This has motiv ated methods that learn internal representations of the agent’ s state, e ff ecti vely reducing the size of the state space and restructuring state representations in or- der to support generalization. Howe ver , TD Learning coupled with an artificial neural network, as a function approximator, has been sho wn to fail to learn some fairly simple control tasks, challenging this explanation of rew ard-based learning. W e hypothesize that such failures do not arise in the brain because of the ubiquitous presence of lateral inhibition in the cortex, producing sparse distributed internal representations that support the learning of expected future reward. The sparse conjunctive representations can av oid catastrophic interference while still support- ing generalization. W e provide support for this conjecture through computational simulations, demonstrating the benefits of learned sparse representations for three problematic classic control tasks: Puddle-world, Mountain-car , and Acrobot. K eywor ds: Reinforcement learning, T emporal Di ff erence Learning, Learning representations, Sparse representations, Lateral inhibition, Catastrophic interference, Generalization, Midbrain Dopamine system, k-W inners-T ake-All (kWT A), SARSA 1. Introduction Reinforcement learning (RL) – a class of machine learning problems – is learning ho w to map situations to actions so as to maximize numerical reward signals receiv ed during the e xperiences that an artificial agent has as it interacts with its environment (Sutton and Barto, 1998). An RL Email addr esses: jrafatiheravi@ucmerced.edu (Jacob Rafati), dnoelle@ucmerced.edu (David C. Noelle) URL: http://rafati.net/ (Jacob Rafati) agent must be able to sense the state of its environment and must be able to take actions that a ff ect the state. The agent may also be seen as having a goal (or goals) related to the state of the en vironment. Humans and non-human animals’ capability of learning highly complex skills by reinforcing appropriate behaviors with reward and the role of midbrain dopamine system in rew ard-based learning has been well described by a class of a model-free RL, called T emporal Di ff erence (TD) Learning (Montague et al., 1996; Schultz et al., 1997). While TD Learning, by itself, certainly does not explain all observed RL phenomena, increasing evidence suggests that it is key to the brain’ s adaptive nature (Dayan and Ni v, 2008). One of the challenges that arise in RL in real-world problems is that the state space can be very large. This is a version of what has classically been called the curse of dimensionality . Non-linear function approximators coupled with reinforcement learning have made it possible to learn abstractions ov er high dimensional state spaces. Formally , this function approximator is a parameterized equation that maps from state to value, where the parameters can be construc- tiv ely optimized based on the experiences of the agent. One common function approximator is an artificial neural network, with the parameters being the connection weights in the network. Choosing a right structure for the value function approximator , as well as a proper method for learning representations are crucial for a robust and successful learning in TD (Rafati Heravi, 2019; Rafati and Marcia, 2019; Rafati and Noelle, 2019a). Successful examples of using neural networks for RL include learning ho w to play the game of Backgammon at the Grand Master level (T esauro, 1995). Also, recently , researchers at Deep- Mind T echnologies used deep con volutional neural networks (CNNs) to learn how to play some A T ARI games from raw video data (Mnih et al., 2015). The resulting performance on the games was frequently at or better than the human expert lev el. In another e ff ort, DeepMind used deep CNNs and a Monte Carlo Tree Search algorithm that combines supervised learning and rein- forcement learning to learn how to play the game of Go at a super-human lev el (Silver et al., 2016). 1.1. Motivation for the r esear ch Despite these successful examples, surprisingly , some relativ ely simple problems for which TD coupled with a neural network function approximator has been shown to fail. For example, learning to navigate to a goal location in a simple two-dimensional space (see Figure 4) in which there are obstacles has been sho wn to pose a substantial challenge to TD Learning using a back- propagation neural network (Boyan and Moore, 1995). Note that the proofs of con ver gence to optimal performance depend on the agent maintaining a potentially highly discontinuous value function in the form of a large look-up table, so the use of a function approximator for the value function violates the assumptions of those formal analyses. Still, it seems unusual that this ap- proach to learning can succeed at some di ffi cult tasks but f ail at some fairly easy tasks. The power of TD Learning to explain biological RL is greatly reduced by this observation. If TD Learning fails at simple tasks that are well within the reach of humans and non-human animals, then it cannot be used to explain ho w the dopamine system supports such learning. In response to Boyan and Moore (1995), Sutton (1996) showed that a TD Learning agent can learn this task by hard-wiring the hidden layer units of the backpropagation network (used to learn the value function) to implement a fixed sparse conjunctiv e (coarse) code of the agent’ s location. The specific encoding used was one that had been previously proposed in the CMAC model of the cerebellum (Albus, 1975). Each hidden unit would become active only when the 2 agent w as in a location within a small region. For an y gi ven location, only a small fraction of the hidden units displayed non-zero activity . This is what it means for the hidden representation to be a “sparse” code. Locations that were close to each other in the environment produced more ov erlap in the hidden units that were acti ve than locations that were separated by a large distance. By ensuring that most hidden units had zero activity when connection weights were changed, this approach kept changes to the value function in one location from having a broad impact on the expected future reward at distant locations. By engineering the hidden layer representation, this RL problem was solved. This is not a general solution, ho wev er . If the same approach w as taken for another RL problem, it is quite possible that the CMA C representation would not be appropriate. Thus, the method proposed by Sutton (1996) does not help us understand ho w TD Learning might flexibly learn a v ariety of RL tasks. This approach requires prior knowledge of the kinds of internal representations of sensory state that are easily associated with expected future reward, and there are simple learning problems for which such prior knowledge is una v ailable. W e h ypothesize that the ke y feature of the Sutton (1996) approach is that it produces a sparse conjunctiv e code of the sensory state. Representations of this kind need not be fixed, howe ver , but might be learned at the hidden layers of neural netw orks. There is substantial evidence that sparse representations are generated in the cortex by neu- rons that release the transmitter GAB A (O’Reilly and Munakata, 2001) via lateral inhibition. Biologically inspired models of the brain show that, the sparse representation in the hippocam- pus can minimize the overlap of representations assigned to di ff erent cortical patterns. This leads to pattern separ ation , av oiding the catastrophic interference, b ut also supports g eneralization by modifying the synaptic connections so that these representations can later participate jointly in pattern completion (O’Reilly and McClelland, 1994; Noelle, 2008). Computational cognitiv e neuroscience models have sho wn that a combination of feedfor - ward and feedback inhibition naturally produces sparse conjunctiv e codes ov er a collection of excitatory neurons (O’Reilly and Munakata, 2001). Such patterns of lateral inhibition are ubiq- uitous in the mammalian cortex (Kandel et al., 2012). Importantly , neural networks containing such lateral inhibition can still learn to represent input information in di ff erent ways for di ff erent tasks, retaining flexibility while producing the kind of sparse conjunctiv e codes that may support reinforcement learning. Sparse distributed representation schemes have the useful properties of coarse codes while reducing the likelihood of interference between di ff erent representations. 1.2. Objective of the paper In this paper , we demonstrate how incorporating a ubiquitous feature of biological neural networks into the artificial neural networks used to approximate the value function can allo w TD Learning to succeed at simple tasks that have pre viously challenged it. Specifically , we sho w that the incorporation of lateral inhibition , producing competition between neurons so as to produce sparse conjunctive r epresentations , can produce success in learning to approximate the v alue function using an artificial neural network, where only failure had been previously found. Thus, through computational simulation, we provide preliminary evidence that lateral inhibition may help compensate for a weakness of TD Learning, improving this machine learning method and further buttressing the TD Learning account of dopamine-based reinforcement learning in the brain. This paper extends our pre vious works, Rafati and Noelle (2015, 2017). 3 Figure 1: The agent / en vironment interaction in reinforcement learning (Sutton and Barto, 2017). 1.3. Outline of the paper The organization of this paper is as follows. In Section 2, we provide background on the reinforcement learning problem and the temporal di ff erence learning methods. In Section 3, we introduce a method for learning sparse representation in reinforcement learning inspired by the lateral inhibition in the cortex. In Section 4, we provide details concerning our computational simulations of TD Learning with lateral inhibition to solve some simple tasks that TD methods were reported to fail to learn in the literature. In Section 5, we present the results of these simulations and compare the performance of our approach to previously examined methods. The concluding remarks and future research plan can be found in Section 6. 2. Reinfor cement learning 2.1. Reinfor cement learning pr oblem The Reinforcement Learning (RL) problem is learning through interaction with an en vir on- ment to achiev e a goal . The learner and decision maker is called the a gent , and ev erything outside of the agent is called the en vir onment . The agent and the en vironment interact over a sequence of discrete time steps, t = 0 , 1 , 2 , . . . . At each time step, t , the agent recei ves a representation of the en vironment’ s state , S t ∈ S , where S is the set of all possible states, and on that basis the agent selects an action , A t ∈ A , where A is the set of all possible actions for the agent. One time step later , at t + 1, as a consequence of the agent’ s action, the agent receiv es a re ward R t + 1 ∈ R and also an update on the agent’ s new state, S t + 1 , from the environment. Each cycle of interaction is called an e xperience . Figure 1 summarizes the agent / en vironment interaction (see (Sutton and Barto, 2017) for more details). At each time step t , the agent implements a mapping from states to possible actions, π t : S → A . This mapping is called the agent’ s policy function. The objectiv e of the RL problem is to maximize the expected v alue of r eturn , i.e. the cumu- lativ e sum of the receiv ed rew ard, defined as follows G t , T X t 0 = t + 1 γ t 0 − t − 1 r t 0 , 0 ≤ t < T , (1) where T is a final step (also known as horizon) and 0 ≤ γ ≤ 1 is a discount factor . As γ gets closer to 1, the objective takes future rew ards into account more strongly , and if γ is closer to 0, 4 the agent is only concerned about maximizing the immediate re ward. The goal of RL problem is finding an optimal policy , π ∗ that maximizes the expected return for each state s ∈ S π ∗ ( s ) = arg max π E π [ G t | S t = s ] , (2) where E π [ . ] denotes the expected v alue gi ven that the agent follo ws policy π . The reinforcement learning algorithms often in volve the estimation of a value function that estimate how good it is to take a gi ven action in a gi ven state. The value of taking action a under the policy π in state s is defined as the expectation of the return starting from state s and taking action a , and then following the policy π Q π ( s , a ) , E π [ G t | S t = s , A t = a ] . (3) The agent / en vironment interaction can be brok en into subsequences, which we call episodes , such as plays of a game or any sort of repeated interaction. Each episode ends when time is ov er , i.e., t = T or when the agent reaches an absorbing terminal state. 2.2. Generalization using neur al network The reinforcement learning algorithms need to maintain an estimate of the value function Q π ( s , a ), and this function could be stored as a simple look-up table. Ho wever , when the state space is lar ge or not a ll states are observable, storing these estimated values in a table is no longer possible. W e can, instead, use a function approximator to represent the mapping to the estimated value. A parameterized functional form trainable parameters w can be used: q ( s , a ; w ) ≈ Q π ( s , a ). In this case, RL learning requires a search through the space of parameter values, w , for the function approximator . One common way to approximate the v alue function is to use a nonlinear functional form such as that embodied by an artificial neural network. When a change is made to the weights w based on an experience from a particular state, the changed weights will then a ff ect the value function estimates for similar states, producing a form of generalization. Such generalization makes the learning process potentially much more po werful. 2.3. T emporal di ff er ence learning T emporal Di ff erence (TD) learning (Sutton, 1988) is a model-free reinforcement learning algorithm that attempts to learn a policy without learning a model of the en vironment. TD is a combination of Monte Carlo random sampling and dynamic programming ideas (Sutton and Barto, 1998). The goal of the TD approach is to learn to predict the value of a given state based on what happens in the next state by bootstrapping, i.e., updating values based on the learned values, without w aiting for a final outcome. SARSA (State-Action-Reward-State-Action) is an on-policy TD algorithm that learns the action-value function (Sutton and Barto, 1998). Follo wing the SARSA version of TD Learning (see Algorithm 1), the reinforcement learning agent is controlled in the following way . The current state of the agent, s , is provided as input to the neural network, producing output as state-action values q ( s , a ; w ) ≈ Q π ( s , a ). The action a is selected using the exploration policy , -greedy , where with a small exploration probability , , these values were ignored, and an action was selected uniformly at random from the possible actions a i ∈ A , otherwise, the output unit with the highest activ ation le vel determined the action, a , to be taken, i.e., a = arg max a Q ( s , a ) with prob . of (1 − ) , random action from A with prob . of . (4) 5 The agent takes action a and then, the en vironment updates the agent’ s state to s 0 and the agent received a re ward signal, r , based on its current state, s 0 . The action selection process was then repeated at state s 0 , determining a subsequent action, a 0 . Before this action was taken, howe ver , the neural network value function approximator had its connection weights updated according to the SARSA T emporal Di ff erence (TD) Error: δ = r + γ q ( s 0 , a 0 ; w ) − q ( s , a ; w ) . (5) The TD Error , δ , was used to construct an error signal for the backpropagation network imple- menting the value function. The network was given the input corresponding to s , and activ ation was propagated through the network. Each output unit then received an error v alue. This error was set to zero for all output units except for the unit corresponding to the action that was taken, a . The selected action unit received an error signal equal to the TD error, δ . This error value was then backpropagated through the network, using the standard backpropagation of error al- gorithm (Rumelhart et al., 1986), and connection weights were updated. Assuming that the loss function is sum of square error (SSE) of the TD error, i.e. L ( w ) , δ 2 / 2, the weights will be updated based on the gradient decent method as w ← w − ∇ w L which can be computed as w ← w + δ ∇ w q ( s , a ; w ) , (6) where ∇ w q is the gradient of q with respect to the parameters w . This process then repeats ag ain, starting at location s 0 and taking action a 0 until s is a terminal state (goal) or the number of steps exceeds the maximum steps T . Algorithm 1 SARSA: On-Policy TD Learning Input: policy π to be e valuated Initialize: q ( s , a ; w ). repeat (for each episode) Initialize s Compute q ( s , a ; w ) Choose action a giv en by extrapolation polic y , -greedy in Eq. (4) repeat (for each step t of episode) T ake action a , observe re ward r and next state s 0 Compute q ( s 0 , a 0 ; w ) Choose action a 0 giv en by -greedy policy in Eq. (4) Compute the TD error δ ← r + q ( s 0 , a 0 ; w ) − q ( s , a ; w ) Update the parameters w ← w + αδ ∇ w q ( s , a ; w ) s ← s 0 , a ← a 0 until ( s is terminal or reaching to max number of steps T ) until (con ver gence or reaching to max number of episodes) 3. Methods f or learning sparse repr esentations 3.1. Lateral inhibition Lateral inhibition can lead to sparse distributed representations (O’Reilly and Munakata, 2001) by making a small and relatively constant fraction of the artificial neurons active at any 6 one time (e.g., 10% to 25%). Such representations achieve a balance between the generalization benefits of overlapping representations and the interference avoidance o ff ered by sparse repre- sentations. Another way of viewing the sparse distributed representations produced by lateral inhibition is in terms of a balance between competition and cooperation between neurons partic- ipating in the representation. It is important to note that sparsity can be produced in distributed representations by adding regularization terms to the learning loss function, providing a penalty during optimization for weights that cause too many units to be activ e at once (French, 1991; Zhang et al., 2015; Liu et al., 2018). This learning process is not necessary , howe ver , when lateral inhibition is used to produce sparse distributed representations. With this method, feedforward and feedback inhi- bition enforce sparsity from the very beginning of the learning process, o ff ering the benefits of sparse distributed representations e ven early in the reinforcement learning process. 3.2. k-W inners-T ake-All (kWT A) mechanism Computational cognitive neuroscience models hav e shown that fast pooled lateral inhibition produces patterns of activ ation that can be roughly described as k -W inners-T ake-All ( k WT A) dynamics (O’Reilly and Munakata, 2001). A k WT A function ensures that approximately k units out of the n total units in a hidden layer are strongly activ e at an y gi ven time. Applying a k WT A function to the net input in a hidden layer gi ves rise to a sparse distributed representations, and this happens without the need to solve any form of constrained optimization problem. The k WT A function is provided in Algorithm 2. The k WT A mechanism only requires sorting the net input vector in the hidden layer in e very feedforward direction to find the top k + 1 active neurons. Consequently , it has at most O( n + k log k ) computational time comple xity using a partial quicksort algorithm, where n is the number of neurons in the largest hidden layer , and k is the number of winner neurons. k is relatively smaller than n . For example k = 0 . 1 × n is considered for the simulations reported in this chapter , and this ratio is commonly used in the literature. Algorithm 2 The k -Winners-T ake-All Function Input η : net input to the hidden layer Input k : number of winner units Input constant parameter 0 < q < 1, e.g., q = 0 . 25 Find top k + 1 most activ e neurons by sorting η , and store them in η 0 in descending order Compute k WT A bias, b ← η 0 k − q ( η 0 k − η 0 k + 1 ) retur n η kW T A ← η − b 3.3. F eedforwar d kWT A neural network In order to bias a neural network tow ard learning sparse conjunctiv e codes for sensory state inputs, we constructed a variant of a backpropagation neural network architecture with a single hidden layer (see Figure 2) that utilizes the k WT A mechanism described in Algorithm 2. Consider a continuous control task (such as Puddle-world in Figure 4) where the state of the agent is described as 2D coordinates, s = ( x , y ). Suppose that the agent has to choose between four av ailable actions A = { North, South, East, W est } . Suppose that the x coordinate and the y coordinate are in the range [0 , 1] and each x and y range is discretized uniformly to n x and n y points correspondingly . Let’ s denote X = [0 : 1 / n x : 1], and Y = [0 : 1 / n y : 1] as the discretized 7 Figure 2: The k WT A neural network architecture: a backpropagation network with a single layer equipped with the k -Winner -T ake-All mechanism (from Algorithm 2). The k WT A bias is subtracted from the hidden units net input that causes polarized activity which supports the sparse conjunctive representation. Only 10% of the neurons in the hidden layer hav e high activation. Compare the population of red (winner) neurons to the orange (loser) ones. vectors, i.e., X is a vector with n x + 1 elements from 0 to 1, and all points between them with the grid size 1 / n x . T o encode a coordinate v alue for input to the network, a Gaussian distribution with a peak value of 1, a standard deviation of σ x = 1 / n x for the x coordinate, and σ y = 1 / n y for the y coordinate, and a mean equal to the giv en continuous coordinate value, µ x = x , and µ y = y , was used to calculate the activity of each of the X and Y input units. Let’ s denote the results as x and y . The input to the network is a concatenation vector s : = ( x , y ). W e can calculate the net input v alues of hidden units based on the network inputs, i.e., the weighted sum of the inputs η : = W ih s + b ih , (7) where W ih are weights, and b ih are biases from the input layer to the hidden layer . After calculat- ing the net input, we compute the k WT A bias, b , using Algorithm 2. W e subtract b from all of the net input values, η , so that the k hidden units with the highest net input values hav e positiv e net input values, while all of the other hidden units adjusted net input v alues become neg ative. These adjusted net input values, i.e. η kwta = η − b were transformed into unit activ ation values using a logistic sigmoid activ ation function (gain of 1, o ff set of − 1), resulting in hidden unit acti vation values in the range between 0 . 0 and 1 . 0, h = 1 1 + e − ( η kwta − 1) (8) with the top k units having acti v ations above 0 . 27 (due to the − 1 o ff set), and the “losing” hidden units ha ving acti vations below that value. The k parameter controlled the de gree of sparseness of 8 the hidden layer acti vation patterns, with low values producing more sparsity (i.e., fewer hidden units with high activ ations). In the simulations of this paper, we set k to be 10% of the total number of hidden units. The output layer of the k WT A neural network is fully connected to the hidden layer and has |A| units, with each unit i = 1 , . . . , |A| representing the state-action values q ( s , a i ; w ). W e calculate these values by computing the acti v ation in the output layer q = W ho h + b ho , (9) where W ho are weights, and b ho are biases, from the hidden layer to the output layer . The output units used a linear activ ation function, hence, q is a vector of the state-action v alues. In addition to encouraging sparse distributed representations, this k WT A mechanism has two properties that are worthy of note. First, introducing this highly nonlinear mechanism violates some of the assumptions relating the backpropagation of error procedure to stochastic gradient descent in error . Thus, the connection weight changes recommended by the backpropagation procedure may slightly deviate from those which would lead to local error minimization in this network. W e opted to ignore this discrepancy , ho wev er, trusting that a su ffi ciently small learning rate would keep these de viations small. Second, it is worth noting that this particular k WT A mechanism allo ws for a distrib uted pattern of acti vity o ver the hidden units, making use of inter - mediate le vels of activ ation. This provides the learning algorithm with some fle xibility , allowing for a graded range of activ ation levels when doing so reduces network error . As connection weights from the inputs to the hidden units grow in magnitude, howe ver , this mechanism will driv e the activ ation of the top k hidden units closer to 1 and the others closer to 0. Indeed, an examination of the hidden layer acti vation patterns in the k WT A-equipped networks used in this study revealed that the k winning units consistently had acti vity le vels close to the maximum possible value, once the learning process w as complete. 4. Experiments and Simulation T asks 4.1. Numerical simulations design In order to assess our hypothesis that biasing a neural network tow ard learning sparse con- junctiv e codes for sensory state inputs will improve TD Learning when using a neural network as a function approximator for q ( s , a ; w ) state-action value function, we constructed three types of backpropagation networks: kWT A network. A single layer backpropagation neural network equipped with the k -W inners- T ake-All mechanism. See Figures 2 and 3(c). Regular network. A single layer backpropagation neural network without the k WT A mecha- nism. See Figure 3(b). Linear network. A linear neural network without a hidden layer . See Figure 3(a) There was complete connectivity between the input units and the hidden units and between the hidden units and the output units. For Linear networks, there was full connectivity between the input layer and the output layer . All connection weights were initialized to uniformly sampled random values in the range [ − 0 . 05 , 0 . 05]. In order to inv estigate the utility of sparse distributed representations, simulations were con- ducted inv olving three relativ ely simple reinforcement learning control tasks: the Puddle-world 9 (a) Linear (b) Regular (c) k WT A Figure 3: The neural netw ork architectures used as the function approximator for state action values q ( s , a ; w ). (a) Linear network. (b) Regular backpropag ation neural network. (c) k WT A network. task, the Mountain-car task, and the Acr obot task. These reinforcement learning problems were selected because of their extensiv e use in the literature (Boyan and Moore, 1995; Sutton, 1996). In this section, each task is described. W e tested the SARSA variant of TD Learning (Sutton and Barto, 2017). (See Algorithm 1.) on each of the three neural networks architectures. The Matlab code for these simulations is av ailable at http://rafati.net/td-sparse/ . The specific parameters for each simulation can be found in the description of the simulation tasks belo w . In Section “Results and Discussions”, the numerical results for training performance for each task are reported and discussed. 4.2. The Puddle-world task The agent in the Puddle-world task attempts to navigate in a dark 2D grid world to reach a goal location at the top right corner, and it should avoid entering poisonous “puddle” regions (Figure 4). In e very episode of Algorithm 1, the agent is located in a random state. The agent can choose to move in one of four directions, A = { North, South, East, W est } . For the Puddle-world task, the x -coordinate and the y -coordinate of the current state, s , were presented to the neural network over two separate pools of input units. Note that these coordinate values were in the range [0 , 1], as shown in Figure 4. Each pool of input units consisted of 21 units, with each unit corresponding to a coordinate location between 0 and 1, inclusi ve, in increments of 0 . 05. T o encode a coordinate value for input to the network, a Gaussian distribution with a peak value of 1, a standard deviation of 0 . 05, and a mean equal to the given continuous coordinate value was used to calculate the activity of each of the 21 input units (see Figure 2). W e use each of the three mentioned neural network models as function approximators for state-action values (see Figure 3). All networks had four output units, with each output corre- sponding to one of the four directions of motion. The hidden layer for both regular BP and k WT A neural networks had 220 hidden units. In the k WT A network, only 10% (or 22) of the hidden units were allowed to be highly acti ve. 10 Figure 4: The agent in puddle-world task attempts to reach the goal location (fixed in the Northeast corner) in the least time steps by a voiding the puddle. The agent mo ves a distance of 0 . 05 either North, South, East, or W est on each time step. Entering a puddle produces a re ward of ( − 400 × d ), where d is the distance of the current location to the edge of the puddle. This value w as − 1 for most of the environment, b ut it had a higher value, 0, at the goal location in the Northeast corner . Finally , the agent receives a reward signal of − 2 if it had just attempted to leav e the square en vironment. This pattern of reinforcement was selected to parallel that pre viously used in Sutton (1996). At the beginning of the simulation, the e xploration probability , , was set to a relati vely high value of 0 . 1, and it remained at this value for much of the learning process. Once the average magnitude of δ over an episode fell below 0 . 2, the v alue of was reduced by 0 . 1% each time the goal location was reached. Thus, as the agent became increasingly successful at reaching the goal location, the exploration probability , , approached zero. (Annealing the exploration probability is commonly done in systems using TD Learning.) The agent continued to explore the en vironment, one episode after another , until the average absolute v alue of δ was belo w 0 . 01 and the goal location was consistently reached, or a maximum of 44 , 100 episodes had been completed. This value was heuristically selected as a function of the size of the en vironment: (21 × 21) × 100 = 44 , 100. Each episode of SARSA was terminated if the agent had reached the goal in the corner of the grid or after the maximum steps, T = 80, had been taken. The learning rate remained fixed α = 0 . 005 during training. When this reinforcement learning process was complete, we examined both the behavior of the agent and the degree to which its value function approximations, q ( s , a ; w ), matched the correct values determined by running SARSA to con vergence while using a large look-up table 11 to capture the value function. 4.3. The Mountain-car task In this reinforcement learning problem, the task inv olves driving a car up a steep mountain road to a high goal location. The task is di ffi cult because the force of gravity is stronger than the car’ s engine (see Figure 5). In order to solve the problem, the agent must learn first to mov e away from the goal, then use the stored potential energy in combination with the engine to ov ercome gravity and reach the goal state. The state of the Mountain-car agent is described by the car’ s -1.2 -0.5 0 0.5 -1 0 1 y = sin(3 x ) Goal Figure 5: The goal is to dri ve an underpowered car up a steep hill. The agent receiv ed -1 re ward for each time step until it reached the goal, at which point it receiv ed 0 reward. position and velocity , s = ( x , ˙ x ). There are three possible actions: A = { forward, neutral, backward } throttle of the motor . After choosing action a ∈ A , the car’ s state is updated by the following equations x t + 1 = bound( x t + ˙ x t + 1 ) (10a) ˙ x t + 1 = bound( ˙ x + 0 . 0001 a t − 0 . 0025 cos(3 x t )) , (10b) where the bound function keeps state variables within their limits, x ∈ [ − 1 . 2 , 0 . 5] and ˙ x ∈ [ − 0 . 07 , 0 . 07]. If the car reaches the left en vironment boundary , its velocity is reset to zero. In order to present the values of the state variables to the neural network, each v ariable was encoded as a Gaussian activ ation bump surrounding the input corresponding to the variable v alue. Every episode started from a random position and velocity . Both the x -coordinate and the ˙ x velocity was discretized to 60 mesh points, allowing each variable to be represented over 61 inputs. T o encode a state value for input to the network, a Gaussian distribution was used to calculate the activity of each of the 61 input units for that v ariable. The network had a total of 122 inputs. All networks had three output units, each corresponding to one of the possible actions. Be- tween the 122 input units and the 3 output units was a layer of 61 × 61 × 0 . 7 = 2604 hidden units (for k WT A and Regular networks). The hidden layer of the k WT A network was subject to the 12 previously described k WT A mechanism, parameterized so as to allo w 10%, or 260, of the hidden units to be highly activ e. W e used the SARSA version of TD Learning (Algorithm 1). The exploration probability , , was initialized to 0 . 1 and it was decreased by 1% after each episode until reaching a value of 0 . 0001. The agent receiv ed a reward of r = − 1 for most of the states, but it received a reward of r = 0 at the goal location ( x ≥ 0 . 5). If the car collided with the leftmost boundary , the velocity was reset to zero, but there was no extra punishment for bumping into the wall. The learning rate, α = 0 . 001, stayed fixed during the learning. The agent explored the Mountain-car en vironment in episodes . Each episode be gan with the agent being placed at a location within the en vironment, sampled uniformly at random. Actions were then taken, and connection weights updated, as described above. The episode ended when the agent reached the goal location or after the maximum of T = 3000 actions had been taken. The agent continued to explore the en vironment, one episode after another , until the av erage absolute value of δ was below 0 . 05 and the goal location was consistently reached, or a maximum of 200 , 000 episodes had been completed. 4.4. The Acr obot task W e also examined the utility of learning sparse distributed representations on the Acrobot control task, which is a more complicated task and one attempted by Sutton (1996). The acr obot is a two-link under-actuated robot, a simple model of a gymnast swinging on a high-bar (Sutton, 1996). (See Figure 6.) The state of the acrobot is determined by four continuous state variables: two joint angles ( θ 1 , θ 2 ) and corresponding velocities. Thus, the state the agent can be formally described as s = ( θ 1 , θ 2 , ˙ θ 1 , ˙ θ 2 ). The goal is to control the acrobot so as to swing the end tip (“feet”) abov e the horizontal by the length of the lower “leg” link. T orque may only be applied to the second joint. The agent receiv es − 1 re ward until it reaches the goal, at which point it receiv es a re ward of 0. The frequency of action selection is set to 5 Hz, and the time step, ∆ t = 0 . 05 s , is used for numerical integration of the equations describing the dynamics of the system. A discount factor of γ = 0 . 99 is used. The equations of motion for acrobot are, ¨ θ 1 = − d − 1 1 ( d 2 ¨ θ 2 + φ 2 ) , (11a) ¨ θ 2 = − m 2 l 2 c 2 + I 2 − d 2 2 d 1 − 1 τ + d 2 d 1 φ 1 − φ 2 ! , (11b) where d 1 , d 2 , φ 1 and φ 2 are defined as d 1 = m 1 l 2 c 1 + m 2 l 1 l c 2 cos( θ 2 ) + I 2 , (12a) d 2 = m 2 l 2 c 2 + l 1 l c 2 cos( θ 2 ) + I 2 , (12b) φ 1 = − m 2 l 1 l c 2 ˙ θ 1 2 sin( θ 2 ) − 2 m 2 l 1 l c 2 ˙ θ 2 ˙ θ 1 sin( θ 2 ) + ( m 1 l c 1 + m 2 l 1 ) g cos( θ 1 − π/ 2) + φ 2 , (12c) φ 2 = m 2 l c 2 g cos( θ 1 + θ 2 − π/ 2) . (12d) The agent chooses between three di ff erent torque v alues, τ ∈ {− 1 , 0 , 1 } , with torque only applied to the second joint. The angular velocities are bounded to θ 1 ∈ [ − 4 π, 4 π ] and θ 2 ∈ [ − 9 π, 9 π ]. The values m 1 = 1 and m 2 = 1 are the masses of the links, and l 1 = 1 m and l 2 = 1 m are the lengths of the links. The values l c 1 = l c 1 = 0 . 5 m specify the location of the center of mass of links, and 13 The goal is r eaching the tip to this line 1 ✓ 1 ✓ 2 tip ⌧ Figure 6: The goal is to swing the tip (“feet”) above the horizontal by the length of the lower “leg” link. The agent receiv es -1 reward until it reaches to goal, at which point it recei ves 0 rew ard. I 1 = I 2 = 1 kg m 2 are the moments of inertia for the links. Finally , g = 9 . 8 m / s 2 is the acceleration due to gravity . These physical parameters are pre viously used in Sutton (1996). In our simulation, each of the four state dimensions is divided ov er 20 uniformly spaced ranges. T o encode a state v alue for input to the network, a Gaussian distribution with a peak value of 1, a standard de viation of 1 / 20, and a mean equal to the giv en continuous state variable value was used to calculate the acti vity of each of the 21 inputs for that variable. Thus, the network had 84 total inputs. The network had three output units, each corresponding to one of the possible v alues of torque applied: clockwise, neutral, or counter-clockwise. Between the 84 inputs and the 3 output units was a layer of 8400 hidden units for the k WT A network and the regular backpropagation network. For the k WT A network, only 10%, or 840, of the hidden units were allo wed to be highly activ e. The acrobot agent explores its en vironment in episodes . Each episode of learning starts with the acrobot agent hanging straight down and at rest (i.e., s = (0 , 0 , 0 , 0)). The episode ends when the agent reaches the goal location or after the maximum of T = 2000 actions are taken. At the beginning of a simulation, the exploration probability , , is set to a relatively high v alue of 0 . 05. When the agent first reaches the goal, the value of starts to decrease, being reduced by 0 . 1% each time the goal location is reached. Thus, as the agent becomes increasingly successful at reaching the goal location, the exploration probability , , approaches to lower bound of 0 . 0001. The agent continues to explore the en vironment, one episode after another, until the average 14 absolute v alue of δ is below 0 . 05 and the goal location is consistently reached, or a maximum of 200 , 000 episodes are completed. A small learning rate α = 0 . 0001 was used and stayed fixed during the learning. 5. Results and discussions W e compared the performance of our k WT A neural network with that produced by using a standard backpropagation network (Regular) with identical parameters. W e also examined the performance of a linear network (Linear), which had no hidden units b ut only complete connections from all input units directly to the four output units (see Figure 3). 5.1. The puddle-world task Figure 7(a)-(c) show the learned value function (plotted as max a Q ( s , a ) for each location, s = ( x , y ), for representative networks of each kind. Also, Figure 7(d)-(f) display the learned policy at the end of learning. Finally , we show learning curves displaying the episode average value of the TD Error , δ , over episodes in Figure 7(g)-(i). In general, the Linear network did not consistently learn to solve this problem, sometimes failing to reach the goal or choosing paths through puddles. The Regular backpropagation net- work performed much better , but its value function approximation still contained su ffi cient error to produce a few poor action choices. The k WT A network, in comparison, consistently con- ver ged on good approximations of the value function, almost always resulting in optimal paths to the goal. For each network, we quantitativ ely assessed the quality of the paths produced by agents using the learned value function approximations. For each simulation, we froze the connection weights (or , equiv alently , set the learning rate to zero), and we sequentially produced an episode for each possible starting location, except for locations inside of the puddles. The reward accu- mulated over each episode was recorded, and, for each episode that ended at the goal location, we calculated the sum squared de viation of these accumulated reward values from that produced by an optimal agent (identified using SARSA with a large look-up table for the value function). The mean of this squared deviation measure, o ver all successful episodes, w as recorded for each of the 20 simulations run for each of the 3 network types. The means of these values, over 20 simulations, are displayed in Figure 8. The backpropagation network had significantly less error than the linear network ( t (38) = 4 . 692; p < 0 . 001), and the k WT A network had significantly less error than the standard backpropagation network ( t (38) = 6 . 663; p < 0 . 001). On average, the k WT A network de viated from optimal performance by less than one reward point. W e also recorded the fraction of episodes which succeeded at reaching the goal for each simulation run. The mean rate of goal attainment, across all non-puddle starting locations, for the linear network, the backpropagation network, and the k WT A network were 93 . 3%, 99 . 0%, and 99 . 9%, respectively . Despite these consistently high success rates, the linear network exhibited significantly more failures than the backpropagation network ( t (38) = 2 . 306; p < 0 . 05), and the backpropagation network exhibited significantly more failures than the k WT A network ( t (38) = 2 . 205; p < 0 . 05). 5.2. The mountain-car task Figure 9 (a)-(c) show the v alue function (plotted as max a Q ( s , a ) for each location, s ) for representativ e netw orks of each kind. In the middle ro w of Figure 9 (d)-(f), we show the learning 15 Linear Regular k WT A V alues 1 Estimate of V alue F unction y 0.5 0 1 0.5 x 0 0 50 100 − Q max 1 Estimate of V alue F unction y 0.5 0 1 x 0.5 0 0 50 100 − Q max 1 Estimate of V alue F unction y 0.5 0 1 0.5 x 0 0 50 100 − Q max (a) (b) (c) Policy x 0 0.5 1 y 0 0.5 1 Po l i c y m a p x 0 0.5 1 y 0 0.5 1 Po l i c y m a p x 0 0.5 1 y 0 0.2 0.4 0.6 0.8 1 Po l i c y m a p (d) (e) (f) TD Error Episodes 0 10,000 20,000 30,000 40,000 mean( δ ) -5 -4 -3 -2 -1 0 Me a n v a l u e o f δ allo v er eac h episo de Episodes 0 10,000 20,000 30,000 40,000 mean( δ ) -5 -4 -3 -2 -1 0 Episodes 0 10,000 20,000 30,000 40,000 mean( δ ) -5 -4 -3 -2 -1 0 Me a n v a l u e o f δ allo v er eac h episo de (g) (h) (i) Figure 7: The performance of various learned v alue function approximators may be compared in terms of their success at learning the true value function, the resulting action selection policy , and the amount of experience in the en vironment needed to learn. The approximate of the state values, expressed as max a Q ( s , a ) for each state, s is giv en. (a) V alues of states for the Linear network. (b) V alues of states for the Re gular network. (c) V alues of state for k WT A network. The actions selected at a grid of locations is sho wn in the middle ro w . (d) Policy of states deriv ed from Linear netw ork. (e) Policy of states derived from Regular network. (f) Policy of states derived from k WT A network. The learning curve, showing the TD Error over learning episodes, is shown on the bottom. (g) A verage TD error for training Linear netw ork. (h) A verage TD error for training Re gular network. (i) A verage TD error for training k WT A network. These results were initially reported at Rafati and Noelle (2015). curves, displaying the episode average value of the TD Error, δ , over training episodes, and also the number of time steps needed to reach the goal during training episodes. In the last row , we display the results of testing the performance of the networks. The test performances were collected after e very epoch of 1000 training episodes, and there were no changes to the weights and no exploration during the testing phase. The value function for the k WT A network is the closest numerically to optimal Q -table results, and the policy after training for the k WT A network is the most stable. 16 Linear BP kWTA MSE of Rew ard 0 5 10 15 20 25 30 35 40 * * t (38) = 2 . 205; p< 0 . 005 t (38) = 4 . 692; p< 0 . 001 Success rate 93 . 3% Success rate 99 . 9% Success rate 99 . 0% Regular Figure 8: A veraged over 20 simulations of each network type, these columns display the mean squared deviation of accumulated reward from that of optimal performance. Error bars show one standard error of the mean (Rafati and Noelle, 2015). 5.3. The Acr obot task In the first row of Figure 10(a)-(c), the episode av erage v alue of the TD Error, δ , o ver training episodes, and also the number of time steps needed to reach the goal during training episodes are shown for di ff erent networks, during the training phase. In the last row of Figure 10(d)-(f), we display the results of testing performance of the networks for the three network architectures. At testing time, any changes to the weights were a voided, and there was no e xploration. From these results, we can see that only the k WT A network could learn the optimal policy . Both the Linear network and the Regular backpropagation network failed to learn the optimal policy for the Acrobot control task. 6. Conclusions and Future W ork Using a function approximator to learn the v alue function (an estimate of the expected future rew ard from a given en vironmental state) has the benefit of supporting generalization across sim- ilar states, but this approach can produce a form of catastrophic interference that hinders learning, mostly arising in cases in which similar states have widely di ff erent v alues. Computational neu- roscience models ha ve sho wn that a combination of feedforward and feedback inhibition in neu- ral circuits naturally produces sparse conjunctive codes over a collection of excitatory neurons (Noelle, 2008). Artificial neural networks can be forced to produce such sparse codes over their hidden units by including a process akin to the sort of pooled lateral inhibition that is ubiquitous in the cere- bral cortex (O’Reilly and Munakata, 2001). W e implemented a state-action v alue function ap- proximator that utilizes the k -Winners-T ake-All mechanism (O’Reilly, 2001) which e ffi ciently 17 Linear Regular k WT A V alues 0.07 -20 -1.5 0 20 -0.07 − Q max 0.6 40 60 80 x ˙ x 0.07 0 -1.5 20 -0.07 40 − Q max 0.6 60 80 x ˙ x 0.07 0 -1.5 20 -0.07 40 − Q max 0.6 60 80 x ˙ x (a) (b) (c) T rain performance 0 50000 100000 150000 200000 -100 0 100 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps 0 50000 100000 150000 200000 -100 0 100 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps 0 50000 100000 150000 200000 -100 0 100 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps (d) (e) (f) T est performance 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 2000 4000 Steps (g) (h) (i) Figure 9: The performance of various networks trained to perform the Mountain-car task. The top row contains the approximate value of states after training (max a Q ( s , a )). (a) V alues approximated from Linear network. (b) V alues approximated from Regular BP network. (c) V alues approximated from the k WT A network. The middle row shows two statistics ov er training episodes: the top subplot shows the average value of the TD Error, δ , and the bottom subplot shows the number of time steps during training episodes. A verage of TD Error and total steps in training for (d) Linear (e) Regular (f) k WT A is giv en. The last ro w reports the testing performance, which was measured after each epoch of 1000 training episodes. During the test episodes the weight parameters were frozen and no exploration was allowed. A verage of TD Error and the total steps for (g) Linear (h) Regular (i) k WT A are given. incorporate a kind of lateral inhibition into artificial neural network layers, dri ving these ma- chine learning systems to produce sparse conjuncti ve internal representations (Rafati and Noelle, 2015, 2017). The proposed method solves the previously impossible-to-solve control tasks by balancing between generalization (pattern completion) and sparsity (pattern separation). W e produced com- putational simulation results as preliminary evidence that learning such sparse representations of the state of an RL agent can help compensate for weaknesses of artificial neural networks in TD Learning. These simulation results both support our method for improving representation learning in model-free RL and also lend preliminary support to the hypothesis that the midbrain dopamine system, along with associated circuits in the basal ganglia, do, indeed, implement a form of 18 Linear Regular k WT A T rain performance 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 1000 2000 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 1000 2000 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 1000 2000 Steps (a) (b) (c) T est performance 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 200 400 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 200 400 Steps 0 50000 100000 150000 200000 -1 0 1 mean( δ ) 0 50000 100000 150000 200000 Episo des 0 200 400 Steps (d) (e) (f) Figure 10: The performance of various networks trained to solve the Acrobot control task. The top ro w results correspond to training performance of the SARSA Algorithm 1. The a verage TD Error (top subplot) and the total number of steps (bottom subplot) for each episode of learning are giv en for (a) Linear (b) Regular and (c) k WT A neural networks are giv en. The bottom ro w are the performance results for the testing that was measured after every epoch of 1000 training episodes in which the weight parameters were frozen and no exploration was allo wed. A verage of TD error and total steps for (d) Linear (e) Regular (f) k WT A are given. TD Learning, and the observed problems with TD learning do not arise in the brain due to the encoding of sensory state information in circuits that make use of lateral inhibition. Using a sparse conjuncti ve representation of the agent’ s state not only can help in the solving of simple reinforcement learning task, but it might also help impro ve the learning of some large- scale tasks too. For example, FeUdal (state-goal) networks (V ezhnevets et al., 2017) can benefit using k WT A mechanism for learning sparse representations in conjunction with Hierarchical Reinforcement Learning (HRL) (see Rafati Heravi (2019); Rafati and Noelle (2019a,b,c)). In the future, we will extend this work to the deep RL framew ork, where the v alue function is approximated by a deep Conv olutional Neural Network (CNN). The k WT A mechanism can be used in the fully connected layers of the CNN in order to generate sparse representations using the lateral inhibition like mechanism. References Albus, J. S., 1975. A new approach to manipulator control: The cerebellar model articulation controller CMA C. Journal of Dynamic Systems, Meaasurement, and Control 97 (3), 220–227. Boyan, J. A., Moore, A. W ., 1995. Generalization in reinforcement learning: Safely approximating the v alue function. In: Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA, pp. 369–376. Dayan, P ., Niv , Y ., 2008. Reinforcement learning: The good, the bad and the ugly . Current Opinion in Neurobiology 18, 185–196. French, R. M., 1991. Using semi-distributed representations to overcome catastrophic forgetting in connectionist net- works. In: Proceedings of the 13th Annual Cognitiv e Science Society Conference. Lawrence Erlbaum, Hillsdale, NJ, pp. 173–178. 19 Kandel, E., Schwartz, J., Jessell, T ., Siegelbaum, S., Hudspeth, A. J., 2012. Principles of Neural Science, 5th Edition. McGraw-Hill, Ne w Y ork. Liu, V ., Kumaraswamy , R., Le, L., White, M., 2018. The utility of sparse representations for control in reinforcement learning. arXiv e-prints (1811.06626). Mnih, V ., Kavukcuoglu, K., Silver , D., Rusu, A. A., V eness, J., Bellemare, M. G., Graves, A., Riedmiller , M., Fidjeland, A. K., Ostrovski, G., et al., 2015. Human-level control through deep reinforcement learning. Nature 518 (7540), 529–533. Montague, P . R., Dayan, P ., Sejnowski, T . J., 1996. A frame work for mesencephalic dopamine systems based on predic- tiv e Hebbian learning. Journal of Neuroscience 16, 1936–1947. Noelle, D. C., 2008. Function follows form: Biologically guided functional decomposition of memory systems. In: Biologically Inspired Cognitiv e Architectures Papers from the 2008 AAAI Fall Symposium. O’Reilly , R. C., 2001. Generaliztion in interactive networks: The benefits of inhibitory competition and Hebbian learning. Neural Computation 13, 1199–1242. O’Reilly , R. C., McClelland, J. L., 1994. Hippocampal conjunctiv e encoding, storage, and recall: A voiding a trade-o ff . Hippocampus 4 (6), 661–682. O’Reilly , R. C., Munakata, Y ., 2001. Computational Explorations in Cognitiv e Neuroscience. MIT Press, Cambridge, Massachusetts. Rafati, J., Marcia, R. F ., 2019. Deep reinforcement learning via l-bfgs optimization. arXiv e-print (arXi v:1811.02693). Rafati, J., Noelle, D. C., 2015. Lateral inhibition overcomes limits of temporal di ff erence learning. In: 37th Annual Cognitiv e Science Society Meeting. Pasadena, CA, USA. Rafati, J., Noelle, D. C., 2017. Sparse coding of learned state representations in reinforcement learning. In: Conference on Cognitiv e Computational Neuroscience. New Y ork City , NY , USA. Rafati, J., Noelle, D. C., 2019a. Learning representations in model-free hierarchical reinforcement learning. arXiv e-print Rafati, J., Noelle, D. C., 2019b. Unsupervised methods for subgoal discovery during intrinsic motivation in model-free hierarchical reinforcement learning. In: 33rd AAAI Conference on Artificial Intelligence (AAAI-19), 2nd W orkshop on Knowledge Extraction From Games. Honolulu, HI, USA. Rafati, J., Noelle, D. C., 2019c. Unsupervised subgoal discovery method for learning hierarchical representations. In: 7th International Conference on Learning Representations, ICLR 2019 W orkshop on “Structure & Priors in Reinforce- ment Learning”, New Orleans, LA, USA. Rafati Heravi, J., 2019. Learning representations in reinforcement learning. Ph.D. thesis, University of California, Merced. URL https://escholarship.org/uc/item/3dx2f8kq Rumelhart, D. E., Hinton, G. E., W illiams, R. J., 1986. Learning representations by back-propagating errors. Nature 323, 533–536. Schultz, W ., Dayan, P ., Montague, P . R., 1997. A neural substrate of prediction and reward. Science 275 (5306), 1593– 1599. Silver , D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelv am, V ., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutske ver , I., Lillicrap, T ., Leach, M., Kavukcuoglu, K., Graepel, T ., Hassabis, D., 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), 484–489. URL http://dx.doi.org/10.1038/nature16961 Sutton, R. S., 1988. Learning to predict by the methods of temporal di ff erences. Machine Learning 3, 9–44. Sutton, R. S., 1996. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Advances in Neural Information Processing Systems 8. MIT Press, Cambridge, MA, pp. 1038–1044. Sutton, R. S., Barto, A. G., 1998. Reinforcement Learning: An Introduction, 1st Edition. MIT Press, Cambridge, MA, USA. Sutton, R. S., Barto, A. G., 2017. Reinforcement Learning: An Introduction, 2nd Edition. MIT Press, Cambridge, MA, USA. T esauro, G., 1995. T emporal di ff erence learning and TD-Gammon. Communications of the ACM 38 (3). V ezhnevets, A. S., Osindero, S., Schaul, T ., Heess, N., Jaderberg, M., Silver, D., Kavukcuoglu, K., 2017. Feudal net- works for hierarchical reinforcement learning. In: Proceedings of Thirty-fourth International Conference on Machine Learning (ICML-17). Zhang, Z., Xu, Y ., Y ang, J., Li, X., Zhang, D., 2015. A survey of sparse representation: Algorithms and applications. IEEE Access 3, 490–530. 20
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment