A neurally plausible model learns successor representations in partially observable environments

A neurally plausible model lear ns successor r epr esentations in partially obser vable en vir onments Eszter Vértes Maneesh Sahani Gatsby Computational Neuroscience Unit Univ ersity Colle ge London London, W1T 4JG {eszter, maneesh}@gatsby.ucl.ac.uk Abstract Animals need to devise strate gies to maximize returns while interacting with their en vironment based on incoming noisy sensory observ ations. T ask-relev ant states, such as the agent’ s location within an en vironment or the presence of a predator , are often not directly observable b ut must be inferred using av ailable sensory infor- mation. Successor representations (SR) have been proposed as a middle-ground between model-based and model-free reinforcement learning strategies, allo wing for fast v alue computation and rapid adaptation to changes in the re ward function or goal locations. Indeed, recent studies suggest that features of neural responses are consistent with the SR framew ork. Howe v er , it is not clear ho w such represen- tations might be learned and computed in partially observ ed, noisy en vironments. Here, we introduce a neurally plausible model using distrib utional successor fea- tur es , which builds on the distributed distrib utional code for the representation and computation of uncertainty , and which allows for efﬁcient v alue function computa- tion in partially observed en vironments via the successor representation. W e show that distributional successor features can support reinforcement learning in noisy en vironments in which direct learning of successful policies is infeasible. 1 Introduction Humans and other animals are able to ev aluate long-term consequences of their actions and adapt their behaviour to maximize reward across different environments. This behavioural ﬂexibility is often thought to result from interactions between two adapti ve systems implementing model-based and model-free reinforcement learning (RL). Model-based learning allows for ﬂe xible goal-directed behaviour , acquiring an internal model of the en vironment which is used to ev aluate the consequences of actions. As a result, an agent can rapidly adjust its policy to localized changes in the en vironment or in the reward function. But this ﬂexibility comes at a high computational cost, as optimal actions and value functions depend on expensi ve simulations in the model. Model-free methods, on the other hand, learn cached values for states and actions, enabling rapid action selection. This approach, howe ver , is particularly slow to adapt to changes in the task, as adjusting behaviour e ven to localized changes, e.g. in the placement of the re ward, requires updating cached v alues at all states in the en vironment. It has been suggested that the brain makes use both of these complementary approaches, and that the y may compete for beha vioural control [Daw et al., 2005]; indeed, several behavioural studies suggest that subjects implement a hybrid of model-free and model-based strategies [Da w et al., 2011, Gläscher et al., 2010]. Successor representations [SR; Dayan, 1993] augment the internal state used by model-free systems by the expected future occupancy of each w orld state. SRs can be viewed as a pr ecompiled representation of the model under a giv en policy . Thus, the SRs fall in between model-free and model-based Preprint. Under revie w . approaches and can reproduce a range of corresponding behaviours [Russek et al., 2017]. Recent studies hav e argued for evidence consistent with SRs in rodent hippocampal and human behavioural data [Stachenfeld et al., 2017, Momennejad et al., 2017]. Motiv ated by both theoretical and experimental work arguing that neural RL systems operate ov er la- tent states and need to handle state uncertainty [Dayan and Daw, 2008, Gershman, 2018, Starkweather et al., 2017], our work takes the successor framew ork further by considering partially observable en vironments. Adopting the frame work of distrib uted distributional coding [Vértes and Sahani, 2018], we show how learnt latent dynamical models of the en vironment can be naturally integrated with SRs deﬁned o ver the latent space. W e begin with short o vervie ws of reinforcement learning in the partially observed setting (section 2); the SR (section 3); and distributed distributional codes (DDCs) (section 4). In section 5, we describe ho w using DDCs in the generati ve and recognition models leads to a particularly simple algorithm for learning latent state dynamics and the associated SR. 2 Partially observable Markov decision processes Markov decision processes (MDP) pro vide a framew ork for modelling a wide range of sequential decision-making tasks relev ant for reinforcement learning. An MDP is deﬁned by a set of states S and actions A , a rew ard function R : S × A → R , and a probability distribution T ( s 0 | s, a ) that describes the Marko vian dynamics of the states conditioned on actions of the agent. For notational con venience we will take the re ward function to be independent of action, depending only on state; but the approach we describe is easily extended to the more general case. A partially observable Markov decision process (POMDP) is a generalization of an MDP where the Markovian states s ∈ S are not directly observable to the agent. Instead, the agent receives observ ations ( o ∈ O ) that depend on the current latent state via an observation process Z ( o | s ) . Formally , a POMDP is a tuple: ( S , A , T , R , O , Z , γ ), comprising the objects deﬁned above and the discount factor γ . POMDPs can be deﬁned ov er either discrete or continuous state spaces. Here, we focus on the more general continuous case, although the model we present is applicable to discrete state spaces as well. 3 The successor representation As an agent explores an en vironment, the states it visits are ordered by the agent’ s policy and the transition structure of the world. State representations that respect this dynamic ordering are likely to be more efﬁcient for v alue estimation and may promote more effecti ve generalization. This may not be true of the observed state coordinates. For instance, a barrier in a spatial environment might mean that two states with adjacent physical coordinates are associated with v ery different v alues. Dayan [1993] argued that a natural state space for model-free v alue estimation is one where distances between states reﬂect the similarity of future paths gi ven the agent’ s policy . The successor representa- tion (Dayan, 1993; SR) for state s i is deﬁned as the expected discounted sum of future occupancies for each state s j , giv en the current state s i : M π ( s i , s j ) = E π [ ∞ X k =0 γ k I [ s t + k = s j ] | s t = s i ] . (1) That is, in a discrete state space, the SR is a N × N matrix where N is the number of states in the en vironment. The SR depends on the current policy π through the e xpectation in the right hand side of eq. 1, taken with respect to a (possibly stochastic) polic y p π ( a t | s t ) and en vironment T ( s t +1 | s t , a t ) . Importantly , the SR makes it possible to express the value function in a particularly simple form. Follo wing from eq. 1 and the deﬁnition of the value function: V π ( s i ) = X j M π ( s i , s j ) R ( s j ) , (2) where R ( s j ) is the immediate rew ard in state s j . The successor matrix M π can be learned by TD learning, in much the same way as TD is used to update value functions. In particular , the SR is updated according to a TD error: δ t ( s j ) = I [ s t = s j ] + γ M π ( s t +1 , s j ) − M π ( s t , s j ) , (3) 2 which reﬂects errors in state predictions rather than re wards, a learning signal typically associated with model-based RL. As shown in eq. 2, the value function can be factorized into the SR—i.e., information about expected future states under the polic y—and instantaneous re ward in each state 1 . This modularity enables rapid policy ev aluation under changing re ward conditions: for a ﬁx ed policy only the re ward function needs to be relearned to ev aluate V π ( s ) . This contrasts with both model-free and model-based algorithms, which require extensi ve experience or rely on computationally expensi ve e valuation, respecti vely , to recompute the value function. 3.1 Successor repr esentation using features The successor representation can be generalized to continuous states s ∈ S by using a set of feature functions { ψ i ( s ) } deﬁned ov er S . In this setting, the successor representation (also referred to as the successor feature representation or SF) encodes e xpected feature values instead of occupancies of individual states: M π ( s t , i ) = E [ ∞ X k =0 γ k ψ i ( s t + k ) | s t , π ] (4) Assuming that the reward function can be written (or approximated) as a linear function of the features: R ( s ) = w T rew ψ ( s ) (where the feature values are collected into a vector ψ ( s ) ), the value function V ( s t ) has a simple form analagous to the discrete case: V π ( s t ) = w T rew M π ( s t ) (5) For consistenc y , we can use linear function approximation with the same set features as in eq. 4 to parametrize the successor features M π ( s t , i ) . M π ( s t , i ) ≈ X j U ij ψ j ( s t ) (6) The form of the SFs, embodied by the weights U ij , can be found by temporal difference learning: ∆ U ij = δ i ψ j ( s t ) δ i = ψ i ( s t ) + γ M ( s t +1 , i ) − M ( s t , i ) (7) As we have seen in the discrete case, the TD error here signals prediction errors about features of state, rather than about rew ard. 4 Distributed distributional codes Distributed distrib utional codes (DDC) are a candidate for the neural representation of uncertainty [Zemel et al., 1998, Sahani and Dayan, 2003] and recently hav e been shown to support accurate inference and learning in hierarchical latent variable models [Vértes and Sahani, 2018]. In a DDC, a population of neurons represent distributions in their ﬁring rates implicitly , as a set of expectations: µ = E p ( s ) [ ψ ( s )] (8) where µ is a vector of ﬁring rates, p ( s ) is the represented distrib ution, and ψ ( s ) is a vector of encoding functions speciﬁc to each neuron. DDCs can be thought of as representing exponential family distributions with suf ﬁcient statistics ψ ( s ) using their mean parameters E p ( s ) [ ψ ( s )] [W ainwright and Jordan, 2008]. 5 Distributional successor representation As discussed above, the successor representation can support ef ﬁcient value computation by incorpo- rating information about the polic y and the environment into the state representation. Howe ver , in more realistic settings, the states themselves are not directly observ able and the agent is limited to state-dependent noisy sensory information. 1 Alternativ ely , for the more general case of action-dependent reward, the expected instantaneous reward under the policy-dependent action in each state. 3 Algorithm 1 W ake-sleep algorithm in the DDC state-space model Initialise T , W while not con verged do Sleep phase: sample: { s sleep t , o sleep t } t =0 ...N ∼ p ( S N , O N ) update W : ∆ W ∝ P t ( ψ ( s sleep t ) − f W ( µ t − 1 ( O sleep t − 1 ) , o sleep t )) ∇ W f W W ake phase: O N ← {collect observations} infer posterior µ t ( O t ) = f W ( µ t − 1 ( O t − 1 ) , o t ) update T : ∆ T ∝ ( µ t +1 ( O t +1 ) − T µ t ( O t )) µ t ( O t ) T update observation model parameters end while In this section, we lay out how the DDC representation for uncertainty allows for learning and computing with successor representations deﬁned over latent variables. First, we describe an algorithm for learning and inference in dynamical latent variable models using DDCs. W e then establish a link between the DDC and successor features (eq. 4) and show how they can be combined to learn what we call the distributional successor featur es . W e discuss different algorithmic and implementation-related choices for the proposed scheme and their implications. 5.1 Learning and inference in a state space model using DDCs Here, we consider POMDPs where the state-space transition model is itself deﬁned by a conditional DDC with means that depend linearly on the preceding state features. That is, the conditional distribution describing the latent dynamics implied by follo wing the policy π can be written in the following form: p π ( s t +1 | s t ) ⇔ E s t +1 | s t ,π [ ψ ( s t +1 )] = T π ψ ( s t ) (9) where T π is a matrix parametrizing the functional relationship between s t and the expectation of ψ ( s t +1 ) with respect to p π ( s t +1 | s t ) . The agent has access only to sensory observ ations o t at each time step, and in order to be able to make use of the underlying latent structure, it has to learn the parameters of generati ve model p ( s t +1 | s t ) , p ( o t | s t ) as well as learn to perform inference in that model. W e consider online inference (ﬁltering), i.e. at each time step t the recognition model produces an estimate q ( s t |O t ) of the posterior distribution p ( s t |O t ) giv en all observations up to time t : O t = ( o 1 , o 2 , . . . o t ) . As in the DDC Helmholtz machine [Vértes and Sahani, 2018], these distributions are represented by a set of expectations—i.e., by a DDC: µ t ( O t ) = E q ( s t |O t ) [ ψ ( s t )] (10) The ﬁltering posterior µ t ( O t ) is computed iterati vely , using the posterior in the pre vious time step µ t − 1 ( O t − 1 ) and the new observation o t . Due to the Markovian structure of the state space model (see ﬁg. 1), the recognition model can be written as a recursiv e function: µ t ( O t ) = f W ( µ t − 1 ( O t − 1 ) , o t ) (11) with a set of parameters W . The recognition and generativ e models are updated using an adapted version of the wake-sleep algorithm [Hinton et al., 1995, Vértes and Sahani, 2018]. In the following, we describe the two phases of the algorithm in more detail (see Algorithm 1). Sleep phase The aim of the sleep phase is to adjust the parameters of the recognition model given the current generati ve model. Speciﬁcally , the recognition model should approximate the e xpectation of the DDC encoding functions ψ ( s t ) under the ﬁltering posterior p ( s t |O t ) . This can be achiev ed by moment matching, i.e., simulating a sequence of latent and observed states from the current model and 4 minimizing the Euclidean distance between the output of the recognition model and the sufﬁcient statistic vector ψ ( . ) e valuated at the latent state from the next time step. W ← argmin W X t k ψ ( s sleep t ) − f W ( µ t − 1 ( O sleep t − 1 ) , o sleep t ) k 2 (12) where { s sleep t , o sleep t } t =0 ...N ∼ p ( s 0 ) p ( o 0 | s 0 ) N − 1 Q t =0 p ( s t +1 | s t , T π ) p ( o t +1 | s t +1 ) . This update rule can be implemented online, and after a sufﬁciently long sequence of simula- tions { s sleep t , o sleep t } t the recognition model will learn to approximate expectations of the form: f W ( µ t − 1 ( O sleep t − 1 ) , o sleep t ) ≈ E p ( s t |O t ) [ ψ ( s t )] , yielding a DDC representation of the posterior . W ake phase In the w ake phase, the parameters of the generati ve model are adapted such that it captures the sensory observ ations better . Here, we focus on learning the polic y-dependent latent dynamics p π ( s t +1 | s t ) ; the observ ation model can be learned by the approach of [Vértes and Sahani, 2018]. Given a sequence of inferred posterior representations { µ t ( O t ) } computed using wak e phase observations, the parameters of the latent dynamics T can be updated by minimizing a simple predictive cost function: T ← argmin T X t k µ t +1 ( O t +1 ) − T µ t ( O t ) k 2 (13) The intuition behind eq. 13 is that for the optimal generativ e model the latent dynamics satisﬁes the following equality: T ∗ µ t ( O t ) = E p ( o t +1 |O t ) [ µ t +1 ( O t +1 )] . That is, the predictions made by combining the posterior at time t and the prior will agree with the av erage posterior at the next time step—making T ∗ a stationary point of the optimization in eq. 14. For further details on the nature of the approximation implied by the w ake phase update and its relationship to v ariational learning, see the supplementary material. In practice, the update can be done online, using gradient steps analogous to prediction errors: ∆ T ∝ ( µ t +1 ( O t +1 ) − T µ t ( O t )) µ t ( O t ) T (14) s 1 s 2 s t − 1 s t o 1 o 2 o t − 1 o t T . . . T . . . r 1 r 2 r t − 1 r t µ 1 µ 2 µ t − 1 µ t . . . . . . (a) DDC state-space model (b) Learned dynamics (c) T rajectories Figure 1: Learning and inference in a state-space model parametrized by a DDC. (a) The structure of the generati ve and recognition models. (b) V isualization of the dynamics T learned by the w ake-sleep (algorithm 1). Arrows show the conditional mean E s t +1 | s t [ s t +1 ] for each location. (c) Posterior mean trajectories inferred using the recognition model, plotted on top of true latent and observed trajectories. Figure 1 shows a state-space model corresponding to a random walk policy in the latent space with noisy observations, learned using DDCs (Algorithm 1). For further details of the experiment, see the supplementary material. 5.2 Learning distributional successor features Next, we sho w how using a DDC to parametrize the generati ve model (eq. 9) allows for computing the successor features deﬁned in the latent space in a tractable form, and how this computation can be combined with inference based on sensory observations. 5 Follo wing the deﬁnition of the SFs (eq. 4): M ( s t ) = E [ ∞ X k =0 γ k ψ ( s t + k ) | s t , π ] = ∞ X k =0 γ k E [ ψ ( s t + k ) | s t, π ] (15) W e can compute the conditional expectations of the feature vector ψ in eq. 15 by applying the dynamics k times to the features ψ ( s t ) : E s t + k | s t [ ψ ( s t + k )] = T k ψ ( s t ) . Thus, we hav e: M ( s t ) = ∞ X k =0 γ k T k ψ ( s t ) (16) = ( I − γ T ) − 1 ψ ( s t ) (17) Eq. 17 is reminiscent of the result for discrete observed state spaces M ( s i , s j ) = ( I − γ P ) − 1 ij [Dayan, 1993], where P is a matrix containing Markovian transition probabilities between states. In a continuous state space, ho wev er, ﬁnding a closed form solution like eq. 17 is non-tri vial, as it requires ev aluating a set of typically intractable integrals. The solution presented here directly exploits the DDC parametrization of the generativ e model and the correspondence between the features used in the DDC and the SFs. In this framew ork, we can not only compute the successor features in closed form in the latent space, but also ev aluate the distributional successor featur es , the posterior expectation of the SFs giv en a sequence of sensory observations: E s t |O t [ M ( s t )] = ( I − γ T ) − 1 E s t |O t [ ψ ( s t )] (18) = ( I − γ T ) − 1 µ t ( O t ) (19) The results from this section suggest a number of different w ays the distributional successor features E s t |O t [ M ( s t )] can be learned or computed. Learning distributional SFs during sleep phase The matrix U = ( I − γ T ) − 1 needed to compute distributional SFs in eq. 19 can be learned from temporal differences in feature predictions based on sleep phase simulated latent state sequences (section 3.1). Computing distributional SFs by dynamics Alternati vely , eq. 19 can be implemented as a ﬁxed point of a linear dynamical system, with recurrent connections reﬂecting the model of the latent dynamics: τ ˙ x n = − x n + γ T x n + µ t ( O t ) (20) ⇒ x ∞ = ( I − γ T ) − 1 µ t ( O t ) (21) In this case, there is no need to learn ( I − γ T ) − 1 explicitly but it is implicitly computed through dynamics. For this to w ork, there is an underlying assumption that the dynamical system in eq. 20 reaches equilibrium on a timescale faster than that on which the observ ations O t ev olve. Both of these approaches a void ha ving to compute the matrix in verse directly and allo w for ev aluation of policies giv en by a corresponding dynamics matrix T π ofﬂine. Learning distributional SFs during wake phase Instead of fully relying on the learned latent dynamics to compute the distributional SFs, we can use posteriors computed by the recognition model during the wake phase, that is, using ob- served data. W e can deﬁne the distributional SFs directly on the DDC posteriors: f M ( O t ) = E π [ P k γ k µ t + k ( O t + k ) | µ t ( O t )] , treating the posterior representation µ t ( O t ) as a feature space over sequences of observ ations O t = ( o 1 . . . o t ) . Analogously to section 3.1, f M ( O t ) can be acquired by TD learning and assuming linear function approximation: f M ( O t ) ≈ U µ t ( O t ) . The matrix U can be 6 Figure 2: V alue functions computed using successor features under a random walk polic y updated online, while executing a gi ven policy and continuously inferring latent state representations using the recognition model: ∆ U ∝ δ t µ t ( O t ) T (22) δ t = µ t ( O t ) + γ M ( O t +1 ) − M ( O t ) (23) It can be shown that f M ( O t ) , as deﬁned here, is equiv alent to E s t |O t [ M ( s t )] if the learned generativ e model is optimal–assuming no model mismatch–and the recognition model correctly infers the corresponding posteriors µ t ( O t ) (see supplementary material). In general, howe ver , exchanging the order of TD learning and inference leads to dif ferent SFs. The advantage of learning the distrib utional successor features in the wake phase is that e ven when the model does not perfectly capture the data (e.g. due to lack of ﬂexibility or early on in learning) the learned SFs will reﬂect the structure in the observations through the posteriors µ t ( O t ) . 5.3 V alue computation in a noisy 2D envir onment W e illustrate the importance of being able to consistently handle uncertainty in the SFs by learning value functions in a noisy environment. W e use a simple 2-dimensional box environment with continuous state space that includes an internal wall. The agent does not have direct access to its spatial coordinates, but recei ves observations corrupted by Gaussian noise. Figure 2 shows the v alue functions computed using the successor features learned in three dif ferent settings: assuming direct access to latent states, treating observations as though the y were noise-free state measurements, and using latent state estimates inferred from observ ations. The v alue functions computed in the latent space and computed from DDC posterior representations both reﬂect the structure of the en vironment, while the value function relying on SFs o ver the observed states f ails to learn about the barrier . T o demonstrate that this is not simply due to using the suboptimal random walk polic y , but persists through learning, we hav e learned successor features while adjusting the policy to a given re ward function (see ﬁgure 3). The policy was learned by generalized policy iteration [Sutton and Barto, 1998], alternating between taking actions following a greedy policy and updating the successor features to estimate the corresponding value function. The v alue of each state and action was computed from the value function V ( s ) by a one-step look- ahead, combining the immediate re ward with the e xpected v alue function having taken a gi ven action: Q ( s t , a t ) = r ( s t ) + γ E s t +1 | s t ,a t [ V ( s t +1 )] (24) In our case, as the v alue function in the latent space is expressed as a linear function of the features ψ ( s ) : V ( s ) = w T U ψ ( s ) (eq. 5-6), the expectation in 24 can be expressed as: E s t +1 | s t ,a t [ V ( s t +1 )] = w T rew U · E s 0 | s,a [ ψ ( s t +1 )] (25) = w T rew U · P · ( ψ ( s t ) ⊗ φ ( a t )) (26) Where P is a linear mapping, P : Ψ × Φ → Ψ , that contains information about the distribution p ( s t +1 | s t , a t ) . More speciﬁcally , P is trained to predict E s t +1 | s t ,a t [ ψ ( s t +1 )] as a bilinear function 7 of state and action features ( ψ ( s t ) , φ ( a t ) ). Giv en the state-action value, we can implement a greedy policy by choosing actions that maximize Q ( s, a ) : a ∗ = argmax a ∈A Q ( s t , a t ) (27) = argmax a ∈A r ( s t ) + γ w T rew U · P · ( ψ ( s t ) × φ ( a t )) (28) The argmax operation in eq. 28 (possibly ov er a continuous space of actions) could be biologically implemented by a ring attractor where the neurons recei ve state-dependent input through feedforward weights reﬂecting the tuning ( φ ( a ) ) of each neuron in the ring. Just as in ﬁgure 2, we compute the value function in the fully observ ed case, using inferred states or using only the noisy observations. For the latter two, we replace ψ ( s t ) in eq. 28 with the inferred state representation µ ( O t ) and the observed features ψ ( o t ) , respectiv ely . As the agent follows the greedy polic y and it recei ves ne w observations the corresponding SFs are adapted accordingly . Figure 3 shows the learned v alue functions V π ( s ) , V π ( µ ) and V π ( o ) for a giv en rew ard location and the corresponding dynamics T π . The agent having access to the true latent state as well as the one using distributional SFs successfully learn policies leading to the re warded location. As before, the agent learning SFs purely based on observations remains highly sub-optimal. Histogram of collected rewards Figure 3: V alue functions computed by SFs under the learned policy . T op row shows reward and value functions learned in the three dif ferent conditions. Bottom ro w sho ws histogram of collected rew ards from 100 episodes with random initial states, and the learned dynamics T π visualized as in ﬁg. 1. 6 Discussion W e hav e shown that representing uncertainty ov er latent variables using DDCs can be naturally integrated with representations of uncertainty about future states and therefore can generalize SRs to more realistic en vironments with partial observability . In our work, we hav e deﬁned distributional SFs ov er states, using single step look-ahead to compute state-action values (eq. 24). Alternatively , SFs could be deﬁned directly over both states and actions [Kulkarni et al., 2016, Barreto et al., 2017] with the distributional dev elopment presented here. Barreto et al. [2017, 2019] has shown that successor representations corresponding to previously learned tasks can be used as a basis to construct policies for novel tasks, enabling generalization. Our framew ork can be extended in a similar way , eliminating the need to adapt the SFs as the policy of the agent changes. The framework for learning distributional successor features presented here makes a number of connections to experimental observations in the hippocampal literature. While it has been argued that the hippocampus holds an internal model of the environment and thereby supports model- based decision making [Miller et al., 2017], there is little known about ho w such a model is acquired. Hippocampal replays observed in rodents during periods of immobility and sleep hav e been interpreted 8 as mental simulations from an internal model of the en vironment, and therefore a neural substrate for model-based planning [Pfeiffer and Foster, 2013, Mattar and Da w, 2018]. Here, we propose a complementary function of replays that is to do with learning in the context of partially observed en vironments. The replayed sequences could serve to reﬁne the recognition model to accurately infer distributions o ver latent states, just as in the sleep phase of our algorithm. Broadly consistent with this idea, Stella et al. [2019] recently observed replays reminiscent of random walk trajec tories after an animal freely explored the environment. These paths were not previously experienced by the animal, and could indeed serv e as a training signal for the recognition model. Learning to perform inference is itself a prerequisite for learning the dynamics of the latent task-relev ant v ariables, i.e. the internal model. References André Barreto, W ill Dabney , Rémi Munos, Jonathan J. Hunt, T om Schaul, Hado P . v an Hasselt, and David Silver . Successor features for transfer in reinforcement learning. In Advances in neural information pr ocessing systems , pages 4055–4065, 2017. André Barreto, Diana Borsa, John Quan, T om Schaul, Da vid Silver , Matteo Hessel, Daniel Mank o witz, Augustin Žídek, and Rémi Munos. T ransfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement. arXiv:1901.10964 [cs] , January 2019. URL http://arxiv.org/abs/1901.10964 . arXi v: 1901.10964. Nathaniel D. Da w , Y ael Ni v , and Peter Dayan. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Natur e Neuroscience , 8(12):1704, December 2005. ISSN 1546-1726. doi: 10.1038/nn1560. URL https://www.nature.com/articles/ nn1560 . Nathaniel D. Daw , Samuel J. Gershman, Ben Seymour , Peter Dayan, and Raymond J. Dolan. Model-Based Inﬂuences on Humans’ Choices and Striatal Prediction Errors. Neur on , 69(6): 1204–1215, March 2011. ISSN 0896-6273. doi: 10.1016/j.neuron.2011.02.027. URL http: //www.sciencedirect.com/science/article/pii/S0896627311001255 . Peter Dayan. Improving Generalization for T emporal Difference Learning: The Successor Representation. Neural Computation , 5(4):613–624, July 1993. ISSN 0899-7667. doi: 10.1162/neco.1993.5.4.613. URL https://doi.org/10.1162/neco.1993.5.4.613 . Peter Dayan and Nathaniel D. Daw . Decision theory , reinforcement learning, and the brain. Cognitive, Affective , & Behavioral Neur oscience , 8(4):429–453, December 2008. ISSN 1531-135X. doi: 10.3758/CABN.8.4.429. URL https://doi.org/10.3758/CABN.8.4.429 . Samuel J. Gershman. The Successor Representation: Its Computational Logic and Neural Substrates. J. Neur osci. , 38(33):7193–7200, August 2018. ISSN 0270-6474, 1529-2401. doi: 10.1523/ JNEUR OSCI.0151- 18.2018. URL http://www.jneurosci.org/content/38/33/7193 . Jan Gläscher, Nathaniel Daw , Peter Dayan, and John P . O’Doherty . States versus Re wards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning. Neur on , 66(4):585–595, May 2010. ISSN 0896-6273. doi: 10.1016/j.neuron.2010.04.016. URL http://www.sciencedirect.com/science/article/pii/S0896627310002874 . Arthur Gretton, Karsten M. Borgw ardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A Kernel T wo-Sample T est. Journal of Machine Learning Resear ch , 13:723–773, March 2012. URL http://jmlr.csail.mit.edu/papers/v13/gretton12a.html . G E Hinton, P Dayan, B J Frey , and R M Neal. The "wake-sleep" algorithm for unsupervised neural networks. Science , 268(5214):1158–1161, May 1995. ISSN 0036-8075. T ejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J. Gershman. Deep Successor Reinforcement Learning. arXiv:1606.02396 [cs, stat] , June 2016. URL abs/1606.02396 . arXi v: 1606.02396. Marcelo G. Mattar and Nathaniel D. Da w . Prioritized memory access explains planning and hippocam- pal replay . Nature Neur oscience , 21(11):1609, November 2018. ISSN 1546-1726. doi: 10.1038/ s41593- 018- 0232- z. URL https://www.nature.com/articles/s41593- 018- 0232- z . 9 Ke vin J Miller , Matthew M Botvinick, and Carlos D Brody . Dorsal hippocampus contributes to model-based planning. Nature Neur oscience , 20(9):1269–1276, September 2017. ISSN 1097-6256, 1546-1726. doi: 10.1038/nn.4613. URL http://www.nature.com/articles/nn.4613 . I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw , and S. J. Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour , 1 (9):680, September 2017. ISSN 2397-3374. doi: 10.1038/s41562- 017- 0180- 8. URL https: //www.nature.com/articles/s41562- 017- 0180- 8 . Brad E. Pfeiffer and David J. Foster . Hippocampal place-cell sequences depict future paths to remembered goals. Nature , 497(7447):74–79, May 2013. ISSN 1476-4687. doi: 10.1038/ nature12112. URL https://www.nature.com/articles/nature12112 . Evan M. Russek, Ida Momennejad, Matthe w M. Botvinick, Samuel J. Gershman, and Nathaniel D. Daw . Predictiv e representations can link model-based reinforcement learning to model-free mech- anisms. PLOS Computational Biology , 13(9):e1005768, September 2017. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1005768. URL https://journals.plos.org/ploscompbiol/ article?id=10.1371/journal.pcbi.1005768 . Maneesh Sahani and Peter Dayan. Doubly Distributional Population Codes: Simultaneous Represen- tation of Uncertainty and Multiplicity . Neural Computation , 15(10):2255–2279, October 2003. ISSN 0899-7667. doi: 10.1162/089976603322362356. URL http://dx.doi.org/10.1162/ 089976603322362356 . Kimberly L. Stachenfeld, Matthe w M. Botvinick, and Samuel J. Gershman. The hippocampus as a predictiv e map. Nature Neur oscience , 20(11):1643–1653, November 2017. ISSN 1546-1726. doi: 10.1038/nn.4650. URL https://www.nature.com/articles/nn.4650 . Clara Kwon Starkweather , Benedicte M. Babayan, Naoshige Uchida, and Samuel J. Gershman. Dopamine re ward prediction errors reﬂect hidden-state inference across time. Nat. Neur osci. , 20 (4):581–589, April 2017. ISSN 1546-1726. doi: 10.1038/nn.4520. Federico Stella, Peter Baracskay , Joseph O’Neill, and Jozsef Csicsv ari. Hippocampal Reactiv ation of Random Trajectories Resembling Brownian Dif fusion. Neur on , February 2019. ISSN 0896- 6273. doi: 10.1016/j.neuron.2019.01.052. URL http://www.sciencedirect.com/science/ article/pii/S0896627319300790 . Richard S. Sutton and Andre w G. Barto. Intr oduction to reinfor cement learning , volume 135. MIT press Cambridge, 1998. Eszter Vértes and Maneesh Sahani. Flexible and accurate inference and learning for deep generativ e models. In S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 4166–4175. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 7671- flexible- and- accurate- inference- and- learning- for- deep- generative- models. pdf . Martin J. W ainwright and Michael I. Jordan. Graphical Models, Exponential Families, and V ariational Inference. F ound. T r ends Mach. Learn. , 1(1-2):1–305, January 2008. ISSN 1935-8237. doi: 10.1561/2200000001. URL http://dx.doi.org/10.1561/2200000001 . Richard S. Zemel, Peter Dayan, and Alexandre Pouget. Probabilistic interpretation of population codes. Neural computation , 10(2):403–430, 1998. 10 Supplementary material A A pproximations in the wake phase update Here, we giv e some additional insights into the nature of the approximation implied by the wake phase update for the DDC state-space model and discuss its link to variational methods. According to the standard M step in variational EM, the model parameters are updated to maximize the expected log-joint of the model under the approximate posterior distrib utions: ∆ θ ∝ ∇ θ X t E q ( s t ,s t +1 |O t +1 ) [log p θ ( s t +1 | s t )] (29) = ∇ θ X t − Z q ( s t , s t +1 |O t +1 )(log p θ ( s t +1 | s t ) + log q ( s t |O t +1 )) d ( s t , s t +1 ) (30) = ∇ θ X t − K L [ q ( s t , s t +1 |O t +1 ) k p θ ( s t +1 | s t ) q ( s t |O t +1 )] (31) After projecting the distributions appearing in the KL div ergence (eq. 31) into the joint expo- nential family deﬁned by sufﬁcient statistics [ ψ ( s t ) , ψ ( s t +1 )] , they can be represented using the corresponding mean parameters: q ( s t , s t +1 |O t +1 ) P = ⇒  E q ( s t ,s t +1 |O t +1 ) [ ψ ( s t )] E q ( s t ,s t +1 |O t +1 ) [ ψ ( s t +1 )]  =  µ t ( O t +1 ) µ t +1 ( O t +1 )  (32) p θ ( s t +1 | s t ) q ( s t |O t +1 ) P = ⇒  E p θ ( s t +1 | s t ) q ( s t |O t +1 ) [ ψ ( s t )] E p θ ( s t +1 | s t ) q ( s t |O t +1 ) [ ψ ( s t +1 )]  =  µ t ( O t +1 ) T µ t ( O t +1 )  (33) T o restrict ourselves to online inference, we can make a further approximation: µ t ( O t +1 ) ≈ µ t ( O t ) . Thus, the wake phase update can be thought of as replacing the KL div ergence in equation 31 by the Euclidean distance between the (projected) mean parameter representations in eq. 32-33. X t k µ t +1 ( O t +1 ) − T µ t ( O t ) k 2 (34) Note that this cost function is directly related to the maximum mean discrepancy (Gretton et al. [2012]; MMD)–a non-parametric distance metric between two distributions–with a ﬁnite dimensional RKHS. B Equivalence of E p ( s t |O t ) [ M ( s t )] and f M ( µ t ( O t )) f M ( µ t ( O t )) = E p ( O >t |O t ) [ X k γ k µ t + k ( O t + k )] (35) where E p ( O >t |O t ) [ µ t + k ( O t + k )] = E p ( O t +1: t + k |O t ) [ µ t + k ( O t + k )] (36) = Z d O t +1: t + k p ( O t +1: t + k |O t ) Z ds t + k p ( s t + k |O t + k ) ψ ( s t + k ) (37) = Z d O t +1: t + k p ( O t +1: t + k |O t ) Z ds t + k p ( s t + k , O t +1: t + k |O t ) p ( O t +1: t + k |O t ) ψ ( s t + k ) (38) = Z ds t + k Z d O t +1: t + k p ( s t + k , O t +1: t + k |O t ) ψ ( s t + k ) (39) = T k µ t (40) (41) 11 Thus we hav e: f M ( µ t ( O t )) = X k γ k T k µ t (42) = ( I − γ T ) − 1 µ t (43) = E p ( s t |O t ) [ M ( s t )] (44) C Further experimental details Figure 1: Learning and inference in the DDC state-space model The generativ e model corresponding to a random walk policy: p ( s t +1 | s t ) = [ s t + ˜ η ] W A L L S , (45) p ( o t | s t ) = s t + ξ (46) Where [ . ] W A L L S indicates the constraints introduced by the w alls in the en vironment (outer walls are of unit length). η ∼ N (0 , σ s = 1 . ) , ˜ η = 0 . 06 ∗ η / k η k , ξ ∼ N (0 , σ o = 0 . 1) , s t , o t ∈ R 2 W e used K=100 Gaussian features with width σ ψ = 0 . 3 for both the latent and observed states. A small subset of features were truncated along the internal wall, to limit the artifacts from the function approximation. Alternativ ely , a features with various spatial scales can also be used. The recursive recognition model was parametrized linearly using the features: f W ( µ t − 1 , o t ) = W [ T µ t − 1 ; ψ ( o t )] (47) As sampling from the DDC parametrized latent dynamics is not tractable in general, in the sleep phase, we generated approximate samples from a Gaussian distrib ution with consistent mean. The generativ e and recognition models were trained through 50 wake-sleep cycles, with 3 · 10 4 sleep samples, and 5 · 10 4 wake phase observ ations. The latent dynamics in Fig.1b is visualized by approximating the mean dynamics as a linear readout from the DDC: E s t +1 | s t [ s t +1 ] ≈ αT ψ ( s t ) where s ≈ αψ ( s ) . Figure 2 T o compute the v alue functions under the random walk policy we computed the SFs based on the latent ( ψ ( s ) ), inferred ( µ ) or observed ( ψ ( o ) ) features, with discount factor γ = 0 . 99 . In each case, we estimated the rew ard vector w rew using the av ailable state information. Figure 3 T o construct the state-action value function, we used 10 features ov er actions φ ( a ) , von Mises functions ( κ = 2 . ) arranged ev enly on [0 , 2 π ] . The policy iteration was run for 500 cycles, and in each cycle an episode of 500 steps was collected according to the greedy policy . The visited latent, inferred of observed state sequences were used to update the corresponding SFs to re-e valuate the policy . T o facilitate faster learning, only episodes with positiv e returns were used to update the SFs. 12

A neurally plausible model learns successor representations in partially observable environments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment