Maximum-Entropy Exploration with Future State-Action Visitation Measures
Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.
💡 Research Summary
Maximum‑entropy reinforcement learning (MaxEntRL) traditionally augments the external reward with an intrinsic term that is the (negative) entropy of a distribution, most often the policy itself. While this encourages stochastic actions, it does not directly promote exploration of the state space. The present work introduces a novel intrinsic reward based on the entropy of the discounted future state‑action visitation distribution. For any current state‑action pair ((s,a)), the authors define a conditional distribution (d_{\pi,\gamma}(\bar s,\bar a \mid s,a)) that captures, in expectation, which future state‑action pairs will be visited when following policy (\pi). By mapping each future pair through a feature extractor (h) into a feature space (\mathcal Z), they obtain a conditional feature distribution (q_\pi(z\mid s,a)). The intrinsic reward is then \
Comments & Academic Discussion
Loading comments...
Leave a Comment