Reward-Aware Proto-Representations in Reinforcement Learning
In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.
💡 Research Summary
This paper revisits the idea of proto‑representations in reinforcement learning (RL) and introduces a reward‑aware alternative to the well‑known Successor Representation (SR). While SR encodes the expected discounted visitation counts under a policy and has been successfully applied to exploration, credit assignment, and zero‑shot transfer, it completely ignores the reward function. The authors therefore focus on the Default Representation (DR), originally proposed in a neuroscience context, and develop a comprehensive theoretical and algorithmic foundation for it.
The DR is defined within the framework of linearly solvable Markov Decision Processes (LMDPs). In an LMDP the agent maximizes cumulative reward while incurring a KL‑divergence penalty for deviating from a default policy πᵈ. The resulting optimal value function can be expressed linearly, and the DR matrix Z captures the expected reward obtained when moving from state s to state s′ under πᵈ:
Z = h·diag(exp(−r/λ)) – P_{πᵈ})⁻¹,
where r is the vector of state rewards, λ controls the weight of the KL‑penalty, and P_{πᵈ} is the transition matrix induced by the default policy. Unlike the SR, which counts visits, Z directly incorporates the magnitude of rewards, making it intrinsically reward‑aware.
The paper makes four major theoretical contributions. First, it derives a dynamic‑programming (DP) update for Z:
Z₀ = R⁻¹, Z_{k+1} = R⁻¹ + R⁻¹ P_{πᵈ} Z_k,
with R = diag(exp(−r/λ)). By expanding this recursion into a Neumann series, the authors prove convergence to the closed‑form DR. Second, they translate the DP recursion into a sample‑based temporal‑difference (TD) algorithm that does not require knowledge of P_{πᵈ}. For each observed transition (s, a, r, s′) under πᵈ, the TD target is
Y = exp(r/λ)·𝟙_{s=j} + Z(s′, j) (non‑terminal)
or Y = exp(r/λ)·𝟙_{s=j} (terminal), and Z(s, j) is updated toward Y with step‑size α. Importance sampling can be used when the behaviour policy differs from πᵈ.
Third, the authors analyze the eigen‑structure of DR. They prove that when the reward function is constant (negative) across all states, DR and SR share the same eigenvectors, and the eigenvalues are related by
μ_SR,i = (γ / (μ_DR,i⁻¹ – exp(−r/λ) + γ – 1))⁻¹,
where γ is the SR discount factor. This result explains why SR‑based techniques such as reward shaping and eigen‑option discovery have been successful, and it shows that those techniques can be transferred to DR, with the added benefit that DR’s eigenvectors also encode reward hotspots.
Fourth, the paper extends DR to state‑action‑dependent rewards. By lifting the LMDP formulation to the state‑action space, the authors obtain a matrix Z̄ = h·diag(exp(−r̄/λ)) – P̄_{πᵈ})⁻¹, where r̄ ∈ ℝ^{|S||A|} and P̄_{πᵈ} is the transition matrix over state‑action pairs. The optimal Q‑values follow
exp(q∗/λ) = Z̄_{NN} P̄_{πᵈ}^{NT} exp(r̄_T/λ),
and the optimal policy can be expressed analytically as
π∗(a|s) = πᵈ(a|s)·exp(q∗(s,a)/λ).
Thus DR provides a closed‑form solution for Q‑learning when the reward depends on actions.
To make DR scalable, the authors introduce “default features”. Analogous to successor features, they propose a factorized parameterization Z(s, s′; θ) ≈ φ(s)ᵀ W φ(s′), where φ(s) are learned features that capture the default dynamics. This enables DR to be applied in large or continuous state spaces while preserving its reward‑aware properties.
The empirical section evaluates DR across four canonical RL settings where SR has previously shone:
-
Reward shaping – In grid‑worlds with negative “danger” tiles, DR‑based shaping leads to faster convergence and higher final returns than SR, because the shaping potential directly reflects expected cumulative penalties.
-
Option discovery – Eigen‑options derived from DR’s top eigenvectors produce policies that avoid low‑reward regions while still encouraging exploration, outperforming SR‑based eigen‑options in terms of cumulative reward during the discovery phase.
-
Exploration – When combined with count‑based intrinsic bonuses, DR yields a more balanced exploration strategy that accounts for both transition uncertainty and reward risk, achieving quicker full‑state coverage.
-
Transfer learning – In tasks where terminal rewards change, DR’s default features allow immediate recomputation of optimal values without re‑learning the transition dynamics, dramatically reducing sample complexity compared to SR, which must be re‑estimated.
Overall, the paper establishes the Default Representation as a principled, reward‑aware proto‑representation that generalizes the Successor Representation. It provides rigorous DP and TD learning algorithms, characterizes the eigen‑structure linking DR to SR, extends the formulation to state‑action rewards, and introduces a scalable feature‑based approximation. Empirical results demonstrate that DR not only matches but often surpasses SR across a suite of RL challenges, offering a compelling new tool for researchers seeking representations that simultaneously capture dynamics and reward structure.
Comments & Academic Discussion
Loading comments...
Leave a Comment