Can We Really Learn One Representation to Optimize All Rewards?

Can We Really Learn One Representation to Optimize All Rewards?
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As machine learning has moved towards leveraging large models as priors for downstream tasks, the community has debated the right form of prior for solving reinforcement learning (RL) problems. If one were to try to prefetch as much computation as possible, they would attempt to learn a prior over the policies for some yet-to-be-determined reward function. Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. However, FB’s training objective and learning behavior remain mysterious. In this paper, we demystify FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice. We draw connections with rank matching, fitted Q-evaluation, and contraction mapping. Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement. We call our proposed method $\textbf{one-step forward-backward representation learning (one-step FB)}$. Experiments in didactic settings, as well as in $10$ state-based and image-based continuous control domains, demonstrate that one-step FB converges to errors $10^5$ smaller and improves zero-shot performance by $+24%$ on average. Our project website is available at https://chongyi-zheng.github.io/onestep-fb.


💡 Research Summary

This paper critically examines the promise of Forward‑Backward (FB) representation learning, a recent unsupervised pre‑training method that claims a single pair of learned representations can be combined with any downstream reward to recover the optimal policy without further fine‑tuning. The authors first ask whether the “ground‑truth” FB representations assumed by prior work can even exist. By analyzing the successor measure Mπ—a matrix that encodes discounted state‑action visitation probabilities—they show that Mπ is full‑rank for discrete controlled Markov processes. Consequently, any representation that exactly factorizes Mπ must have a dimensionality d at least as large as the total number of state‑action pairs |S×A|. For continuous or large‑scale problems where |S×A| is effectively infinite, a finite‑dimensional representation cannot capture the true successor measure, invalidating the core assumption of FB.

Next, the paper re‑derives the FB training objective using the Least‑Squares Importance Fitting (LSIF) framework. LSIF treats density‑ratio estimation as a regression problem; FB chooses the ratio function g(s,a,z,s′,a′)=F(s,a,z)⊤B(s′,a′) and fits it to the ratio Mπ(s′,a′|s,a,z)/ρ(s′,a′). This yields a Monte‑Carlo loss L_MC_FB that can be rewritten as a temporal‑difference (TD) loss L_TD_FB by substituting the Bellman recursion for Mπ and introducing target networks. The resulting TD loss is mathematically equivalent to the original FB objective up to a constant scaling factor. Importantly, this loss is identical in spirit to the objective of Fitted Q‑Evaluation (FQE), except that FB estimates the successor‑measure ratio rather than Q‑values directly.

The authors then expose a fundamental difficulty: the policy π, the forward representation F, and the backward representation B are mutually dependent. Because the Bellman operator defined by the TD loss does not enjoy a contraction property when the policy itself is a function of F, convergence is not guaranteed. They prove that for representation dimensions d < |S×A|, FB can incur arbitrarily large errors on the optimal Q‑value for some reward functions, contradicting earlier claims of bounded error. They also demonstrate an invariance property: ground‑truth forward representations must be unchanged under positive affine transformations of the latent variable, providing a practical test for whether learned representations have converged to the true ones.

Motivated by these insights, the paper proposes a simplified method called one‑step Forward‑Backward representation learning (one‑step FB). The key idea is to break the circular dependency by fixing a behavioral policy πβ (e.g., a dataset‑collecting policy) during pre‑training. The algorithm first estimates the successor measure Mπβ from offline data, then learns F and B by minimizing the LSIF‑based loss with respect to this fixed Mπβ and a reference measure ρ. When a new reward r is presented, the method computes a latent vector z_r = E_{ρ}


Comments & Academic Discussion

Loading comments...

Leave a Comment