Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers
In-context reinforcement learning (ICRL) leverages the in-context learning capabilities of transformer models (TMs) to efficiently generalize to unseen sequential decision-making tasks without parameter updates. However, existing ICRL methods rely on explicit reward signals during pretraining, which limits their applicability when rewards are ambiguous, hard to specify, or costly to obtain. To overcome this limitation, we propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback, eliminating the need for reward supervision. We study two variants that differ in the granularity of feedback: Immediate Preference-based RL (I-PRL) with per-step preferences, and Trajectory Preference-based RL (T-PRL) with trajectory-level comparisons. We first show that supervised pretraining, a standard approach in ICRL, remains effective under preference-only context datasets, demonstrating the feasibility of in-context reinforcement learning using only preference signals. To further improve data efficiency, we introduce alternative preference-native frameworks for I-PRL and T-PRL that directly optimize TM policies from preference data without requiring reward signals nor optimal action labels.Experiments on dueling bandits, navigation, and continuous control tasks demonstrate that ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
💡 Research Summary
The paper introduces In‑Context Preference‑based Reinforcement Learning (ICPRL), a reward‑free paradigm for in‑context reinforcement learning (ICRL) that relies exclusively on preference feedback during both pre‑training and deployment. Existing ICRL methods require explicit scalar rewards, limiting their applicability when rewards are ambiguous, costly, or hard to specify. ICPRL addresses this by defining two feedback granularities: Immediate Preference‑based RL (I‑PRL), which receives per‑step binary preferences between two actions, and Trajectory Preference‑based RL (T‑PRL), which receives binary comparisons between whole trajectories. In both cases the learner never observes a reward function; preferences may be synthetic (via a Bradley‑Terry model) or human‑generated.
The authors first adapt the supervised pre‑training framework of Decision‑Pretrained Transformers (DPT) to the preference‑only setting, creating Decision Preference‑Pretrained Transformer (DP²T). DP²T trains a transformer to predict optimal actions for query states conditioned on a context set built from preference data, eliminating the need for optimal‑action labels.
Beyond this baseline, the paper proposes preference‑native training procedures that directly optimize the transformer using the structure of the preferences. For I‑PRL, a sigmoid‑based loss aligns the model’s action preference scores with observed binary choices. For T‑PRL, a Bradley‑Terry‑style likelihood is used to train the model to rank trajectories according to the observed preferences, without ever reconstructing a reward function.
Experiments span dueling bandits, navigation (2‑D and 3‑D mazes), and continuous control (MuJoCo). Across these domains, ICPRL models achieve performance comparable to state‑of‑the‑art reward‑based ICRL baselines (e.g., DPT, Algorithm Distillation). I‑PRL often converges faster and with fewer preference samples than T‑PRL, highlighting the information density of step‑wise feedback. The results demonstrate that preference information alone is sufficient for a transformer meta‑policy to generalize to unseen tasks with only a few additional preference queries at test time.
Key contributions include (1) the formulation of a reward‑free ICRL paradigm, (2) the demonstration that supervised pre‑training remains effective with preference‑only context, and (3) novel preference‑native training methods that remove dependence on optimal‑action labels. Limitations are acknowledged: most preference data are synthetically generated, human‑in‑the‑loop validation is limited, and large‑scale preference dataset collection still incurs cost. Future work is suggested on leveraging large language models as cheap annotators, multi‑modal preference integration, and Bayesian treatment of noisy preference feedback.
Comments & Academic Discussion
Loading comments...
Leave a Comment