Scalable In-Context Q-Learning

Scalable In-Context Q-Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{S-ICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/SICQL}{https://github.com/NJU-RL/SICQL}}.


💡 Research Summary

The paper introduces S‑ICQL (Scalable In‑Context Q‑Learning), a novel framework for in‑context reinforcement learning (ICRL) that overcomes the limitations of prior algorithm‑distillation (AD) and decision‑pretrained transformer (DPT) methods. Existing ICRL approaches typically feed raw transition sequences as prompts, which are long, redundant, and entangle task information with suboptimal behavior, making it difficult to learn optimal policies from imperfect data. S‑ICQL tackles this by (1) pre‑training a world model on a multi‑task offline dataset to learn the transition‑reward distribution p(s′,r|s,a), and (2) using the world model to compress a few transitions into a lightweight prompt that captures only the essential dynamics of each task.

The architecture is a prompt‑based multi‑head transformer. One head predicts the policy πθ(a|s;β) and another predicts an in‑context value V̂θ(s;β), while a separate head estimates Qθ(s,a). Instead of fitting the state value by a simple mean of Q‑values, S‑ICQL fits an upper‑expectile τ of the Q‑distribution, V̂θ(s)≈Expτ


Comments & Academic Discussion

Loading comments...

Leave a Comment