Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL
Large language models (LLMs) achieve strong performance when all task-relevant information is available upfront, as in static prediction and instruction-following problems. However, many real-world decision-making tasks are inherently online: crucial information must be acquired through interaction, feedback is delayed, and effective behavior requires balancing information collection and exploitation over time. While in-context learning enables adaptation without weight updates, existing LLMs often struggle to reliably leverage in-context interaction experience in such settings. In this work, we show that this limitation can be addressed through training. We introduce ORBIT, a multi-task, multi-episode meta-reinforcement learning framework that trains LLMs to learn from interaction in context. After meta-training, a relatively small open-source model (Qwen3-14B) demonstrates substantially improved in-context online learning on entirely unseen environments, matching the performance of GPT-5.2 and outperforming standard RL fine-tuning by a large margin. Scaling experiments further reveal consistent gains with model size, suggesting significant headroom for learn-at-inference-time decision-making agents. Code reproducing the results in the paper can be found at https://github.com/XiaofengLin7/ORBIT.
💡 Research Summary
The paper tackles a fundamental limitation of large language models (LLMs): while they excel at static prediction and instruction‑following, they struggle to adapt online when information must be gathered through interaction and feedback is delayed. The authors formalize this as “multi‑episode in‑context online learning,” where an agent faces the same underlying task repeatedly (e.g., navigating a new interface or playing a game) and must use the transcript of all previous episodes stored in its context window to improve performance, without any weight updates at inference time.
To endow LLMs with this capability, the authors introduce ORBIT (Online Reinforcement‑Based In‑Context Training), a meta‑reinforcement‑learning (meta‑RL) framework. During meta‑training, a pretrained LLM is exposed to a diverse suite of partially observable Markov decision processes (MDPs) – Minesweeper, Hangman, Wordle, Blackjack, etc. – and for each sampled task the model interacts for several episodes (T = 3‑5). Crucially, the entire cross‑episode history is concatenated and fed back to the model as part of its prompt, forcing it to learn a non‑Markovian “meta‑policy” that decides actions based solely on the accumulated transcript. The objective is to maximize the expected number of successful task completions within a trajectory; a unified binary completion reward (0/1) is used across all tasks to avoid scale imbalances.
Policy optimization is performed with Group Relative Policy Optimization (GRPO), a trajectory‑level policy‑gradient method that normalizes rewards within a batch of K rollouts per task, thereby providing a low‑variance advantage estimate without requiring a value function. PPO‑style clipped updates with asymmetric clipping bounds stabilize training. This design aligns well with the sparse, outcome‑driven reward signal.
Evaluation is conducted on two completely unseen environments: a Maze navigation task and the game Mastermind. Both require the agent to explore, infer hidden structure, and then exploit that knowledge in later episodes. The authors compare ORBIT‑trained Qwen3‑14B against several baselines: GPT‑4o, GPT‑5.2 (high‑effort reasoning mode), standard RL fine‑tuning (PPO) on the same tasks, and the base LLM without meta‑training. Results show that after meta‑training, the 14 B model matches GPT‑5.2’s success rate on episode 3 and substantially outperforms all other baselines, demonstrating that the model has learned to “learn within context.” Moreover, scaling experiments with 7 B, 14 B, and 34 B variants reveal a consistent performance increase as model size grows, suggesting ample headroom for future improvements.
The paper’s contributions are threefold: (1) a clear definition and protocol for multi‑episode in‑context online learning; (2) a simple yet effective meta‑RL training pipeline that requires no external memory modules or elaborate prompt engineering; (3) empirical evidence that even modest open‑source LLMs can acquire general‑purpose, inference‑time decision‑making abilities comparable to proprietary frontier models when trained with ORBIT.
Limitations include the fixed context window size, which caps the amount of cross‑episode history that can be retained, and the binary reward formulation, which may be too coarse for tasks demanding nuanced behavior. The authors suggest future work on longer contexts, hierarchical meta‑learning, and integration with real‑world tool‑use scenarios.
In summary, ORBIT demonstrates that meta‑reinforcement learning is a viable path to transform pretrained LLMs into agents capable of efficient online learning at inference time, opening the door to more autonomous, adaptable AI systems without the need for continual weight updates.
Comments & Academic Discussion
Loading comments...
Leave a Comment