Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs’ in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.

💡 Research Summary

The paper tackles the challenge of training large language models (LLMs) for multi‑turn code generation, a task that naturally fits reinforcement learning (RL) but suffers from high computational cost and instability when using purely online RL methods. The authors observe that multi‑turn code generation can be modeled as a one‑step recoverable Markov decision process (MDP), meaning that any sub‑optimal action has a bounded negative impact that can be recovered in a single subsequent step. Leveraging this property, they propose Cobalt (Contextual Bandit Learning with Offline Trajectories), a hybrid approach that combines the data efficiency of offline RL with the performance benefits of online RL.

Cobalt proceeds in two phases. First, an offline dataset of code‑generation trajectories is collected using a reference LLM (e.g., fine‑tuned versions of R1‑Distill 8B and Qwen3 8B). For each coding problem, 16 independent trajectories are sampled, and only those containing at least one correct program are retained. Overly easy or completely failed trajectories are filtered out, and a max‑variance down‑sampling step limits each problem to at most four representative trajectories. These full trajectories are then split by turn, producing partial trajectories that serve as contextual states (the history of previous code and feedback).

In the second phase, online learning is performed as a contextual bandit problem. Given a partial trajectory s_t = (o_0, a_0, …, o_t), the LLM generates a single next program a_t. The program is executed against the public test cases, yielding an immediate reward R(s_t, a_t) ∈

Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment