Agentic Policy Optimization via Instruction-Policy Co-Evolution

Reading time: 5 minute
...

📝 Original Info

  • Title: Agentic Policy Optimization via Instruction-Policy Co-Evolution
  • ArXiv ID: 2512.01945
  • Date: 2025-12-01
  • Authors: Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

💡 Deep Analysis

Figure 1

📄 Full Content

Agentic Policy Optimization via Instruction-Policy Co-Evolution Han Zhou 1 Xingchen Wan 2 * Ivan Vuli´c 1 Anna Korhonen 1 Abstract Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling au- tonomous agents that can conduct effective multi- turn and tool-integrated reasoning. While instruc- tions serve as the primary protocol for defining agents, RLVR typically relies on static and man- ually designed instructions. However, those in- structions may be suboptimal for the base model, and the optimal instruction may change as the agent’s policy improves and explores the interac- tion with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-policy Co- Evolution framework that integrates instruction optimization as a dynamic component of the rein- forcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are peri- odically pruned. New instructions are generated and verified through an on-policy reflection mech- anism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers in- novative instructions that guide the agent toward more strategic reasoning paths, achieving substan- tial performance gains with only a marginal in- crease in computational overhead. 1. Introduction The advent of large language models (LLMs) (Brown et al., 2020; Chung et al., 2024) has given rise to autonomous 1Language Technology Lab, University of Cambridge. 2Machine Learning Research Group, University of Oxford. ∗Now at Google. Correspondence to: Han Zhou . Preprint. February 3, 2026. agents that are capable of reasoning, interpreting user in- tents, and tackling complex tasks via interacting with the environment (Yao et al., 2023). When paired with care- fully engineered instructions, LLM-based agents have ex- celled in a wide range of applications, such as code genera- tion (Jimenez et al., 2023), retrieval-augmented generation (Trivedi et al., 2023), and interactive decision-making (Su et al., 2025). Recently, the reinforcement learning (RL) (Sutton et al., 1999) paradigm has further advanced the rea- soning capabilities of LLM agents, enabling them to learn policies from verifiable rewards (Shao et al., 2024) (RLVR) and achieve multi-turn and tool-integrated reasoning (Jin et al., 2025; Xue et al., 2025). In the core of these agentic capabilities, instructions serve as the protocol for programming these agents, characteriz- ing their roles, and defining any available tools/interfaces for interaction. The performance of LLM-based agents has been shown to be highly dependent on the instruction (Zhou et al., 2025), and subtle changes can exert substantial dif- ferences in generated trajectories, preventing robust and generalizable agent applications. The compounding effect of instructions is further amplified when LLMs are post- trained via RL, where changes in instructions can result in different initial spaces for policy learning, thereby largely af- fecting the converged performance after training (Liu et al., 2025a). Consequently, instruction design becomes crucial for agent training and typically requires costly human efforts for iterative refinements via trial-and-error. The traditional paradigm of RLVR treats instruction as a static and pre-defined input. However, the optimal instruc- tion for the base model is not always known a priori and may even change as the model’s policy improves and explores the interaction with the environment (Soylu et al., 2024). Recent findings also underscore the importance of instruc- tion for RL, where injecting reward specification (Zhang et al., 2025) or in-context hints (Liu et al., 2025b) into the instruction better aligns the model with the learning objec- tive and generates richer reward signals. While automated prompt optimization (APO) (Zhou et al., 2023; Yang et al., 2024) approaches exist for obtaining a better instruction before commencing the RL phase, generalizing them to the online setting of RL and incorporating adaptive knowledge during policy updates is rather non-trivial. 1 arXiv:2512.01945v2 [cs.LG] 31 Jan 2026 Agentic Policy Optimization via Instruction-Policy Co-Evolution To bridge this gap, we propose to automate instruction learn- ing not as a static term, but as an integral and dynamic component of the RL learning loop, allowing the instruction and policy to co-evolve in an online setup. We introduce INSPO, INStruction-POlicy co-evolution, for agentic policy optimization, a novel framework that delivers two major inn

📸 Image Gallery

demo3.png main2.png reward.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut