Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

November 26, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs
ArXiv ID: 2511.21928
Date: 2025-11-26
Authors: Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor

📝 Abstract

Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

💡 Deep Analysis

📄 Full Content

Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs Yifan Zhou1,∗, Sachin Grover1,∗, Mohamed El Mistiri1, Kamalesh Kalirathnam1, Pratyush Kerhalkar1, Swaroop Mishra3,†, Neelesh Kumar2, Sanket Gaurav2, Oya Aran2, Heni Ben Amor1 1Interactive Robotics Lab, Arizona State University 2Research & Development, Procter & Gamble 1{yzhou298,sgrover6,melmisti,kamales1,pkerhalk,hbenamor}@asu.edu 2{kumar.n.40,gaurav.s,aran.o}@pg.com, 3swaroopranjanmishra@gmail.com Abstract Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop—directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across 15 Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on 8 out of 15 tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL. 1 Introduction Reinforcement Learning (RL) [56] represents a foundational paradigm shift within the broader field of machine learning. It allows autonomous agents to learn optimal behaviors through interactions with their environment, i.e., through repeated trial and error. Over the past decades, RL has resulted in remarkable successes across a range of challenging domains, including mastering strategic games such as Backgammon [60] and Go [55], achieving human-level performance in robot table-tennis [12] or contributing to scientific breakthroughs such as protein folding [26]. Traditional RL relies exclusively on numerical feedback in the form of scalar rewards. By contrast, humans often learn and reason using natural language, prior knowledge, and common sense [37, 45]. Many real-world tasks are accompanied by rich linguistic context such as manuals, domain descriptions, and expert instructions which standard RL algorithms are unable to exploit. Yet, this information can serve as a powerful inductive bias: guiding exploration, encoding constraints, and expressing useful heuristics to accelerate learning. ∗Corresponding Authors. †Currently working at Microsoft AI 39th Conference on Neural Information Processing Systems (NeurIPS 2025). arXiv:2511.21928v1 [cs.LG] 26 Nov 2025 To bridge this gap between numerics and semantics, we introduce Prompted Policy Search (ProPS), a new method that unifies numerical and linguistic reasoning within a single framework. ProPS enables language models to process and act on both reward signals and natural language inputs, such as high-level goals, domain knowledge, or strategic hints. This results in a more informed and adaptable policy search process. While prior works have used Large Language Models (LLMs) [39] to augment specific components of the RL pipeline (e.g., reward shaping [69], Q-function modeling [64], or action generation [20]), these approaches still depend on conventional RL algorithms for optimization. In contrast, we show that LLMs can directly perform policy search, treating optimization as an in-context reasoning problem. To this end, first, we demonstrate that LLMs are capable of numerical optimization for RL tasks. We then extend this capability to incorporate linguistic abstractions, enabling a unified reasoning strategy where semantic and quantitative signals complement one another. The resulting approach accelerates convergence by incorporating prior knowledge, enforcing constraints, and refining exploration. Moreover, it offers additional transparency by providing natural language justifications of the proposed policy updates: an essential feature for domains requiring transparency, safety, and human oversight. Our primary contributions are as follows: (C1) Unifying Numerical and Linguistic Reasoning for RL: We propose Prompted Policy Search (ProPS), a framework to integrate scalar reward signals with natural language guidance in a unified optimization loop, enabling language models to reason over both quantitative feedback and semantic abstractions. (C2) Flexible Integration of Human-Centric Knowledge: ProPS leverages the symbolic a

📄 Read Full PDF on ArXiv