Expanding LLM Agent Boundaries with Strategy-Guided Exploration
Reinforcement learning (RL) has demonstrated notable success in post-training large language models (LLMs) as agents for tasks such as computer use, tool calling, and coding. However, exploration remains a central challenge in RL for LLM agents, especially as they operate in language-action spaces with complex observations and sparse outcome rewards. In this work, we address exploration for LLM agents by leveraging the ability of LLMs to plan and reason in language about the environment to shift exploration from low-level actions to higher-level language strategies. We thus propose Strategy-Guided Exploration (SGE), which first generates a concise natural-language strategy that describes what to do to make progress toward the goal, and then generates environment actions conditioned on that strategy. By exploring in the space of strategies rather than the space of actions, SGE induces structured and diverse exploration that targets different environment outcomes. To increase strategy diversity during RL, SGE introduces mixed-temperature sampling, which explores diverse strategies in parallel, along with a strategy reflection process that grounds strategy generation on the outcomes of previous strategies in the environment. Across UI interaction, tool-calling, coding, and embodied agent environments, SGE consistently outperforms exploration-focused RL baselines, improving both learning efficiency and final performance. We show that SGE enables the agent to learn to solve tasks too difficult for the base model.
💡 Research Summary
The paper tackles a fundamental bottleneck in reinforcement‑learning (RL) fine‑tuning of large language models (LLMs) used as autonomous agents: exploration in high‑dimensional, language‑rich action spaces is extremely inefficient, especially when rewards are sparse and only given at the end of long‑horizon tasks. Existing LLM‑RL methods typically sample actions directly from the pretrained policy, which is already heavily biased toward high‑probability token sequences. Consequently, the learning process tends to refine already‑known behaviours rather than discover novel solution trajectories, limiting the agent’s ability to solve tasks that the base model cannot handle.
Strategy‑Guided Exploration (SGE) introduces a high‑level abstraction called a “strategy”. At each decision step the agent first generates a concise natural‑language description of what it intends to achieve (e.g., “click the submit button after filling the form”). This strategy is sampled from a dedicated distribution Sπ that is prompted separately from the usual action generation. After a strategy is produced, the model generates a chain‑of‑thought and finally the concrete environment actions conditioned on that strategy. By shifting the exploration focus from low‑level token actions to the space of strategies, SGE creates structured, diverse exploratory behaviours that correspond to distinct outcomes in the environment.
Two technical mechanisms make SGE effective:
-
Mixed‑temperature sampling – Tokens belonging to the strategy are sampled with a high temperature (e.g., T≈1.0) while the subsequent reasoning and action tokens are generated with a low temperature (e.g., T≈0.2). High temperature at the strategy level yields a wide variety of high‑level plans, whereas low temperature for actions ensures that each plan is executed consistently, avoiding the superficial noise that high‑temperature action sampling would introduce (e.g., tapping slightly different coordinates on the same UI button).
-
Strategy reflection – During training the system maintains a buffer of previously executed strategies together with their success/failure outcomes. With a certain probability, a failed rollout triggers a “negative reflection” prompt that asks the model to critique the unsuccessful plan, while successful rollouts receive a “positive reflection” prompt that highlights what worked. This feedback loop encourages the generation of strategies that are both novel and grounded in the observed dynamics of the environment, reducing redundancy across episodes.
SGE also leverages parallelism: K strategies are generated per task in a single forward pass, matching the requirements of group‑based RL algorithms such as GRPO or RLOO, which already need K parallel responses for advantage estimation. Thus SGE adds no extra computational burden but supplies a richer set of exploratory samples, decreasing gradient variance.
Empirical evaluation spans four heterogeneous domains: UI manipulation, tool‑calling, multi‑turn coding, and embodied robot navigation. Across all benchmarks, SGE consistently outperforms strong exploration‑focused baselines (ε‑greedy, entropy bonuses, UCB, Random Network Distillation, etc.) in both learning speed and final success rate. Notably, in the coding environment the base LLM achieves a pass@2048 of 0.69, while the best RL baseline reaches only pass@1 = 0.64. After applying SGE, the agent attains pass@1 = 0.73, demonstrating that the method can discover solutions beyond the ceiling of the pretrained model.
Ablation studies reveal that (i) mixed‑temperature sampling alone improves strategy diversity by a factor of ~2.3 and raises success by ~6 percentage points compared to uniform temperature; (ii) strategy reflection alone reduces strategy duplication by 35 % and adds ~4 percentage points to final performance; (iii) combining both yields the largest gains, confirming a synergistic effect. Experiments with models of different scales (2 B, 6 B, 70 B parameters) show that even modest‑size LLMs benefit from SGE, while larger models gain additional fine‑tuning precision, surpassing their own pre‑RL performance by 3 % or more.
The authors acknowledge limitations: the approach relies on the ability to express useful high‑level plans in natural language, which may be challenging for purely physical or non‑verbal tasks. The current reflection mechanism is simple and could be enriched with meta‑learning or self‑critique architectures. Future work is suggested in (a) extending strategies to multimodal representations (visual, proprioceptive), (b) integrating external planners (e.g., Monte‑Carlo Tree Search) with strategy conditioning, and (c) applying SGE in continual‑learning settings where long‑term strategy retention matters.
In summary, Strategy‑Guided Exploration reframes exploration for LLM agents as a language‑level planning problem. By generating diverse, high‑level strategies with mixed‑temperature sampling and grounding them through reflection, SGE enables agents to traverse previously unreachable regions of the solution space, achieving superior learning efficiency and higher final performance across a broad set of agentic tasks. This work opens a promising direction for scaling LLM‑based RL beyond the confines of the pretrained policy’s bias.
Comments & Academic Discussion
Loading comments...
Leave a Comment