Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization
Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. EntroPO augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters).To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the SWEBENCH leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with EntroPO ranks 1st on SWEBENCH-LITE and 4th on SWEBENCH-VERIFIED on the open-weight leaderboard, surpassed only by models with over 10x more parameters(e.g., >$350B).
💡 Research Summary
The paper tackles two intertwined challenges that limit the effectiveness of large language models (LLMs) on real‑world software engineering tasks: (1) the need for multi‑turn, tool‑augmented reasoning, and (2) the reliance on test‑time scaling (TTS) which suffers when model outputs collapse into a narrow set of solutions. Existing alignment methods such as Direct Preference Optimization (DPO) and Kahneman‑Tversky Optimization (KTO) improve alignment with human preferences but inadvertently reduce the entropy of the policy, leading to a “winner‑take‑all” dynamic. This diversity collapse makes additional sampling during TTS largely redundant, limiting the marginal gains from deeper search or larger sample budgets.
EntroPO is introduced as an entropy‑enhanced preference‑optimization framework designed specifically for multi‑turn, tool‑using coding agents. The authors formalize the coding task as a finite‑horizon Markov Decision Process (MDP) where each state contains the full interaction history and each action corresponds to a tool call (search, execution, patching, etc.). Preferences are collected at the trajectory level via an automated oracle that labels a trajectory as “preferred” if the final patch passes all unit tests. Using the Bradley‑Terry model to express preference probabilities, the standard DPO/KTO objectives are augmented with a weighted entropy regularizer λ·H(π). This term penalizes low‑entropy policies, encouraging the model to retain a broader distribution over viable action sequences even after alignment.
The paper derives a regularized value function Vπ and Q‑function that incorporate both a KL‑divergence term (to keep the learned policy close to a reference policy) and the entropy term. By rearranging these expressions, the optimal policy takes a soft‑max form over the regularized Q‑values, mirroring maximum‑entropy RL formulations. Lemma 3.1 (Reward Sum Decomposition) shows how total accumulated reward can be expressed in terms of the optimal policy, the reference policy, and the initial value. Building on this, Theorem 3.2 provides a multi‑turn DPO‑style loss (EntroPO‑DPO) that directly incorporates the entropy regularizer. An analogous EntroPO‑KTO loss is also derived. Crucially, the analysis demonstrates that the entropy term preferentially boosts updates for high‑utility trajectories that are under‑represented by the reference policy, thereby preserving diversity specifically among correct solutions.
Beyond training, the authors address the inference side by proposing a hybrid best‑trajectory selector for TTS. The selector combines (i) a learned verifier model that scores each trajectory’s quality and (ii) model‑free heuristics such as test‑pass flags, trajectory length, and number of tool calls. This hybrid approach mitigates verifier errors while exploiting the richer candidate set produced by EntroPO‑trained policies.
Empirical evaluation spans a suite of models ranging from 7 B to 106 B parameters, fine‑tuned on the same preference dataset. Baselines include standard DPO, KTO, and the recent multi‑turn DPO (M‑DPO). Experiments are conducted on both SWE‑Bench‑LITE and SWE‑Bench‑VERIFIED. Results show that EntroPO consistently yields higher policy entropy (15‑30 % increase) and superior test‑pass rates. Notably, a 30 B model trained with EntroPO achieves 1st place on SWE‑Bench‑LITE and 4th place on SWE‑Bench‑VERIFIED among all open‑weight models, rivaling closed‑weight models that are more than ten times larger (>350 B). The hybrid selector further improves performance, outperforming pure model‑free best‑of‑N sampling by 4.3 percentage points and pure verifier‑based ranking by 2.1 percentage points.
The contributions are threefold: (1) an entropy‑regularized multi‑turn preference‑optimization framework that directly combats diversity collapse, (2) a theoretical analysis showing how the entropy term preserves diversity among correct solutions, and (3) a hybrid TTS selector that leverages the richer candidate distribution. The work opens several avenues for future research, including automatic tuning of the entropy weight λ, integration with more complex toolchains (debuggers, refactoring tools, CI pipelines), and hybrid training that blends offline EntroPO with online RLHF for long‑term policy stability. Overall, the paper provides a compelling solution for building powerful, exploratory coding agents capable of tackling the multi‑step, tool‑intensive challenges of modern software engineering.
Comments & Academic Discussion
Loading comments...
Leave a Comment