SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
💡 Research Summary
The paper addresses the high computational cost associated with test‑time scaling for large‑language‑model (LLM) agents that perform software‑engineering (SWE) tasks. Traditional test‑time scaling generates multiple candidate solutions by sampling from scratch with a positive temperature and then selects the best answer using test‑based feedback. While this approach yields a log‑linear performance improvement with the number of samples, it is expensive because each new candidate requires a full trajectory from the issue description to a final patch. Recent attempts to reduce this cost—such as SWE‑Search (a Monte‑Carlo Tree Search variant) and Satori‑SWE (self‑improvement via a reward model)—rely on value or reward estimators. These estimators suffer from model mis‑calibration and are tightly coupled to pipeline‑style agents that use predefined tools, making them unsuitable for modern agentic frameworks like SWE‑agent or OpenHands, which synthesize custom bash scripts without fixed templates.
SWE‑Replay is introduced as the first efficient, generalizable test‑time scaling technique that eliminates the need for noisy value estimates. Its core insight is to recycle previously sampled trajectories by resuming exploration at carefully selected intermediate steps rather than always starting from scratch. The system maintains an archive of all generated trajectories. For each scaling iteration it stochastically decides between (i) pure exploration—generating a fresh trajectory from the issue description—or (ii) exploitation—picking a trajectory from the archive, restoring the environment to just before a chosen step sₜ, and branching by sampling a new step s′ₜ to replace sₜ.
Step selection is driven by two orthogonal criteria. First, “repository‑exploration rarity” is measured by abstracting each step into a state defined as the set of repository files explored up to that point. States that have been visited by fewer concrete steps are deemed rarer. A softmax over the inverse of the step count assigns higher sampling probability to rarer states, encouraging the agent to revisit under‑explored parts of the codebase. Second, “reasoning intensity” is quantified by counting the number of logical paragraphs in the agent’s natural‑language reasoning at a step, rather than raw token length, because paragraph count better captures substantive deliberation. Within a selected state, steps with more paragraphs receive higher softmax probability, ensuring that the replay focuses on decision points where the agent performed deep analysis.
Environment restoration avoids storing full snapshots. The authors observe that agents mainly modify the repository via text edits, file creation, or deletion. Therefore, they record only file‑level diffs for each step. When replaying, if no non‑repo side effects (e.g., package installations) are detected, the diffs are applied directly, bypassing costly action re‑execution. If side effects exist, the full action sequence is replayed. This hybrid approach dramatically reduces storage and runtime overhead.
Empirical evaluation is conducted on three benchmarks: SWE‑Bench Verified, SWE‑Bench Pro, and SWE‑Bench Multilingual. On SWE‑Bench Verified, SWE‑Replay reduces the average sampling cost by up to 17.4 % while maintaining or improving the resolve rate by up to 3.8 % across three LLM backends (e.g., GPT‑4, Claude, Llama‑2) and two agent scaffolds (SWE‑agent, OpenHands). The method’s benefits persist on the more complex Pro and multilingual datasets, demonstrating robustness to varied issue types, languages, and repository sizes. An analysis of file‑access patterns shows that SWE‑Replay shifts exploration toward the long‑tail of repository files, confirming that the rarity‑driven selection expands the search space beyond the shallow regions typically covered by naive scaling.
The authors also provide a theoretical intuition: by branching at high‑impact, rare states, the expected quality of the sampled set improves because the variance of the underlying performance distribution is reduced, and the probability of discovering a high‑quality patch increases. This aligns with the empirical gains observed.
In summary, SWE‑Replay offers a practical, model‑agnostic solution for test‑time scaling of modern SWE agents. It leverages trajectory replay, rarity‑aware state selection, and reasoning‑intensity weighting to cut computational cost without sacrificing—and often enhancing—solution quality. Its independence from value or reward models makes it robust to calibration errors, and its compatibility with agents that generate arbitrary bash scripts ensures broad applicability across current and future agentic frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment