AgentStepper: Interactive Debugging of Software Development Agents
Software development agents powered by large language models (LLMs) have shown great promise in automating tasks like environment setup, issue solving, and program repair. Unfortunately, understanding and debugging such agents remain challenging due to their complex and dynamic nature. Developers must reason about trajectories of LLM queries, tool calls, and code modifications, but current techniques reveal little of this intermediate process in a comprehensible format. The key insight of this paper is that debugging software development agents shares many similarities with conventional debugging of software programs, yet requires a higher level of abstraction that raises the level from low-level implementation details to high-level agent actions. Drawing on this insight, we introduce AgentStepper, the first interactive debugger for LLM-based software engineering agents. AgentStepper enables developers to inspect, control, and interactively manipulate agent trajectories. AgentStepper represents trajectories as structured conversations among an LLM, the agent program, and tools. It supports breakpoints, stepwise execution, and live editing of prompts and tool invocations, while capturing and displaying intermediate repository-level code changes. Our evaluation applies AgentStepper to three state-of-the-art software development agents, ExecutionAgent, SWE-Agent, and RepairAgent, showing that integrating the approach into existing agents requires minor code changes (39-42 edited lines). Moreover, we report on a user study with twelve participants, indicating that AgentStepper improves the ability of participants to interpret trajectories (64% vs. 67% mean performance) and identify bugs in the agent’s implementation (17% vs. 60% success rate), while reducing perceived workload (e.g., frustration reduced from 5.4/7.0 to 2.4/7.0) compared to conventional tools.
💡 Research Summary
AgentStepper is presented as the first interactive debugger specifically designed for large‑language‑model (LLM)‑driven software development agents. The paper begins by identifying four core challenges that developers face when trying to understand and debug such agents: (C1) prompt engineering, (C2) tracking the high‑level control and data flow between the LLM and tools, (C3) locating bugs in the orchestrating agent program, and (C4) reviewing intermediate code changes made during execution. Existing solutions—raw log files or generic log viewers—provide only linear, unstructured token streams and lack any means for interactive control, making them ill‑suited for these challenges.
Drawing inspiration from conventional software debugging, the authors propose to raise the level of abstraction from low‑level implementation details to high‑level agent actions. They adapt classic debugging concepts—breakpoints, stepwise execution, and live state editing—to the domain of LLM‑based agents. The resulting system, AgentStepper, consists of three tightly integrated components: a web‑based user interface, a backend that records events and intermediate repository states, and a lightweight API for instrumenting existing agents.
The UI visualizes an agent’s trajectory as two interleaved conversations: one between the agent program and the LLM, and another between the agent program and the invoked tools. Each turn (prompt, LLM response, tool call, tool result) is displayed as a distinct, collapsible event, allowing developers to jump to any point in the execution. Breakpoints can be set on any event; when a breakpoint is hit, execution pauses and the developer may edit the prompt, modify tool arguments, or change the LLM’s response before resuming. The backend simultaneously captures file‑system changes and stores them as Git commits, presenting a commit history alongside the conversation view. This dual representation lets developers inspect the exact code state at any step without having to reconstruct it manually.
To demonstrate ease of adoption, the authors integrated AgentStepper into three state‑of‑the‑art agents: ExecutionAgent, SWE‑Agent, and RepairAgent. Integration required only 5–7 API calls and 39–42 lines of edited code per agent (less than 1 % of the total code base). After instrumentation, each agent’s run automatically generated a structured conversation log and a series of intermediate commits, which could be explored through the debugger.
The effectiveness of AgentStepper was evaluated with a user study involving twelve participants. Participants performed two tasks: (1) trajectory comprehension—answering questions about the agent’s behavior, and (2) bug identification—locating faults in the agent program. Half of the participants used AgentStepper, while the other half used conventional tools (raw logs, IDE debuggers, or custom scripts). Results showed a modest increase in comprehension accuracy (from 64 % to 67 %) and a substantial rise in bug‑finding success (from 17 % to 60 %) for the AgentStepper group. Subjective workload, measured with NASA‑TLX, indicated a dramatic reduction in frustration (5.4 / 7.0 down to 2.4 / 7.0) and overall perceived effort.
The paper’s contributions are: (1) a clear taxonomy of debugging challenges for LLM‑based development agents, (2) the conceptual mapping between conventional debugging and agent debugging, (3) the design and implementation of AgentStepper—a conversation‑centric, interactive debugger that also tracks repository‑level changes, (4) empirical evidence of low integration cost across three diverse agents, and (5) a user study confirming that the tool improves developers’ ability to understand and fix agents while reducing cognitive load. All source code and data have been released publicly to encourage further research on debugging intelligent software development agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment