AutoCodeSherpa: Symbolic Explanations in AI Coding Agents
Large language model (LLM) agents integrate external tools with one or more LLMs to accomplish specific tasks. Agents have rapidly been adopted by developers, and they are starting to be deployed in industrial workflows, such as their use to fix static analysis issues from the widely used SonarQube static analyzer. However, the growing importance of agents means their actions carry greater impact and potential risk. Thus, to use them at scale, an additional layer of trust and evidence is necessary. This work presents AutoCodeSherpa, a technique that provides explanations of software issues in the form of symbolic formulae. Inspired by the reachability, infection, and propagation model of software faults, the explanations are composed of input, infection, and output conditions, collectively providing a specification of the issue. In practice, the symbolic explanation is implemented as a combination of a property-based test (PBT) and program-internal symbolic expressions. Critically, this means our symbolic explanations are executable and can be automatically evaluated, unlike natural language explanations. Experiments show the generated conditions are highly accurate. For example, input conditions from AutoCodeSherpa had an accuracy of 85.7%. This high accuracy makes symbolic explanations particularly useful in two scenarios. First, the explanations can be used in automated issue resolution environments to decide whether to accept or reject patches from issue resolution agents; AutoCodeSherpa could reject 2x as many incorrect patches as baselines did. Secondly, as agentic AI approaches continue to develop, program analysis driven explanations like ours can be provided to other LLM-based repair techniques which do not employ analysis to improve their output. In our experiments, our symbolic explanations could improve the plausible patch generation rate of the Agentless technique by 60%.
💡 Research Summary
AutoCodeSherpa tackles a pressing problem in the emerging field of large‑language‑model (LLM)‑driven coding agents: the lack of trustworthy, verifiable evidence for the bugs they are asked to fix. While current agents can generate patches, their autonomous nature makes it difficult for developers to assess whether a suggested change truly resolves the underlying issue, especially when only a handful of test cases are used for validation. AutoCodeSherpa introduces “executable symbolic explanations” – formal, machine‑checkable specifications that describe a bug in three inter‑related parts: an input condition (the set of inputs that trigger the fault), an infection condition (the internal program state that becomes corrupted), and an output condition (the observable symptom such as an exception or incorrect return value).
The approach is inspired by the classic Reachability‑Infection‑Propagation (RIP) model of software failures. By mapping RIP to input‑infection‑output, the authors can express the bug as a Hoare triple {I} P {O}, where I and O are the input and output conditions respectively, and P is the buggy program. To make these conditions concrete and automatically verifiable, AutoCodeSherpa combines property‑based testing (PBT) with program‑internal symbolic expressions. The input and output conditions are captured as a PBT: a test that includes a generator for inputs satisfying I, a precondition, and an assertion that checks O after executing P. The infection condition is a symbolic formula that distinguishes buggy states from normal ones, derived from a second LLM‑driven agent that explores the codebase and extracts relevant internal variables or objects.
The system is built as a multi‑agent pipeline:
- PBT Generation Agent reads the natural‑language issue description and the buggy source, prompting an LLM to synthesize a property‑based test that generalizes the failing execution.
- Code Exploration Agent runs static analysis tools to locate definitions, data‑flow paths, and potential mutation points related to the issue.
- Infection Condition Synthesis Agent consumes the outputs of the first two agents and asks the LLM to formulate a precise symbolic predicate that holds exactly when the bug’s internal state is corrupted.
Each stage includes refinement steps (e.g., parsing LLM output, cross‑checking with static analysis results, and iterative prompting) to improve precision. The final symbolic explanation can be executed directly: the PBT can be run against any candidate patch, and the infection predicate can be evaluated at runtime to see whether the buggy state still appears.
Evaluation was performed on real‑world bugs extracted from SonarQube static analysis reports. Over 3,000 bug instances were processed, yielding the following accuracies: input condition 85.7 %, infection condition 79.7 %, and output condition 79.0 %. The authors then used the generated PBTs to filter patches produced by existing coding agents. Compared with the baseline SpecRover (which also writes a reproducer test but without symbolic infection predicates), AutoCodeSherpa rejected 123 % more incorrect patches while preserving correct ones.
A second set of experiments examined whether the symbolic explanations could assist other agents. By feeding the explanations to Agentless—a minimally autonomous repair system that follows a fixed three‑phase pipeline—the plausible‑patch generation rate rose by 60.7 %. This demonstrates that even agents that do not perform any internal analysis can benefit from external, formally specified bug knowledge.
The authors also tested the approach across several LLM back‑ends (e.g., GPT‑4, Claude, LLaMA‑2) and observed consistent explanation quality, indicating that the pipeline is not tied to a particular model.
Key contributions are:
- The notion of executable symbolic explanations for software bugs, bridging the gap between natural‑language issue reports and formal verification artifacts.
- A concrete multi‑agent system (AutoCodeSherpa) that automatically derives input, infection, and output conditions from an issue description and source code.
- Empirical evidence that these explanations improve automated patch validation and can boost the effectiveness of other repair techniques.
Limitations and future work are acknowledged. The infection condition extraction currently relies on LLM reasoning over static analysis data; complex control‑flow, concurrency, or nondeterministic behavior may be missed. The PBTs focus on input‑output relationships and may not capture environment‑dependent bugs (e.g., file‑system or network interactions). Future directions include tighter integration with sophisticated static analysis or symbolic execution engines, richer runtime instrumentation to capture more nuanced infection predicates, and extending the framework to multi‑modal inputs such as logs or execution traces.
In summary, AutoCodeSherpa provides a practical, model‑agnostic method to endow AI coding agents with verifiable, executable explanations of bugs. By turning a natural‑language issue into a formal Hoare triple that can be automatically checked, the system enhances trust, enables more aggressive automated patch filtering, and offers reusable knowledge that can improve the performance of other repair agents. This work represents a significant step toward safe, large‑scale deployment of LLM‑driven software engineering tools in production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment