Agent-Based Software Artifact Evaluation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years, substantially improving research reproducibility across major SE conferences. However, this success has introduced a growing scalability challenge, as artifact evaluation relies heavily on reviewers’ manual execution and debugging, leading to escalating human effort amid rapidly increasing paper submissions. To address this problem, we investigate automated artifact evaluation. We first conduct a preliminary study on artifacts from top-tier SE conferences and identify three key challenges: perceiving execution states, maintaining stable execution environments, and recovering from execution errors. Inspired by these findings, we propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation. ArtifactCopilot automates environment construction, instruction execution, and error recovery by combining an execution normalization strategy to ensure environment stability with an artifact evaluation graph that transforms README documents into dependency-aware command graphs, enabling structured execution planning, execution-state tracking, and error recovery. Evaluation on 48 real-world artifacts shows that ArtifactCopilot matches human artifact evaluation outcomes for 85.42% of the artifacts, outperforming Claude Code by 52.09 percentage points, while costing only $0.091 per artifact on average and requiring zero human intervention for 45 out of 48 artifacts.

💡 Research Summary

The paper addresses the growing scalability problem of software engineering (SE) artifact evaluation, a practice that has been used for over a decade to certify the reproducibility of research results at major conferences such as ASE, FSE, ICSE, and ISSTA. While the manual process—downloading a repository, following README instructions, and debugging failures—has greatly improved reproducibility, the rapid increase in paper submissions (e.g., ICSE 2025 accepted 245 papers, a 2.8‑fold rise since 2015) has made the manual approach untenable due to the escalating human effort required.

To understand why automation is difficult, the authors built a dataset of 48 real‑world artifacts from the four top‑tier SE venues. They first performed a preliminary study on a subset of 12 artifacts, recording 237 human intervention steps. These steps fell into three categories: (1) command execution (53 % of interventions), (2) error handling (40 %), and (3) dynamic configuration (7 %). The study revealed that artifacts are typically described in unstructured natural‑language README files, that environment boundaries (host vs. container) are often implicit, and that failures can occur at any point in the workflow, requiring ad‑hoc debugging.

Guided by these findings, the authors propose ArtifactCopilot, the first end‑to‑end agent‑based framework for automated artifact evaluation. Its design rests on three pillars:

AE Graph (Artifact Evaluation Graph) – Using large language models (LLMs) and NLP techniques, the system parses the README (or PDF) into discrete commands, infers dependencies, and constructs a directed acyclic graph where nodes are executable commands and edges encode ordering constraints. This graph provides a structured execution plan and a state‑aware representation of progress.
Environment‑aware Execution Monitor – Because artifacts may run on Docker, Conda, or native environments, the monitor tracks context switches in real time, ensuring each command is dispatched to the correct runtime. It normalizes environment specifications, automatically generates Dockerfiles or Conda environments, and translates file‑system paths between host and container.
Multi‑Agent Error Recovery – When a command fails, a primary agent triggers specialized sub‑agents (e.g., package‑installation agent, version‑conflict resolver, path‑fixer). These agents analyze error logs with LLMs, propose and apply corrective actions, and then reschedule the failed node in the AE Graph. This loop enables autonomous debugging without human input.

The framework was evaluated on the full 48‑artifact dataset. ArtifactCopilot achieved 85.42 % agreement with human‑assigned artifact badges, outperforming Claude Code (33.33 % agreement) and a baseline LLM script‑generation approach (20.83 %). The average monetary cost per artifact was $0.091, derived from GPT‑4 API calls, and 45 out of 48 artifacts required zero human intervention; the remaining three needed only an average of 0.11 manual steps. Ablation experiments demonstrated the necessity of each component: removing the AE Graph reduced success by 43.75 pp, dropping environment normalization caused a 29.17 pp drop, and eliminating multi‑agent collaboration lowered performance by 22.92 pp.

The authors conclude that artifact evaluation is fundamentally a state‑aware workflow problem rather than a simple script‑generation task. By formalizing the process with a dependency‑aware graph, providing runtime‑aware monitoring, and equipping the system with autonomous error‑recovery agents, ArtifactCopilot offers a practical solution to the scalability bottleneck faced by the SE research community. The paper also suggests that the underlying principles—graph‑based execution modeling and multi‑agent debugging—could be extended to other reproducibility and deployment automation scenarios across software engineering domains.

Agent-Based Software Artifact Evaluation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment