Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems
The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards – SWE-Bench Lite and SWE-Bench Verified – have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
💡 Research Summary
This paper presents the first comprehensive, systematic study of all submissions to the two public SWE‑Bench leaderboards—SWE‑Bench Lite (300 curated instances) and SWE‑Bench Verified (500 instances curated by OpenAI). The authors collected every entry (79 in Lite, 99 in Verified, corresponding to 80 distinct approaches) and extracted metadata from the leaderboard pages, the associated GitHub repositories (README, metadata.yaml), scientific papers, blog posts, and LinkedIn announcements. Using a mixed deductive‑inductive content‑analysis coding scheme, they classified each submission along four major dimensions: (1) submitter type, (2) product purpose and form (i.e., how the solution is delivered), (3) large language model (LLM) used, and (4) system architecture (single‑LLM vs. multi‑LLM, agentic vs. non‑agentic, autonomy of execution).
Research Question 1 (RQ 1) investigates who is contributing and with what tools. The analysis reveals that industry dominates: more than half of the unique approaches originate from companies, ranging from small startups (<50 employees) to large publicly‑traded corporations such as Amazon, IBM, Google, and Microsoft. Academia accounts for roughly 15 % of the approaches, while a non‑trivial fraction (≈10 %) comes from individual developers or open‑source community projects. The most frequently used LLMs are proprietary: Anthropic’s Claude 3.5/4 Sonnet appears in 38 % of the submissions, followed by OpenAI’s GPT‑4/GPT‑4o and Google’s Gemini 1.5. Open‑source models (Llama 2, Qwen, etc.) are present but rarely achieve top‑rank performance.
RQ 2 focuses on architectural patterns. The authors distinguish four categories: (a) single‑LLM non‑agentic, (b) multi‑LLM non‑agentic, (c) single‑LLM agentic, and (d) multi‑LLM agentic. Approximately 45 % of the approaches are simple single‑LLM pipelines that generate a patch via a prompt and then run a basic verification step. About 35 % employ an agent framework (e.g., SWE‑Agent, AutoCoderRover) that can plan, decompose tasks, and invoke multiple LLM calls dynamically. The remaining submissions are hybrids that combine several LLMs for different pipeline stages (e.g., one model for bug localization, another for patch synthesis). Notably, the five highest‑performing entries (precision > 30 %) all rely on Claude 4 Sonnet, and three of them are multi‑LLM agentic systems, suggesting that “proprietary LLM + agentic orchestration” is currently the most effective recipe.
RQ 3 maps each approach onto the seven‑stage software‑maintenance pipeline proposed by Liu et al.: preprocessing, issue reproduction, bug localization, task decomposition, patch generation, patch verification, and ranking. The study finds that most submissions implement preprocessing and reproduction using GitHub APIs and containerized environments. Bug localization frequently blends static analysis (AST matching, type checking) with LLM‑driven natural‑language‑to‑code mapping. Task decomposition is explicitly present only in agentic systems, where a planner generates a list of subtasks that are dispatched to separate LLM instances. Patch generation is almost universally “prompt‑to‑code,” sometimes enriched with code‑snippet retrieval. Verification combines the SWE‑Bench supplied test suite with additional static checks (linters, type checkers). Ranking aggregates precision, execution time, and patch size, and a few submissions add custom quality metrics such as readability scores.
The discussion highlights two dominant trends: (1) the overwhelming advantage of proprietary LLMs, especially Claude 4 Sonnet, which raises concerns about reproducibility and accessibility for the broader research community; and (2) the architectural diversity, where both simple non‑agentic pipelines and sophisticated multi‑agent orchestrations can achieve state‑of‑the‑art results, indicating that there is no single “golden” architecture yet. The authors also warn of potential benchmark saturation—only 500 instances are currently curated—and suggest expanding the benchmark with more languages, frameworks, and bug types to avoid overfitting to a narrow set of problems. Finally, they call for standardized metadata and open‑source release policies for leaderboard submissions to improve reproducibility and foster more transparent progress tracking.
In sum, this work not only provides a detailed taxonomy of who is building LLM‑based APR systems and how they are built, but also offers actionable insights for future research directions: improving open‑source LLM capabilities, formalizing agentic orchestration patterns, and evolving SWE‑Bench to remain a robust, future‑proof testbed for automated program repair.
Comments & Academic Discussion
Loading comments...
Leave a Comment