What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions-typically open source-also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.

💡 Research Summary

**
This paper presents the first systematic investigation of the two public leaderboards associated with SWE‑Bench—SWE‑Bench Lite (300 repair tasks) and SWE‑Bench Verified (500 tasks). SWE‑Bench is a recent benchmark for Automated Program Repair (APR) that extracts real‑world Python issues from popular open‑source repositories, deliberately omitting test cases to mimic the conditions developers face when a bug is reported. The authors collected every entry submitted to the two leaderboards (79 entries for Lite and 133 for Verified) and enriched the raw metadata (pull‑request URL, README, metadata.yaml) with information gathered from academic papers, blog posts, LinkedIn profiles, and targeted Google searches. Using a mixed deductive‑inductive content‑analysis approach, they coded each entry along several dimensions: submitter type (academia, industry, academia‑industry collaboration, open‑source community, single developer, unknown), industry size (small, medium, large private, large publicly‑traded), product purpose (coding assistant, issue resolution, development framework, etc.), product form (cloud service, CLI tool, IDE plugin, library), availability level (publicly available product, upon request, B2B, non‑commercial solution, unavailable), open‑source status of the code, and the large language model(s) employed.

Key findings include:

Dominance of Industry – Over two‑thirds of all submissions originate from industry, with a roughly even split between small startups (<50 employees) and large publicly‑traded corporations (Amazon, Google, IBM, etc.). Academic contributions account for only about 12 % of entries, while open‑source community and individual developers together contribute less than 15 %.
Proprietary LLM Supremacy – The highest precision scores on both leaderboards are achieved by solutions that rely on Anthropic’s Claude 4 Sonnet. Other commercial models such as GPT‑4 and GPT‑4o also appear among the top‑performing entries. Open‑source LLMs (e.g., Llama, Qwen) are present but lag behind by roughly 10–15 percentage points in % Resolved.
Product Distribution – Approximately 45 % of the submissions are released as publicly available products, most often as IDE plugins (JetBrains, VS Code) or cloud‑based services. Around 30 % are “upon request” or B2B‑only offerings, indicating limited accessibility for the broader research community. Non‑commercial research artifacts (open‑source code without a deployed product) make up about 15 % of the entries.
Open‑Source Code vs Closed‑Source Models – While 38 % of the entries provide their source code on GitHub, the majority of those still depend on proprietary LLM APIs, limiting full reproducibility. Purely open‑source pipelines (both code and model) are a minority.
Temporal Evolution – Submissions grew steadily from September 2023 to September 2025, with a noticeable surge after the official release of the Verified leaderboard in August 2024. The average % Resolved also increased after this date, reflecting the rapid adoption of newer, more capable LLMs.

The authors discuss several implications. The heavy industry presence signals a shift of APR research toward productization and commercial incentives, while academia retains a niche by focusing on open‑source models and methodological rigor. However, the reliance on closed‑source LLMs raises concerns about transparency, reproducibility, and equitable access. The paper recommends that future versions of SWE‑Bench incorporate multi‑model evaluation, support for open‑source LLMs, and automated correctness verification (e.g., generated test suites) to mitigate bias toward proprietary solutions. Additionally, encouraging academia‑industry collaborations could foster shared resources and standards.

Threats to validity are acknowledged: incomplete submitter information for some entries, potential temporal bias due to the late introduction of Verified, and reliance on non‑peer‑reviewed sources (blogs, LinkedIn) for supplemental data.

In conclusion, SWE‑Bench has become a central platform for APR benchmarking, but its current ecosystem is dominated by industry and proprietary LLMs. Enhancing openness, diversifying model usage, and improving benchmark design are essential steps to ensure that APR research remains transparent, reproducible, and broadly beneficial.

What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair

💡 Research Summary

Comments & Academic Discussion

Leave a Comment