Pull Requests as a Training Signal for Repo-Level Code Editing
Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.
💡 Research Summary
The paper tackles the problem of enabling large language models (LLMs) to perform repository‑level code editing without relying on heavyweight inference‑time scaffolding such as multi‑step agents, tool calls, or external retrieval loops. The authors ask a fundamental question: how much of the repository‑wide editing capability can be encoded directly into model weights? To answer this, they introduce Clean‑PR, a data‑centric mid‑training paradigm that transforms raw GitHub pull requests (PRs) into high‑quality, verifiable training signals.
First, they collect an enormous raw dump of 16.4 M PRs from 274 k public repositories (≈8.6 TB). A detailed noise taxonomy reveals that the majority of these PRs are unsuitable for learning: 38 % lack core source changes, 25 % are generated by bots, and 24 % remain unmerged. Using a multi‑stage filtering pipeline they discard bot‑created PRs, non‑core file changes, and any PRs from repositories that appear in the SWE‑bench evaluation sets. After filtering, only 3.05 M PRs (≈46.4 B tokens) remain.
The core technical contribution is the conversion of traditional line‑based diffs into a Search/Replace format. For each PR they reconstruct the exact “before” and “after” repository states, algorithmically extract minimal edit spans, and identify unique anchor contexts that uniquely locate the edit within a file. These Search/Replace blocks are then round‑trip verified: applying the block to the before‑state must reproduce the after‑state bit‑for‑bit. Any failure leads to discarding the example, guaranteeing that every training instance is both minimal and correct.
To compensate for the fact that PR titles often reference external issue numbers rather than containing a full problem description, the pipeline augments each PR with the title and body of any linked GitHub issue. This issue‑augmentation step aligns the training signal with the typical SWE‑bench prompt, which consists of a detailed bug report followed by the required code change.
From the filtered corpus they further refine the dataset to a balanced 2.02 M‑instance “Clean‑PR‑train” set (≈17.7 B tokens). They limit each PR to at most five core files, enforce a maximum of 100 k tokens per file (windowing around the edit), and apply repository‑level sampling to avoid over‑representation of large projects. The final training set spans 12 programming languages, with Python comprising roughly 19 % of the data.
Training proceeds in two stages. In the first “mid‑training” stage, the base Qwen2.5‑Coder‑32B model is exposed to the Clean‑PR‑train corpus, allowing it to learn repository‑wide patterns, multi‑file dependencies, and the Search/Replace editing paradigm. In the second stage, the authors perform an Agentless‑Aligned stepwise supervised fine‑tuning (SFT). They decompose each repair trajectory from SWE‑rebench and SWE‑Gym into three sub‑tasks: (1) file localisation (given an issue description and repository tree, predict the relevant file paths), (2) line‑level navigation (given the file, predict the line range to edit), and (3) patch generation (produce the Search/Replace block).
A novel “Error‑Driven Augmentation” strategy is introduced to improve robustness. During SFT, distractor files and irrelevant code regions mined from model failures are injected into the context, teaching the model to reject noisy retrieval results and avoid over‑editing. This augmentation yields measurable gains in both localisation accuracy and end‑to‑end repair success.
Evaluation on SWE‑bench Lite and SWE‑bench Verified shows substantial improvements over the instruction‑tuned baseline: absolute gains of 13.6 % points on Lite and 12.3 % points on Verified. These gains are achieved without any agentic loops at inference time, demonstrating that repository‑level editing knowledge can be effectively baked into the model itself.
The paper’s contributions are threefold: (1) a rigorously validated, publicly released PR dataset (Clean‑PR) that is the largest of its kind for repository editing, (2) a scalable pipeline that converts noisy diffs into minimal, verifiable Search/Replace blocks, and (3) an agentless‑aligned fine‑tuning regime with error‑driven augmentation that bridges the gap between raw editing ability and the structured workflow required for real‑world software engineering tasks. By showing that high‑quality, mid‑training data can replace complex inference scaffolding, the work opens a path toward more efficient, deployable code‑capable LLMs for large‑scale software maintenance.
Comments & Academic Discussion
Loading comments...
Leave a Comment