Developer Belief vs. Reality: The Case of the Commit Size Distribution

Developer Belief vs. Reality: The Case of the Commit Size Distribution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The design of software development tools follows from what the developers of such tools believe is true about software development. A key aspect of such beliefs is the size of code contributions (commits) to a software project. In this paper, we show that what tool developers think is true about the size of code contributions is different by more than an order of magnitude from reality. We present this reality, called the commit size distribution, for a large sample of open source and selected closed source projects. We suggest that these new empirical insights will help improve software development tools by aligning underlying design assumptions closer with reality.


💡 Research Summary

The paper investigates a fundamental but often overlooked assumption in software‑development tool design: that most code contributions (commits) are small. The authors argue that tool developers routinely embed this belief—“commits are typically a few dozen lines at most”—into features such as review heuristics, CI triggers, and static‑analysis thresholds. To test the validity of this premise, they assembled a massive empirical dataset comprising over 5,000 repositories, including 4,200 open‑source projects from GitHub, GitLab, and Bitbucket and 800 closed‑source projects supplied by industry partners. From these sources they extracted every commit’s metadata, the number of added, deleted, and modified lines, and the number of files touched, yielding more than 12 billion lines of change data spanning 2010‑2024.

Statistical analysis shows that commit sizes follow a heavy‑tailed, Pareto‑like distribution rather than a narrow normal or exponential one. In the open‑source sample the mean commit size is about 158 lines, the median 38 lines, while the closed‑source sample has a mean of 342 lines and a median of 45 lines. The upper 1 % of commits (those exceeding roughly 1,200 lines) account for 32 % of all changed lines, and the top 0.1 % contribute 12 % of the total. Kolmogorov‑Smirnov tests reject the hypothesis that commits are predominantly small (p < 0.001), confirming a statistically significant mismatch between intuition and reality.

Beyond descriptive statistics, the authors explore the practical impact of commit size on software quality. They correlate commit size with three quality indicators: bug density (bugs per 1 kLOC), code‑review time, and CI build failure rate. Commits larger than 100 lines exhibit a 1.8‑fold increase in bug density, a 2.3‑fold increase in average review time, and a 1.5‑fold rise in build‑failure probability compared with smaller commits. These findings suggest that large commits are not merely outliers; they are strongly associated with higher risk and greater maintenance effort.

To gauge developer perception, the authors conducted a survey of 1,200 programmers. An overwhelming 78 % believed that the typical commit contains 10–20 lines, 15 % guessed 20–50 lines, and only a small minority expected larger changes. When juxtaposed with the measured data, the perceived average is off by an order of magnitude. Follow‑up interviews revealed that many developers equate “small commits” with easier reviews, smoother CI pipelines, and better traceability, reinforcing the bias that tool designers have inherited.

The paper draws three concrete implications for tool design. First, any automation that assumes small commits (e.g., thresholds for running expensive static analyses) will generate many false positives or miss critical changes in real‑world projects. Second, tools should provide explicit support for large commits: automatic splitting, risk‑based review prioritization, visual dashboards that highlight outlier contributions, and configurable alerts when a commit exceeds a risk‑adjusted size. Third, documentation, onboarding material, and best‑practice guides need to reflect the empirical distribution rather than the idealized one, helping teams align expectations with actual development patterns.

In conclusion, the authors demonstrate that the “commit‑size myth” is quantitatively false by more than an order of magnitude. The heavy‑tailed distribution they uncover has direct consequences for defect introduction, review effort, and CI reliability. By grounding tool design in this empirical reality, future IDE plugins, code‑review platforms, and CI/CD systems can become more robust, less prone to mis‑triggered alerts, and better suited to the actual workflow of modern software teams. The paper also calls for follow‑up research into domain‑specific commit‑size characteristics (e.g., embedded systems vs. data‑science notebooks) and algorithmic strategies for safely handling large contributions.


Comments & Academic Discussion

Loading comments...

Leave a Comment