How Different Are Different diff Algorithms in Git?
Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values in 1.7% to 8.2% commits based on the different diff algorithms. Regarding bug-introducing change identification, we found 6.0% and 13.3% in the identified bug-fix commits had different results of bug-introducing changes from 10 Java projects. For patch application, we found that the Histogram is more suitable than Myers for providing the changes of code, from our manual analysis. Thus, we strongly recommend using the Histogram algorithm when mining Git repositories to consider differences in source code.
💡 Research Summary
This paper investigates the practical impact of choosing different diff algorithms in Git, focusing on the default Myers algorithm and the more recent Histogram algorithm introduced in Git 1.7.7 (2011). The authors first conduct a systematic mapping study of 52 recent software‑engineering papers (published 2013‑2017) that employ Git diff. The mapping reveals that almost all studies (98 %) analyze open‑source projects and that diff is primarily used for three purposes: (1) extracting code changes, (2) collecting change‑based metrics, and (3) identifying bug‑introducing changes via the SZZ technique. Despite the availability of multiple algorithms (Myers, Minimal, Patience, Histogram), prior work overwhelmingly relies on Myers without explicit justification.
To assess whether the algorithm choice matters, the authors perform three empirical comparisons on real‑world data.
-
Code churn metrics – Using 14 Java projects that employ continuous integration, they compute added and deleted lines per commit with both algorithms. The results show that the number of added lines differs by 0.8 %–6.2 % and deleted lines by 1.4 %–7.6 % across files, affecting 1.7 %–8.2 % of commits. Such variations can influence productivity, size, and complexity analyses that depend on churn statistics.
-
Bug‑introducing change identification – On ten Apache projects, the authors apply the SZZ algorithm to locate the original buggy commit for each bug‑fix. The two diff algorithms produce divergent bug‑introducing line sets in 6.0 % and 13.3 % of the identified bug‑fix commits, respectively, and the set of files with differing deleted lines ranges from 2.4 % to 6.6 %. This discrepancy suggests that the choice of diff algorithm can bias the labeling of buggy versus clean code, potentially affecting downstream defect‑prediction models.
-
Patch quality assessment – From a total of 21,590 changes, the authors randomly sample 377 changes and have expert reviewers manually evaluate the quality of the generated patches. Evaluation criteria include how well the hunks reflect the developers’ intent and whether unnecessary small differences are introduced. The manual analysis finds that for pure code changes, Histogram yields higher‑quality patches in 62.6 % of files, whereas Myers is superior in only 16.9 % of files. For non‑code changes (comments, formatting), both algorithms perform similarly.
The paper also discusses validity threats such as selection bias of projects, sampling strategy for commits, and subjectivity in manual labeling. Mitigation strategies include using multiple projects, random sampling, and double‑blind expert reviews. All datasets and scripts are made publicly available for reproducibility.
In conclusion, the empirical evidence demonstrates that the Histogram algorithm consistently provides more accurate and meaningful diff outputs than Myers, especially for code‑level changes. The authors therefore strongly recommend that researchers and practitioners who mine Git repositories for metrics, bug‑introduction analysis, or patch generation should prefer the --diff-algorithm=histogram option over the default setting. This recommendation aims to improve the reliability of empirical software‑engineering studies that depend on diff‑derived data.
Comments & Academic Discussion
Loading comments...
Leave a Comment