GitEvo: Code Evolution Analysis for Git Repositories
Analyzing the code evolution of software systems is relevant for practitioners, researchers, and educators. It can help practitioners identify design trends and maintenance challenges, provide researchers with empirical data to study changes over time, and give educators real-world examples that enhance the teaching of software evolution concepts. Unfortunately, we lack tools specifically designed to support code evolution analysis. In this paper, we propose GitEvo, a multi-language and extensible tool for analyzing code evolution in Git repositories. GitEvo leverages Git frameworks and code parsing tools to integrate both Git-level and code-level analysis. We conclude by describing how GitEvo can support the development of novel empirical studies on code evolution and act as a learning tool for educators and students. GitEvo is available at: https://github.com/andrehora/gitevo.
💡 Research Summary
The paper addresses a clear gap in the software‑engineering tooling landscape: while many tools exist for mining Git repositories (e.g., GitPython, PyDriller, JGit) they operate only on version‑control metadata such as commits, authors, and file changes, and they do not provide insight into how the source code itself evolves. Conversely, a plethora of language‑specific parsers and static‑analysis frameworks (AST modules, JavaParser, Babel, Tree‑sitter) can examine the structure of a single snapshot but lack integration with Git history, making longitudinal studies cumbersome. To bridge this divide, the authors introduce GitEvo, a Python‑based, multi‑language, extensible tool that simultaneously performs Git‑level and code‑level analysis on any Git repository.
GitEvo’s architecture consists of four stages. First, it accepts a repository identifier (URL, local path, or a directory of repositories). Second, it leverages GitPython and PyDriller to iterate over commits within a user‑specified time window (defaulting to the most recent five years) and to retrieve the source files for each commit. Third, it parses each file with Tree‑sitter, which provides concrete syntax trees (CSTs) for the supported languages. Fourth, user‑defined metric functions are invoked on a ParsedCommit object that bundles the commit hash, a list of ParsedFile objects, and the corresponding CST nodes. The tool aggregates metric values per year or month and automatically generates both HTML visual reports and CSV data files.
Two usage modalities are offered. The command‑line interface (e.g., gitevo -r python https://github.com/pallets/flask) runs a set of built‑in metrics (lines of code, number of source files, test files, LOC per file, etc.) and produces ready‑to‑view reports. The programmatic API is more powerful: developers create a GitEvo instance, decorate metric functions with @evo.metric(name, …), and write arbitrary CST‑based analyses in pure Python. The paper provides concrete examples, such as counting data‑structure nodes, loop constructs, or functions decorated with @pytest.
Implementation details reveal that GitEvo currently supports Python, JavaScript, TypeScript, and Java by loading the respective Tree‑sitter grammars. Adding a new language merely requires installing the grammar and registering it in the configuration, which underscores the tool’s extensibility. The authors note that parsing overhead can become significant for very large histories, but the design allows parallelization or selective commit sampling if needed.
The authors demonstrate the utility of GitEvo through three practical applications. In empirical research, they processed over 1.2 million commits across 2,168 repositories (TypeScript, JavaScript, Python) to identify 44,900 commits that introduced mocking into test suites. All analysis scripts were written once in Python, despite the multi‑language nature of the data. A second study examined ten years of functional‑programming feature usage (lambdas, comprehensions, generator expressions, etc.) in three major Python projects (CPython, Pandas, Django), revealing usage trends that were visualized with the generated reports. In industry, a custom report for the FastAPI framework was shared on GitHub Discussions, receiving positive feedback and feature requests, illustrating the tool’s relevance to practitioners.
For education, GitEvo was integrated into an undergraduate software‑engineering course with more than 90 students. The assignment required students to select a real open‑source repository, run GitEvo, explore the generated charts, and explain observed patterns. Students successfully linked metric trends to concrete design decisions (e.g., increased use of const in JavaScript, adoption of Java record), demonstrating the tool’s effectiveness as a teaching aid for software‑evolution concepts.
The paper concludes that GitEvo uniquely combines Git‑level and CST‑level analysis, reducing the need for multiple disparate tools and programming languages. Future work includes expanding support to all languages covered by Tree‑sitter and enriching the API with higher‑level abstractions for classes, methods, and decorators, thereby simplifying metric definition further. Overall, GitEvo offers a practical, extensible platform for researchers, developers, and educators to study and visualize code evolution across diverse code bases.
Comments & Academic Discussion
Loading comments...
Leave a Comment