Did You Forkget It? Detecting One-Day Vulnerabilities in Open-source ForksWith Global History Analysis

Did You Forkget It? Detecting One-Day Vulnerabilities in Open-source ForksWith Global History Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tracking vulnerabilities inherited from third-party open-source software is a well-known challenge, often addressed by tracing the threads of dependency information. However, vulnerabilities can also propagate through forking: a code repository forked after the introduction of a vulnerability, but before it is patched, may remain vulnerable long after the vulnerability has been fixed in the initial repository. History analysis approaches are used to track vulnerable software versions at scale. However, such approaches fail to track vulnerabilities in forks, leaving fork maintainers to identify them manually. This paper presents a global history analysis approach to help software developers identify one-day (known but unpatched) vulnerabilities in forked repositories. Leveraging the global graph of public code, as captured by the Software Heritage archive, our approach propagates vulnerability information at the commit level and performs automated impact analysis. Starting from 7162 repositories with vulnerable commits listed in OSV, we propagate vulnerability information to 2.2 million forks. We evaluate our approach by filtering forks with significant user bases whose latest commit is still potentially vulnerable, manually auditing the code, and contacting maintainers for confirmation and responsible disclosure. This process identified 135 high-severity one-day vulnerabilities, achieving a precision of 0.69, with 9 confirmed by maintainers.


💡 Research Summary

The paper tackles a largely overlooked security problem in the open‑source ecosystem: the propagation of “one‑day” vulnerabilities through forks. A one‑day vulnerability is a known flaw that has been fixed upstream but remains unpatched in a downstream fork that was created after the vulnerable commit and before the fixing commit. Existing vulnerability databases such as OSV only track commits in the original (upstream) repository, so downstream forks are invisible to these tools, leaving fork maintainers to discover and remediate such issues manually.

To address this gap, the authors propose a global history‑analysis approach that leverages the Software Heritage (SWH) archive – the world’s largest public source‑code archive – to propagate vulnerability information at the commit level across the entire fork ecosystem. Their method consists of three main steps:

  1. Extraction of Vulnerable Commits – From OSV they collect all vulnerability entries that specify Git commit ranges (introduced and fixed commits). This yields 7,162 upstream repositories with known vulnerable commits.

  2. Construction of a Global Fork Graph – Using the SWH deduplicated Merkle DAG, they identify “shared‑commit forks” as defined by Pietri et al.: two repositories are forks of each other if they share at least one commit. This definition is platform‑agnostic and captures forks across GitHub, GitLab, self‑hosted instances, and even cross‑forge clones.

  3. Commit‑Level Propagation Algorithm – Starting from each vulnerable commit, a breadth‑first (or depth‑first) traversal marks all descendant commits as vulnerable until a fixing commit is encountered, after which all later descendants are marked safe. Because the SWH graph contains ~5 billion commits and ~0.5 trillion edges, the authors implement a compressed in‑memory representation that enables a single‑machine analysis of the entire public code base.

Applying this pipeline to the SWH graph (as of April 2024) results in the propagation of vulnerability information to 2.2 million forks. To evaluate practical impact, the authors filter forks by popularity (GitHub stars) and severity (CVSS ≥ 7.0), yielding about 1,500 candidate forks. Manual inspection of the latest commits in these candidates identifies 135 high‑severity one‑day vulnerabilities. The approach achieves a precision of 0.69 (i.e., 69 % of automatically flagged cases are true positives). The authors also performed responsible disclosure: they contacted maintainers of the flagged forks and received confirmation for 9 high‑severity cases, demonstrating that many of these vulnerabilities were previously unknown to the downstream projects.

Beyond the detection pipeline, the paper presents two tooling artifacts:

  • A public lookup website where users can query any public commit (by SHA‑1 or SWHID) to see whether it is currently vulnerable according to the propagated data.
  • A prototype CI/CD plugin for GitHub Actions and GitLab CI that automatically checks new pushes against the propagated vulnerability database, alerting developers in real time.

The authors discuss how their method complements existing static or dynamic analysis tools. While static analysis can pinpoint vulnerable code patterns, it is language‑specific and computationally expensive at scale. History‑based analysis, by contrast, is language‑agnostic and scalable, but traditionally limited to upstream repositories. By extending history analysis to the global commit graph, the paper bridges this gap and provides a systematic way to monitor fork‑based risk.

Limitations acknowledged include:

  • Dependence on OSV coverage – vulnerabilities not recorded in OSV cannot be propagated.
  • Potential loss of propagation when downstream forks cherry‑pick patches, thereby changing commit hashes; the current algorithm relies on commit identity preservation.
  • Lack of deployment‑time information – the approach flags vulnerable commits but does not know whether a fork’s code is actually deployed in production, which would affect real‑world risk assessment.
  • Legal and privacy considerations when analyzing a massive public code base.

Future work suggested involves hybridizing the global history approach with static/dynamic analysis to improve precision, integrating package‑manager metadata and CI logs to infer actual deployment, and building automated notification channels for downstream maintainers.

In summary, the paper introduces a novel, scalable method for detecting one‑day vulnerabilities across the entire open‑source fork ecosystem by propagating commit‑level vulnerability data through the Software Heritage graph. The empirical study demonstrates that thousands of forks inherit unpatched vulnerabilities, many of which are high‑severity and previously unknown. The presented tools illustrate practical pathways for integrating this insight into developers’ workflows, thereby strengthening the security of the global open‑source supply chain.


Comments & Academic Discussion

Loading comments...

Leave a Comment