Does Functional Package Management Enable Reproducible Builds at Scale? Yes
Reproducible Builds (R-B) guarantee that rebuilding a software package from source leads to bitwise identical artifacts. R-B is a promising approach to increase the integrity of the software supply chain, when installing open source software built by third parties. Unfortunately, despite success stories like high build reproducibility levels in Debian packages, uncertainty remains among field experts on the scalability of R-B to very large package repositories. In this work, we perform the first large-scale study of bitwise reproducibility, in the context of the Nix functional package manager, rebuilding 709 816 packages from historical snapshots of the nixpkgs repository, the largest cross-ecosystem open source software distribution, sampled in the period 2017-2023. We obtain very high bitwise reproducibility rates, between 69 and 91% with an upward trend, and even higher rebuildability rates, over 99%. We investigate unreproducibility causes, showing that about 15% of failures are due to embedded build dates. We release a novel dataset with all build statuses, logs, as well as full ‘‘diffoscopes’’: recursive diffs of where unreproducible build artifacts differ.
💡 Research Summary
The paper investigates whether functional package management (FPM), as embodied by the Nix package manager, can deliver reproducible builds (R‑B) at the scale of a large, cross‑ecosystem software distribution. The authors focus on the nixpkgs repository, which contains roughly 100 000 packages as of late 2024, and they rebuild a total of 709 816 packages drawn from historical snapshots spanning July 2017 to April 2023. By employing Nix’s --check flag together with --keep-failed, each package is built from source, compared bit‑for‑bit against any cached binary of the same derivation, and any mismatches are retained for further analysis. The study is organized around five research questions (RQ0–RQ4) that address the evolution of reproducibility over time, ecosystem‑specific patterns, root causes of non‑reproducibility, and the mechanisms by which fixes are introduced.
Methodologically, the authors start from 200 revisions previously selected for a related study, but because full builds are computationally intensive they apply a dichotomic sampling strategy: they iteratively select the middle revision of the largest time interval, ending up with 17 evenly spaced snapshots. For each snapshot they trigger a full Hydra‑style CI build of every derivation in nixpkgs, resulting in a massive workload that required careful orchestration of compute resources. The reproducibility check works by first attempting to fetch a pre‑built output from the official binary cache (cache.nixos.org). If the cache does not contain a matching hash, Nix builds the derivation locally and then performs a byte‑wise comparison of all outputs. When a mismatch occurs, the diffoscope tool is invoked to generate recursive diffs (the authors call the resulting artifacts “diffoscopes”), which are stored in JSON/HTML form for later automated analysis.
The empirical results are striking. Bit‑wise reproducibility rates rise from 69 % in the earliest snapshot to 91 % in the most recent one, despite the repository’s growth from roughly 60 000 to over 100 000 packages. Overall rebuildability exceeds 99 %, indicating that almost every package can be built from source, even if some outputs differ at the byte level. Ecosystem‑level analysis reveals substantial variance: packages from interpreted language ecosystems (Python, Ruby, Perl) and certain system libraries exhibit higher non‑reproducibility, while pure compiled ecosystems (C/C++) tend to be more stable. The authors identify that about 15 % of failures are attributable to embedded build timestamps, a classic source of non‑determinism that can be mitigated by consistently applying the SOURCE_DATE_EPOCH environment variable. Other identified causes include non‑deterministic compiler flags, random seeds not fixed, and accidental inclusion of absolute build‑path strings in binaries.
Fix tracking shows that many non‑reproducible packages are eventually patched, either through targeted reproducibility patches (e.g., upstream patches that strip timestamps) or as side‑effects of broader version upgrades. The authors distinguish intentional fixes—often linked to reproducibility tickets in the Nix community—from accidental fixes that arise when a package is updated for unrelated reasons. By correlating the timing of patches with the appearance of reproducibility improvements, they demonstrate that the Nix ecosystem’s continuous integration pipeline (Hydra) and its policy of building every package in a sandboxed environment play a crucial role in surfacing and resolving these issues.
A major contribution of the work is the release of a comprehensive dataset comprising build logs, metadata for all 709 000+ builds, and more than 114 000 diffoscopes documenting the exact differences between non‑reproducible artifacts. This dataset enables future research on automated root‑cause analysis, machine‑learning models that predict reproducibility risk, and longitudinal studies of package freshness and maintenance practices.
In discussion, the authors argue that the high reproducibility rates achieved demonstrate that the perceived impracticality of R‑B at large scale is largely a myth; functional package management provides the necessary isolation, deterministic environment variables, and input‑addressed storage to make bit‑wise reproducibility feasible. They acknowledge remaining challenges, such as handling packages that deliberately embed build‑time information (e.g., version strings derived from git timestamps) and the need for better tooling to automatically rewrite such patterns. Threats to validity include the reliance on Nix’s own reproducibility checks (which may miss subtle differences) and the fact that the study only covers default build configurations, not alternative build flags that could introduce additional nondeterminism.
Overall, the paper delivers the first large‑scale, empirical validation that functional package management can support reproducible builds across a massive, heterogeneous software collection, and it provides a valuable open dataset for the community to further improve supply‑chain security and build determinism.
Comments & Academic Discussion
Loading comments...
Leave a Comment