Systematically Examining Reproducibility: A Case Study for High Throughput Sequencing using the PRIMAD Model and BioCompute Object
The reproducibility of computational pipelines is an expectation in biomedical science, particularly in critical domains like human health. In this context, reporting next generation genome sequencing methods used in precision medicine spurred the development of the IEEE 2791-2020 standard for Bioinformatics Analyses Generated by High Throughput Sequencing (HTS), known as the BioCompute Object (BCO). Championed by the USA’s Food and Drug Administration, the BCO is a pragmatic framework for documenting pipelines; however, it has not been systematically assessed for its reproducibility claims. This study uses the PRIMAD model, a conceptual framework for describing computational experiments for reproducibility purposes, to systematically review the BCO for depth and coverage. A meticulous mapping of BCO and PRIMAD elements onto a published BCO use case reveals potential omissions and necessary extensions within both frameworks. This underscores the significance of systematically validating claims of reproducibility for published digital objects, thereby enhancing the reliability of scientific research in bioscience and related disciplines. This study, along with its artifacts, is reported as a RO-Crate, providing a structured reporting approach, which is available at https://doi.org/10.5281/zenodo.14317922.
💡 Research Summary
The paper addresses a critical gap in the reproducibility of computational pipelines used in high‑throughput sequencing (HTS) for precision medicine. While the BioCompute Object (BCO) – formalized as IEEE 2791‑2020 – has been promoted by the U.S. Food and Drug Administration as a pragmatic framework for documenting HTS analyses, its ability to guarantee reproducibility has not been systematically examined. To fill this void, the authors adopt the PRIMAD model, a conceptual schema that decomposes a computational experiment into seven orthogonal elements: Problem, Research goal, Implementation, Method, Data, Actor, and Device. PRIMAD is widely recognized for its capacity to pinpoint which aspects of an experiment must be reported to enable faithful replication.
The study proceeds by selecting a publicly available BCO use‑case and mapping each of its ten domains (e.g., Provenance Domain, Execution Domain, Description Domain) onto the PRIMAD categories. This mapping reveals several systematic omissions. First, the “Research goal” – the explicit scientific question driving the analysis – lacks a dedicated field in the BCO, making it difficult for downstream users to assess whether the pipeline aligns with their own objectives. Second, the “Actor” element, which captures the individuals or institutions responsible for the analysis, is not explicitly recorded, obscuring accountability and hindering communication when discrepancies arise. Third, the “Device” component, encompassing hardware specifications, operating‑system versions, and virtualization settings, is only partially represented in the Execution Domain; critical details such as exact container image digests, GPU availability, and low‑level library versions are missing, jeopardizing environment reconstruction.
On the data side, the BCO includes checksums and source URLs for raw input files (e.g., FASTQ), but it does not systematically document derived artifacts such as aligned BAM files, variant call format (VCF) outputs, or intermediate quality‑control reports. Consequently, a researcher attempting to reproduce the full analytical workflow would need to regenerate these intermediate products without guidance on the exact parameters or software versions used. Regarding implementation, the BCO references Docker images and script locations, yet it fails to provide a complete dependency manifest (e.g., a requirements.txt or conda environment file) that would allow exact recreation of the software stack.
Based on these findings, the authors propose concrete extensions to the BCO specification. They recommend adding a “Research goal” field to the Description Domain, a standardized “Actor” field to capture contributor identifiers (ORCID, institutional affiliation), and a richer “Device” sub‑section that records OS version, hardware architecture, container image SHA‑256 digests, and any accelerator details. For data provenance, they suggest a hierarchical “Data” section that logs both primary inputs and all derived outputs, each with persistent identifiers, checksums, and storage locations. Finally, they advocate for an explicit “Implementation” manifest that enumerates all software dependencies with exact version numbers.
To demonstrate a reproducible publishing workflow, the authors package the entire study—including the original BCO, the PRIMAD‑BCO mapping, and all supplementary scripts—into a RO‑Crate. The RO‑Crate, hosted on Zenodo with a DOI (https://doi.org/10.5281/zenodo.14317922), provides a machine‑readable JSON‑LD metadata bundle that interlinks data, code, and documentation. This approach showcases how RO‑Crate can complement BCO by offering a unified container for all research artifacts, thereby facilitating automated discovery, citation, and reuse.
In conclusion, the paper delivers a rigorous, model‑driven audit of the BioCompute Object, exposing specific deficiencies that could undermine reproducibility claims. By aligning BCO with PRIMAD and enriching it with the proposed extensions, the framework can evolve from a documentation checklist into a robust reproducibility guarantee. The authors also outline future directions, including the development of automated tools for PRIMAD‑BCO mapping across diverse HTS pipelines and broader community adoption of RO‑Crate as a standard packaging format for computational biology studies.