Standing Together for Reproducibility in Large-Scale Computing: Report on reproducibility@XSEDE
This is the final report on reproducibility@xsede, a one-day workshop held in conjunction with XSEDE14, the annual conference of the Extreme Science and Engineering Discovery Environment (XSEDE). The workshop’s discussion-oriented agenda focused on reproducibility in large-scale computational research. Two important themes capture the spirit of the workshop submissions and discussions: (1) organizational stakeholders, especially supercomputer centers, are in a unique position to promote, enable, and support reproducible research; and (2) individual researchers should conduct each experiment as though someone will replicate that experiment. Participants documented numerous issues, questions, technologies, practices, and potentially promising initiatives emerging from the discussion, but also highlighted four areas of particular interest to XSEDE: (1) documentation and training that promotes reproducible research; (2) system-level tools that provide build- and run-time information at the level of the individual job; (3) the need to model best practices in research collaborations involving XSEDE staff; and (4) continued work on gateways and related technologies. In addition, an intriguing question emerged from the day’s interactions: would there be value in establishing an annual award for excellence in reproducible research?
💡 Research Summary
The final report of the “reproducibility@XSEDE” workshop, held in conjunction with XSEDE14, captures a community‑wide discussion on how to improve reproducibility in large‑scale computational research. The workshop’s agenda was deliberately discussion‑driven, allowing participants to surface real‑world challenges, emerging technologies, and promising practices. Two overarching themes emerged. First, organizational stakeholders—particularly national supercomputing centers—occupy a unique position to champion, enable, and sustain reproducible science. By standardizing system software stacks, managing library versions, and providing automated logging of build‑time and run‑time metadata, these centers can create the infrastructural backbone that makes replication feasible. Moreover, they can institutionalize reproducibility policies, embed training modules into user onboarding, and act as custodians of best‑practice guidelines.
Second, individual researchers must adopt a mindset that every experiment is conducted as if it will be replicated by someone else. This requires disciplined version control, explicit declaration of software dependencies, thorough documentation of input data, parameters, and environment variables, and the use of containerization (Docker, Singularity) and workflow management tools (Nextflow, Snakemake, Pegasus) to capture the full execution context.
From the collective input, four priority areas for XSEDE were identified. (1) Documentation and training: develop reproducibility‑focused tutorials, case studies, and curricula, and integrate them into the XSEDE portal and user‑support channels. (2) System‑level tooling: create mechanisms that automatically capture build logs, runtime environment details, resource usage, and job‑level provenance, storing this information in a standardized, machine‑readable format (e.g., JSON or YAML) linked to each job’s identifier. (3) Collaboration models: establish a reproducibility checklist for joint projects involving XSEDE staff and external researchers, conduct periodic peer reviews of reproducibility artifacts, and disseminate successful collaboration patterns as community templates. (4) Gateways and portals: embed provenance capture and snapshot functionality into scientific gateways so that a complete workflow snapshot—including code, container images, input datasets, and parameter files—can be exported, archived, or shared with collaborators.
An additional, more aspirational suggestion was the creation of an annual award for excellence in reproducible research. Such an award would provide a tangible incentive, raise the profile of reproducibility within the high‑performance computing community, and encourage the development of innovative tools and practices.
Overall, the report offers a concrete roadmap: by aligning institutional policies, providing automated provenance infrastructure, fostering a culture of meticulous documentation, and rewarding exemplary reproducible work, XSEDE can become a catalyst for reproducibility across the nation’s most demanding computational science projects. The identified focus areas are actionable, measurable, and directly address the pain points raised by workshop participants, positioning XSEDE to lead the next generation of transparent, repeatable, and trustworthy large‑scale scientific computing.
Comments & Academic Discussion
Loading comments...
Leave a Comment