We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete narrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research publications to their associated digital scholarly objects such as the data, code, and workflows. To enable this, Whole Tale will create an environment where researchers can collaborate on data, workspaces, and workflows and then publish them for future adoption or modification. Published data and applications will be consumed either directly by users using the Whole Tale environment or can be integrated into existing or future domain Science Gateways.
Capturing the “Whole Tale” of Computational
Research: Reproducibility in Computing
Environments
Bertram Ludäscher, School of Information Sciences, University of Illinois at Urbana-Champaign;
Kyle Chard, University of Chicago; Niall Gaffney, Texas Advanced Computing Center,
University of Texas at Austin; Matthew B. Jones, University of California Santa Barbara;
Jaroslaw Nabrzyski, University of Notre Dame; Victoria Stodden,* School of Information
Sciences, University of Illinois at Urbana-Champaign; and Matthew Turk, School of Information
Sciences, University of Illinois at Urbana-Champaign
*Corresponding author address: School of Information Sciences, University of Illinois at Urbana-
Champaign, Champaign, IL 61820, USA; email: vcs@illinois.edu
Abstract: We present an overview of the recently
funded “Merging Science and Cyberinfrastructure
Pathways: The Whole Tale” project (NSF award
#1541450). Our approach has two nested goals: 1)
deliver an environment that enables researchers to
create a complete narrative of the research
process including exposure of the data-to-
publication lifecycle, and 2) systematically and
persistently link research publications to their
associated digital scholarly objects such as the
data, code, and workflows. To enable this, Whole
Tale will create an environment where researchers
can collaborate on data, workspaces, and
workflows and then publish them for future
adoption or modification. Published data and
applications will be consumed either directly by
users using the Whole Tale environment or can be
integrated into existing or future domain Science
Gateways.
- Introduction
Computational
resources
and
scientific
services are now nearly ubiquitous in scientific
investigations; however, the applications used to
discover
and
analyze
data
are
extremely
fragmented and can be intractable, creating a large
and
meaningful
gap
between
the
research
processes and the ability to verify the findings [1].
There is frequently no way to trace findings in
publications
back
through
the
originating
computations and data. The Whole Tale project
aims to remedy this gap in two ways: 1) integrate
existing cyberinfrastructure that supports the entire
computational process underlying discovery, thus
simplifying the ability for researchers to conduct
computational research; and 2) and capture and
deliver
relevant
workflow
and
processing
provenance
that
will
be
discoverable
and
accessible from the associated publication. Whole
Tale envisions a collaborative environment where
data providers, application developers, and data
consumers collaborate and create end-to-end
workflows converting data to information using
reproducible computational methods.
- The Whole Tale Research
Environment
Whole Tale will enable a research environment
that seamlessly supports computational tools for
tackling pressing research problems in a way that
is scalable and reproducible but that still supports
software familiar to current researchers. Our aim is
to
support
scientific
investigation
at
all
computational scales, from HPC environments to
single-user endeavors (the “long tail” of science).
We will provide a research environment that
captures and, at the time of publication, exposes
salient details of the research via access to
persistent versions of the data and code used,
workflow provenance, data lineage, parameter
settings, and output data. Our approach differs,
and is complementary to, that provided by some
science gateways in that we rely on utilization of
commodity tools, rather than bespoke, domain-
specific instruments.
The Whole Tale environment will provide
linkages to existing cyberinfrastructure to provide
a research environment that will be instrumented
with workflow and reproducibility tools to aid in
capturing and storing key scripts, function calls,
parameter settings and machine state information
that are essential for reproducing the results.
The cyberinfrastructure will be exposed to
users through well-known applications such as
Jupyter Notebooks that support commonly used
data analysis languages including R and Python.
Storage will be exposed to users through several
interfaces
including
Globus,
a
web
based
filesystem interface, FUSE modules for filesystem-
level access to local and remote data repositories,
the DataONE federation of data repositories, and
an open source Cloud storage environment
Nextcloud. By building data repository access into
modules that present file-like interfaces, we further
lower the barrier to access for remote data stores.
The system will also incorporate Globus Auth—a
unified an identity management system that will
allow users to leverage their own campus, ORCID
identifier, or other existing identities. Whole Tale
will also enable the deployment of Dockerfile-
based environments to support extensible and
customizable research workflows.
This content is AI-processed based on open access ArXiv data.