Using Provenance to support Good Laboratory Practice in Grid Environments
Conducting experiments and documenting results is daily business of scientists. Good and traceable documentation enables other scientists to confirm procedures and results for increased credibility. Documentation and scientific conduct are regulated and termed as “good laboratory practice.” Laboratory notebooks are used to record each step in conducting an experiment and processing data. Originally, these notebooks were paper based. Due to computerised research systems, acquired data became more elaborate, thus increasing the need for electronic notebooks with data storage, computational features and reliable electronic documentation. As a new approach to this, a scientific data management system (DataFinder) is enhanced with features for traceable documentation. Provenance recording is used to meet requirements of traceability, and this information can later be queried for further analysis. DataFinder has further important features for scientific documentation: It employs a heterogeneous and distributed data storage concept. This enables access to different types of data storage systems (e. g. Grid data infrastructure, file servers). In this chapter we describe a number of building blocks that are available or close to finished development. These components are intended for assembling an electronic laboratory notebook for use in Grid environments, while retaining maximal flexibility on usage scenarios as well as maximal compatibility overlap towards each other. Through the usage of such a system, provenance can successfully be used to trace the scientific workflow of preparation, execution, evaluation, interpretation and archiving of research data. The reliability of research results increases and the research process remains transparent to remote research partners.
💡 Research Summary
The paper addresses the growing need for electronic laboratory notebooks that satisfy Good Laboratory Practice (GLP) in modern, data‑intensive scientific research. Traditional paper notebooks cannot keep pace with the volume, complexity, and distributed nature of contemporary experiments, especially when collaborations span multiple institutions and geographic locations. To meet this challenge, the authors extend the open‑source DataFinder system—originally developed by the German Aerospace Centre (DLR)—with comprehensive provenance (origin and lineage) recording capabilities, thereby creating a GLP‑compliant electronic notebook suitable for Grid environments.
DataFinder is a Python‑based client‑server application that manages both data and associated metadata. Its key strength lies in supporting heterogeneous, distributed storage back‑ends (WebDAV, FTP, GridFTP, Subversion, Amazon S3, etc.) through a modular plug‑in architecture. Metadata can be stored centrally, while the actual data may reside on any of the supported back‑ends, allowing seamless access to both digital files and physical artifacts (e.g., samples, tapes) within a single unified interface.
Provenance is modeled using the Open Provenance Model (OPM), which defines three node types—Artifact (data or physical sample), Process (experimental or computational step), and Agent (person, instrument, or software)—and directed edges such as “used”, “wasDerivedFrom”, “wasTriggeredBy”, and “wasUndertakenBy”. By embedding OPM‑based logging into DataFinder’s file operations, every action (opening, copying, importing, etc.) automatically generates provenance entries, constructing a directed graph that captures the full scientific workflow from specimen collection through analysis to publication. Special handling is required for operations like copying to avoid unintended forks in the provenance graph; the current prototype flags this as an area for future refinement.
Provenance data are stored in the prOOst system, a semi‑structured graph database built on Neo4j. The system exposes a REST API for recording provenance events and provides a web front‑end for visualization and querying. Queries are expressed in the Gremlin graph‑traversal language, enabling complex investigations such as: “Which artifacts contributed to a given result?”, “What agents and software versions were involved?”, “What is the current processing stage of a sample?”, and “Has a particular workflow satisfied predefined quality criteria?”. This query capability directly supports GLP audit requirements, as regulators can retrieve precise, time‑stamped evidence of compliance.
The authors illustrate the approach with a realistic use case involving a team of biologists who collect field specimens, process them in wet labs, and collaborate with remote partners. Each specimen’s physical storage location, collection metadata, and subsequent analytical steps are entered into DataFinder. Provenance records link the specimen to derived data (e.g., images, measurement files) and to higher‑level artifacts such as experimental protocols and manuscripts. Because the provenance graph is centrally stored, any collaborator can query the lineage of an artifact regardless of where the underlying data physically reside, ensuring transparency across institutional boundaries.
The paper also maps GLP requirements (traceability, accountability, data integrity, and auditability) onto specific provenance features. For instance, the “Agent” node satisfies the GLP demand to identify who performed each step, while mandatory metadata fields enforce documentation of when and how data were generated. By integrating provenance capture directly into the data management workflow, the system eliminates the need for manual notebook entries, reducing human error and improving reproducibility.
Limitations identified include incomplete handling of provenance during file duplication, potential consistency issues when external clients bypass DataFinder to access storage directly, and the need for a more user‑friendly notebook‑style interface for non‑technical scientists. Future work is proposed to tighten integration with Grid authentication/authorization services, develop automated provenance consistency checks, and design domain‑specific UI templates that hide technical complexity while exposing essential GLP documentation fields.
In summary, the authors present a practical, extensible framework that combines heterogeneous distributed data storage, rich metadata management, and graph‑based provenance recording to deliver a GLP‑compliant electronic laboratory notebook for Grid environments. Their prototype demonstrates that provenance can be leveraged not only for regulatory compliance but also for advanced scientific queries, thereby enhancing both the credibility and efficiency of collaborative research.
Comments & Academic Discussion
Loading comments...
Leave a Comment