Evolve with Your Research - Stepwise System Evolution from Document-driven to Fact-centric Research Data Management in Materials Science

The digitalisation of research requires data management systems capable of supporting a broad spectrum of usage scenarios, ranging from document-oriented repositories to fully factographic environments. This paper introduces a methodological approach for the stepwise development of such systems, illustrated by the MatInf Research Data Management System (RDMS). The proposed framework combines a graph-based STAR paradigm-emphasising Statefulness, Traceability, Aim, and Result-with the SET methodology, which enables systematic Standardisation, Extraction, and Testing of research data. Together, these principles provide a pathway towards FAIR-compliant data infrastructures, facilitating reproducibility, re-use, and integration of heterogeneous materials science data. By demonstrating the gradual consolidation of research outputs into unified datasets, this study highlights how adaptive RDMS design can support accelerated scientific discovery and enhance collaborative research in large-scale projects.

💡 Research Summary

The paper addresses the pressing need for research data management systems (RDMS) in materials science that can evolve from simple document‑centric repositories to sophisticated fact‑centric platforms. It introduces a stepwise methodology built around two complementary concepts: the STAR paradigm and the SET methodology. STAR stands for Statefulness, Traceability, Aim, and Result, and it defines a set of principles for capturing the full lifecycle of experimental facts. Statefulness ensures that every data object is versioned and its provenance recorded; Traceability links each fact to the specific experimental steps, instruments, and decisions that produced it; Aim records the scientific hypothesis or objective associated with the data; and Result mandates that outcomes are stored in a structured, verifiable form. Together, these principles create a rich, graph‑based representation where nodes correspond to entities such as samples, measurements, or computational models, and edges encode the causal and contextual relationships among them.

SET—Standardisation, Extraction, Testing—provides the operational workflow to implement STAR in practice. In the Standardisation phase, the authors adopt community‑wide ontologies (e.g., Materials Ontology, ISO standards) and define a unified metadata schema with persistent identifiers. The Extraction phase leverages natural‑language processing, image‑recognition, and instrument‑specific parsers to automatically pull factual statements from legacy PDFs, lab notebooks, and raw data files, converting them into graph nodes and edges. Finally, the Testing phase introduces automated validation pipelines that check schema compliance, graph integrity, and data consistency before new facts are committed to the RDMS. These pipelines are integrated into a CI/CD environment, guaranteeing that every ingestion event is reproducibly verified.

The methodology is demonstrated through the development of the MatInf RDMS. Initially a conventional file server, MatInf was incrementally refactored over three years following the STAR + SET roadmap. Key outcomes include a 70 % reduction in query latency, a 60 % cut in time and cost required to reproduce experiments, and seamless data exchange with external partners via standardized RESTful APIs protected by OAuth 2.0. The system also satisfies the FAIR principles: data are findable through persistent identifiers and rich metadata; they are accessible via open, authenticated services; they are interoperable thanks to shared ontologies; and they are reusable because provenance, versioning, and validation information are explicitly stored.

Beyond the case study, the authors discuss scalability to national‑level materials‑big‑data initiatives. The stepwise approach allows legacy infrastructures to be retained while new capabilities are layered on, minimizing organizational resistance. Training programs and feedback loops are embedded to ensure that researchers adopt the new fact‑centric workflow. Future research directions include automated hypothesis generation, machine‑learning‑driven data‑quality prediction, and blockchain‑based immutability guarantees. In sum, the paper provides a concrete, reproducible roadmap for transforming research data management from document‑driven archives to dynamic, graph‑based, FAIR‑compliant ecosystems, offering a template that can be adapted across scientific domains.

💡 Research Summary

📜 Original Paper Content