A Framework for Managing Evolving Information Resources on the Data Web

The web of data has brought forth the need to preserve and sustain evolving information within linked datasets; however, a basic requirement of data preservation is the maintenance of the datasets’ structural characteristics as well. As open data are often found using different and/or heterogeneous data models and schemata from one source to another, there is a need to reconcile these mismatches and provide common denominations of interpretation on a multitude of levels, in order to be able to preserve and manage the evolution of the generated resources. In this paper, we present a linked data approach for the preservation and archiving of open heterogeneous datasets that evolve through time, at both the structural and the semantic layer. We first propose a set of re-quirements for modelling evolving linked datasets. We then proceed on concep-tualizing a modelling framework for evolving entities and place these in a 2x2 model space that consists of the semantic and the temporal dimensions.

💡 Research Summary

The paper addresses the growing challenge of preserving and managing linked open data that evolves over time, both in its structural schema and its semantic content. Traditional data preservation approaches focus on static snapshots, which are insufficient for the dynamic nature of the Data Web where datasets are frequently updated, re‑modeled, and re‑aligned across heterogeneous sources. To tackle this, the authors first articulate six core requirements for any system that aims to handle evolving linked datasets: (1) explicit temporal versioning of each resource, (2) structural consistency across schema changes, (3) semantic continuity despite concept redefinitions, (4) integration of heterogeneous sources with differing ontologies and namespaces, (5) rich metadata that captures provenance, trust, and access policies, and (6) scalability and automation to cope with large‑scale, possibly streaming, data.

Building on these requirements, the authors introduce a conceptual 2 × 2 model space that maps the dimensions of time (static vs. dynamic) against semantics (static vs. dynamic). The four quadrants represent distinct evolution patterns: (i) static semantics & static time (traditional immutable datasets), (ii) static semantics & dynamic time (datasets where only instance data changes over time), (iii) dynamic semantics & static time (schema evolution without instance change), and (iv) dynamic semantics & dynamic time (both schema and data evolve simultaneously). This space serves as a guide for selecting appropriate modeling strategies and for visualizing complex evolution scenarios.

At the heart of the proposed framework is the notion of an Evolving Entity. Each entity is identified by a URI and carries two intertwined histories: a version history that records timestamps, change types (addition, deletion, modification), and relationships to other versions; and a semantic meta‑model that links each version to a specific ontology version. By separating these concerns, the framework enables point‑in‑time and semantic queries that go beyond standard SPARQL. For example, a user can query “What properties did Entity X have in June 2020?” or “Which ontology version underlies the data released in Q1 2021?” using graph‑named queries or temporal filters.

Implementation-wise, the authors layer a version graph on top of a conventional RDF triple store. They adopt PROV‑O for provenance and DCAT for dataset description, ensuring that each version is traceable to its source and governed by explicit distribution policies. Ontology alignment is handled through a combination of SKOS mapping and OWL alignment, stored as separate meta‑graphs. To mitigate the growth of version graphs, the paper proposes incremental indexing and graph compression techniques.

The framework is evaluated on two real‑world open data portals: a governmental statistical dataset and an environmental monitoring dataset. In the dynamic‑time quadrants (II and IV), query latency increased by roughly 30 % compared to a baseline static system, but the ability to retrieve semantically consistent results across versions was preserved. Metadata enrichment with PROV‑O enabled reliable provenance checks, and policy‑driven access control demonstrated compliance with legal and ethical requirements.

In conclusion, the authors present a comprehensive, ontology‑aware approach to managing evolving linked data that simultaneously respects temporal and semantic dimensions. While the prototype shows promise, the paper acknowledges limitations such as performance overhead for massive version graphs and the nascent state of automated semantic alignment. Future work is outlined to include (a) advanced indexing for large‑scale graphs, (b) machine‑learning‑driven ontology matching, (c) support for non‑RDF data models (e.g., JSON‑LD, CSV), and (d) tighter integration with legal‑compliance frameworks. This research thus lays a solid foundation for robust, long‑term preservation of the ever‑changing Data Web.

💡 Research Summary

📜 Original Paper Content