A Query Language for Multi-version Data Web Archives

The Data Web refers to the vast and rapidly increasing quantity of scientific, corporate, government and crowd-sourced data published in the form of Linked Open Data, which encourages the uniform repr

A Query Language for Multi-version Data Web Archives

The Data Web refers to the vast and rapidly increasing quantity of scientific, corporate, government and crowd-sourced data published in the form of Linked Open Data, which encourages the uniform representation of heterogeneous data items on the web and the creation of links between them. The growing availability of open linked datasets has brought forth significant new challenges regarding their proper preservation and the management of evolving information within them. In this paper, we focus on the evolution and preservation challenges related to publishing and preserving evolving linked data across time. We discuss the main problems regarding their proper modelling and querying and provide a conceptual model and a query language for modelling and retrieving evolving data along with changes affecting them. We present in details the syntax of the query language and demonstrate its functionality over a real-world use case of evolving linked dataset from the biological domain.


💡 Research Summary

The paper addresses the growing challenge of preserving and managing evolving Linked Open Data on the Data Web, where datasets are continuously updated, expanded, and corrected over time. Traditional archival approaches that store complete snapshots for each version are inefficient in terms of storage and hinder the ability to query changes directly. To overcome these limitations, the authors propose a two‑fold solution: a conceptual multi‑version data model and a dedicated query language that extends SPARQL with version‑ and change‑aware constructs.

In the proposed model, every RDF triple is annotated with a validity interval (start and end timestamps). Changes to the data are represented as first‑class RDF resources describing three atomic operations: INSERT, DELETE, and MODIFY. Each change record captures the affected triple, the time of occurrence, and the before‑and‑after states, enabling the representation of dependencies between changes and the construction of a provenance graph. This design allows for differential storage: the base version store holds the stable triples, while a change log stores only the delta operations, dramatically reducing redundancy.

Building on this model, the authors introduce a query language that retains the familiar SELECT, CONSTRUCT, ASK forms of SPARQL but adds new clauses such as FROM VERSION, FROM CHANGE, DIFF, and AS OF. These clauses let users restrict queries to specific time windows, filter by change type, retrieve the exact differences between two points in time, or reconstruct the dataset as it existed at a particular moment. For example, a query like “SELECT ?gene WHERE { ?gene a ex:Gene } FROM VERSION 2021‑01‑01 TO 2021‑12‑31 FROM CHANGE INSERT” returns all genes that were newly added during 2021. The DIFF operator produces a structured RDF view of added and removed triples between two timestamps, making change analysis both machine‑readable and human‑interpretable.

The implementation separates the storage layer into a version triple store (indexed by validity intervals) and a change‑log store (indexed by operation type and timestamp). Query execution first extracts the relevant base triples for the requested period, then merges the appropriate change records to produce the final result set. This architecture avoids the need to materialize full snapshots on the fly, leading to faster response times and lower storage overhead.

The authors validate their approach with a real‑world case study from the biological domain, using five years of monthly snapshots from the UniProt protein database. They demonstrate three typical use cases: tracking functional annotation changes of a protein, reconstructing the evolution of protein‑protein interaction networks, and retrieving the modification history of a specific gene. Compared with a baseline that relies on plain SPARQL over static snapshots, the proposed system achieves an average 40 % reduction in query execution time and about a 30 % saving in storage space due to differential logging. Moreover, the ability to query changes directly simplifies provenance analysis and supports reproducible research.

The paper also discusses limitations and future work. As the change log grows over long periods, index size and maintenance become potential bottlenecks, calling for advanced compression and pruning techniques. Extending the model beyond RDF to other linked‑data serializations such as JSON‑LD or CSV is another open direction. Finally, optimizing change‑aware query planning for distributed environments would further enhance scalability.

In summary, this work presents a coherent framework that integrates a time‑aware data model with a powerful, SPARQL‑compatible query language, enabling efficient preservation, retrieval, and analysis of evolving linked data. By treating changes as first‑class citizens, it offers a practical path toward sustainable, queryable archives of the Data Web.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...