CLEF. A Linked Open Data native system for Crowdsourcing
💡 Research Summary
The paper presents CLEF (Crowdsourcing Linked Entities via web Form), a lightweight, Linked Open Data (LOD)‑native platform designed to support the entire lifecycle of cultural‑heritage data collection, peer‑review, and publication. The authors begin by outlining the shortcomings of existing content management systems such as Omeka S and Semantic MediaWiki, which either rely on rigid relational schemas, lack proper provenance handling, or provide cumbersome user interfaces that hinder participation from non‑expert contributors. To address these gaps, CLEF adopts a bottom‑up design driven by two real‑world case studies: ART chives, a multi‑institutional archival‑metadata crowdsourcing project, and muso W, an open catalogue of music‑related web resources.
Key requirements extracted from the case studies include: (1) seamless integration with external vocabularies (e.g., Wikidata, Getty AAT), (2) fine‑grained provenance tracking, (3) conflict resolution and version control for competing contributions, (4) user‑friendly, template‑driven data entry, (5) automatic validation and editorial workflows, (6) long‑term preservation and discoverability, and (7) easy deployment for projects with limited technical resources.
CLEF’s architecture consists of a React‑based front‑end that renders dynamic web forms defined by JSON‑LD templates, a Node.js middleware that mediates between the UI and an Apache Jena Fuseki triple store, and a continuous‑integration pipeline built on GitHub Actions. When a contributor fills a form, the system validates the input against the selected ontology (e.g., CIDOC‑CRM, EDM), enriches it with auto‑completion suggestions from SPARQL endpoints, and immediately stores the resulting RDF triples in a named graph. Simultaneously, the same data is committed as RDF files to a GitHub repository; the commit metadata (author, timestamp, message) is mirrored in a provenance graph, making provenance a first‑class LOD entity that can be queried alongside the data itself.
The editorial process is managed through role‑based access: guest contributors submit drafts, internal reviewers approve or reject changes, and approved records transition to a “published” state that is immutable except for being flagged as a draft again. This ensures data integrity while still allowing iterative improvement. For long‑term archiving, CLEF automatically pushes stable releases to Zenodo, assigning DOIs and guaranteeing persistence.
Compared with prior solutions, CLEF offers three distinctive advantages. First, provenance is stored natively in RDF named graphs, enabling fine‑grained, queryable provenance that aligns with FAIR principles. Second, the form‑driven UI abstracts away the complexity of SPARQL and RDF, allowing non‑technical users to contribute high‑quality linked data. Third, integration with GitHub provides built‑in version control, CI testing, and a familiar collaboration environment, reducing the operational burden on small research teams.
The authors acknowledge current limitations: scalability of the Fuseki store for very large datasets, the need for more sophisticated UI customization tools, and the relatively manual nature of external vocabulary mapping. Future work will explore machine‑learning‑assisted entity extraction, SPARQL endpoint optimization, and a plugin ecosystem that lets institutions extend CLEF with domain‑specific validation rules.
In summary, CLEF demonstrates that a fully LOD‑native, provenance‑aware, and user‑centric platform can effectively bridge the gap between cultural‑heritage institutions and scholarly projects, facilitating FAIR, reusable, and trustworthy linked data production without requiring heavyweight legacy infrastructure.
Comments & Academic Discussion
Loading comments...
Leave a Comment