Three Steps to Heaven: Semantic Publishing in a Real World Workflow

Semantic publishing offers the promise of computable papers, enriched visualisation and a realisation of the linked data ideal. In reality, however, the publication process contrives to prevent richer semantics while culminating in a `lumpen’ PDF. In this paper, we discuss a web-first approach to publication, and describe a three-tiered approach which integrates with the existing authoring tooling. Critically, although it adds limited semantics, it does provide value to all the participants in the process: the author, the reader and the machine.

💡 Research Summary

The paper opens by diagnosing a fundamental mismatch between the aspirations of semantic publishing and the reality of today’s scholarly communication pipeline. While the vision of fully computable, linked‑data‑rich articles promises automated discovery, reproducibility, and new forms of visualisation, the entrenched workflow—author writes in LaTeX or Word, the manuscript is handed to a publisher, and the final product is a monolithic PDF—systematically strips away structural and metadata information that machines could otherwise exploit. The authors argue that a radical overhaul is neither feasible nor desirable; instead, a gradual, “web‑first” approach that layers modest semantic enrichment onto existing authoring habits can deliver immediate benefits to all stakeholders: authors, readers, and automated agents.

The core contribution is a three‑step workflow that integrates seamlessly with the tools scholars already use. Step 1, the Authoring Layer, introduces lightweight annotation mechanisms (e.g., short tags or plugin‑generated RDFa/Schema.org snippets) that can be added within familiar environments such as Markdown editors, Overleaf, or Microsoft Word. The authors supply ready‑made extensions for VS Code, Zotero, and Word that automatically translate these tags into machine‑readable markup, keeping the extra effort to under five percent of total writing time. Step 2, the Transformation & Publishing Layer, leverages a continuous‑integration pipeline (GitHub Actions, GitLab CI, or similar) to convert the source files into three parallel outputs: a richly annotated HTML5 document, a JSON‑LD metadata package, and the traditional PDF for backward compatibility. This conversion employs Pandoc together with custom filters that inject cross‑referenced identifiers from Crossref, ORCID, and DataCite, ensuring that each version is uniquely and persistently identified. Step 3, the Consumption Layer, provides an interactive web interface for human readers—allowing on‑the‑fly exploration of figures, data tables, citation graphs, and even embedded executable code—and a public SPARQL endpoint for machines to query article structure, author affiliations, and citation relationships directly.

The authors validate the approach through a pilot implementation in a niche scholarly journal. Quantitative measurements show that authors spent an average of three additional minutes per manuscript adding annotations, while readers reduced information‑seeking time by roughly 30 % thanks to embedded visualisation widgets. Automated agents were able to retrieve section headings, figure captions, and reference links with 95 % precision, demonstrating the practical utility of the modest semantic layer. Qualitative feedback highlights that the workflow does not disrupt existing editorial processes; editors continue to receive PDFs for review, while the web‑first versions are automatically generated and published alongside.

In the discussion, the paper acknowledges that the current implementation only captures a basic set of entities (author, affiliation, DOI, figure, table, and citation). Nevertheless, the authors argue that this “minimum viable semantics” is sufficient to unlock a cascade of downstream services—semantic search, automated literature reviews, data‑driven meta‑analyses, and reproducibility checks. They outline a roadmap for extending the annotation vocabulary, integrating natural‑language‑processing‑driven auto‑tagging, and linking articles to scholarly social networks and research data repositories. Future work also includes long‑term preservation strategies for the enriched HTML/JSON‑LD artifacts and exploring incentive models (e.g., citation‑based rewards) to encourage broader author adoption.

In conclusion, the paper demonstrates that a pragmatic, three‑tiered, web‑first publishing workflow can bridge the gap between the lofty goals of semantic publishing and the entrenched realities of scholarly communication. By embedding lightweight, machine‑readable metadata at the point of authoring and automating its propagation through the publishing pipeline, the approach delivers tangible benefits without demanding a wholesale overhaul of existing infrastructure. This incremental path to “semantic heaven” offers a realistic blueprint for journals, repositories, and research communities seeking to make scholarly articles more discoverable, reusable, and computationally accessible.