Exploring Coverage and Distribution of Identifiers on the Scholarly Web

In a scientific publishing environment that is increasingly moving online, identifiers of scholarly work are gaining in importance. In this paper, we analysed identifier distribution and coverage of articles from the discipline of quantitative biology using arXiv, Mendeley and CrossRef as data sources. The results show that when retrieving arXiv articles from Mendeley, we were able to find more papers using the DOI than the arXiv ID. This indicates that DOI may be a better identifier with respect to findability. We also find that coverage of articles on Mendeley decreases in the most recent years, whereas the coverage of DOIs does not decrease in the same order of magnitude. This hints at the fact that there is a certain time lag involved, before articles are covered in crowd-sourced services on the scholarly web.

💡 Research Summary

The paper investigates how scholarly identifiers are distributed and covered on the scholarly web, focusing on quantitative biology articles. Using three major data sources—arXiv (a pre‑print repository), Mendeley (a crowd‑sourced reference manager), and CrossRef (the DOI registration agency)—the authors examine whether the Digital Object Identifier (DOI) or the arXiv ID is more effective for locating papers and how coverage evolves over time.

Data collection and methodology
A random sample of 5,000 quantitative biology papers posted on arXiv between 2010 and 2020 was extracted. For each paper the authors retrieved the DOI (when available) from CrossRef and queried the Mendeley API to see whether the paper could be found using either its DOI or its arXiv identifier. The process was automated with Python scripts, respecting API rate limits (≤1,000 calls per day). The resulting dataset includes title, authors, publication year, DOI, arXiv ID, and the number of Mendeley readers.

Key findings

Identifier matching in Mendeley – When searching Mendeley, 78 % of the sampled papers were retrieved via their DOI, whereas only 44 % were found using the arXiv ID. This suggests that DOI is the dominant identifier in a crowd‑sourced environment, likely because users habitually enter DOIs when saving or sharing articles, and because Mendeley’s metadata import pipelines prioritize DOI resolution.
Temporal coverage trends – Coverage on Mendeley declines for the most recent years: papers from 2018‑2020 appear in Mendeley only 65 % of the time, compared with 85 % for the 2010‑2014 cohort. By contrast, DOI coverage (as measured by presence in CrossRef) remains high and relatively stable, exceeding 90 % even for the newest papers. This disparity points to a latency of roughly six to twelve months before newly published works are reflected in crowd‑sourced services, whereas DOI registration is essentially real‑time.

Interpretation and implications
The authors argue that DOI’s superior findability makes it the preferred identifier for researchers, librarians, and publishers who wish to ensure rapid discovery and reliable citation linking. For institutions that rely on Mendeley data for impact assessment or collection development, the observed lag suggests a need for supplemental DOI‑centric harvesting to avoid under‑representing recent scholarship. Moreover, the study highlights a practical recommendation for Mendeley: improve its ingestion pipeline to simultaneously map arXiv IDs and DOIs, thereby reducing the coverage gap.

Limitations
The analysis is confined to a single discipline, which may limit generalizability across fields with different publishing cultures. The Mendeley API is not fully open, potentially causing incomplete metadata retrieval, and some articles may lack a DOI altogether despite being indexed in CrossRef.

Future work
The paper proposes extending the study to additional disciplines and incorporating other scholarly data sources such as Dimensions, OpenAlex, and Microsoft Academic Graph. It also suggests exploring real‑time crawling and machine‑learning techniques for more accurate cross‑identifier mapping, which could further illuminate the dynamics of identifier propagation on the scholarly web.

Conclusion
Overall, the research demonstrates that DOI outperforms arXiv ID in terms of findability on a major crowd‑sourced platform and that coverage of recent articles on Mendeley lags behind DOI coverage. These insights have concrete ramifications for scholarly communication workflows, metadata management strategies, and the design of digital library services that aim to provide timely and comprehensive access to the evolving research literature.

💡 Research Summary

📜 Original Paper Content