Analyzing data citation practices using the Data Citation Index

We present an analysis of data citation practices based on the Data Citation Index from Thomson Reuters. This database launched in 2012 aims to link data sets and data studies with citations received from the other citation indexes. The DCI harvests citations to research data from papers indexed in the Web of Science. It relies on the information provided by the data repository as data citation practices are inconsistent or inexistent in many cases. The findings of this study show that data citation practices are far from common in most research fields. Some differences have been reported on the way researchers cite data: while in the areas of Science and Engineering and Technology data sets were the most cited, in Social Sciences and Arts and Humanities data studies play a greater role. A total of 88.1 percent of the records have received no citation, but some repositories show very low uncitedness rates. Although data citation practices are rare in most fields, they have expanded in disciplines such as crystallography and genomics. We conclude by emphasizing the role that the DCI could play in encouraging the consistent, standardized citation of research data; a role that would enhance their value as a means of following the research process from data collection to publication.

💡 Research Summary

The paper presents a comprehensive quantitative assessment of data citation practices across the scholarly landscape using Thomson Reuters’ Data Citation Index (DCI), a database launched in 2012 that links research data sets and data studies with citations drawn from the Web of Science citation indexes. The authors harvested the entire DCI corpus—over two million records—examining citation counts, disciplinary patterns, and repository‑specific characteristics. Their analysis reveals that data citation remains a marginal activity: 88.1 % of all records have never been cited, underscoring the nascent state of data‑citation culture and the inconsistency or outright absence of citation conventions in many fields.

Disciplinary differences are pronounced. In the “Science, Engineering and Technology” domain, raw data sets dominate citations, reflecting a strong emphasis on reproducibility and direct reuse of experimental or simulation outputs. Conversely, in “Social Sciences, Arts and Humanities,” data studies—publications that interpret, analyze, or aggregate data—receive the bulk of citations, indicating that scholars in these areas value the contextualized analysis more than the primary data itself. This divergence mirrors underlying epistemic norms: the former prioritizes data as a reusable commodity, while the latter treats data as a substrate for narrative‑driven scholarship.

Repository‑level analysis shows that specialized archives such as the Crystallography Open Database, GenBank, and the Protein Data Bank exhibit markedly lower uncitedness rates. These repositories benefit from long‑standing community standards, well‑defined metadata schemas, and explicit citation guidelines, which together facilitate consistent referencing. In contrast, multidisciplinary or generic repositories often suffer from incomplete or non‑standardized metadata, making it difficult for authors to generate accurate citations and for the DCI to capture them.

The authors argue that data citations differ fundamentally from traditional article citations. While article citations largely reflect intellectual influence, data citations are heavily contingent on data quality, accessibility, and the visibility of the hosting repository. Consequently, the DCI’s citation metrics can serve as a novel indicator of a dataset’s scholarly impact, complementing conventional bibliometrics. For example, a dataset repeatedly cited across diverse studies signals high reuse value and may warrant recognition akin to that afforded to highly cited articles.

Crucially, the paper positions the DCI as a catalyst for standardizing and encouraging data citation. By aggregating citation links and exposing them to the research community, the DCI can promote transparency throughout the research lifecycle—from data collection, through analysis, to publication. The authors recommend coordinated actions: repositories must improve metadata completeness and adopt persistent identifiers; publishers should enforce data‑citation policies; funding agencies and institutions ought to incorporate data citations into evaluation criteria; and researchers need education on proper data‑citation practices.

Despite the overall rarity of data citation, the study identifies pockets of robust activity in fields such as crystallography and genomics, where community norms and technical infrastructure already support systematic referencing. The authors conclude that as these practices diffuse, the scholarly ecosystem will gain a richer, more traceable record of research outputs, enhancing reproducibility, accountability, and the attribution of credit to data creators. The DCI, therefore, has the potential to become a cornerstone infrastructure for the emerging data‑centric paradigm of scholarly communication.

💡 Research Summary

📜 Original Paper Content