Research Data Explored: Citations versus Altmetrics
The study explores the citedness of research data, its distribution over time and how it is related to the availability of a DOI (Digital Object Identifier) in Thomson Reuters’ DCI (Data Citation Index). We investigate if cited research data “impact” the (social) web, reflected by altmetrics scores, and if there is any relationship between the number of citations and the sum of altmetrics scores from various social media-platforms. Three tools are used to collect and compare altmetrics scores, i.e. PlumX, ImpactStory, and Altmetric.com. In terms of coverage, PlumX is the most helpful altmetrics tool. While research data remain mostly uncited (about 85%), there has been a growing trend in citing data sets published since 2007. Surprisingly, the percentage of the number of cited research data with a DOI in DCI has decreased in the last years. Only nine repositories account for research data with DOIs and two or more citations. The number of cited research data with altmetrics scores is even lower (4 to 9%) but shows a higher coverage of research data from the last decade. However, no correlation between the number of citations and the total number of altmetrics scores is observable. Certain data types (i.e. survey, aggregate data, and sequence data) are more often cited and receive higher altmetrics scores.
💡 Research Summary
The paper investigates the citation behavior of research data indexed in Thomson Reuters’ Data Citation Index (DCI) and examines whether cited datasets also generate impact on the social web, as measured by altmetrics. The authors retrieved all DCI records from 1960‑2014 that had received at least two citations, yielding a primary sample of 10,934 items (Sample 1). Metadata fields included DOI or URL availability, document type (data set, data study, repository), source repository, research area, publication year, data type, citation count, and ORCID presence.
Citation analysis shows that roughly 86 % of all DCI records remain uncited, confirming earlier findings of extreme skewness in data citation distributions. However, a temporal trend emerges: data published after 2007 are cited more frequently despite a shorter citation window, indicating growing recognition of data as scholarly output. Data studies receive far more citations on average (≈17.5 per item) than data sets (≈3.2), and items with a DOI attract higher citation averages (≈20 vs. 14 for URL‑only items). Repositories as a document type, though few in number, exhibit the highest mean citations (≈198), suggesting that citing a repository aggregates credit for many underlying datasets.
For altmetrics, the study employed three commercial aggregators—PlumX, ImpactStory, and Altmetric.com—focusing on items with at least two citations. PlumX provided the broadest coverage; nevertheless, only 4‑9 % of DOI‑bearing items registered any altmetric score. The proportion of DOI‑linked data with social‑media mentions has risen steadily over the last decade, whereas the share of URL‑only items with such mentions has declined. Importantly, statistical tests reveal no meaningful correlation between citation counts and total altmetric scores, implying that traditional scholarly impact and social‑media visibility capture distinct dimensions of data reuse.
Repository‑level analysis identifies nine repositories (e.g., ICPSR, World Wide Protein Data Bank, UK Data Archive) that house the majority of DOI‑linked, highly‑cited datasets. While DCI indexes nearly 2 million records, only a small subset (≈25 % from figshare) contributed any citations, and figshare itself showed no items with ≥2 citations, highlighting a mismatch between repository size and citation impact.
Data‑type categorisation (merged manually) reveals that survey data, sequence data, and aggregate data dominate both citation and altmetric activity. Survey data alone account for 17,334 citations across 1,734 items, while sequence data contribute 10,458 citations from 3,408 items. These patterns align with disciplinary norms: social‑science surveys are heavily cited, whereas life‑science sequence datasets attract more social‑media attention (e.g., Twitter mentions).
The authors conclude that (1) the majority of research data remain uncited, (2) DOI assignment modestly enhances citation rates but does not guarantee altmetric visibility, (3) altmetrics coverage for data is low and uneven across repositories and data types, (4) citation and altmetric metrics are largely independent, and (5) a small number of repositories concentrate the most impactful data. These findings have practical implications for data‑management policies, the promotion of DOI registration for datasets, and the design of evaluation frameworks that incorporate both bibliometric and altmetric indicators to capture the multifaceted impact of research data.
Comments & Academic Discussion
Loading comments...
Leave a Comment