PP(top x%) is the proportion of papers of a unit (e.g. an institution or a group of researchers), which belongs to the x% most frequently cited papers in the corresponding fields and publication years. It has been proposed that x% of papers can be expected which belongs to the x% most frequently cited papers. In this Letter to the Editor we will present the results of an empirical test whether we can really have this expectation and how strong the deviations from the expected values are when many random samples are drawn from the database.
💡 Deep Analysis
Deep Dive into Expected values in percentile indicators.
PP(top x%) is the proportion of papers of a unit (e.g. an institution or a group of researchers), which belongs to the x% most frequently cited papers in the corresponding fields and publication years. It has been proposed that x% of papers can be expected which belongs to the x% most frequently cited papers. In this Letter to the Editor we will present the results of an empirical test whether we can really have this expectation and how strong the deviations from the expected values are when many random samples are drawn from the database.
📄 Full Content
Expected values in percentile indicators
Lutz Bornmann* & Robin Haunschild**
*First author and corresponding author:
Division for Science and Innovation Studies
Administrative Headquarters of the Max Planck Society
Hofgartenstr. 8,
80539 Munich, Germany.
E-mail: bornmann@gv.mpg.de
**Max Planck Institute for Solid State Research
Heisenbergstr. 1,
70569 Stuttgart, Germany.
Email: R.Haunschild@fkf.mpg.de
2
Abstract
PP(top x%) is the proportion of papers of a unit (e.g. an institution or a group of researchers),
which belongs to the x% most frequently cited papers in the corresponding fields and
publication years. It has been proposed that x% of papers can be expected which belongs to
the x% most frequently cited papers. In this Letter to the Editor we will present the results of
an empirical test whether we can really have this expectation and how strong the deviations
from the expected values are when many random samples are drawn from the database.
Key words
Percentiles; expected values; reference sets; citation impact
3
The Leiden Manifest presents ten guiding principles for research evaluation, especially
for the proper use of bibliometrics in research evaluation. According to Hicks, Wouters,
Waltman, de Rijcke, and Rafols (2015) “the most robust normalization method is based on
percentiles: each paper is weighted on the basis of the percentile to which it belongs in the
citation distribution of its field (the top 1%, 10% or 20%, for example)” (p. 430). PP(top x%)
is the proportion of papers of a unit (e.g. an institution or a group of researchers), which
belongs to the x% most frequently cited papers in the corresponding fields and publication
years. The Leiden Ranking (http://www.leidenranking.com/ranking/2016/list
) uses PP(top
x%) as one of the central indicators to rank universities world-wide.
It is an important advantage of PP(top x%) that the indicator allows a comparison with
an expected value. It has been proposed that x% of papers can be expected which belongs to
the x% most frequently cited papers (e.g. Bornmann, Mutz, Marx, Schier, & Daniel, 2011). In
this Letter to the Editor we will present the results of an empirical test whether we can really
have this expectation and how strong the deviations from the expected values are when many
random samples are drawn from the database.
The bibliometric data used in this paper are from an in-house database developed and
maintained by the Max Planck Digital Library (MPDL, Munich) and derived from the Science
Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), Arts and
Humanities Citation Index (AHCI) prepared by Thomson Reuters (Philadelphia,
Pennsylvania, USA). The in-house database contains not only bibliographic and times cited
information for single papers, but also several field-normalized indicators. Three indicators
are PP(top 50%), PP(top 10%), and PP(top 1%) which are calculated following Waltman and
Schreiber (2013). These indicators consider ties in citation data if these ties are at the
threshold separating the top papers from the bottom (100-x)%. The values of PP(top 50%),
PP(top 10%), and PP(top 1%) for all papers published between 1980 and 2010 (n= 23154624)
are PP(top 50%)=49.380, PP(top 10%)=9.904, and PP(top 1%)=0.990. The values are not
4
exactly 50%, 10% and 1%, respectively, because the impact of the papers in our database is
not fractionally assigned to subject categories. Instead, an average citation impact is
calculated for papers assigned to more than one subject category. Waltman, van Eck, van
Leeuwen, Visser, and van Raan (2011) explain with vivid examples how these deviations
emerge if the impact is not fractionally measured.
Table 1. Key figures for PP(top 50%), PP(top 10%), and PP(top 1%) from 1000 random
samples of different size