Slovenia's Current Research Information System (SICRIS) currently hosts 86,443 publications with citation data from 8,359 researchers working on the whole plethora of social and natural sciences from 1970 till present. Using these data, we show that the citation distributions derived from individual publications have Zipfian properties in that they can be fitted by a power law $P(x) \sim x^{-\alpha}$, with $\alpha$ between 2.4 and 3.1 depending on the institution and field of research. Distributions of indexes that quantify the success of researchers rather than individual publications, on the other hand, cannot be associated with a power law. We find that for Egghe's g-index and Hirsch's h-index the log-normal form $P(x) \sim \exp[-a\ln x -b(\ln x)^2]$ applies best, with $a$ and $b$ depending moderately on the underlying set of researchers. In special cases, particularly for institutions with a strongly hierarchical constitution and research fields with high self-citation rates, exponential distributions can be observed as well. Both indexes yield distributions with equivalent statistical properties, which is a strong indicator for their consistency and logical connectedness. At the same time, differences in the assessment of citation histories of individual researchers strengthen their importance for properly evaluating the quality and impact of scientific output.
Deep Dive into Zipfs law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenias research as an example.
Slovenia’s Current Research Information System (SICRIS) currently hosts 86,443 publications with citation data from 8,359 researchers working on the whole plethora of social and natural sciences from 1970 till present. Using these data, we show that the citation distributions derived from individual publications have Zipfian properties in that they can be fitted by a power law $P(x) \sim x^{-\alpha}$, with $\alpha$ between 2.4 and 3.1 depending on the institution and field of research. Distributions of indexes that quantify the success of researchers rather than individual publications, on the other hand, cannot be associated with a power law. We find that for Egghe’s g-index and Hirsch’s h-index the log-normal form $P(x) \sim \exp[-a\ln x -b(\ln x)^2]$ applies best, with $a$ and $b$ depending moderately on the underlying set of researchers. In special cases, particularly for institutions with a strongly hierarchical constitution and research fields with high self-citation rates, expon
arXiv:1003.1018v1 [physics.data-an] 4 Mar 2010
Zipf’s law and log-normal distributions in measures of scientific output across
fields and institutions: 40 years of Slovenia’s research as an example
Matjaˇz Perc∗∗
Department of Physics, Faculty of Natural Sciences and Mathematics, University of Maribor,
Koroˇska cesta 160, SI-2000 Maribor, Slovenia
Abstract
Slovenia’s Current Research Information System (SICRIS) currently hosts 86,443 publications with citation data from
8,359 researchers working on the whole plethora of social and natural sciences from 1970 till present. Using these
data, we show that the citation distributions derived from individual publications have Zipfian properties in that they
can be fitted by a power law P(x) ∼x−α, with α between 2.4 and 3.1 depending on the institution and field of research.
Distributions of indexes that quantify the success of researchers rather than individual publications, on the other hand,
cannot be associated with a power law. We find that for Egghe’s g-index and Hirsch’s h-index the log-normal form
P(x) ∼exp[−a ln x −b(ln x)2] applies best, with a and b depending moderately on the underlying set of researchers.
In special cases, particularly for institutions with a strongly hierarchical constitution and research fields with high
self-citation rates, exponential distributions can be observed as well. Both indexes yield distributions with equivalent
statistical properties, which is a strong indicator for their consistency and logical connectedness. At the same time,
differences in the assessment of citation histories of individual researchers strengthen their importance for properly
evaluating the quality and impact of scientific output.
Keywords: Zipf’s law, citation distribution, g-index, h-index, ranking
1. Introduction
Raking of researchers is both important as well as interesting. While importance is largely due to the determi-
nation of advancement and selection criteria that underly faculty recruitments or the awarding of research grants and
funds to individuals with best indicators (Garfield, 1983; Adam, 2002; Ventura and Mombr´u, 2006), the fact that it
is interesting has many more aspects worth considering. For one, researchers seem to have a keen interest for de-
termining who is the most cited or the most connected or the most influential of them all. Certainly this in part to
gratify the personal sense of achievement, but more intricately, there is a lot we don’t yet understand in terms of
how and why certain researchers get more attention than others, and why some cannot rise above a given level of
recognition. Scientific excellence is definitely a crucial factor to consider, yet that alone cannot explain all the fasci-
nating properties that have been revealed in recent years with regards to citation distributions (Egghe and Rousseau,
1990; Laherrere and Sornette, 1998; Redner, 1998, 2005; Radicchi et al., 2008; Vieira and Gomes, 2010), indexes
that quantify individual scientific output (Hirsch, 2005; Egghe, 2006, 2008a; Bornmann et al., 2008; Zhang, 2009;
Guns and Rousseau, 2009; Cabrerizoa et al., 2010), the importance of first-movers (Newman, 2009) and self-citations
(Fowler and Aksnes, 2007; Schreiber, 2007, 2008a), or the structure of scientific collaboration networks (Newman,
2001), to name but a few.
Empirical studies are important since they provide fuel for potential attempts at modeling and related theoretical
approaches aimed towards deepening our understanding of citation practices, as well as for sharpening criteria and
indexes that quantify individual scientific output. Notably, one fact stands quite solid and has been pointed out on
∗Electronic address: matjaz.perc@uni-mb.si; Homepage: http://www.matjazperc.com/
∗∗Supplementary tables for this paper are accessible via: http://www.matjazperc.com/sicris/stats.html
Preprint submitted to Journal of Informetrics
November 1, 2018
several occasions [see e.g. Redner (2005)]. Namely that the more one paper is cited, the more likely it is it will attract
further citations in the future. This phenomenon is by now known under different names. The Matthew effect (Merton,
1968) is likely the oldest to describe it, but one can come across also cumulative advantage (de Solla Price, 1965, 1976)
or preferential attachment (Barab´asi and Albert, 1999), depending on the field of research and motivation of the study.
Especially linear preferential attachment models enjoy exceptional popularity in describing the growth and setup
of complex networks (Albert and Barab´asi, 2002; Dorogovtsev and Mendes, 2003; Pastor-Satorras and Vespignani,
2004) and have become synonymous for power-law distributions of connections that can be observed in many of
them (Faloutsos et al., 1999; Sornette, 2003; Newman, 2005; Clauset et al., 2009). There is evidence suggesting that
citation statistics may obey to similar rules, yet deviations from the power-law distribution maintain the reasoning
open to amendments (Redner, 2005), especially i
…(Full text truncated)…
This content is AI-processed based on ArXiv data.