Power laws in citation distributions: Evidence from Scopus

1 P o w er la ws in citation distributions: Evidence from Scopus Michal Brzezinski F acult y of Economic Sc ie nces , Univ ersit y of W arsa w, W arsa w, P oland E-mail: m brzezinski@wne.u w.edu. pl Abstract Mo deling distributions of citations to scientiﬁc pap er s is crucial for understanding ho w science devel- ops. How ever, there is a co nsiderable empirica l controv ersy on which sta tistical mo del ﬁts the citation distributions best. This paper is concerned with rigoro us e mpir ical detection o f p ow er-law b ehaviour in the distribution of citations r eceived by the most highly cited scien tiﬁc pap ers. W e hav e used a la rge, nov el data set on citations to scien tiﬁc paper s published betw een 1998 and 2002 dr awn from Sco pus . The power-law model is compared with a n umber of alterna tive mo dels using a likelihoo d ratio test. W e have found that the power-law hypothes is is r ejected for around half of the Scopus ﬁelds of science. F or these ﬁelds of science, the Y ule, p ow er- law with expone ntial cut-o ﬀ and lo g-normal distributions seem to ﬁt the data b etter than the pure power-la w mo del. On the other hand, when the p ower-la w hypo thes is is not rejected, it is usually empirically indistinguishable from most of the alternative models. The pure pow er - law mo del seems to b e the b est model only for the mo st highly cited papers in “Physics and Astronom y” . Overall, our r esults seem to suppo rt theories implying tha t the most highly cited scientiﬁc pap e r s follow the Y ule, p ow er-law with exp onential cut-oﬀ or log -normal distribution. Our ﬁndings suggest also that power laws in citation distributions, when present, a c count o nly for a very small fraction of the publis he d pap ers (les s than 1% for mo st of science ﬁelds ) a nd that the p ow er- law scaling par ameter (exp o nent) is substantially higher (from around 3.2 to a round 4.7) tha n found in the older literatur e. In tro du c tion It is often ar g ued in scientometrics, so cial physics and other sciences that distributions o f some scientiﬁc items (e.g., articles, citations) pro duced by some scientiﬁc so urces (e.g., autho r s, journals) have heavy tails that can b e mo de lle d using a p ower-la w mo del. These distr ibutions are then said to conform to the Lotk a’s law [1]. Examples of such distributions include a uthor pro ductivity , o ccurrence of words, citations 2 received by papers , no des of so cial netw or ks, num ber o f authors per pa pe r , scattering of scientiﬁc literature in journa ls, a nd many others [2]. In fact, p ow er-law models are widely used in many sciences as physics, biology , earth and planetary sciences, economics, ﬁnance, computer science, and o thers [3 , 4]. Mo dels equiv alent to Lotk a’s law are known as Pareto’s law in economics [5] and as Zipf ’s law in ling uistics [6]. Appropriate measuring a nd pr oviding s cientiﬁ c explanations for p ower laws plays an imp ortant r ole in understanding the behaviour of v arious natura l and so cial phenomena. This pa pe r is co ncerned with empir ical detection of p ower-la w b ehaviour in the distr ibution o f citatio ns received by scien tiﬁc pa per s. The pow er-law distr ibutio n of citations for the highly cited pap ers was ﬁr st suggested by Price [7], who a lso prop osed a “cumulative adv an tage ” mechanism that could genera te the power-law distribution [8]. More recently , a growing liter a ture has developed that aims a t mea s uring power laws in the right tails of citatio n distributions. In particular , Redner [9, 10] fo und that the rig h t tails of citation distributions for a rticles published in Physical Review over a century and of ar ticles published in 198 1 in journals cov er ed by Thomson Scientiﬁc’s W eb of Science (W oS) follow p ow er laws. The latter data set was a lso mo delled with p ow er -law techniques by Clause t et al. [4] and Peterson et al. [11]. The la tter study also used data from 2007 list of the liv ing highest h-index c hemists and fro m Physical Review D b etw een 197 5 and 1 994. V an Raa n [1 2] obs e rved that the top of the dis tribution of around 18,000 pap er s published b etw een 1991 and 1 998 in the ﬁeld of chemistry in Netherla nds follows a p ow er law distribution. Po wer-law models were a ls o ﬁtted to data from hig h energy physics [13], data for most cited ph ys ic ists [1 4], da ta for all pa pe rs published in journals of the American Physical So ciety from 1 983 to 2008 [15], and to data for all physics pap e rs published betw een 1 9 80 a nd 19 89 [16]. Recently , Albarr´ an and Ruiz-Cas tillo [17] tested for the p ow er- law b ehavior using a large W oS dataset of 3 .9 million articles published b etw een 1998 a nd 200 2 catego rized in 2 2 W o S r e search ﬁelds. The same dataset was also used to s e arch for the power laws in the right tail of citation distributio ns categor ized in 2 19 W o S scientiﬁc sub-ﬁelds [18, 19]. These studies oﬀer the lar gest existing b o dy of evide nce on the power-law b ehaviour of citation distributions . Three ma jor conclusions app ear from them. First, the power-law behavior is not universal. The existence of power law cannot b e rejected in the W oS data for 17 out o f 2 2 and for 140 out o f 219 sub-ﬁelds studied in [17] and in [18, 19], resp ectively . Secondly , in opp osition to pr evious studies, these pap er s found that the scaling parameter (exp onent) of the p ow er-law distribution is above 3.5 in most o f the cas e s, while the older literature suggested that the parameter v alue is b etw een 2 a nd 3 [19]. Third, p ow er laws in citatio n dis tr ibutions ar e ra ther small – on av era g e 3 they cov er just ab out 2% of the most highly cited articles in a given W oS ﬁeld of scie nce and account for ab out 13.5% of all citatio ns in the ﬁeld. The main aim of this paper is to use a statis tica lly r igorous approa ch to a nswer the empirical question of whether the p ow er- law model descr ib es b est the observed distr ibutio n o f highly cited pap ers . W e use the statistical to olb ox for detecting p ow er-law behaviour intro duced by Cla uset et al. [4]. There a re tw o ma jor contributions of the present pap er. First, w e use a very la rge, previously unused data set on the citation distributio ns of the most highly c ited pap er s in several ﬁelds o f science. This data set comes from Scopus, a bibliographic databa s e introduced in 2004 b y Elsevier, a nd contains 2.2 million articles published b etw een 1998 and 20 02 and ca teg orized in 27 Sc o pus ma jor sub ject ar eas of s cience. Most of the previous studies us e d r ather small data se ts , which were not suita ble for rigoro us statis tica l detecting of the p ow er-law behaviour. In contrast, our sample is even bigger with resp ect to the most highly cited pap ers than the large sample used in the recent contributions based on W oS data [1 7 – 19]. This res ults from the fact that Scopus indexes ab out 70% mor e so urces co mpared to the W oS [20, 21] and ther e fore gives a more co mprehensive cov era ge of cita tion distr ibutions. 1 The s e cond ma jo r contribution o f the pap er is to provide a r igoro us statistical co mparison of the p ow er- law mo del and a n um b er of alternative mo dels with resp ect to the problem which theor etical distribution ﬁts b etter empirical data on citations. This problem o f mo del selection has bee n previo usly studied in some contributions to the literature. It has b een argued that mo dels like stretched exponential [14], Y ule [8], log -normal [10, 23, 24], Tsa llis [2 5–27] or shifted power law [15] ﬁt citation distributions equally well or b etter than the pure power-law model. How ever, pr evious papers hav e either focused on a single alter na tive distribution or used o nly visual metho ds to choo se be tw een the comp eting mo dels. The present pap er ﬁlls the gap by providing a systematic and sta tistically r igorous compariso n o f the power-law distribution with such alternative mo dels as the log- no rmal, exp onential, str e tc hed exp onential (W eibull), Tsallis, Y ule a nd power-la w with exp onential cut-oﬀ. The co mparison b etw een mo dels w as per formed using a likelihoo d ratio tes t [4, 28]. 1 F r om the p ersp ectiv e of measuring pow er la ws in citation distributions, the most imp ortant part of the distribution is the ri ght tail. It seems that the database used in this pap er has a b etter cov erage of the right tail of citation distributions. The most highly cited paper in our database has receiv ed 5187 citat ions (see T able 2), while the corresponding n umber f or the dat abase based on W oS is 4461 [22]. Our database is further describ ed in “Materials and methods” section. 4 Materials and Metho ds Fitting p ow er-la w mo del to citation data W e follow Clauset e t al. [4] in choosing metho ds for ﬁtting pow er laws to cita tion distributions. These authors car efully show that, in general, the appropr iate metho ds dep end on whether the data ar e c on- tin uous or discrete. In our case, the latter is true as citations are non- negative integers. Let x b e the nu mber o f citations received by a n a rticle in a given ﬁeld of science. The pro bability density function (pdf ) o f the discrete p ow er-law mo del is deﬁned as p ( x ) = x − α ζ ( α, x 0 ) , (1) where ζ ( α, x 0 ) is the gener a lized or Hurwitz zeta function. The α is a shap e parameter of the p ow er-law distribution, known as the p ow er-law exp onent or scaling parameter . The power-la w b ehaviour is usually found o nly for v alues g r eater than some minimum, denoted by x 0 . In c ase of citation distributions, the power-law behaviour has b een found on average only in the top 2% of all a rticles published in a ﬁeld o f science [18, 19]. The low er b ound on the p ower-la w b ehaviour, x 0 , should b e therefore e s timated if we wan t to meas ure precisely in which part of a cita tio n distribution the mo del applies. Mor e ov er, w e need a n estimate of x 0 if w e wan t to obtain a n unbiased estimate of the p ower-la w exp onent, α . W e estimate α using the maxim um lik eliho o d (ML) estimation. The log-likeliho o d function c orre- sp onding to (1) is L ( α ) = − n ln ζ ( α, x 0 ) − α n X i =1 ln x i (2) The ML estima te for α is found by numerical maximiza tion o f (2). 2 F ollowing Clauset et a l. [4], we use the following pr o cedure to estimate the low er bound on the p ow er- law be haviour, x 0 . F or ea ch x > x min , we calculate the ML estimate o f the power-la w ex po nent, ˆ α , and then we compute the well-known Kolmogor ov-Smirno v (KS) statistic for the data and the ﬁtted mo del. The KS s tatistic is deﬁned a s KS = max x > x 0 | S ( x ) − P ( x ; ˆ α ) | , (3) 2 Clauset at al. [4] provide also an approx i m ate method of estimating α for the discrete pow er-l a w model by assuming that con tinuou s p ow er-law distributed r eals are rounded to the nearest integers. How ever, it this pap er we use an exact approac h based on maximizing (2). 5 where S ( x ) is the cum ulative distr ibution function (cdf ) for the observ ations with v alue at lea st x 0 , and P ( x, ˆ α ) is the cdf fo r the ﬁtted p ow er-law mo del to observ ations for which x > x 0 . The estimate ˆ x 0 is then chosen as a v alue of x 0 for which the KS statistic is the s mallest. The standard error s fo r b oth estimated parameters, ˆ α and ˆ x 0 , are co mputed with standa r d b o otstr ap metho ds with 1,000 r eplications. Go o dness-of-ﬁt and mo del selection tests The ne x t step in meas uring p ower laws inv olves testing go o dness of ﬁt. A p ositive result of such a tes t allows to conclude that a p ow er -law mo del is consis ten t with data. F ollowing Clauset et al. [4] aga in, w e use a test based on a s emi-parametric b o otstrap appro a ch. 3 The pro cedure starts with ﬁtting a power-law mo del to data and c a lculating a KS statistic for this ﬁt, k . Next, a la rge num b er o f synthetic data sets is genera ted that fo llow the origina lly ﬁtted p ow er-law model ab ov e the estimated x 0 and hav e the same non-p ow er-law distribution as the orig inal da ta set b elow ˆ x 0 . Then, a p ower-law model is ﬁtted to ea ch of the ge ner ated data sets using the same metho ds as for the o riginal da ta set, and the KS s tatistics a re calculated. The fra ction of data sets for which their own KS statistic is la rger tha n k is the p -v alue of the test. It r epresents a proba bility that the K S statistics co mputed for data dr awn from the p ow er-law mo del ﬁtted to the origina l data is at least as lar ge as k . The p ow er-law hypothesis is rejected if the p - v alue is smaller than s o me chosen thresho ld. F ollowing Clauset et al. [4], we rule out the p ower-la w mo del if the estimated p -v alue for this test is smaller than 0.1. In the present pa p e r, w e use 1 ,000 g enerated data sets. If the go o dness- of-ﬁt test r ejects the p ow er-law hypothesis, we may conclude that the p ow er law has not b een found. Ho wev er, if a da ta set is ﬁtted well by a p ow er law, the q uestion r e ma ins if there is an alternative distribution, which is an equally go o d or better ﬁt to this data set. W e need, therefore, to ﬁt some riv al distributions and ev aluate which dis tribution gives a better ﬁt. T o this aim, we use the likelihoo d ratio test, which tests if the co mpared mo dels ar e equally close to the true mo del against the alternative that one is closer. The test co mputes the lo garithm of the ratio o f the likeliho o ds of the data under t wo comp eting distributions , LR, which is nega tive or p ositive dep ending on which mo del ﬁts data better . Sp eciﬁca lly , let us consider tw o distributions with p dfs denoted by p 1 ( x ) and p 2 ( x ). The LR is 3 If our data w ere dra wn f rom a given model, the n we could use the KS statistic in testing goo dness of ﬁt, because the distri bution of the KS s tatistic is kno wn in suc h a case. How ever, when the underlying mo del is not kno wn or when its parameters are estimated f rom the data, which is our case, the distribution of the KS statistic must be obtained by simulation. 6 deﬁned as: LR = n X i =1 [ln p 1 ( x i ) − ln p 2 ( x i )] . (4) A po sitive v alue of the LR sugg ests that model p 1 ( x ) ﬁts the da ta better . How ever, the s ign of the LR can b e used to deter mine which model should b e fav or ed only if the LR is signiﬁca n tly diﬀerent from zer o. V uong [28] s how ed that in the ca se of non- nested mo dels the normalized log-likeliho o d ratio NLR = n − 1 / 2 LR /σ , where σ is the estimated s tandard dev iation of LR, ha s a limit sta ndard nor mal distribution. 4 This result can b e used to c ompute a p -v alue for the tes t discrimina ting b etw een the comp eting mo dels. If the p -v alue is small (for example, smaller tha n 0.1), then the sign of the LR can probably be trusted a s an indicato r of which mo del is pre fer red. How ever, if the p -v alue is lar ge, then the test is unable to choo s e b etw een the compared distributions. W e hav e follow ed Clauset et al. [4] in choosing the following alter na tive discrete distributions: e xp o - nent ial, stretched exp onential (W eibull), log-no rmal, Y ule and the pow er law with expo nential cut-oﬀ. 5 Most of these mo dels hav e b e en co nsidered in previo us literature on mo deling citation distribution. As another alter na tive, w e also use the Tsallis distribution, which has b een als o prop osed a s a mo del for citation distributions [26, 27]. The deﬁnitions of our alter native distributions are given in T able 1. [Please insert T able 1 ab out here] Data W e use citation data from Sco pus, a bibliog raphic database intro duce d in 200 4 by Elsevier. Scopus is a ma jor co mpetitor to the mos t-widely used data source in the literature on mo deling citation distributions – W eb of Science (W oS) from Tho ms o n Reuters. Scopus covers 29 million r ecords with refer ences going back to 19 96 and 21 millio n pre-19 96 r ecords going back as far as 182 3 . An imp ortant limitation of the database is that it do es not cov er cited reference s for pre- 1 996 ar ticle s . Scopus contains 21,0 00 pee r-reviewed journals from mor e than 5,000 int er national publishers. It cov ers ab out 7 0% more sour ces compared to the W o S [20], but a larg e part of the additiona l sour ces are low-impact journals. A r ecent literature review has found that the quite extensive literature that compar es W oS and Scopus from the per sp ective o f citatio n analysis oﬀers mixed results [21]. How ever, most o f the studies sug g est that, at 4 In case of nested models, 2LR has a l imit a c hi-squared distribution [28]. 5 The p ow er - law with exp onen tial cut-oﬀ b ehav es lik e the pure p o wer-law mo del for smal ler v alues of x , x > x 0 ,while for larger v alues of x it b ehav es l i ke an exp onen tial distribution. The pure p ow er-l a w mo del is nested wi thin the p ow er-law with exp onent ial cut-oﬀ, and for this reason the latter alwa ys provide s a ﬁt at l east as go od as th e former. 7 least for the p erio d fro m 1996 on, the num b er of citations in b oth databases is either r oughly similar or higher in Scopus than in W oS. Therefore, is seems that Scopus constitutes a useful alternative to W oS from the p er sp e ctive of mo deling citation distributions. Journals in Sco pus are clas siﬁed under four main sub ject area s : life sciences (4,20 0 journals), hea lth sciences (6,500 journals), ph ysical sciences (7,100 journa ls) and so cia l sciences including arts and h uman- ities (7,00 0 jour na ls). The four main sub ject areas a re further divided into 27 ma jor sub ject areas and more than 300 mino r s ub ject areas. Journa ls may b e classiﬁed under more than o ne s ub ject a r ea. The a nalysis in this pap er was p erfor med o n the level of 27 Scopus ma jor sub ject ar eas o f s c ience. 6 F rom the v a rious do c umen t types co nt a ine d in Scopus, we ha ve sele c ted only articles. F or the purp ose of comparability with the recent W oS-base d s tudies [17, 18 ], only the a rticles published b etw een 199 8 a nd 2002 were considere d. F ollowing pre v ious literature, w e hav e chosen a common 5-year citation window for all articles published in 19 98-20 02. 7 See Albarr´ an a nd Ruiz-Castillo [17] for a justiﬁcation of cho osing the 5-year citation window common for a ll ﬁelds o f science. In order to measur e the p ow er- law behaviour o f citations, we need data on the r ight tails of citation dis- tributions. T o this end, we ha ve used the Scopus Citation T rack er to collec t cita tions for min(10 0 , 000 ; x ) of the highest cited ar ticles, w he r e x is the actual n umber of articles published in a g iven ﬁeld of s cience during 1998 -2002 . This analysis was p erfor med separately for each of the 27 science ﬁelds categor ized b y Scopus. Descriptive statistics for our data s ets a r e pr esented in T able 2. [Please insert T able 2 ab out here] In some cas es, there was less than 1 00,000 articles published in a ﬁeld of science during 1 998-2 002 and we were a ble to obtain complete or almos t complete distributions o f citations (see columns 2-4 of T able 2). 8 In o ther cas es, w e have obtained only a part o f the relev a nt distr ibution encompas sing the righ t tail a nd some par t of the middle of the distribution. The smallest p or tions of citation distributions were obtained for Me dicine (8.4% of total pap ers ), Bio chemistry , Genetics and Molecular Biolo gy (15.7%) and Physics and Astronomy (18.4 %). How ever, using the W o S data for 22 science categ ories, Albarr´ an and Ruiz-Castillo [17] found that p ow er laws account usually only for less than 2% of the highest- c ited 6 See T able 2 f or a list of the analyzed Scopus areas of science. 7 F or example, f or articles published i n 1998 w e hav e analyzed citations r eceiv ed during 1998-2002, while for ar ticles published i n 2002, those received during 2002 - 2006. 8 F or all ﬁelds of science analyzed, there w ere some articles with m issing infor m ation on citations. These articles were remov ed from our samples. H ow ever, this has usually aﬀec ted only about 0.1% of our samples. 8 articles. Therefore, it seems that the cov era ge o f the right tails of citation distributions in our samples is satisfactory for our purpo ses. Results and Discussion T able 3 presents results of ﬁtting the discrete p ower-la w mo del to our data s ets consisting of citatio ns to scientiﬁc articles published o ver 1998-20 02 (with a common 5-year citation window), se parately for each of the 27 Scopus ma jor sub ject ar eas of science. The las t row giv es also results fo r all sub ject area s combined (“ All scie nces’). Beside estimates o f the p ower-law expo nent ( ˆ α ) and the low er b ound on the power-law b ehaviour ( ˆ x 0 ), the table g ives also the estimated num b er and the p ercentage o f power-la w distributed pap ers, as well as the p -v alue for our go o dness-of-ﬁt test. [Please insert T able 3 ab out here] Results with resp ect to the go o dness - of-ﬁt s ug gest that the power-la w hypo thesis cannot b e rejected for the following 14 Scopus science ﬁelds: “Ag ricultural and Biologica l Scienc e s ”, “Bio chemistry , Genetics and Molecular Biolog y”, “Chemical E ng ineering”, “Chemistry” , “E nergy”, “Environmental Scie nc e ”, “Ma- terials Science”, “Neuroscience” , “Nursing ”, “Pharmaco logy , T oxicology and Pharmace utics ” , “P hysics and Astro nomy”, “ Psychology”, “Health P rofessio ns ”, and “ Multidisciplina ry”. The r emaining 13 Scopus ﬁelds of science fo r which the p ower-la w mo del is rejected include humanities and so cia l science s (“Arts and Humanities”, “ Business, Manag ement a nd Acco un ting” , “Economics , Econometrics and Finance”, “So cial Science s ”), but als o fo r mal sciences (“Computer Science”, “Decision Sciences”, “Mathematics”), life s ciences (“Immunology and Microbiology”, “Medicine” , “V eterinary”, “Den tistry”), as well as “E arth and Planetary Sciences” and “Eng ineering”. The bes t p ow er- law ﬁts for these ﬁelds of sc ience are shown on Fig ure 1. [Please insert Figure 1 a b out her e] F or most of the distr ibutions shown on Figure 1, it can be cle arly seen that their rig ht tails decay faster than the pure p ow er-law mo del indicates. This sugges t that the lar g est obse r v atio ns for these distr ibutions should be ra ther modeled with a distribution having a light er tail than the pure power-la w mo del like the log-norma l or power-la w with exp o nent ial cut-oﬀ mo dels. 9 The p -v alue for o ur go o dnes s-of-ﬁt test in cas e of “All Sciences ” is 0.0 76, which is b elow o ur accepta nce threshold of 0 .1. Ho wev er , this p -v alue is non-neglig ible and s igniﬁcantly higher than p -v alues for mo st of the 13 Scopus ﬁelds of s c ience for which we reject the pow er-law hypothesis. F or this reason, we conclude that the e vidence is no t co nclusive in this ca se. Our result for “All Sciences” is, how ever, in a sta rk contrast with that o f Albar r´ an and Ruiz-Ca stillo [17], who using the W oS data found that the ﬁt for a corres p o nding data set was very go o d (with a p - v alue o f 0.8 5 ). 9 The estimates of the pow er -law exp onent for the 14 Scopus science ﬁelds for whic h the power law seems to b e a plausible hypothesis ra ng e from 3 .24 to 4.69 . This is in a go o d agr e e men t with Albarr ´ an and Ruiz-Ca stillo [1 7] a nd conﬁr ms their ass essment that the true v alue of this parameter is substantially higher than found in the ea rlier literature [9, 1 3, 25], which oﬀered estimates ra nging from a round 2.3 to around 3. W e also conﬁrm the observ ation of Albarr´ an and Ruiz-Castillo [17] that power laws in citatio n distributions are rather small – they account usua lly for less than 1% of total articles published in a ﬁeld of science. The only tw o ﬁelds in our study with slightly “bigger ” pow er laws are “ Che mis try” (2%) a nd “Multidisciplinary” (2.8%). The comparis on betw een the p ow er- law h yp othesis and alternatives using the V uong’s test is presented in T able 4. It can be observed that the exp onential mo del ca n b e r uled out in mo s t of the ca s es. W e discuss other re s ults ﬁrst for the 13 Scopus ﬁelds of science that did not pass our goo dnes s-of-ﬁt test. F or all of these ﬁelds, except for “V eter inary”, the Y ule and p ow er-law with exp o ne ntial cut-oﬀ mo dels ﬁt the da ta b etter than the pur e p ow er-law model in a s tatistically s igniﬁcant wa y . The log-nor mal mo del is b etter than the pure p ow er- law mo del in 10 of the discussed ﬁelds; the same holds fo r the W eibull distribution in cas e of 5 ﬁelds. How ever, these res ults do not imply that the distributions, which give a better ﬁt to the non-p ow er- law distributed da ta than the pur e power-la w model a re plausible h yp otheses for these data s ets. This issue s hould b e further s tudied using appropr ia te go o dness-of- ﬁt tests. [Please insert T able 4 ab out here] W e now turn to results for the remaining Sc o pus ﬁelds of s cience that w ere not rejected b y our go o dness- of-ﬁt test. The p ow er -law h yp othesis s eems to b e the b est mo del only for “Physics and Astro nomy”. In this case, the test statistics is always no n-negative implying that the p ow er- law mo del ﬁts the data a s 9 In Albarr´ an and Ruiz-Castillo [17], the pow er-l a w h yp othesis is found plausible for 17 out of 22 W oS ﬁelds of science. It is rejected for “Pha r macology and T oxicolog y”, “Ph ysics”, “Agricultural Sciences”, “Engine eri ng”, and “So cial Sciences, General”. The se results are not directly comparable with those of the presen t paper as Scopus and W oS use diﬀerent classiﬁcation systems to categorize journals. 10 go o d as or b etter tha n each of the alterna tives. F or the re maining 13 ﬁelds o f science, the lo g-norma l, Y ule and p ower-law with exp onential cut-oﬀ mo dels have alwa ys hig her log-likeliho o ds sugge sting that these mo dels may ﬁt the data b etter than the pure p ower-la w distribution. Howev er, o nly in a few cases the diﬀerences b etw een mo dels are sta tistically signiﬁcant. F or “Che mis try” and “Multidisciplinary” b oth the Y ule and p ow er-law with exp onential cut-o ﬀ mo dels are favoured over the pure pow er- law model. The p ow er- law w ith exp o ne ntial cut-oﬀ is als o fav o ured in ca se of “Hea lth Pro fessions”. In other ca ses, the p -v alues for the likelihoo d ratio test are la rge, whic h implies that there is no conclusive evidence that would a llow to distinguish b etw e en the pur e power-la w, lo g-normal, Y ule a nd pow er -law with exp onential cut-oﬀ distributions. Comparing the power-la w distr ibutio n with the W eibull a nd Tsallis distributions, we observe that the sign of the test statistics is po s itive in roughly half of the ca ses, but the p -v a lues are alwa ys la rge and neither mo del c a n be ruled out. O ur likeliho o d ratio tests sugges t therefore that when the p ow er-law is a plausible hypothesis a ccording to our g o o dness-of- ﬁt test it is often indistinguisha ble from a lternative mo dels. Overall, our res ults show that the evidence in fav our of the p ow er- law behaviour of the r ig ht-tails of citation distributions is rather weak. F or roughly half of the Scopus ﬁelds of science studied, the power- law hypothes is is rejected. Other distributions, esp e c ially the Y ule, p ow er-law with exp onential c ut-o ﬀ and log-no rmal seem to ﬁt the data from these ﬁelds of science better than the pur e p ow er-law mo del. On the other hand, when the p ow er-law hypo thesis is not rejected, it is usually empirica lly indistinguishable from a ll alter na tives with the exception of the e xp o nential distribution. The pure p ower-law mo del seems to b e fav o ured over alternative mo dels only for the most highly cited pap ers in “P hysics and Astronomy”. Our r esults suggest that theories implying tha t the most highly cited s cientiﬁ c pap ers follow the Y ule, power-law with exp onential cut-oﬀ or lo g-norma l distr ibution may hav e slightly more suppor t in data than theories predicting the pure p ower-la w b ehaviour. Conclusions W e have used a larg e, nov el data set on citations to scientiﬁc pap er s published b etw een 199 8 and 2002 drawn from Scopus to test empir ically for the power-law b ehaviour of the righ t-tails of citation distri- butions. W e have found that the p ower-la w hypothes is is rejected for ar ound half of the Scopus ﬁelds of science. F or the remaining ﬁelds o f science, the p ow er-law distr ibutio n is a plausible mo del, but the 11 diﬀerences be tw een the p ow er law and a lternative mo dels ar e usually statistically insigniﬁca nt . The pap er also conﬁrmed recent ﬁndings of Albar ran and Ruiz-Castillo [17 ] that p ower laws in citation distributions , when they a re a plausible, a c count o nly fo r a very small fraction of the published pa pe rs (less tha n 1 % for mo st of science ﬁelds) and that the power-la w exp onent is substa nt ially higher than found in the older literature. Ac kno wledgmen ts I would lik e to acknowledge gratefully the use o f Matlab and R softw ar e accompanying the pap ers by Clauset et a l. [4] and Shalizi [2 9]. An y remaining err o rs a r e my resp onsibility . References 1. Lotk a A (19 26) The frequenc y distribution of scientiﬁc productivity . Journal of W a shington Academy Sciences 16: 317–3 23. 2. Egghe L (2005) Po wer la ws in the information pro duction pro cess : Lotk a ian informetrics. Oxford: Elsevier. 3. Newman ME (2005) Po wer laws, Pareto distributions and Zipf ’s law. Co nt emp or ary Physics 46 : 323–3 51. 4. Clauset A, Shalizi CR, Newman ME (2009 ) Pow er-law distr ibutions in empirica l data. SIAM r eview 51: 661– 703. 5. Gabaix X (20 09) Po wer la ws in economics and ﬁnance. Annu a l Review of E c o nomics 1: 255– 294. 6. Baa yen RH (2001) W ord frequency distributions. Dor drech t: Klu wer. 7. de Solla Price D (1965) Netw or ks of scientiﬁc pap ers . Science 149: 510– 515. 8. de Solla Pr ic e D (1976) A gener al theory of bibliometric and other c umulative adv antage pro ces ses. Journal of the American So c iet y for Information Science 27: 292– 306. 9. Redner S (199 8) How popular is your pap er? An empirica l study of the cita tion distribution. The Europ ean Physical Journal B 4: 131–1 34. 12 10. Redner S (2005) C ita tion statistics from 1 10 years of Physical Review. Physics T o day 58 : 49–54. 11. P eter s on GJ, Pres s´ e S, Dill K A (20 1 0) Nonuniv ersal p ow er law scaling in the pro bability distribu- tion o f scient iﬁc citations. P ro ceedings of the Nationa l Academy of Sciences 10 7: 16023 –1602 7. 12. V an Raan AF (2006) Statistical prop e r ties of bibliometric indicators: Research gro up indica tor distributions and correlatio ns. Journal of the American So ciety for Informa tion Science a nd T ech- nology 57: 408–4 30. 13. Lehmann S, Lautrup B, Ja ckson A (2003 ) Citation net works in hig h ener g y ph ysic s. Physical Review E 68 : 026113 . 14. Laherr` ere J, Sornette D (1998) Stretc hed exp o nential distributions in nature and economy:fat tails with c har acteristic scale s. The Europ ean P hysical Journal B 2: 525–5 39. 15. Eom YH, F ortuna to S (2011 ) Character izing and mo deling citatio n dynamics . PloS One 6: e249 26. 16. Golosovsky M, Solomon S (2012) Runaw ay event s do minate the heavy tail of citation dis tributions. The E ur op ean Physical Journal Specia l T opics 205 : 3 03–3 11. 17. Albarr´ an P , Ruiz-Castillo J (2011) References made and citations received by scientiﬁc articles . Journal of the American So c iet y for Information Science and T echnology 6 2: 40–49 . 18. Albarr´ an P , Cresp o JA, Ortu ˜ no I, Ruiz-Castillo J (2011) The s kewness o f science in 219 sub-ﬁelds and a num b er of aggr egates. Scien tometrics 88: 385 –397. 19. Albarr´ an P , Cre s po JA, Ortu ˜ no I, Ruiz-Castillo J (201 1 ) The skewness of science in 219 sub- ﬁelds and a num b er of aggr egates. W orking pap er 11-09, Universidad Carlos I I I. 20. L´ op ez-Illesca s C, de Moya-Aneg´ on F, Moe d HF (2008) Cov erag e and citation impact of oncolog ical journals in the W eb of Science and Scopus. Journal of Informetrics 2: 304–3 16. 21. Aghaei Cha degani A, Salehi H, Md Y un us M, F ar hadi H, F o oladi M, et al. (20 13) A comparison betw een tw o main academic literature collections: W eb of Science and Scopus databases. Asian So cial Science 9 : 18–26 . 22. Li Y, Ruiz-Castillo J (2 013) The impact of e x treme o bserv ations in citation distributions. T echnical rep ort, Universidad Carlos II I, Departamento de Econom ´ ıa. 13 23. Stringer MJ, Sales-Pardo M, Amaral LAN (200 8 ) Eﬀectiveness of journal ranking s chemes as a to ol fo r lo ca ting infor mation. PLoS One 3: e1683 . 24. Radicc hi F, F ortunato S, Ca stellano C (2008 ) Universality o f cita tion distributions: T ow ar d an ob jective measure o f scie n tiﬁc impact. Pro ceedings of the National Academy of Sciences 105 : 17268 –172 7 2. 25. Tsallis C, de Albuquerque MP (2000) Are citations of scientiﬁc pa p er s a ca se of nonextensivity? The E ur op ean Physical Journal B 13 : 777–78 0. 26. Anastasiadis AD, de Albuquerque MP , de Albuquerque MP , Mussi DB (201 0) Ts a llis q -exp onential describ es the distr ibution of scientiﬁc citations - a new characterization of the impact. Scientomet- rics 83: 205– 218. 27. W allace ML, Larivi` ere V, Gingras Y (2 0 09) Mo deling a c e n tury of citation distr ibutions. Jo ur nal of Infor metrics 3: 296– 303. 28. V uong QH (1989 ) Likeliho o d ratio tests for mo del sele c tio n and non-nested hypotheses. Econo- metrica 57: 307– 333. 29. Shalizi CR (2 007) Maximum likelihoo d es timation for q-exp o ne ntial (Tsa llis) distributions. T ech- nical r ep ort, arXiv preprint math/070 1 854. 14 Figure Legends 10 0 10 1 10 2 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Arts and Humanities 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Business 10 0 10 1 10 2 10 3 10 4 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Computer Science 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Decision Sciences 10 0 10 1 10 2 10 3 10 4 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Earth Sciences 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Economics 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Engineering 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Immunology 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Mathematics 10 0 10 1 10 2 10 3 10 4 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Medicine 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Social Sciences 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Veterinary 10 0 10 1 10 2 10 3 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 Dentistry Figure 1. The compleme ntary cumulativ e distribution functions (blue circles) and b es t p ow er-law ﬁts (dashed black line) for citation distributions that did not pass the go o dness-of-ﬁt test, Scopus, 1 9 98–2002, 5-year citatio n wi ndo w. 15 T ables T abl e 1 . Deﬁnitions of alternativ e discrete distributions. Distribution name Probability distribution function Exp onential (1 − e − λ )e λx 0 e − λx Stretched exp o nential (W eibull) 1 P ∞ x 0 ( q x β − q ( x +1) β ) q x β − q ( x +1) β Log-nor mal q 2 π σ 2 h erfc( ln x 0 − µ √ 2 σ ) i − 1 1 x exp h − (ln x − µ ) 2 2 σ 2 i Tsallis 1 P ∞ x 0 (1+ x/σ ) − θ − 1 (1 + x/σ ) − θ − 1 Y ule ( α − 1) Γ( x 0 + α − 1) Γ( x 0 ) Γ( x ) Γ( x + α ) Po wer la w with exp onential cut-oﬀ ( P ∞ x 0 x − α e − λx ) − 1 x − α e − λx Note: The distributions hav e bee n no rmalized to ens ure tha t the total pro bability in the doma in [ x 0 , + ∞ ] is 1 . Discrete lo g-norma l distribution is approximated by rounding the contin uous log-nor mally distributed reals to the nearest in teg e r s. F or Tsallis distribution, we use a parametriza tion considered by Shalizi [29]. 16 T abl e 2 . Descriptiv e statistics fo r citation distributions, Scopus, 1998 –2002, 5-y ear citation windo w Scopus sub ject area of science T otal num b er No. of papers % of all pap ers Mean no. Std. Dev. Max. no. of papers in the sample in the sample of citations of citations of citat ions Agricultural and Biological Sciences 372575 99804 26.8 15.17 14.36 628 Arts and Humanities 47191 47074 99.8 1.256 3.357 91 Bio c hemistry , Genet ics and Molecular Bi ol ogy 636421 99819 15.7 49.09 46.29 3118 Business, M anagemen t and Accoun ting 61211 61156 99.9 3.452 7.273 287 Chemical Engineering 158673 98989 62.4 7.232 9.236 344 Chemistry 416660 99398 23.9 21.07 21.17 1065 Computer Science 134179 99933 74.5 6.44 18.13 2737 Decision Sciences 27409 27393 99.9 3.467 5.496 143 Earth and Planetary Sciences 228197 99788 43.7 14.1 17.03 1195 Economics, Econometrics and Finance 49645 49559 99.8 4.652 8.653 287 Energy 67076 66378 99.0 2.553 5.596 334 Engineering 439719 99765 22.7 11.77 15.83 971 En vir onmen tal Science 186898 99847 53.4 10.72 11.27 730 Immu nology and M icrobiology 195339 99858 51.1 22.11 25.11 926 Materials Science 331310 99591 30.1 12.48 14.49 697 Mathematics 193740 99922 51.6 6.912 11.38 929 Medicine 1191154 9982 3 8.4 48.55 60.14 4365 Neuroscience 445181 99886 22.4 18.97 20.39 771 Nursing 51283 50464 98.4 5.274 12.07 518 Pharmacology , T o xicology and Pharm aceutics 179427 99757 55.6 12.19 12.28 347 Ph ysics and A s tronom y 541328 99817 18.4 24.75 31.64 3118 Psyc hology 104449 99736 95.5 7.446 11.55 377 Social Sciences 215410 99890 46.4 6.148 8.055 519 V eterinary 53203 53117 99.8 3.637 5.843 128 Den tistry 27470 27437 99.9 4.943 6.736 115 Health Pr of essions 75491 75414 99.9 7.272 11.49 348 Multidisciplinary 50287 50226 99.9 30.38 76.08 5187 All Sciences 6480926 2203841 34.0 14.92 27.74 5187 17 T abl e 3 . P ow er-l a w ﬁts to citation distributions, Scopus, 1998 –2002, 5-y ear citation window Scopus sub ject area of science ˆ x 0 ˆ α No. of pow er-l a w papers % of total pap ers p –v alue Agricultural and Biological Scien ces 92 (15.1) 4.19(0.25) 488 0.1 0.566 Arts and Humanities 14 (5.4) 3.46 (0.47) 655 1.4 0.005 Bio c hemistry , Genet ics and Molecular Biology 148 (2 8.0) 3.72 (0.13) 2813 0.4 0.175 Business, M anagemen t and Accounting 24 (10.1) 3.4 (0.38) 1339 2.2 0.000 Chemical Engine eri ng 38 (6.7) 4.01 (0.19) 1418 0.9 0.099 Chemistry 41(7.1) 3.4(0.05) 8193 2.0 0.110 Computer Science 26 (10.6) 2.78 (0.11) 3989 3.0 0.000 Decision Scien ces 12 (4.0) 3.36 (0.24) 1596 5.8 0.000 Earth and Planetary Scien ces 36 (8.9) 3.37 (0.09) 5834 2.6 0.000 Economics, Econometrics and Finance 21 (1 0.2) 3.13 (0.36 ) 1995 4.0 0.000 Energy 32 (5.4) 3.91 (0.22) 356 0.5 0.825 Engineering 26 (9.4) 3.14 (0.09) 7986 1.8 0.000 En vir onmen tal Science 63 (10.3) 4.33 (0.22) 624 0.3 0.506 Immu nology and M icrobiology 78 (13.6) 3.48 (0.10) 2713 1.4 0.049 Materials Science 43 (8.9) 3.47 (0.11) 2687 0.8 0.193 Mathematics 24 (4.0) 3.11 (0.06) 4152 2.1 0.012 Medicine 59 (16.3) 3.07 (0.04) 20163 1.7 0.000 Neuroscience 135 (28.4) 4.69 (0.41) 423 0.1 0.896 Nursing 60 (15.7) 3.68 (0.40) 439 0.9 0.256 Pharmacology , T o xicology and Pharm aceutics 56 (6.8) 4.1 (0 . 12) 1215 0.7 0.865 Ph ysics and A s tronom y 61 (6.5) 3.35 (0.04) 5034 0.9 0.797 Psyc hology 52 (8.8) 3.9 (0 . 17) 1060 1.0 0.812 Social Sciences 24 (6.4) 3.56 (0.15) 2963 1.4 0.007 V eterinary 23 (4.0) 4.09 (0.27) 858 1.6 0.017 Den tistry 20 (2.4) 3.89 (0.18) 1012 3.7 0.011 Health Pr of essions 49 (10.2) 3.85 (0.24) 942 1.2 0.352 Multidisciplinary 209 (40 .4) 3.24 (0.14) 1147 2.8 0.100 All Scien ces 1 86 (46.3) 3.45 (0.10) 6364 0.2 0.076 Note: standard errors are giv en in parenth eses. 18 T abl e 4 . Mo del s e lection tests for citation distributio ns, Scopus, 1998–2002, 5-year citation wi ndo w Scopus sub ject area of science p - v al ue Exponen tial W eibull Log-normal Tsallis Y ule PL with cut-oﬀ LR p LR p LR p LR p LR p NLR p Agricultural and Biological Scien ces 0.566 20.740 0.009 0.338 0.77 9 -0.096 0.782 0.054 0.890 -0.011 0.858 -0.268 0.464 Arts and Humanities 0.005 6.287 0.457 -6.93 0.023 -6.56 0.025 -4.325 0. 189 -1.38 0.000 -7.37 0.000 Bio c hemistry , Genetics and Molecular Biology 0.175 20 4.5 0.000 1.22 0.758 -1.12 0.473 -1.227 0.479 -0.155 0.108 -0.567 0.287 Business, M anagemen t and Accounting 0.000 34.390 0.034 -9.60 0.013 -9.24 0.013 -7.279 0. 065 -1.39 0.000 -9.98 0.000 Chemical Engine eri ng 0.099 69.480 0.001 -0.021 0.994 -0.972 0.480 0.025 0.990 -0.358 0.187 -0.78 0.211 Chemistry 0.110 7 36.0 0.000 7.48 0.262 - 2.67 0.204 1.290 0.687 -0.999 0.060 -3.31 0.010 Computer Scienc e 0.000 609 .4 0.000 -7.05 0.248 -8.80 0.035 - 6.719 0.132 -2.00 0.000 -5.23 0.001 Decision Scien ces 0.000 77.730 0.001 -6.71 0.046 -6.81 0. 048 -.0275 0.956 -2.66 0.000 -5.91 0. 001 Earth and Planetary Scien ces 0.000 459 .7 0.000 -4.69 0.451 -7.52 0.045 -4.928 0.264 -1.95 0.000 -5.69 0.001 Economics, Econometrics and Finance 0.000 45.080 0.021 -21.6 0.000 -20.4 0.000 -17.027 0.002 -2.68 0.000 -22.9 0.000 Energy 0.825 20.630 0.065 0.357 0.78 9 -0.072 0.838 0.347 0.690 -0.023 0.884 -0.119 0.625 Engineering 0.000 825 .5 0.000 - - -7.98 0.032 - 0.763 0.877 -2.71 0.000 -7.52 0.000 En vir onmen tal Science 0.506 26.730 0.104 0.003 0.99 9 -0.422 0.685 -0.333 0.793 -0.114 0.334 -0.18 0. 547 Immu nology and M icrobiology 0.049 170 .3 0.000 -1.85 0.539 -2.48 0.176 - 1.111 0.496 -0.268 0.076 -3.98 0.005 Materials Science 0.193 233 .4 0.000 2.02 0.610 -1.02 0.460 -0.034 0.987 -0.412 0.178 -0.850 0.192 Mathematics 0.012 414 .8 0.000 -1.54 0.784 -4.97 0.083 - 0.264 0.943 -1.56 0.007 -5.19 0.001 Medicine 0.000 2740.0 0.000 - - -7.78 0.043 -4.566 0. 309 -2.03 0.000 -5.62 0.001 Neuroscience 0.896 11.920 0.072 -0.018 0.987 -0.178 0.726 -0.066 0.888 -0.020 0.637 -0.285 0.451 Nursing 0.256 21.520 0.012 -0.284 0.803 -0.372 0.580 -0.048 0.936 -0.045 0.565 -0.733 0.226 Pharmacology , T oxicology and Pharmaceut ics 0.86 5 47.520 0.000 -0.361 0.844 -0.747 0.449 -0.002 0.999 -0.148 0.337 -1.24 0.115 Ph ysics and A s tronom y 0.797 706 .2 0.000 19.5 0.006 0.048 0.646 0.954 0.495 0.091 0.771 0. 000 1.000 Psyc hology 0.812 53.220 0.000 0.186 0.920 -0. 460 0.562 0.129 0.904 -0.112 0.475 -0.791 0.208 Social Sciences 0.007 173 .3 0.000 -3.56 0.366 -4.27 0.114 0.0774 0.983 -1.43 0.007 -4.21 0.004 V eterinary 0.017 38.090 0.000 0.841 0.598 -0.183 0.677 1. 953 0.330 -0.047 0.874 -0.542 0.298 Den tistry 0.011 11.830 0.200 - 6.60 0.025 -6. 26 0.028 -3.714 0.257 - 1.28 0.000 - 7. 14 0.000 Health Professions 0.352 38.620 0.001 -0.944 0.599 -1.10 0.352 -0.395 0.760 -0.192 0.189 -1.63 0.071 Multidisciplinary 0.100 98.560 0.001 - 1.37 0.595 -1. 67 0.339 -1.497 0.377 - 0.067 0.069 -1.44 0.090 All Scien ces 0.076 672.3 0.000 18.30 0.009 -0.125 0.797 -0.007 0.992 -0.054 0.625 -0.240 0.488 Note: Second column gives the p -v alue f or the hy p othesis that the data f ol low a pow er-law mo del. “- ” m eans that the maxim um lik eliho o d estimator did not con verge. Positiv e v alues of the log-lik eliho o d ratio (LR) or the normali zed log-l i ke li hoo d r atio (NLR) indicate th at the p ow er-law mo del is fav ored o ve r the alternativ e.

Power laws in citation distributions: Evidence from Scopus

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment