Caveats for using statistical significance tests in research assessments

1 Ca veat s for us ing st atist ica l sign ifica nce t ests in r ese arch asse ssm ents Jesper W. Schnei der Danish Centre for Studies in Research and Research Policy, Department of Political Science & Go vernment, Aarhus University, Finlandsgade 4, DK-8200, Aarhus N, Denmark jws@cfa.au.dk Telephone: 0045 8716 5241 Fax: 0045 8716 4341 Abstract This article raises concerns about the advantages of using s tatistical sig nificance tests in research assessments as has recentl y been suggested in the debate about proper norm alization procedures for citation indic ators by Opthof and Ley desdorff (2010) . Statistical signific ance tests are highly controversial and numerous criticisms have bee n leveled against their use. Based on ex amples from articles by proponents of the us e statis tical si gni ficance t ests i n res earch assessm ents, w e addres s some of the numerous problems with such tests . T he issu es sp ecifi call y discu ssed a re the ri tual practice of such tests, their dichotomous application in decision makin g, the difference between s tatistical a nd substantive significance , the implausibility of mo st null hypothese s, the crucial assumption of randomness, as well as the utility of standard error s and co nfidence intervals for inferent ial purposes. W e argue tha t applying statistical sig n ifican ce test s and m echanic all y adhering to their results are hi ghly pr oblematic and detrime ntal to critic al thinking. We claim that the use of such tests do not provide any advantages in relation to decid ing whether differences betwee n citation indic ator s are important or not. On the contra ry their use may be harmful. Like many other c riti cs, we gener all y believe that stati stical signifi cance tes ts are over - and misused in the empi rical sci ences incl uding sci entom etrics and w e encour age a r eform on thes e matters . Highlights : We warn agains t the us e of st atist ical si gnifican ce tes ts ( NHS T) in researc h assessments. We introd uce the controversial debate of NHST to the informetric c ommunit y . We demonstrate some of the numerous f laws, misconceptions and misuses of NHST. We discuss potential alternatives and conclude that no "easy fixes" exist. We advocate informed judgment, free of the NHST -ritual, in decision processes. Ackno wled gement . The author would like to thank the three anon ymous reviewers for t heir ver y useful comments on previous drafts of this arti cle. 2 1. Intro duction In a rec ent arti cle Op t hof and Leydesdorff ( 2010 ; hereafter O& L ) m ake several claims aga inst the validity of journal and field normalization procedures applied in the so cal led “crown indicator” developed by the Center for Science and Te chnolog y Studies (CWTS) at Leiden University in the Netherlands. Like Lundberg (2007) before them, O&L suggest a normalization procedure based on a sum of ratios instead of a ratio of sums as used in the “crown indicator”. While L undberg (2007) and O&L give different r easons for such a normaliz ation ap proach , the y do commonly argue that it is a more sound statistical procedure . O&L for their part argue that, contrary to the “crown indicator”, their proposed normalization procedure, which follows the arithmetic or der of operations, provides a distribution with statistics that can be appl ied for st atist ical si gnificance t est s. This claim is repeated i n Leydesdorff and Opthof (2010 ), as wel l as L ey desdorff et al. (2011) ; indeed in all these articles, Ley desdorff and co -authors distinctly indicat e that s ignifican ce test s are important , advantageous and som ewhat nec essa ry in ord er to detect “s ignifican t” di fferenc es between t he units assessed . O& L’s critique and p roposals are interesting and the y have rais ed a heated but need ed d ebate in the community (e.g., Bornmann, 2010 ; Moed , 2010; Spaan, 2010 ; Van Raan et al., 2010 ; Gingras & L arivi ère , 2011; Larivi ère & Gingras, 2011; W altman et al., 2011a ) . W e ar e s ympatheti c to the claims that sums of ratios are advantageous, n everth eless, o n at least one important point w e think that O &L’ s claims ar e fla wed , and that is the role of statistical significan ce tests . The authors seem to ignore the numerous criti cism s rai sed a gainst statistical sign ificance tests throughout various decades in numerous empirical fields within the soci al , behavioral, medical and life sci ences, for exam ple in ps ychology and educat ion ( Roz eboom, 1960; Bakan, 1966; Carver, 1978; Meehl , 1978; 1990; Oakes, 1986; Cohen, 1990; 1994; Gig erenzer, 1993; Schmidt & Hunter, 1997), sociolog y ( Morrison & Henkel, 19 69), eco nom ics ( McCloske y , 1985; McCloskey & Z iliak, 1996 ), clinical med icine and epide miology ( Rothman, 1986; Goodman, 1 993; 2008 ; Stan g , Poole & Kuss, 2010 ), as well as statistics pr oper (Berkson, 1938; Cox et al., 1977; Kruskal, 197 8; Guttman, 1985; Tukey, 1991; Krantz, 1999 ), an d recent ly M arsh, J ayasin ghe and Bond (2011) i n this journal, to name just a few non - Bay esian critical wor ks out of literally hundreds . U nawar eness of the criticisms le veled ag ainst signif icance tests seems to be the standard in many empirical disciplines (e.g., Huberty & Pike, 1999) . For man y decades the substantial cri ticisms hav e been negl ected . A fact Roo z eboom (1997) has called a “sociology - of - sc ience wonderment” (p. 335). O nly recent l y , at least in some disciplines, e.g., medicine, psychology and ecology, has the criticism s l o w l y be gun to have an effect on some resear chers, jour nal edit ors, and in guidelines and textbooks , b ut the eff ect is still scanty (e.g., Wilkinson et al., 1999; Fidler & Cu mming, 2007). Statistical significance tests are highly controversial. The y are surrounded b y m yt hs . T h e y a r e overused and a re very often misunderstood and misus ed (for a fine overview, see Kline, 2004) . Criticisms ar e numerou s. Some point to the inher ently log ical flaws in sta tistical sign ificance te sts (e.g., Cohen, 1994) . Othe rs cl aim th at such t ests have no scient ific relev anc e; in fact they may be harmful (e. g., A rmstrong , 2007 ). Others hav e docu mented a wh ole catalo gue of mi sint erpretati ons of statistic al significa nce tests and especially the p value (e.g., Oakes, 198 6; Goodman, 2008 ) . Still 3 others have documented various diffe rent misuses, such as neglec ting statistic al power, indifference to randomness, adher ence to a mechani cal ritual , arbi trar y signific ance lev els fo rcing dichotomous decision making, and implausible nil null hy p otheses, t o name so me (e.g. Gigerenzer, 1993; Shaver, 1993). Rothman (1986 ) ex emplif ies the cri tical p erspecti ve: Testing for stati stical si gnific ance toda y continues not on its merits as a methodolo gical too l but on the momentum of tradition. R ather than serving as a thinker’s tool, it h as become for some a clumsy substitute f or thought, subverting what should be a contem plate exercise into an algorithm prone to error (p. 445). Alterna tives and supplements to significance tests have been su ggest ed; am ong th ese f or exam ple effect size estimation s and confidence intervals, power anal ys es, and s t ud y replication s (e.g., Kirk, 1996) . Interestingly, relatively few have def ended stat isti cal signif icance t est s and th ose who h ave, agree with many of the criticism s leveled against s uch test s (e.g., Abelson, 1997; Cortina & Dunlap, 1997; Chow, 1998; Wainer, 1999). The def enders do ho wever claim that most of these failin gs are due to humans and t hat si gnifican ce test s can pl ay a role, al beit a l imit ed one, in research . Crit ics will have n one of this, a s history testifies that the so - calle d limited ro le is not prac ticable, p eople continue to overuse, misun derstand and misuse such tests . To c ritics, statistica l signific ance tests have l et the so cial scien ces astra y and scien tific resear ch can and should live without them (e.g., Carver, 1978 ; Armstrong, 2007). The aim of the present a rticl e is to w arn agains t what Gi gerenzer (2004) calls “mindless statistics ” and the “null ritual”. W e ar gue that appl ying stati stical signific ance test s and mech anical l y adher ing to their results i n rese arch a nd more speci ficall y in r esearch assessments , as su ggested b y O&L , is hi ghly problematic and detriment al to critical ( scientif ic ) thi nking. We clai m that t he us e of such tests do not provide any advantages in relation to deciding whether differences betwe en citation indicators are important or not . On the contrar y th eir use m ay be harmful . Like ma n y other critics, w e generally believe that statistica l significance tes ts are so problematic tha t reform is urgentl y needed ( see for ex ample , C umming, 2012). Centered on ex amples mainly from O&L (Opthof & Leydesdorff, 2010), we add ress s ome o f the numerous problems o f su ch tests . It is impor tant to emphasize tha t the falla cies we d iscuss here are extremely common in the social sci ences and not distinctive for th e par ticular artic le b y O & L we scr utinize . To emphas ize this we provide further bri ef examp les haphazardl y retri eved from t he recent scientometric literatur e . The r eason we sp ecificall y respond to O&L's article is a grav e concern that su ch a flawed “ ri tu al ” s hould be used in r elation to the al read y sensitive issue of research assessment based on citation indicators, a s wel l as a reacti on t o the a rgument that s uc h tests are supposed to be advantageous in that respect . But this article is not an exh austi ve revi ew of all problems and criticism s leveled at statistica l signific ance tests ; we sim ple d o not hav e the s pac e for that and s everal such reviews al read y ex ist (e.g ., Oakes, 1986; Nickerson, 2000; Kline, 2004). Thus we only address some of t he problem s and controversies du e to their appea ranc e in the article b y O& L. Th ey do not come in an y na tural order and are intrinsically related . We have orga nized the article according to the problems addressed. First we outline our understanding of how O &L appr oach si gnifican ce tes ts in thei r article . In the second se ction, we proceed with some of the 4 caveats r elated to the use of statistical sig nifica nce tests. We outline the prac tice of sig nificance tests, and we discuss some of the misconceptions and misuse s, including the assumption of randomness and the utility of standard erro r s and confidence int ervals. We conclude w ith a summary and some recommendations for best pr actice, to inspiration for authors, reviewers and editors. 2. The conc eption a nd applica tion of signifi cance test s in th e article b y O&L In this secti on we outline how O&L employ signific ance tests in thei r articl e, and how the y seemin gly consider “si gnifi cance” and pres ent their arguments f or the supposed advant ages of such test s. No tice, O &L's reasoning base d upon statistical sig nificance tes ts have briefly been criticized in Waltman et al. (2011b ). In the p resent articl e we elabo rate o n thes e mat ters. O& L state t hat if we ave rage over the aggregate then we can “test fo r the si gnifican ce o f the deviati on of t he test s et from t he referen ce se t” ( 2010, p. 424). O&L thus claim that the normalization procedur e they sugge st (i.e. sum of ra tios) ena ble statistica l significance testing and that t he latter can deci de whet her ci tatio n scor es deviat e “signi fican tl y” from th e baseli ne. Accordin g to O &L this i s a clear advanta ge , and we assume that they consider this process somewhat objective. In order to compare their normalization procedure to that of CW TS , and to demons trate th e clai med ad vanta ges of si gnifi can ce tests , O &L produce a number of empirical examp les combined with test statistics . D ifferences in journal normalizations are fi rst explored in a small er data set . Next a “re al - life” data set f rom the Academ ic Medi cal C entre (AMC) in the Netherlan ds is used to comp are the r elative citation scores for o ne sci entis t based on the two different normalization procedures, and subse quentl y to compa re t he effects of the different normalizations on values and ranks for 232 scholars at AMC 1 1 Using t he Academic Medical Centr e for the demonstration is interesting s ince CWT S has produced and delivered relative citatio n indicators i n a pre vious evaluation o f the ce ntre . . The AM C data set is further expl ored i n L eydesdorff and Optho f (2010 a; 2010 b ). O& L use p v alues i n connection with Pearso n and Spear man cor relatio n coeffi cients , as wel l as the Kruskal- Wallis test used with Bonf erroni correct ion s . I n the latte r case the sign if icanc e le vel is giv en as 5%, wh ereas 1 % and 5% ar e used with the co rrelation statistics. Also, citation scor es based on O&L ’ s normalization approach come with standard error s of th e mean in the articles . It is a central point in O&L ’ s argument that th eir normalization approach produce s a s tatis tic whe re un certai nty (i. e., rando m error) can be estim ated by providing standard errors and th e argum ent goes “[i] f the normalization is performed as proposed by us, the score is 0.91 (±0.11) and therewith not si gn if icant l y different ” (p. 426 ). This quote is exemplary fo r O &L ’s treatment of statistica l significance te sts and the apparent implic it or explicit view upon “significan ce” . First, O&L conside r the question of “signif icance” as a dichotomous decision , eit her a result is significant or not. Second , their rhetoric suggest that “signifi canc e” impl ies i mpor tance o r rather lack of im portan ce in this case , as th e “worl d aver age” citati on score o f 1 is t reated as a point null hypothesis , and s ince 1 is located w ithin the confidence limits they cannot reject the null h ypothesis, concluding tha t there is no “significant difference” from the “world average”. Notice also th at O &L u se stand ard er rors as a su rrogate fo r test s for 5 signifi cance b y determin ing whether th e estimate d int erval su bsumes the “worl d avera ge” citation score or not. We clai m that t he approach to st atist ical si gnificance tes tin g describ ed ab ove is common among social scie ntist. Neverthel ess, it requires some c ritical c omments because i t is d eeply entangled in the quag mire of problems relating to significance t ests an d if applied as suggest ed b y O&L in res earch assessments, it may distort the decision making process and have serious consequences f or those assessed . The next secti on addres ses som e of these problems and controversies. 3. Some caveats related to statistical s igni fican ce tests In this sec tion we address some impor tant problems in re lation to statistic al signif icance tests on the basis of pr actice, arguments a nd claims in O&L . First w e brie fly discuss what sta tistical signific ance tests are and outline the ir ritualistic prac tice. Second we de fine eff ect si ze an d statistical power. Subsequently we address s ome c ommon mis interpre tation s of statistical signifi cance tes ts, and c l o s e l y r e l ated to this , we discuss the mechanical dichotomous decision process t hat signifi cance test s usual ly lead to. The following subsection discusses some misuses of statistical s ignifica nce test, especiall y the implausibility of mos t nil null hypotheses. Th is leads to a discussion of one of the crucia l assumptions behind such tests, randomness; and f inall y, we addres s the issue of standard errors and c onfidence intervals and their supposed advantages compared to p values. 3.1 The purpose and practice of s tatistical sign ificanc e test The domina nt approach to statistic al signif icance testing is a n unusual h ybrid of two fundamentall y different frequen tist approaches to stat isti cal inferen ce, R onald . A. Fisher’s “ inductive inference ” and J erz y Ne yman and E gon P earson’s “ inductive behavior ” (Gigerenzer et al., 1989). According to Gigerenzer (2004, p.588), most hybr id si gnific ance test s are pe rform ed as a “null ritual ”, where : • A statistical nu ll hypothesis of “ no d ifferen ce” or “zero correlat ion ” in t he populatio n is set up, sometime s called a nil null h ypothesis. Predictions of the research hypothesis or an y alternative substantive hypotheses are not specified. Notice, o ther hypotheses to be nullified, such a directional, non - zero or in terval esti mates, are pos sible but seldom used, hence the “null ritual ”. • A n arbitrary but conventional 5% s ignifi cance l evel (or low er) is used for rejecting the null hypothesis . If the r esult is “si gnific ant” the rese arch h ypothes i s i s acce pt ed. Resu lts are reported as p < .05, p < .01, or p < .001 (whichever comes next to the obtaine d p value). Notice , other si gnifican ce level s can b e used. • T his procedure is a lwa ys perform ed . While the “ null r itual ” has re fined asp ects, th ese do not chang e the es sence o f the ritu al , which is i dentical f or all statistical sig nificance tests in th e frequentist tra dition. Statistical sig nificance tests in this hybrid tradition are also popular ly know n as null hypothesis significance tests (NHST). NHST produces a probabilit y value ( p valu e) . The definition of the p value is as follo ws: 6 • T he probabilit y of the o bserved data, plus more extreme data across all possible random sampl es, if the null h ypothesis is true, given randomness 2 The gene ral form c an be writt en as p (Dat a | H 0 ). While the mathe matical de finition of the p value is rather simple , its meaning has shown to be very diffi cult to interpr et correc tl y . Carv er (1978), Kline (2004) and Goodman (2008) list man y misconceptions about p values. Fo r ex ampl e, the incorrec t interpre tation that if p = .05, the null hypothesis has only a 5% chance of being true. As the p v alue is cal culated under the assumption that the null hy pothesis is true , it cannot simultaneou sly be a probability that the null hypothesis is fals e. We are not to blame for this confusion . Fishe r himself co uld not explain th e infere ntial meanin g of his own inve nt ion (Goodman, 2008). and a s ample si z e of n (i.e., the sampl e size u sed in t he part icul ar stud y ), and all assumptions of the test statistic are satisfied (e.g., Goodman, 2008, p. 136 ). The individual elements of the above stat ement about NHST are ver y important, though often neglected or ignored. First, it is important to realize that p values are conditional probabilities that should be interpreted from an object ive frequentist philosoph y of probability , i.e., a relat iv e frequen cy “in -the-long- r un” perspective (von M ises, 1928) . Because the "long- run " relative frequen cy is a p ropert y of al l events in th e coll ective 3 Second, t he p value is a conditional probabilit y of the data based on the assumption that the null hypothesis is true in the population, i.e., p (D at a | H 0 ) , and therefore not the inverse probability p (H 0 | D ata) as often bel ieved ( Cohen, 1994 ) . T he theore tical sampling distribution ag ainst whic h result s are com pared ( e. g., t , F , χ 2 distributions ) are generated by assuming that sampling occurs from a population (s) in which the null hypothesis is e xactly true . Third, randomness is a fundamental assumption, it is the raison d' être of NHST. Wit hout randomness, NHST become meaningl ess as we cann ot add ress sam pling error, t he sole purpose of such tests (Shaver, 1993) . Fourth, s amp le siz e is a crucial cons iderati on, bec ause th e p value is a function of effec t and sample siz es , as well as spr ead i n data (Cohen, 1990). Fifth, t h e resu lt of NHST is a probability state ment, often expressed as a dichotomy in terms of whether the probabilit y was less or more than th e signifi cance lev el ( α ). Notice that p valu es and α level s are two different theoretical e ntities. To Fish er p v alues are a prop erty of t he dat a and h is notion of probabilit y relating to the stud y. To N e ym a n -Pearson α is a fix ed propert y of the test not the data and their conce ption of error r ate in , it follows that a prob ability ap plies to a col lect ive and not a s ingle ev ent (Dienes, 2008). Neither do probabilities apply to the truth of hypotheses as a hypothesis is not a collective. Consequently , a p v alue is not a p robability of a single result ; i t is a con ditio nal probability “in the long run”. T his can also be inferred from the definition above “… the observe d data, plus more extreme data across all possible random sampl es ”. More ex treme d ata actual l y ref er to resu lts t hat have no t happened. Thus, i f we repeat the study many time s by draw ing random samples from the same population(s) , what would happen? In reality we sample only once and relate the p value to the act ual result ! 2 We use randomnes s to in clude both random sam pling an d random assi gnment . 3 The set of events th at an objectiv e probabil ity – un derstood as a relative long - run fre que nc y – applies to. Technically , the set should b e infinite, but t his require ment is often rela xed in practice. 7 the "long- run " t hat stri kes a bal ance betwe en α , β ( the probabilit y of making a Type II error), and sampl e size n (Gigerenzer et al., 1989) . The conv entional 5% is due to Fisher (1925). L ater , in h is bitter arg uments with Ney ma n and Pearson , he w ould discard the conventional level and argue for reporti ng exact p values . 3.2 Effect Size and statistical powe r In this context, two important concepts should b riefl y be clarified b efore we continue, effec t size and statistical powe r. An effe ct size is a s tatistic tha t estimates the mag nitude of the re sult in the population ( e. g., Kirk, 1996 ) . Measur es of e ffect si ze c an be cl assi fied as s tand ardiz ed or un standard ized . S tandardized measures ar e scal e - free becau se th e y are defined in terms of the variab ility in the data . Some well - known standardized measures include Cohen's d , r , R 2 and odds ratios ( e. g., Kirk, 1996; Grissom & Kim, 2005 ). Unstandardized measu res ar e expr essed in the origin al units or in terms of percenta ges or pro porti ons . Effect sizes a re important for at lea st three reasons: 1) they provide crucial information for judging the importance of a result; 2) the y are important for accumulation of evidence over time and thus for meta - analy sis and theor y building ; and 3 ) prior to a stud y , estimates of anticipat ed e ffect s izes can b e used in power anal yses to project adequate sample size for detec ting statistically signific ant results (e.g., Kirk, 1996; Kline, 2004 , Ellis, 2010). The stat isti cal power o f a significan ce tests is the probability of r ejecting the null hypothesis when it is false (Cohen, 1988). P ower i s the complement of β (1 - β ). A s tatistical powe r anal ysis involves four variables: significance level ( α ), s ample si ze ( n ), eff ect si ze and power . For a n y statistical mo del, these relation ships are such that e ach is a function of the other thr ee. Statistical power is a ff ected chiefl y b y the siz e of th e effect and the siz e of the s ample used t o detect i t (Cohen, 1988; 1990) . Bigger ef fects are e asier t o detec t than small er effects, wh ile larg e samples offer greater te st sensitivity than sma ll samples. Given α and th e ant ici pated effe ct siz e, we c an determi ne the sam ple si ze needed f or detecti ng a s tatis ticall y significa nt effect with a cert ain likelihood (i.e. power) wh en there i s an ef fect th ere to b e detect ed . Statistical powe r is particular l y important w hen there is a true difference or association in the population. The test mu st be powerful enough to detect such diff eren ce s o r ass ociatio n s . Otherwise , a non- si gnifican t resu lt would simply m ean t hat a T ype II erro r has b een committed . 3.3 Some common misinterpretations of statistical sig nificance te sts Despit e frequen t warnin gs in t he litera ture , statistical signif icance is too often conflate d with the practical or th eoret ical importance of empiri cal results . In a recent surv ey of m anagem ent rese arch, Seth et al. (2009, p. 7 - 8) found that 90% of the pa pers did not distinguished between statistical signifi cance and pra ctical impo rtance. S tatistical signific ance is ofte n used as the sole cr iterion of importance leading to r itualistic dichotomous decision behavior and thereb y deemphas izi ng interp retation s of ef fect si zes (e.g., S carr, 1997). A clear distinction must be made beca use statistically signif icant results are not nece ssaril y important. Statistical sig nifica nce leads simply to a conclusion that A i s differen t from B , or, at best, that A is greater than B , or th at insufficient evidence has been found for a d ifferen ce. T ypical l y when 8 we reject a null h ypothesis of zero eff ect we c onclu de that there is a "si gnificant " effe ct in t he population. When we fail to rej ect we con clude th at ther e is no effe ct. M an y critics a rgue, however, that such mere binary decisions provide an impoverished view of what scie nce seeks or can achiev e. Kirk asks “[ h]ow far wou ld ph y si cs have pro gress ed if th eir resear ches had fo cu sed on discovering ordinal relationships?” (1996, p. 754). What we a ppear to forget, howeve r, is that statistical s ignifica nce is a func tion of sampl e siz e and t he magnitude of the actual effect (e.g ., Rosenthal, 1994, p. 232). Large effect s ize and s mall s ample si ze as well as sm all ef fect si ze and large sam ple siz e can both bring about sta tistica l significa nce with matching p val ues , but more disturbingly, such " si gnificant effects " are most often tr eated the same way . Effect and s ampl e siz es are rarel y consid ered , but the y should. Consider the following e xample from Schubert and Glänzel (1983) who su ggest the w stat isti c as a s tatis tical s ignifican ce tes t for di fferen ces betw een journal impa ct fac tors ( J IF) . In their ex ampl e they compar e two journals wit h impact factors 0.611 and 0.913. The w statistic of 1.98 i s larger than 1.96, the value co rresponding to the 5% signifi cance level, h ence the authors con clude that "... the impact factors of the two journals differ significantly ..." (Schubert & Glänzel, 1983, p. 65). The important question , however, is whet her this diffe rence is importa nt? Informed human judgment is needed for such a decision. "Human judgment " refers to th e fact that d ecisio n - making in sta tistical infer ence is basic ally subjective , context depended and goal oriented ( e.g., Bakan, 1966; Carver, 1978; Tukey, 1991 ). "Informed " refers to the fundamental premise of providing a sound basis upon which one ca n make a decision about importance. In that respect, we need to focus on ef fect sizes and confidence intervals, consider the research design, and perhaps most important, relate the result to former empirical findings and theoretical insights (e.g., Kirk, 1996). But when it com es t o research asses sment s, such an i ntell ectual bas e is v irtually a bsent. It is a fundamen tal problem in re lation to the applica tion of statistica l signific ance tests f or comparison betwe en citation indic ators in re search assessm ents t hat we b asicall y do no t know a priori what differences would be important. O bviously importance depends on contex t and goal of the assessment , as well as cos ts and bene fits . But we do not have a substantial empiric al and the oretica l literature tha t can g uide us with some anticipatory effect sizes to look for. Thus , in this examp le, for lack of any thing better , we j udge th e standardized effect size in relation to Cohen's benchmarks f or "small", "medium" and "large" effec t sizes (Cohen, 1988). Cohen reluctantly proposed his benchmarks for sta tis tical power analyses to help rese arc hers guess on effect size when no other sourc es for estimation exist . Using his conventional definitions to interpret observed effect siz es in general is problematic and could e asily lead to ye t another form for "mindle ss" statistics. Cohe n himself urged to in terpret effect siz es based on the context of t he data and his benchmar ks should be seen as a last re sort. Returning to the J IF examp le above, Cohen's d , a st andard ize d mean di ffer ence ef fect size , yield an ef fect size around .24. According to Cohen (1988), "small" effect sizes begin around .20 for mean differences. Effect sizes lower than this are considered trivial and the benchmark for "medium - s ized" effe cts is set to .50. I s the apparently "small" bu t statistica lly signific ant effect between th e two journal impact factors important? The confi dence int erval for the ef fect siz e is - .01 to .48, which is from zero effe ct to al most a medium effect, can we base our decision on this level of unc ertainty? 9 Important effect sizes, those we determine wou ld mak e a dif feren ce, an d ac cepted levels of uncertai nt y , should be defined before the study commen ces . But be ware , bi g effect s are n ot necessa ril y important effect s, nei ther are sm all ef fects nec essaril y unimportan t. Now t he question of course is "how b ig is big?" Obviously the question is relative to the ac tual study a nd certainly not easy to answe r. A rela tive citation imp act of 1.05 can b e "statistic ally significant" above the world average of 1, but we would probably not consider this result important or rather it depends on the context. 3.4 Misuse of the term "significance" and the practice of dichotomous decisions O ne of the reasons for the widespread use of the null rit ual ma y well be the false belief tha t statistical s ignifica nce tests can de c ide for us whether results are important or not. B y r e l yi n g on the ritua l w e are convenientl y relived of further pains of hard thinking about differences that make a differen ce (Gigerenzer, 2004). An overwhelming number of tests are prod uced in this m ech anical fashion. B ut the truth is that most of th em do not scrutinize the sta tistically significa nt differences found and i t is likely that most differences are t riv ial des pite th e implied rhetoric (e. g., Webs ter & Starbuck, 1988). The rh etorical pract ice is often to drop t he quali fier “s tati stical ” and speak ins tead of “signif icant diffe rence s”. Usi ng the term “si gnifi cance” wi t hout the qualifier certai nly giv es an impression of importance, but “ significance ” in its statistica l sense mea ns something quit e diffe rent . It has a ver y limited interpr etation specif ically relate d to sampling error. Reporting that a result is "highly significant" simply means a "long- run " inte rpreta tion of how strong the data, or more extreme data, contradict the null hy pothesis that the effect is zero in the p opulation, given repeated random sampling with t he same s ample siz e. W hether the result is "hig h l y important" is another question still not answered. Nowhere in their artic les do O&L use the qualifier “statistical” . Th ey continuousl y speak of “signifi canc e”, “si gnifi cantl y differ ent” or “ not signifi cantl y differ ent”. Fo r ex ample, t hey emphasize that such a procedure (sum of ratio s) “…allows us to test for sig nificance of the deviation of the te st set from th e refer ence set” (2010, p. 424), or “… th e resear cher under st ud y would show as performing significantly below the world average in his reference group, both with (0 . 71) or without self - citations (0 .58) ” (2010, p. 426) . To us at least, it seems e vident that “sign ifi canc e” to O&L somehow is conceived of as a criterion of importance and used as a dichotomous decision making tool in relation to resear ch asses sment , i.e., either t he result s are “significant” or not . There are countless examples in the scie ntometric lite rature of similar practic e, where stat isti cal signi fic ance is t reated as the binary crite rion for importance of results . Fo r examp le in regres sion an al ysis , the importance of predictor variables or the fit of the model usuall y comes down to whether t or F statistics are "sig nificant" or not at the conventional alpha level s (e.g., S tremer sch, Verni e rs & Verhoef, 2007; H aslam et al., 2008 and Mingers & X u, 2010 t o nam e just a few studies that try to identify variable s that predi ct citation impact ) . Now it may be that O&L do in fact mean “statistical sig nificance” in its limited frequentist sense rel atin g to sampling error , which this quote could indicate “[t]he significance of d ifferen ces depends on the shape of t he underlying distributions and the siz e of the samples ” (2010, p. 428); b ut if so , t h e y explicitl y fail to attend to it in a proper and una mbiguous manner. And this merel y raises 10 new concerns, such as the plausibilit y of the null h ypothesis, the assumption of r andomness and th e actual statistical power involved . We wil l addres s these questions in the following subsections. Dichotomous decisions based on arbit rar y significance levels are uninformed. Consider Table 1 in Leydesdorff & Opthof (2010 a , p. 645) , were the S pearman rank correlation betw een field normalized sum of ratios versus ratio of sums is not significa nt in this particular study (we assume at the 5% level). A calc ulati on of th e exact p value, with n = 7, gives a probabilit y of .052. T o quote Rosnow and Rosentahl, “surely, God loves the .06 nearl y as much as the .05” (1989, p. 1277 ). A rhetori cal variant of this ex ample is found in the following quote from Jacob, Lehrl and Henkel (2007, p. 125) "... citation ra tes in co - authorships almost reach signif icance ( p = 0.059) pointing to a t rend to support this assumption". Surel y, the practical differ enc e between . 049 and .059 is miniscule but the quote also reveals a ver y common and serious misunderstanding, the so - called " inverse probability fallacy " ( e.g., Carver, 1978). The p val ue provides no direct information about the truth or falsity of the null h ypothesis, conditional or otherwise. To recapture, NHST provides the p (D ata | H 0 ) and not p (H 0 | D at a ) . The latter however is usuall y w hat researchers want to know. According to Cohe n "[NHST] does not tell us what we wa nt to know, a nd we so much want to know what we want to know that, out of despera tion, we nevertheless believe that it does! What we want to know is 'Given these data, what is the probabilit y that H 0 is true? ' Bu t as most of us know , what it tells us is ' Given that H 0 is true, what is the probability o f these ( or more ex trem e) data? " (p.997 ). We think it is u nsophistica ted to trea t the “truth ” as a clear - cut binary variable and ironic that a de cision that makes no all ow ance for uncertainty occurs in a domain that purports to describe degrees of uncertainty. Remember also that freq uent ists ar e concern ed with collectives not the truth of a sin gle even t. The ri tu alist ic use of t he arbi trar y 5% or 1% l evels indu ces res ea rchers to neglect critica l examination of th e relev ance an d imp ortan ce of th e findi ngs . R esear chers must always report n ot merely statistica l significa nce but also the actua l statistics an d r eflect u pon th e pra ctical or theoretical importance of the results. This is a lso true for citation indicators and differences in performance rankings. To be come more quantitative, precise, a nd theoretically rich, we need to move beyond dichotomous decision making. 3.5 The misuse of nil null hypotheses and the ne glec t of Type II error s Signifi cance t ests are co mput ed based o n the assumption that the null h ypothesis is true in t he population . This i s hardl y ever the fact in th e social s ciences. Nil null hypotheses are alm ost always imp lausible, at lea st in observatio nal studies (e.g., Be rkson, 1938; Lykke n, 1968; Meehl , 1978; Cohen, 1990, Anderson et al., 2000). A nil null hypothesis is one which posits , in an absolute sense, no differen ce or no asso ciatio n in a param eter , and it is almost universally applied (Cohen, 1994). There will always be uncontrolled spuri ous factors in observational studies and it is even questi onable w hethe r rando miz ation can be expect ed to ex actl y balan ce out t he effe cts of al l extraneous factors in experiments (Meehl, 1978). As a result , the ob served correlati on betw een an y two vari ables or the diff erence b etween an y two mean s will seldom be e xact ly 0 .0000 to the n th decimal . A null hypothesis of no differ ence is therefore most probably implausible, and if so, disproving it is both unimpressive and uninformative (Lykken, 1968; Cohen, 1994). Add to this th e sample size se nsitivity of NHST (e.g., Cohen, 1990; Mayo, 2006). For exam ple, an observed e ffect 11 of r = .25 is statistica lly signific ant if n = 63 but not if n = 61 in a two - tail ed test . A lar ge enough sample c an reject any nil null hy potheses. This prope rty of NHST follows dire ctly from the fact that a nil null hypothesis defines an infinitesimal point on a continuum. As the sample size increases, the confidence interval shrinks and b ecome less and less likely to include the point corresponding to the null hypothesis. Given a l arge enou gh sam ple si ze, man y relati onshi ps can emerge as bein g statis ti call y signific ant bec aus e “ever ythin g correlat es to some ex tent with ever ything else” (Me ehl, 1990, p. 204).” These correlations exist for a combination of interestin g and trivial reasons. Meehl (1990) referred to the tendenc y to rejec t null h y potheses when the true relatio nship s are triv ial as the “c rud fact or”. And Tuke y ( 1991) piercin gl y wrote that "… it is foolish to ask ’are t he effect s of A and B differen t?’ The y are alw a ys different - for some decimal place" (p. 100). What we wa nt to know is the size of the differ ence bet ween A and B an d the erro r associat ed with our es ti mate. C onsequently, a di fference of triv ial effe ct siz e or even a total l y spurious one will eventually be statistically significant in an overpow ered study. Similarly, impor tant di fferences can fail to reach st ati stical signifi cance in poo rl y designed, underpo we r ed studies. Notic e, that it is a falla cy to tre at a statistica lly non - significant result as having no difference or no effect. For ex ample, O&L (2010, p. 426) s tate that “[i]f the normalization is performed as proposed by us, the score is 0.91 (±0.11) a nd therewith n ot signific antly different f rom the worl d aver age”. Without considering statistical pow er and effect sizes, statistic ally non - signific ant results ar e virtually u ninterpretable . Consider another example. I n a stud y on de terminants of facul t y research productivity, L ong et al. (2009, p. 245) conclude that they cannot support their research hypothesis that doctoral students in inf ormation sy stems with high - sta tus acad emic o ri gi ns ex hibit gre ater res earch productivity in terms of both quantity and quality than doctoral graduates with moderate - or low statu s academ ic ori gins. An F test indicated that dif ferences in mean citation counts across academi c ori gins ( i.e. 108.82, 95.26 and 37.08 respectively) were not statistically significant ( p = .0 9). Likewise, no " significan t pairwise differ ences " were found. Bu t p = .09 does not mean that the assumption of equalit y between mean c itation counts e xist. It does m ean th at the dat a were not inconsisten t with t he assum ed nil null statistica l h ypothe s is at the co nvent ional 5% alpha l evel , given the actual sample size . Again we s ee the m isconcep tion p (H 0 | Dat a). Thou gh often emphasi zed t hat failing to reject the null hypothesis does not mean that the null hypothesis is true, when it comes to a d ecis i on this is a distinction wit hout a difference. Th e pract ical consequen ce i s that we act as if th ere was n o difference i n citat ion cou nts bet ween hi gh - st atus, moderate - status and low- status graduates. We suspec t that similar diffe rences in mean c itation c ounts at p = .05 would have lead the author s to a supportive conclusion. But perhaps the nil null hy pothesis was implausible to begin with ? As in the pre vious example , cons iderati ons of p ower and eff ect siz es are also need ed in this case in order to say any thing concerning the research hypothesis. It is important to note that Type I errors can onl y occur when the null hypothesis is actually true. The p value onl y exists assuming the null h ypothesis to be true. Acco rdingl y , with implausible n ull hyp oth eses, th e effecti ve rat e of Type I errors in m an y studi es may ess ential l y be zero and the only kind of decision errors are Type II. If the nil null h ypothesis is unlikely to be true, 12 testing it is unlikely to advance our knowledge. It is more realistic t o assume non - zero population associations or differences, but we seldom do that i n our statistical hypotheses. In fact we sel dom reflect upon the plausibility of our null hypotheses, or for that matter adequately address other underlying assumpti ons assoc iate d with NHST (e.g., Keselman et al., 1998). O&L do not r eflect upon the plausibili ty of t heir various unstated nil null hypotheses. As far as we u nderst and O& L, they apply Kruskal - Wallis tests in or der to decide whether the cita tion scores of the ir AM C research ers are “si gnific ant ly diff erent ” from unit y ( the world average of 1 ) . Is the null h y pothesis of exact ly no d ifferen ce t o the n th decimal in the population plausible and in a technical sense true? We question that . Su rely cit ation s cores d evia te from 1 at some level of precision (e.g., Berkson, 1942) . Sam ple sizes in O&L are gener all y small . In the p arti cular c ase o f AMC r esear chers # 117 and #118 (p. 427), their citation scores of 1.50 and .93 turns out to be n ot "significantly different" fro m the world average of 1. But the result s ar e a co nsequ ence of l ow sta tis tical p ower , i.e., small sample sizes combined with Bonfe rroni procedures, and we are more likely dealing with a Ty pe II error, a fail ure t o reject a fal se null hypothes is of no abso lut e difference . If sample sizes could be enlarg ed randomly , then the statistica l power of the studies would be strengthened. However, if the null hypo thesis is false a n yw a y, then it is just a question of finding a sufficie ntly large sample to reject the null hypothesis. S ooner o r later th e cit ation s cores of AM C res earchers # 117 and #118 will be " significantly different" from the world average . The question of course is whether such a deviation is trivial or impor tant? I nformed human judgment is neede d for such decisions. Wheth er the actual AMC samples are probability sample s and whether t hey could be en larged randomly is a delicate question which we return to in section 3.7. But consider th is, if the samples are not probability sa mples addressing sampling error becomes meaningless, and the samples should be considered as convenience samples or appare nt populations. In both cases, NHST would be irrel evant and th e cit ation s cores as the y are would do, for example 1.50 and .93. Are these deviations from 1 trivial or important? Again, we are left with informed human judgment for such decisions. 3.6 Over powered studies Larivière and Gingras (2011) is ge nerall y supportive of O&L, but c ontrary to O&L, the ir an al yses of the di fferences betw een th e two norm aliz ation appr oaches are for most of them overpowered with very low p values. Wilcoxon - signe d rank tests are used to decide whether the distributions are “statistica lly diff erent” ( p. 395 ). Like O&L, Larivière and Gingra s (2011) do not reflect upon t he assumptions associ ated wit h NHST, such as randomness or the pl ausibility of their null hy pothese s. It is indeed questionable whether th e se crucial as sumpt ions ar e met. A null h ypothesis of a common median equal to zero is questionable and a (plausible) stoch astic dat a generatio n mechan ism is not presented. Not surprisingly , given the samp le siz es involved, the di fferences betwee n the two distributions ar e “statistically different” or “sig nificantly diffe rent” in the words of Larivière & Gingr as (20 11, p. 395 ). The more intere sting question is to wha t extent th e result s are important or , co nversel y , an ex ample of the "crud factor ". Wilcox on- signed rank tests alone cannot tell us whether the differ ences between the two distributions are noteworthy , esp ecially not in high - powered studies with implausible nil null hypothesis. More information is neede d. Lari vière and 13 Gingras (2011) do in fact address the importance for some of their results with inf ormation e xtrin si c from the W ilcoxon- signed rank tests . S cores f or the sum of ratios , fo r examp le, seem to be generally higher than those of the ratio of sums, and the authors reflect upon some of the potential consequences of these findings (p. 395). This is commendable. Again e ffect siz es and i nform ed hum an judgment are needed for such inferential purposes , and it s eems t hat Larivière an d Gin gras (2011) indeed use difference s in descriptive statistics to add ress the importance of the results. Why then use Wilcoxon - signed rank test s? As a mechanical ritua l t hat can decide upon importance? As argued this is untenable. Or as an inferential tool? I n that case we should focus upon the assumption of randomness and whether t his is satisfied. This is questionable. A s a l w a ys , s uffi cien t power guarantees falsi fi cation of implausible null hy pothesis, and this seems to be the case in Larivière and Gingras (2011). Interestin gly, prior to Larivière and Gingras (2011), Waltman et al. (2011c ), obta ined similar e mpirical r esults compa ring the two normalization approa ches, albei t without involving the null ritual. 3.7 The assumption of randomness and its potential misuse Statistical sig nifica nce tests concer n sampli ng error and we sample in order to make statistical inferen ces, eit her des cript ive in ferences from sa mpl e to population or caus al claim s (Green land, 1990) . S tatistical inference relies on probabilit y theory . I n order for probabilit y theory and statistical te sts to work r andomness is requ ired (e.g., Cox , 2006 ). This is a mathema tical neces sity as stand ard errors and p values are estimate d in distributions th at assume random sa mpling from well - defined populations (Berk & Freedman, 2003 ). Inf ormation on how data is gene rated be comes critical when we go beyond description. In other words, when w e make statis tical inferen ces we assume t hat dat a are generat ed b y a stochas tic mech anism an d/or th at d ata are ass igned t o treatme nts randomly . The e mpirical w orld has a structure that typica ll y negates the poss ibility of random selection unless random sampling is imposed. Ideall y, random sampling ensures that sample units are selected independently and with a known nonzero chance of being selected ( Shaver, 1993). As a consequence, random samples should come from well - defined finite populations, not "imaginary" or "s uper-populations " (Berk & Freedman, 2003 ). With ra ndom sa m pling a n important empirica l matter is resolve d. Without random sampling , we must leg itimate that the nature or the social world produced the equivalent of a random sample or constructed the data in a mann er that can b e accur atel y represen ted b y a conv enient an d wel l - understood model. Redner (2005), for exampl e , suggest that citation data ha ve a stochas tic nat ure gene rated b y a linear preferen tial at tachment mechan ism. P erhaps , but we are skeptic al about treatin g social processes as genuine stochastic processes that generates the equivalent of random samples like , fo r examp le , a model of radioactive decay does in the phy sical w orld (Berk, Western & Weiss, 1995 a ). The social world is the domain of man - made laws, social regulations, customs, the particulars of a spe cific culture and the spontaneous actions of people (Wi nkler, 2009, p. 190 -104) . Also, there seems to be some debate in our community as to what theoretical distribution that best approx imates a citation distribution (e.g., Viera & Gomes, 2010). We believe that randomness i s best obtained through a n appropriate probability sample with a well -defined population. Alternat ivel y, data ma y constitute a 14 convenience sample or an appar ent popul ation (or a census from a population) (Berk, W estern & Weiss, 1995a). 3.7.1 Conveni ence sa mpl es , apparent populations and "s uper-populations" Very f ew observatio nal studies using inferentia l statistics in the soc ial scienc es clarify how dat a are generat ed, w hat chan ce m echanis m is ass umed, i f any, or define the population to which results are generalized, whether explicitly or implicitl y. Presuma bly, most observational studies, also in our field, are based on convenience and not prob abilit y samples (Kli ne, 2004) . Albeit many social scientists do it, it is ne verthele ss a category mistake to make s tatistical infe rences ba sed upon sampl es of co nvenien ce. W ith convenience samples, bias is to be ex pected and independence becomes p roblem atic ( C opas & Li, 199 7). When independence is lacking conventional e stimation procedures will likely provide incorrect standard errors and p v alues can be grossl y mislead ing. Ber k and Freedman ( 2003 ) suggest that standard error s and p values will be too small, and th at m an y res ear ch result s ar e held t o be s tatis ticall y signifi cant when t he y are the m ere prod uct of chance v ariat ion. Indeed, there reall y is no poi nt in addressing sampling error when there is no random me chanism to ensur e that the pr obability and mathe matical t heory behind the calibration is working consistently. Turning to O&L, i t is not at a ll clear in wha t way they assum e that t heir sa mples are a pro duct of a chan ce mechan ism and from what w ell - defined populations the y may h ave be en drawn? For example, one of the cas es studied in O&L concern one principal inves tigator (PI) from AMC. Sampling units are 65 publications affiliated w ith the P I for the period 1997 - 2006. The questions are : (a) in w hat s ense d oes thi s data set com prise a pro babi lit y sampl e? ; a nd (b) how is the population defi ned? We assume t hat O& L have tried to iden tify all eligible publications in the database for the PI in qu estion for the given period. Most likely, data constitute s al l the avai labl e observations from the "appare nt" population of pub lications af filiated with the PI . If so, frequentist inferen ce based on a long-run interpretation o f some repeat abl e data mech anism is no t approp riate. There is no uncertaint y due to variation in repeated sampling from the population. A counter argument could be that "the data are just one of many possible data sets that could have been generat ed if th e PI's care er were t o be repla yed m any times o ver". But this d oes not clarif y what sampl ing mech anism s elected t he care er we hap pened t o obser ve. No one knows, or can know. It is simply not relevant for the problem at hand to think of observations as draws from a rando m process w hen furth er real izat ions ar e impos sible in pract ice and lack meaning ev en as abs tract propositions. Adheri ng t o a freq u entist conception of probabilit y in the face of non - repe atabl e data and in a non-stochastic setting seems dubious. Neither can the set of publications ide ntified by O&L in the spe cific ci tatio n databas e be considered a random draw from th e finite popula tion of all papers affiliated with the PI , including those ex ternal to th e datab ase . I t is unlikely that the data generation mechanism can be stochastic when governed by indexing policies in one database. Most likely , the data se t constitutes a convenien ce sample of spec ific publication types coincidentally indexed in the specifi c dat abase . Conven ience samp les are often treat ed as i f th ey were a random reali zat ion fro m some l arge, poorly- defined population. This unsupported assumpti on is sometimes calle d the “s uper -populatio n 15 model” (Cochran, 1953). Whil e some au thors argue th at "su per - popul ations" are justifiable for statistical s ignifica nce test ( e.g., Bollen, 1995 ), we do not find such arguments convincing f o r freque ntist statistics with non - experim e ntal data (see fo r exam ple, W estern and Jackman (1994) and Berk, W estern and Wei ss (1995 a; 1995b ) fo r sim ilar view s) . "Super-populations " are defined in a circular way as the population from which th e data would have come if the data w ere a random sample (Berk & Fr eedman, 2003) . "Super-populations" a re imagin ary with no empiric al existence , as a consequence, the y do not generate real statistics and inferences to them do not directly answer any empirical questions. What dra w from an "imaginary super -popul ation" does the real - wo rld sample we have in hand represent? We simply cannot know . I nferences to imaginary popula tions are also imaginar y (Berk & Freedm an, 2003) 4 One could of course treat data as a n apparent population. I n this non- stochastic se tti ng s tatistical inf erence is un necess ar y because all t he avai lable in form ation i s coll ected. Non ethel ess, we oft en st ill produce standard errors and significance tests for such settings, but their contextual meaning is obscure . There is no sampling error, mean s and v arian ces ar e population p aramet ers. Notice, population parameters can still be inaccurate due to mea surement error, an issue seldom discussed in relation to citation indicators. Leavi ng measur ement erro r as ide for a moment, what we are left w ith is the citation indic ator, the a ctual parame ter, what used to be the estimate d statistic. In t h e A MC c a se the indicator f or the PI is .91which is below the world avera ge of one. Is it an important deviation from the world average - maybe not ? . It is somehow absurd to address standard erro rs, p values and confidence intervals with strict adheren ce as if these n umber s and intervals were precise, when the y are not. Consider the bibliometric data used for indicator ca lculations. They a re selective, pac ked with potentia l errors that influence amongst other th ings the matching process of citing - cited d ocum ents (Moed, 2002). Obviously, the best possible data should be use d , but this actuall y means that a hi gh workload should be invested to improve bibliom etri c dat a qualit y. It is more than likely th at the measurem ents th at go in to i ndicat ors are bi ased or at leas t not " preci se " (see A dler, Ewin g and Taylor (200 9) fo r a cri tical review of citation data and the statistic s derived fro m them ) . Notwithstan din g the basic violation of assumptions, we think it is questionable to put so must trust in signif icance tests with sharp ma rgins of failure when our data and measureme nts most likel y a t best are imprecise . In practice s ampli ng is co mplicat ed and bec ause ev en well - designed probability samples a re usually implemente d imperfectly, the usef ulness of statistica l inference will usually be a matter o f degree. Nevert heless , t his is rarely reflec ted upon. I n pra ctice sampling assumptions are most often left unconsidered. We believ e that the reaso n wh y the assumption of randomness is often i gnore d is the widesprea d and indiscriminate misuse of statistica l significance te sts which may have created a c ognitive illu sion where assumptions behind such tests hav e " el apse d " from our 4 Notice, there is an important difference between imaginary populations that plausibly c oul d exis t and thos e that c ould not. A n imag inary popu lation is produced by som e real a nd well - defined st ochas tic process . The condi tioni ng circumstances and stochastic processes are clearly articulated often in mathematical t erms. In the natural scienc es such imaginary populations are comm on. T his is not the case in the social sciences, yet super - populations are very often assum ed, but seldom ju stifi ed. 16 minds and their results are thought to be something the y are not, namel y decision statements about the importance of the findings. One can al wa ys make i nferen ces but s tatis tical i nferences come with restr ictive assump tions and fre quentist infe rence is not applica ble in non- stochastic settin gs. 3.8 Standard errors and confidence intervals permit, but do not guarantee, better inference It is gen erall y accept ed among critics of statistical sig nificance tests that inte rval estimat es such as standa rd errors (S E) and confidence intervals (CI) are superior to p values and should replace them as a mea ns of desc ribing variability in e stimators. If used pr operl y, SE and CI are certainly more informative than p val ues and should replace them. They focus on uncertainty and interval estimation a nd simultaneously provide an idea of the likel y direction and magnitude of the underl ying dif ference and th e random va riability of the point estimate (e.g., C umming, Fidle r & Vaux, 2007; Coulson et al., 2010 ). But as i nfere ntial tools, t hey are bound to the same frequentist the ory of probability, meaning “in - the -long- run" interpretations, and riddled with assumptions such as randomness and a " correct " statistical mod el used to construct the limits. A CI deriv ed fro m a valid test will, over unlimited re petitions of the study , contain the true paramete r with a frequency no less than its confidence level. This definition specifies the coverage property of the method used to generate the interval, not the probability t hat the true parame ter value lies within the interva l, it either does or does not . Thus, f requen tist infer ence mak es on l y pre - sample probability assertions. A 95% CI contains the true parameter value with probability .95 only before one has seen the data . After the d ata h as been seen, the probabilit y is zero or one. Yet C Is are u ni versall y interp reted i n practice as guides to post - sample u ncertai nt y. As Abelson puts it: “ [u] nder the Law of the Diffusion of I diocy, every foolish applica tion of significa nce testing is sooner or later going to be translated into a corre sponding foolish practice for confidence limits” ( 1997, p. 130). When a SE o r CI is onl y used to check whether the interval subsumes the point null hypothesis , the procedure is no different from checki ng whet her a test statistic is sta tistically significant or not. It is a co vertl y NHST. O&L provide SE for the citation indica tors in their study. These rou ghly correspond to 68% CI 5 In the case of O&L , what seems at f irst to be a mor e informa tive analy sis with interva l estimates turns out to be a genuine significance test with dichotomous decision behavior an d questionable fulfillment of c rucial assumptio ns . SE s are used a s a sur rogat e for NHS T: “[i] f the . Thus, in a perfect world, the interpretation should be t hat under repeated realizations, the interval would cover the true citation scor e 68 % of the time. But w e have no wa y of knowing whether the current interval is one of the fortunat e 68%, an d we probably hav e no possibility for further replica tions. In ad dition, a s pointed out in the previous section, it is i ndeed questionable whether the samples in O&L are genuine probabilit y samples. If they are not, interpretation of SE becomes confused. According to Freedman (2003 ), "an SE for a conveni en ce sampl e is bes t vie wed as a de minimis error estimate : if this wer e — cont rar y to fact — a simpl e random sample, the uncertainty due to randomness would be something like the SE". 5 A CI of 95% is roughly 2 SE. SE bars around point estimates that represen t means in graphs are typically one SE wide , which correspon ds rough ly to a 32% sign ificanc e level and a 68% CI . 17 normalization is performed as proposed by us, the score is 0.91 (±0.11) and ther ewith not significantly different from the world average” (p. 426). We can see from this quote, and others, that their main interest i s to check whether the i nterval subsumes the world average of 1. In this case , it does, and consequently the implicit nil null h ypothe sis is not reject ed. This is the null ritu al and it is hig hly problematic . 4. Summ ary and recommend ation s Opthof and Leydesdorf (2010) provide at least one sound reason for altering the nor malization procedures in relation to citation indicators, however, it is certainly not sta tistical sign ificance testing . As we have discu ssed in the pr esen t articl e, s tatistical significa nce tests are high ly problema tic . They a re logically flawed, misunderstood, ritualistically misused , an d foremost mechanic all y overused. The y only address sampling error, not necessarily the most important issue. They should be interpreted from a fr e quentist theory of probability and t heir u se is conditional on restric tive assumptions , most pertinent , t ha t the null hypothesis must be true and data generation is the result of a plausible st ochastic process. These assumptions are seldom met rendering such t ests virtually meaningle ss . The problems sketched h ere are well known. Critic ism s and disenchantments ar e mounting, bu t chan ging me ntali ty and pr actice in t he s ocial sci ences is a slo w affair , giv en the evi denc e, indeed a “sociology - of - science wonde rment ” as Roozeboom ( 1997, p. 335) phrased it. The use of sign ificance tests by O&L is probably within "nor mal scien ce" an d their practice is not necessa ril y more inf erior to those of most others , the oth er examples testify to that . What has caused ou r respo nse in t his art icle is a grave ethi cal conce rn about the rit ualis tic use o f stati stical signifi cance tes ting in c onnection with research assessments. Assessments and their b y products, funding, promotion, hiring or sacking, should not be based on a mechanical tool known to be deepl y controversial . Whet her differenc es in ran kin gs or im pact factors betw een un its , are importa nt should be based on human judgment informed by numbers not by mechanical decisions based on tests that are logicall y flawed and ver y seldom based on the assumptions they are supposed to. Indeed, in their argume nt for changing the normalization procedures, O&L point to, what they see as flaw ed assess ments an d the re al conseq uenc es the y have h ad for i ndivi dual res earch ers at AMC. This is lauda ble, but th en arg uing that statis tical sig nifica nce tests is an a dvanceme nt for such assessments a re problematic in our view. As we have argued, it hardly brings more obj ectivity or fairness to res earch as sess ments, on the co ntrar y. Suggested alternatives to significa nce tests include the use of C Is and SEs . They do provide more information and are s uperior to statistical sig nificance tests a nd should as su ch be pr eferr ed. But neither CIs or SEs are a panacea for the problems outlined in this articl e . They are based on the same frequ enti st foundation as NHST Resampling techni ques (e.g., bootstrap, jackknife and randomization) are considered b y some to be a suitable alternative to sta tistical sign ificance te sts ( Diacronis & Efron, 1983). Resampling techniq ues are b asicall y in ternal rep licati ons th at recomb ine the o bser vati ons in a d ata set i n diffe rent ways to estimate p recision, of ten with few er assumptions about underlying population 18 distributions compared to traditional methods ( Lunneborg, 2000). Re sampling techniques are versati le and certainl y have merit s. The bootstrap technique see ms especi ally well suite d for interva l estimation if we ar e unwilling or unable to make a lot of a ssumptions about population distributions. A potenti al applic ation in this ar ea is the e stimation of CI s for ef fect sizes and f or sensit ivit y analyses (e.g., Coll iander & A hlgren, 2011). In the 2011 Le iden Ranking b y CWTS , such inter val estimations see m to have been implemente d in the form of "stability in tervals" for the various indicators. Notice, they are used f or uncertaint y estimation, not statistical inference. Indee d, the statistica l inferentia l capabilities of re sampling tech niques are hig hly questionable . Basically , one is simula ting the f requentist " in - the -long- run " interpreta tion but using only t he d ata set on hand as if it we re the population. I f this d ata set is s mall, unrepresentative, biased, non - random or the observations are not independe nt, resampling from it will not somehow fix these problems. In fact resampling can magnify the effects of unusual features in a data set. Consequently, resampling does not entirel y free us from having to make assumptions about the population distribution and it is not a substitut e for external replication, which is always more preferable . Other sta tistical tools tha t could infor m a decision - making proces s when it com es to comparison and importance of results are , fo r ex ample, exploratory data analyses, such a s box - whiskers plot . But th ere are mo re satisfac tory inferen tial alternative s, which contrary to NHST, do assess the degree of support that data provide for hypotheses, e.g., Bayesian inference (e.g., Gelman et al. , 2004), model- based inference based on information theory (e.g., Anderson, 2008) and likelihood inference (e.g., Royall, 1997). 4.1. Some recommendations for best practice Some res earch ers h ave c alled for a ban on NHS T (e.g., Hunter, 1997). C ensoring is not the wa y forward, but neither is status quo. What w e need is statistic al reform s as s uggested for ex ample b y Wilki nson et al. (1999), Kline (2004), Cumming (2012). He re emphasis is on paramet er estimation, i.e. ef fect size estimation with conf idence in tervals. I mportant p ublication g uidelines such as APA (2010) still sanction the use of NHST, albeit wit h strong recommendations to report m easur es of effect size and confidence intervals around them (e.g. APA, 2010, p. 34). Based on the aforementioned source s on statistical reform, here are some recommendations on data anal ysis practi ce s from the frequentist perspective: 1) statist ical inf erence on ly makes sense when data come f rom a probabil it y sample or have been randomly assigned to treatment and control groups ; 2) whenever possible take an estimation fr amework, starting with the formulation of research aims su ch as “h ow much? ” or “to what ex tent? ”; 3 ) interp retation of resear ch result s shoul d be based on point and interval estimates; 4 ) calcu late effect siz e esti mates an d con fidenc e intervals to answer those questions, then inter pret results based on i nformed judgment; 5) if statistical si gnifican ce te sts are used, (a) information on power must be reported, and (b) the null hypothesis shoul d be p laus ible; 6) effe ct siz es an d confid ence in terv als m ust be reported whenever possible for all ef fects studied, whether la rge or small, statistic ally significant o r not ; 7) ex act p values should be repo rte d; 8) it is un accepta ble to describ e results sole ly in terms of statis tical signifi cance ; 9 ) use the word “ significan t” without the qualif ier “sta tistically” only to desc ribe 19 results tha t are truly noteworthy; 1 0) it is the re searcher’s resp onsibility to explain why the results have sub stantive sig nificance; s tatistical tests are inadequ ate for this purp ose; 11) re plication is the best way to deal with sampling error. Finally , it is important to emphasize what significance tests, or CIs and SEs used for th e same purpose, are not , and w hat the y cannot do for us. T hey do not make a decision for us. Standard limits for re taining or rejectin g our null hy pothesis have no mathe matical or empiric al relevance, they are arbitrary thre sholds. There can and should be no universal standard. Ea ch case must be judge d on its merits. S ignificance tests are based on unr ealistic as sumptions g iving the m limited applicability in practice. They relate only to the assessment of the role of chance and they are not ver y i n formative at that if at all. T he y tell us nothi ng about the impact of errors, and do not help decide whether an y plau sible s ubst antive res ult i s true. F irst and foremost, t here are no ma gical solutions besides informed human judgment . Like the current d ebate on field normalization , it i s time to sta rt a deba te conce rning the ( mis)use of sta tistical sig nificance te sting within our field. We encour age quantitative and statistical thinking , n ot mindless statistic s. We do not think that the null ritual ha s much if anything that speaks for it. 20 Referen ces Abelson, R. P. (1997). A retrospective on the significance test ban of 1999 (if there were no significance tests, they would be invented). IN: L. L. Harlow, S.A. Mulaik, & J. H. Steiger (Eds.), What if the r e wer e no sig nifi cance test s? Hillsdale, NJ: Erlba um, 117 -141. Abelson, R. P. (1997). On the surprising longevity of flogged horses: Wh y there is a case for the signifi cance tes t. Psychol ogical Science , 8 (1) , 12-15. Adler, R., Ewing, J. & Taylor, P. (2009) . Citation statistics. Statist ical Scie nce , 24(1), 1-14. American Psychological Association. (2010). Publication Manual of the APA (6 th ed.) APA: Washington, DC. Anderson, D. R. (2008). Model Bas ed Infer ence i n the L ife Sci ences. A Pr imer on Evid ence . Spri nger: New Yo rk, NY. Anderson, D. R., Burnham, K. P., & Thompson, W. L. (2000). Null hypothesis testing: Problems, prevalen ce, and an alter nati ve. Journal of Wildlife Management , 64, 912-923. Armstrong, J. S. (2007). Significance tests harm progress in fore castin g. International Journal of Forecasting , 23(2), 321-327. Bakan, D. (1966). The tes t of si gnifican ce i n ps ychologica l research. Psychological Bulletin , 66(6), 423-437. Berk, R.A., Wester, B. & Weiss, R.E. (1995a ). Stat isti cal inferen ce for appa rent po pulations. Sociological Methodology , 25, 421-458. Berk, R. A., Western, B. & Weiss , R.E. (1995b). Reply to Bollen, Firebaugh, and Rubin. Sociological Methodology , 25, 481-485. Berk, R.A., & Freedman, D. A. (2003). Sta tistical assump tions as empiric al commi tmen ts. I N: T. G. Blomberg & S. Cohen (Eds.), Law, punishment, and social control: Essays in honor of Sheldon Messinger . New York: Aldine, 235-254. Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Assoc i ation , 37(219), 325-335. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi- square tes t. Journal of th e American Statistical Association , 33(203), 526-536. Bollen, K.A. (1995). Apparent and Nonapparent Signifi cance Tes ts. Socio logical Methodology , 25, 459-268. 21 Bornmann L . (2010). Toward s an ide al met hod of measuri ng res earch per fo rmance: S ome comments to the Opthof and Leydesdorff (2010) paper. Journal of Informetrics , 4 (3), 441-443. Carver, R. P. (1978). The c ase against sta tistical sig nificance testing. Harvard Educational Review , 48(3), 378-399. Chow, S. L. (1998). Précis of Statistic al signif icance: Ratio nale, validi ty , and utili ty. Be havioral and Brain Sciences , 2, 169-239. Cochran, W.G. (1953). Sampling Techniques . New York: Wile y . Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences . 2 nd edition. Lawrence Erlbaum: H illsdale , NJ. Cohen, J. (1990). Things I have learned (so far). American Psychologist , 45(12), 1304-1312. Cohen, (1994). The earth is round ( p <.05). American Psychologist , 49(12), 1007-1003. Colliander, C. & Ahlgren, P. (2011). The effects and their stability of field normalization baseline on relativ e perform ance with respect to citati on im pact: a cas e stud y of 20 natural scien ce departm ents. Journal of I nformetrics , 5(1), 101-113. Copas, J. B. & Li, H. G. (1997). Inference for non-random samples (with discussion). Journal Royal Statistic al Society , B, 59(1), 55-95. Cortina, J. M. & Dunlap, W. P. (1997). On the logic and purpo se of sig nificance te sting. Psychological Methods , 2(2), 161-172. Coulson, M., Healey, M., Fidler, F. & Cumming, G. (2010). Confidence intervals permit, but do not guar antee, b etter i nfe rence th an stat isti cal sign ificance t est ing. Frontiers i n Psychology , 1, p. 1-9. Cox, D.R., Spjøtvoll, E., Johansen, S., van Zwet, W.R., Bithell, J.F., Barndorff - Niel sen , O. & Keuls, M. (1977). The Role of Significance Tests [with Discussion and Reply]. Scandinavian Journal of Statistics, 4(2), 49-70. Cox, D. R. (2006). Prin ciples of Statistic al Inferen ce . Cambridg e University Press: Cambr idge, UK. Cumming, G. (2012). Understanding the New Statistics. Effect Sizes, Confidence I ntervals, and Meta -Analysis . Routledge: New York. Cumming, G., Fidler, F. & Vaux, D.L. (2007). Error bars in ex perimen tal b iology. Journal of Cell Biology , 177(1), p. 7-11. 22 Dines, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical infer ence . Palgrave Macmillan: Houndsmills, Basigstoke, UK Diaconis, P. & Efron, B. (1983). Computer- inten sive methods in statistic s. Scientific American , 248(5), 116-130. Ellis, P. D. (2010). The Essential G uide to Effec t Sizes: Statistic al Power, M eta -Analysis, and the Interpretation of Research Results . Cambridg e University Press: Cambridge, UK. Fidler, F. & Cumming, G. (2007). Lessons learned from statistical reform efforts in other disciplines. Psychology in the Schools , 44 (5) , 441-449. Fisher, R.A. (1925). Statistical Methods for Research Workers . Edinburgh: Oliver & Boyd. Freed m an, D. A. (2003). Sampling. E ncyclopedia of Social Science Research Methods . Ret riev ed December 01, 2011, from http://sage - eref erence. com/ view/s ocialsci ence/ n879.x ml . Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. (2004) . Bayesian Data Anal ysis . Chapman & Hall/C RC, Boca R aton . Gigerenzer, G. (1993). The superego, the ego, a nd the id in statistical re asoning . IN: G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: methodological issues . Hillsdale, NJ: Erlbaum, 3 11 -339. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics , 33(5), 587-606. Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L . (1989). The em pire of chance: How probability changed science and everyday life. Cambrid ge Uni versit y Press: N ew York. Gin gr as, Y., & Larivière , V. (2011). There are neither “king” nor “crown” in scientometric s: Comments on a supposed “alternative” method of normalization. Journal of Informetrics , 5(1), 226-227. Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: Implications for epidemiology of a negle cted hi stori cal deb ate. American Journal of Epidemiology , 137(5), 485-496. Goodman, S. N. (2008). A dirty dozen: twelve p -value misconceptions. Seminars in Hematology , 45(3), 135-140. Greenland, S. (1990). Randomization, statistics, and causal inference. Epidemiology , 1 (6) , 421- 429. Grissom, R. J. & Kim, J. J. (2005). Effect Sizes For Research: A Broad Practical Approach . Lawrence E rlbau m: Mah wah, NJ . 23 Guttman, L. (1985). The il logic of st atist ical in ference fo r cumu lative s cie nce. Applied Stochastic Models and Data Analysis , 1(1), 3-10. Haslam, N., Ban, L., Kau fmann, L. Loughnan, S., P eters, K., W helan , J . & Wilson , S. (2008). What makes a n article influential? Predic ting impac t in social an d personality psyc hology. Sciento metrics , 76(1), 169-185. Huberty, C. J. & Pike, C. J. (1999). On some history regarding statistical testing . I N. B Thompson (Ed.), Advances in social science methodology , volume 5, 1-22. Hunter, J. E. (1997). Needed: A ban on the significance test. Psychological Science , 8, 3-7. Jacob, J., Lehrl, S. & Henkel, A. (2007). Early recognition of high quality researchers of the German ps ychi atr y by world wide acces sible bibliometr ic indicato rs. Scient ometr ics 73(2), 117- 130. Keselman, H. J., Huberty, C. J., Lix, L. M., Olejnik, S., Cribbie, R. A., Donahue, B., Kowalchuk, R. K., Lowman, L. L., Petoskey, M. D., Keselman, J. C., & Levin, J. R. (1998). S tatistical practices of edu catio nal r esearchers : An anal ysis of their ANOV A, MA NOVA, and ANCOV A anal yses. Review of Educational Research , 68, 350-386. Kirk, R. E. (1996). Practical s ignifi cance: a co nce pt whos e time h as come. Educational and Psychological Measurement , 61(5), 246-759. Kline, R. B. (2004). Beyond significance testing: reforming data analysis methods in behavioral research . American Ps ychologic al Association, Washington, DC. Krantz, D. (1999). The null hypothesis testing controversy in Psyc hol ogy . Journal of the America n Statistical Assoc iation , 94(448), 1372-1381. Kruskal, W.H. (1978). Significance, Tests of. In: International Encyclopedia of Statistics , eds. W. H. Kruskal and J. M. Tanur, F ree Pr es s (New Y ork), 944-958. Lar iv i ère, V . & Y. Ging ras (2011). Averages of ratios vs. ratios of averages: An empirical analysis of four levels of aggregation. Journal of Informetrics , 5(3), 392-399. Leydesdorff, L. , & Opthof, T. (2010a ). Normaliza tion at the field level: F racti onal countin g of citations. J ournal of Informetrics , 4(4), 644–646. Leydesdorff, L. & Opthof, T. (2010b) Normalization, CWTS i ndicators, a nd the Leiden Rankings: Differ ences in citat ion be havior at the lev el of fiel ds http://arxiv.org/ftp/arxiv/papers/1003/1003.3977.pdf 24 Leydesdorff, L., Bornmann, L., Mutz, R. & Opthof, T. (2011). Turning the tables on citation analysis one more time: Principles for comparing sets of documents. Journal of the American Society for Information Science and Technology , 62(7), 1370-1381. Long, R., Crawford, A., White, M. & Davis, K. (2009). Determin ants of facu lt y research productiv ity in infor mation systems: An empirical ana l y sis of the impa ct of academic o r igin and acade mic affiliation. Scientomet rics , 78(2), 231-260. Lundberg, J. (2007). Lifting the crown – citation z - score. Journal of I nformetric s , 1(2), 145–154. Lunneborg, C. E. (2000). Data Analysis by Resampling . Brooks Cole: Pacific Grove, CA. Ly kken, D.T. (1968). S tatistical sig nifican ce in psy ch olo gical resear ch. Psychological Bulle tin , vol. 70(3, part 1), 151-159. Marsh, H. W., Jayasinghe, U. W. & Bond, N. W. (2011). Gender differences in peer reviews of grant a pplications: A su bstantive -methodological synergy in support of the null hypothesis model. Journal of Informetrics, 5(1), 167–180. Mayo, D. (2006). Philosophy of Statistics. In: S. Sarkar & J. Pfeifer (Eds), The Philosophy of Science: An Encyclopedia . Routledge: London, 802-815. McCloskey, D. N. (1985). The loss function has been mislaid: the rhetoric of significance tests. American Econo mic Re view , 75(2), 201-205. McCloskey, D. N., & Ziliak, S. T. (1996). The standard error of regression. Journal of Economic Literature , 34(3) , 97-114. Meehl, P. E. (1978). Th eoretical risks and tabular asterisk: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Counseling and Clinical Psychology , 46, 806-834. Meehl, P. E. (1990). Appraising and amending theories: the strategy of Lakatosian defense and two princ iples that wa rrant it. Psychological Inquiry , 1, 108-141. Mingers, J. & Xu, F. (2010). The drivers of citations i n m anagement scien ce journ als. European Journal of Operational Research , 205(2), 422-430. Moed, H. F. (2002). The impact - factors debat e: the ISI’s uses an d limits. Towar ds a critical, inform ative, ac curat e and pol icy- rele vant bibliometrics. Nature , 415(6873), 731-732. Moed, H.F. (2010).CWTS crown indicator me asures citation impact of a research group’s publication oeuvre. Journal of Info rmet rics , 4(3), 436–438. Morrison, D.E. & Henkel,, R.E. (1969). Significance tests reconsidere d. Am erican Sociologist , 4, 131-140. 25 Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological Methods , 5(2), 241-301. Oakes, M. (1986). Statist ical Inference: A Commentary for the Social and Behavioral Sciences . New Yor k : Wiley. Opthof, T. , & Leydesdorff , L. (2010). Caveats for the journal and field norma lizations in the CWTS (“Leiden ”) evalu ations of res earch pe rform ance. Journal of Informetrics, 4(3), 423-430. Redner, S. (2005). Citation statistics from 110 years of Phy sical Review. Physics Today , 58(6), 49-54. Rosenthal, R. (1994). Parametric measures of effect size. IN: H. Cooper & L. V. Hed ges (Eds.), The handbook of research synthesis , New York: Sage, 231-244. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in ps ychological sci ence. American Psychologist , 44(10), 1276-1284. Rothman, K. J. (1986). Signifi cance qu esti ng. Annals of Internal Medicine , 105(3), 445-447. Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman & Hall: London. Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bul letin , 57(5), 416-428. Rozeboom, W. W. (1997). Good science is abductive, not hypothetico- deducti ve. IN : L. L. Harlow, S.A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? Hillsda le, NJ: Erlbaum, 335-392. Scarr, S. (1997). Rules o f evidence: A lar ger cont ex t for t he stati stical debate. Psychological Science , 8, 16 -17. Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of signif icance t est ing in the anal ysis of r esear ch dat a. IN: L. L. H arlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if th ere were n o significance tests? Hillsdale, N J: Erlbaum, 37 - 64. Schubert, A. & Glänzel, W. (1983). Statistical reliability of comparisons based on the citation impact of scientific p ublications. Scientom etrics , 5(1), 59-74. Seth, A., Carlson, K.D., Hatfield, D.E., & Lan, H.W. (2009). So What? Beyond statistical significance to substantive significance in strategy research. In: D. Bergh & D. Ketchen, eds., Research Methodology in Strategy and Management , vol. 5. Elsevier JAI: New York, 3-28. Shaver, J. P. (1993). What statistical significance testing is, and what it is not. Journal of Experimental Education , 61(4), 293-316. 26 Spaan, J.A.E. (2010). The danger of pseudoscience in i nformetrics. Journal of Inform etri cs , 4(3), 439–440. Stang, A., Poole, C., & Kuss, O. (2010). The ongoing t y ranny of statistic al signif icance testing in biomed ical rese arch. E uropean Journal of Epidemiology , 25(4), 225-230. Stremersch, S., Verniers , I . & Verhoef, P. C. (2007). The q u es t fo r c itations: d rivers of a rticl e impact. Journal of Mark eting , 71(3), 171–193. Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical S cience , 6(1), 100-116. van Raan, A. F. J., van Leeuweb, T. N., Visser, M. S., van Eck, N. J. & Waltman, L. (2010). Rivals for the crown: Reply to Opthof and Leydesdorff. Journal of Informetrics , 4(3), 431-435. Vieira, E. S. & Gomes , J . A. N. F. (2010). Citations to sc ientific a rticles: Its distribu tion and dependen ce on th e arti cle feat ures. Journal of I nformetric s , 4(1), 1-13. von Mises, R. (1928). Wahrscheinlichkeit, Statistik und Wahrheit . Wien: Springer. Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods , 4(2), 212-213. Waltman, L., van Eck, N. J., Leeuwen, T. N., Visser, M. S., & van Raan, A. F. J. (2011a ). Toward s a n ew crown indicator: Some theoretical considerations, Journal of Informetric s , 5 (1), 37-47. Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S. & van Raan, A. J. F. (2011b ). On the corr elation between bibliometric indicators and peer review: reply to Opthof and Leydesdorff. Scientom etri cs , 83(3), 1017-1022. Waltman, L., van Eck, N. J., van Leeuwen, T. N., Visser, M. S. & van Raan, A. F. J. (2011c ). Towards a new crown indicator: an empir ical anal ysis . Scient ometr ics , 87(3), 467-481. Webster, E. J. & Starbuck, W. H. (1988). Theory building in industrial and organizational psychology. In: C. L. Cooper & I. Robertson (Eds.), International review of industrial and organizational psychology . Wiley: London, 93-138. Western, B. & Jackman, S. (1994). Bayesian inference for comparative research. The Ameri can Polit ical S cience Re view , 88(2), 412-423. Winkler, O. W. (2009). Interpreting Economic and Social Data. A Foundation of Descriptive Statistics . Be rlin Heidelberg, Springer- Veral g. Wilkinson, L . and Task Force on Statistical Inferen ce, APA Board of Scientific Affairs (1999). Statistical methods in psy cholog y journals: Guidelines and explanations. American Psychologist , 54(8), 594-604.

Caveats for using statistical significance tests in research assessments

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment